Confidence intervals get top billing as the alternative to significance. But beware: confidence intervals rely on the same math as significance and share the same shortcominings. Confidence intervals don’t tell where the true effect lies even probabilistically. What they do is delimit a range of true effects that are broadly consistent with the observed effect.
An oft-overlooked detail in the significance debate is the challenge of calculating correct p-values and confidence intervals, the favored statistics of the two sides. Standard methods rely on assumptions about how the data were generated and can be way off when the assumptions don’t hold. Papers on heterogenous effect sizes by Kenny and Judd and McShane and Böckenholt present a compelling scenario where the standard calculations are highly optimistic.
[NOTE: This is a repost of a blog that Andrew Gelman wrote for the blogsite Statistical Modeling, Causal Inference, and Social Science]. Blake McShane and David Gal recently wrote two articles (“ Blinding us to the obvious? The effect of statistical training on the evaluation of evidence” and “Statistical significance and the dichotomization of evidence”) on the misunderstandings of p-values that are common even among supposed experts in statistics and applied social research.
NOTE: This entry is based on the article, “There’s More Than One Way to Conduct a Replication Study: Beyond Statistical Significance” (Psychological Methods, 2016, Vol, 21, No. 1, 1-12) Following a large-scale replication project in economics (Chang & Li, 2015) that successfully replicated only a third of 67 studies, a recent headline boldly reads, “The replication crisis has engulfed economics” (Ortman, 2015).