Null Hypothesis Significance Testing

AoI*: “What Is the False Discovery Rate in Empirical Research?” by Engsted (2024)

[*AoI = “Articles of Interest” is a feature of TRN where we report abstracts of recent research related to replication and research integrity.] ABSTRACT (taken from*** the article***) “A scientific discovery in empirical research, e.g., establishing a causal relationship between two variables, is typically based on rejecting a statistical null hypothesis of no relationship.

REED: EIR* – Interval Testing

[* EIR = Econometrics in Replications, a feature of TRN that highlights useful econometrics procedures for re-analysing existing research. The material for this blog is motivated by a recent blog at TRN, “ The problem isn’t just the p-value, it’s also the point-null hypothesis!” by Jae Kim and Andrew Robinson] In a recent blog, Jae Kim and Andrew Robinson highlight key points from their recent paper, “ Interval-Based Hypothesis Testing and Its Applications to Economics and Finance” (Econometrics, 2019).

KIM & ROBINSON: The problem isn’t just the p-value, it’s also the point-null hypothesis!

In Frequentist statistical inference, the *p-*value is used as a measure of how incompatible the data are with the null hypothesis. When the null hypothesis is fixed at a point, the test statistic reports a distance from the sample statistic to this point. A low (high) p-value means that this distance is large (small), relative to the sampling variability.

HIRSCHAUER et al.: Twenty Steps Towards an Adequate Inferential Interpretation of p-Values in Econometrics

This blog is based on the homonymous paper by Norbert Hirschauer, Sven Grüner, Oliver Mußhoff, and Claudia Becker in the Journal of Economics and Statistics. It is motivated by prevalent inferential errors and the intensifying debate on p-values – as expressed, for example in the activities of the American Statistical Association including its p-value symposium in 2017 and the March 19 Special Issue on Statistical inference in the 21st century: A world beyond P < 0.

GOODMAN: When You’re Selecting Significant Findings, You’re Selecting Inflated Estimates

Replication researchers cite inflated effect sizes as a major cause of replication failure. It turns out this is an inevitable consequence of significance testing. The reason is simple. The p-value you get from a study depends on the observed effect size, with more extreme observed effect sizes giving better p-values; the true effect size plays no role.

MILLER: The Statistical Fundamentals of (Non-)Replicability

“Replicability of findings is at the heart of any empirical science” (Asendorpf, Conner, De Fruyt, et al., 2013, p. 108) The idea that scientific results should be reliably demonstrable under controlled circumstances has a special status in science. In contrast to our high expectations for replicability, unfortunately, recent reports suggest that only about 36% (Open Science Collaboration, 2015) to 62% (Camerer, Dreber, Holzmeister, et al.

REED: Why Lowering Alpha to 0.005 is Unlikely to Help

[This blog is based on the paper, “A Primer on the ‘Reproducibility Crisis’ and Ways to Fix It” by the author] A standard research scenario is the following: A researcher is interested in knowing whether there is a relationship between two variables, x and y. She estimates the model y = μ**0 + μ**1 x+ ε, ε ~ N(0,**σ2).

PARASURAMA: Why Overlapping Confidence Intervals Mean Nothing About Statistical Significance

[NOTE: This is a repost of a blog that Prasanna Parasurama published at the blogsite Towards Data Science]. Prasanna1 “The confidence intervals of the two groups overlap, hence the difference is not statistically significant” The statement above is wrong. Overlapping confidence intervals/error bars say nothing about statistical significance. Yet, a lot of people make the mistake of inferring lack of statistical significance.

MCSHANE & GAL: Statistical Significance and Dichotomous Thinking Among Economists

[Note: This blog is based on our articles “Blinding Us to the Obvious? The Effect of Statistical Training on the Evaluation of Evidence” (Management Science, 2016) and “Statistical Significance and the Dichotomization of Evidence” (Journal of the American Statistical Association, 2017).] Introduction The null hypothesis significance testing (NHST) paradigm is the dominant statistical paradigm in the biomedical and social sciences.

SCHEEL: When Null Results Beat Significant Results OR Why Nothing May Be Truer Than Something

[The following is an adaption of (and in large parts identical to) a ** recent blog post**by Anne Scheel that appeared on The 100% CI .] Many, probably most empirical scientists use frequentist statistics to decide if a hypothesis should be rejected or accepted, in particular null hypothesis significance testing (NHST).

Null Hypothesis Significance Testing

Help us improve the FORRT website