At the end of The Importance of Being Earnest, Lady Bracknell remarks that, “the number of engagements that go on seem to me to be considerably above the proper average that statistics have laid down for our guidance.” For years, economists have been concerned that the number of published results that are just significant at the 5 percent level are well above the average that statistics would predict.
By tradition, we tend to draw a bright line at the 5 percent level (and another, less bright line, at 10 percent). Results that have a 5 percent likelihood or less of occurring by chance are considered clearly statistically significant. Some view those with a 5–10 percent likelihood as marginally significant, while those with a greater than 10 percent likelihood of arising by chance are considered not significant at all. But there is really not much difference between a result that has a 4.9 percent probability of occurring by chance (a p value of 4.9) and a result with a p value of 5.1. And in the absence of our bright line rules, there is no reason to think that there should be more results with p values at 4.9 than results with p values at 5.1.
The bunching of results with p values just below 5 percent is likely caused by one (or both) of the following reasons:
- Studies with results with p values of 4.9 are more likely to be published than those with results with p values of 5.1
- Authors manipulate their results in small ways to push p values just below 5, for example, by including or excluding covariates, or more drastically redefining the sample over which the relationship is estimated
A recent paper by Abel Brodeur and coauthors with the wonderful title “Star Wars: The Empirics Strike Back” estimates the extent of the bulge (excess mass) in results with p values just below 5 percent. They collect data on fifty thousand empirical results published in three top journals between 2005 and 2001 (AER, JPE, and QJE). They find that 10–20 percent of tests with p values below .5 should have p values between .10 and .25. My colleague, Ben Olken, argued at a session on transparency at the AEA that this is not quite as bad as it looks. He calculates that when you look at a result showing a p value of .5, these results suggest that on average the p value should be .725. Instead of 5% percent of these results being false rejections, 7.25 are false rejections. We are only misreading 2.25 tests out of a hundred, so it’s not a big deal? Of course we could say there is nearly a 50 percent increase in the number of false positives—a big deal? My bigger concern is that the Brodeur study looks at all tests, many of which are robustness tests. I fear that the manipulation might be bigger for the headline results than for others. This, of course, is harder to test—what is a headline result?
There is, however, another striking finding for followers of RCTs in this paper. Field RCTs show no bulge in p values just below 5 percent. Compare the distribution of results for RCTs with those for other sources in Figure 9 from the paper, shown at left (in the paper, they test for the bulge and find none). If the journals are as likely to reject insignificant results of RCTs as they are for other studies, then this suggests RCT results are less likely to be manipulated than other studies.