How do you test the replicability of an RCT?

At the BITSS conference last month I heard a fascinating talk by Uri Simonsohn, a psychologist at Wharton, on replication of randomized evaluations. You may remember a while back there was a big fuss (which reached The New York Times) about how many famous randomized trials in psychology failed to replicate. Uri’s paper, however, shows that many of the papers that attempt to test the replicability of RCTs do a really bad job. This is not a question of whether the replication accurately copies the original experimental protocol; rather, it’s a question of bad statistics. The common standard for judging whether a result replicates is to run the experiment again and see if a result that was statistically significant before is statistically significant in the second version. But as Uri convincingly shows, this is not a good standard. For example, if the replication is done on a much smaller sample size the finding that the result is not significant probably does not tell us much about anything.

It turns out that determining exactly what the right test for replication is is not straightforward (even though its pretty obvious that many of the papers claiming “nonreplication” are really bad).  We have to take into account magnitude of effect (i.e., is the effect size similar in magnitude?) and the relative power of the original study and the replication.

My initial reaction was that the best test would be whether the new coefficient was significantly different from the coefficient in the earlier study; however, Uri points out this is not a good test. It might be that in two high-powered studies the coefficients are very tightly measured, and while they are very close for practical purposes they are statistically different from each.

Uri suggests the following standard: a study fails to replicate if the replications rule out an effect big enough to be detectable with the original study. He uses the analogy of a telescope. If someone finds a new star with a small telescope (a low-powered study) but a large telescope (high-powered study) can’t find the same star, this raises doubt about whether the original finding was correct.

For more details, see the paper. It’s thought-provoking, if a little scary.