**Summary:**

Pre-analysis plans increase the chances that published results are true by restricting researchers’ ability to data-mine. Unfortunately, writing a pre-analysis plan isn’t easy, nor is it without costs, as discussed in recent work by Olken and Coffman and Niederle. Two recent working papers - “Split-Sample Strategies for Avoiding False Discoveries,” by Michael L. Anderson and ...

**Topics:**

Owen Ozier considers the following as important:

**This could be interesting, too:**

Tony Yates writes New Statesman blogs on 2nd Referendum and the inevitable bad politics of Brexit

International Settlement writes An explanation of negative swap spreads: demand for duration from underfunded pension plans

IMFBlog writes Sub-Saharan Africa: Diversifying for Tomorrow

Liliane Held-Khawam writes Favoritisme de la BNS: l’horlogerie suisse supplantée par Apple

Pre-analysis plans increase the chances that published results are true by restricting researchers’ ability to data-mine. Unfortunately, writing a pre-analysis plan isn’t easy, nor is it without costs, as discussed in recent work by Olken and Coffman and Niederle. Two recent working papers - “Split-Sample Strategies for Avoiding False Discoveries,” by Michael L. Anderson and Jeremy Magruder (ungated here) and “Using Split Samples to Improve Inference on Causal Effects,” by Marcel Fafchamps and Julien Labonne (ungated and updated here) - propose some very clever refinements to address some of the challenges inherent in pre-analysis plans.

Two of the big problems are that (a) it is hard to formulate the best way to test a hypothesis without looking at an associated dataset, and (b) even if one knew the best way to test a hypothesis, most papers perform a series of tests, each associated with the outcome of a previous test. Coffman and Niederle’s discussion of the first problem suggests that if an experiment is inexpensive enough that it can be replicated, then the first round of exploratory work can always be replicated by a second round of confirmatory work. Olken refers to the second problem as “pre-specifying the entire ‘analysis tree,’” the combinatorics of which quickly become intractably onerous without the ability to know some of the patterns of results in advance. The two new papers basically wed these insights, formalize them statistically, and present a solution: split your experimental dataset; use the first piece of the dataset for exploratory work, choosing which hypotheses to test and how to test them; refine a pre-analysis plan using this exploratory work; and, plan in hand, use the second piece to actually perform the tests, in essence performing a replication of the exploratory work.

But, having split the sample, a smaller testing dataset will surely reduce statistical power – the chance that you’ll actually detect an effect if it is there. Or will it? Both papers have a common departure point. Statistical power is the probability of detecting a nonzero effect, conditional on a coefficient truly being nonzero. However, the probability that you – the researcher – successfully detect such an effect also depends on something else: the probability that the coefficient you’ve decided to estimate (true beta, not beta-hat) is actually nonzero. Since that isn’t a sure thing, split-sample methods can increase the odds that you succeed at rejecting a false null hypothesis – and score points for science.

Bear with me, here comes the math. (Choose your own adventure: if you aren’t sufficiently caffeinated for the math right now, just skip down a few paragraphs.)

Fafchamps and Labonne take an approach involving a parameter,

Here’s the basic idea. Consider what would happen if a researcher correctly worked out that the statistical power for a test was 0.99. Very nice. However, what if there was only a 50 percent chance (

Fafchamps and Labonne’s suggestion is to split the sample, and test the intended hypothesis, in all the forms one can think of, in the first half. Power for this search process is lower, since the sample is smaller: 0.86. But that means that if the researcher tries all relevant hypotheses, there is an 86 percent chance of detecting it, conditional on stumbling on the right one. Then, whatever hypothesis the researcher picks in the first round, she tests in the second round, having written a more-informed pre-analysis plan. Power? 0.86 again in the second half. Take the product of those two numbers to find the probability of detecting the right formulation in the first half and then having it pass the test in the second half: 0.74. Voila: 74 percent power instead of 49.5 percent power. Progress!

Anderson and Magruder operationalize Olken’s distinction between “primary” and “secondary” hypotheses: a primary hypothesis is about a “key variable of interest,” while a secondary hypothesis is of lesser (or perhaps conditional) importance. Anderson and Magruder consider two parameters: u_h, the importance associated with a hypothesis; and p_h, the prior associated with whether that hypothesis is actually false. A hypothesis with large values of u_h and p_h is likely to be considered “primary.” If either of these is sufficiently small, however, it costs more in power than it yields in expectation to include the hypothesis in a pre-analysis plan.

Here’s the basic idea. Consider that there is a main hypothesis tested with power 0.8 in the full dataset. Consider a second hypothesis, with the same statistical power on its own, but for which an accurate prior is that there is only a 10 percent chance that the null is false – that the underlying coefficient is actually nonzero. If the researcher tests both hypotheses, she should adjust for multiple testing. The Bonferroni correction to the p-value means that there is now only power 0.71 on each hypothesis. But since the second null hypothesis had only a 10 percent chance of being false, this means we have sacrificed 9 percent statistical power on the first hypothesis (0.80-0.71) while only gaining a 7 percent chance of an additional hypothesis being rejected (0.71*0.10). If the researcher’s objective function is expected total hypotheses rejected, this is a bad deal (0.78 instead of 0.80). A loss for science.

Anderson and Magruder’s suggestion is to split the sample, but to then do something they call “hybrid.” Leave the main hypothesis alone: it will be tested in the full sample, regardless. It can be in the pre-analysis plan from the very beginning. But use a little bit of the data, perhaps 30 percent, to try out the second hypothesis. That’s a small sample, so be lenient: look for an absolute T statistic of 1.2, for example. Conditional on the 10 percent chance that there is an effect to detect, there is a 63 percent chance of detecting the second effect in this 30-percent sample. (Of course, conditional on the 90 percent chance that there is really nothing to detect, there is also a good chance of a false positive: 23 percent, under the null.) Now, if the secondary hypothesis doesn’t pass the threshold, the researcher just gets to do the one main test; this happens 0.1*(1-0.63) + 0.9*(1-0.23) = 72.9 percent of the time. So the nice feature of the hybrid approach is that, much of the time, the main hypothesis doesn’t need a multiple test correction. Its power ends up being 0.729*0.8 + (1-0.729)*0.71 = 77.6 percent.

When the secondary hypothesis does pass the threshold, Anderson and Magruder have another suggestion: just do a one-sided test for it. After all, it is wildly unlikely that, if a real effect is at work, it would turn up with the right sign in the one sample split but with the opposite sign in the other. So: test for only the sign that appeared in the first split of the data. (This is a clever way to use a little bit of information from the first split of the data to increase the power of your test in the second split.) With 70 percent of the data remaining, a one-sided test with Bonferroni adjustment (since it is the second hypothesis) has power 0.65. How many hypotheses will be rejected in expectation? 0.776 + 0.1*0.63*0.65 = 0.817. If the researcher’s objective function is total hypotheses rejected, this is a better deal (0.817 instead of 0.800). Progress!

The math is over. Now to wrap up.

There were three pretty innovative tricks in these papers. The first is splitting the sample. Though Anderson and Magruder point out that splitting the sample has been used for various purposes in statistics for more than 80 years, this application is a new one. Split-sample approaches help a pre-analysis plan when,

My discussion vastly oversimplifies both papers. I used the Bonferroni correction, but both papers consider a variety of multiple-testing adjustments, including those that, like Bonferroni, control the family-wise error rate (FWER: the probability of getting at least one false rejection), as well as those that control the false discovery rate (FDR: the fraction of rejections that are incorrect). The methods work, whichever approach you take.

The Fafchamps and Labonne paper goes on to discuss how this approach might reorganize other aspects of the research process: data management might be divided between the portion of a research team that controls and anonymizes the whole dataset and a separate group that formulates and tests hypotheses in the split-sample while writing the pre-analysis plan; journals might accept papers based only on the pre-analysis plan and the analysis in the first half of the dataset, without knowing what remains significant in the second half.

The Anderson and Magruder paper goes on to show how their approach could have changed the conclusions of the Casey, Glennerster, and Miguel paper that brought pre-analysis plans to prominence in the context of field experiments in development economics. Anderson and Magruder’s finding serves as a warning: a pre-analysis plan does bind researchers’ hands against data mining and p-hacking, but may also bind them against some important discoveries.

A caveat.

There is a looming problem, hinted at by both papers. Lunch (or, in this case, a pre-analysis plan with lots of hypotheses) still isn’t free. Anderson and Magruder report two statistics: among recently-published field experiments, the median T-statistic is 2.6; among recently-filed pre-analysis plans, the median number of tests is 128. The contradiction here is that if your expected T-statistic is 2.6, your unadjusted power is 74 percent. If you adjust the FWER for 128 tests, your power is down to 17 percent. How do we reconcile this? Perhaps field data collection will have to be on a larger scale than before, or only some coefficients require multiple test corrections. Fafchamps and Labonne’s proposed division of labor also appears to necessitate a larger research team than has previously been typical. This trend may place some types of research out of reach for graduate students, or for researchers who are “only” able to secure a few hundred thousand dollars in research funding. No matter how you slice the data, multiple test correction and pre-analysis plans combine to drive the required sample sizes up considerably. If these requirements are disproportionately applied to field experiments, they may be raising the bar in precisely the wrong places: “specification searching and publication biases are quite small in randomized controlled trials,” as Vivalt (2016) and the amazingly-titled Brodeur, et al. (2016) (ungated here) conclude.

All is not lost. With the rise of “big data” comes massive sample size, and thus the required statistical power. If they arrive sequentially, early waves of “big data” can act as the first split that helps write the pre-analysis plan for later waves. (This only helps, of course, if “big data” somehow obviates the need for the kind of bespoke data collection that is common in current field experiments.) Finally, if you are still having a hard time writing your pre-analysis plan, or you worry that your pre-analysis plans won’t pan out, just do as Anderson, Magruder, Fafchamps, and Labonne have done: write papers

PS – here is a short piece of Stata code that produces all the calculations above.

Two of the big problems are that (a) it is hard to formulate the best way to test a hypothesis without looking at an associated dataset, and (b) even if one knew the best way to test a hypothesis, most papers perform a series of tests, each associated with the outcome of a previous test. Coffman and Niederle’s discussion of the first problem suggests that if an experiment is inexpensive enough that it can be replicated, then the first round of exploratory work can always be replicated by a second round of confirmatory work. Olken refers to the second problem as “pre-specifying the entire ‘analysis tree,’” the combinatorics of which quickly become intractably onerous without the ability to know some of the patterns of results in advance. The two new papers basically wed these insights, formalize them statistically, and present a solution: split your experimental dataset; use the first piece of the dataset for exploratory work, choosing which hypotheses to test and how to test them; refine a pre-analysis plan using this exploratory work; and, plan in hand, use the second piece to actually perform the tests, in essence performing a replication of the exploratory work.

But, having split the sample, a smaller testing dataset will surely reduce statistical power – the chance that you’ll actually detect an effect if it is there. Or will it? Both papers have a common departure point. Statistical power is the probability of detecting a nonzero effect, conditional on a coefficient truly being nonzero. However, the probability that you – the researcher – successfully detect such an effect also depends on something else: the probability that the coefficient you’ve decided to estimate (true beta, not beta-hat) is actually nonzero. Since that isn’t a sure thing, split-sample methods can increase the odds that you succeed at rejecting a false null hypothesis – and score points for science.

Bear with me, here comes the math. (Choose your own adventure: if you aren’t sufficiently caffeinated for the math right now, just skip down a few paragraphs.)

**Fafchamps and Labonne**Fafchamps and Labonne take an approach involving a parameter,

*psi*- the likelihood that, when writing a pre-analysis plan uninformed by actual experimental data, a researcher tests a hypothesis for which the null is indeed not true (i.e. where there is truly a non-zero coefficient).Here’s the basic idea. Consider what would happen if a researcher correctly worked out that the statistical power for a test was 0.99. Very nice. However, what if there was only a 50 percent chance (

*psi*) that the test was an interesting one to perform? The other half the time, the researcher tests a hypothesis for which the null is true and there is no effect to find. That means the probability of detecting a true effect is only 0.5*0.99 = 0.495. A great dataset with slim odds of a discovery: a loss for science.Fafchamps and Labonne’s suggestion is to split the sample, and test the intended hypothesis, in all the forms one can think of, in the first half. Power for this search process is lower, since the sample is smaller: 0.86. But that means that if the researcher tries all relevant hypotheses, there is an 86 percent chance of detecting it, conditional on stumbling on the right one. Then, whatever hypothesis the researcher picks in the first round, she tests in the second round, having written a more-informed pre-analysis plan. Power? 0.86 again in the second half. Take the product of those two numbers to find the probability of detecting the right formulation in the first half and then having it pass the test in the second half: 0.74. Voila: 74 percent power instead of 49.5 percent power. Progress!

**Anderson and Magruder**Anderson and Magruder operationalize Olken’s distinction between “primary” and “secondary” hypotheses: a primary hypothesis is about a “key variable of interest,” while a secondary hypothesis is of lesser (or perhaps conditional) importance. Anderson and Magruder consider two parameters: u_h, the importance associated with a hypothesis; and p_h, the prior associated with whether that hypothesis is actually false. A hypothesis with large values of u_h and p_h is likely to be considered “primary.” If either of these is sufficiently small, however, it costs more in power than it yields in expectation to include the hypothesis in a pre-analysis plan.

Here’s the basic idea. Consider that there is a main hypothesis tested with power 0.8 in the full dataset. Consider a second hypothesis, with the same statistical power on its own, but for which an accurate prior is that there is only a 10 percent chance that the null is false – that the underlying coefficient is actually nonzero. If the researcher tests both hypotheses, she should adjust for multiple testing. The Bonferroni correction to the p-value means that there is now only power 0.71 on each hypothesis. But since the second null hypothesis had only a 10 percent chance of being false, this means we have sacrificed 9 percent statistical power on the first hypothesis (0.80-0.71) while only gaining a 7 percent chance of an additional hypothesis being rejected (0.71*0.10). If the researcher’s objective function is expected total hypotheses rejected, this is a bad deal (0.78 instead of 0.80). A loss for science.

Anderson and Magruder’s suggestion is to split the sample, but to then do something they call “hybrid.” Leave the main hypothesis alone: it will be tested in the full sample, regardless. It can be in the pre-analysis plan from the very beginning. But use a little bit of the data, perhaps 30 percent, to try out the second hypothesis. That’s a small sample, so be lenient: look for an absolute T statistic of 1.2, for example. Conditional on the 10 percent chance that there is an effect to detect, there is a 63 percent chance of detecting the second effect in this 30-percent sample. (Of course, conditional on the 90 percent chance that there is really nothing to detect, there is also a good chance of a false positive: 23 percent, under the null.) Now, if the secondary hypothesis doesn’t pass the threshold, the researcher just gets to do the one main test; this happens 0.1*(1-0.63) + 0.9*(1-0.23) = 72.9 percent of the time. So the nice feature of the hybrid approach is that, much of the time, the main hypothesis doesn’t need a multiple test correction. Its power ends up being 0.729*0.8 + (1-0.729)*0.71 = 77.6 percent.

When the secondary hypothesis does pass the threshold, Anderson and Magruder have another suggestion: just do a one-sided test for it. After all, it is wildly unlikely that, if a real effect is at work, it would turn up with the right sign in the one sample split but with the opposite sign in the other. So: test for only the sign that appeared in the first split of the data. (This is a clever way to use a little bit of information from the first split of the data to increase the power of your test in the second split.) With 70 percent of the data remaining, a one-sided test with Bonferroni adjustment (since it is the second hypothesis) has power 0.65. How many hypotheses will be rejected in expectation? 0.776 + 0.1*0.63*0.65 = 0.817. If the researcher’s objective function is total hypotheses rejected, this is a better deal (0.817 instead of 0.800). Progress!

The math is over. Now to wrap up.

There were three pretty innovative tricks in these papers. The first is splitting the sample. Though Anderson and Magruder point out that splitting the sample has been used for various purposes in statistics for more than 80 years, this application is a new one. Split-sample approaches help a pre-analysis plan when,

*ex ante*, you can’t precisely characterize the hypotheses you would like to test, or the exact weights you attach to the importance of testing them. They provide more power than guessing the hypotheses, but less power than if you had been sure of the hypotheses from the get-go. The two other tricks? Using a hybrid pre-analysis plan approach; and the one-sided test in the second split. This last trick—the one-sided test in the second slice of the data using the sign from the first slice of the data—improves statistical power, and is one of the very few situations I can think of in which a one-sided test in a pre-analysis plan both legitimately preserves test size and doesn’t risk missing unanticipated negative results – after all, the impacts of new interventions may surprise you! (Examples come to mind in education , cash transfers, and public works programs, to name a few.)My discussion vastly oversimplifies both papers. I used the Bonferroni correction, but both papers consider a variety of multiple-testing adjustments, including those that, like Bonferroni, control the family-wise error rate (FWER: the probability of getting at least one false rejection), as well as those that control the false discovery rate (FDR: the fraction of rejections that are incorrect). The methods work, whichever approach you take.

The Fafchamps and Labonne paper goes on to discuss how this approach might reorganize other aspects of the research process: data management might be divided between the portion of a research team that controls and anonymizes the whole dataset and a separate group that formulates and tests hypotheses in the split-sample while writing the pre-analysis plan; journals might accept papers based only on the pre-analysis plan and the analysis in the first half of the dataset, without knowing what remains significant in the second half.

The Anderson and Magruder paper goes on to show how their approach could have changed the conclusions of the Casey, Glennerster, and Miguel paper that brought pre-analysis plans to prominence in the context of field experiments in development economics. Anderson and Magruder’s finding serves as a warning: a pre-analysis plan does bind researchers’ hands against data mining and p-hacking, but may also bind them against some important discoveries.

A caveat.

There is a looming problem, hinted at by both papers. Lunch (or, in this case, a pre-analysis plan with lots of hypotheses) still isn’t free. Anderson and Magruder report two statistics: among recently-published field experiments, the median T-statistic is 2.6; among recently-filed pre-analysis plans, the median number of tests is 128. The contradiction here is that if your expected T-statistic is 2.6, your unadjusted power is 74 percent. If you adjust the FWER for 128 tests, your power is down to 17 percent. How do we reconcile this? Perhaps field data collection will have to be on a larger scale than before, or only some coefficients require multiple test corrections. Fafchamps and Labonne’s proposed division of labor also appears to necessitate a larger research team than has previously been typical. This trend may place some types of research out of reach for graduate students, or for researchers who are “only” able to secure a few hundred thousand dollars in research funding. No matter how you slice the data, multiple test correction and pre-analysis plans combine to drive the required sample sizes up considerably. If these requirements are disproportionately applied to field experiments, they may be raising the bar in precisely the wrong places: “specification searching and publication biases are quite small in randomized controlled trials,” as Vivalt (2016) and the amazingly-titled Brodeur, et al. (2016) (ungated here) conclude.

All is not lost. With the rise of “big data” comes massive sample size, and thus the required statistical power. If they arrive sequentially, early waves of “big data” can act as the first split that helps write the pre-analysis plan for later waves. (This only helps, of course, if “big data” somehow obviates the need for the kind of bespoke data collection that is common in current field experiments.) Finally, if you are still having a hard time writing your pre-analysis plan, or you worry that your pre-analysis plans won’t pan out, just do as Anderson, Magruder, Fafchamps, and Labonne have done: write papers

*about*writing pre-analysis plans instead.PS – here is a short piece of Stata code that produces all the calculations above.