Comparing cost-effectiveness across contexts

There have been a number of recent blogs and articles criticizing RCT results because of the fear that results from one context are mindlessly applied to another. But the description of policy influence in these articles has little in common with the nuanced discussions between researchers and policymakers I have been privileged to observe (and take part in) over the years at J-PAL and elsewhere. I realize that not everyone is privy to how these discussions unfold and that it might be useful to give a flavor of them. How to inform policy by taking evidence from RCTs and combining it with good descriptive evidence, theory, and an understanding of institutional context, political constraints, and needs on the ground, is a massive subject and not one I can cover in a single blog.

I will start with one aspect of the discussion: using comparative cost-effectiveness as a tool to inform policy. This was the subject of a panel at the Society for Research on Educational Effectiveness in Washington, DC, that I participated in on Saturday. David Evans from the World Bank and Steven Glazman from Mathematica also presented. Felipe Barrera-Osorio from Harvard University was the discussant and Amanda Beatty organized and moderated the session.

We discussed technical issues about what discount rates and exchange rates to use as well as when to use cost-effectiveness and when to use cost-benefit analysis. But as these are covered in Chapter 9 of Running Randomized Evaluations and in J-PAL’s cost-effectiveness methodology paper, I will not repeat them here.

Instead I want to concentrate on the discussion of what assumptions cost-effectiveness/cost-benefit conclusions were sensitive to as a great example of ways to combine RCTs with other methodologies to inform nuanced policy discussions. (While both cost-effectiveness analysis [CEA] and cost-benefit anlaysis [CBA] facilitate the comparison of different programs in different years implemented in different contexts, they do differ slightly. CEA shows the amount of “effect a program achieves on one outcome measure for a given cost, while CBA combines all the different benefits of a program onto one scale [usually a monetary scale] and shows the ratio of the combined benefits to cost.) Steve Glazerman, drawing on his RCTs of Job Corps and Talent Transfer Initiative, showed how sensitive cost-benefit projections could be to longer term projections: the rate at which teachers attracted to low-performing schools stayed in position after the evaluation was critical to program effectiveness. Whether earnings gains from Job Corps persisted and for how long made a huge difference to estimated effectiveness. How do you deal with this uncertainty when giving policy advice?  One tool is to calculate the threshold above which the program will be effective (i.e., where net present value (NPV) of benefits outweighs NPV of costs). What proportion of teachers would have to stay, how long would earnings gains have to persist to reach the effectiveness threshold? What does descriptive data tell us about how long teachers stayed in their new positions? What is the trajectory of other earnings differentials over time, do they tend to persist, widen, or decline? In other words, there is no single number that perfectly summarizes the cost-benefit analysis of a program. Instead a range of calculations informs a conversation about program effectiveness under different scenarios. Sensitivity analysis can also inform project design. If we know what the CBA is highly sensitive to, program implementers can work on strengthening that element of the program. If the CBA shows the program is much more effective for one type of school than another type, this can inform selection of schools into the program.

Dave and I both drew on cost-effectiveness calculations of programs designed to improve test scores throughout the developing world, summarized in Kremer, Brannen, and Glennerster (2013). In the education programs we examined, the choice of discount rate and exchange rate had relatively little influence on the relative cost-effectiveness of the programs.

However, there was considerable variety in the precision with which different impacts were estimated. If you took the point estimate of the evaluation, some of the programs that looked very cost-effective were not precisely estimated. This means it is not possible to clearly distinguish the relative cost-effectiveness of several different programs. In the chart below, the error bars show the cost-effectiveness of programs assuming the impact was at the top, and the bottom, of the 90 percent confidence interval. In this example, there is overlap between the error bars for all four programs.

In other cases, even when impact estimates are imprecise, the lower bound of the estimate would still mean the program was more cost-effective than alternatives (see table below).

It is also worth noting that all the examples shown here would be considered highly cost-effective, even at the lower end of the confidence interval, if they were compared to most programs designed to improve test scores in rich countries. Achieving one standard deviation improvement in test scores for $100 is extremely cost-effective if we consider that a child typically gains between 0.7 and 1SD during a full school year, and a school year usually costs a lot more than $100.  

Dave presented work showing the sensitivity of results to different cost estimates using the detailed spreadsheets of cost data produced by J-PAL staff (especially Conner Brannen). For example, transport costs varied depending on whether a program was in a more or less densely populated area. Teacher costs (and contract-teacher costs) also varied by context. What would the cost-effectiveness of a program look like if it were taken to another context with different transport or teacher costs? Based on many different simulations, Dave classed program into three groups: the “always winners” (which proved to be cost-effective under many different assumptions), the “always losers,” and those whose rank was sensitive to some of the cost assumptions made.

One issue raised in the discussion, which (as Felipe emphasized) economists are not always very good at thinking through, was the validity of putting scores from different types of exams at different grades on one scale. For this reason, the studies shown here are all of primary education and most of the tests focus on basic literacy and numeracy. But as children progress, a 1SD change in one test may mean something very different than a 1SD improvement in another test.  

What should we take away from the sensitivity of cost-effectiveness rankings to some of these assumptions? Is it impossible to conclude anything of policy value? In his discussion, Felipe concluded that differences in costs and cost assumptions were a manageable problem. With the detailed cost data and models made available, it was possible to calculate what the costs would be in a new context and (assuming effectiveness stayed the same) calculate relative cost-effectiveness in that cost environment. Similarly, what type of cost environment would the program likely be cost-effective in? A bigger challenge, he concluded, was judging whether the effectiveness of a program translated from one context to another. In other words, assuming “effectiveness stayed the same” is a big assumption.

Every decision we make ends up being based on assumptions and projections. But how do we do those projections as well as we possibly can? What tools can we use? What is the empirical evidence about which estimates replicate and which don’t? These questions will be the subjects of forthcoming blogs.