Individual versus community incentives for service provision

Why would we want to incentivize at the community rather than individual level? This was the question that puzzled me as I prepared a discussion of Olken, Onishi, and Wong (2013) for J-PAL’s Maternal and Child Health Conference. This paper tests the impact of block grants to communities to improve health and education. In some communities the size of the block grant was linked to performance against target outcomes like the percentage of children immunized or enrollment rates at school. The incentives were effective at improving health, though not education on average. However, in communities with poor outcomes at the start of the program, the incentives improved outcomes over and above the impact of the block grants.

Incentives at the community level face collective-action problems: any effort by a single individual would have little impact on its own, making it tempting to free ride. Individuals may also find it hard to capture all the benefits of the incentives, which might similarly diminish their impact on behavior.

However, programs that provide incentives to individual service providers linked to performance have a mixed record, particularly those run by governments. All too often governments fail to implement the incentive programs they set up: providing bonuses even to those who have poor attendance records (Chen and Kremer 2001), refusing to impose sanctions on providers with high absence (Dhaliwal and Hanna 2014), or “excusing” absences to ensure providers avoid sanctions (Banerjee, Duflo, and Glennerster 2008). At the start of the program, when providers think the incentives will be enforced, absenteeism falls (in Dhaliwal and Hanna, health outcomes improve measurably as well) but once it becomes clear the government is not imposing the incentives absenteeism goes back up.

This prompted me to come up with a quick categorization of programs designed to improve public-service quality which used incentives at the individual and community level and have been rigorously evaluated (see figure below). I may well have missed studies, so please let me know if there are any I should add.

Given the problem governments appear to have in implementing individual incentives, I thought it would be useful to distinguish between programs implemented by NGOs for NGO workers; those which incentivize government workers but where an NGO determines who gets the incentive and how much; and those schemes that are entirely governmental. I also distinguished between programs working in health versus education.

The somewhat surprising pattern that emerged showed that community incentive programs administered by governments were on the whole more effective than government-run incentive programs targeted at individual providers. Interestingly, the effective community-incentive programs are all in health (as noted above, the incentives in education in Indonesia did not improve education on average, although they did in areas with low initial outcomes). (Note that Muralidharan and Sundararaman 2009 find positive impacts of individual incentives in schools in India but the incentives are determined and given out by an NGO.)

What advantage might community-level incentives have over individual level incentives? I can think of several reasons:

i)               Communities have the ability to shift resources into activities that might be more productive in generating better services. For example, in Olken et al. (2013), incentivized communities switch resources from education to health (education outcomes do not fall as a result and there is evidence the education expenditure is used more effectively). An individual often cannot make these budget reallocations.

ii)              Governments may find it more acceptable to reward or punish communities for performance than individuals. There may be strong norms that people doing the same job are paid the same, a norm that may not be as strong at the community level.

iii)            Communities may have more information or more subtle ways to incentivize service providers to perform well in ways that are compatible with local norms. In discussion at the conference there was some skepticism that this was the case, but Olken et al. do show evidence that at least in health, the community incentives are in part working through higher provider effort. This is also the case in Bloom et al. 2006 where district-wide contracts in Cambodia linked to performance lead to lower absenteeism and higher provider effort.

These observations are not meant to suggest that we should give up on incentives for providers at the individual level, but these findings should give governments pause before introducing them, and in particular encourage governments to think of ways to tie supervisor’s hands to ensure follow-through. But this way of looking at the literature did make me more optimistic about community-based incentives where more work would be useful, particularly in education. (Note that Muralidharan and Sundararaman look at incentives tied to a group of teacher performance, but this is not quite the same as community incentives as it’s not clear that teachers can reallocate resources). Finally, the challenges governments face in carrying through incentive schemes raises the importance of recruiting self-motivated staff in the first place (Ashraf, Bandiera, and Lee 2014a; also discussed at the J-PAL Maternal and Child Health Conference) and the role of nonfinancial incentives (Ashraf et al. 2014b).

Ben Olken’s 20-minute presentation of the Olken et al. findings can be viewed here, along with my discussion and subsequent general debate. Video of Nava Ashraf presenting Ashraf, Bandiera, and Lee (2014a) is available here.

The full citations for the papers listed in the table above are below:

Banerjee, Abhijit V., Esther Duflo, and Rachel Glennerster. 2008. "Putting a Band-Aid on a Corpse: Incentives for Nurses in the Indian Public Health Care System." Journal of the European Economic Association 6(2-3): 487-500. PDF

Bloom, Erik, Indu Bhushan, David Clingingsmith, Rathavuth Hung, Elizabeth King, Michael Kremer, Benjamin Loevinsohn, and J. Brad Schwartz. 2006. “Contracting for health: Evidence from Cambodia.” Mimeo. PDF

Basinga, Paulin, Paul Gertler, Agnes Binagwaho, Agnes L.B. Soucat, Jennifer Sturdy, and Christel Vermeersch. 2010. "Paying Primary Health Care Centers for Performance in Rwanda." World Bank Policy Research Working Paper Series. PDF

Dhaliwal, Iqbal and Rema Hanna. 2014. "Deal with the Devil: The Successes and Limitations of Bureaucratic Reform in India." MIT Working Paper. PDF

Duflo, Esther, Pascaline Dupas, and Michael Kremer. 2009. “Peer Effects, Teacher Incentives, and the Impact of Tracking: Evidence from a Randomized Evaluation in Kenya.” American Economic Review, 101(5): 1739-74. PDF

Glewwe, Paul, Nauman Ilias, and Michael Kremer. 2010. "Teacher Incentives." American Economic Journal: Applied Economics, 2(3): 205-27. PDF

Kremer, Michael and Daniel Chen. 2001. “Interim Report on a Teacher Incentive Program in Kenya.” Mimeo.

Muralidharan, Karthik and Venkatesh Sundarraman. 2011. "Teachers Performance Pay: Experimental Evidence from India." Journal of Political Economy 119(1):39-77. PDF

Olken, Benjamin, Junko Onishi, and Susan Wong. "Should Aid Reward Performance? Evidence from a Field Experiment on Health and Education in Indonesia." Forthcoming, American Economic Journal: Applied Economics. PDF

The complex ethics of randomized evaluations

When I gave a lecture to the faculty of Fourah Bay College in Freetown, Sierra Leone this past February, the first question I got was about ethics, but I was not asked whether randomized evaluations are ethical, as I often am when I present in the US or Europe. Instead, a chemistry professor was concerned that the validity of his study on iodine deficiency was being undermined by his inability to get parental permission for urine samples from schoolchildren. Iodine deficiency was potentially a major problem in this community and understanding the problem was essential to solving it. The children themselves were happy to provide urine samples, there was no health risk to the children, but tracking down their parents and getting consent was very hard and he risked getting a biased sample of respondents. Was it possible to reduce the reliance on parental consent for collecting these samples?

I suggested that he should get parental consent in all these cases because people get very sensitive about collection of bodily fluids and while there was no risk of physical harm, parents would likely feel upset if they found out that their child’s urine was collected without their consent. So we brainstormed ways to get higher consent rates at low cost.

But the story illustrates a number of points:

i)               Many of the important ethical challenges that those doing RCTs face are not specific to RCTs, but rather apply to anyone collecting data on the ground. They are about how to ask for consent (i.e., determining when oral consent is sufficient and when written consent is needed) or how to store data in a secure way. Alderman, Das, and Rao (2013) have a nice paper about the practical ethical challenges of field work that is part of a series commissioned for The Oxford Handbook of Professional Economic Ethics, edited by George DeMartino and Dierdre McCloskey.

ii)              The perception of the ethics of randomized evaluations is often very different in the US and Europe than it is in developing countries. It is even stronger if we move from a setting like Fourah Bay to the communities where we work. The poor are used to scarcity. They understand that there are often not enough resources for everyone to benefit from a new initiative. They are used to an NGO building a well in one village but not in another. The idea that the allocation might be based on a lottery in which each community has an equal chance is often regarded as a major improvement on the more normal approach of resources going to communities near the road or to those with connections.

iii)            There are complicated trade-offs involved, with real costs and real benefits, a complexity that is often missed in the debate about the ethics of randomized evaluations.

The real ethical issues involved in randomized evaluations are rarely about whether an RCT leads to “denying” someone access to a program (an argument put forward by, for example, Casey Mulligan and ably answered by Jessica Goldberg).

Nor does it make sense to lump all different types of randomized evaluations into one category. A lottery around an eligibility cutoff raises different issues than a study in which services are phased in randomly, for example.  A recent debate on the Development Impact blog made this point (see responses here, here, and here), and in Chapter 4 and in the lecture notes for teaching the ethics of randomized evaluations (which includes practical tips for ensuring compliance with IRB) we systematically go through the ethical issues associated with different forms of randomized evaluations. These notes also provide practical tips for ensuring compliance with IRB, an area where we provide graduate students and new researchers with only a little guidance (although I am pleased to report Deanna Ford from EPoD recently ran a session training graduate students from Harvard and MIT on these issues).

But some of the toughest issues around the ethics of randomized evaluations (and other empirical work) go beyond the current debate, and the answers are not straightforward. Shawn Powers and I have written a much longer piece that goes into these issues in some detail and will form another chapter in The Oxford Handbook on Professional Economic Ethics. Some of the issues we discuss that have no easy answers include:

i)               What is the line between practice and research in economic and social research (a question the Belmont Report explicitly refused to address)? Clearly an evaluator cannot always be held responsible for the ethics of the program they evaluate (think Josh Angrist, who studied the impact of the Vietnam War by comparing lottery winners and losers. He could not have sought ethical approval for the Vietnam draft lottery. The war and the draft were not “research,” only the analysis was. At the other extreme, when a researcher designs a new vaccine and tests it, both the risks of the vaccine—and not just the risks of taking a survey—should be reviewed. In other words, the program (the vaccine) is considered research. But there is a lot of grey area between these two extremes. What if the program would have gone ahead without the evaluation but the researcher gives advice to the implementer to help them improve the program—does the entire program fall under the jurisdiction of the researcher’s IRB? The line between practice and research has important practical implications. In particular, if the program itself is considered research then informed consent needs to be collected from everyone who participates in the program, not just those on whom data are collected This is a particular issue in clustered randomized evaluations, where only a fraction of those in the program usually have their data collected. While “community approval” can be gained for a clustered randomized trial from a community meeting, it is not as good as the consent given as part of data collection. (Some in the medical profession seem to have got around the difficult issues in clustered randomized evaluations by saying that the “subject” of the study is the medical provider (as they are the level of randomization), so informed consent is only needed from the provider. As the provider has an ethical obligation to do the best for their patients, the researcher need not worry about ethics with respect to the patient).  

ii)              What happens when a western IRB has a different view of ethics than the government or ethics board in the country where the study takes place? Whose position prevails? Specifically, imagine a researcher at a US university or the World Bank is advising a government on a national survey run by its statistical agency, and the researcher will have access to the data from the survey and will use these data in their research, Can the US university tell the government how to run their survey? Can they tell them how to store their data, or only how the US researcher stores their data? (We have all seen government data held in insecure places.) Imagine the government plans to collect data on drug stocks and theft of stocks from clinics. The US university wants there to be a consent form in which clinic staff can refuse to answer. The government says that these are government employees and the drugs are government property, thus all clinic staff need to answer the survey as part of their employment duties. Should the US-based researcher refuse to participate in such a case?

iii)            What data can be made public and when and how is it possible to share data between researchers? Names, addresses, and phone numbers need to be taken out before data are made public, but what other information could be used to identify people? If we strip out village name and GPS coordinates, the public data lose a lot of its value to other researchers. It is not possible, for example, to check whether researchers addressed spillover issues appropriately. If other researchers promise to keep the data confidential, and follow their own IRB standards, can we share the village name with them? But does sharing violate our agreement with the people we survey? My colleague at J-PAL, Marc Shotland, is currently working with MIT’s IRB to think through how informed consent language can be adapted to gain consent to share GPS data between researchers more easily while preserving confidentiality.

These are only some of the practical challenges that researchers face every day when conducting RCTs and other empirical work that involves data collection. We hope that this contribution will move the debate on the ethics of randomized evaluations and other empirical work onto these issues that don’t necessarily have easy answers, and where a debate could generate some light rather than just heat.

Further reading:

 

 

Comparing cost-effectiveness across contexts

There have been a number of recent blogs and articles criticizing RCT results because of the fear that results from one context are mindlessly applied to another. But the description of policy influence in these articles has little in common with the nuanced discussions between researchers and policymakers I have been privileged to observe (and take part in) over the years at J-PAL and elsewhere. I realize that not everyone is privy to how these discussions unfold and that it might be useful to give a flavor of them. How to inform policy by taking evidence from RCTs and combining it with good descriptive evidence, theory, and an understanding of institutional context, political constraints, and needs on the ground, is a massive subject and not one I can cover in a single blog.


I will start with one aspect of the discussion: using comparative cost-effectiveness as a tool to inform policy. This was the subject of a panel at the Society for Research on Educational Effectiveness in Washington, DC, that I participated in on Saturday. David Evans from the World Bank and Steven Glazman from Mathematica also presented. Felipe Barrera-Osorio from Harvard University was the discussant and Amanda Beatty organized and moderated the session.


We discussed technical issues about what discount rates and exchange rates to use as well as when to use cost-effectiveness and when to use cost-benefit analysis. But as these are covered in Chapter 9 of Running Randomized Evaluations and in J-PAL’s cost-effectiveness methodology paper, I will not repeat them here.


Instead I want to concentrate on the discussion of what assumptions cost-effectiveness/cost-benefit conclusions were sensitive to as a great example of ways to combine RCTs with other methodologies to inform nuanced policy discussions. (While both cost-effectiveness analysis [CEA] and cost-benefit anlaysis [CBA] facilitate the comparison of different programs in different years implemented in different contexts, they do differ slightly. CEA shows the amount of “effect a program achieves on one outcome measure for a given cost, while CBA combines all the different benefits of a program onto one scale [usually a monetary scale] and shows the ratio of the combined benefits to cost.) Steve Glazerman, drawing on his RCTs of Job Corps and Talent Transfer Initiative, showed how sensitive cost-benefit projections could be to longer term projections: the rate at which teachers attracted to low-performing schools stayed in position after the evaluation was critical to program effectiveness. Whether earnings gains from Job Corps persisted and for how long made a huge difference to estimated effectiveness. How do you deal with this uncertainty when giving policy advice?  One tool is to calculate the threshold above which the program will be effective (i.e., where net present value (NPV) of benefits outweighs NPV of costs). What proportion of teachers would have to stay, how long would earnings gains have to persist to reach the effectiveness threshold? What does descriptive data tell us about how long teachers stayed in their new positions? What is the trajectory of other earnings differentials over time, do they tend to persist, widen, or decline? In other words, there is no single number that perfectly summarizes the cost-benefit analysis of a program. Instead a range of calculations informs a conversation about program effectiveness under different scenarios. Sensitivity analysis can also inform project design. If we know what the CBA is highly sensitive to, program implementers can work on strengthening that element of the program. If the CBA shows the program is much more effective for one type of school than another type, this can inform selection of schools into the program.


Dave and I both drew on cost-effectiveness calculations of programs designed to improve test scores throughout the developing world, summarized in Kremer, Brannen, and Glennerster (2013). In the education programs we examined, the choice of discount rate and exchange rate had relatively little influence on the relative cost-effectiveness of the programs.

However, there was considerable variety in the precision with which different impacts were estimated. If you took the point estimate of the evaluation, some of the programs that looked very cost-effective were not precisely estimated. This means it is not possible to clearly distinguish the relative cost-effectiveness of several different programs. In the chart below, the error bars show the cost-effectiveness of programs assuming the impact was at the top, and the bottom, of the 90 percent confidence interval. In this example, there is overlap between the error bars for all four programs.

In other cases, even when impact estimates are imprecise, the lower bound of the estimate would still mean the program was more cost-effective than alternatives (see table below).

It is also worth noting that all the examples shown here would be considered highly cost-effective, even at the lower end of the confidence interval, if they were compared to most programs designed to improve test scores in rich countries. Achieving one standard deviation improvement in test scores for $100 is extremely cost-effective if we consider that a child typically gains between 0.7 and 1SD during a full school year, and a school year usually costs a lot more than $100.  


Dave presented work showing the sensitivity of results to different cost estimates using the detailed spreadsheets of cost data produced by J-PAL staff (especially Conner Brannen). For example, transport costs varied depending on whether a program was in a more or less densely populated area. Teacher costs (and contract-teacher costs) also varied by context. What would the cost-effectiveness of a program look like if it were taken to another context with different transport or teacher costs? Based on many different simulations, Dave classed program into three groups: the “always winners” (which proved to be cost-effective under many different assumptions), the “always losers,” and those whose rank was sensitive to some of the cost assumptions made.


One issue raised in the discussion, which (as Felipe emphasized) economists are not always very good at thinking through, was the validity of putting scores from different types of exams at different grades on one scale. For this reason, the studies shown here are all of primary education and most of the tests focus on basic literacy and numeracy. But as children progress, a 1SD change in one test may mean something very different than a 1SD improvement in another test.  


What should we take away from the sensitivity of cost-effectiveness rankings to some of these assumptions? Is it impossible to conclude anything of policy value? In his discussion, Felipe concluded that differences in costs and cost assumptions were a manageable problem. With the detailed cost data and models made available, it was possible to calculate what the costs would be in a new context and (assuming effectiveness stayed the same) calculate relative cost-effectiveness in that cost environment. Similarly, what type of cost environment would the program likely be cost-effective in? A bigger challenge, he concluded, was judging whether the effectiveness of a program translated from one context to another. In other words, assuming “effectiveness stayed the same” is a big assumption.


Every decision we make ends up being based on assumptions and projections. But how do we do those projections as well as we possibly can? What tools can we use? What is the empirical evidence about which estimates replicate and which don’t? These questions will be the subjects of forthcoming blogs.

Strengthening the accountability of politicians

feb14_enews_rachel.jpg

Across much of the world, votes are often cast on the basis of regional ties, patronage politics, or simple bribery. In Freetown last week, politicians, civil society, academics, and media came together to discuss ways to make politicians more accountable and to encourage people to base their vote on policies and performance, rather than party loyalty and/or gifts. In the past few years, an increasing number of studies have suggested that voters in developing countries will respond to information about candidates and change their vote, rewarding high-performing politicians and punishing poorly performing ones.  This encouraging evidence on efforts to strengthen formal democracy has come from political systems as diverse as Brazil, India, Benin, and now Sierra Leone, and is in contrast to the rather discouraging evidence on external efforts to change the workings of more informal institutions, which I blogged about in the fall. While no study has yet linked improvements in the workings of democracy to improved services for the poor on the ground, the hope is that by getting better politicians elected, and showing politicians that if they don’t perform they will be punished at the polls, these voter education campaigns will translate into improved services.

It was particularly exciting to have Amrita Johri from Satark Nagrik Sangathan (SNS) at the Freetown event (hosted by the International Growth Centre) to explain their pioneering work informing voters of the qualities of political candidates in India. SNS used the Indian Right to Information Act to collect data on how incumbent politicians spent their discretionary funds, what committees the MPs were on and how active they were. They use this information to create scorecards that are placed in local newspapers and disseminated through meetings and street theater.

But many countries in Africa don’t have the kind of detailed information that is available on MPs in India. Sierra Leone has just introduced a new Freedom of Information Act, but it was not available during the 2012 election. Search for Common Ground (SCG) therefore decided to videotape debates between MP candidates in 14 constituencies and screen these at randomly selected polling centers throughout the relevant constituencies. Ambrose James of SCG presented on the results: Exit polls conducted in treatment and comparison communities showed that the debates led to greater political knowledge (for example, about the size of the MPs’ Constituency Facilitation Fund), about candidates’ characteristics (e.g., candidate education), and candidate policy stance. Debates also led to a change in how people voted, with a 5 percent increase in the vote share for the candidate who won the debate. It also led to greater policy alignment between voters and candidates. In other words, voters were more likely to vote for a candidate who shared the same policy preferences as themselves.

Dr. Yusuf Bangura noted that the evidence from the US is that political debates tend to have little influence on how people vote. Why would it be different in Sierra Leone? One possible reason was that in the US, voters are bombarded with information about candidates and the marginal information gained from debates may be small. In contrast, in Sierra Leone, voters had few alternative sources of information on MPs. One interesting fact revealed in the discussion was how similar (and low) the level of knowledge was among voters in India and Sierra Leone.

Some of those at the workshop and in the media coverage of the event questioned whether a 5 percent swing was big enough to change much. After all, the majority of voters still voted along traditional party lines.  The Honorable Isatu Kabia argued that a 5 percent swing was big and important and I agree with her. After all, if one debate screening late in the campaign in a system where there is little objective information to disseminate can have a net effect on vote shares of 5 percentage points, more continuous and more detailed information has the potential to have an even bigger impact. Hon. Kabia made the further important point that if potential candidates knew they would be rewarded for their objective performance, this would encourage good candidates to enter politics, which could have a profound effect on democracy.

Are RCTs less manipulable than other studies?

At the end of The Importance of Being Earnest, Lady Bracknell remarks that, “the number of engagements that go on seem to me to be considerably above the proper average that statistics have laid down for our guidance.” For years, economists have been concerned that the number of published results that are just significant at the 5 percent level are well above the average that statistics would predict.

By tradition, we tend to draw a bright line at the 5 percent level (and another, less bright line, at 10 percent). Results that have a 5 percent likelihood or less of occurring by chance are considered clearly statistically significant. Some view those with a 5–10 percent likelihood as marginally significant, while those with a greater than 10 percent likelihood of arising by chance are considered not significant at all. But there is really not much difference between a result that has a 4.9 percent probability of occurring by chance (a p value of 4.9) and a result with a p value of 5.1. And in the absence of our bright line rules, there is no reason to think that there should be more results with p values at 4.9 than results with p values at 5.1.

The bunching of results with p values just below 5 percent is likely caused by one (or both) of the following reasons:

  1. Studies with results with p values of 4.9 are more likely to be published than those with results with p values of 5.1
  2. Authors manipulate their results in small ways to push p values just below 5, for example, by including or excluding covariates, or more drastically redefining the sample over which the relationship is estimated

A recent paper by Abel Brodeur and coauthors with the wonderful title “Star Wars: The Empirics Strike Back” estimates the extent of the bulge (excess mass) in results with p values just below 5 percent. They collect data on fifty thousand empirical results published in three top journals between 2005 and 2001 (AER, JPE, and QJE). They find that 10–20 percent of tests with p values below .5 should have p values between .10 and .25. My colleague, Ben Olken, argued at a session on transparency at the AEA that this is not quite as bad as it looks. He calculates that when you look at a result showing a p value of .5, these results suggest that on average the p value should be .725. Instead of 5% percent of these results being false rejections, 7.25 are false rejections. We are only misreading 2.25 tests out of a hundred, so it’s not a big deal? Of course we could say there is nearly a 50 percent increase in the number of false positives—a big deal?  My bigger concern is that the Brodeur study looks at all tests, many of which are robustness tests. I fear that the manipulation might be bigger for the headline results than for others. This, of course, is harder to test—what is a headline result?

There is, however, another striking finding for followers of RCTs in this paper. Field RCTs show no bulge in p values just below 5 percent. Compare the distribution of results for RCTs with those for other sources in Figure 9 from the paper, shown at left (in the paper, they test for the bulge and find none). If the journals are as likely to reject insignificant results of RCTs as they are for other studies, then this suggests RCT results are less likely to be manipulated than other studies.

Policy versus academic jobs in economics

As someone who worked as a policy economist for many years (at the UK Treasury and the IMF) before going into research, I am often asked for advice from those trying to decide whether to go into a career in policy or academia. With job market candidates making those decisions now, I decided to summarize my advice in writing. I think this is particularly important because many PhD candidates only get input from academic advisors--most of whom have no firsthand experience of working in policy (although that does not always stop them passing judgment on policy work).

I should stress that I am not talking about doing a research job at a policy institution (like being in the research department at the World Bank or a federal reserve bank). I am talking about jobs where you help set policy, like operations jobs at the World Bank or IMF, or in a government department. Judging these institutions by their research departments (as many academics do) is like judging Cal Tech by only looking at its music department.   

Policy and academic work are equally intellectually challenging, but in very different ways--and which one is suited to you will depend a lot on your personality. For example, academia can be a pretty lonely profession. Papers are written over many years with only intermittent feedback from colleagues. However, you have a lot of autonomy in terms of what you work on and how you do your work. In policy, deadlines are much shorter (I once had to estimate, in twelve hours, the impact of the war in Kosovo lasting another three months on its neighbor’s balance of payments). You are also part of a collaborative process. There is no way, for example, to do a balance-of-payments projection in isolation from the fiscal and monetary projections. In policy you also have a boss, which can be the best or worst thing about your job, depending on the boss.

The questions that are worked on and the reward structure are also very different in each field. Academia rewards findings that are different and unexpected. In policy it is more important to be right than novel—after all, millions of lives may be impacted by a policy decision. In academia, people argue a lot about the direction of an effect but very little about the magnitude: in policy it’s the reverse. While academics argue whether government borrowing could impact GDP, policy economists argue whether a particular country should have a deficit of 3 or 4 percent of GDP. In academics, people often become increasingly specialized. In policy you have to be able to use the basic tool box of economics and apply it to any problem that might get thrown at you. In a few years at the UK Treasury I worked on: introducing greater market forces within the National Health Service; the extent to which current trade imbalances reflected different demographic profiles across countries; what proportion of UK bonds should be long-term, short-term or index-linked; and whether a proliferation of new stock exchanges would undermine price revelation. When I first heard graduate students talking about the challenge of finding a problem to work on I was stunned. This was not a problem I had ever faced. And this gets to the heart of the difference: in academics you look for a problem you can answer well, and in policy you find the best answer you can to the problem you are given.

Finally, in academia you have to convince other economists. In policy the challenge is often to persuade noneconomists. There is the added challenge of solving the economic question within the relevant political constraints and framing the solution in a politically acceptable way. Whether you find this irritating or exciting could help you decide which job is right for you. I was 21 and had been at the UK Treasury a matter of weeks when a small group of us were told, “Your job is to slip some common sense past the prime minister without her noticing.” Now that’s a challenge.

How do you test the replicability of an RCT?

At the BITSS conference last month I heard a fascinating talk by Uri Simonsohn, a psychologist at Wharton, on replication of randomized evaluations. You may remember a while back there was a big fuss (which reached The New York Times) about how many famous randomized trials in psychology failed to replicate. Uri’s paper, however, shows that many of the papers that attempt to test the replicability of RCTs do a really bad job. This is not a question of whether the replication accurately copies the original experimental protocol; rather, it’s a question of bad statistics. The common standard for judging whether a result replicates is to run the experiment again and see if a result that was statistically significant before is statistically significant in the second version. But as Uri convincingly shows, this is not a good standard. For example, if the replication is done on a much smaller sample size the finding that the result is not significant probably does not tell us much about anything.

It turns out that determining exactly what the right test for replication is is not straightforward (even though its pretty obvious that many of the papers claiming “nonreplication” are really bad).  We have to take into account magnitude of effect (i.e., is the effect size similar in magnitude?) and the relative power of the original study and the replication.

My initial reaction was that the best test would be whether the new coefficient was significantly different from the coefficient in the earlier study; however, Uri points out this is not a good test. It might be that in two high-powered studies the coefficients are very tightly measured, and while they are very close for practical purposes they are statistically different from each.

Uri suggests the following standard: a study fails to replicate if the replications rule out an effect big enough to be detectable with the original study. He uses the analogy of a telescope. If someone finds a new star with a small telescope (a low-powered study) but a large telescope (high-powered study) can’t find the same star, this raises doubt about whether the original finding was correct.

For more details, see the paper. It’s thought-provoking, if a little scary.

Pre-analysis plans at Berkeley's BITSS conference

On December 12th I attended the annual meeting of the Berkeley Initiative for Transparency in the Social Sciences (BITSS). BITSS brings together economists, political scientists, biostatisticians, and psychologists to think through how to improve the norms and incentives to promote transparency in the social sciences. I was on a panel talking about preanalysis plans in which researchers specify in advance how they will analyze their data.

I have now been involved in writing four of these plans and my thinking about them has evolved, as has the sophistication of the plans. Kate Casey, Ted Miguel and I first wrote one of these plans for our evaluation of a Community Driven Development program in Sierra Leone (see the previous blog ). It was exactly the type of evaluation where pre-analysis plans are most useful. We had a large number of outcome variables with no obvious hierarchy of which ones were most important so we specified how all the outcomes would be grouped into families and tested as a group. While the outcomes were complex the randomization design was simple (one treatment, one comparison group).

The next case also included multidimentional outcomes: empowerment of adolescent girls in Bangladesh. However, now we had five treatments and a comparison group with different treatments targeted at different ages. The task of prespecifying was overwhelming and we made mistakes. It was extremely difficult to think through in advance what subsequent analysis would make sense for every combination of results we might get from the different arms. We also failed to take into account that some of our outcomes in a given group were clearly more important than others: we ended up with strong effects on years of schooling and math and literacy scores but the overall “education” effect was weakened by no or negative effects on indicators like how often a girl read a magazine. We hope, when we write the paper people will agree it makes sense to deviate from our plan and concentrate on the more important education results.

Learning from that example, the most recent pre-analysis plan was written in stages. Kelly Bidwell, Kate Casey and I evaluate the impact of screening debates between MPs in the Sierra Leone election (Kate did a webinar on the emerging results). We wrote our initial plan of action (when the survey was in the field) and then updated it as we revealed and analyzed different parts of the data sequentially. We also specified which outcomes were primary and which were secondary (i.e. would help us understand the mechanics of how the intervention had an impact but weren’t to be considered “success” on their own). We started by looking at data collected in treatment areas before and after the debates. Analyzing these data helped us update our hypotheses for the next version of the PAP. We next examined data from the comparison group and downgraded some outcomes as too hard to change (when turnout is 98% in the comparison group, debates are unlikely to increase turnout). With this information we updated our PAP for our individual level experiment in which different voters were shown different parts of the debate (we logged the history of how this PAP evolved over time). Finally, after analyzing these results we finalized our PAP for the main evaluation of the debates. All changes were redlined with dates on which the changes were made.

The other unusual part of this PAP was that for some outcomes we specified the use of one sided tests rather than the standard two sided tests. For example, we only tested whether debates increased knowledge. We increased our power to detect this effect by committing not to look at decreases in knowledge. This approach is only reasonable if there is no theory under which a negative effect could make sense: ie if you saw a negative effect you would assume it was an anomaly. Where a one sided test makes sense its important to commit to it in advance.

These issues are discussed in more detail in Module 8.3. For examples of published pre-analysis plans see resource links under Chapter 8. There will also be a panel to discuss pre-analysis plans at the AEA.

Smackdown on Community Driven Development

smack-down.jpg

When Dave Evans at the World Bank invited me to participate at a “smackdown” between evaluators and operations experts who worked on Community Driven Development (CDD), I was skeptical. One of the great advantages of working on randomized evaluations is the close and cooperative way in which evaluators and implementers work together (see my previous blog).  The idea of deliberately highlighting the differences between the two groups seemed wrong. But Dave convinced me that a format set up as a mock fight would attract more interest and he was right: the two sides sat in the middle of a standing room only crowd which added to the atmosphere of a boxing ring.

Team Operations was: Daniel Owen, with whom I had worked on a randomized impact evaluation of CDD in Sierra Leone; and Susan Wong, who is an accomplished evaluator of CDD as well as managing CDD programs in Asia and Africa. On the research side ("Team Impact Evaluation") was Biju Rao, Macartan Humphries, and myself. Biju’s summary of the evidence on the participation in development exposed the lack of rigorous evidence on the impact of CDD and caused a stir at the Bank when its first draft circulated in 2004/5. To their credit, people in operations, like Dan Owen, responded by supporting a number of rigorous impact evaluations of CDD including three randomized evaluations in Sierra Leone, Liberia, and the Democratic Republic of Congo (Macartan is a coauthor on the latter two studies). Other evaluations, like those by Susan Wong, tested alternative approaches to CDD. As a result we now know a lot more about CDD, much of which was discussed at the smackdown. 

I would highlight a few points which emerged:

  1. Delivering the goods: Macartan and I noted how the Sierra Leone and DRC programs we evaluated were effective in bringing public goods to poor communities despite poorly functioning post war governmental systems.

  2. Compared to what? Susan pointed out that it was important to compare CDD to the next best alternative. When the Bank went into dysfunctional post war environments, CDD was often the only way to get money to communities to help them rebuild quickly. Even outside postwar environments there were often no functioning government structures at the very local level. If donors wanted to work at this local level they would inevitably have to have some institution building component of the type often found (in varying intensities) in CDD projects.

  3. No spillovers: Macartan and I emphasized that in our studies while the programs themselves had participation from women and minorities, CDD was not successful in making local decision making processes more open and transparent: the inclusive decisions making stayed within the project. Dan argued that with more time and better measures of decision making the effects would spill over. My view is that we have developed some good measures of participatory decision making, with the Sierra Leone and DRC studies good examples of this (see Chapter 5 of Running Randomized Evaluations).

  4. What we don’t know: One important gap in the evidence is the extent to which the emphasis on participatory decision making within the project is the reason these projects were successful in delivering the goods.

The audience was asked to vote at the beginning and end of the discussion on two questions: i) how can operations better foster CDD (based on evaluation evidence); and ii) how can researcher be improved to make it more useful for operations? The results of the votes are below.

Recommendations from Team Impact Evaluation

Recommendations from Team Operations

RCTs provide independence or objectivity through methodology

At the recent launch event of Running Randomized Evaluations, hosted by DIME, a questioner asked about my comment that because randomized evaluations provide an independent methodology they gave evaluators the freedom to work hand in hand with implementing partners. This observation came from my experience as a member of the Independent Advisory Committee on Development Impact for the UK government’s Department for International Development (DFID). Much of the discussion focused on increasing the independence of evaluation in line with the then international standards. The concern was that if DFID (or any other agency) paid for an evaluation of its own projects there was a conflict of interest and evaluators would provide glowing reports on projects in order to get future contracts for other evaluations. Complex rules seek to avoid these conflicts with heads of evaluation in most agencies not reporting to those in charge of operations and individual evaluations contracted for separately from implementation contracts. But this independence comes at a cost of isolating evaluation from the heart of operations. 

In contrast, randomized evaluations require very close working relationships between evaluators and implementers from the design of the project through implementation. Despite this relationship it is possible for randomized evaluations to provide independent or objective results (I would argue that objectivity is what we are really after).  This is because, for the most part, the results of a randomized evaluation are what they are. We set the experiment up, we monitor that the protocol is complied with, we collect the data and usually only when we see the final data can we tell whether the program worked. Usually at this final stage there is relatively little flexibility for the evaluator to run the analysis different ways to generate the outcome they want to see. There are of course exceptions: in particular, when an experiment has a large number of outcome measures and it is not clear which ones are the most important we may want to strengthen objectivity by specifying how the data are analyzed in advance (see Chapter 8 of Running Randomized Evaluations). It is also important to provide independence by committing to publish the results in advance. But compared to much other evaluation work carried out by development agencies, randomized evaluations provide results which are harder to manipulate and thus are reasonably objective, allowing for the type of close partnerships between evaluator and implementer which can be incredibly productive for both sides and which helps ensure evaluations are useful and used.

As a footnote to those interested in the agency mechanics of all this, DFID now evaluates a number of its programs with randomized evaluations which are usually commissioned by operations teams but are often classed as “internal” evaluations and thus not necessarily “independent”. Also, more recent international guidance notes (e.g., from NONIE) have a more nuanced approach to independence.