Session Recap: Beyond Internal Validity--The Peter H. Rossi Lecture
November 7, 2014 04:20 PM
By Sana Ahmad, Rutgers University
Dr. Larry Orr was selected as this year’s 2014 Peter H. Rossi Award winner. Orr began his discussion by highlighting the important times and influential people in his life and career. He mentioned how he got his start as Assistant Professor at Wisconsin, Madison and discussed working with Joseph Newhouse on the RAND project.
An expert on randomized designs, Orr related that for forty years, evaluation was consumed by a battle over internal validity. He and others have engaged in a long struggle to convince researchers and policy makers that randomized trials are the optimal way to minimize threats to internal validity, such as selection and history. He related that random assignment designs are now universally acknowledged as the gold standard in evaluation. With some hurdles overcome, Orr acknowledged that the field now faces new challenges, one of which is weaknesses of the standard model of impact evaluation. The standard model comprises random assignment to one or more interventions or control in a small number of sites, one to two rounds of follow-up surveys, and cost-benefit analysis. He claimed that viewed broadly, the standard model is not the gold standard.
Orr highlighted two primary flaws of the standard model including lack of external validity and failure to take Rossi’s metallic laws into account. An externally valid evaluation provides unbiased estimates for population of interest, which can be the population currently served or a new target population. He indicated that social policy interventions are never tested using a sample that is representative of the population of interest, a necessary requirement for establishing external validity. Samples are instead chosen for convenience of financial feasibility, and the population of interest is sometimes not even acknowledged. Expressing displeasure of the current practice of overlooking external validity, Orr says that just as researchers don’t accept estimates of dubious internal validity, they should be just as intolerant of violations of external validity. Citing an example--Head Start—Orr showed that it is possible to have a nationally representative sample.
Orr prescribes five things to ensure external validity in every evaluation: 1) define the population of policy interest at the onset, 2) think about how to draw a sample that has a reasonable semblance to the population of interest, 3) compare sample to population based on several characteristics to ensure that sample is representative, 4) document in design report, and 5) use one the various methods to project results from the sample to the population of interest.
The second primary flaw of the standard model that Orr highlights is violation of Rossi’s metallic laws, which relate that the expected value, or long-run average, of the net impact of large public programs is zero. This implies that the distribution of impacts of public programs is centered at zero, but it does not say anything about outcomes of individual program. He further goes on the give examples of various trials from different fields (medicine, criminal justice, social justice) that essentially produced null results. Orr’s solution to such a conundrum is to conduct better evaluations, and he relates that Rossi came to the same conclusion.
It is almost impossible to gauge if an intervention might work a priori, and an evaluation design according to the standard model sometimes takes five or more years to complete, costing millions of dollars. Due to the heavy investment required, most agencies can only conduct one to two evaluations per year. In order to streamline, hasten, and improve the process, Orr proposed that the solution is to do better and cheaper evaluations.
He suggested choosing interventions by strategically searching existing evaluation literature to identify promising programs. The next step would be to conduct experiments in a two-stage process. The first stage would be to use administrative data to conduct streamlined evaluation to measure impact of program on primary outcome of interest to client. If stage one shows promise, the next stage would look more like the standard model where cost-benefit studies are conducted. Orr said that the second state is certain to yield higher success because the program passed through stage one. The surest effect against false positive results is replication and the two-stage evaluation model drives the false positive rate down. Orr concluded his talk by suggesting that management is an area in which cheap experiments can be conducted, relaying that Google and Microsoft conduct these types of cheap experiments where results are available very quickly.