In 2001, the US Congress passed the No Child Left Behind Act in an effort to measure and improve student performance in math and English language skills. The law required states to adopt standardized tests and to create timelines, with annual benchmarks, that would bring student proficiency levels to 100 percent by 2014. Schools were judged not only on their overall achievement rates, but on the performance of various subgroups, such as students from low-income families or historically disadvantaged ethnic groups. If any of these subgroups failed to meet the annual goals, the school would fail and therefore face penalties including the loss of funding and the possibility of restructuring or closure.
Under these new regulations, however, states were allowed wide flexibility. States were allowed to choose their own standardized tests, set their annual benchmarks, and consider various allowances when grading a school’s performance. The result? Small, often subtle differences in implementation led to significant differences in measured outcomes, according to new research by Professor Jonah Rockoff, who worked with Elizabeth Davidson of Teachers College and Randall Reback of Barnard College, both at Columbia University, and with Heather Schwartz of the RAND Corporation.
One of these subtle yet significant differences in implementation was the use, by some states, of a confidence interval, a statistical means of accounting for sampling error. Suppose a state set its math proficiency benchmark at 58 percent, and 56 percent of students in a particular school score above the state’s proficiency level. The question then becomes: can state administrators be fairly certain that if all the students at the school took the test again, it would still fail to reach 58 percent? In order to provide more assurance that failing schools were truly below the benchmark, states adopted confidence intervals as high as 90, 95, or even 99 percent..
With a large confidence interval, the real bar is often far below the stated benchmark, Rockoff explains. “It might sound trivial, but with a very wide confidence interval, maybe only 30 percent of students had to pass the test,” instead of 58 percent, he says. “It’s odd, because we never use confidence intervals in grading; a student could not tell a teacher, ’it’s true that I failed, but you can’t be 99 percent sure that if you tested me again, I would still fail.’” Yet that is essentially how many states implemented No Child Left Behind. And while some states had wide confidence intervals, others set none, leading to dramatic differences in reported failure rates.
Another implementation detail that had a significant effect on measured outcomes was allowing states to set the minimum size of its student subgroups. To some extent, this seems reasonable: If a school had only one student living in poverty, and that student failed the test, should the entire school fail? However, as with confidence intervals, the use of minimums varied greatly among states, with some states measuring subgroups in the single digits and others setting them many times higher. “I can’t imagine that the policymakers who designed No Child Left Behind would say they intended that in North Dakota, a school needs to have 40 poor students for the results of their testing to matter, but in South Dakota, you need only ten,” Rockoff says. “That goes beyond the intended flexibility.”
No Child Left Behind was intended to raise standards, not to compare student performance across the nation or make schools all teach the same material. But as the researchers show, even with flexibility on what schools are teaching, meaningful accountability requires consistency on implementation. “It’s fine if state administrators in Mississippi want to teach fifth-grade math differently than they do in Massachusetts,” Rockoff says. “But even if you give states flexibility with some big-picture items, you really need to nail down all of the details on how things get measured.”
These findings have implications for any large organization that wants to establish an incentive and accountability system, Rockoff notes. A conglomerate that wants to establish incentives might allow its various divisions some leeway on how they set their benchmarks. But this intention can backfire without sufficient measurement standards. “You can give flexibility about goals or performance,” Rockoff says. “But if you allow too much flexibility with measurement, the system will fail.”
Jonah Rockoff is associate professor of finance and economics at Columbia Business School.
Read the Research
"Fifty Ways to Leave a Child Behind: Idiosyncrasies and Discrepancies in States' Implementation of NCLB"