The Will to Quantify: The “Bottom Line” in the Market Model of Education Reform
by Leo M. Casey - 2013
Background/Context: There is a deep and yawning chasm between the world of tests and testing practices as they ought to be and the actual tests and testing practices now imposed on American students, educators, and schools. That chasm of theory and practice is a function of the dominant paradigm of educational reform, with its theory of action that schools must be remade in the image and likeness of a corporation.
Purpose: To explore the role, development, and implications of assessment use in the market model of education reform.
Research Design: Analysis of the recent publication of the value-added measurements found in the Teacher Data Reports of the New York City Department of Education.
Conclusions: Since standardized tests provide data for a “bottom line,” they have been widely embraced in some circles as a basis for making “high stakes” decisions that hold individuals and schools accountable. In the market model of education reform, questions about the validity and reliability of the tests and of their use for “high stakes” decisions are dismissed as efforts to avoid accountability.
It has been close to a half-century since the American Education Research Association, the American Psychological Association, and the National Council on Measurement in Education first jointly published the Standards for Educational and Psychological Testing, setting forth their professional consensus on the appropriate designs and uses of testing. The Standards provided a framework for the evaluation of tests and testing practices, with a goal of encouraging the adoption of what would be called, in current nomenclature, best practices.
Understood by its sponsoring organizations as a living document, the Standards have gone through two major revisions, in 1985 and in 1999. In her contribution to the March 2012 conference on Educational Assessment, Accountability and Equity, Eva Baker analyzed the work of the joint committee that produced the 1999 version of the Standards, drawing upon her experience as its co-chair. Her account is rich and insightful, welcomingly self-reflective, and even self-critical. But for readers grounded in the contemporary reality of American K-12 schools, the one striking takeaway from her discussion is the deep and yawning chasm between the world of tests and testing practices as they ought to be, the raison dêtre of the Standards, on the one hand, and the actual tests and testing practices now imposed on American students, educators, and schools, on the other hand. The more American education has come under the thrall of the testing regime initiated with No Child Left Behind, the harder it has become to see the substance of the Standards in actual tests and testing practices. Today, the tests and testing practices in American education are, on many significant counts, radically opposed to the Standards.
GOING PUBLIC: THE NEW YORK CITY EXPERIENCE
When education measures go public in the current environment, one must ask whether the measures in question actually tell us anything that is educationally meaningful. Consider the recent experience in New York City with the publication of individual Teacher Data Reports (TDRs). The TDRs were the NYC Department of Educations experiment in value-added measures of teacher performance. Using New York State standardized exams in English Language Arts and Math, grades 3 through 8, TDRs were developed for approximately 18,000 teachers of those subjects over three school years: 2007-8, 2008-9 and 2009-2010. When the TDRs were first introduced, then Schools Chancellor Joel Klein and then United Federation of Teachers (UFT) President Randi Weingarten jointly wrote to all NYC public schools teachers that the reports would not be used for evaluation purposes, specifically identifying tenure decisions and the annual rating process as two areas where they would play no role. Instead, they explained, the reports were new information that should be used to help teachers improve their instruction and develop as professionals. Contemporaneously, then Deputy Chancellor Chris Cerf wrote to Weingarten: it was the DOEs firm position and expectation that that Teacher Data Reports will not and should not be disclosed or shared outside of the school community. But Klein very quickly went back on his word and the DOEs institutional promises. Acting unilaterally, the DOE began to use the TDRs in tenure decisions. And DOE officials began to solicit from news media Freedom of Information requests for the TDR reports, and they were filed in short order. In a news interview, then Deputy Chancellor John White defended this reversal of policy and the release of the TDRs: the data would strengthen the citys case for changing the policy on firing teachers, he said, providing grist for the citys campaign to make NYC public school teachers into at will employees. The UFT went to court and fought the release of individual teacher names with the TDRs, but was ultimately unsuccessful in its legal challenges, and the full TDRs were published by all of the major NYC print media in February 2012.
What are we to make of the TDRs as educational measures of teacher performance? There are serious issues regarding the validity and reliability of value-added scores, as the tests in question were designed for another purpose: the measure of student performance, not teacher or school performance. (This issue is addressed directly in 11.2 of the Standards.). But in the case of the TDRs, it was clear that the data from the New York State tests were neither reliable nor valid for their original purpose, much less for other, ancillary purposes. The New York State Education Department (NYSED) had asked a noted testing expert, Harvard Professor of Education Daniel Koretz, to do a study of the rather dramatic increases in proficiency rates on the exams used to calculate the TDRs. Comparing student scores on those exams to student scores on the NAEP, Koretz concluded there the apparent improvement in performance on the New York test(s) appears to arise in substantial part from score inflation, a lowering of standards or both. Commenting on the Koretz study and the state exams, New York Regents Chancellor Meryl Tisch declared that we have to stop lying to our kids. We have to be able to know what they do and do not know. To its credit, NYSED undertook a redesign and recalibration of the state exams, starting in 2010-11. Yet the flawed exams were the very same exams that provided the basis for the TDRs that were published.
What was worse, the DOE made choices in the design of the TDRs that significantly increased the validity and reliability problems in the data. A 2010 U.S. Department of Education study found that value-added measures in general have disturbingly high rates of error, with the use of three years of test data producing a 25% error rate in classifying teachers as above average, average, and below average and one year of test data yielding a 35% error rate. New York University Professor Sean Corcoran has shown that it is hard to isolate a specific teacher effect from classroom factors and school effects using value-added measures, and that in a single year of test scores, it is impossible to distinguish the teachers impact. The fewer the years of data and the smaller the sample (the number of student scores), the more imprecise the value-added estimates. Yet the DOE chose to ignore these problems in a push to produce data which could be used in the annual evaluations of the greatest number of teachers. Consequently, TDRs were produced with a single year of test score data, and sample sizes as low as 10 student scores were used. When students did not have a score in a previous year, scores were statistically imputed to them in order to produce a basis for making a growth measure. With this alchemy, the published TDRs had average confidence intervals of 60 to 70 percentiles for a single-year estimate. On a distribution that went from 0 to 99, the average margins of error in the TDRs were, by the DOEs own calculations, 35 percentiles for Math and 53 percentiles for English Language Arts. One-third of all TDRs, the DOE conceded, were completely unreliablethat is, so imprecise as to not warrant any confidence in them. The sheer magnitude of these numbers takes us into the realm of the statistically surreal.
As if all of this were not sufficiently troubling by itself, add into the equation the fact that the data used by the DOE to construct the TDRs were significantly corrupted. Many teachers were assigned scores for students they had not taught and were missing scores for students they had taught. Teachers were even assigned entire classes they had never taught, and were missing entire classes they had taught.
In sum, the construction and publication of the TDRs demonstrated how not to use tests and test data in the responsible ways advocated by the Standards. It was, to put it simply, a demonstration of professional malpractice in the realm of testing.
While they would not have expressed themselves in the language of the Standards, New York City public school teachers were deeply angered by the publication of the TDRs, with all of their well-documented flaws. For them, the act of publication, and the complicity of the DOEs Chancellor and the citys Mayor in that publication, was yet another undeserved attack on their work and their profession at a time when both are under siege. A significant element in this perception was the fact that teachers they knewwomen and men who had dedicated their lives to educating and caring for the citys young peoplehad their professional reputations and their lifes work damaged by the publication of misleading and incorrect information. Take the case of Pascale Mauclair, a sixth-grade teacher at P.S. 11 in Queens who was pilloried in the Murdoch owned tabloid New York Post as the citys worst teacher based solely on her TDR reports. Upon investigation, Mauclair proved to be an excellent teacher who had the unqualified support of her school, one of the best in the city: her principal declared without hesitation or qualification that she would put her own child in Mauclairs class, and her colleagues met Mauclair with a standing ovation when she returned to the school after the Posts attack.
Mauclairs undoing had been her own dedication to teaching students with the greatest needs. As a teacher of English as a Second Language, she had taken on the task of teaching small self-contained classes of recent immigrants for the last five years. Those two factorsteaching high needs students and teaching small classeshave a particularly contorting effect on the TDRs, given their methodology. Add to this mix the fact that Mauclairs school had an unusual configuration of its sixth-grade classes that distorted comparisons to other sixth-grade classes across the city, and you have the makings of a TDR that could not have been more wrong as a measure of a teachers performance.
The public outcry over the publication of TDRs was intense, and was joined by unexpected allies such as Bill Gates and Wendy Kopp. In June 2012, the New York state legislature and governor responded by enacting legislationover the strident objections of Mayor Bloomberg and the tabloid pressthat prohibited the public disclosure of teacher evaluations. Under this new law, parents may be told the rating of their own childrens teacher, but there will no longer be mass releases or mass publications of teacher evaluations. This legislation is a welcome development, but it must be noted that New York City is not the only school district which has seen the publication of value added scores: the second largest school district in the country, Los Angeles, underwent the same experience when the Los Angeles Times produced and published value added scores for that citys teachers, and there has yet to be any legislative relief in California.
THE BOTTOM LINE IN MARKET EDUCATIONAL REFORM
It is important to remember that the problems with the TDRs went well beyond their publication. The deeper, more vexing question posed by the New York City experience is why such professional malpractice came to define a central New York City Department of Education initiative to evaluate teachers. The answer to this question lies in the theory of action that informed this and other DOE projects. For Michael Bloomberg, Joel Klein, and a cohort of similarly minded education reformers across the United States, the fundamental problem with American public education is that it has been organized as a monopoly that is not subject to the discipline of the marketplace. The solution to all that ails public schools, therefore, is to remake them in the image and likeness of a competitive business. Just as private businesses rise and fall on their ability to compete in the marketplace, as measured by the bottom line of their profit balance sheet, schools need to live or die on their ability to compete with each other, based on an educational bottom line. If bad schools die and new good schools are created in their stead, the productivity of education improves. But to undertake this transformation and to subject schools to market discipline, an educational bottom line must be established. Standardized testing and value-added measures of performance based on standardized testing provide that bottom line.
This theory of action applied not only to schools, but also to the teachers that worked in them. Here Bloomberg, Klein, and other education reformers adopted the stacking theory of personnel management first popularized by Jack Welch, the controversial past CEO of General Electric. (Welch was hired as a consultant in the early days of the DOEs school supervisor training program, the Leadership Academy.) Welch believes that the disciplining power of competition must be applied within the business itself: all employees needed to be ranked, or stacked, from the highest to the lowest performing, and each year, the bottom 10% of employees must be fired. Welchs theory relies upon what might be called a Hobbesian market model, in which the workplace is organized around a competition for survival, a war of all against all. This cut-throat competition engenders a fear-ridden, conformist culture and an ethos of servility in which the autocrat who rules the workplace exercises unchallenged power. To this end, the price of challenging autocratic power is deliberately set very high, so that few will overcome the fear and muster the courage to take such a step. Moreover, workers are pitted against each other in bitter competition precisely because that competition will make it more difficult for them to mount the solidaristic actions that could make challenges to autocratic power successful.
The Hobbesian market model of personnel management is therefore as much a political theory on how to rule the workplace as it is an economic theory of how to maximize enterprise productivity. Welchs well-known anti-union animus played a pivotal role in developing this model: on principle, he is opposed to the idea that workers should have a collective voice in how their workplace is organized and run, and he conducted harsh campaigns against the General Electric unions during his time as CEO. A significant part of the attractiveness of Welchs approach for Bloomberg, Klein, and like-minded education reformers has been their own ardent opposition to teacher unions and to a meaningful professional voice for teachers in the important decisions at their schools, as well as their general distrust of career professional educators. Once this common ground is grasped, the obsession of Bloomberg and Klein with having unfettered power to fire teachers and the constant talk from the NYC DOE about the legions of bad teachers who need to be replaced becomes comprehensible, as part of the Hobbesian market approach to personnel management.
Seen in this context, decisions that are inexplicable in terms of the well-established, professional code of the Standards become intelligible as part of a theory of action which seeks to remake schools in the image and likeness of competitive businesses. If the objective is to produce an intense competition for professional survival among teachers, in which the bottom 10% are fired each year, than what one needs is an annual ranking, period. Using three years of value-added data will produce a more reliable and valid measure of a teachers performance than one year of value-added data, but it would also make it impossible to do annual rankings for many teachers. Similarly, using only larger sample sizes will eliminate the more unreliable and invalid measures, but then many teachers would no longer be ranked. What is important here is the political effects of the competition that comes from the ranking and the firings, not the educational soundness or quality of the ranking. The market model of education reform has become a prisoner to a Nietzschean will to quantify, in which the validity and reliability of the actual numbers is irrelevant.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Corcoran, S. P. (2010). Can teachers be evaluated by their students test scores? Should they be? The use of value added measures of teacher effectiveness in policy and practice. Providence, Rhode Island: Annenberg Institute for School Reform at Brown University. Retrieved from: http://www.annenberginstitute.org/pdf/valueAddedReport.pdf