Trusting Our Judgment: Measurement and Accountability for Educational Outcomes
by David Steiner - 2013
Background/Context: Education policymakers across the country face an urgent problem: we know there is wide disparity in teacher effectiveness, but we lack meaningful tools to identify and reward the most effective teachers or to ensure that the least effective improve or leave the classroom.
Purpose: This article considers the value of the national move toward value-added measures and our current fascination with objective measurements – a fascination that stems from our collective distrust of our teachers and ourselves, and our reluctance to make judgments about the substantive narratives we teach students.
Research Design: This is an analytical and reflective piece that draws upon the author’s experience serving as New York State Commissioner of Education and dean of a teacher education institution in New York City.
Conclusions: Value-added measures of teachers’ impact on student learning are an imperfect but important tool; however, by often refusing to take responsibility for what is worth teaching, we risk cutting off important opportunities for democratic education and ultimately impoverishing students’ own ability to make meaningful judgments about their world, regardless of their test scores.
THE VALUE OF VALUE-ADDED MEASURES
Education policymakers across the country face an urgent problem: we know there is wide disparity in teacher effectiveness, but we lack meaningful tools to identify and reward the most effective teachers across the system or to ensure that the least effective improve or leave the classroom. The consequences for our children are real. Teachers are widely believed to be the single most important in-school factor in determining student achievement: students with the least effective teachers may learn only half of what their peers with average teachers learn in a year (Hanushek, 2008). Students burdened with very poor teachers two years in a row could therefore find themselves, solely as a result of the quality of their instruction, a full year behind the kind of gap it may be impossible to close.
Like other educators and officials, I faced this problem directly when I served as New York State Commissioner of Education. Our principals and district leaders found it virtually impossible to remove teachers based on performance, no matter how persuasive the evidence of their ineffectiveness might be. To give one example, in 2008 and 2009, out of some 55,000 tenured public school teachers in New York City, only 3 were fired for incompetence (Medina, 2010).
A major part of the problem was the lack of a meaningful way to assess how well individual teachers taught. The teacher evaluation systems that were in place in New York and virtually everywhere else in this country were essentially pro forma. A 2009 study of teacher evaluations in 12 diverse districts across 4 states made this point starkly: fewer than 1% of all teachers were rated unsatisfactory, and where more than 2 ratings were available (e.g., superior, excellent, satisfactory, unsatisfactory), 94% of all teachers received 1 of the 2 highest ratings. The study concluded that meaningless evaluations made teachers seem interchangeable: the widget effect (Weisberg, Sexton, Mulhern, & Keeling, 2009). If all teachers are good or great, it is impossible to reward excellence or address ineffectiveness.
In this context, value-added measurement is an imperfect but, if used carefully, extremely valuable tool for improving educational outcomes. The latest research suggests that students assigned to high-VA teachers score higher on key exams than similar students with other teachers, controlling for external characteristics like parent socioeconomic status. This effect is visible from the first year these highly effective teachers arrive at a new school. More important, these students are more likely to succeed in a number of important ways outside the classroom: they are more likely to attend college, attend higher-ranked colleges, and earn higher salaries (Chetty, Friedman, & Rockoff, 2012).
Value-added measures are by no means a perfect tool. The multitude of factors that affect students means there is a lot of noise in the calculations, so that the correlation between value-added measures of teacher impact from year to year as measured by Glazerman, Goldhaber, Loeb, Raudenbush, and Staiger (2010) is somewhere between 0.30 and 0.40. This is about as consistent as the tools relied on by employers in other complex professions, where widely-used year-to-year indicators like an investment firms average returns, a Major League baseball players batting average, and a Real Estate agents annual sales volume range from 0.33 to 0.40 (Glazerman et al, 2010; Sturman, Cheramie, & Cashen, 2005; Schall & Smith, 2000). And, for educators specifically, it is significantly better than indicators traditionally used to evaluate and promote teachers, such as possession of a masters degree, years of experience, and hours of professional development (Glazerman et al, 2010).
Of course, even the most artfully constructed value-added accountability system should not be the only measure used to assess a teachers effectiveness and indeed, every state that plans to use value-added measures does so as one of a set of evaluation tools. In New York State, for example, school districts will base 40% of a teachers annual reviews on student achievement as measured by state and local assessments, with at least half of this (20%) based on state standardized tests. The remaining 60% will be based primarily on principal observation, though districts may also give weight to student and parent feedback or student portfolios (Governors Press Office, 2012). Used in this way, as one indicator among several, value-added measures have proven to be effective at predicting student achievement on not just multiple-choice standardized tests but also higher-order assessments of conceptual understanding in mathematics and literacy tests requiring short written responses more effective, again, than graduate degrees or years of experience (Kane & Staiger, 2012).
As value-added accountability systems become a factor in important decisions like which teachers to reward and promote and which teachers to support with professional development or remove from the classroom, it is crucial that these systems be implemented carefully. This will mean attending to the important technical concerns raised by critics of value-added measures and the balanced analyses of experts some of whose work appears in this volume. For example, we know that multiple years of data for a teacher yield more reliable value-added measures with tighter confidence intervals; we know that students are not always distributed into classrooms randomly, complicating the attribution of test-score gains; we know that increased accountability carries with it the risk of manipulation of the results of high-stakes tests; and we know that rigorous, normed standardized tests are often only available for mathematics and English language arts and only for Grades 3 or 4 to 8, which raises equity issues for the evaluation of teachers in other subjects and grade levels.
Policymakers are rarely ace statisticians or measurement experts, and I would not claim to be either. Most of us try, however, to listen to those who are, and must ultimately make up our minds in circumstances where the experts dont fully agree. I know that educators, policymakers, and researchers must work diligently to incorporate all of this knowledge and ensure validity, and we need ongoing research to continue to enhance the reliability and consistency of our student assessments and value-added systems. But, on balance, I remain convinced that value-added measurement is indispensable to the improvement of our educational system.
As more and more states incorporate value-added measurements into teacher evaluations, the greatest risk I see is that our focus on measuring teachers (in part) through gains in standardized test scores could make the profession less attractive to talented aspiring teachers and, as the economy improves, drive some portion of the best current teachers out of the classroom. Media coverage of education debates often gives prominence to dramatic stories about outlier teachers whose incompetence is truly shocking those who fall asleep in class, for example and revels in the sometimes vitriolic rhetoric deployed by politicians and commentators. This has created a somewhat toxic atmosphere of suspicion and distrust especially in some of our largest urban centers for the vast majority of teachers who work conscientiously to educate their students.
We should be clear: teachers are our greatest assets in education reform. It would be a perilous error to view value-added systems as a way to force supposedly indifferent teachers finally to do their jobs; bean-counting measures of educational outcomes will never capture the full range of what they do, and pretending otherwise risks infantilizing the teaching profession. Rather, we must treat value-added data (always combined with other evidence) as a way to celebrate the many teachers who excel at educating children and support those who struggle to do so effectively which will, in a non-trivial number of cases, mean counseling them out of the profession. If we do this well if we do this from an attitude of trust rich, multi-dimensional teacher evaluations should ultimately help increase respect for teachers and draw ever more qualified young people to the front of the class.
ACCOUNTABILITY AND THE RETREAT FROM JUDGMENT
Id like to step back from the perspective of a policymaker grappling with the very specific, very pressing issue of value-added systems to consider a deeper question about the trajectory in American education that has brought us to this point. Teachers who critique value-added measures often point to the fact that such measures can only be as valuable as the underlying high-stakes student assessments on which they are based and those same critics go on to pan our current assessments. They have an important point: while multiple-choice and short answer-based tests can track basic skills and some important competencies, they do a poorer job of assessing the kind of knowledge and analytic skills that students across the globe are routinely required to demonstrate. But these same teacher critics, in my view, have not helped us understand why this is the case.
I would trace the origins of this testing gap to the fact that as a society, we simply cant decide what we want to teach. Our indecision cuts across English language arts, social studies, and even to some extent the sciences. Fewer and fewer individual texts are required reading either on curricula or for exams. The Common Core State Standards for the English language arts, which have recently been adopted by nearly all the states, are revolutionary both in their national reach and in the granular descriptions of increasingly complex skills they lay out. But they conspicuously avoid requiring specific readings beyond gestures toward classic myths and stories from around the world, Americas Founding Documents, foundational American literature, and Shakespeare (Common Core State Standards Initiative, 2012). In this respect, our standardized tests are symmetrical to our standards. The official description of the AP exam in English Literature and Composition, for example, explicitly states that there is no recommended or required reading list. Instead, teachers are offered a list of 150-200 representative authors to suggest the range and quality of reading expected in the course (College Board, 2010). (This is in marked contrast to the AP exam in Latin, which requires exactly two texts: four books of Virgils Aeneid and four books of Caesars Gallic War. It is evidently easier to be prescriptive in a dead language.)
Why would we be so reluctant to expect students to be familiar with particular works of literature? We live in a richly pluralistic democracy, home to more and more varied worldviews than perhaps any other nation in history. This heterogeneity creates a community of strangers ill at ease with grand narratives that could help us make sense of our culture, the kind of universal value systems that would allow us to say with confidence that one book is more worth our students attention more profound, richer in imagination, more insightful than another. To draw up a list of such books, even a list that changes through the years and has choices built into it, would require contentious choices we seem unprepared to make. We do not trust our judgment or our neighbors. Thus the time-honored resistance to the imposition of a national curriculum and devolution of educational authority to the local district. And so rather than insisting that our students master a shared set of evolving cultural works, we ask them to acquire reading skills that can be exercised on any text, from Shakespeare to the morning paper or a corporate memorandum. Conveniently, these skills can be assessed through multiple-choice questions about short passages.
We thus allow all parties to retreat from judgment: we assume that our society need not make hard choices among texts; our teachers need not make potentially controversial qualitative judgments about student work; and our students need not learn to make aesthetic, philosophical, or spiritual judgments about their world they are themselves judged only on their ability to sort evidence and information.
Taken as a whole, the Common Core State Standards are potentially a more important education reform than value-added measurements. As rigorous, well-defined national standards, they represent huge progress over the hodgepodge of uneven state standards they will replace. But, like value-added measurements, they are only part of the solution, and they must be implemented very carefully to avoid damaging the system they mean to improve.
In discussing the Common Core ELA standards and illustrating how they should be used, David Coleman, one of the chief architects of the standards, often recurs to the metaphors of the detective and the journalist. He imagines students who read and write like investigators marshaling evidence, which of course can be collected from any passage packed with sufficient information. The skills of detective and journalist are real and worthy but I am concerned that we not neglect aesthetic judgment and higher-order thinking. Great literature is great in part because it is complex to its core, ambiguous, and multifaceted: that is why such literature can be read over many generations in vastly different ways. To be responsive to such texts, to be open to their invitation, is to be a reader, not only a detective gathering clues or a journalist dividing fact from fiction. The works that are central to our human narrative pose fundamental questions about the human condition, questions that I believe should be central to a democratic education. Whatever the virtues of a consummate journalist-detective, it is not clear that we should expect her to be an informed and autonomous citizen or a fulfilled human being.
Our classroom standards/student assessment/teacher evaluation system doesnt have to eschew judgment in this way. The Common Core could represent not just an important consensus on the skills and competencies required for employment in the new economy but, perhaps, a framework in which we can begin to trust our collective judgments hard-fought, compromised, and evolving as they may need to be in America about how to teach our children who we are. Two consortiums of states are currently being funded by the federal government to design the next generation of exams, tied to the Common Core. This is a rare opportunity to ensure we have deep-probing assessments that will empower our teachers to teach not just basic skills but true judgment. Several years ago, French students aspiring to college were asked to write for four hours without notes on the question, Can knowledge of the self be sincere? We are not French, but this is a useful bookend to bear in mind, on the other side of the shelf from a potential content-averse, evidence-focused, multiple-choice test of investigative prowess.
Hannah Arendt wrote that when we choose to educate the next generation, we have to take responsibility for our world, for the portrait of that world we wish to place before our children, for the stories we will share with them, and for the knowledge that will ground their freedom (Arendt, 1968). Our retreat from judgment is a shirking of this responsibility. As we insist on greater accountability for our students growth for teachers, schools, and teacher preparation programs, we should not give ourselves a free pass.
I have focused in this article on a few educational reforms that are now gaining traction nationwide and their larger implications for our culture. In her insightful paper in this issue, Eva Baker also raises important points about the future promise and risks of assessment and validity in an increasingly digital age. In the realm of teacher preparation, we are seeing a clear trend toward more granular assessments geared toward the measurement of particular pedagogical skills. For example, 25 states are now moving to adopt the new Teaching Performance Assessment (edTPA) exam, a video- and portfolio-based analysis of teaching practice with detailed, subject-specific rubrics. Against this backdrop, it is easy to imagine a system of badges of proficiency or mastery for both students and teachers along the lines Eva Baker envisions. These hold great promise for efficient career preparation and signaling but it is hard to imagine a meaningful badge in judgment. As educators and policymakers, we need more and better data to make crucial decisions about how to design the systems in which teachers are prepared and children educated this and vigilance in using that data responsibly are the only way we can ensure validity. But as citizens and human beings, we must remain loyal to the faculty of judgment in a way that transcends these important technocratic concerns. Somewhere at the heart of our ongoing quest to teach and to learn better, there is a holistic, qualitative, and very human component that deserves to be honored.
Arendt, H. (1968). Between past and future. New York, NY: Penguin.
Chetty, R., Friedman, N., & Rockoff, J. (2012). The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood. Cambridge, MA: NBER working paper. Retrieved from http://www.nber.org/papers/w17699
College Board (2010). English course description. New York, NY: College Board. Retrieved from http://apcentral.collegeboard.com/apc/public/repository/ap-english-course-description.pdf
Common Core State Standards Initiative (2012). Myths Vs. Facts. Retrieved from http://www.corestandards.org/about-the-standards/myths-vs-facts
Glazerman, G., Goldhaber, D., Loeb, S., Raudenbush, S., & Staiger, D. (2010). Evaluating teachers: The important role of value-added. Washington, DC: Brookings.
Governors Press Office (2012, February 16). Governor Cuomo announces agreement on evaluation guidelines that will make New York State a national leader on teacher accountability. Albany, NY: Governors Press Office. Retrieved from http://www.governor.ny.gov/press/02162012teacherevaluations
Hanushek, E. (2008). Teacher deselection. Leading Matters. San Francisco, CA: Leading Matters. Retrieved from http://www.stanfordalumni.org/leadingmatters/san_francisco/documents/Teacher_Deselection-Hanushek.pdf
Kane, T. J., & Staiger, D. O. (2012). Gathering feedback on teaching: Combining high-quality observations with student surveys and achievement gains. Seattle, WA: The Bill & Melinda Gates Foundation. Retrieved from http://www.metproject.org/downloads/MET_Gathering_Feedback_Practioner_Brief.pdf
Medina, J. (2010, February 24). Progress slow in city goal to fire bad teachers, New York Times, p. A1.
Schall, T., & Smith, G. (2000). Do baseball players regress to the mean? The American Statistician, 54, 231-235.
Sturman, M.C., Cheramie, R.A., & Cashen, L.H. (2005). The impact of job complexity and performance measurement on the temporal consistency, stability, and test-retest reliability of employee job performance ratings. Journal of Applied Psychology, 90, 269-283.
Weisberg, D., Sexton, S., Mulhern, J., & Keeling, D. (2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness. New York, NY: The New Teacher Project. Retrieved from http://widgeteffect.org/downloads/TheWidgetEffect.pdf