Why “Engineering” Teacher Evaluation Systems is Best
by Nicole B. Kersting — July 11, 2014
New teacher evaluation systems are being designed, implemented and piloted in many states. The goal of these new systems is to provide more accurate and objective information on teacher performance than current systems do. Ideally, these new systems provide information for accountability purposes but also for helping teachers improve their performance. Achieving these goals is not likely unless we change our approach to designing such systems from a political process and adopt an engineering perspective. Taking an engineering design approach can lead to solid designs for teacher evaluation systems, provide opportunities for improvement through monitoring and feedback, and create accountability for the design process because the information on teacher performance these systems do provide can be evaluated against the goals and intended uses that were specified.
In many states new teacher evaluation systems are being designed, piloted, and implemented to improve the status quo. The goal of these efforts is to fix, what many consider, a broken system, in which virtually all teachers are deemed to be performing well. Although a considerable percentage of students do not meet the grade level standards, promotion and tenure are primarily a function of years spent in the classroom.
These new systems aspire to be more accurate and objective by including multiple measures of teacher performance including information on student learning. Often, they reflect a weighted combination of classroom observation ratings, parent and student feedback, and value-added scores, a relatively new and controversial measure of teacher performance estimated from student scores on standardized tests. Simply put, teachers add value if their students learn more over the course of the school year than would be expected based on students past learning, they do not if students perform below expectations. Ideally, these new systems would provide not only more accurate information for accountability purposes, but also useful information for helping teachers to improve performance.
Without a doubt, the goals are worthy. If we need such systems, and not everyone agrees we do, we all want them to be fair to teachers, and accurately identify teachers who might need further training to reach their potential, those that are doing already a fine job, and those who might be called exemplary professionals. Yet, it is not clear that current approaches are going to work much better unless we change our approach to designing and monitoring such complex systems, by understanding the information they do provide, by evaluating their performance over time, and by putting in corrective action for improvement when needed.
Instead, the process by which these systems are designed is often political and makes achieving these goals less likely. One instructive example of how political this process can be is the Chicago teacher strike in the fall of 2012. Protests were motivated in part by the citys mandate that value-added measures should account for nearly half of teachers overall performance in the new evaluation system, which the union strongly opposed. In the end, the deal approved in Chicago reduced test contributions to about one third of teachers overall performance. No matter what anyones opinion of the outcome, we probably all agree that turning the design of such systems into a political process is hazardous, counterproductive and undermines the very goals to which these systems aspire.
Even without the obvious political pressures of Chicago, the design of new teacher evaluation systems is often tied to a political process. For example, Arizona put into law and is currently piloting a system in which classroom observations account for 50 percent of teacher performance, value-added scores account for 33 percent and teacher and student feedback for the remaining 17 percent. Other districts and states have legislated their own combination of measures and percentage weights. What is disconcerting is not that there will be different systems. In fact, we would expect and welcome that variation if we could be sure that these differences are deliberate and purposeful, aligned with each systems overall goals and not political in nature. Which measures we include and how we make them count needs to consider the quality of the measures, but just as importantly, it affects directly what teacher performance means in each context.
Perhaps careful design is taking place in some districts or states. But instances like Chicago provide some evidence to the contrary, and in other cases it is simply hard to know. Designing teacher evaluation systems is complex and challenging because all decisions affect each other and ultimately the information they provide. Allowing the design process to be political trivializes this task. That is the case when decisions, like how much to make each performance measure count, are made politically as was done in Chicago. But it is also the case when the focus is solely on these obvious decisions, because they can be easily explained and exploited for political purposes even if that unwittingly transforms the complexity of designing such systems into a simplistic task. A lot more decision making goes into designing these systems than specifying measures and simple percentage weights to reflect their respective contributions to teachers overall performance score. Here are two important design issues that deserve our attention.
First, most new teacher evaluation systems seek to combine teachers scores on different performance indicators into a single overall performance score that can be used for decision-making. Yet, none of the available strategies are without challenges and all require some assumptions and choices that affect the meaning of teachers total performance scores.
Applying percentage weights to reflect each measures contribution, which might appear straight forward, is only meaningful, if scores on each measure are distributed roughly the same, which is rarely the case. A measure with more variation in scores will have a greater impact on teachers overall score than the measure with little variation, even if we would like the percentage weight for both measures to be the same. The highest scores on a measure with greater variability are much further way from average performance than the highest scores for measures with little variability. This is somewhat similar to the idea that the order and relative magnitude of any set of numbers does not change when we add a constant, the constant being the extreme case of a distribution with no variability at all. To deal with different distributions we can either make them more alike or create a common metric so that combining scores from different distributions is not combining apples with oranges. Which strategies might work best and are most defensible for a given district or state depends in part on the actual distributions, which we often know little about.
How we combine scores might also be affected by the desired level of precision we seek on teacher performance. If individual teacher rank-ordering is not important, we might divide the score distributions for each performance measure into performance groups using cut scores and then calculate teachers total performance based on his or her performance group designation across different measures as a weighted average.
A second important issue that has not received sufficient attention concerns the fact that we do not have the same information available to assess the performance of all teachers. For teachers who teach subjects or grade levels in which no standardized tests are administered, we cannot easily estimate value-added scores. That is also true for beginning teachers or new teachers in a district or state, when value-added models require information on student learning from multiple prior student cohorts. To make up for the missing data, average grade level or at times average school level value-added scores are used. This practice is not uncommon in statistics where under certain conditions missing data might be substituted by mean values. While this practice might be defensible in the context of statistical analyses when the goal is to make inferences about the group, it seems a lot more problematic when the goal is to provide information for individual selection and high-stakes decision-making.
These challenges do not mean we cannot create better teacher evaluation systems than the one we currently have. But they do suggest that designing high quality systems requires sophisticated technical expertise and that politics wont get us there. We might be more successful in the long run if we treat creating such systems as an engineering design challenge. I am not advocating simplistic, mechanistic solutions. I am proposing that we take advantage of a framework for design and feedback that creates exactly the accountability and improvement new teacher evaluation systems aspire to provide. This framework has led time and time again to the greatest accomplishments.
Under this approach, designing teacher evaluation systems begins with specifying clearly the kind of information the system is supposed to provide, its intended use and the level of precision required. It also includes identifying the constraints that have to be taken into account, such as availability, reliability and validity of measures, distributions and so forth, all of which become important parameters in the design and later testing and monitoring of the system. Treating this work as a design challenge also allows us to acknowledge what every engineer knows, that there is no perfect design, that the first design is rarely the best design. But based on careful testing and improvements we almost always can create the best possible design. Doing this work right, of course, requires time, something that we always seem to be short of in education. Unless we are serious about spending the time, we should not expect real improvement. But if we do follow through, we might get a lot closer to the kind of educational reform and improvement we hope these systems can help facilitate. Now, that would be no small accomplishment in education.