The Effects of Accountability System Design on Teachers’ Use of Test Score Data
by Jennifer L. Jennings - 2012
Background/Context: Many studies have concluded that educational accountability policies increase data use, but we know little about how to design accountability systems to encourage productive versus distortive uses of test score data.
Purpose: I propose that five features of accountability systems affect how test score data are used and examine how individual and organizational characteristics interact with system features to influence teachers’ data use. First, systems apply varying amounts of pressure. Second, the locus of pressure varies across systems. Third, systems diverge in the distributional goals they set for student performance. Fourth, the characteristics of assessments vary across systems. Finally, systems differ in scope—that is, whether they incorporate multiple measures or are process- or outcome oriented.
Research Design: I review the literature on the effects of accountability systems on teachers’ data use and propose a research agenda to further our understanding of this area.
Conclusions/Recommendations: Researchers have spent much more time analyzing test score data than investigating how teachers use data in their work. Evolving accountability systems provide new opportunities for scholars to study how the interactions between accountability features, individual characteristics, and organizational contexts affect teachers’ test score data use.
The focus on data, I would say, is the driving force [behind education] reform. No longer can we guess. We need to challenge ourselves everyday to see what the data mean. Secretary of Education Arne Duncan, 2010 (Quoted in Prabhu, 2010)
Since the 1970s, American education policy has relied on test-based accountability policies to improve student achievement and to close achievement gaps between advantaged and disadvantaged groups. Central to the theory of action underlying accountability is the idea that newly available test score data, in conjunction with the sanctions attached to these data, change the way that schools and teachers do business. In the view of many policy makers, exemplified by the Secretary of Education quoted at the beginning of this article, data are the primary driver of education reform.
Because data cannot do anything by themselves, whats missing from this account is an understanding of whether and how data change practice at the district, school, and classroom level and lead to educational improvement. Scholars have identified a number of potential positive effects of accountability-induced test score data use (hereafter, data use) on classroom and school practices, such as supporting diagnosis of student needs, identifying strengths and weaknesses in the curriculum, identifying content not mastered by students, motivating teachers to work harder and smarter, changing instruction to better align it with standards, encouraging teachers to obtain professional development that will improve instruction, and more effectively allocating resources within schools (Stecher, 2004). Other scholars studying the unintended consequences of accountability systems have been more skeptical about the transformative potential of data use because educators can also use data as a strategic resource to manipulate performance on accountability measures (Koretz, 2008).
Many studies have found that accountability policies increase data use (Kerr, Marsh, Ikemoto, & Barney, 2006; Marsh, Pane, & Hamilton, 2006; Massell, 2001). Yet little is known about how features of accountability systems affect how educators use data because accountability has been conceived of as one treatment in the literature. This is misleading, because accountability systems, even under the No Child Left Behind Act (NCLB), differ in important ways that have implications for schools and teachers use of data. As Lee (2008) wrote, We need to know who is accountable for what, how, and why (p. 625). Almost 20 years after the implementation of the first state accountability systems, we still know little about how features of accountability systems interact with organizational and individual characteristics to influence teachers responses to accountability.
This article draws on the existing literature to catalog the features of accountability systems that may affect data use and reviews what we know, and what we dont know, about their effects. I limit my scope to the use of test score data, which at present are the primary data used in schools in response to accountability, and further restrict my focus to teachers data use.
WHAT COUNTS AS DATA USE?
I conceive of data use broadly in this review. At one end of the continuum is the most passive use of test score datadata use as a lens. Teachers may never open a spreadsheet or a score report, but they generally have a rough idea of how their school performed on state standardized tests. A near universal use of test score data among teachers, then, involves making inferences about school performance and determining whether test scores accurately reflect the schools quality. Data use of this kind may lead educators to do nothing, to change classroom practice, to work longer hours, to inform their professional identities, or to look for a job at another school. The key point here is that whether exposure to this information leads to action, inaction, or dismissal of its relevance, educators are using these data to make sense of the schools in which they work.
Educators can also use test score data more actively as a tool for diagnosis, whereby they identify problems and develop an account of their causes. Teachers may utilize formative assessment data to identify skills on which their class or an individual student performed poorly. Data-based diagnoses range from more to less formal and systematic and may reflect perceptions as much as accurate inferences about student or class needs.
Data can also serve as a compass pointing toward particular instructional and organizational changes, or toward maintaining the status quo. Based on the inferences drawn from test score data, teachers may change their instructional approaches or the content they cover to improve student learning. Teachers may also use data in more general ways to allocate resources. They may alter the way they allocate attention to students, or change the way they spend classroom time. Data-induced instructional change is not necessarily instructional improvement, however, because all changes can make things worse or better.
Another type of data use involves monitoring.1 School leaders and teachers set goals for student performance and gauge progress by using formative or summative standardized tests. Teachers often determine whether a curriculum is working, or whether a student should change ability groups or receive additional attention, by assessing progress on tests.
Finally, data can also be used as a legitimizer. Every day, teachers must provide accounts to themselves about why they chose Option A over Option B in their classrooms. Such decisions are never clear-cut and sometimes involve tradeoffs between content (Should I focus more on math or science?) or students (Which students should I tutor after school?). Data provide an objective rationale for making such decisions. Because of the cultural standing of data, they also provide a legitimate way to account for ones actions to other parties, such as a colleague or a principal.
Together, these types of data use capture how teachers view their schools, students, and themselves (lens); how they determine whats working, whats going wrong, and why (diagnosis); what they should do in response (compass); how they establish whether it worked (monitoring); and how they justify decisions to themselves or to others (legitimizer).
Because of the positive valence of data use in the practitioner literature (i.e., Boudet, City, & Murnane, 2005) and in the culture at large, it is worth further refining these categories to distinguish between productive and distortive uses of data. At the core of data use is the process of making inferences from student test scores regarding students and schools performance, responding (or choosing not to respond) to these inferences, monitoring the effectiveness of this response, and accounting for it to oneself and others. I thus define productive data use as practices that improve student learning and do not invalidate the inferences about student- and school-level performance that policy makers, educators, and parents hope to make. To the extent that teachers use of test score data to make instructional and organizational decisions produces score gains that do not generalize to other measures of learningfor example, other measures of achievement or other measures of educational attainmentand thus leads us to make invalid inferences about which schools, teachers, and programs are effective, I will characterize this as distortive data use.
Two concrete examples are useful in making this distinction clear. In an extreme case such as cheating, teachers use formative assessment data to determine which students are lowest performing and thus whose answer sheets should be altered. As a result, scores increase substantially, and we infer that school quality has improved when it has not. On the other hand, consider a teacher who uses formative assessment data to pinpoint her students weaknesses in the area of mathematics and finds that they perform much worse on statistics and probability problems than geometry problems. She searches for more effective methods for teaching this contentmethods focused on the most important material in this strand, not the details of the specific testand students performance on multiple external assessments improves in this strand of mathematics. We infer that students made gains in statistics and probability, which is, in this case, a valid inference.
THE EFFECTS OF FEATURES OF ACCOUNTABILITY SYSTEMS ON TEST SCORE DATA USE
I focus my review of the impact of five features of accountability systems on teachers data use and discuss these features in terms of productive and distortive forms of data use. I chose these features based on my review of the literature because they represent, in my view, the most important dimensions on which accountability systems differ.
First, accountability systems apply varying amounts of pressure. Systems differ in the required pace of improvement and vary on a continuum from supportive to punitive pressure. Second, the locus of pressure varies across accountability systems. Systems may hold districts, schools, or students accountable for performance, and recent policies hold individual teachers accountable. Third, accountability systems vary in the distributional goals they set for student performance. Prioritizing growth versus proficiency may produce different responses, as will holding schools accountable for racial and socioeconomic subgroups. Fourth, features of assessments vary across systems. To the extent that teachers feel that using test data will improve scores on a given test, teachers may be more likely to use it, though it is not clear whether they will use it in productive or distortive ways. Fifth, the scope of the accountability system may affect data use. This includes whether an accountability system incorporates multiple measures, or is process- or outcome oriented. An accountability system that rewards teachers for short-term test score increases will likely produce different uses of data than one that measures persistent effects on test scores.
In the preceding description, accountability features are treated as universally understood and processed by all teachers. The implication is that adopting a design feature will lead to a predicted set of responses. But there are good reasons to believe that teachers may understand and interpret the implications of these features differently. Coburn (2005) studied how three reading teachers responded to changes in the institutional environment. She found that teachers sensemaking, framed by their prior beliefs and practices, influenced their responses to messages from the institutional environment. Diamond (2007), studying the implementation of high-stakes testing policy in Chicago, found that teachers understanding of and responses to this policy were mediated by their interactions of colleagues. A number of other studies confirm that organizational actors may not perceive and react to the organizational environment similarly (Coburn, 2001, 2006; Spillane, Reiser, & Reimer, 2002; Weick, 1995). These studies largely found that organizational actors both within and between organizations construct the demands of, and appropriate responses to, accountability systems differently. As a result, individuals and organizations respond in varied ways (and potentially use data in varied ways) that are not simply a function of what policy makers perceive as teachers incentives.
The object of interest in this review, then, is not the average effect of the five accountability system features on data use. Rather, I consider how individual and organizational characteristics interact with different features to produce the responses we observe.
VARYING AMOUNTS OF PRESSURE
Accountability systems vary in the amount and type of pressure exerted on schools. The empirical problem for studying this issue, however, is that accountability pressure is in the eye of the beholder; that is, it does not exist objectively in the environment. What is perceived as significant pressure in one school may be simply ignored in another. Teachers use data as a lens for understanding environmental demands, but the meanings they attach to these data vary across individuals and schools. Although the studies that I discuss next treat pressure as a quantity known and agreed on by all teachers, I emphasize that understanding the effects of pressure on data use first requires a more complex understanding of how teachers use data to establish that they face accountability pressure.
Accountability pressure may have heterogeneous effects on data use depending on schools proximity to accountability targets. As noted, teachers may not have a uniform understanding of their proximity to targets, but most studies simulate this pressure by predicting schools odds of missing accountability targets. Only one study has examined the effect of schools proximity to accountability targets (here, adequate yearly progress [AYP]) on instructional responses. Combining school-level data on test performance and survey data from the RAND study of the implementation of NCLB in three states (Pennsylvania, Georgia, and California), Reback, Rockoff, and Schwartz (2010) examined the effects of accountability pressure on a series of plausibly data-driven instructional behaviors. Schools that were furthest from AYP targets were substantially more likely to focus on students close to proficiency relative to those that were very likely to make AYP (53% of teachers vs. 26%), to focus on topics emphasized on the state test (84% vs. 69%), and to look for particular styles and formats of problems in the state test and emphasize them in [their] instruction (100% vs. 67%; Reback et al., 2010).
This study does not illuminate the process of data use in these schools. These activities could have resulted from data use as a lens and stimulus rather than examination of test data to diagnose student performance levels and content weaknesses. Nonetheless, the study provides some support for the hypothesis that the amount of accountability pressure affects how, and how much, data are used; data appear to be used to target both students and item content and styles when schools face more accountability pressure. It also demonstrates that teachers facing objectively low risks of missing accountability targets still respond strongly to accountability systems. In these low-risk schools, more than two thirds of teachers in this study appeared to use data to focus on particular content and item formats.
Hamilton et al.s (2007) study of the implementation of NCLB in three states (Pennsylvania, Georgia, and California) also considered how variation in state accountability systems may affect instructional responses, and data use in particular. Hamilton et al. found that districts and schools responded in similar ways across the three states. Among the most common improvement strategies deployed were aligning curricula and instruction with standards, providing extra support for low-performing students, encouraging educators to use test results for planning instruction, adopting benchmark assessments, and engaging in test preparation activities. Though this review focuses on teachers, some of the most useful information on data use in this study came from principal surveys. More than 90% of principals in all three states reported that they were using student test data to improve instruction, though a higher fraction of principals in Georgia found data useful than in the other two states. Georgia also appeared to be an outlier in data use in other ways. For example, 89% of districts required interim assessments in elementary school math, whereas only 44% did in California, and 38% did in Pennsylvania. Other studies suggest that these findings could be a function of variation in pressure resulting from where accountability targets were set and how quickly schools were expected to increase scores. Pedulla et al.s (2003) study provides some evidence on this issue; they found that higher stakes increase data use, and another study has found that the amount of pressure a district is facing is associated with teachers emphasis of tested content and skills (Center on Education Policy, 2007).
Beyond these two studies, the management literature on organizational learning and behavior provides further insight into how the amount of accountability pressure might interact with organizational characteristics to affect data use. Current federal accountability targets of 100% proficiency by 20142 can be understood as a stretch goal, which Sitkin, Miller, See, Lawless, and Carton (in press) described as having two features, extreme difficultyan extremely high level of difficulty that renders the goal seemingly impossible given current situational characteristics and resources; and extreme noveltythere are no known paths for achieving the goal given current capabilities (i.e., current practices, skills, and knowledge) (p. 9). Ordonez, Schweitzer, Galinsky, and Bazerman (2009) characterized this problem as goals gone wild, writing that they can narrow focus, motivate risk-taking, lure people into unethical behavior, inhibit learning, increase competition, and decrease intrinsic motivation (p. 17). Schweitzer, Ordonez, and Douma (2004) empirically tested this idea in a laboratory experiment and found that subjects with unmet goals engaged in unethical behavior at a higher rate than those told to do their best and that this effect was stronger when subjects just missed their goals. Although we have little empirical evidence on how these findings might generalize to the level of organizations, Sitkin et al. proposed that high-performing organizations are the most likely to benefit from stretch goals, whereas these goals are likely to have disruptive, suboptimal effects in low-performing organizations.
To the extent that the management literature described earlier applies to schools, these findings suggest that high- and low-performing schools will respond differently to accountability pressure. We can hypothesize that the lowest performing schools may be more likely to pursue distortive uses of data when faced with stretch goals. These schools may attempt to quickly increase test scores using some of the approaches documented in the studies reviewed here, such as targeting high-return content or students. Higher performing schools, which often have higher organizational capacity, may respond in more productive ways.
There is some evidence for this argument in the educational literature on data use; two studies have found that schools with higher organizational capacity are more likely to use data productively (Marsh et al., 2006; Supovitz & Klein, 2003). Marsh et al. synthesized the results of four RAND studies that incorporated study of teachers data use and found that staff preparation to analyze data, the availability of support to help make sense of data, and organizational norms of openness and collaboration facilitated data use. Supovitz and Klein (2003), in their study of Americas Choice schools use of data, found that a central barrier to data use was the lack of technical ability to manipulate data to answer questions about student performance.
LOCUS OF PRESSURE
Accountability systems hold some combination of students, teachers, schools, and districts accountable for improving performance. In cities like New York and Chicago, students must attain a certain level of performance on tests to be promoted to the next grade, and 26 states currently require exit exams for students to graduate (Urbina, 2010). Merit pay proposals in districts such as Houston, Charlotte-Mecklenberg, and Minneapolis tie some component of a teachers compensation to his or her students performance on standardized tests (Papay & Moore Johnson, 2009). More recently, many federal Race to the Top winners committed to linking up to 50% of teacher evaluations to student test scores (New Teacher Project, 2010), and in September 2010, the Los Angeles Times published individual value-added data for elementary teachers (Buddin, 2010). Principals in many cities can earn substantial bonuses or lose their jobs based on test scores, and NCLB holds both schools and districts accountable for improving test scores.
Does the locus of pressure affect how much, and how, data are used? We need to understand whether within-school variation in data use increases or decreases when accountability moves to the level of the teacher. An intriguing finding in the current literature is that most of the variation in data use currently exists within rather than between schools (Marsh et al., 2006). This means that individual user characteristics need to be a focus of study along with organizational characteristics. Because current studies focus on outcomes rather than process, we can only use existing results to generate hypotheses about between-teacher variation in data use.
Recent studies by Papay (2010) and Corcoran, Jennings, and Beveridge (2010) compared teacher effects on high- and low-stakes tests under the assumption that the stakes attached to tests matter for teacher responses. These studies suggest that teachers who appear effective on high-stakes tests are not necessarily effective on low-stakes tests. Furthermore, teacher effects on high-stakes tests decay more quickly than do teacher effects on low-stakes tests, perhaps because teachers face different pressures to increase scores on high-stakes tests that lead them to use test data in different ways. Corcoran et al. (2010) found that there are particularly large gaps in measured teacher effectiveness on the two tests for inexperienced teachers. These teachers may be using, or experiencing the stimulus effects of, high-stakes test data differently than their more experienced peers. This may be because these teachers face pretenure pressure to increase test scores; the Reback et al. (2010) study found that in schools facing accountability pressure, untenured teachers work substantially more hours per week.
These findings led to two specific lines of inquiry. The first task is to explain whether data use plays a role in making teachers effective on a high-stakes test but not a low-stakes test. For example, do some teachers use test score data from high-stakes tests as the dominant lens in making sense of school performance, and thus focus their time and attention there? Do teachers with substantially greater effectiveness on high-stakes than low-stakes tests use item-level data to diagnose student needs and inform their instruction? The second task is to explain whether there are systematic teacher characteristics that are associated with different levels and types of data use. Do new teachers understand the line between productive and distortive data use differently than more experienced teachers? Does preservice training now focus more on data use, which would produce different types of usage among new teachers? Do untenured teachers experience accountability pressure in a more acute way and thus make them more attentive to test score data? Answering each of these questions requires an understanding of how teachers internalize the locus of pressure, what they do as a result, and how these behaviors vary across teachers both between and within organizations.
The locus of accountability may also affect how data are used to target resources, such as time and attention in and out of class, to students. Targeting reflects the use of data as diagnosis and potentially as legitimizer. That is, test scores are used to diagnose students in need of additional resources, and the imperative to improve test scores may legitimize the use of these data to target additional time and attention to these students. Teachers (or their administrators) may also use administrative data to determine which students do not required targeted resources in the short term. Corcoran (2010) found that in the Houston Independent School District, a large fraction of students did not have two consecutive years of test data available. This creates incentives for teachers to use data to identify students who dont count if teachers are held individually accountable for student value-added, so they can focus their attention elsewhere. Even if accountability is at the level of the school, Jennings and Crosta (2010) found that 7% of students are not continuously enrolled in the average school in Texas and thus do not contribute to accountability indicators. This may be consequential for students. For example, Booher-Jennings (2005) found that teachers in the Texas school that she studied focused less attention on students who were not counted in the accountability scheme because they were not continuously enrolled. These actions were legitimated as data-driven decisions. In this case, data systems that were intended to improve the outcomes of all students were instead used to determine which students would not affect the schools scores.
Because policies that hold individual teachers accountable for scores are new and have not been studied, the performance management literature on individual incentives is helpful in understanding how teachers may use data differently in these cases. Much of this literature is based on formal models of human behavior rather than empirical data. These studies suggest that high-powered individual incentives focused on a limited set of easily measurable goals are likely to distort behavior and lead to undesirable uses of data (Baker, Gibbons, & Murphy, 1994; Campbell, 1979; Holmstrom & Milgrom, 1991). If this precept applies to schools, we can predict that individual accountability focused largely on test scores will encourage distortive uses of data. But we may also expect that these responses will be mediated by the organizational contexts in which teachers work (Coburn, 2001). In some places, this pressure may be experienced acutely, whereas in others, principals and colleagues may act as a buffer. As I will discuss later, subjective performance measures have been proposed as a way to offset these potentially distortive uses of data.
DISTRIBUTIONAL GOALS OF THE ACCOUNTABILITY SYSTEM: PROFICIENCY, GROWTH, AND EQUITY
The goals of an accountability system affect how student performance is measured, which in turn may affect how data are used. The three major models currently in use are a status (i.e., proficiency) model, a growth model, or some combination of the two. These models create different incentives for using data as both diagnosis and compass to target resources to students. In the case of status modelsby which I mean models that focus on students proficiencyteachers have incentives to move as many students over the cut score as possible but need not attend to the average growth in their class. In a growth model, teachers have incentives to focus on those students who they believe have the greatest propensity to exhibit growth. Because state tests generally have strong ceiling effects that limit the measurable growth of high-performing students (Koedel & Betts, 2009), teachers may focus on lower performing students in a growth system.
No studies to date have investigated whether status models have different effects on data use than growth models. However, we can generate hypotheses about these effects by considering a growing body of literature that has assessed how accountability systems affect student achievement across the test score distribution. A prime suspect in producing uneven distributional effects is reliance of current accountability systems on proficiency rates, a threshold measure of achievement. Measuring achievement this way can lead teachers to manipulate systems of measurement to create the appearance of improvement. For example, teachers can focus on bubble students, those close to the proficiency cut score (Booher-Jennings, 2005; Hamilton et al., 2007; Neal & Schanzenbach, 2007; Reback, 2008). Test score data appear to play a central role in making these targeting choices and may also be used to legitimize these choices as data-driven decision making (Booher-Jennings, 2005). Because sanctions are a function of passing rates, slightly increasing the scores of a small number of students can positively impact the schools accountability rating.
A large body of evidence addresses the issue of distributional effects and provides insight into the extent to which teachers are using data to target resources to students. The literature is decidedly mixed. One study in Chicago found negative effects of accountability pressure on the lowest performing students (Neal & Schanzenbach, 2007), whereas another in Texas found larger gains for marginal students and positive effects for low-performing students as well (Reback, 2008). Four studies identified positive effects on low-performing students (Dee & Jacob, 2009; Jacob, 2005; Ladd & Lauen, 2010; Springer, 2007), whereas four found negative effects on high-performing students (Dee & Jacob, 2009; Krieg, 2008; Ladd & Lauen, 2010; Reback, 2008). Because these studies intended to establish effects at the level of the population, they did not directly attend to how teachers varied in their use of data to target students and how organizational context may have mediated these responses. I return to these issues in my proposed research agenda.
Only one study to date has compared the effects of status and growth models on achievement. Analyzing data from North Carolina, which has a low proficiency bar, Ladd and Lauen (2010) found that low-achieving students made more progress under a status-based accountability system. In contrast, higher achieving students made more progress under a growth-based system. This suggests that teachers allocation of resources is responsive to the goals of the measurement system; teachers targeted students below the proficiency bar under a status system, and those expected to make larger gains (higher performing students) under a growth system. As more states implement growth models, researchers will have additional opportunities to address this question and determine what role the difficulty of the proficiency cut plays in affecting how teachers allocate their attention.
A second feature that may be important for how data are used to target students is whether the system requires subgroup accountability, and what cutoffs are established for separately counting a subgroup. States vary widely in how they set their subgroup cutoffs. In Georgia and Pennsylvania, 40 students count as a subgroup, whereas in California, schools must enroll 100 students or 50 students if that constitutes 15% of school enrollment (Hamilton et al., 2007). Only one study by Lauen and Gaddis (2010) has addressed the impact of NCLBs subgroup requirements. Though they found weak and inconsistent effects of subgroup accountability, Lauen and Gaddis found large effects of subgroup accountability on low-achieving students test scores in reported subgroups; these were largest for Hispanic students. These findings suggest that we need to know more about how data are used for targeting in schools that are separately accountable for subgroups compared with similar schools that are not.
To summarize, most of our knowledge about the effects of distributional goals of accountability systems comes from studies that examine test scores, rather than data use, as the focus of study. These studies raise many questions about the role data use played in producing these outcomes. First, data use has made targeting more sophisticated, real-time, and accurate, but we know little about how targeting varies across teachers and schools. Second, we need to know whether teachers, administrators, or both are driving targeting behavior. For example, whereas 77%90% of elementary school principals reported encouraging teachers to focus their efforts on students close to meeting the standards, only 29%37% of teachers reported doing so (Hamilton et al., 2007). Third, we need to know more about the uses of summative data for monitoring the effectiveness of targeting processes. How do teachers interpret students increases in proficiency when they are applying targeting practices? Depending on the inference teachers want to make, targeting can be perceived as a productive or distortive use of data. Targeting students below passing creates the illusion of substantial progress on proficiency, making it distortive if the object of interest is change in student learning. On the other hand, the inferences made based on test scores would not be as distortive if teachers examined average student scale scores. At present, we do not know to what extent teachers draw these distinctions in schools. A final area of interest, which will be discussed in more detail in the following section, is the extent to which targeting increases students skills generally or is tailored to predictable test items that will push students over the cut score.
FEATURES OF ASSESSMENTS
Features of assessments may affect whether teachers use data in productive or distortive ways. Here, I focus on three attributes of assessmentsthe framing of standards, the sampling of standards, and the cognitive demand of the skills represented on state testsbecause they are most relevant to the potentially distortive uses of data. Many more features of assessments, such as the extent to which items are coachable, should also be explored. The specificity of standards and their content varies widely across states (Finn, Petrilli, & Julian, 2006). Framing standards too broadly leads teachers to use test data to focus on tests rather than on the standards themselves (Stecher, Chun, Barron, & Ross, 2000). In other words, if a standard requires that students understand multiple instantiations of a skill but always test the same one on the test, teachers will likely ignore the unsampled parts of the standard. On the other hand, overly narrow framing of standards also enables test preparation that increases overall scores without increasing learning. For example, in some cases, state standards are framed so narrowly that they describe a test question rather than a set of skills (Jennings & Bearak, 2010). The implication for data use is that the framing of standards may affect how much teachers use item data disaggregated at the standard level. Ultimately, this turns on teachers assessments of how much information standard-level data provide.
By the same token, the sampling of standardswhat is actually covered on the testsmay also affect how teachers use data to inform instruction. Teachers themselves have reported significant misalignment between the tests and the standards, such that content was included that was not covered in the curriculum, or important content was omitted (Hamilton et al., 2007). Omissions are not randomly drawn from the standards; for example, Rothman, Slattery, Vranek, and Resnick (2002) found that tests were more likely to cover standards that were less cognitively demanding. Studies of standard coverage on the New York, Massachusetts, and Texas tests confirm these impressions and have found that state tests do not cover the full state standards and are predictable across years in ways that facilitate test-specific instruction, though there is substantial variation across subjects and states (Jennings & Bearak, 2010). At one end of the continuum is New York; in mathematics, in no grade is more than 55% of the state standards tested in 2009. By contrast, the Texas math tests covered almost every standard in 2009, and the Massachusetts exams covered roughly four fifths of the standards in that year. Jennings and Bearak (2010) analyzed test-item level data and found that students performed better on standards that predictably accounted for a higher fraction of test points. This suggests that teachers had targeted their attention to standards most likely to increase test scores. Survey evidence largely confirms these findings. In the RAND three-state study, teachers reported that there were many standards to be tested, so teachers had identified highly assessed standards on which to focus their attention (Hamilton & Stecher, 2006).
A complementary group of studies on teaching to the format illustrates how features of the assessment can affect how data are used to influence practice. Studies by Borko and Elliott (1999), Darling-Hammond and Wise (1985), McNeil (2000), Shepard (1988), Shepard and Dougherty (1991), Smith and Rottenberg (1991), Pedulla et al. (2003), and Wolf and McIver (1999) all demonstrate how teachers focus their instruction not only on the content of the test, but also its format, by presenting material in formats as they will appear on the test and designing tasks to mirror the content of the tests. To the extent that students learn how to correctly answer questions when they are presented in a specific format but struggle with the same skills when they are presented in a different format, this use of test data to inform instruction is distortive because it inflates scores.
Taken together, these studies suggest that different assessments create incentives for teachers to use data in different ways. They also suggest that teachers are using relatively detailed data about student performance on state exams in their teaching but provide little insight into the types of teachers or schools where these practices are most likely to be prevalent. School organizational characteristics, such as the availability of data systems, instructional support staff, and profesional development opportunties, may affect how features of assessments are distilled for teachers. Teachers own beliefs about whether these uses constitute good teaching may also matter. Some teachers view focusing on frequently tested standards as best practice, whereas others see this as teaching to the test. Another important area for inquiry is whether this type of data use is arising from the bottom up or the top down. In many cases, teachers are not making these decisions alone. Rather, school and district leaders may mandate uses of data and changes in practice that will increase test scores, and teachers may unevenly respond to these demands (Bulkley, Fairman, Martinez, & Hicks, 2004; Hannaway, 2007; Koretz, Barron, Mitchell, & Stecher, 1996; Koretz, Mitchell, Barron, & Keith, 1996; Ladd & Zelli, 2002).
Many have hypothesized that accountability systems based on multiple process and outcome measures may encourage more productive uses of data than those based only on test scores. In a policy address, Ladd (2007) proposed the use of teams of inspectors that would produce qualitative evaluations of school quality so that accountability systems would more proactively promote good practice. Hamilton et al. (2007) suggested creating a broader set of indicators to provide more complete information to the public about how schools are performing and to lessen the unwanted consequences of test-based accountability. The hope is that by measuring multiple outcomes and taking account of the processes through which outcomes are produced, educators will have weaker incentives to use distortive means.
A large body of literature in economics and management has considered how multiple measures may influence distortive responses to incentives, particularly when firms have multiple goals. In their seminal article, Holmstrom and Milgrom (1991) outlined two central problems in this area: Organizations have multiple goals that are not equally easy to measure, and success in one goal area may not lead to improved performance in other goal areas. They showed that when organizational effectiveness in achieving goals is weakly correlated across goal domains and information across these domains is asymmetric, workers will focus their attention on easily measured goals to the exclusion of others. They recommended minimizing strong objective performance incentives in these cases.
Because comprehensive multiple measures systems have not been implemented in the United States, it is currently difficult to study their effects on data use. Existing studies have examined how the use of multiple measures affects the validity of inferences we can make about school quality (Chester, 2005) or have described the features of these systems (Brown, Wohlstetter, & Liu, 2008), but none has evaluated their effects on data use. The European and U.K. experience with inspectorates provides little clear guidance on this issue; scholars continue to debate whether these systems have improved performance or simply led to gaming of inspections (Ehren & Visscher, 2006).
A PROPOSED RESEARCH AGENDA
To improve accountability system design and advance our scholarly knowledge of data use, researchers should study both productive and distortive uses of data. Next, I propose a series of studies that would help build our understanding of how accountability system features affect teachers data use and what factors produce variability in teachers responses.
AMOUNT OF PRESSURE
There are two specific features of regulatory accountability systems that should be explored to understand their impacts on data use: where the level of expected proficiency or gain is set, and how quickly schools are expected to improve. There is wide variation in the cut scores for proficiency that states set under NCLB as well as substantial variation in the required pace of improvement under NCLB. As Koretz and Hamilton (2006) have written, establishing the right amount of required improvement and its effects has been a recurrent problem in performance management systems and is generally aspirational rather than evidence based. Some states used linear trends to reach 100% proficiency by 20132014, whereas others allowed slower progress early on but required the pace of improvement to increase as 2014 approached (Porter, Linn, & Trimble, 2005). Such differences across states can be exploited to understand the impact of the amount of pressure on data use.
Also worth exploring is how schools react to cross-cutting accountability pressures. Local, state, and federal accountability systems are currently layered on top of each other, yet we know little about which sanctions drive educators behavior and why. For example, California maintained its pre-NCLB state growth system and layered a NCLB-compliant status system on top. In New York City, the A-F Progress Report system was layered on top of the state and federal accountability system. In both cases, growth and status systems frequently produced different assessments of school quality. We need to know more about how educators make sense of the competing priorities of these systems and determine how to respond to their multiple demands. For example, we might expect that judgments of low performance from both systems would intensify accountability pressure and increase teachers use of test score data, whereas conflicting judgments might lead teachers to draw on test score data to legitimize one designation versus the other.
LOCUS OF PRESSURE
Rapidly evolving teacher evaluation systems and the staggered implementation of these systems across time and space provide an opportunity to understand how data use differs when both teachers and schools are held accountable.
We need to know more about how educators understand value-added data that are becoming widely used and distributed and how they respond in their classrooms. For example, we need to know more about how teachers interpret the meaning of these measures as well as their validity. We need to know to what extent teachers perceive that these measures accurately capture their impact and how this perceived legitimacy affects the way they put these data into use. Of particular interest is how these interpretations vary by teacher and principal characteristics such as experience, grade level, subject taught, teacher preparation (i.e., traditional vs. alternative), demographics, and tenure. Researchers should also investigate the interaction between individual and organizational characteristics, such as the schools level of performance, racial and socioeconomic composition, resources, and climate.
Once we have a better understanding of how teachers interpret these data, scholars should study whether and how teachers data use changes in response to measures of their value-added. For example, we need to understand whether teachers pursue productive or distortive forms of data use as they try to improve their value-added scores and how these reactions are mediated by individual and organizational characteristics.
GOALS OF THE ACCOUNTABILITY SYSTEM: STATUS, GROWTH, AND EQUITY
Although scholars have proposed a variety of theories to predict how teachers will respond to status versus growth-oriented systems, few of these theories have been tested empirically. What seems clear is that status and growth-based systems create different incentives for using data to inform instruction. Though there is no extant research on this area, one might hypothesize that in a status-based system, a teacher might use formative assessment data to reteach skill deficits of students below the threshold. Under a growth system, the same teacher might use these data to maximize classroom growth, which could occur by focusing on the skills that the majority of the class missed. Alternatively, a more sophisticated approach could be used whereby teachers determine which students have the greatest propensity to exhibit growth and focus on the skill deficits of these students. As accountability systems evolve, researchers should study how the incentives implied by status and growth systems affect data use practices.
All the ideas posed in the preceding paragraphs suggest that teachers have a clear understanding of how status versus growth systems work. Because growth systems are based on sophisticated statistical models, there are good reasons to suspect that is untrue. Researchers should also study how teachers vary in their understanding of these systems, how these understandings vary between and within organizations, and how they shape teachers use of data.
FEATURES OF ASSESSMENTS
As described in the literature review, assessments offer different opportunities to use data to improve test scores. This area will become particularly relevant with the implementation of the Common Core standards and assessments, which focus on fewer concepts but promote deeper understanding of them. The switch to a new set of standards will provide a unique opportunity for researchers to observe the transition in data use that occurs when the features of standards and assessments change.
We also need to know how teachers vary in the extent to which they use distortive approaches to increase scores, which can be enabled by assessments that build in predictable features. Existing research suggests that there is substantial variation across states in such responses, which suggests that features of assessments may matter. More investigation at the organizational level is needed. For example, there is substantial variation in the fraction of teachers reporting that they emphasize certain assessment styles and formats of problems in their classrooms. One hypothesis is that features of assessments make these behaviors a higher return strategy in some places than others. Another hypothesis is that variation in score reporting produces variation in teachers data use. Researchers could contrast teachers responses in states and districts that disaggregate subscores in great detail, relative to those that provide only overall scores.3 Once we better understand the features of assessments and assessment reporting that contribute to these differences, researchers can design assessments that promote desired uses of data and minimize undesired uses. Using existing student-by-item-level administrative data, it is now possible to model teachers responses to these incentives, but what is missing from such studies is an understanding of the data-related behaviors that produced them. Future studies using both survey and qualitative approaches to study data use can help to unpack these findings.
Many have hypothesized that accountability systems based on multiple measuresand in particular, those that are both process- and outcome orientedmay produce more productive uses of data. Future studies should establish how teachers interpret multiple measures systems. These systems will put different weights on different types of measures, requiring teachers to decide how to allocate their time between meeting them. For example, in New York City, 85% of schools letter grades are based on student test scores, whereas 15% are based on student, teacher, and parent surveys and attendance records. Likewise, new systems of teacher evaluation incorporate both test scores (in some cases, up to 51% of the evaluation) and other evaluations. We need to know how teachers understand these systems in practice and how the weights put on different types of measures influence their understanding.
The rise of the educational accountability movement has created a flurry of enthusiasm for the use of data to transform practice and generated reams of test score data that teachers now work with every day. Researchers have spent much more time analyzing these test score data themselves than trying to understand how teachers use data in their work. What this literature review makes clear is just how scant our knowledge is about what teachers are doing with these data on a day-to-day basis. Given the widespread policy interest in redesigning accountability systems to minimize the undesired consequences of these policies, understanding how accountability features influence teachers data use is an important first step in that enterprise.
1. Federal, state, and district policy makers, of course, formally use data to measure how schools are doing and to apply rewards or sanctions, but I focus here on use of data by teachers.
2. Recent regulations now make it possible for states to request waivers from this requirement.
3. I thank a reviewer for this point.
Baker, G., Gibbons, R., & Murphy, K. J. (1994). Subjective performance measures in optimal incentive contracts. Quarterly Journal of Economics, 109, 11251156.
Booher-Jennings, J. (2005). Below the bubble: Educational triage and the Texas accountability system. American Educational Research Journal, 42, 231268.
Borko, H., & Elliott, R. (1999). Hands-on pedagogy versus hands-off accountability: Tensions between competing commitments for exemplary math teachers in Kentucky. Phi Delta Kappan, 80, 394400.
Boudet, K., City, E., & Murnane, R. (Eds.). (2005). Data-wise: A step-by-step guide to using assessment results to improve teaching and learning. Cambridge, MA: Harvard Education Press.
Brown, R. S., Wohlstetter, P., & Liu, S. (2008). Developing an indicator system for schools of choice: A balanced scorecard approach. Journal of School Choice, 2, 392414.
Buddin, R. (2010). Los Angeles teacher ratings: FAQ and about. Los Angeles Times. Retrieved from http://projects.latimes.com/value-added/faq/
Bulkley, K., Fairman, J., Martinez, M. C., & Hicks, J. E. (2004). The district and test preparation. In W. A. Firestone & R. Y. Schorr (Eds.), The ambiguity of test preparation (pp. 113142). Mahwah, NJ: Erlbaum.
Campbell, D. T. (1979). Assessing the impact of planned social change. Evaluation and Program Planning, 2, 6790.
Center on Education Policy. (2007). Choices, changes, and challenges: Curriculum and instruction in the NCLB era. Retrieved from http://www.cep-dc.org
Chester, M. D. (2005). Making valid and consistent inferences about school effectiveness from multiple measures. Educational Measurement: Issues and Practice, 24, 4052.
Coburn, C. E. (2001). Collective sensemaking about reading: How teachers mediate reading policy in their professional communities. Educational Evaluation and Policy Analysis, 23, 145170.
Coburn, C. E. (2005). Shaping teacher sensemaking: School leaders and the enactment of reading policy. Educational Policy, 19, 476509.
Coburn, C. E. (2006). Framing the problem of reading instruction: Using frame analysis to uncover the microprocesses of policy implementation. American Educational Research Journal, 43, 343379.
Corcoran, S. P. (2010). Can teachers be evaluated by their students test scores? Should they be? Providence, RI: Annenberg Institute, Brown University.
Corcoran, S. P., Jennings, J. L., & Beveridge, A. A. (2010). Teacher effectiveness on high and low-stakes tests (Working paper). New York University.
Darling-Hammond, L., & Wise, A.E. (1985). Beyond standardization: State standards and school improvement. Elementary School Journal, 85, 315336.
Dee, T. S., & Jacob, B. (2009). The impact of No Child Left Behind on student achievement (NBER working paper). Cambridge, MA: National Bureau of Economic Research.
Diamond, J. B. (2007). Where rubber meets the road: Rethinking the connection between high-stakes testing policy and classroom instruction. Sociology of Education, 80, 285313.
Ehren, M., & Visscher, A.J. (2006). Towards a theory on the impact of school inspections. British Journal of Educational Studies, 54, 5172.
Finn, C. E., Petrilli, M. J., & Julian, L. (2006). The state of state standards. Washington, DC: Thomas B. Fordham Foundation.
Hamilton, L. S., & Stecher, B. M. (2006). Measuring instructional responses to standards-based accountability. Santa Monica, CA: RAND.
Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn, A., Russell, J. L., . . . Barney, H. (2007). Implementing standards-based accountability under No Child Left Behind: Responses of superintendents, principals, and teachers in three states. Santa Monica, CA: RAND.
Hannaway, J. (2007, November). Unbounding rationality: Politics and policy in a data rich system. Mistisfer lecture, University Council of Education Administration, Alexandria, VA.
Holmstrom, B., & Milgrom, P. (1991). Multitask principal-agent analyses: Incentive contracts, asset ownership, and job design. Journal of Law, Economics, and Organization, 7, 2452.
Jacob, B. A. (2005). Accountability, incentives, and behavior: Evidence from school reform in Chicago. Journal of Public Economics, 89, 761796.
Jennings, J. L., & Bearak, J. (2010, August). State test predictability and teaching to the test: Evidence from three states. Paper presented at the annual meeting of the American Sociological Association, Atlanta, GA.
Jennings, J. L., & Crosta, P. (2010, November). The unaccountables. Paper presented at the annual meeting of APPAM, Boston, MA.
Kerr, K. A., Marsh, J. A., Ikemoto, G. S., & Barney, H. (2006). Strategies to promote data use for instructional improvement: Actions, outcomes, and lessons from three urban districts. American Journal of Education, 112, 496520.
Koedel, C., & Betts, J. (2009). Value-added to what? How a ceiling in the testing instrument influences value-added estimation (Working paper). University of Missouri.
Koretz, D. (2008). Measuring up: What standardized testing really tells us. Cambridge, MA: Harvard University Press.
Koretz, D., Barron, S., Mitchell, K., & Stecher, B. (1996a). The perceived effects of the Kentucky Instructional Results Information System. Santa Monica, CA: RAND.
Koretz, D., & Hamilton, L. S. (2006). Testing for accountability in K-12. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 531578). Westport, CT: American Council on Education/Praeger.
Koretz, D., Mitchell, K., Barron, S., & Keith, S. (1996b). The perceived effects of the Maryland School Performance Assessment Program (CSE Tech. Rep. No. 409). Los Angeles: University of California, National Center for Research on Evaluation, Standards, and Student Testing.
Krieg, J. (2008). Are students left behind? The distributional effects of No Child Left Behind. Education Finance and Policy, 3, 250281.
Ladd, H. F. (2007, November). Holding schools accountable. Paper presented at the annual meetings of APPAM, Washington, DC.
Ladd, H. F., & Lauen, D. L. (2010). Status versus growth: The distributional effects of accountability policies. Journal of Policy Analysis and Management, 29(3), 426450.
Ladd, H., & Zelli, A. (2002). School-based accountability in North Carolina: The responses of school principals. Educational Administration Quarterly, 38, 494529.
Lauen, D. L., & Gaddis, S. M. (2010). Shining a light or fumbling in the dark? The effects of NCLBs subgroupspecific accountability on student achievement gains (Working paper). University of North Carolina, Chapel Hill.
Lee, J. (2008). Is test-driven external accountability effective? Synthesizing the evidence from cross-state causal-comparative and correlational studies. Review of Educational Research, 78, 608644.
Marsh, J. A., Pane, J. F., & Hamilton, L. S. (2006). Making sense of data-driven decision making in education: Evidence from recent RAND research (No. OP-170). Santa Monica, CA: RAND.
Massell, D. (2001). The theory and practice of using data to build capacity: State and local strategies and their effects. In S. H. Fuhrman (Ed.), From the capitol to the classroom: Standards-based reform in the states (pp. 148169). Chicago, IL: University of Chicago.
McNeil, L. M. (2000). Contradictions of school reform: The educational costs of standardized testing. London, England: Routledge.
Neal, D., & Schanzenbach, D. W. (2007). Left behind by design: Proficiency counts and test-based accountability (Working paper). University of Chicago.
New Teacher Project. (2010). The real race begins: Lessons from the first round of Race to the Top. New York, NY: Author.
Ordonez, L. D., Schweitzer, M. E., Galinsky, A. D., & Bazerman, M. H. (2009). Goals gone wild: The systematic side effects of over prescribing goal setting. Academy of Management Perspectives, 23, 616.
Papay, J. P. (2010). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Education Research Journal, 48, 130.
Papay, J., & Moore Johnson, S. (2009). Redesigning teacher pay: A system for the next generation of educators. Washington, DC: Economic Policy Institute.
Pedulla, J. J., Abrams, L. M., Madaus, G. F., Russell, M. K., Ramos, M. A., & Miao, J. (2003). Perceived effects of state-mandated testing programs on teaching and learning: Findings from a national survey of teachers. Chestnut Hill, MA: National Board on Educational Testing and Public Policy.
Porter, A. C., Linn, R. L., & Trimble C. S. (2005). The effects of state decisions about NCLB adequate yearly progress targets. Educational Measurement: Issues and Practice, 24, 3239.
Prabhu, M. T. (2010). Forum calls for better use of data in education. eSchool News, 12(4).
Reback, R. (2008). Teaching to the rating: School accountability and the distribution of student achievement. Journal of Public Economics, 92, 13941415.
Reback, R., Rockoff, J., & Schwartz, H. (2010). Under pressure: Job security, resource allocation, and productivity in schools under NCLB (Working paper). Barnard College.
Rothman, R., Slattery, J. B., Vranek, J. L., & Resnick, L.B. (2002). Benchmarking and alignment of standards and testing (Working paper). UCLA.
Schweitzer, M., Ordonez, L., & Douma, B. (2004). Goal setting as a motivator of unethical behavior. Academy of Management Journal, 47, 422432.
Shepard, L. A. (1988, April). The harm of measurement-driven instruction. Paper presented at the annual meeting of the American Educational Research Association, Washington, DC.
Shepard, L. A., & Dougherty, K. D. (1991). The effects of high stakes testing. In R. L. Linn (Ed.), Annual meetings of the American Education Research Association and the National Council of Measurement in Education. Chicago, IL.
Sitkin, S., Miller, C., See, K., Lawless, M., & Carton, D. (in press). The paradox of stretch goals: Pursuit of the seemingly impossible in organizations. Academy of Management Review, 36.
Smith, M. L., & Rottenberg, C. (1991). Unintended consequences of external testing in elementary schools. Educational Measurement: Issues and Practice, 10, 711.
Spillane, J. P., Reiser, B. J., & Reimer, T. (2002). Policy implementation and cognition: Reframing and refocusing implementation research. Review of Educational Research, 72, 387431.
Springer, M. (2007). The influence of an NCLB accountability plan on the distribution of student test score gains. Economics of Education Review, 27, 556563.
Stecher, B. (2004). Consequences of large-scale high-stakes testing on school and classroom practice. In L. Hamilton, B. M. Stecher, & S. Klein (Eds.), Making sense of test-based accountability in education (pp. 79100). Santa Monica, CA: RAND.
Stecher, B. M., Chun, T. J., Barron, S. I., & Ross, K. E. (2000). The effects of the Washington State education reform on schools and classrooms: Initial findings. Santa Monica, CA: RAND.
Supovitz, J. A., & Klein, V. (2003). Mapping a course for improved student learning: How innovative schools systematically use student performance data to guide improvement. Philadelphia: Consortium for Policy Research in Education, University of Pennsylvania.
Urbina, I. (2010, January 12). As school exit tests prove tough, states ease standards. The New York Times, p. A1.
Weick, K. (1995). Sensemaking in organizations. London, England: Sage.
Wolf, S. A., & McIver, M.C. (1999). When process becomes policy: The paradox of Kentucky state reform for exemplary teachers of writing. Phi Delta Kappan, 80, 401406.