The Use of Test Accommodations as a Gaming Strategy: A State-Level Exploration of Potential Gaming Tendencies in the 2007–2009 Period and Implications for Re-directing Research on Gaming Through Test Accommodations
by Argun Saatcioglu, Thomas M. Skrtic & Thomas A. DeLuca - 2016
The overuse of test accommodations (e.g., test readers, extra time, and calculators) for students with disabilities is a potential means of gaming the accountability system because it can inflate proficiency gains. However, no direct evidence on this problem exists, and findings on whether or not test accommodations improve test scores are persistently mixed. A key issue underlying both problems is the failure to account for contextual attributes. We propose students’ skill range—the breadth of knowledge and skills they acquire—as a fundamental contextual variable. A wider skill range implies a broader and thus more complex skillset. Under such conditions, more students with disabilities are likely to legitimately need test accommodations. Gaming is thus suspect when a high percentage of students with disabilities are given test accommodations and the skill range is narrow. Based on state-level panel data (2007–2009) for eighth graders, we find that states with a wider mathematics skill range indeed use test accommodations more commonly, and that under such conditions test accommodations do not necessarily result in greater proficiency gains, suggesting that gaming may be less likely in such states. Proficiency gains from test accommodations are greatest in states where the mathematics skill range is markedly narrower, yet a high percentage of students with disabilities are nonetheless given test accommodations. We offer a number of competing heuristic explanations of gaming and no-gaming under a wider set of combinations for skill range and test accommodations, which are testable in future research. We follow this with a discussion of broader implications for scholarship and practice.
A persistent criticism of proficiency-based accountability policies is that they may lead to gaming behaviors that aim to inflate achievement for maintenance of legitimacy in the face of regulative pressures and associated public expectations (Darling-Hammond, 2007; Linn, 2003; Peterson & Hess, 2006, 2008). The pressure to improve proficiency gains increases the risk of dumbed-down tests (Cronin, Dahlin, Adkins, & Kingsbury, 2007), narrower curricula (Berliner, 2009), teaching to the test (Firestone, Schoor, & Monfils, 2004), instruction focusing on near-proficient students (Booher-Jennings, 2005; Jennings & Sohn, 2014), suspensions of low performers during test periods (Figlio, 2006), push out practices encouraging such students to drop out (Hursh, 2007; Orfield & Kornhaber, 2001), and teacher cheating, including giving students answers prior to the test and changing responses after the test (Copeland, 2013; Jacob & Levitt, 2003).
We address special education as another gaming domain, focusing on the potential overuse of test accommodations (e.g., test readers, extra time, and calculators) to inflate proficiency gains. Past studies have shown that special education was used for gaming in various ways, including over-identification of low-income students and potential low performers as disabled so as to exempt them from assessments or to not publicize their scores (Cullen & Reback, 2006; Deere & Strayer, 2001; Figlio & Getzler, 2006; Haney, 2000; Hursh, 2007; Jacob, 2005; McGill-Franzen & Allington, 1993). Such strategies for gaming were curbed by the 1997 reauthorization of the Individuals with Disabilities Education Act (IDEA), which required states to include students with disabilities in state (and local) assessments, with accommodations as needed, and to develop alternate assessments for those unable to take the standard exam (Yell & Shriner, 1997). Alternate assessments are for students with the most significant cognitive disabilities who are unable to participate in general assessments, even with accommodations (Kleinert & Thurlow, 2001). To reduce incentives for unfair use of alternate assessments for gaming purposes (e.g., inappropriately assigning more capable students with disabilities to alternate assessments) and also to provide better adequate yearly progress (AYP) information about the students with disabilities subgroup, the U.S. Department of Education (2005a) subsequently capped the percentages of alternatively tested students whose scores can be used for AYP purposes to no more than 3% of the total student population tested regardless of the share of alternatively tested students in that overall population (Elledge, Le Floch, Taylor, & Anderson, 2009). With no such caps regarding test accommodations, on the other hand, the overuse of such accommodations remains a legally open and a particularly feasible and potentially attractive means of gaming.
Yet, not only is direct empirical evidence on gaming through accommodations virtually nonexistent, but more importantly, findings on the effect of accommodations on proficiency outcomes remain mixed (e.g., Ketterlin-Geller, Alonzo, Braun-Monegan, & Tindal, 2007; Sireci, Scarpati, & Li, 2005). The latter problem is critical in addressing the former because the notion that accommodations can be used for gaming evokes the idea that accommodations are likely to have positive effects on proficiency gains. In this article, we address variance in students skill range in mathematicsthe breadth of math knowledge and skills they acquireas a key contextual attribute that creates conditions determining the legitimate need for test accommodations. We argue that under a wide skill range condition (under which the average student learns more skills) a greater percentage of students with disabilities may legitimately require test accommodations and that under a narrow skill range condition, the need for accommodations is likely to be less. Relying on state-level data for 2007 and 2009, we examine the relation between accommodations and proficiency gains under different skill range conditions, and propose competing heuristic explanations for why and how accommodations can be used as a gaming strategy under those conditions. These explanations are testable in future research.
TEST ACCOMMODATIONS AND THEIR POTENTIAL USE FOR GAMING
Accommodations are changes in standardized assessment conditions to remove the construct-irrelevant variance created by disabilities (Fuchs, Fuchs, Eaton, Hamlett, & Karns, 2000, p. 66). They include changes in test presentation (e.g., read-aloud [proctor reads items/responses]), response (e.g., calculator), timing (e.g., extended time), and/or setting (e.g., separate room) (Thurlow, 2000). An individual student with a disability may receive one or more of these accommodations depending on his or her needs (Elliott & Marquart, 2004; Schulte, Elliott, & Kratochwill, 2000).1 The IDEA requires states to ensure access of the child [with a disability] to the general curriculum ( 300.26(b)(3)), which means providing accommodations during instruction to ensure equal access to the general [content standards] (Thurlow, 2000, p. 11). Its policy logic is that together these two requirements will incentivize schools to set challenging (general education) content standards for students with disabilities and to develop effective instructional accommodations to help them (and their school) meet state performance standards (McDonnell, McLaughlin, & Morison, 1997). Inclusion of students with disabilities in state testing and accountability systems was reaffirmed in Title I of the No Child Left Behind Act of 2001 (NCLB). Title I continued the requirements for test accommodations and alternate assessments and further required that the performance of students with disabilities be disaggregated and targeted in AYP calculations, thus making schools, districts, and states accountable for the performance of the students with disabilities subgroup and for closing the achievement gap between it and all other subgroups (Thurlow, 2002).
However, the use of accommodations is controversial in part because of the longstanding concern that they may give students who use them an undue advantage. A test accommodation is a valid means of leveling the playing field only when it improves the performance of students with disabilities by compensating for their specific impairments, but does not have any effect, or has a significantly limited effect, on the performance of students without disabilities when those students are given the accommodation under experimental conditions (Sireci et al., 2005). This specific validity standard is generally referred to as the differential boost criterion (Fuchs et. al., 2000). It is, however, only partially supported by experimental research, as there are many cases in which the accommodation involved significantly improved the performance of not just students with disabilities but also those without (Bolt & Thurlow, 2004; Cormier, Altman, Shyyan, & Thurlow, 2010; Lazarus, Thurlow, Lail, & Christensen, 2009; Zenisky & Sireci, 2007). If a given accommodation results in proficiency gains for all students, then it may be doing more than compensating for impairments of students with disabilitiesmaking it fertile ground for gaming the system. As Sireci et al. (2005) point out, if everyone benefits from accommodations, then scores from accommodated exams [taken by students with disabilities] may be invalidly inflated (p. 458). The possibility that test accommodations can inflate performance and improve scores fuels the suspicion that they may also be used for gaming.
This suspicion is reinforced by findings indicating that teachers over-identify students with disabilities for accommodations and that they are unable to predict which students will benefit from accommodations (Fuchs et al., 2000; Helwig & Tindal, 2003; Ketterlin-Geller et al., 2007). Accommodation decisions often are formulated idiosyncratically by individualized education program [IEP] teams (Fuchs et al., 2000, p. 66), a process in which decisions . . . are based on inconsistent and often unreliable sources of information . . . [including] teachers subjective judgment (Ketterlin-Geller et al., 2007, p. 196). Neither state nor federal legislation provide adequate guidance on how accommodation decisions are to be made (Elliott, McKevitt, & Kettler, 2002). States go only as far as specifying which accommodations are allowed and the policy variables that can be used in decisions about the use of those accommodations. The two dominant policy variables in this regard include whether a test accommodation is specified in the students IEP (chosen from among those allowed by the state) and whether commensurate accommodations are used for the student in classroom instruction (Lazarus et al., 2009).
Despite the proneness of test accommodations to be (mis)used as a gaming strategy, no direct evidence on this potential problem exists. There is only one study that offers speculative insight. In their examination of data from Florida before 1997, when the IDEA required states to include students with disabilities in standardized assessments, Figlio and Getzler (2006) found that schools reclassified low income and previously low-performing students as disabled at much higher rates following the introduction of the testing regime so as to exempt them from testing or to not publicize their scores. The preclusion of such strategies by the IDEA, Figlio and Getzler speculated, would lead schools to use test accommodations for gaming purposes under NCLB.
In addition to the lack of direct evidence on the use of accommodations for gaming, findings on the actual effect of accommodations on proficiency gains have been persistently mixed (e.g., Ketterlin-Geller et al., 2007; Sireci et al., 2005). This is a more critical problem because the assumption that accommodations result in greater proficiency gains underlies the gaming argument. For it is beneficial to use accommodations for gaming the accountability system only when the outcome is greater proficiency. While there are many studies that indicate accommodations are indeed related to proficiency gains (for a review, see Sireci et al., 2005), there are several others that find no relationship, and some that find a negative relationship between accommodations and proficiency outcomes (for a review, see Johnstone, Altman, Thurlow, & Thompson, 2006). As Ketterlin-Geller et al. (2007) note, inappropriate assignment of accommodations or their inconsistent use in teaching and testing may significantly jeopardize student achievement by withholding necessary format changes or by providing distracting or confusing changes (p. 195). The former practice prevents students with disabilities from demonstrating their competence (Fuchs et al., 2000, p. 68) while the latter increases the likelihood that students will be adversely affected by an unneeded accommodation (Helwig & Tindal, 2003, p. 212).
A fundamental problem in the collective body of work bearing on both the use and consequences of accommodations is that accommodations are considered in a decontextualized fashion. Specifically, there is no account of whether there are objective conditions under which a greater proportion of students with disabilities are likely to legitimately need accommodations to level the playing field. This issue is important because if a high percentage of such students were given accommodations when objective conditions suggest less of a need for such a policy, then this would raise the question of gaming. This approach also sets the stage for a more strategic analysis of proficiency gains because proficiency outcomes can be interpreted in terms of how they vary depending on different conditions that imply different proportions of students with disabilities legitimately needing accommodations. We take an important step in redirecting research on accommodations as a gaming strategy in this regard.
SKILL RANGE AS AN OBJECTIVE CONTEXTUAL ATTRIBUTE FOR ACCOMMODATIONS
We address students skill rangethe breadth (or scope) of their knowledge and skillsas an objective condition with legitimate implications for the percentage of students with disabilities needing accommodations. In simple terms, skill range can be understood as acquired skillset by the average student in a given context, such as a classroom, school, district, state, or nation. As we will explain, we use this construct at the state level. Thus, skill range signifies the content acquired by the average student (not necessarily one with a disability) in a given state. This is primarily a function of the scope of topics covered in the curriculum, the standardized assessment involved, and the instructional orientation of the teacher. To elaborate, consider a hypothetical school-level example. Assume that two different schools across the street from each other offer a course titled algebra for the same grade level and that each school designs its own content for the course as well as the test involved. Assume also that it is possible for algebra teachers across the two schools to differ in their instructional orientation in terms of teaching to the test. The range of algebra skills the average student acquires in each school is likely to be a function of three factors: (1) what is supposed to be taught, that is the scope of the curriculum or the official content; (2) what is tested, that is the scope of the assessment, which may or may not be as wide as the curriculum; and (3) what actually gets taught, that is the degree to which the instruction focuses on the curriculum and the test (Cizek, 2005; Kubiszyn & Borich, 2007). As a result of various degrees and combinations of these three moving parts, students in one school may acquire a broader or more limited range of algebra skills than their peers in the other school. Our concept of skill range applies the same idea at the state level.
A narrower skill range implies not just fewer, but also simpler skills. While this is likely pertinent to all skill domains, it may be more pronounced and thus more readily observable with regard to mathematics because math skillsunlike, for instance, reading skillsare sequentially arranged in complexity and must therefore be acquired in progressive fashion (Reys, 2006). In other words, the math skill range spans from easier skills to more complex ones. For example, acquiring the skill to solve simple linear equations with a single unknown is typically necessary prior to learning the skill to solve polynomial equations and those with multiple unknowns. But, while mastery over all these skills may be important in an ideal algebra class, students can successfully acquire linear function skills even when skills for polynomial equations and multiple unknowns are postponed or omitted, resulting in a suboptimal skill range.
Logically, in a systemfor instance a school, district, or statethat helps students acquire a broader range of skills, a greater share of students with disabilities are likely to require both instructional and test accommodations by virtue of the scope and associated complexity of the skillset. Consider the algebra example. A wider and thus more complex skill range implies items that may be longer and more challenging to read, resulting in a greater proportion of students with disabilities requiring extra time as well as read-aloud proctors to level the playing field. For the same reason, calculators may also be needed by a greater share of students with disabilities, as well as a separate quiet room if the test is more taxing for such students. This is consistent with the basic logic of accommodations, which are ideally designed to compensate for challenges associated with disabilities without making the test itself easier (Fuchs et al., 2000). Therefore, in discerning the use of accommodations as a gaming strategy to inflate proficiency gains, it is critical to consider what the skill range is in the system in question. If a high share of students are given accommodations when the skill range is narrow, it is more likely that actors in the system are engaged in gaming. As a means to examine this hypothesis, the association of accommodations with proficiency gains can be explored under different conditions of skill range as an objective contextual attribute.
STATE/NAEP COMPARABILITY AS A MEASURE OF STATE-LEVEL SKILL RANGE
The notion of skill range can be construed at the classroom, school, district, state, and even the national level. In this article, we address it at the state level and rely on what we refer to as state/NAEP comparability to measure it. Below, we discuss the conceptual features of this measure, leaving its empirical construction and properties for the subsequent section.
The central challenge with the notion of skill range is that there is neither an ideal criterion for skill range, nor a uniform set of guidelines to specify and govern it. The latter is particularly common for decentralized educational systems such as in the United States, where states and local districts exercise considerable autonomy (McDonnell, 2004). It is, however, possible to measure skill range at the state level in reference to a low-stakes nationwide test, namely the state-level National Assessment of Educational Progress (NAEP), widely recognized as the gold standard in assessment (Koretz, 2008). More specifically, our skill range measure involves the comparability of proficiency outcomes on state assessments to proficiency outcomes on the NAEP. This does not involve a side-by-side comparison of test content, nor is it about the comparison of test outcomes. It is about whetheror how wellthe outcomes can be compared in the first place.
Designed by an independent federal board of experts, NAEP evaluates a uniquely wide skill range because, as a nationwide test, it approximates an amalgamation of skills students are expected to learn in all states (Hombo, 2003). As such, NAEP provides the next best thing to an ideal criterion for skill range: a comprehensive frame of reference. Individual states are represented on committees that develop NAEPs frameworks (skillsets NAEP evaluates) and review the frameworks in draft form (National Assessment Governing Board, 2002). In contrast to NAEPs skill range, students in individual states learn and are tested on a more targeted skill range reflecting local curricular preferences and instructional orientations. Thus, the more limited the students skill range in a state, the less comparable their performance on the state assessment is to their performance on the NAEP. State/NAEP comparability denotes the degree to which students skill range approaches the range of skills evaluated by the NAEP.
NAEPs wider skill range imposes an inherent limit on state/NAEP comparability, but we are interested in how this comparability varies across states. NAEPs design and content do not vary by state and changes only infrequently (Dee & Jacob, 2010; Kolstad, 2002), while state curricula, assessments, and instructional orientations are subject to frequent change. Finally, as a low-stakes test, NAEP is not subject to accountability pressures that can stimulate gaming behaviors (Jacob, 2001). These make the NAEP a stable and thorough reference point to help examine students skill range without treating the NAEP as normatively superior to any state assessment. State/NAEP comparability is different from state/NAEP performance discrepancy as outcomes of two tests can be comparable even if results are discrepant, so long as a similar range of skills is involved in performance on both tests.
In this article, we present analyses that address three guiding questions:
What is the relationship between the percentage of students with disabilities who are given test accommodations in a state and that states average skill range? A positive relationship implies a lower prevalence of accommodations used for gaming.
What is the relationship between the percentage of students with disabilities who are given test accommodations in a state and that states proficiency gains? Does this relationship vary under different conditions of skill range?
How do states vary in terms of skill range/test accommodation combinations and related proficiency gains? What are potential explanations for why (or why not) accommodations may be used as a gaming strategy under those combinatorial conditions?
We relied on various data sources to construct a state-level panel dataset for the 20072009 period with a two-year interval (resulting in two time points), focusing on mathematics for eighth graders (n = 50 × 2 =100).2 Descriptions for all measures are shown in Table 1. We use the data for examining the relation of skill range to percentage of students with disabilities who are given test accommodations, and for multivariate models predicting associated proficiency gains.
Table 1. Measures Used in the Study
State-level information on assessment types for students with disabilities is publicly available from the IDEA Data Center (IDC).3 The state-level data files from IDC go only as far back as 2007. We therefore obtained information for 2007 and 2009, consistent with our skill range measures described below. Our central assessment type measure is the percentage of students with disabilities taking the regular state assessment with accommodations. State-level data does not address whether a given student receives a single accommodation or multiple ones. Therefore, this issue is not part of our analysis. The issue of whether single and multiple accommodations have significantly different implications and outcomes is not raised in existing research either (see Cormier et al., 2010; Johnstone et al., 2006; Thompson, Blount, & Thurlow, 2002; Zenisky & Sireci, 2007).
We also obtained the percentages for students with disabilities taking the regular state assessment without accommodations, and for students with disabilities taking the alternate state assessment. We use these two measures for descriptive purposes in our analysis.
This measure denotes the degree to which the breadth of the average eighth graders mathematics skills in a state approaches the range of skills evaluated by the NAEP, which approximates an amalgamation of skills students are expected to learn in all states. State/NAEP comparability scores are drawn from federally commissioned studies that employ equipercentile linking procedures to map state proficiency results onto the NAEP scale for 2007 and 2009 in order to facilitate the comparison of proficiency outcomes across states and over time (see Bandeira de Mello, 2011; Bandeira de Mello, Blankenship, & McLaughlin, 2009). Assume two states in which 60% of eighth graders are proficient or above based on the proficiency cut scores for their respective state assessments. Assume also that the 60% proficiency for one state maps onto that particular states NAEP distribution at, say, a NAEP score of 200, while the 60% for the other state maps onto that states NAEP distribution at, say, a NAEP score of 220. Based on these NAEP scale equivalencies (NSEs) for state proficiency results, the average eighth grader in the second state would be considered higher performing than the first even though both states report 60% proficiency on their own assessments. However, expressing the outcome of one test (i.e., state assessment) on the scale for another (i.e., NAEP) is valid only to the extent that student performances on the two tests, respectively, are comparable in terms of the skill range involved. Thus, the mapping studies include a diagnostic comparability measure that we adopted and refer to as state/NAEP comparability. This measure is based on building-level data from the states NAEP sample, representative of the entire state. Since students in the states NAEP sample take both the state assessment and the NAEP, each building in the NAEP sample has two critical scores: percent proficient or above on the state assessment in the building (p) and the percentage of students whose NAEP scores are at or above the states NSE (q) (e.g., the 200 or 220 discussed above). If the student outcomes on the state assessment and the NAEP are comparable in terms of skill range, then p and q would be highly correlated for schools within the states NAEP sample. In simple terms, if NAEP and the state assessments were the same and thus evaluated exactly the same skills, then p and q would be nearly identical. The correlation of p and q, ranging from 0 to 1, is also adjusted for NAEP measurement and sampling errors, and the natural between-school variation in state assessment outcomes (for more detail, see Banderia de Mello et al., 2009, pp. 710).4, 5
The central outcome measure is the percentage of eighth graders proficient or above in the state mathematics assessment. This is the difference between proficiency in a given year and proficiency from two years earlier. Thus, while the analysis is limited to the 20072009 period, we obtained 2005 proficiency information to generate the proficiency gain for 2007 (see Table 1).
CONTROL MEASURES FOR MULTIVARIATE ESTIMATION OF PROFICIENCY GAIN
States with greater non-school demographic disparitiesspecifically, greater proportions of low income and racial/ethnic minority students and students with access to limited human, social, and cultural capitalare likely to find it more difficult to improve statewide performance. Waves of accountability reform have had only modest success in the last 25 years in closing achievement gaps due in part to non-school disparities (Reardon, 2011). We operationalize non-school disparities by means of six measures. These are state means for adults, ages 30 to 50, participating in the American Community Survey (ACS), generated from individual-level data available from the Integrated Public-Use Microdata Series (IPUMS).6 They include percent Black and percent Hispanic, capturing racial/ethnic disadvantages; average personal income and percent in poverty, representing the economic capital available to children; percent with college degree as a measure of human capital; and percent single-parent as a proxy for social/cultural capital (Coleman, 1988; McLanahan & Sandefur, 1994).
We also control for disparities in school quality, in terms of three measures drawn from the National Center for Education Statistics Common Core of Data7 (CCD): percentage of students in urban and rural schools, pupil/teacher ratio in secondary grades, and per-pupil revenue. States with greater proportions of students in urban and rural schools can experience greater challenges in improving their proficiency gain. This is partly related to non-school demographic disparities since urban and rural schools typically serve students affected by such disparities. But it also reflects the problem of low school quality, which aggravates achievement problems. Urban and rural schools are characterized by persistent compositional disadvantages, such as poverty concentration and racial/ethnic isolation, which are related to adverse peer effects (Hanushek, Kain, & Rivkin, 2009), low teacher and staff quality (Lankford, Loeb, & Wyckoff, 2002), disruptive climate (Bryk & Schneider, 2003), low expectations (Payne, 2008), and low academic pressure (Noguera, 2003; Schwartzbeck, 2003). The second school quality measure, pupil/teacher ratio, is a proxy for class size. The larger the size of a classroom, the more difficult it can be to attend to the various needs of all students. This not only increases the risk of teacher stress and turnover (Russell, Altmeier, & Van Velzen, 1987), but it can also impede achievement (Nye, Hedges, & Konstantopoulos, 2000), complicating efforts to improve statewide proficiency gains. The third measure, per-pupil revenue generated from state and local (district) sources, is a proxy for tangible school resources. More broadly, this is a fiscal expression of the state-specific demand for educational quality and outcomes (Gramlich & Rubinfeld, 1982).8 Per-pupil revenue from state and local sources is affected by the states income and wealth as well as its emphasis on education relative to other services such as law enforcement, healthcare, and highways (Baker & Duncombe, 2004). States with lower per-pupil state and local revenue spend less on hiring high-quality teachers, and provide students with lower quality programs, facilities, and supplies (Card & Payne, 2002). They also allocate less to students with costlier needs (Duncombe & Yinger, 2008). As a result, they may experience greater challenges in proficiency gains.
States with strongly conservative political values are more averse and thus respond more strategically to federal regulation, especially regarding social policies such as education. Elazar (1984) attributes this to a worldview that combines traditionalistic and individualistic orientations underlying a strong preference for small government. According to Henig (2013), proficiency pressures under the accountability regime have contributed to a sense of federal intrusion undermining local autonomy over education policy in conservative states. Therefore, while conservative values emphasize the importance of rigorous education (Elazar, 1984), states with strong conservative values may have been slower in their proficiency gains. We operationalize conservatism as percentage of Republicans in the state legislature.
WHAT WE FOUND
What is the relationship between the percentage of students with disabilities who are given test accommodations in a state and that states average skill range? As seen in Figure 1, this relationship is positive and significant (r = 0.255, p ≤ 0.010). As the range and thus the complexity of math skills acquired by the average eighth grader in a state increases, so does the percentage of students with disabilities who are given test accommodations, consistent with our expectations. Each observation in Figure 1 is a state-by-year observation and, for the overwhelming majority of states, the 2007 and 2009 observations are significantly close in Euclidian distance, often overlapping (which makes many state abbreviation labels unreadable except in outlying cases). Scatterplots for separate years are very similar to what is shown here. Pooled data offers greater power for testing the significance of the two-way correlation, given an n of 100 (50 × 2) for 2007 and 2009 combined.
Figure 1. Association of testing accommodations with skill range
Note. Pooled data for 2007 and 2009 (50 × 2). Solid line represents best linear fit (r = 0.255, p ≤ 0.010). Dashed line marks 0.747 on the skill range axis, the point at which the effect of test accommodations on proficiency gain is zero. This is explained in reference to our multivariate analysis.
The denser clustering of observations at the top right corner of the figure indicates that the majority of states have a wide skill range and a high percentage of students with disabilities who are given test accommodations. Evidently, the legitimate need for test accommodations to be used more commonly when the skill range involved is wider does not rule out the possibility that, under such circumstances, test accommodations can, at least in some cases, also be used for gaming. Our data does not support a closer examination of such dynamics. However, the top right corner of Figure 1 does constitute a potential reference point for interpreting outlying cases in the top left corner, where states give test accommodations to a high proportion of their students with disabilities even though the state-level skill range involved is markedly narrower. This suggests that test accommodations may not be legitimately needed for as high a percentage of students with disabilities. Therefore, there is a greater chance that gaming through test accommodations is more prevalent in outlying cases in the top left corner in Figure 1 than in cases located at the top right corner. We address this point further in our discussion of findings from multivariate estimates of proficiency gains from test accommodations under different conditions of skill range.
What is the relationship between the percentage of students with disabilities who are given test accommodations in a state and that states proficiency gain? Does this relationship vary under different objective conditions of skill range? We rely on hierarchical gain score modeling to address these questions. By specifying the outcome as the gain from one period to the next and by controlling for the value of where the gain begins, gain score models account for the ceiling effect. Controlling for where the gain begins can also account for various unobserved state characteristics. Finally, hierarchical modeling is appropriate for panel data, as it limits biases from non-independence of repeated observations nested within states. We fit the following random intercept model in stepwise fashion9:
Pjt – Pj(t-1) = β0j + β1Pj(t-1) + γSjt + ζAjt + η(Ajt*Sjt) + Σi=1(δ0(i)C(i)jt)εjt
β0j = λ00 + υ0j
where j = state and t = time (2007 and 2009). P denotes percent proficient or above on the state assessment. The outcome is the gain in 2007 (from 2005 to 2007) and in 2009 (from 2007 to 2009). S represents skill range (state/NAEP comparability) and A represents percent of students with disabilities who are given test accommodations. The interaction of these two measures is also included in the model (Ajt*Sjt). C(i) denotes a vector of 10 state characteristics as control measures in predicting proficiency gain. Finally, ε and υ are first- and second-level error terms, normally distributed with zero mean and constant variance. The results are shown in Table 2, in which all estimates are based on unstandardized scores.
Results for all four models indicate that, given the ceiling effect, the greater the proficiency at the beginning of the gain period (Pj(t-1)), the smaller the gain. For instance, the estimated effect in Model 1 is –0.309 (p ≤ 0.010), which changes only negligibly in subsequent models. The effects of state characteristics, as controls, are more telling. Virtually none of these characteristics have a statistically significant relationship with proficiency gain in any of the models, and, while not shown here, this is the case even when percent proficient at the beginning of the gain period is omitted. This is consistent with insights from classic studies of compliance with legal mandates (e.g., Tolbert & Zucker, 1983). When a reform is legally mandated, all actors subject to the mandate are coerced to comply. Individual attributes are often inconsequential in the timing and degree of compliance. For example, all drivers, regardless of personal, cultural, and political characteristics, are expected to comply with speed limits when such rules are instituted. Coercive compliance in this regard is different from normative and mimetic processes where the characteristics and needs of complying actors play a greater role in timing and degree of compliance (DiMaggio & Powell, 1983; Tolbert & Zucker, 1983). NCLB legally obliged all states to improve proficiency irrespective of their distinct features. Hence, no state characteristic predicts proficiency gain.
Table 2. Hierarchical Linear Regression Estimates of Effects on Gain in Percent Proficient or Above
*** p ≤ 0.010; ** p ≤ 0.050; * p ≤ 0.100.
The percent of students with disabilities who are given test accommodations is entered as a predictor in Model 2 and is non-significant. So is skill range, which is entered in Model 3. In Model 4, these two measures are entered together along with their interaction (the full model shown in equations 1a and 1b above). In that full model, both the main effects and the interaction term are strong and significant. This suggests that, when either measure is specified by itself, as in Models 2 and 3, its estimated effect may be suppressed because it is contingent on the interaction. Consider the accommodation effect, which has conceptual primacy for this study, and is estimated in Model 2 as nearly zero (0.007, p ≥ 0.100). This is likely because opposite values of the accommodation effect under conditions of narrow and wide skill range tend to cancel each other out, resulting in a null aggregate estimate. Consistent with this interpretation, the main accommodation effect in Model 4 is strongly positive (0.566, p ≤ 0.010), while its interaction with skill range is strongly negative (–0.758, p ≤ 0.010), meaning that higher values of skill range considerably reduce the accommodation effect. Thus, when the two-way interaction is not specified, the total accommodation effect averages out to a small coefficient as in Model 2. This supports our contention that the association of test accommodations with proficiency gains needs to be explored under different conditions of skill range as an objective contextual attribute. Variance in the nature of the estimated accommodation effect across the skill range distribution is shown in Figure 2, based on the predicted values of proficiency gain from Model 4 in Table 2.
Figure 2. Predicted values of proficiency gain as a function of test accommodations and skill range based on Model 4 in Table 2
Note. 1 = Harmful gaming region (59); 2 = Tempered gaming region (13); 3 = Useful gaming region (13); 4 = Limited gaming region (10); 5 = Super gaming region (5).
Under low values of skill range, an increase in the percentage of students with disabilities who are given test accommodations is associated with a considerable increase in proficiency gain (net of where the gain begins and other state characteristics). This means that in states where the average eighth grader acquires a limited range (and thus complexity) of math skills relative to NAEPs math skill range, giving test accommodations to a greater proportion of students with disabilities considerably improves the states proficiency gains. Simply put, test accommodations result in better statewide proficiency improvement when the average student is acquiring fewer skills. This also suggests that the handful of cases in the top left corner of Figure 1those with high accommodation percentages even though their skill range is markedly narrowerget the largest proficiency payoffs from test accommodations. Recall our earlier conjecture that gaming through test accommodations is more likely in the top left corner of Figure 1 because accommodation rates are much higher than what the skill range involved would suggest. This conjecture is further supported by the insight that, under such conditions, proficiency gains from test accommodations are the greatest.
Given the negative interaction term in Model 4, the slope of the accommodation effect on proficiency gain gradually declines as skill range increases. When skill range is equal to 0.747 (shown by the dashed lines in Figures 1 and 2), the accommodation effect is nullified. Past this point, the effect becomes increasingly negative, such that a rise in the percentage of students with disabilities who are given test accommodations reduces the proficiency gain. It is important to note that what is reduced here is the gain, not necessarily net proficiency, although a proficiency loss is likely as well. Either way, the use of test accommodations as a gaming strategy to inflate proficiency gains is less rational and thus less likely under conditions where skill range is greater than 0.747. And the majority of cases in the data are located past this threshold in the national skill range distribution, clustered particularly in the top right corner of Figure 1, which is the same area as the deep back corner of Figure 2. Thus, when skill range is wide, not only is the percentage of accommodated students with disabilities highsince, as we contend, more of such students would legitimately need test accommodationsbut the use of test accommodations is also associated with lower proficiency gains, not higher. Taken together, these two reasons make the use of test accommodations as a gaming strategy less likely in states where skill range is greater than 0.747, even if such dynamics may indeed occur to a degree.
The same conjectural logic suggests that chances of gaming through test accommodations are greater when skill range is below 0.747, because an increase in the percentage of students with disabilities who are given accommodations increases the statewide proficiency gain. This view is especially pertinent in situations where a markedly lower skill range is combined with a significantly common use of test accommodations, such as in the top left corner of Figure 1, which is the same area as the elevated front area of Figure 2.
While our approach makes it easier to interpret potential dynamics at the tails of the national skill range distribution (e.g., top right and top left corners of Figure 1), it is not as effective for interpreting the remaining cases. We therefore change our interpretive approach below and propose a more fine-grained grouping of all cases in the data. We then offer competing heuristic explanations pertaining to various regions across the entire space in Figures 1 and 2, which could be tested in future research.
HEURISTIC ACCOUNTS OF COMPETING EXPLANATIONS FOR ACCOMMODATION EFFECTS ON PROFICIENCY
How do states vary in terms of skill range/test accommodation combinations and related proficiency gains? What are potential explanations for why (or why not) accommodations may be used as a gaming strategy under those combinatorial conditions? To address these questions, we specify regions based on two dimensions instead of one. The first dimension is skill range as before, with a threshold of 0.747 on that dimension, where the nature of the accommodation effect on proficiency gain changes. The second dimension is the percentage of students with disabilities who are given test accommodations, shown by the dotted line in Figure 2. We use 50% as a cut off on this dimension, for two reasons: (a) it is the theoretical midpoint, given the zero-to-one range, and (b) it is nearly the same as the mean (0.546) and the median (0.565), and using the mean or the median results in negligibly small changes in how cases are clustered.
The regions are depicted by the ellipses in Figure 2. For each one, we offer both a malevolent (gaming) interpretation and a competing benevolent (no gaming) interpretation, and specify when we favor one interpretation over the other. Regions 1 and 2both of which are beyond the 0.747 skill range threshold where test accommodations reduce proficiency gainare respectively labeled harmful (i.e., self-defeating) gaming and tempered gaming. The harmful gaming region includes 59 cases (59%, given n = 100) involving wide skill range and high accommodationsthe top right corner in Figure 1, which, as discussed above, we do not consider a gaming location. Since test accommodations reduce proficiency gain in this region and could result in proficiency loss, if such accommodations are indeed used as a gaming strategy, then actors are implementing such a strategy at their own states peril. For instance, given the wide skill range, various districts and schools in a state may use test accommodations to inflate their local proficiency gains, but this may backfire if students with disabilities are unable to perform highly on the regular assessment, even with accommodations. A wide skill range implies a more demanding test, and it is possible, for at least some actors, to overconfidently assume that the overuse of test accommodations can give a needed boost to proficiency results. This would be a classic case of individual actors pursuing their own (or local) interests at the expense of their collective interest (Collins, 1982).10 The overconfident pursuit of self-interest may involve another potential twist in this region. It is possible for a wide skill range to function as an incentive for using test accommodations as a gaming strategy even when students with disabilities are not given the required instructional accommodations in the classroom. This would be another case of overconfidence involving the assumption that test accommodations can be a substitute for adequate instruction in the classroom under conditions of wide skill range.
The benevolent interpretation for the harmful gaming region is more straightforward. Schools and districts located in the states in this region are likely to pursue rigorous learning goals, helping their eighth graders acquire a wide skill range, and are at the same time very inclusive regarding students with disabilities, even at their own peril in terms of how their proficiency gains compare to gains in other states. Nagle, Yunker, and Malmgren (2006) characterize this strongly inclusive posture as the cultural belief that the only way to challenge negative preconceptions about the abilities of students with disabilities is to hold them to high standards and provide them with opportunities to demonstrate their abilities (p. 37). This is the interpretation we favor.
Region 2, tempered gaming, includes 13 cases. We use the term tempered because, while the cases here are on the right-hand side of the 0.747 threshold on the skill range dimension in Figure 1, they are below the 0.50 mark on the accommodations dimension. As in Region 1, because test accommodations here reduce the proficiency gain and could even result in proficiency loss (since skill range is greater than 0.747), if such accommodations are indeed used as a gaming strategy, this occurs at the peril of the states overall proficiency gain. Accordingly, a similar set of malevolent interpretations applies if one were to read gaming into this region. But a key rationale for the label tempered gaming, as opposed harmful gaming, is that states in this region do not provide test accommodations for as high a percentage of their students with disabilities as in Region 1. This could be either because such students are more commonly tested without accommodations or because they are assigned to alternate assessments. We therefore profiled each regions use of these other two assessment types, as shown in Figure 3.
Figure 3. Mean percentages for alternate assessment and regular assessment without accommodations by gaming regions shown in Figure 2
While the average alternate assessment percentages are similarly low for Regions 1 and 2, a stark difference exists in terms of the mean percentage for regular assessments without accommodations. The mean percentage is nearly double in the tempered gaming region (0.506, the highest in Figure 3), suggesting that here there is less room to use accommodations for gaming in the first placeyet another basis for the label tempered gaming. But the high percentage of students with disabilities who are tested without accommodations in this region also supports a positive viewpoint, because it implies a greater prevalence of the type of inclusive culture we noted in the benevolent interpretation for the harmful gaming region. Specifically, it indicates that students with disabilities in the tempered gaming region enjoy greater access to the regular curriculum as well as to regular instruction, evidenced by a high proportion of them being tested without accommodations. If such a culture does exist in the region, the associated values and norms are also likely to make the use of test accommodations as a gaming strategy less likely, though the possibility of such a strategy is not entirely eliminated. Incidentally, one could argue that an educational culture that emphasizes inclusion to a degree that sharply reduces the use of test accommodations could also inadvertently hinder test outcomes for students with disabilities, since the test accommodations are intended to level the playing field. However, if this is true, it further strengthens the no gaming interpretation for this region. Ultimately, this benevolent view is what we favor for the tempered gaming region.
Region 3 is labeled useful gaming and includes 13 cases, for which the skill range is between 0.747 and 0.500, and test accommodations are above 50%. While the region is shown in Figure 2, the cases involved can also be observed in Figure 1. Here, nearly as many students with disabilities are given test accommodations as in Region 1, but because the skill range is narrower, accommodations increase, not decrease, the proficiency gain. Thus, if accommodations are used as a gaming strategy here, then such a strategy does work in terms of proficiency outcomes. Hence the label useful gaming. From this malevolent viewpoint, the skill range in this region is low enough relative to Regions 1 and 2 to preclude the problem of overconfidence from which potential gamers in Regions 1 and 2 may suffer. In other words, gaming may be a safer bet under conditions of moderate skill range (0.5000.747).
The benevolent interpretation for the useful gaming region is that states here happen to teach somewhat fewer math skills to their average eighth grader and also happen to be highly inclusive when it comes to their students with disabilities. Schools and districts in such states may be accommodating many of these students both on the regular state assessment and in classroom instruction, thus effectively leveling the playing field. However, we are unable to specify whether we favor the malevolent or the benevolent perspective for Region 3.
Region 4 is labeled limited gaming, which includes 10 cases. The skill range here is between 0.747 and 0.500 as in Region 3, but test accommodations are below 50%, going as low as about 8% (see Figure 1). Therefore, although test accommodations can be used as a gaming strategy with positive payoffs in terms of proficiency gains (given the moderate skill range), this is likely to transpire at much lower degree here since a more limited percentage of students with disabilities are given test accommodations relative to those in Region 3. Hence the label limited gaming. However, we favor the benevolent viewpoint for this region. This is because, as seen in Figure 3, 43% of students with disabilities are on average given the regular state assessment without accommodations, indicating an inclusive educational culture, as in Region 2. It is likely that the associated values and norms curtail, though not entirely eliminate, the risk of test accommodations being used as a gaming strategy.
However, it is important to note that on average, 26% of students in the limited gaming region are given alternate state assessments and are thus excluded from regular assessments and curricula (highest in Figure 3). Therefore, noticeable exclusionary tendencies and inclusionary practices may coexist in the limited gaming region. The U.S. Department of Education capped proficiency outcomes from alternate assessments that can be used for AYP purposes at a maximum of 3% of the total student population tested. As a result, alternate testing is not as attractive a means of gaming as test accommodations are in terms of proficiency returns. However, the cap applies to entire school systems and subsequently to states, not to individual schools, such that some school campuses can exceed the 3% cap as long as their school system overall does not exceed it (U.S. Department of Education, 2005b). Therefore, it is possible for school staff to use alternate testing as a gaming strategy to inflate their building-level proficiency, even though much of this would feed neither into the outcome for the district nor the states overall proficiency gain, which is the primary focus for this study. Nonetheless, the fact that as much as 26% of students in the limited gaming region are given the alternate state assessment raises the question of whether schools in this region at times inappropriately assign some of their more capable students with disabilities to alternate assessments in order to increase their building-level proficiency outcomes. In a recent study of one state, Cho and Kingston (2011) found that more than 90% of all students who took the alternate assessment in math scored proficient or above.
Region 5, the final region, is labeled super gaming and includes only five cases. As our earlier discussion of these cases suggests, we favor the malevolent interpretation for this region. The skill range involved is markedly low/narrow compared to other cases, but the percentage of students with disabilities given test accommodations is nearly as high as in Region 1. The mean percentages testing without accommodations is also as high as in Region 1, whereas the percentage in alternate assessments is about twice as high, as seen in Figure 3. The central difference between the two regions is that, in Region 5, growth in test accommodations results in significant increases in proficiency gain; hence the label super gaming. If one were to adopt a benevolent viewpoint for Region 5, it would suggest that schools and districts here simply teach less, but may at the same time have a somewhat inclusive culture manifested by their high test accommodation percentages. We consider this a less than plausible but not an impossible combination.
When a performance measure, such as test scores in schools, is treated as the yardstick for legitimacy, it may result in goal displacement whereby the measure becomes an end in itself rather than the means to fundamental goals, such as academic learning and growth. This, as Campbell (1975) points out, increases the risk of distortion and corruption in processes that affect the performance measure. Such concerns have been at the core of the criticisms against proficiency-based high-stakes accountability. In this study, we focus on test accommodations for students with disabilities as a potential gaming strategy to inflate proficiency gains. Test accommodations have been controversial because of the risk for misuse. Moreover, teachers tend to over-identify students with disabilities for test accommodations, often based on their subjective judgment. Despite these potential issues and increased interest in gaming practices, direct evidence of test accommodations being used as a gaming strategy has been virtually nonexistent. In addition, findings on whether test accommodations affect proficiency outcomes positively or negatively remain persistently mixed. This is a critical problem because the gaming argument presumes that there should be a positive relationship between test accommodations and proficiency gains.
A central issue in research on the use and consequences of test accommodations is that studies often neglect contextual variables. We take an important step in this regard and propose skill range as an objective contextual attribute with implications for both the use and proficiency outcomes of test accommodations. Of course, this is one among many possible contextual attributes, one that we believe is fundamental. A wider skill range implies not just a broader, but also a more complex skillset and knowledge base. We contend that under such conditions, a greater share of students with disabilities is likely to legitimately require test accommodations, given the breadth and complexity of the skillset involved. Conversely, a narrower skill range suggests more limited use of test accommodations. Gaming is, therefore, more likely when a high percentage of students with disabilities are given test accommodations under conditions of narrow skill range. That is, if a state uses test accommodations even when its average student acquires a relatively limited skillset, then the state may be gaming the accountability system. Our findings provide support for this perspective at the state level. We also find that, when skill range is high, test accommodations tend to reduce proficiency gains, making the use of test accommodations as a gaming strategy a less likely option even when a high percentage of students with disabilities are given test accommodations.
LIMITATIONS OF THE STUDY AND ASSOCIATED DIRECTIONS FOR SCHOLARSHIP
The central limitation of our study is the lack of school- and district-level data, which would help examine within-state dynamics in addition to between-state patterns. Notions of skill range and gaming through test accommodations apply to and can thus be studied at any analytical level, including the classroom. Moreover, as we demonstrate, there are multiple ways to interpret the interaction of skill range and use of test accommodations, as well as the outcomes of the latter, in terms of proficiency gains. As implausible as some potential interpretations are from our view, lack of school- and district-level data limits our ability to test them empirically. Some individual schools and districts, for instance, may use test accommodations more than others in a state with a wide average skill range, resulting in proficiency losses. Thus, measures of learning and skill acquisition at the school and district levels are critical. Measures of proficiency gain at those levels are also critical to examine how schools and districts vary around their state averages in terms of proficiency gains, revealing insights on how various factors may drive state averages.
Also, we have no objective measures of the inclusiveness of educational culture at different analytical levels with regard to students with disabilities. Such a measure is important because greater inclusiveness is likely to curtail gaming tendencies. In our analysis, we treated the percentage of students with disabilities taking the regular state assessment without accommodations as a structural proxy measure for inclusiveness of the educational culture involved. A more common use of the non-accommodation option implies greater access of students with disabilities to the regular curriculum. This, in turn, implies a lower risk for test accommodations to be used for gaming purposes. But direct measures of inclusiveness at the school and district levels would be more effective for empirical analyses.
Finally, information on a broader range of years would strengthen our analysis. Available assessment type data from the IDEA Data Center go only as far back as 2007, and our skill range measure does not go beyond 2009. A broader range of years for both measures, as well as for proficiency gains, can help examine more complex dynamics. For instance, it is important to test how changes in skill range are related to changes in the use of test accommodations and in proficiency for a more complete picture of the potentially broader dynamics involved. While we were able to examine changes in proficiency, we used skill range and test accommodations as static measures due to data limitations. Better longitudinal data for multiple levels of analysis, along with more rigorous measures of inclusiveness, can significantly improve research on test accommodations as a potential gaming strategy.
IMPLICATIONS FOR AUTHENTIC PRACTICE
Test accommodations are most commonly used for students with mild disabilitiesrecall that, for more significant disabilities, alternate assessments are often used. Since identification and diagnosis of mild disabilities rely considerably on subjective judgment (Skiba et al., 2008), so do decisions to give such students test accommodations. Thus, practices to mitigate educator tendencies to use test accommodations as a gaming strategy are likely to be more effective if they are not limited to rules and regulations, but include rigorous normative controls fostering authenticity and equity in subjective judgment. In this respect, cultivation of a genuinely inclusive culture regarding students with disabilities is crucial, both at administrative and instructional levels. Such a culture is bereft of gaming tendencies. It accommodates meaningful participation of students with disabilities in the general education curriculum and in the accountability system in ways that serve their best interests. For instance, one of the most outlying cases in the top left corner of Figure 1, which is also the super gaming region (Region 5) in Figure 2, is Georgia in 2009. Georgia is also the state in which the biggest cheating scandal recently broke out, where educators were caught changing student answers on high-stakes exam sheets (Copeland, 2013). While this can be a pure coincidence, it also indicates the possibility that the use of test accommodations as a gaming strategy may be a component of a broader gaming problem, which can be mitigated by normative interventions, including better teacher and administrator preparation, parental participation and advocacy, and better administrative and community oversight.
One area in which clear rules and regulations can be particularly effective is the design of the accommodation policy itself. Specifically, procedures can be put in place that ensure that if a student is given test accommodations, then he or she is also given commensurate instructional accommodations in the classroom. As we noted earlier, it is possible for educators to view test accommodations as a means for improving proficiency gains without investment in corresponding instructional accommodations in the classroom. This involves the assumption that test accommodations can be a substitute for effective learning. Therefore, rules can be established to require and monitor the proper joint use of test and instructional accommodations.
A key priority should be to foster and implement both normative and regulative measures in potentially low-performing schools, particularly in urban and rural areas, as these schools are more likely to experience not just proficiency problems, but also severe resource constraints hindering their capacity to genuinely accomplish mandated proficiency outcomes. Both issues can increase the risk of gaming as a means of organizational survival.
Another policy implication that receives a lot of attention in the accommodation effects literature is the potential of assessments built on the principles of Universal Design for Learning (UDL) (ERIC/OSEP, 1998). Applied to learning, universal design means building flexibility into curricular materials and instructional activities so as to make learning goals achievable by students with wide differences in ability. Applied to assessment, it means that rather than retrofitting tests with accommodations after the fact, they would be designed from the outset to allow the widest participation possible, while supporting valid inferences about performance for all participants (Thompson, Johnstone, & Thurlow, 2002). Assessments so designed would be accessible to all test takers and thus would minimize the need for accommodations. Here, the goal of universal design is not to eliminate individualization, but rather to enable use of appropriate accommodations when they are needed to reduce threats to validity and comparability of scores (Thompson et al., 2002).
Finally, coordinating the use of universal design in instruction and in assessment for individual students would address another significant problem identified in the accommodations literaturethe inconsistent use of appropriate accommodations in teaching and testing, which jeopardizes student achievement by providing distracting, confusing, and/or unneeded accommodation (Helwig & Tindal, 2003; Ketterlin-Geller et al., 2007).
1. In this study, we do not differentiate between single and multiple accommodations. We instead focus on whether any accommodation is provided, for two reasons. First, the vast majority of existing studies typically treat single and multiple accommodations as essentially similar treatments, and none of them raise the issue of whether single and multiple accommodations have significantly different implications and outcomes (see Cormier, Altman, Shyyan, & Thurlow, 2010; Johnstone, Altman, Thurlow, & Thompson, 2006; Thompson, Blount, & Thurlow, 2002; Zenisky & Sireci, 2007). Second, the distinction between single and multiple accommodations is not addressed in available state- and national-level data.
2. The District of Columbia, Puerto Rico, Virgin Islands, and other U.S. territories are excluded from the analysis.
4. The original name of the state/NAEP comparability score in the mapping studies is relative error in mapping, smaller values of which indicate greater comparability. We reversed this measure to facilitate interpretation.
5. Since the construction of the state/NAEP comparability score starts with the states own proficiency cut score on its own assessment, state/NAEP comparability is robust to changes in cut scores and tests. But there are other factors to consider for effects on the comparability measure. First, students may be motivated to succeed more on the state assessment than on the NAEP, a low-stakes test. Not only is this common across states and over time, and largely inconsequential in analyzing variance, but the problem is also less pronounced for fourth graders than for eighth and twelfth graders who also take the NAEP, since older cohorts can differentiate the consequences of high- and low-stakes tests more strategically than younger ones. Second, NAEP and state assessments are not administered at exactly the same time. This also poses a limited problem because many states conduct their assessment in the spring, close to when the NAEP is administered. Third, NAEP and state assessments differ in item design, which can reduce comparability. NAEP includes a mix of multiple choice and open-response items, while state tests are increasingly comprised almost entirely of multiple-choice items, particularly for mathematics. Yet, since this difference is increasingly common across states, the resulting noise is likely to be common too. There is also evidence that effects of test design differences on students may have been limited in the NCLB era compared to those of skill range differences (Jacob, 2005, 2007; Wei, Shen, Lukoff, Ho, & Haertel, 2006). Finally, NAEP and state assessments can be subject to different exclusions and alternate testing rules. For example, state/NAEP comparability can be adversely affected if the state uses exclusions and alternate assessments for a greater share of disabled students than NAEP does. However, under NCLB, schools must test no less than 95% of all students in order make adequate yearly progress. NCLB also requires that, at the state level, no more than 3% of students who are counted as proficient on the state assessment are those who were proficient on alternate assessments (U.S. Department of Education, 2005a). Therefore, the potential differences in exclusion and alternate assessment rules would have at best a limited effect on the quality of the state/NAEP comparability measure.
6. Minnesota Population Center, University of Minnesota, Minneapolis, MN (usa.ipums.org/usa).
7. National Center for Education Statistics, Department of Education, Washington, DC (nces.ed.gov/ccd).
8. It excludes federal contributions such as Title I funds to level the playing field. This is because not only do federal contributions fall short in leveling the playing fieldas they are limited to about 10% of total per-pupil revenue in a statebut including them in considerations of state-level per-pupil revenue also confounds the degree of underlying differences in state-specific demand for educational quality and outcomes. It is important to note that we replicated the entire empirical analysis with a per-pupil revenue measure that included federal contributions, which did not result in different findings. We decided to stick with our original approach and excluded federal contributions, because this is a conceptually more plausible alternative.
9. More complex modelsfor instance, those involving random slopes and time interactionswere infeasible given our data limitations (only two years of observations for 50 states).
10. See chapter one in particular, on Non-rational foundations of rationality."
Altman, J. R., Lazarus, S. S., Thurlow, M. L., Quenemoen, R. E., Gutbbert, M., & Gormier, D. G. (2008). 2007 survey of states: Activities, changes, and challenges for special education. Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Baker, B. D., & Duncombe, W. (2004). Balancing district needs and student needs: The role of economies of scale adjustmentsand pupil need weights in school finance formulas. Journal of Education Finance, 29, 97124.
Bandeira de Mello, V. (2011). Mapping state proficiency standards onto the NAEP scales: Variation and change in State Standards for Reading and Mathematics, 20052009 (NCES 2011-458). Washington, DC: National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education.
Bandeira de Mello, V., Blankenship, C., & McLaughlin, D. (2009). Mapping state proficiency standards onto NAEP Scales: 20052007 (NCES 2010-456). Washington, DC: National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education.
Berliner, D. C. (2009). MCLB (Much Curriculum Left Behind): A U.S. calamity in the making. The Educational Forum, 73, 284296.
Bolt, S. E., & Thurlow, M. L. (2004). Five of the most frequently allowed testing accommodations in state policy: Synthesis of research. Remedial and Special Education, 25, 141152.
Booher-Jennings, J. (2005). Below the bubble: Educational triage and the Texas accountability system. American Educational Research Journal, 42, 231268.
Bryk, A. S., & Schneider, B. (2003). Trust in schools: A core resource for school reform. Educational Leadership, 60, 4045.
Campbell, D. (1975). Assessing the impact of planned social change. In G. Lyons (Ed.), Social research and public policies: The Dartmouth/OECD Conference (pp. 345). Hanover, NH: Public Affairs Center, Dartmouth College.
Card, D., & Payne, A. A. (2002). School finance reform, the distribution of school spending, and the distribution of student test scores. Journal of Public Economics, 83, 4982.
Cho, H., & Kingston, N. (2011). Capturing implicit policy from NCLB test type assignments of students with disabilities. Exceptional Children, 78, 5872.
Cizek, G. J. (2005). High-stakes testing: Contexts, characteristics, critiques, and consequences. In R. P. Phelps (Ed.), Defending standardized testing (pp. 2354). Mahwah, NJ: Lawrence Erlbaum Associates.
Coleman, J. S. (1988). Social capital in the creation of human capital. American Journal of Sociology, 94, S95S120.
Collins, R. (1982). Sociological insight: An introduction to nonobvious sociology. New York, NY: Oxford University Press.
Copeland, L. (2013, April 14). School cheating scandal shakes up Atlanta. USA Today.
Cormier, D. C., Altman, J. R., Shyyan, V., & Thurlow, M. L. (2010). A summary of the research on the effects of test accommodations: 20072008 (Technical Report 56). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Cronin, J., Dahlin, M., Adkins, D., & Kingsbury, G. G. (2007). The proficiency illusion. Washington, DC: Thomas B. Fordham Institute.
Cullen, J., & Reback, R. (2006). Tinkering toward accolades: School gaming under a performance accountability system. Improving School Accountability: Check-ups or choice. Advances in Applied Economics, 14, 134.
Darling-Hammond, L. (2007). Race, inequality and educational accountability: The irony of No Child Left Behind. Race Ethnicity and Education, 10, 245260.
Dee, T. S., & Jacob, B. A. (2010). The impact of No Child Left Behind on student achievement. Journal of Policy Analysis and Management, 30, 418464.
Deere, D., & Strayer, W. (2001). Putting schools to the test: School accountability, incentives, and behavior. College Station, TX: Private Enterprise Research Center, Texas A&M University.
DiMaggio, P. J., & Powell, W. W. (1983). The iron cage revisited: Institutional isomorphism and collective rationality in organizational fields. American Sociological Review, 48, 147160.
Duncombe, W., & Yinger, J. (2008). Measurement of cost differentials. In H. Ladd & E. B. Fiske (Eds.), Handbook of research in education finance and policy (pp. 239256). New York, NY: Routledge.
Elazar, D. J. (1984). American federalism. Third edition. New York, NY: Harper & Row.
Elledge, A., Le Floch, K. C., Taylor, J., & Anderson, L. (2009). State and local implementation of the No Child Left Behind Act. Volume VImplementation of the 1 percent rule and 2 percent interim policy options. Washington, DC: U.S. Department of Education.
Elliott, S. N., & Marquart, A. M. (2004). Extended time as a testing accommodation: Its effects and perceived consequences. Exceptional Children, 70(3), 349367.
Elliott, S. N., McKevitt, B. C., & Kettler, R. J. (2002). Testing accommodations research and decision making: The case of good scores being highly valued but difficult to achieve for all students. Measurement and Evaluation in Counseling and Development, 35, 153166.
ERIC/OSEP (Educational Resources and Information Clearinghouse & Office of Special Education Programs). (1998). Topical report. Washington, DC: Author.
Figlio, D. N. (2006). Testing crime and punishment. Journal of Public Economics, 90, 837851.
Figlio, D. N., & Getzler, L. S. (2006). Accountability, ability and disability: Gaming the system. Improving School Accountability: Check-ups or choice. Advances in Applied Economics, 14, 3549.
Firestone W. A., Schoor, R. Y., & Monfils, L. F. (2004). The ambiguity of teaching to the test: Standards, assessment, and educational reform. Mahwah, NJ, L. Erlbaum Associates Publishers.
Fletcher, J. M., Francis, D. J., Boudousquie, A., Copeland, K., Young, V., Kalinowski, S., & Vaughn, S. (2006). Effects of accommodations on high-stakes testing for students with reading disabilities. Exceptional Children, 72, 136150.
Fuchs, L. S., Fuchs, D., Eaton, S. B., Hamlett, C. L., & Karns, K. M. (2000). Supplementing teacher judgments of mathematics test accommodations with objective data sources. School Psychology Review, 29, 6585.
Fuchs, L. S., Fuchs, D., Hamlett, C., Eaton, S. B., Binkley, E., & Crouch, R. (2000). Using objective data sources to enhance teacher judgments about test accommodations. Exceptional Children, 67, 6781.
Gramlich, E. M., & Rubinfeld, D. L. (1982). Micro estimates of public spending demand functions and tests of the Tiebout and median voter hypotheses. The Journal of Political Economy, 90, 536560.
Haney, W. (2000). The myth of the Texas miracle in education. Education Policy Analysis Archives, 8(41). Retrieved from http://epaa.asu.edu/epaa/v8n41/index.html
Hanushek, E. A., Kain, J. F., & Rivkin, S. G. (2009). New evidence about Brown V. Board of Education: The complex effects of school racial composition on achievement. Journal of Labor Economics, 27, 349383.
Helwig, R., & Tindal, G. (2003). An experimental analysis of accommodation decisions on large-scale mathematics tests. Exceptional Children, 69, 211225.
Henig, J. R. (2013). The end of exceptionalism in American education: The changing politics of school reform. Cambridge, MA: Harvard Education Publishing Group.
Hollenbeck, K., Tindal, G., & Almond, P. (1998). Teachers knowledge of accommodations as a validity issue in high-stakes testing. The Journal of Special Education, 32, 175183.
Hombo, C. (2003). NAEP and No Child Left Behind: Technical challenges and practical solutions. Theory into Practice, 42, 5965.
Hursh, D. (2007). Exacerbating inequality: The failed promise of the No Child Left Behind Act. Race Ethnicity and Education, 10, 295308.
Jacob, B. A. (2001). The impact of high-stakes testing on student achievement: Evidence from Chicago. Unpublished Manuscript. Cambridge, MA: John F. Kennedy School of Government, Harvard University.
Jacob, B. A. (2005). Accountability, incentives and behavior: The impact of high-stakes testing in the Chicago public schools. Journal of Public Economics, 89, 761796.
Jacob, B. A. (2007). Test-based accountability and student achievement: An investigation of differential performance on NAEP and state assessments (NBER Working Paper No. 12817). Cambridge, MA: National Bureau of Economic Research.
Jacob, B. A., & Levitt, S. D. (2003). Rotten apples: An investigation of the prevalence and predictors of teacher cheating. Quarterly Journal of Economics, 118, 843877.
Jennings, J., & Sohn, H. (2014). Measure for measure: How proficiency-based accountability systems affect inequality in academic achievement. Sociology of Education, 87, 125141
Johnstone, C. J., Altman, J., Thurlow, M. L., & Thompson, S. J. (2006). A summary of research on the effects of test accommodations: 2002 through 2004 (Technical Report 45). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Kearns, J. F., Towles-Reeves, E., Kleinert, H. L., Kleinert, J. O., & Thomas, M. K. (2009). Characteristics of and implications for students participating in alternate assessments based on alternate academic achievement standards. The Journal of Special Education, 45, 314.
Ketterlin-Geller, L. R., Alonzo, J., Braun-Monegan, J., & Tindal, G. (2007). Remedial and Special Education, 28, 194-206.
Kleinert, H., & Thurlow, M. (2001). An introduction to alternate assessment. In H. Kleinert & J. Kearns (Eds.), Alternate assessment: Measuring outcomes and supports for students with disabilities (pp. 115). Baltimore, MD: Paul H. Brookes.
Kolstad, A. (2002). Design goals: NAEP 2002 and beyond. Washington, DC: National Center for Education Statistics, U.S. Department of Education.
Koretz, D. (2008). Measuring up: What educational testing really tells us. Cambridge, MA: Harvard University Press.
Kubiszyn, T., & Borich, G. (2007). Educational testing and measurement: Classroom application and practice. New York, NY: John Wiley & Sons.
Lankford, H., Loeb, S., & Wyckoff, J. (2002). Teacher sorting and the plight of urban schools: A descriptive analysis. Educational Evaluation and Policy Analysis, 24, 3762.
Lazarus, S. S., Thurlow, M. L., Lail, K. E., & Christensen, L. (2009). A longitudinal analysis of state accommodations policies: Twelve years of change, 19932005. The Journal of Special Education, 43, 6780.
Linn, R. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32, 313.
McDonnell, L. M. (2004). Politics of persuasion and educational testing. Cambridge, MA: Harvard University Press.
McDonnell, L. M., McLaughlin, M. J., & Morison, P. (1997). Educating one and all: Students with disabilities and standards-based reform. Washington, DC: National Academy Press.
McGill-Franzen, A., & Allington, R. L. (1993). Flunk em or get them classified: The contamination of primary grade accountability data. Educational Researcher, 22, 1922.
McLanahan, S. S., & Sandefur, G. (1994). Growing up with a single parent: What hurts and what helps. Cambridge, MA: Harvard University Press.
Nagle, K., Yunker, C., & Malmgren, K. W. (2006). Students with disabilities and accountability reform challenges identified at the state and local levels. Journal of Disability Policy Studies, 17, 2839.
National Assessment Governing Board. (2002). Using the National Assessment of Educational Progress to confirm state test results. Washington, DC: National Assessment Governing Board.
Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: How high-stakes testing corrupts Americas schools. Cambridge, MA: Harvard Education Press.
No Child Left Behind Act of 2001, 20 U.S.C. 6301 et seq.
Noguera, P. A. (2003). City schools and the American dream: Reclaiming the promise of public education. New York, NY: Teachers College Press.
Nye, B., Hedges, L. V., & Konstantopoulos, S. (2000). The effects of small classes on academic achievement: The results of the Tennessee class size experiment. American Educational Research Journal, 37, 123151.
Orfield, G., & Kornhaber, M. L. (Eds.). (2001). Raising standards or raising barriers: Inequality and high-stakes testing in public education. New York, NY: Century Foundation Press.
Payne, C. M. (2008). So much reform, so little change: The persistence of failure in urban schools. Cambridge, MA: Harvard Education Press.
Peterson, P. E., & Hess, F. M. (2006). Keeping an eye on state standards. A race to the bottom? Education Next, 6, 2829.
Peterson, P. E., & Hess, F. M. (2008). Keeping an eye on state standards: A race to the bottom? Education Next, 8, 7073.
Reardon, S. F. (2011). The widening academic achievement gap between the rich and the poor: New evidence and possible explanations. In G. J. Duncan & R. J. Murnane (Eds.), Whither opportunity? Rising inequality, schools, and childrens life chances (pp. 91115). New York, NY: Russell Sage Foundation.
Reys, B. J. (2006). The intended mathematics curriculum as represented in state-level curriculum standards: Consensus or confusion? Greenwich, CT: Information Age.
Russell, D. W., Altmaier, E., & Van Velzen, D. (1987). Job-related stress, social support, and burnout among classroom teachers. Journal of Applied Psychology, 72, 269274.
Schulte, A. A. G., Elliott, S. N., & Kratochwill, T. R. (2000). Educators perceptions and documentation of testing accommodations for students with disabilities. Special Services in the Schools, 16, 3556.
Schwartzbeck, T. D. (2003). Declining counties, declining school enrollments. Arlington, VA: American Association for School Administrators.
Sireci, S. G., Scarpati, S. E., & Li, S. (2005). Test accommodations for students with disabilities: An analysis of the interaction hypothesis. Review of Educational Research, 75, 457490.
Skiba, R. J., Simmons, A. B., Ritter, S., Gibb, A. C., Rausch, M. K., Cuadrado, J., & Chung, C. G. (2008). Achieving equity in special education: History, status, and current challenges. Exceptional Children, 74, 264288.
Thompson, S. J., Blount, A., & Thurlow, M. L. (2002). A summary of research on the effects of test accommodations: 1999 through 2001 (Technical Report 34). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Thompson, S. J., Johnstone, C. J., & Thurlow, M. L. (2002). Universal design applied to large scale assessments (Synthesis Report 44). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Thurlow, M. L. (2000). Standards-based reform and students with disabilities: Reflections on a decade of change. Focus on Exceptional Children, 33, 116.
Thurlow, M. L. (2002). Positive educational results for all students: The promise of standards-based reform. Remedial and Special Education, 23, 195202.
Tolbert, P. S., & Zucker, L. G. (1983). Institutional sources of change in the formal structure of organizations: The diffusion of civil service reform, 18801935. Administrative Science Quarterly, 28, 2239.
Towles-Reeves, E., Kearns, J., Kleinert, H., & Kleinert, J. (2008). An analysis of the learning characteristics of students taking alternate assessments based on alternate achievement standards. The Journal of Special Education, 42, 241254.
U.S. Department of Education, Office of Elementary and Secondary Education. (2003). Title IImproving the academic achievement of the disadvantaged, final regulations. Washington, DC: U.S. Department of Education.
U.S. Department of Education. (2005a). Alternate achievement standards for students with the most significant cognitive disabilities: Non-regulatory guidance. Washington, DC: U.S. Department of Education.
U.S. Department of Education. (2005b). To raise the achievement of students with disabilities, greater flexibility available for states, schools. Washington, DC: U.S. Department of Education.
Wei, X., Shen, X., Lukoff, B., Ho, A. D., & Haertel, E. (2006). Using test content to address trend discrepancies between NAEP and California state tests. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA.
Yell, M. L., & Shriner, J. G. (1997). The IDEA amendments of 1997: Implications for special and general education teachers, administrators, and teacher trainers. Focus on Exceptional Children, 30, 120.
Zenisky, A. L., & Sireci, S. G. (2007). A summary of the research on the effects of test accommodations: 20052006 (Technical Report 47). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.