The Informational Significance of A–F School Accountability Grades
by Curt M. Adams, Patrick B. Forsyth , Jordan Ware & Mwarumba Mwavita - 2016
Background/Context. Despite problems with accountability systems under No Child Left Behind, the policy has been widely commended for exposing the depth and breadth of educational inequality in the United States. As states implement new accountability systems, there is growing concern that attention to achievement gaps and the performance of marginalized children has faded. Many approved accountability plans no longer report achievement by student subgroups or include subgroup performance in the calculation of accountability indicators.
Research Purpose. This study examined the informational significance of Oklahoma’s A–F accountability grades relative to the policy objective of achievement equity. Informational significance as explained in self-determination theory provided a framework to explore the usefulness of an A–F grade for understanding achievement differences within and between schools.
Research Design. We evaluated the informational significance of Oklahoma A–F school grades by analyzing reading and math test scores from over 25,000 students in 81 elementary and middle schools. The study was designed to address two questions: Do students in “A” and “B” schools have high average achievement and small achievement gaps compared to students in “D” and “F” schools? What is the difference in average achievement and achievement gaps between school grades when holding constant contextual school conditions?
Results. We found test score gaps attributed to Free and Reduced Lunch qualification and minority status. Free and Reduced Lunch and minority students average about one standard deviation lower in math and reading than their peers. Test score gaps varied across A–F school grades with the largest gaps existing in “A” and “B” schools. HLM results showed that A–F grades do not differentiate schools by effectiveness levels. For reading, we did not find statistically significant main effects attributed to letter grades. For math, the only statistically significance difference was between students in “A” and “B” schools and students in “F” schools. This difference had a small effect size. School grades did moderate achievement gaps, but gaps moved in a direction opposite from what would be desired of an accountability system that measured achievement equity.
Conclusions. Progress made under NCLB in exposing achievement inequity in the U.S. has taken a step back with Oklahoma’s A–F school grades. Our evidence suggests that a composite letter grade provides very little meaningful information about achievement differences.
Despite problems with accountability systems under No Child Left Behind (NCLB), the policy has been widely commended for exposing the depth and breadth of educational inequality in the United States. Achievement equity remains a target of accountability systems approved for NCLB waivers. As of February 2014, 42 states, Washington, DC, and eight school districts in California were operating under approved NCLB flexibility waivers. At a minimum, approved accountability systems need to set ambitious Annual Measurable Objectives (AMO) in reading/language arts and math, to recognize reward schools and identify low performing schools as priority and focus schools, to measure student growth, to support efforts to close achievement gaps, and to support priority and reward schools in building capacity and improving performance (U.S. Department of Education, 2012).
As states implement new accountability systems, there is growing concern that attention to achievement gaps and the performance of marginalized children has faded (Ayers, 2011; Hall, 2013). Many approved accountability plans no longer report achievement by student subgroups or include subgroup performance in the calculation of accountability indicators (CEP, 2012; Domaleski & Perie, 2013; Polikoff, McEachin, Wrabel, & Duque, 2014). Some states have opted to combine historically marginalized students into a super subgroup, to use growth of the bottom 25% of students in a school to satisfy the achievement gap reporting requirement, and to evaluate school performance with a composite indicator (Ayers, 2011; McNeill, 2012). These changes have the potential to produce a performance indicator that effectively hides achievement disparities within and between schools (Hall, 2013).
Initial evidence shows that concerns raised from document analysis of waiver applications are legitimate. Ushomirsky, Wiliams, and Hall (2014) found that schools in Florida, Kentucky, and Minnesota earned high effectiveness rankings despite low test performance from minority and low-income students. In some cases, minority and low-income students in lower ranked schools outperformed peers in higher ranked schools. We intensify the scrutiny of new accountability systems by examining the informational significance of Oklahomas AF school grades. Oklahoma uses an AF ranking system as the basis of State recognition and intervention. High letter grades, As and Bs, lead to rewards and public acclaim whereas low grades, Ds and Fs, impose mandated turnaround plans on schools. Although the AF grading system was not designed specifically to measure achievement gaps, it is assumed that ranking schools by letter grades can support efforts to equalize achievement distributions.
This study examined the informational significance of letter grades relative to the policy objective of achievement equity. Informational significance as explained in self-determination theory provided a framework to explore the usefulness of an accountability grade for understanding achievement differences within and between schools. This study does not measure the validity of inferences based on a letter grade. Rather, the purpose was to determine if students in A and B1 schools had high average achievement and smaller achievement gaps compared to students in D and F schools, and to explore differences in average achievement when holding constant conditions that schools do not control.
TEST-BASED ACCOUNTABILITY AND ACHIEVEMENT EQUITY
We use a narrow definition of achievement equity in this study, focusing on racial and economic test score gaps that many reform policies target (Fusarelli, 2004; Lee, 2006, 2008). Under NCLB, test-based accountability became the primary policy instrument to redress achievement disparities (Lee, 2008). As it turns out, the complexity of achievement equity exceeded the structure and function of test-based accountability (Lee, 2006; Harris, 2011; Mintrop & Sunderman, 2009). Trend data from the National Assessment of Educational Progress (NAEP) indicate that progress made in the 1970s and 1980s in narrowing achievement gaps stalled from 19902004 (Lee, 2006; Mintrop & Sunderman, 2009; Rothstein, Jacobson, & Wilder, 2008) and from 20042012 (National Center for Education Statistics, 2013). Additionally, poverty gaps actually increased in two thirds of the states over the last decade (Quality Counts, 2014).
Several explanations exist for persistent test score gaps. Differences in learning opportunities, resource disparities, school capacity, and teaching quality partly explain lower average achievement of poor and minority students (Barton & Coley, 2010). More specific to this study is the quality and use of accountability indicators. Test-based accountability under NCLB used simplistic performance indicators in a targeted way: to reward schools meeting yearly objectives and to sanction schools falling short of annual achievement targets (Harris, 2011; Mintrop & Sunderman, 2009). Persistent achievement gaps during the NCLB era raise questions about the improvement assumptions of test-based accountability (Ryan & Weinstain, 2009).
IMPROVEMENT ASSUMPTIONS AND ACCOUNTABILITY
Test-based accountability assumes that external contingencies (e.g., threats or rewards) have instrumental value in reinforcing desired actions and outcomes (Ryan & Deci, 2002, 2012; Polikoff et al., 2014). The belief is that threats and sanctions can provoke internal change by eliciting the collective will to make instructional systems efficient and effective (Mintrop & Sunderman, 2009). It is assumed that school actors, when confronted with punitive consequences for low achievement, will take action that leads to better outcomes (ODay, 2002). Accordingly, schools falling short of measurable objectives encounter public scorn, and face mandated improvement plans, loss of students through choice options, prescribed reform models, reconstitution, or in some cases closure (Sahlberg, 2008). Test-based accountability relies on accountability indicators to identify underperforming schools so that pressure or sanctions can induce school actors to improve learning opportunities.
Agency and expectancy theories have been used to explain how accountability systems work from outside of schools to bring about change within them (Polikoff et al., 2014; Ryan & Weinstein, 2009). As suggested by these frameworks, clear achievement standards and accurate performance indicators function as an external motivator for goal attainment. Agency theory assumes that accountability information is a mechanism used by principals (i.e., school administrators, community members, legislators, tax payers, etc.) to ensure school agents (i.e. teachers) deliver student achievement (Polikoff et al., 2014). Expectancy theory assumes that rewards and threats motivate teachers to improve achievement as long as standards are clear and performance information is accurate (Finnigan & Gross, 2007). Through an agency and expectancy lens, the legitimacy and trustworthiness of accountability indicators affect the behavioral response of school members.
In the design of test-based accountability, a body of divergent research findings on external control was either dismissed by policy makers or not known. The weight of the psychological evidence indicates that contingent reinforcement withers under the strain of complex, conceptual tasks like teaching and learning (Ryan & Deci, 2002; Ryan & Weinstein, 2009). Performance information used as an external control is inimical to work that involves professional discretion, adaptation, and cooperation among interdependent groups (Forsyth, Adams, & Hoy, 2011). An alternative theoretical lens used to explain optimal individual and group performance may well explain why achievement equity eluded NCLB and how performance information can play a supportive role in closing achievement gaps. The crucial adjustment is with the locus of causality, switching from external to internal motivators. Self-determination theory informs this adjustment.
SELF-DETERMINATION THEORY APPLIED TO ACCOUNTABILITY
The fundamental assumption of self-determination theory is that individuals are inherently oriented toward growth and goal fulfillment (Ryan & Deci, 2002). Accordingly, the drive and determination to excel are internal, primal states that require nurturing not coercive control by external mechanisms. External mechanisms, like the use of performance information, fuel motivation and effective behavior by satisfying the innate psychological needs of autonomy, competence, and relatedness (Adams, Forsyth, Dollarhide, Miskell, & Ware, 2015; Ryan & Deci, 2012).
Here autonomy does not mean independence; rather, it is a cognitive belief embodied in the volitional and purposeful action of individuals (Williams, 2002). Competence is feeling effective in ones task and having confidence in ones ability to execute actions required to achieve a challenging outcome. Relatedness comes from supportive social connections that foster feelings of belonging and psychological security within a group or organization (Reeve, 2002). Schools bring to life teaching and learning when structures and processes promote interactions that support autonomy, competence, and relatedness (Reeve & Jang, 2006). On the other hand, teaching and learning become uninspired and stale when psychological needs are thwarted by controlling structures (Reeve & Halusic, 2009). Performance indicators can be used to build professional capacity, but to do so the information needs to be used in ways that enhance the desire, creativity, and energy of teachers and students to press for academic excellence (Ryan & Weinstein, 2009).
NLCB was not designed to build professional capacity; its intent was to hold schools accountable for past results. Paradoxically, NCLB added considerable noise, confusion, and dysfunction to many of the low performing, high need schools it promised to reform (Moller, 2008; ODay, 2002; Sirotnik, 2005). When considered through self-determination theory, the anemic performance of accountability is not surprising (Ryan & Weinstein, 2009). Accountability indices derived from aggregated test scores cover a very thin slice of the performance pie. The complexity of school work, more directly the complexity of teaching and learning, far exceeds the narrow parameters of a composite achievement index compiled from curricular tests administered at one occasion during the school year (Moller, 2008). Low quality accountability information used to trigger sanctions constrict meaningful learning opportunities, hinder innovation and risk-taking, and undermine motivation (Sahlberg, 2008).
Effective accountability systems informed by self-determination theory depend on the functional significance of performance indicators. Functional significance is defined as the meaning and worth individuals place in an object, experience, or event (Ryan & Deci, 2012). The functional significance of accountability indicators can be informational or controlling (Ryan & Weinstein, 2009). Using indicators as an external control has arguably diminished innovation, creativity, and joy of teaching and learning in the very schools that NCLB intended to reform (Darling-Hammond, 2006; Feuer, 2008; Sunderman & Kim, 2005). From a controlling perspective, NCLB accountability systems failed to generate the legitimacy and trustworthiness needed for performance indicators to be used as a tool to reform schools.
Accountability indicators that have informational significance stand a better chance of generating the energy and capacity to narrow persistent achievement gaps (Ryan & Deci, 2012; Ryan & Weinstein, 2009). Informational significance comes from the diagnostic value associated with a performance indicator (Ryan & Weinstein, 2009). Clear and accurate information about learning processes and outcomes is needed to generate knowledge about student performance; this knowledge in turn can drive improvement decisions and actions. It is hard to see how appropriate action can be taken to close achievement gaps without first knowing how achievement varies within schools. If consequential decisions and actions are based on accountability indicators, the indicators should provide enough information to understand differences in student achievement.
Informational significance, as understood in self-determination theory, is the standard by which we evaluate the performance of Oklahomas AF grades. We do not test the construct validity or reliability of letter grades; instead, our concern is with the ability of the grade to yield meaningful and useful information about achievement differences between student groups. Our assessment of informational significance is based on the degree to which school grades reflect achievement gaps within and between schools. This purpose stands in contrast to validity studies that use theory and empirical evidence to evaluate the ability of a measure to yield truthful judgments about the object it purports to measure (Messick, 1995; Miller, 2008). The State may have not have intended for AF letter grades to specifically measure achievement gaps, but the policy objective is to improve achievement equity.
For achievement equity, we evaluate informational significance by what a letter grade reveals about achievement gaps for FRL and minority students. To keep attention focused on high achievement for all students, a composite letter grade must reflect test score differences within and between schools (Linn, 2005, 2008). High grades, As and Bs, logically suggest strong achievement for all students in all subject areas. Low grades largely suggest lower average achievement and large achievement gaps. If AF grades reflect subgroup differences, they may have value for equalizing achievement outcomes. If they do not, they fail the test of informational significance.
Our purpose was to assess the informational significance of Oklahomas school grades in related to the policy objective of closing achievement gaps within and between schools. As such, we do not evaluate the validity of letter grades for judgments about achievement equity. The objective was to evaluate the usefulness of Oklahomas AF grades for understanding achievement differences. Grades should be meaningful for determining the average achievement of all students and student subgroups. We asked two questions: First, do students in A and B schools have high average achievement and small achievement gaps compared to students in D and F schools? And second, what is the difference in average achievement and achievement gaps between school grades when holding school context constant?
COMPOSITION OF OKLAHOMAS AF SCHOOL GRADES
Oklahoma uses a single letter grade as an indicator of Annual Measurable Objectives (AMO), to classify reward, priority, or focus schools and to rank schools by effectiveness (Oklahoma State Department of Education, 2012). School grades for the 20122013 school year were calculated using a formula that converts test scores into categorical data, categorical data back into a continuous index, and a continuous index into a summative letter grade. The final composite grade is derived from two components: (1) student achievement and (2) student growth (Ayers, 2011; OCEP & CERE, 2012; Oklahoma State Department of Education, 2014).
The student achievement component makes up 50% of the school grade. Student test scores from state math, reading, science, and social studies exams are used to calculate a schools Performance Index (PI). The PI is calculated from a simple binary scale. Students scoring below proficiency for each tested subject are assigned a zero and students who score proficient or above are assigned a one. The total score for all tested subjects is divided by the total number of tests taken to calculate the PI score for a school. This produces a PI score ranging from 0100. The PI score is then multiplied by .50 for the calculation of the final composite grade (Oklahoma State Department of Education, 2014).
Student growth makes up the other 50 percent of the composite letter grade. Only math and reading/English exams are used for the growth index. Growth is composed of overall student growth (25%) and growth of the bottom quartile of students in a school (25%). For both components, the growth score is calculated by first calculating the total number of students in the school who either scored proficient/advanced for both testing periods, who increased a proficiency level in the current testing period, or who showed a growth in the test score that was above the state average for growth. This number is then divided by the total number of eligible students to arrive at an overall growth index that ranges from 0100. The overall growth index is then multiplied by .25 for the calculation of the composite school grade. Growth of the bottom quartile is similarly calculated (Oklahoma State Department of Education, 2014).
Up to 10 bonus points are awarded to schools based on attendance rates, advance course participation, dropout rates, and return rate of parent climate surveys. The PI score, student growth, and bonus points are summed to arrive at an overall Index score that ranges from 0100. Index scores between 90100 receive an A, 8089 a B, 7079 a C, 6069 a D, and below 59 an F (Oklahoma State Department of Education, 2014).
Analyses were based on 20122013 reading and math test scores of over 25,000 students from 81 urban, urban fringe, and suburban elementary and middle schools. Schools were sampled from three contiguous districts in a single metropolitan area. Achievement data were used from students in third, fourth, fifth, sixth, seventh, and eighth grades. Table 1 contains descriptive data for the sample of students and schools. Valid math scores were obtained from 25,663 students and valid reading scores from 25,469 students. Approximately 45% of the students qualified for Free or Reduced Lunch (FRL), 42% identified as a minority ethnic group, and 52% identified as nonminority Caucasian.
Scale scores from the state curricular exams in reading and math were used to operationalize achievement. Scale scores range from a low of 400 to a high of 990. The average reading scale score for the sample was 747 with a standard deviation of 90. The average math scale score was 759 with a standard deviation of 92. The school sample shows that the average FRL rate was 70%; the average minority composition was 60%, 14% of the schools earned school grades of A; 19% earned grades of B; 4% earned grades of C; 20% earned grades of D; 43% grades of F. Of the sampled schools, 62 were elementary schools and 19 middle schools.
Table 1. Descriptive Student and School Data
Math Sample by Student Composition and Test Score
Note. N = 81 elementary and middle schools from three contiguous districts in one metropolitan area. We had valid reading scores for 25,469 students and valid math scores for 25,663 students.
Two techniques were used in the analysis. First, consistent with conventional practices to report test score gaps (Jencks & Phillips, 1998; Reardon, 2011), we standardized scale scores to a mean of 0 and a standard deviation of 1. We then report mean differences between FRL and non-FRL and minority and nonminority students in A, B, C, D, and F schools. This approach is useful for examining differences in the achievement status of students.
School grades, however, rank schools by effectiveness, and as such they must measure what schools and teachers control by accounting for achievement variance attributed to different school context. Harris (2011) refers to this as the cardinal rule of accountability, schools should be held accountable for what they do. When an indicator is used to rank schools, simple descriptive data lack the power to control for alternative explanations of test score differences (Carlson, 2006; Forte, 2010; Harris, 2011). For this reason, we used a multilevel modeling approach to estimate mean differences and achievement gaps after controlling for factors that are unrelated to teaching effectiveness or school practices.
We followed a conventional multilevel model building process in HLM 7.0. The first step was to decompose achievement variance to within school and between school components with an unconditional random effects ANOVA. Results were used to calculate the IntraClass Correlation Coefficient (ICC), the percent of achievement variance attributed to school and non-school factors. We tested the effects of student characteristics on achievement with a Random Coefficients regression. Student variables were grand-mean centered in this model. Grand-mean centering has a computational advantage over group-mean centered or uncentered models in that it controls for any shared variance between individual and group level predictors. Significant student variables were retained and set to vary randomly across schools. Nonrandom student effects were fixed to the school level.
Achij = β0j + rij
β0j = γ00 + uoj
P = σ2 uo / σ2 uo + σ2 eo
Random Coefficient Regression
Achij = β0j + β1j (Minority Statusij) + β2j (FRL Statusij) + rij
β0j = γ00 + uoj
β1j = γ01 + uoj
β2j = γ02 + uoj
The final step was to test a random coefficient slopes and intercepts as outcomes model with all significant student and school variables. We changed the centering to group-mean in this model to allow for a more accurate estimation of differences in level one slopes across schools (Enders & Tofighi, 2007). To further increase the reliability of the slope estimation, we used the state calculated school index score as a single predictor variable. The index score is a continuous variable that is used to determine the categorical letter grade. Using a single continuous variable as opposed to multiple categorical variables improves the degrees of freedom and yields a more reliable estimate of variation in level one slopes (Hox, 2010). Estimates represent the actual difference in scale scores after controlling for factors not related to teaching practices and school performance.
Random Coefficient Slopes and Intercepts as Outcomes Model
Achij = β0j + β1j (Minority Statusij) + β2j (FRL Statusij) + rij
β0j = γ00 + γ01 (C) + γ02 (D) + γ03 (F) + γ04 (% Minority) + γ05 (FRL Rate) + uoj
β1j = γ00 + γ11 (IndexScore) + uoj
β2j = γ00 + γ21 (IndexScore) + uoj
Achij = is an individual’s estimated average achievement, i, in the average school, j.
β0j = is the school achievement mean for math achievement
β1j = Minority achievement gap
β2j = FRL achievement gap
γ00 = grand mean for achievement
γ01 = is the difference in average achievement between A/B schools and C schools
γ02 = is the difference in average achievement between A/B schools and D schools
γ03 = is the difference in average achievement between A/B schools and F schools
γ04 = is the effect of school % Minority on achievement
γ05 = is the effect of FRL rate on student achievement
γ11 = cross-level interaction of minority achievement and Index Score
γ21 = cross-level interaction of FRL achievement and Index Score
We organized results by the two research questions: Do students in A and B schools have high average achievement and small achievement gaps compared to students in D and F schools. What is the difference in average achievement and achievement gaps between school grades when holding context constant?
AVERAGE ACHIEVEMENT AND ACHIEVEMENT GAPS
As reported in Table 2, students in A and B schools had higher average reading and math scores than students in C, D, and F schools. Students in A schools had an average reading score about .34 standard deviations above the sample mean and an average math score about .39 standard deviations above the sample mean. Students in B schools had an average reading score about .12 standard deviations above the sample mean and average math score about .11 standard deviations above. Average reading and math scores in C, D, and F schools were below the sample mean and around one standard deviation less than the average reading and math scores in A schools.
Table 2. Differences in Reading and Math Test Scores by FRL Status and School Grade
Note. Test scores were standardized to a mean of 0 and a standard deviation of 1. Values represent that average deviation from the sample mean.
We did find test score gaps for FRL students (Table 2 and Figures 1 and 2). In the overall sample, FRL students averaged reading and math scores nearly one standard deviation lower than non FRL students. The overall test score gap was close to one standard deviation for reading and math. Both reading and math gaps varied across school-assigned letter grades. In A schools the reading gap was .83 standard deviations with the average FRL student scoring -.31 standard deviations below the mean and the average non FRL student scoring nearly .52 standard deviations above the mean. The math gap in A scores was about .75 standard deviations with the average math score of FRL students falling -.19 standard deviations below the mean and the average math score for non FRL students at about .56 standard deviations above the mean.
Figure 1. Mean differences in reading test scores by FRL status and school letter grades
Test scores were standardized to a mean of 0 and a standard deviation of 1. Values are reported in standard deviation units. FRL students are coded as 1 and non FRL students are coded as 0.
For B schools, the FRL reading and math gaps were about 1 standard deviation. FRL students in B schools had an average reading score of -.33 and math score of -.31 standard deviations below the mean. Non FRL students had average reading and math scores of .38 and .36 standard deviations above the mean. Smaller FRL gaps were found in C, D, and F schools. The average reading difference in C schools was about .56 standard deviations and the average math difference was about .68. For D schools, differences were about .34 standard deviations for both reading and math, and in F schools the average reading difference was less than .02 standard deviations (with FRL students having a slightly higher average) and nearly .26 standard deviations for math.
Figure 2. Mean differences in math test scores by FRL status and school letter grades
Test scores were standardized to a mean of 0 and a standard deviation of 1. Values are reported in standard deviation units. FRL students are coded as 1 and non FRL students coded as 0.
The minority test score gap followed a similar pattern as the FRL (Table 3). The overall minority difference in reading and math scores was about 1 standard deviation. In reading, the average minority student scored at -.28 standard deviations below the sample mean whereas the average nonminority student scored .27 deviations above. In math, the average minority student scored -.31 standard deviations below the mean and the average non-minority was .27 standard deviations above.
Table 3. Differences in Reading and Math Test Scores by Minority Status and School Grade
Note. Test scores were standardized to a mean of 0 and a standard deviation of 1. Values represent that average deviation from the sample mean.
Test gaps for minority students varied by letter grade. The largest minority gaps in reading and math (over one standard deviation) were found in B rated schools (Figures 3 and 4). The minority reading gap in A schools was .49 standard deviations while the minority math gap was .59 standard deviations. For C schools the average reading gap was .64 standard deviations and the average math gap was .37. Smaller differences between minority and non-minority students were found in D and F schools. For D schools the minority reading gap was about .24 standard deviations and the math about .35. For F schools, minority gaps were .30 and .25 standard deviations, respectively.
Figure 3. Mean differences in reading test scores by Minority status and school letter grades
Test scores were standardized to a mean of 0 and a standard deviation of 1. Values are reported in standard deviation units. Minority students are coded as 1 and non-minority students coded as 0.
Figure 4. Mean differences in reading test scores by minority status and school letter grades
Test scores were standardized to a mean of 0 and a standard deviation of 1. Values are reported in standard deviation units. Minority students are coded as 1 and nonminority coded as 0.
We first report the variance decomposition from the unconditional random effects ANOVA models. Results show achievement variance that is attributed to student and school differences. Student differences accounted for 72% of variance in reading and 70% in math. Schools, on the other hand, accounted for 28% of the reading variance and 30% of the math variance (Table 2). To address the research question, we examined the main effects of letter grades and the moderating effect of letter grades on achievement gaps.
Table 2. Differences in reading and math test scores by FRL status and school grade
Note. Test scores were standardized to a mean of 0 and a standard deviation of 1. Values represent that average deviation from the sample mean.
SMALL MAIN EFFECTS
Table 3 displays average differences in the math and reading scale scores after controlling for student (FRL and minority status) and school characteristics (FRL rate and percent of Caucasian students). For reading, we did not find statistically significant achievement differences attributed to school letter grade. Further, the estimated differences were small and considerably less than the standard deviation for the sample and the average standard error for the reading assessment (SEM = 33) (CTP McGraw Hill, 2013). Students in schools receiving a C grade averaged 3 scale points lower than the average reading scores for students in A and B schools. The average reading score for students in D schools was 1 scale point less than the average student scores in A and B schools. The largest difference, 31 scale points, was between students F ranked and students A and B ranked schools. The average difference, however, was not statistically significant and fell within the range of the standard error for the reading assessment (SEM = 33).
Letter grades performed only slightly better in explaining differences in average math scores. We did not find statistically significant differences in average math achievement between students in C schools and students in A and B ranked schools. The estimated difference of 11 scale points was small (Cohens d = .11) and fell within the average measurement error of the math test (SEM = 22). The average math difference of 25 scale points for students in D schools and students in A and B schools was also not statistically significance at p < .05. This estimated difference was small (Cohens d = .25) and around the measurement error of the test. The difference of 42 scale points in average math achievement between F and A and B ranked schools was statistically significant with a small effect size (Cohens d = .44).
Table 3. Differences in reading and math test scores by minority status and school grade
MODERATING EFFECTS OF LETTER GRADES
Consistent with the test score gaps we reported in the previous section, FRL and minority achievement gaps were lower in schools with the lowest school index scores. For FRL students, within-school achievement gaps increased proportionally to increases in the school index score for reading and math. Negative parameter estimates for reading (γ11 = -0.44, p < 01) and math γ21 = -0.53, p < .01) indicate a decline in the average achievement of FRL students as index scores increase. Figures 5 and 6 illustrate the negative relationship between FRL gaps and the school index score. As index scores increased, reading and math gaps widen. Additionally, average reading and math achievement of FRL students was considerably lower in schools with the highest index scores compared to schools with lower index scores.
Figure 5. Graph from intercepts and slopes as outcomes model of reading achievement
Results show a larger FRL gap in reading achievement as index score increases.
Figure 6. Graph from intercepts and slopes as outcomes model of math achievement
Results show a larger FRL gap in reading achievement as index score increases.
The relationship between index score and minority test score gaps was similar to FRL, but not as strong. Average reading achievement of minority students decreased (γ11 = -0.35, p < .05) as school index scores increased. Similarly, average math scores of minority students decreased (γ21 = -0.31, p < .05) as index scores increased. Figures 7 and 8 illustrate the changes in the minority test score gap by school index score. Notice that compared to the FRL gap, the slope of the line for minority students is not as steep and the average gap in schools with the best index scores is not as large.
Figure 7. Graph from intercepts and slopes as outcomes model of reading achievement
Results show a larger Minority gap in reading achievement as index score increases.
Figure 8. Graph from intercepts and slopes as outcomes model of reading achievement
Results show a larger Minority gap in reading achievement as index score increases.
Informational significance provides a different framework to evaluate accountability indicators. Unlike validity studies that evaluate measurement quality, informational significance targets the usefulness of an accountability indicator. An indicator may achieve a degree of validity but not have value or utility for decisions affecting policy and practice. To support achievement equity, letter grades should be capable of explaining high and equitable achievement within and between schools. Oklahomas grades did not meet this standard. We found that AF letter grades end up hiding achievement gaps rather than revealing them.
When analyzing test score gaps, we found higher average reading and math scores in A and B schools compared to C, D, and F schools, but test scores were not equally distributed within letter grades. The largest achievement gaps were in schools ranked as the most effective. FRL and minority students in A and B schools had average reading and math achievement below the overall sample mean and in some cases not different from FRL and minority students in schools with lower letter grades. In A and B schools, FRL students had an average reading score about -.32 standard deviations below the mean. This average score was similar to average reading score of -.35 for FRL students in D schools and an average reading score of -.40 for FRL students in F schools. Average performance of FRL students was nearly equivalent across letter grades.
Informational significance partly depends on a clear and accurate indication of achievement patterns within and across schools. Our evaluation of test score gaps suggests that Oklahomas AF grades do not provide a clear signal of achievement for poor and minority students. Some A and B schools likely had high and equitable student achievement, but it is also true that schools with large test score gaps for FRL and minority students were rated as effective. Herein lies the crux of the issue: grades do not sort out schools with high and equitable achievement from schools with high average achievement and large achievement gaps. Not knowing the relative achievement of FRL and minority students leads to inaccurate judgments about school quality and diminishes the usefulness of letter grades. To be meaningful, grades need to reflect the performance of all students and student subgroups. Oklahomas AF letter grades fail this test by making it possible for schools to receive As and Bs while failing to serve their FRL and minority students.
HLM results raise additional concerns about the informational significance of AF grades. After removing achievement variance attributed to factors unrelated to teaching or school effectiveness, letter grades were unable to differentiate schools by average student achievement. In reading, average test scores in A, B, C, and D schools were similar. The lower average reading achievement we found in F schools does not correspond to the performance difference one would reasonably expect between an A and an F. Math results were not much different than reading. Perhaps the most troubling finding was that A and B schools were least effective for poor, minority children, while D and F schools were most effective. Rather than supporting schools in closing achievement gaps, the intent of NCLB waivers, the Oklahoma system rewards schools with high grades even when large achievement gaps exist. Informational significance is lost on grades that hide achievement variance within and between schools, making any diagnostic and improvement use of AF grades ineffectual.
Evidence that FRL and minority students had higher achievement in D and F schools than their counterparts attending A and B schools challenges the formula used to calculate school grades. The distribution of letter grades would change quite drastically if the state assigned achievement gaps the same weight it assigns to achievement status. In our sample of schools, several D and F schools would become C or B schools, and many A and B schools would become C or D schools. Poor, minority students end up being left behind when grades obscure achievement differences within schools. In some instances, letter grades do not reflect the achievement of all students and student sub-groups, and in other cases schools showing some progress may be misidentified as needing urgent improvement.
The absence of informational significance means that school grades cannot be used to nurture the human and social capacity under which effective schools adapt to their external environments and to the needs of their students. School grades deliver little informational value to teachers and administrators. They hide achievement differences, they cannot be disaggregated by content standards, and they do not measure student growth toward college, citizenship, and career ready expectations. Furthermore, school grades cannot be used to measure the effectiveness of improvement strategies or interventions; any change from one year to the next is just as likely attributable to factors outside school control than to what happens within schools and classrooms.
School grades have limited use as an external control as well. Grades that obscure achievement differences encourage misguided judgments about school effectiveness and misplaced reform pressure. For instance, D and F can use additional support and resources, but instead they face mandated interventions that do not address the sources of diminished capacity. In contrast, A and B schools encounter no external pressure or incentives to track achievement of FRL and minority students. In fact, A and B schools can be rewarded even if low-income and minority student achievement lag behind students with more social advantages.
Problems identified with accountability indicators under NCLB compound in Oklahomas AF grading system. First, the system uses proficiency scores for its calculation of student achievement and student growth. Proficiency scores are a simple metric to describe achievement status in the aggregate, but their accuracy erodes when used as the basis for ranking schools for the purpose of policy decisions or passing judgments of school effectiveness (Carlson, 2006; Forte, 2010; Ho, 2008; Linn, 2005). Second, the system hides achievement of poor and minority students by using the growth of the bottom 25% to satisfy the achievement equity expectation. To keep the spotlight on achievement equity, Oklahomas policy would, at the minimum, need to report proficiency scores by student subgroups and account for subgroup performance in calculations of the student achievement and student growth components. The State does neither, effectively ignoring poor, minority students in its calculations and reporting.
Finally, assumptions of letter grades do not correspond with the dynamic nature of schools and student learning. School performance is multifaceted and varies across subjects, classrooms, and students. Instead of measuring and reporting variability, grades treat teaching and learning in schools as fixed processes. As a result, lower achieving students receive the same performance status as higher achieving students, essentially ignoring variance that can help schools recognize and respond to unmet student needs. NCLB waivers were designed to provide states flexibility in developing a fair and focused accountability system to support continuous improvement (U.S. Department of Education, 2012). A policy that rewards schools for large FRL and minority achievement gaps and penalizes schools whose poor, minority students outperform peers in more affluent schools, is neither fair nor supportive of continuous improvement.
Rather than advancing achievement equity, the intent of the federal NCLB waiver, letter grades seem to exploit achievement levels that derive from wealth and social advantage, while obscuring a schools failure to serve all children. To advance achievement equity, educators need to understand common sources of achievement variance within schools. Letter grades, however, collapse achievement variance into a single composite indicator. No measure of school performance can yield accurate results if the majority of variance in student achievement is concealed by the indicator (Forte, 2010). As a practical consequence, grades end up classifying some schools as A and B schools when they are failing to meet the learning needs of all students and other schools as D and F schools when they are making progress with poor, minority students.
Progress made under NCLB in exposing achievement inequity in the US has taken a step back with Oklahomas AF school grades. Our evidence suggests that a composite letter grade does not provide a clear signal or simple interpretation of achievement differences within and between schools. No meaningful information about achievement gaps can be obtained from a letter grade. We cannot conclude, for instance, that A and B schools have high average and equitable achievement. We also cannot conclude that FRL and minority students in D and F schools perform worse on average than peers in higher ranked schools. Herein lies a fundamental problem with the informational significance of AF accountability grades: Grades do not provide the right information to understand achievement patterns in schools. Without knowledge of test score differences, it is hard to see how appropriate action can be taken to improve learning outcomes.
Although our evidence is limited to one state, many components of Oklahomas system are similar to those used in other states (Howe & Murry, 2015). Other states use proficiency bands without reporting results by subgroups, they use achievement of the bottom 25% to fulfill the achievement gap requirement, and they use a composite indicator to judge school effectiveness (Domaleski & Perie, 2013; Polikoff et al., 2014). These three components are likely to behave the same way in other state accountability systems. It is not variability in schools that presents a problem, but rather weaknesses of the components to measure achievement variance within schools.
We cannot conclude with certainty that effects found in our sample will appear in a larger, more representative sample of states, districts and schools. What is clear is that additional research on new state accountability policies is needed. With states using different accountability designs, it is important for researchers to identify system components capable of yielding valid inferences of school performance. As long as accountability carries with it high stakes consequences, state governments have a legal and ethical responsibility to ensure that accountability systems accurately distinguish among different levels of school effectiveness.
1. When use quotation marks when referring to letter grades associated with a schools ranking. No quotation marks are used for a general reference of letter grades.
Adams, C. M., Forsyth, P. B., Dollarhide, E. Miskell, R. C., & Ware, J. K. (2015). Self-regulatory climate: A social resource for student regulation and achievement. Teachers College Record, 117, 128.
Ayers, J. (2011). No child left behind waiver applications: Are they ambitious and achievable? Washington, DC: Center for American Progress. Retrieved from http://files.eric.ed.gov/fulltext/ED535638.pdf
Baard, P. P. (2002). Intrinsic need satisfaction in organizations: A motivational basis of success in for-profit and not-for-profit settings. In E. Deci and R. Ryan (Eds.), Handbook of self-determination research, (pp. 255276). Rochester, NY: University of Rochester Press.
Baker, E. L., Barton, P. E., Darling-Hammond, L., Haertel, E., Ladd, H. F. Linn, R. L., Ravitch, D., Rothstein, R., Shavelson, R. J., & Shepard, L. A. (2010). Problems with the use of student test scores to evaluate teachers. Washington, DC: The Economic Policy Institute. Retrieved from http://epi.3cdn.net/724cd9a1eb91c40ff0_hwm6iij90.pdf
Baker, E. L., & Linn, R., L. (2002). Validity issues for accountability systems [CSE Technical Report 585]. Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing, University of California, Los Angeles.
Barton, P. E., & Coley, R. J. (2010). The black-white achievement gap: When progress stopped. Princeton, NJ: Educational Testing Services. Retrieved from http://files.eric.ed.gov/fulltext/ED511548.pdf
Booher-Jennings, J. (2005). Educational triage and the Texas accountability system. American Educational Research Journal, 42(2), 231268.
Bryk, A. S. (2009). Support a science of performance improvement. Phi Delta Kappan, 90(8), 597600.
Carlson, D. (2006). Focusing state educational accountability systems: Four methods of judging school quality and progress. Dover, NH: The Center for Assessment. Retrieved from http://www.nciea.org/publications/Dale020402.pdf
CEP. (2012). Accountability issues to watch under NCLB waivers. Washington, DC: The George Washington University, Center on Education Policy. Retrieved from http://files.eric.ed.gov/fulltext/ED535955.pdf
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum.
Darling-Hammond, L. (2006). Securing the right to learn: Policy and practice for powerful teaching and learning. Educational Researcher, 35(7), 1324.
Domaleski, C., & Perie, M. (2013). Promoting equity in state education accountability systems. Lawrence, KS: National Center for the Improvement of Educational Assessment, Center for Educational Testing and Evaluation, University of Kansas.
Enders, C. K., & Tofighi, D. (2003). Centering predictor variables in cross-sectional multilevel models: A new look at an old issue. Psychological Methods, 12(2), 121138.
Feuer, M. J. (2008). Future directions for educational accountability: Notes for a political economy of measurement. In K. Ryan & L. Shepard (Eds.), The future of test-based educational accountability (pp. 293306). New York, NY: Routledge.
Figlio, D. N., & Getzler, L. S. (2002). Accountability, ability and disability: Gaming the system [Working Paper 9307]. Cambridge, MA: National Bureau of Economic Research.
Finnigan, K. S., & Gross, B. (2007). Do accountability policy sanctions influence teacher motivation? Lessons from Chicagos low-performing schools. American Educational Research Journal, 41(3), 594630.
Forsyth, P. B., Adams, C. M., & Hoy, W. K. (2011). Collective trust: Why schools cant improve without it. New York, NY: Teachers College Press.
Forte, E. (2010). Examining the assumptions underlying the NCLB federal accountability policy on school improvement. Educational Psychologist, 45(2), 7688.
Fusarelli, L. D. (2004). The potential impact of the No Child Left Behind Act on equity and diversity in American education. Educational Policy, 18(1), 7194.
Haladyna, T. M., Nolen, S. R., & Haas, N. S. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher, 42(17), 27.
Hall, D. (2013). A step forward or a step back? State accountability in the waiver era. Washington, DC: The Education Trust. Retrieved at http://files.eric.ed.gov/fulltext/ED543222.pdf
Hamilton, L. S., Schwartz, H. L., Stecher, B. M., & Steele, J. L. (2013). Improving accountability through expanded measures of performance. Journal of Educational Administration, 51(4), 453475.
Harris, D. N. (2011). Value-added measures in education: What every educator needs to know. Cambridge, MA: Harvard Press.
Heck, R. H. (2009). Teacher effectiveness and student achievement: Investigating a multilevel cross-classified model. Journal of Educational Administration, 47, 227249.
Heilig, V. J., & Darling-Hammond, L. (2008). Accountability Texas-style: The progress and learning of urban students in a high-stakes testing context. Educational Evaluation and Policy Analysis, 30(2), 75110.
Ho, A. D. (2008). The problem with proficiency: Limitations of statistics and policy under No Child Left Behind. Educational Researcher, 37(6), 351360.
Howe, K. R., & Murray, K. (2015). Why school report cards merit a failing grade. Boulder, CO: National Education Policy Center. Retrieved from http://nepc.colorado.edu/publication/why-school-report-cards-fail
Hox, J. J. (2010). Multilevel analysis: Techniques and applications (2nd ed.). New York, NY: Routledge.
Jencks, C., & Phillips, M. (1998). Americas next achievement test: Closing the black-white test score gaps. The American Prospect, 9(40), 4453.
Kane, T. J., & Staiger, D. O. (2002). The promise and pitfalls of using imprecise school accountability measures. Journal of Economic Perspectives, 16(1), 91114.
King, B., & Minium, E. (2003). Statistical reasoning in psychology and education (4th ed.). Hoboken, NJ: Wiley.
Lee, J. (2002). Racial and ethnic achievement gap trends: Reversing the progress toward equity? Educational Researcher, 31(1), 312.
Lee, J. (2006). Tracking achievement gaps and assessing the impact of NCLB on the gaps: An in-depth look into national and state reading and math outcome trends. Cambridge, MA: The Civil Rights Project at Harvard University. President and Fellows of Harvard College.
Lee, J. (2008). Is test-driven external accountability effective? Synthesizing the evidence from cross-state causal-comparative and correlational studies. Review of Educational Research, 78(3), 608644.
Linn, R. L. (2005). Conflicting demands of No Child Left Behind and state systems: Mixed messages about school performance. Education Policy Analysis Archives, 13(33), 120.
Linn, R. L. (2008). Educational accountability systems. In K. Ryan & L. Shepard (Eds.), The future of test-based educational accountability (pp. 324). New York, NY: Routledge.
Linn, R. L., & Haug, C. (2002). Stability of school building accountability scores and gains [CSE Technical Report 561]. Los Angeles: National Center for Research on Evaluation, Standards, and Student Testing. University of California, Los Angeles.
McNeil, M. (2012). States punch reset button with NCLB waivers. Education Week. Retrieved from http://www.edweek.org/ew/articles/2012/10/17/08waiver_ep.h32.html?tkn=NSLFJ%2BWQnkqPlIMGUAUBakJda6JiHNTaJZDt&intc=es.
Messick, S. (1995). Validity of psychological assessment; Validation of inferences from persons responses and performances as scientific inquiry into school meaning. American Psychologist, 50(9), 741749.
Miller, D. M. (2008). Data for school improvement and educational accountability: Reliability and validity in practice. In K. Ryan & L. Shepard (Eds.), The future of test-based educational accountability (pp. 249262). New York, NY: Routledge.
Mintrop, H., & Sunderman, G. L. (2009). Predictable failure of federal sanctions-driven accountability for school improvement and why we may retain it anyway. Educational Researcher, 38(5), 353364.
Moller, J. (2008). School leadership in an age of accountability: Tensions between managerial and professional accountability. Journal of Educational Change. doi:10.1007/s10833-008-9078-6.
National Center for Education Statistics. (2013a). NAEP 2012: Trends in academic progress, reading 19712012, math 19732012. Washington, DC: U.S. Department of Education. Retrieved from http://nces.ed.gov/nationsreportcard/subject/publications/main2012/pdf/2013456.pdf
National Center for Education Statistics. (2013b). The nation's report card: Trends in academic progress 2012 (NCES 2013456). Washington, DC: National Center for Education Statistics, Institute of Education Sciences, U.S. Department of Education.
Neal, D., & Schanzenbach, D. W. (2010). Left behind by design: Proficiency counts and test-based accountability. The Review of Economics and Statistics, 92(2), 263283.
Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26, 237257.
OCEP & CERE. (2012). An examination of the Oklahoma State Department of Educations AF report card. Tulsa, OK: The Oklahoma Center for Education Policy, University of Oklahoma, and The Center for Educational Research and Evaluation, Oklahoma State University.
ODay, J. A. (2002). Complexity, accountability, and school improvement. Harvard Educational Review, 72(3), 293329.
Oklahoma State Department of Education. (2012). Oklahoma school testing program, Oklahoma core curriculum tests, Grades 3 to 8 assessments. New York, NY: Pearson.
Oklahoma State Department of Education. (2014l). 2014 AF report card technical guide. Retrieved from http://ok.gov/sde/sites/ok.gov.sde/files/documents/files/AtoF_Report_Card_Technical_Guide_8-28-2014.pdf.
Quality Counts. (2014). District disruption and revival: School systems reshape to compete and improve. Education Week. Retrieved from: http://www.edweek.org/ew/toc/2014/01/09/index.html
Pearson, Inc. (2012) Technical report of the Oklahoma School Testing Program, Oklahoma Core Curriculum Tests, Grades 3 to 8 assessments. New York, NY: Pearson.
Polikoff, M. McEachin, A., Wrabel, S., & Duque, M. (2014). The waive of the future? School accountability in the waiver era. Educational Researcher. Retrieved from http://www-bcf.usc.edu/~polikoff/Waivers.pdf
Popham, J. (2007). The no-win accountability game. In C. Glickman (Ed.), Letters to the next president: What we can do about the real crisis in public education (pp. 166173). New York, NY: Teachers College Press.
Raudenbush, S. W. (2004). Schooling, statistics, and poverty: Can we measure school improvement? Princeton, NJ: Educational Testing Service.
Reardon, S. F. (2011). The widening academic achievement gap between the rich and poor: New evidence and possible explanations. In G. J. Duncan & R. J. Murnane (Eds), Wither opportunity? Rising inequality, schools, and childrens life chances, (pp. 91116). New York, NY: Russell Sage Foundation.
Reeve, J. (2002). Self-determination theory applied to educational settings. In E. Deci & R. Ryan (Eds.), Handbook of Self-Determination Research (pp. 183204). Rochester, NY: University of Rochester Press.
Reeve, J., & Halusic, M. (2009). How K-12 teachers can put self-determination theory principles into practice. Theory and Research in Education, 7(2), 145154.
Reeve, J., & Jang. H. (2006). What teachers say and do to support students autonomy during a learning activity. Journal of Educational Psychology, 98(1), 209218.
Rothstein, R. (2009). Getting accountability right. Education Week. Retrieved from http://www.csun.edu/~krowlands/Content/SED610/reform/Getting%20Accountability%20Right.pdf.
Rothstein, R., Jacobson, R., & Wilder, T. (2008). Grading education: Getting accountability right. New York, NY: Teachers College.
Ryan, R. M., & Deci, E. L. (2002). Overview of self-determination theory: An organismic dialectical perspective. In E. Deci & R. Ryan (Eds.), Handbook of self-determination theory research, (pp. 336). Rochester, NY: University of Rochester Press.
Ryan, R. M., & Deci, E. L. (2012). Overview of self-determination theory: An organismic dialectical perspective. In R. Ryan (Ed.), The Oxford handbook of human motivation (pp. 333). Oxford, UK: Oxford University Press.
Ryan, R. M., & Weinstein, N. (2009). Undermining quality teaching and learning: A self-determination theory perspective on high-stakes testing. Theory and Research in Education, 7(2), 224233.
Sahlberg, P. (2008). Rethinking accountability in a knowledge society. Journal of Educational Change. doi:10.1007/s10833-008-9098-2.
Schlechty, P. C. (2010). Leading for learning: How to transform schools into learning organizations. San Francisco, CA: Wiley.
Schwartz, H. L., Hamilton, L. S., Stecher, B. M., & Steele, J. L. (2011). Expanded measures of school performance [Technical Report]. Washington, DC: Rand Corporation.
Sirotnik, K. A. (2002). Promoting responsible accountability in schools and education. The Phi Delta Kappan, 83(9), 662673.
Sirotnik, K. A. (2005). Holding accountability accountable. What ought to matter in public education. New York, NY: Teachers College Press.
Sunderman, G. L., & Kim, J. S. (2005). Measuring academic proficiency under the No Child Left Behind Act; Implications for educational equity. Educational Researcher, 34(8), 313.
U.S. Department of Education. (2012). EASA Flexibility Request. Retrieved from http://www2.ed.gov/policy/elsec/guid/eseAFlexibility/index.html
Ushomirsky, N., Wiliams, D., & Hal, D. (2014). Making sure all children matter getting school accountability signals right. Washington, DC: The Education Trust.
Whitford, B. L., & Jones, J. (2000). Accountability, assessment, and teacher commitment: Lessons from Kentuckys reform efforts. New York, NY: State University of New York.
Williams, G. C. (2002). Improving patients’ health through supporting the autonomy of patients and providers. In E. Deci & R. Ryan (Eds.), Handbook of self-determination theory research (pp. 233–254). Rochester, NY: University of Rochester Press.