Evidence of Grade and Subject-Level Bias in Value-Added Measures
by Jessica Holloway-Libell - June 08, 2015
While value-added models (VAMs)—the statistical tools used to measure teacher effects on student achievement scores—continue to emerge throughout districts and states across the country, education scholars simultaneously recommend caution, especially in terms of the inferences that are made and/or used based on VAM outcomes. This research note investigates an unexplored feature of bias in VAM-based estimates—that which is associated with grade levels and subject areas. Findings contribute an alternative perspective regarding how we think about VAM-based bias and teacher classifications.
Tennessee was one of only two states to win first-round Race to the Top (RttT) funds in 2009 (RttT, 2011). While most other contending states had to adopt or develop a statistical model to measure growth in achievement scores and attribute that growth to teachers (e.g., via a growth or value-added model [VAM]), Tennessee had the distinct advantage in that their VAM, the Tennessee Value-Added Assessment System (TVAAS), had been in place for nearly two decades by the time applications were due.
While currently 44 states and Washington D.C. have adopted some form of VAM or other student growth model in order to comply with RttT stipulations, or to receive No Child Left Behind waivers (Collins & Amrein-Beardsley, 2014), Tennessee was rewarded for being a leader in this competition. In addition, as the TVAAS model was later bought by the analytic software company, SAS® Institute Inc., and renamed the Education Value-Added Assessment System (EVAAS)although it keeps the TVAAS name in Tennesseeit has since become the most popular and widely used VAM in the country.
Since the propagation of such models, however, many educational researchers have increasingly grown more skeptical of said VAMs, primarily because of concerns with the reliability, validity, and bias of the models currently available (Haertel, Rothstein, Amrein-Beardsley & Darling-Hammond, 2011; Graue, Delaney & Karch, 2013; Newton, Darling-Hammond, Haertel & Thomas, 2010; Papay, 2010; Rothstein, 2009, 2010, 2014). The TVAAS model, specifically, has also been the subject of such criticisms (Amrein-Beardsley, 2008; Ballou & Springer, 2015; Gabriel & Lester, 2013; Kupermintz, 2003). The focus of this paper is to call for additional consideration to how we think about issues of bias, as observed in the best VAM on the market, and more specifically about whether Tennessee teachers VAM estimates might be influenced by the grade levels and subject areas they teach.
VALUE-ADDED MODELS AND BIAS
VAMs are statistical tools meant to measure the purportedly causal relationships between teachers instruction and student achievement. This is done by statistically measuring growth in student achievement from one year to the next using students large-scale standardized test scores, while controlling for students prior testing histories and sometimes student-level variables (e.g., race, gender, levels of poverty) and sometimes classroom and school-level variables (e.g., class size, average prior achievement for a teacher). However, the extent to which these controls function as intended to control for bias has been a source of contention (Ehlert, Koedel, Parsons & Podgursky, 2013; Rothstein, 2009, 2010, 2014).
The most salient concern is that of bias related to the types of students taught, predominantly in the cases of homogenous classrooms, including students who consistently score at the extremes of the normal bell curve distribution (Goldhaber, Gabele & Walch, 2012; Kupermintz, 2003; Newton et al., 2010; Rothstein, 2009, 2010, 2014; Rothstein & Mathis, 2013). To mitigate these effects, the best recommendation is to randomly assign students to teachers (Ballou, 2012; Ehlert, Koedel, Parsons & Podgursky, 2012; Glazerman & Potamites, 2011); yet it is unlikely this will be done in practice (Paufler & Amrein-Beardsley, 2014). Others have argued that random assignment is unnecessary because most, if not all VAMs control for at least students prior achievement scores, thus controlling for other risk variables by proxy (Sanders & Horn, 1998; Sanders, Wright, Rivers & Leandro, 2009).
One area that is yet to have been explored, however, is the potential bias that exists at the grade and/or subject level, begging the question, are students more likely to demonstrate growth in some grade levels or subject areas over others? If yes, then teachers value-added scores might also be biased beyond just the student-level.
A CASE IN POINT
This analysis was initially prompted by an assistant principal in Tennessee who recently wrote about his concerns about what he thought were skewed data trends in his states 2013 TVAAS scores (Amrein-Beardsley, 2014). A quick look at the publicly available data seemed to corroborate his hunches, thus compelling a more systematic review; after which I pulled data from the 10 largest school districts in Tennessee, including all grade levels for which TVAAS scores were publicly available (i.e., third through eighth grade English/language arts [ELA] and mathematics). For each grade-level and subject-area, I calculated the percentage of schools per district whose students made more than expected growth (i.e., positive grade-level value-added) for 2013 and for their three-year composite scores.
What I found was that there were, indeed, some perplexing trends, suggesting that this Tennessee administrators hunches were more than just conjecture. For clarity purposes, the tables below are color-codedwith green signifying the districts for which 75% or more of the schools had students who made more than expected growth (i.e., positive grade-level value-added), and red signifying the districts for which 25% or less of the schools had students who made more than expected growth.
In ELA in 2013, schools were, across the board, much more likely to receive positive value-added scores for ELA in fourth and eighth grades than in other grades (see Table 1). Simultaneously, districts struggled to yield positive value-added scores for their sixth and seventh grades in the same subject-areas. Fifth-grade scores fell consistently in the middle range, while the third-grade scores varied across districts.
Table 1. Percent of Schools that had Positive Value-Added Scores in English/language arts by Grade and District (2013)
The three-year composite scores were similar (see Table 2), except even more schools received positive value-added scores for the fifth and eighth grades. In fact, in each of the nine districts that had a composite score for eighth grade, at least 86% of their schools received positive value-added scores at the eighth-grade level. This, contrasted with the sixth and seventh-grade composite scores, suggests that there might be some grade-level bias at play, in this case against sixth and seventh-grade ELA teachers.
Table 2. Percent of Schools that had Positive Value-Added Scores in English/Language Arts by Grade and District (Three-Year Composite)
Though the mathematics scores were not as apparently skewed as the ELA scores, there were some trends to note there as well (see Table 3). In particular, the fourth and seventh grade-level scores were consistently higher than those of the third, fifth, sixth, and eighth grades, which illustrated much greater variation across districts. The three-year composite scores were similar. In fact, a majority of schools across the state received positive value-added scores in mathematics across all grade levels (see Table 4).
Table 3. Percent of Schools that had Positive Value-Added Scores in Mathematics by Grade and District (2013)
Table 4. Percent of Schools that had Positive Value-Added Scores in Mathematics by Grade and
District (Three-Year Composite)
Of most importance here is how these results are being interpreted and used, particularly in terms of the validity of the inferences that are being based on these data. By Tennessees standard, given their heavy reliance on the TVAAS to evaluate teachers, the conclusion might be that the mathematics teachers were, overall, more effective than the ELA teachers in almost every tested grade-level (with the exception of eighth-grade ELA), regardless of school district. Perhaps the fourth and eighth-grade ELA teachers across the state were indeed more effective than the sixth and seventh-grade ELA teachers; thus, they earned and deserved the higher value-added scores and all of the accompanied accolades. Perhaps not.
Reckase (2004) reminds us that bias is not always what it seems. For example, sometimes the apparent bias is due to a misalignment between the content taught and content tested. Interestingly enough, however, if Tennessee deserves praise in one area, it could very well be for their strategic plan regarding standards, curriculum, and testing alignment, especially given the states recent transition from state standards to the Common Core Standards. To accommodate the transition, the Tennessee Department of Education, with the support of a Technical Advisory Committee, revisits the standards and accompanied test, each year, to guarantee effective alignment (for the detailed transition plans, visit TNCore.org and TN.gov). I also spoke with the aforementioned assistant principal to confirm this was the case, and he agreed that the state has gone to extensive lengths to guarantee that what is taught is what is tested, indicating that the problem is most likely not with alignment.
Perhaps a more reasonable explanation, though, is that there was/is some unique bias present, possibly related to issues with the vertical scaling of Tennessees tests, other measurement errors, or some other culprit that is indeterminate at this time. The extreme grade-level differences in ELA and the lack thereof in mathematics suggests two potential forms of biasesgrade-level bias (i.e., against sixth and seventh-grade teachers in ELA) and subject-level bias (i.e., in favor of mathematics teachers in general)indicating that teacher effectiveness, as is often simplistically assumed, is most likely not the sole reason for what are often positioned as true differences between and among teachers given their true levels of effectiveness.
Regardless, in Tennessee, as well as other states and districts across the country, this growth is being attributed to the teachers of students in these grades and subject areas, despite not only the levels of bias (likely) inherent in these estimates themselves (e.g., based on the types of students), although, this is of great controversy, but also by the grade levels or subject areas taught. The latter is something about which more research is certainly warranted, particularly in terms of the ways we define and think about VAM-based bias and teacher classifications.
Amrein-Beardsley, A. (2008). Methodological concerns about the Education Value-Added Assessment System (EVAAS). Educational Researcher, 37(2), 6575. doi: 10.3102/0013189X08316420
Amrein-Bearsley, A. (2014, February 9). An assistant principal from Tennessee on the EVAAS system. [Blog post]. Retrieved from http://vamboozled.com/an-assistant-principal-from-tennessee-on-the-evaas-system/
Ballou, D. (2012). Review of The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood. [Review of the report The long-term impacts of teachers: Teacher value-added and student outcomes in adulthood, by R. Chetty, J]. Boulder, CO: National Education Policy Center. Retrieved from http://nepc.colorado.edu/thinktank/review-long-term-impacts
Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: Some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 7786.
Collins, C., & Amrein-Beardsley, A. (2013). Putting growth and value-added models on the map: A national overview. Teachers College Record 116(1), 134. Retrieved from http://www.tcrecord.org/Content.asp?ContentId=17291
Ehlert, M., Koedel, C., Parsons, E., & Podgursky, M. (2012). Selecting growth measures for school and teacher evaluations. Washington, D.C: National Center for Analysis of Longitudinal Data in Education Research. Retrieved from www.caldercenter.org/publications/upload/WP-80.pdf
Ehlert, M., Koedel, C., Parsons, E., & Podgursky, M. (2013). The sensitivity of value-added estimates to specification adjustments: Evidence from school- and teacher-level models in Missouri. Statistics and Public Policy, 1(1), 1927. doi: 10.1080/2330443X.2013.856152
Gabriel, R., & Lester, J. N. (2013). Sentinels guarding the grail: Value-added measurement and the quest for education reform. Education Policy Analysis Archives, 21(9), 130. Retrieved from http://epaa.asu.edu/ojs/article/view/1165
Glazerman, S. M., & Potamites, L. (2011). False performance gains: A critique of successive cohort indicators. [Working paper]. Washington DC: Mathematica Policy Research. Retrieved from http://www.mathematica-mpr.com/~/media/publications/PDFs/Education/False_Perf.pdf
Goldhaber, D., Gabele, B., & Walch, J. (2012). Does the model matter? Exploring the relationship between different student achievement-based teacher assessments. [Panel paper]. Seattle, WA: Center for Education Data and Research. Retrieved from https://appam.confex.com/appam/2012/webprogram/Paper2264.html
Graue, M. E., Delaney, K. K., & Karch, A. S. (2013). Ecologies of education quality. Education Policy Analysis Archives, 21(8), 136. Retrieved from http://epaa.asu.edu/ojs/article/view/1163
Haertel, E., Rothstein, J., Amrein-Beardsley, A., & Darling-Hammond, L. (2011). Getting teacher evaluation right: A challenge for policy makers. [Capitol Hill briefing]. Retrieved from http://www.aera.net/AboutAERA/KeyPrograms/EducationResearchandResearchPolicy/AERANAEDHoldSuccessfulBriefingonTeacherEval/VideoRecordingofResearchBriefing/tabid/12327/Default.aspx
Kupermintz, H. (2003). Teacher effects and teacher effectiveness: A validity investigation of the Tennessee Value-Added Assessment System. Educational Evaluation and Policy Analysis, 25, 287298. doi:10.3102/01623737025003287
Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010) Value-added modeling of teacher effectiveness: An exploration of stability across models and contexts. Educational Policy Analysis Archives, 18(23), 127. Retrieved from http://epaa.asu.edu/ojs/article/view/810
Papay, J. P. (2010). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1), 163193. doi: 10.3102/0002831210362589
Paufler, N. A., & Amrein-Beardsley, A. (2013, October 22). The random assignment of students into elementary classrooms: Implications for value-added analyses and interpretations. American Educational Research Journal. doi: 10.3102/0002831213508299
Race to the Top Act of 2011, S. 844, 112th Congress. (2011). Retrieved from https://www.govtrack.us/congress/bills/112/s844