Home Articles Reader Opinion Editorial Book Reviews Discussion Writers Guide About TCRecord
transparent 13

The Artificial Conflation of Teacher-Level “Multiple Measures”

by Tray Geiger & Audrey Amrein-Beardsley - August 04, 2017

In this commentary, authors introduce the idea of artificial conflation, as predicated by Campbell’s Law, and as defined by how those with power might compel principals to artificially conflate teachers’ observational with their value-added scores to purposefully exaggerate perceptions of validity, via the engineering of conflated correlation coefficients between these two indicators over time.

According to Campbell’s Law (1976), the more any social indicator is used for decision-making purposes, the more susceptible it may be to behaviors that corrupt said indicators, or rather the validity of the inferences drawn upon which decisions often rely. Such distortionary practices have been registered across social science disciplines (Porter, 2015), including educational contexts, and in this case as a result of high-stakes testing.

As students’ test scores became fundamental to large-scale accountability policies and reforms in the late 1990s and early 2000s, such Campbellian behaviors increasingly came to light in popular media and academic outlets, largely originating from within some of America’s highest-needs schools (e.g., the Atlanta Public School Cheating Scandal). Instances included educators narrowing curricula to better teach that which was tested; coaching students to bubble in correct answers during tests, or erasing incorrect answers to replace them with correct responses thereafter; and excluding, suspending, and expelling students with histories of low performance, preventing them from contributing their low scores to composite measures (e.g., Nichols & Berliner, 2007). In true Campbellian (and also Machiavellian) form, such cavalier behaviors ultimately led to higher test scores, although the resultant scores were artificially inflated (Haladyna, Nolen, & Haas, 1991; Shepard, 1990), individually but also when aggregated at school, district, and state levels. Consequently, these false indicators of “improved student achievement” faded away, again, as synthetic and ephemeral.

Notwithstanding, contemporary teacher evaluation systems continue to rely heavily on the same student-level tests (i.e., as per No Child Left Behind (NCLB), 2001), but students’ test scores are now aggregated at the teacher level to hold teachers accountable for their students’ performance over time (or lack thereof) with high-stakes still attached (e.g., tenure, merit pay, termination). This is most often done using value-added models (VAMs) that are oft-considered the most “objective” of primarily two “multiple measures” being used for increased teacher accountability purposes (Kane & Staiger, 2012; Walsh, Joseph, Lakis, & Lubell, 2017). The second and arguably subservient or oft-considered more “subjective” metric (Weisberg, Sexton, Mullhearn, & Keeling, 2009) includes rubric-based, observational systems that are to help administrators simultaneously discern and then evaluate teachers’ teaching, live and in practice.


While both of these indicators might perform well in low-stakes settings, or not given both measures have unique measurement issues in and of themselves (e.g., VAM-based issues with reliability and validity; observation-based issues with observer subjectivities), it is becoming increasingly evident that once the stakes are raised, as is still required per many educational policies (e.g., New Mexico, New York, Tennessee) and despite the passage of the Every Student Succeeds Act (ESSA, 2016), the observational measures are becoming increasingly susceptible to similar Campbellian distortions, as imposed by those with the most power and control over those with less of both (i.e., administrators and teachers).

Accordingly, in this commentary we bring forward a newly-coined term and existing form of score manipulation – artificial conflation – as specific to America’s “reformed” or “new-and-improved” teacher evaluation systems (e.g., as per Race to the Top (RTTT), 2011). Artificial conflation occurs when these two measures, that are supposed to be assessed in isolation to independently capture unique facets of teaching as per the Standards for Educational and Psychological Testing (American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME), 2014), are forced together to represent one (cor)related construct.

The underlying theory, although few have empirically investigated this (Chin & Goldhaber, 2015), is that there is a construct called “teacher effectiveness” and these two indicators should map onto this construct in similar ways if, in fact, they are capturing similar construct components. Subsequently, high/low values on value-added indicators should be positively correlated with high/low values on observational indicators, respectively. If the estimates derived via both do not adequately agree, something likely went awry, with the observational indicator typically to blame (Hanushek, 2009; Walsh et al., 2017; Weisberg et al., 2009). This is the logic propelling the push to force the observational indicator into better alignment with its more objective counterpart via artificial conflation. Evidence of this in policy and practice follow.




In some cases, current state and district policies not only allow but expressly call for artificial conflation. As such, this type of score manipulation is, in essence and in some cases, being directly stipulated via educational policy.


Evidence of such manipulation stemmed, first, from teachers’ and principals’ reports of principals being obligated to better fit teachers’ observational scores with their value-added scores. Collins (2014), for example, evidenced that in the Houston Independent School District (HISD), when teachers’ observational scores did not support their VAM estimates, principals oft-manipulated teachers’ observational scores to increase their agreement with the VAM-based indicators. While such manipulation seemed preposterous at first glance, it was more recently evidenced that such actions were actually called for within a set of educational policies put into place in Houston. For example, in 2011-2012, 77% of Houston teachers with low VAM-based estimates received the second highest or highest ratings of four on their observations. This was problematic; hence, HISD (2012) noted:


While there is a meaningful, positive relationship between [teachers’ observational] ratings and [their VAM-based estimates], many are not aligning their assessment of classroom practice with student [value-added] growth…[Hence,]…there is significant room for improvement in aligning teacher [observation] scores to [teachers’ VAM-based] outcomes (p. 3).


One year later (2012-2013), the district facilitated more “rating accuracy” to help “at least 85% of [its 287] schools meet the minimum bar for alignment” (HISD, 2013, p. 3, 11). HISD also helped “identify campuses and appraisers requiring intervention…to ensure [that their] rating accuracy improve[d]” (p. 8). Another year later (2013-2014), the court in an ongoing lawsuit in Houston1 observed the alignment data for three schools officially labeled as “at risk for misalignment” (HISD, 2013, p. 3, 11), also noting evidence illustrating that the proportions of teachers with low value-added and high observational scores were significantly higher than teachers with high value-added and low observational scores, respectively (see Figure 1).

Figure 1. Three HISD schools’ rating alignment matrices with associated descriptive statistics. Note: Teachers’ Instructional Practice (IP) scores = their observational scores, and teachers’ Education Value-Added Assessment System (EVAAS) scores = their value-added scores.




Perhaps the most glaring instance of this, though, occurred in Tennessee. In the 2011-2012 school year, the state implemented a teacher evaluation framework consisting of the same two measures (Poon & Schwartz, 2016). After implementation, however, the Tennessee Department of Education was troubled by the number of teachers who had VAM and classroom observation scores that were “misaligned,” with disproportionate numbers of teachers receiving high observational scores and low VAM-based scores (Poon & Schwartz, 2016). See Figure 2 to view the “shaded boxes [that] indicate scores outside of the acceptable range,” as officially declared by the Tennessee State Board of Education (TSBE, 2012, p. 5).

Figure 2. The “acceptable range” (i.e., non-shaded cells) of alignment between Tennessee Value-Added Assessment System (TVAAS) and observational estimates.


Note: For tables with actual data included within them see Poon and Schwartz (2016).

Thereafter, and in no uncertain terms, TSBE (2012) made it state policy that teachers’ classroom observation scores should be more aligned with their VAM-based scores to reduce the numbers of teachers with misaligned scores. While the policy did not explicitly tell administrators to overtly manipulate teachers’ observation scores to match VAM scores, it did state that the TSBE “expect[ed] to see a relationship between the two measures” (TSBE, 2012, p. 4). When a discrepancy was found, the TSBE also provided guidelines to help administrators ensure they were producing “valid” observation scores (see Figure 3).

Figure 3. An “alignment of scores” instructional table used in Tennessee, as per TSBE (2012) policy (Anonymous, personal communication, October 15, 2015; recreated from a photograph of an official document).



Perhaps of most interest is that the underlying assumption is, apparently, that the VAM-based measure is the measure around which the observational measure is to revolve. Or rather, the VAM-based measure is the objective (and hence, relatively correct) measure which the subjective (and hence, relatively incorrect) measure is to affiliate, even if it takes some synthetic manipulation (i.e., artificial conflation).


Related, educational policies akin to those in Tennessee and Houston exist in other states and districts, although such de facto policies do not necessarily exist in such perspicuous ways. Rather, related expectations have also been evidenced in practice in Arizona (Sloat, 2015), Baltimore (Goldring, et al., 2015), and Maryland (Goldring, et al., 2015), and maybe elsewhere of which we are unaware. Likewise, similar bills have also been proposed in other states (e.g., Alabama, Georgia), although they have not (yet) been passed.




Regardless of whether such policy proposals or practices have been evidenced, of explicit concern is that some policymakers and practitioners do not understand how dangerous such policies and practices might be, especially from a validity perspective. While to some, doing this might make sense, actually doing this is empirically nonsensical, as also noted via educational measurement standards (AERA, et al., 2014).


That is, the aforementioned examples overlook or, worse, distort the validity of the inferences to be drawn about the very constructs they are intended to represent, by the very fact that predominantly these two indicators are being artificially conflated to ultimately represent one, correlated indicator of teacher effectiveness. Ironic in this regard is also that the validity of these indicators, and possibly the teacher evaluation system as a whole, is not actually increased as a result, although that is the apparent goal. Rather, it is merely the perception of validity that is increased.

Notwithstanding, more research is certainly needed into potential and actual accounts and evidences of artificial conflation. At the very least, conversations with practitioners might allow us to have a better understanding of artificial conflation, whether and to what extent it might be occurring, and how to best prevent it from occurring, ever or much longer.


1. Houston Federation of Teachers (Plaintiff) v. Houston Independent School District (Defendant), United States District Court, Southern District of Texas, Houston Division.



American Educational Research Association (AERA), American Psychological Association (APA), & National Council on Measurement in Education (NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.


Campbell, D. T. (1976). Assessing the impact of planned social change. Hanover, NH: The Public Affairs Center, Dartmouth College. Retrieved from http://portals.wi.wur.nl/files/docs/ppme/Assessing_impact_of_planned_social_change.pd


Chin, M., & Goldhaber, D. (2015). Exploring explanations for the “weak” relationship between value added and observation-based measures of teacher performance. Cambridge, MA: Center for Education Policy Research (CEPR), Harvard University. Retrieved from http://cepr.harvard.edu/files/cepr/files/sree2015_simulation_working_paper.pdf


Collins, C. (2014). Houston, we have a problem: Teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives, 22(98), 1–42. Retrieved from http://epaa.asu.edu/ojs/article/view/1594


Every Student Succeeds Act (ESSA) of 2016, Pub. L. No. 114-95, • 129 Stat. 1802. (2016). Retrieved from https://www.gpo.gov/fdsys/pkg/BILLS-114s1177enr/pdf/BILLS-114s1177enr.pdf


Goldring, E., Grissom, J. A., Rubin, M., Neumerski, C. M., Cannata, M., Drake, T., &

Schuermann, P. (2015). Make room value-added: Principals’ human capital decisions and the emergence of teacher observation data. Educational Researcher, 44(2), 96-104. doi:10.3102/0013189X15575031


Haladyna, T. M., Nolen, N. S., & Haas, S. B. (1991). Raising standardized achievement test scores and the origins of test score pollution. Educational Researcher, 20(5), 2–7. doi:10.2307/1176395


Hanushek, E. (2009). Teacher deselection. In D. Goldhaber & J. Hannaway (Eds.), Creating a new teaching profession (pp. 165–80). Washington, D.C.: Urban Institute Press.


Houston Independent School District (HISD). (2012). HISD Core Initiative 1: An effective teacher in every classroom, teacher appraisal and development System – Year one summary report. Houston, TX. Retrieved from http://www.nctq.org/docs/HISD_Teacher_AD_Implementation_Manual_08222012.pdf


Houston Independent School District (HISD). (2013). Progress conference briefing. Houston, TX. Available upon request from HISD.


Kane, T. J., & Staiger, D. O. (2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains. Seattle, WA: Bill & Melinda Gates Foundation. Retrieved from http://files.eric.ed.gov/fulltext/ED540960.pdf


Nichols, S. L., & Berliner, D. C. (2007). Collateral damage: How high-stakes testing corrupts America’s schools. Cambridge, MA: Harvard Education Press.


No Child Left Behind (NCLB) Act of 2001, Pub. L. No. 107-110, • 115 Stat. 1425. (2002). Retrieved from http://www.ed.gov/legislation/ESEA02/


Poon, A., & Schwartz, N. (2016, March). Investigating misalignment in teacher observation and value-added ratings. Paper presented at the annual meeting of the Association for Education Finance and Policy, Denver, CO.


Porter, E. (2015, March 24). Grading teachers by the test. The New York Times. Retrieved from http://www.nytimes.com/2015/03/25/business/economy/grading-teachers-by-the-test.html


Race to the Top (RttT) Act of 2011, S. 844--112th Congress. (2011). Retrieved from http://www.govtrack.us/congress/bills/112/s844


Shepard, L. A. (1990). Inflated test score gains: Is the problem old norms or teaching the test? Educational Measurement: Issues and Practice, 9, 15-22. doi:10.1111/j.1745-3992.1990.tb00374.x


Sloat, E. F. (2015). Examining the validity of a state policy-directed framework for evaluating teacher instructional quality: Informing policy, impacting practice (Unpublished doctoral dissertation). Arizona State University, Tempe, AZ.


Tennessee State Board of Education (TSBE). (2012). Teacher and principal evaluation policy. Nashville, TN: Author. Retrieved from https://www.tn.gov/assets/entities/sbe/attachments/7-27-12-II_C_Teacher_and_Principal_Evaluation_Revised.pdf


Walsh, K., Joseph, N., Lakis, K., & Lubell, S. (2017). Running in place: How new teacher evaluations fail to live up to promises. Washington DC: National Council on Teacher Quality (NCTQ). Retrieved from http://www.nctq.org/dmsView/Final_Evaluation_Paper


Weisberg, D., Sexton, S., Mullhearn, J., & Keeling, D. (2009). The widget effect: Our national failure to acknowledge and act on differences in teacher effectiveness. Brooklyn, NY: The New Teacher Project.

Cite This Article as: Teachers College Record, Date Published: August 04, 2017
https://www.tcrecord.org ID Number: 22120, Date Accessed: 5/21/2022 4:46:17 AM

Purchase Reprint Rights for this article or review
Article Tools
Related Articles

Related Discussion
Post a Comment | Read All

About the Author
  • Tray Geiger
    Arizona State University
    E-mail Author
    TRAY GEIGER is a PhD student in Educational Policy and Evaluation at Arizona State University. His research interests include high-stakes testing as related to accountability policy, equity issues, and Critical Race Theory.
  • Audrey Amrein-Beardsley
    Arizona State University
    E-mail Author
    AUDREY AMREIN-BEARDSLEY, PhD, is an Associate Professor at Mary Lou Fulton Teachers College at Arizona State University.
Member Center
In Print
This Month's Issue