Value-Added Model (VAM) Scholars on Using VAMs for Teacher Evaluation After the Passage of the Every Student Succeeds Act


by Matthew Ryan Lavery, Audrey Amrein-Beardsley, Tray Geiger & Margarita Pivovarova - 2020

Background/Context: The Race to the Top federal initiatives and requirements surrounding waivers of No Child Left Behind promoted expanded use of value-added models (VAMs) to evaluate teachers. Even after passage of the Every Student Succeeds Act (ESSA) relaxed these requirements, allowing more flexibility and local control, many states and districts continue to use VAMs in teacher evaluation systems, suggesting that they consider VAMs a valid measure of teacher effectiveness. Scholars in the fields of economics, education, and quantitative methods continue to debate several aspects of VAMs’ validity for this purpose, however.

Purpose: The purpose of this study was to directly ask the most experienced VAM scholars about validity of VAM use in teacher evaluation based on the aspects of validity described in the Standards for Educational and Psychological Testing and found in a review of high-quality peer-reviewed literature on VAMs.

Participants: We invited the 115 scholars listed as an author or coauthor of one or more of the 145 articles published on evaluating teachers with VAMs that have been published in prominent peer-reviewed journals between 2002 and implementation of ESSA in 2016. In this article, we analyze data from 36 respondents (12 economists, 13 educators, and 11 methodologists) who rated themselves as “experienced scholars,” “experts,” or “leading experts” on VAMs.

Research Design: This article reports both quantitative and qualitative analyses of a survey questionnaire completed by experienced VAM scholars.

Findings: Analyses of 44 Likert-scale items indicate that respondents were generally neutral or mixed toward the use of VAMs in teacher evaluation, though responses from educational researchers were more critical of VAM use than were responses from economists and quantitative methodologists. Qualitative analysis of free response comments suggests that participants oppose exclusive or high-stakes use of VAMs but are more supportive of their use as a component of evaluation systems that use multiple measures.

Conclusions: These findings suggest that scholars and stakeholders from different disciplines and backgrounds think about VAMs and VAM use differently. We argue that it is important to understand and address stakeholders’ multiple perspectives to find the common ground on which to build consensus.

Former President Obama’s Race to the Top (2011) initiative spurred states and districts to explore alternative methods to hold teachers accountable for their measurable impacts on students’ growth in achievement over time. This was primarily done using value-added models (VAMs) or their growth model counterparts (e.g., the student growth percentiles model; Betebenner, 2011), hereafter referred to more generally as VAMs. Race to the Top (2011) stipulated that (1) measuring teachers’ impacts on their students’ growth over time was “to be weighted as a significant factor” should states receive Race to the Top funds, and (2) teacher accountability “measures must be rigorous (i.e., statistically rigorous) and comparable across classrooms in a district or across classrooms statewide” (U.S. Department of Education, 2010). Ultimately, federal monies totaling $4.35 billion (Duncan, 2009) were awarded to states that adopted teacher accountability measures such as VAMs. Soon thereafter, the federal government punctuated these initiatives when they also required states to adopt stronger teacher accountability systems in order to secure waivers excusing states from meeting the No Child Left Behind (2002) goal that 100% of the students across states would be academically proficient by the year 2014 (see also Dillon, 2010; Duncan, 2011; Layton, 2012).


Before the passage of the Every Student Succeeds Act (ESSA, 2016), which undid many of the federal efforts noted earlier and afforded states and districts more local control over their teacher evaluation and accountability systems, 44 states and the District of Columbia had adopted and implemented such VAM-based policies (Collins & Amrein-Beardsley, 2014; see also Banchero & Kesmodel, 2011; National Council on Teacher Quality, 2013). While the passage of ESSA (2016) has consequently curbed the extent to which states and districts are adopting and implementing VAMs for teacher evaluation purposes, especially statewide (e.g., in Alaska, Arkansas, California, Kansas, Oklahoma, and Texas), VAMs are still playing substantial roles in other states and districts (e.g., in Colorado, Florida, New Mexico, North Carolina, Maine, and Utah; Close et al., 2018; ExcelinEd, 2017).


In the simplest of terms, VAMs are designed to measure the amount of value that a teacher adds to (or does not add to) students’ growth on large-scale standardized achievement tests over the course of every school year. This is done while controlling for students’ prior testing histories, while some VAMs also control for student-level variables (e.g., demographics, English language learner status, special education status, racial identity) and school-level variables (e.g., class size, school-level demographics, intervention programs); however, the control variables used vary by model. Regardless of the model particulars and specifications, measuring teachers’ value-added is meant to allow for richer analyses of teachers’ causal impacts on their students’ growth in standardized test scores over time because groups of students are followed to assess their learning trajectories from the time students enter a teacher’s classroom to the time they leave. In practice, however, whether these models work as intended is still debated in the scholarly literature in the fields of economics, education, and quantitative methods, as revealed in the findings of a recent systematic review of literature on validity evidence that supports or challenges using VAMs to evaluate teachers (Lavery et al., 2019).


PURPOSE OF THE STUDY


Because VAM use in teacher evaluation may lead to consequential personnel actions (e.g., merit pay, professional development, promotion, remediation, tenure, or dismissal), it critical to examine whether VAMs consistently produce accurate estimates of teacher effectiveness to support valid inferences and decisions. Hence, it is important to understand the informed perspectives of those scholars with well-founded and intimate knowledge of VAMs on the topic of VAM use. This was the purpose of the present study. Researchers sought the expert opinions of the primary set of authors from the fields of economics, education, and measurement, who have written about and researched VAMs to help others understand the empirical and pragmatic issues surrounding their use. Researchers surveyed 145 scholars who published the top or most influential 115 peer-reviewed articles about VAMs and their uses to evaluate teachers within preK–12 schools (participant selection procedures are discussed in the Methods section).


CONCEPTUAL FRAMEWORK


The jointly published Standards of Educational and Psychological Testing (American Educational Research Association [AERA] et al., 2014, henceforth referred to as the Standards) stated that “it is the interpretations of test scores for proposed uses that are [validated], not the test itself” (p. 11). The Standards further describe validation as “a process of constructing and evaluating arguments for and against the intended interpretation of test scores” that “logically begins with an explicit statement of the proposed interpretation of test scores, along with a rationale for the relevance of the interpretation to the proposed use” (AERA et al., 2014, p. 11). M. T. Kane (2013) called this statement the interpretation/use argument (IUA), which “includes all of the claims based on the test scores (i.e., the network of inferences and assumptions inherent in the proposed interpretation and use)” (p. 2).


VALIDITY EVIDENCE


The Standards (AERA et al., 2014) describe five different sources of validity evidence (specifically, evidence based on test content, response processes, internal structure, relations to other variables, and related consequences) that may be collected to support (or challenge) arguments for (or against) the proposed IUA. In addition to these sources of validity evidence, the Standards (AERA et al., 2014) also discuss the reliability/precision of test scores (which are addressed “as an independent characteristic of test scores,” but one that “has implications for validity,” p. 34) and fairness (which is “a fundamental validity issue [that] requires attention throughout all stages of test development and use,” p. 49). This focus on validity and validation led us to examine the specific validity evidence that supports or challenges the use of VAMs to evaluate teachers in a systematic review of the VAM articles that have been published in respected, influential peer-reviewed journals in economics, education, and quantitative methods (Lavery et al., 2019). In the present study, we use this same lens to survey authors of the literature reviewed about the validity of using VAMs to evaluate preK–12 teachers. We developed a survey instrument that focuses on the specific issues that authors of articles in the aforementioned review discussed in their papers. Although a detailed discussion of the validity evidence found in high-quality peer-reviewed literature is beyond the scope of the current article, the Survey Instrument section highlights some of the disagreement found in the literature that led us to directly ask VAM scholars about the validity concerns reported here.


ECONOMISTS, EDUCATORS, AND THEIR LENSES


In the scholarship on VAMs, many of the strongest supporters of VAM use are economists (e.g., Chetty et al., 2014a, 2014b; Hanushek, 1971, 1979, 2011; T. J. Kane & Staiger, 2002, 2008, 2012), whereas many of the strongest critics of VAM use are educators (e.g., Berliner, 2013, 2014; Briggs & Domingue, 2011; Newton et al., 2010; Papay, 2011, 2012). Economics is often viewed, by those both within and outside the field, as the most empirically rigorous and trustworthy of the social sciences and has therefore become very influential in public policy (Fourcade et al., 2015; Lazear, 1999, 2001). Economists must often create parsimonious mathematical models to identify patterns and trends that lie “beneath the noise” of complex and dynamic social systems, a practice that lends itself to a large-scale, big-picture lens. Conversely, educators are charged to teach every child, differentiating instruction for learners’ unique backgrounds, strengths, and needs to ensure that no child is left behind. Thus, the nature of educators’ work lends itself to a more contextualized lens that considers the unique makeup of each school and classroom, the uniqueness of each teacher and student, and the unique interactions among the nearly infinite combinations and permutations of each. The distinct disciplinary lenses through which VAM scholars approach their work may lead them to build that work on very different assumptions (for a discussion from the education perspective, see Amrein-Beardsley & Holloway, 2017). Scholars in quantitative methodology, in addition to those in economics and education, have also contributed to the VAM literature and may have their own unique lenses through which to approach their work. Methodologists’ perspectives may also be shaped by the types of studies they tend to conduct and the specific applied fields they tend to examine. Thus, a secondary purpose of the present study is to identify the specific validity concerns about which scholars from economics, education, and quantitative methodology hold different views.


METHODS


We employed a survey research method for this study to simultaneously collect similar data from multiple respondents (Babbie, 1990; Blair et al., 2014; Scholl et al., 2002; Schonlau et al., 2001; Shannon et al., 2002). Researchers chose this approach for purposes of convenience, versatility, and efficiency and to permit access to study respondents via a simple, suitable, and time-invariant technology (Check & Schutt, 2012; Evans & Mathur, 2005; Weisberg, 2005).


SCHOLAR AND JOURNAL SAMPLE


To identify possible participants in this study, we compiled a list of all scholars who authored or coauthored articles on VAMs in teacher evaluation published in a high-quality peer-reviewed journal between 2002 and 2016. We consulted the 2016 Journal Citation Reports Social Sciences Edition (JCR; Clarivate Analytics, 2017) to select journals of sufficient scientific quality and rigor. Specifically, we considered the four key JCR metrics indicating journals’ impact factors, 5-year impact factors, Eigenfactor Scores, and Article Influence Scores to select journals for inclusion in this study. We considered journals that JCR listed in “Education & Educational Research” (235 journals), in “Economics” (347 journals), in “Psychology, Educational” (58 journals), or in “Social Sciences, Mathematical Methods” (49 journals) categories. Because several journals appeared in more than one JCR category, the combined list contained 644 journals considered for qualification. Articles published in journals that fell in the top quintile (i.e., 20%) of journals in their categories, or in the top quintile of the combined list of journals on any of the four key JCR metrics, qualified for inclusion. The final list contained 238 journals that researchers ultimately included in the study, representing 37% (238/644) of the total journals considered for inclusion.


We then used the EBSCOhost online research engine (EBSCO Industries, 2018) to search the Business Source Complete, Education Full Text, Education Research Complete, Education Resources Information Center (ERIC), and PsycINFO databases for articles containing variations of the term “value-added,” some mention of teachers, and some variation of the words “evaluation,” “effective,” or “quality.” After limiting the journal set to academic journals and removing duplicates, the search returned 582 records (81 from Business Source Complete, 211 from Education Full Text, 322 from Education Research Complete, 359 from ERIC, and 122 from PsycINFO, with several articles returned by more than one database). We then limited results to articles published in one of the aforementioned 238 qualified, rigorous peer-reviewed journals. We also examined the reference lists of two recent reviews written about VAMs (Everson, 2017; Koedel et al., 2015) and included any article in either review that was also published in one of the 238 qualified journals.


Subsequently, 135 articles remained from all sources considered. As per the PRISMA statement (Moher et al., 2009), we then examined the titles and abstracts of these articles and included only articles in which scholar(s) discussed the suitability of VAMs for use in teacher evaluation. We excluded articles that only addressed rating or evaluating schools or school leaders and any articles in which VAM scores were used as a covariate or a qualification criterion for inclusion in a study on a different topic. The final list of included articles consisted of 115 articles (co)authored by 145 scholars. Both the list of 115 qualified articles (Online Supplement A) and the list of the 145 VAM scholars who authored them (Online Supplement B) are available at https://scholarworks.bgsu.edu/seflp_pubs/16/. It was this set of 145 scholars whom we invited to participate in this survey research study.


Table 1 summarizes the number of articles authored and the number of VAM scholars who were listed as first author of at least one qualifying article. Specifically, 103 names from the list of 145 invited scholars (71%) were listed on only one qualifying article, while 12 invited scholars (8%) were listed as author or coauthor on four or more qualifying articles. In addition, 71 of the 145 invited scholars (49%) were listed as the first author or sole author of at least one qualifying article, while 74 scholars (51%) were only listed in the second author position or later. The distribution of authorship summarized in Table 1 suggests that VAM expertise varies among the VAM scholars invited to participate in this survey study. For example, it is reasonable to expect that a scholar who has published four or more articles about VAM use in teacher evaluation in rigorous, reputable journals (and who has first-authored at least one of them) has more knowledge and expertise about this topic than a scholar who has been listed as second author or later on only one such article. First-authorship does not necessarily indicate the author who has with the greatest VAM expertise or who made the greatest contribution to the publication in question, however. Many of the VAM scholars invited to participate in this study are authors of papers published in the field of economics, in which it is not uncommon for the authors on a paper to make equivalent contributions and be listed alphabetically. Thus, we asked respondents to self-report their levels of VAM expertise on the survey instrument, and we only analyzed the responses of those who identified as experienced scholars or experts (described next).


Table 1. Number of VAM Scholars Invited to Participate by Number of Qualifying Articles and First Author or Coauthor Status

Number of Articles

First Author

Coauthor

Total

> 3

11

1

12

3

6

3

9

2

13

8

21

1

41

62

103

Total

71

74

145


SURVEY INSTRUMENT


We constructed and administered a survey for the present study that we aligned with the sources of validity evidence discussed in the Standards (AERA et al., 2014), as well as with the fundamental validity issues of reliability/precision and fairness that the Standards discuss. The survey instrument was developed in pages that feature items along the themes of respondents’ VAM expertise and perspective; VAM reliability/precision; bias; validity evidence based on content, related variables, intended consequences, and unintended consequences; and a final page with overall thoughts on VAMs and VAM use. Each page contained items in which respondents indicated the degree to which they agreed or disagreed with a series of statements about VAMs and VAM use on a 5-point Likert scale. Each of these pages is described next with a brief discussion of its connection to the VAM literature. Interested readers may find the full instrument in Online Supplement C (available at https://scholarworks.bgsu.edu/seflp_pubs/16/).


VAM Expertise and Perspective


We acknowledge that all invited authors may not have contributed equally to the articles that qualified them for inclusion in this study. Clearly, coauthors can fill a wide variety of roles and make a wide variety of contributions to any research team. Because the survey items included in this study addressed some of the more detailed and nuanced concerns about VAMs, the first page asks about expertise and perspective. Respondents were asked to report the role(s) they filled for the article(s) that qualified them for inclusion in the study (e.g., wrote all or part of the paper, conducted the literature review, analyzed data) and to rate their own expertise on VAMs and VAM use on a 5-point Likert-scale, where 5 = leading expert, 4 = expert, 3 = experienced scholar, 2 = scholar, and 1 = contributor (levels of expertise are defined in the Findings section). Respondents then identified their primary field as either economics, education, or quantitative methodology, and they rated their expertise in all three of those fields using a similar Likert scale as the one used for VAM expertise; an exception was that the scale used for these items included an option to choose zero, indicating that the respondent has no background in that field.


Reliability/Precision


The second page of the survey contained items related to the reliability and precision of the teacher effectiveness estimates that VAMs produce. For the purposes of this survey, we define reliability as the Standards define it, a condition in which scores “are consistent over repeated applications of a measurement procedure and hence are inferred to be dependable and consistent” (AERA et al., 2014, pp. 222–223). Likewise, precision is defined as the degree to which scores are free from “the impact of measurement error on the outcome of the measurement” (AERA et al., 2014, p. 222). A review of the VAM literature indicates little disagreement about the year-to-year stability of VAM scores (Lavery et al., 2019). That is, most authors agree, as Koedel et al. (2015) reported, that “the year-to-year correlation in estimated teacher value-added . . . range[s] from 0.18 to 0.64” (p. 186) but differ in their assessment of whether VAM scores are stable enough for evaluative purposes. In light of these findings, we framed the survey items in this section to ask VAM scholars whether VAMs produce teacher effectiveness estimates that are stable enough or precise enough to sort, rank, and evaluate teachers, whether with or without consequences.


Bias


Bias is defined by the Standards as “systematic error in a test [e.g., VAM] score” (AERA et al., 2014, p. 216). Bias is often associated with construct irrelevant variance (CIV), which is in turn defined as variance “that is attributable to extraneous factors that distort the meaning of the scores and thereby decrease the validity of the proposed interpretation” (AERA et al., 2014, p. 217). The items in this section asked respondents whether VAM scores are generally free from bias or are confounded by other sources of variance. Additional items asked about some specific sources of bias discussed in the VAM literature, such as nonrandom assignment of teachers and students to classrooms (Condie et al., 2014; Everson et al., 2013; Johnson et al., 2015; Rothstein, 2009), the inclusion or exclusion of student-level covariates in the model (Ballou et al., 2004; McCaffrey et al., 2004), and the effects of missing data (Amrein-Beardsley, 2008).


Validity Evidence Based on Test Content


The Standards state that “important validity evidence can be obtained from an analysis of the relationship between the content of a test and the construct it is intended to measure” (AERA et al., 2014, p. 14). In the case of VAMs, however, estimates of teacher effectiveness are derived from student tests rather than produced by them. As Nye et al. (2004) stated, “The effects of . . . teacher effectiveness are expected to be largest when the content covered during instruction is closely aligned with . . . student achievement measures” (p. 253). Because the VAM literature includes studies that investigate differences in VAM scores derived from different achievement tests (e.g., Grossman et al., 2014; Lockwood et al., 2007; Papay, 2011), the survey included items that ask about the alignment of the student achievement tests used to generate VAM scores and whether such tests are able to measure the contribution of individual teachers to student learning gains. The survey also contained items that ask whether VAMs are accurate when nonadjacent grade levels are used as predictors (e.g., using Grade 4 science tests to estimate Grade 8 science teachers’ value-added), as well as using different content areas as predictors (e.g., using prior reading or mathematics test scores to estimate science or social studies teachers’ value-added).


Validity Evidence Based on Relations to Other Variables


The Standards (AERA et al., 2014) include convergent and discriminant evidence, as well as test-criterion relationships under the umbrella of validity evidence based on other variables. The VAM literature includes multiple studies that investigate the relationship between teacher VAM scores and other variables, such as student surveys, instructional observations, teacher content knowledge, supervisor evaluations, and measures of students’ future academic and socioeconomic outcomes (e.g., Chetty et al., 2014a, 2014b; Grossman et al., 2014; Harris et al., 2014; Koedel et al., 2015). We developed the survey to include items that ask about whether the relationships between VAMs and the aforementioned variables provide convincing evidence for their use in teacher evaluation. We also included two items that ask whether VAMs measure a different construct than surveys, observations, and supervisor evaluations and thus need not correlate (see Chin & Goldhaber, 2015; Harris et al., 2014).


Validity Evidence Based on Related Consequences


The Standards (AERA et al., 2014) discuss evaluating whether a given interpretation and use of test scores produces its intended outcomes, as well as evaluating the unintended outcomes of score use. M. T. Kane (2013) wrote that “a decision rule that achieves its goals at an acceptable cost and with acceptable consequences is considered a success. A decision rule that does not achieve its goals or has unacceptable consequences is considered a failure” (p. 47). To collect the input of VAM scholars, the survey included one page of items that ask whether VAM use has led to the intended consequences (such as increasing instructional quality and student outcomes) and another page of items that ask about unintended consequences (such as teachers leaving the profession, reducing collaboration, or lowering morale). Fully half of the articles included in our review of VAM literature addressed the intended or unintended consequences of VAM use in teacher evaluation; therefore, those details are not discussed further here (see Lavery et al., 2019).


Overall Thoughts About VAM Use


The final pages of the survey asked respondents to rate the overall use of VAMs in teacher evaluation and asked some free response questions about VAM use. Likert scale items asked whether VAMs should be used to evaluate teachers, and which teachers, as well as what kinds of consequences should be attached to VAM use. Free-response items asked for the scholar’s recommendations for a state or district considering the use of VAMs in teacher evaluation and recommendations to the secretary of education.


QUANTITATIVE DATA ANALYSES


We analyzed participants’ responses to Likert scale items to understand the opinions of experienced VAM scholars (i.e., the responses of those who self-identified as experienced scholars, experts, or leading experts) on the measurement topics and concepts described earlier. Because the population of interest in this study was too unique to permit piloting the survey instrument before formal use, researchers conducted psychometric analyses of Likert scale survey items to examine both item performance and the underlying structure of the survey. More specifically, we conducted exploratory factor analyses (EFAs) using maximum likelihood estimation and promax rotation with Kaiser normalization. Oblique rotation was appropriate because each page of items was written, based on the Standards (AERA, et al., 2014), to measure similar concerns, suggesting that underlying factors are expected to correlate. We also analyzed the internal consistency of resulting subscales to verify that they performed acceptably well (i.e., Cronbach’s α ≥ .80) and to check for underperforming items. Only one item was removed from analysis for poor performance (Item 32; see Table 5 or Online Supplement C; available at https://scholarworks.bgsu.edu/seflp_pubs/16/). Researchers examined the means and standard deviations of each subscale across respondents and by field to better understand respondents’ thoughts on VAMs and VAM use, and they ran analyses of variance with an alpha level of .05 to test for significant differences across respondents’ fields of study.


QUALITATIVE DATA ANALYSES


Working together during synchronous face-to-face meetings, members of the research team jointly read, reviewed, and coded all participant responses to all open-ended questions included within and at the end of the survey instrument. We analyzed the qualitative data using the concepts and methods of grounded theory (Glaser & Strauss, 1967; Strauss & Corbin, 1998), engaging in two rounds of “constant comparison” per question while coding and collapsing responses into first-, second-, and third-level coding schemes (Glaser & Strauss, 1967).


More specifically, we collectively discussed each participant’s response to each question one at a time, during which we co-constructed codes to capture each of the responses respectively, unless we collectively deemed participants’ responses to be nonsensical or not of direct relevance or use (e.g., thank you messages for inviting scholars to participate in the study). We then co-constructed larger codes into which they collapsed their first- and second-level codes for conclusion-drawing purposes. We used a code-calculation spreadsheet to facilitate the quantification of the qualitative data collected (e.g., by generating frequencies and percentages representing responses; see Figure 1 for an example of a code-calculation spreadsheet) to best represent respondents’ convergent and divergent ideas, notions, and major and minor themes (Miles & Huberman, 1994; Miles et al., 2014). Engaging in this process also facilitated our analysis of participants’ responses by area of expertise (e.g., economics, education research, quantitative methodology). We also coded and summarized respondents’ opinions and perceptions based on these classifications.


[39_23320.htm_g/00002.jpg]


Figure 1. Example of a code calculation spreadsheet

FINDINGS


Of the 145 VAM scholars invited to participate in this study, 40 responded. As mentioned, however, we only analyzed the responses of participants who rated their VAM expertise at the level of experienced scholar or higher to ensure that the findings of this study represented the informed opinions of only the most knowledgeable VAM scholars. Of the 40 respondents, one respondent identified as a contributor (“I have little knowledge of VAMs beyond my contribution to the article[s] listed, for which I functioned mainly in a supporting role”), and three respondents identified as a scholar (“I have enough knowledge on the topic of VAMs to feel comfortable speaking about them but may not have been the most knowledgeable among the coauthors of the paper[s] listed”). We did not include the responses of these four respondents in our analyses. We analyzed the data contributed by the 16 respondents who self-identified as an experienced scholar (“I have a great deal of knowledge and expertise about VAMs and have done substantial work in the area, but VAMs are not one of my primary areas of expertise or my work in this area is not yet fully developed”), the 12 who identified themselves as experts (“VAMs are one of my primary areas of expertise; I have a great deal of knowledge on the subject, but there are other scholars that I consider to be the leading experts in this area”), and the eight respondents who identified themselves as a leading experts (“I am among the most knowledgeable on this topic, known in the field for my scholarly contributions”). Although the overall response rate may be low, the data included and analyzed represent a fairly balanced sample of 36 responses, consisting of 12 economists, 13 educators, and 11 methodologists (see Table 2). Given that these responses were given by self-identified experienced VAM scholars, VAM experts, and leading VAM experts, analyses of these findings may still shed some light on the consensus of the most knowledgeable and well-informed VAM scholars.


Table 2. Number of Responses Analyzed by Self-Reported VAM Expertise and Primary Field

Respondent Primary Field

Experienced Scholar

Expert

Leading Expert

Total

Economics

3

4

5

12

Education

9

3

1

13

Quantitative Methods

4

5

2

11

Total

16

12

8

36



We present our findings by construct next, as per the Likert scale items included within each construct and the open-ended items included at the end of each construct. Table 3 displays the subscales that emerged from the Likert scale items included in the survey instrument, the items contributing to each scale, Cronbach’s α values, and means and standard deviations across respondents. Results of EFA analyses suggest that three of the survey pages each measure two underlying constructs (reported in more detail next). Thereafter, we present findings pertaining to the opened-ended questions, as also discussed by discipline.


Table 3. Subscales Derived From Survey Instrument, Items Included, and Scale Descriptives

Subscale

Itemsa

Cronbach’s α

n

M

SD

Reliability and Precision

1–5

.91

35

2.88

1.18

Freedom From Bias

6–9

.91

34

2.21

1.11

Robust to Missing Data

10, 11

.88

31

1.58

0.72

Evidence Based on Content

12–16

.88

32

2.30

0.95

Relations to Other Variables

17, 18, 20

.91

30

2.70

1.13

Measure Something Different

19, 21

.88

29

3.00

1.21

Intended Consequences

22–26

.93

33

2.55

1.02

Unintended Consequencesb

27–31, 33–35

.91

29

2.65

0.81

VAM Use

36, 37, 40–44

.92

33

2.66

1.08

VAMs for All Teachers

38, 39

.82

33

1.74

1.05

Overall VAM Support

Allc

.98

38

2.58

0.85


Note. VAM(s) = value-added model(s). Sample sizes reported for each subscale represent the number of respondents who completed the items included in that subscale. The Overall VAM Support subscale is calculated as the mean of all completed items for each respondent, regardless of how many pages of the survey that respondent completed. Means and standard deviations are calculated across all responses to 5-point Likert scale items in which respondents indicated their level of agreement with statements supporting the valid use of VAMs to evaluate teachers, where 1 = strongly disagree and 5 = strongly agree.

a Item numbers listed correspond to the item numbers displayed in Table 5. b Items in the Unintended Consequences subscale are negatively phrased (i.e., challenge the valid use of VAMs to evaluate teachers) and reverse coded such that 1 = strongly agree and 5 = strongly disagree. c Item 32 performed poorly and was excluded from all analyses, including the Overall VAM Support scale.


On average, across all responses and items, VAM scholars were either neutral or mixed toward the valid use of VAMs in teacher evaluation (M = 2.58, SD = 0.85). Overall VAM Support differed among respondents by their primary field, F(2, 32) = 3.80, p = .033, with partial η² = .19 (a large effect; Cohen, 1988). Respondents who identified as economics scholars showed higher support for VAMs and VAM use (M = 3.02, SD = 0.75) than those who identified as education scholars (M = 2.16, SD = 0.92). No other significant differences were observed between groups on overall VAM support. Table 4 displays the means and standard deviations for all subscales by reported field of study, while Table 5 displays mean responses for each individual item in the survey, both overall and by primary field.


Table 4. Analyses of Variance Results by Reported Primary Field across Survey Subscales

 

Economics

Education

Quant. Methods

Test of Mean Differences

Subscale

n

M

(SD)

n

M

(SD)

n

M

(SD)

F (df)

p

η²

Reliability and Precision

12

3.48

(1.03)

13

2.32

(1.31)

10

3.16

(0.88)

3.65 (2, 32)

.037

.19

Freedom From Bias

11

2.70

(0.97)

11

1.77

(1.03)

10

2.28

(1.27)

2.01 (2, 29)

.153

.12

Robust to Missing Data

9

1.83

(0.75)

10

1.65

(0.91)

10

1.40

(0.46)

0.84 (2, 26)

.442

.06

Evidence Based on Content

11

2.62

(0.72)

11

1.67

(0.97)

10

2.72

(0.97)

4.59 (2, 29)

.019

.24

Relations to Other Variables

11

3.12

(1.07)

9

1.67

(0.80)

10

3.17

(0.92)

7.69 (2, 27)

.002

.36

Measure Something Different

11

3.09

(1.16)

10

2.90

(1.35)

10

3.30

(1.11)

0.27 (2, 28)

.762

.02

Intended Consequences

11

3.12

(0.76)

10

1.86

(1.19)

10

2.77

(0.74)

5.22 (2, 28)

.012

.27

Unintended Consequencesa

8

2.75

(0.89)

9

2.24

(0.84)

10

2.75

(0.59)

1.30 (2, 24)

.291

.10

VAM Use

11

3.31

(0.87)

10

1.89

(0.98)

9

3.08

(0.84)

7.35 (2, 27)

.003

.35

VAMs for All Teachers

11

2.09

(0.94)

10

1.10

(0.32)

9

2.22

(1.46)

3.74 (2, 27)

.037

.22

Overall VAM Support

12

3.02

(0.75)

13

2.16

(0.92)

10

2.74

(0.67)

3.80 (2, 32)

.033

.19


Note. VAM(s) = value-added model(s). Sample sizes reported for each subscale represent the number of respondents who completed the items included in that subscale. The Overall VAM Support scale is calculated as the mean of all completed items for each respondent, regardless of how many pages of the survey that respondent completed. Statistically significant tests of mean differences are displayed in bold type. Means and standard deviations are calculated across all responses to 5-point Likert-scale items in which respondents indicated their level of agreement with statements supporting the valid use of VAMs to evaluate teachers, where 1 = strongly disagree and 5 = strongly agree.

a Items in the Unintended Consequences subscale are negatively phrased (i.e., challenge the valid use of VAMs to evaluate teachers) and reverse coded, such that 1 = strongly agree and 5 = strongly disagree


Table 5. Mean Responses for Individual Survey Items Across All Respondents and by Reported Primary Field

Subscale

All

Econ.

Educ.

Quant.

Item Numbera and Abbreviated Text

n

M

n

M

n

M

n

M

Reliability and Precision

35

2.9

12

3.5

13

2.3

10

3.2

1

VAMs are reliable enough to be used for summative evaluation purposes

35

3.3

12

4.1

13

2.5

10

3.5

2

VAMs are reliable enough to be used for summative evaluation with high-stakes consequences (hiring, firing, merit pay, & other rewards and sanctions)

35

2.8

12

3.6

13

2.1

10

2.9

3

VAMs are reliable enough to provide meaningful rankings of teachers (from highly effective to highly ineffective teachers)

35

2.9

12

3.6

13

2.3

10

2.9

4

VAMs are precise enough to identify teachers in the tails of the effectiveness distribution (either highly effective or highly ineffective teachers)

35

3.3

12

3.6

13

2.5

10

4.0

5

VAMs are precise enough to identify teachers in the middle of the distribution (neither highly effective or highly ineffective teachers)

35

2.4

12

2.6

13

2.2

10

2.5

Freedom From Bias

34

2.2

11

2.7

11

1.8

10

2.3

6

VAMs are unbiased, independent of the ways teachers and students are assigned to classrooms

31

2.0

10

2.2

11

1.6

10

2.1

7

VAMs are unbiased with statistical controls (including student attributes as covariates in the model)

32

2.8

11

3.4

11

2.3

10

2.7

8

VAMs are unbiased without statistical controls (models which include no covariates, using only prior achievement as controls)

32

1.8

11

2.2

11

1.4

10

2.0

9

VAMs correctly attribute variance in student scores to an individual teacher (estimates are not confounded by variance from other sources)

32

2.4

11

3.0

11

1.8

10

2.3

Robust to Missing Data

31

1.6

9

1.8

10

1.7

10

1.4

10

VAMs are accurate regardless of missing data

29

1.7

9

1.9

10

1.8

10

1.4

11

When data are missing, VAMs are accurate regardless of whether data are missing at random

28

1.6

9

1.8

9

1.6

10

1.4

Evidence Based on Content

32

2.3

11

2.6

11

1.7

10

2.7

12

VAMs are accurate independent of test selection (using one test to generate VAM estimates as opposed to a different test of the same content domain)

31

2.0

10

2.0

11

1.5

10

2.7

13

Tests currently used for VAMs are capable of measuring growth in student achievement over time

29

3.1

9

3.7

10

2.1

10

3.5

14

Tests currently used for VAMs are capable of measuring a teacher’s unique contribution to student growth in achievement over time

28

2.5

9

3.1

9

1.8

10

2.6

15

VAMs are accurate when prior achievement in one subject (math or reading) is used to estimate value-added in another subject (science or social studies)

29

2.1

9

2.3

10

1.5

10

2.4

16

VAMs are accurate when prior achievement in a nonadjacent grade level is used to estimate value-added (Grade 4 science test in Grade 8 science VAM)

30

1.9

10

2.0

10

1.3

10

2.4

Relations to Other Variables

30

2.7

11

3.1

9

1.7

10

3.2

17

Correlations between VAM estimates and long-term outcomes (graduation, college, and lifetime earnings) provide convincing evidence for VAMs

29

2.9

11

3.5

9

2.0

9

3.2

18

Correlations between VAM estimates and supervisor observation scores provide convincing evidence for VAMs

30

2.7

11

3.1

9

1.6

10

3.3

20

Correlations between VAM estimates and student survey scores provide convincing evidence for VAMs

28

2.3

11

2.8

9

1.4

8

2.6

Measure Something Different

29

3.0

11

3.1

10

2.9

10

3.3

19

VAMs measure a substantively different aspect effectiveness than observations, thus VAMs and observations do not need to correlate

31

3.0

11

3.0

10

2.9

10

3.1

21

VAMs measure a substantively different aspect of effectiveness than student surveys, thus VAMs and observations do not need to correlate

30

3.2

11

3.2

10

2.9

9

3.7

Intended Consequences

33

2.6

11

3.1

10

1.7

10

2.8

22

The use of VAMs encourages teachers to increase their professional efforts in the classroom

26

2.5

9

2.6

9

2.0

8

3.0

23

States and districts that use VAMs for personnel decisions (hiring, firing, merit pay, rewards, and sanctions) have seen instructional quality increase

23

2.4

8

2.9

8

1.4

7

2.9

24

States and districts that use VAMs for personnel decisions (hiring, firing, merit pay, rewards, and sanctions) have seen student achievement increase

25

2.6

9

3.3

8

1.8

8

2.8

25

VAM use encourages the most effective teachers to teach the students with the greatest educational needs.

26

2.2

9

2.8

8

1.3

9

2.4

26

VAMs promote transparency in the teacher evaluation process

30

2.7

10

3.5

10

1.8

10

2.9

Not Analyzedb

        

32

VAM use has led to increased teacher competition

Unintended Consequencesc

29

2.7

8

2.8

9

2.4

10

2.8

27

Teachers are less inclined to teach subjects (mathematics or reading) for which teachers may be held accountable using VAMs

23

2.3

8

2.5

8

1.6

7

2.7

28

Teachers are less inclined to teach grade levels (Grades 3–8) for which teachers may be held accountable using VAMs

25

2.5

8

2.5

9

2.2

8

2.9

29

Teachers are less inclined to teach students with histories of low academic performance so that such students don’t suppress VAM estimates

20

2.1

6

2.7

7

1.7

7

2.0

30

Teachers are less inclined to teach students with histories of high academic performance so that such students don’t suppress VAM estimates

20

3.4

6

3.2

7

3.6

7

3.3

31

Teachers are less inclined to work with other teachers in teams or collaborative groups as a result of VAM use

20

2.7

5

3.0

8

2.1

7

3.0

33

VAM use has decreased teacher morale

24

2.0

7

2.3

9

1.7

8

2.1

Unintended Consequences (cont.)

        

34

Teachers are inclined to switch schools, districts, or states to work in settings where VAM estimates might not matter as much

19

2.5

6

3.0

7

2.0

6

2.5

35

Teachers are inclined to leave the teaching profession to avoid perceived or real threats of consequences attached to VAMs

20

2.4

6

2.7

8

2.0

6

2.5

VAM Use

33

2.7

11

3.3

10

1.9

9

3.1

36

The benefits and intended consequences of VAM use outweigh any drawbacks or unintended consequences of their use

29

2.6

10

3.4

10

1.5

9

3.0

37

VAMs should continue to be used to evaluate teachers

30

3.0

11

3.7

10

2.0

9

3.2

40

VAMs are superior to other available methods of evaluating teacher effectiveness

29

2.6

10

3.0

10

1.5

9

3.2

41

VAMs should represent the most important component of teacher evaluation systems (most heavily weighted)

30

2.1

11

2.6

10

1.5

9

2.1

VAM Use (cont.)

        

42

VAMs can and should be used to help teachers improve instruction

28

3.4

10

3.7

9

2.8

9

3.7

43

VAMs should continue to be used to make professional development decisions for teachers

29

3.2

10

3.6

10

2.2

9

3.9

44

VAMs should continue to be used to hold teachers accountable for student learning with consequences (merit pay, tenure, firing, rewards, and sanctions)

30

2.4

11

3.2

10

1.6

9

2.4

VAMs for All Teachers

33

1.7

11

2.1

10

1.1

9

2.2

38

VAMs can and should be used to evaluate teachers of every grade level

30

1.9

11

2.4

10

1.0

9

2.3

39

VAMs can and should be used to evaluate teachers of every content area

30

1.7

11

1.8

10

1.2

9

2.1


Note. Econ. = economics, and aggregates the responses of all respondents who identified economics as their primary field; Educ. = education, and aggregates the responses of all respondents who identified education as their primary field; Quant. = quantitative methods, and aggregates the responses of all respondents who identified quantitative methodology as their primary field; VAM(s) = value-added model(s). Sample sizes reported for each subscale represent the number of respondents who completed the items included in that subscale. Sample sizes reported for each item represent the number of respondents who completed that item. The Overall VAM Support subscale is calculated as the mean of all completed items for each respondent, regardless of how many pages of the survey that respondent completed. Means are calculated across responses to 5-point Likert scale items in which respondents indicated their level of agreement with statements supporting the valid use of VAMs to evaluate teachers where 1 = strongly disagree and 5 = strongly agree.

a Item numbers listed correspond to the item numbers as displayed in Appendix A, in which the full text of each item is also presented. b Item 32 performed poorly and was excluded from all analyses, including the Overall VAM Support scale. c Items in the Unintended Consequences subscale are negatively phrased (i.e., challenge the valid use of VAMs to evaluate teachers) and reverse coded, such that 1 = strongly agree and 5 = strongly disagree.


RELIABILITY/PRECISION


All items relating to the reliability and precision of VAMs loaded onto a single factor, Reliability and Precision, with Cronbach’s α = .91. Overall, respondents neither agreed nor disagreed with statements that VAMs are reliable and precise (M = 2.88, SD = 1.18), though responses did differ by field, F(2, 32) = 3.65, p = .037, partial η² = .19 (a large effect; Cohen, 1988). Respondents in the field of education expressed more concern with the reliability and precision of VAMs (M = 2.32, SD = 1.31) than economists (M = 3.48, SD = 1.03). No other significant differences were observed between groups.


The major issue when assessing the reliability of VAMs as identified by the respondents in the survey was VAMs’ inability or limited ability to incorporate all dimensions of what makes a teacher effective. This was expressed as a common concern across respondents and across all three disciplines. Respondents who contributed written comments about VAMs’ reliability and precision emphasized that VAMs should be used as a tool only in combination with other measures of teacher effectiveness as such (e.g., “multiple measures”; see also Fox, 2016; Grossman et al., 2014; Martinez et al., 2016). No comments were written specifically as per VAMs’ levels of reliability and precision otherwise.


BIAS


Items relating to CIV and bias loaded onto two subscales. Four items loaded onto Freedom From Bias (Cronbach’s α = .91), while two items, which specifically ask how VAMs perform when data are missing, loaded onto Robust to Missing Data (Cronbach’s α = .88). Responses indicated that experienced VAM scholars somewhat disagreed with statements that VAMs are unbiased (M = 2.21, SD = 1.11), and this did not differ by field, F(2, 29) = 2.01, p = .153, partial η² = .12 (a medium to large effect; Cohen, 1988). Across all fields, respondents also somewhat disagreed with statements that VAMs are robust to missing data (M = 1.58, SD = 0.72), again with no observed differences by field, F(2, 26) = 0.84, p = .442, partial η² = .06 (a medium effect; Cohen, 1988).

As per the open-ended responses, respondents noted that policy makers should consider not the bias itself (given that it will be present in the data because of missing values and nonrandom student sorting), but rather what amount of bias might be tolerable. Respondents commented on needing to determine “how much bias is too much” or whether “the amount of bias i[s] meaningfully large” enough to prohibit educators from drawing accurate conclusions about teachers from the data.


EVIDENCE BASED ON TEST CONTENT


Items relating to the content of tests used to calculate VAM scores loaded onto a single factor, Evidence Based on Test Content, with Cronbach’s α = .88. When considered together, respondents somewhat disagreed with statements that evidence based on test content supports valid VAM use in teacher evaluation (M = 2.30, SD = 0.95). Here again, responses differed by field, F(2, 29) = 4.59, p = .019, partial η² = .24 (a large effect; Cohen, 1988). Education scholars found evidence based on test content more challenging to VAMs (M = 1.67, SD = 0.97) than either economists (M = 2.62, SD = 0.72) or quantitative methodologists (M = 2.72, SD = 0.97), though no significant difference was observed between economists and methodologists.


Overall, respondents in their written answers agreed that the validity of the estimates depends on the type of tests and on the properties of the tests being used. Because test scores are the main inputs needed in any model to estimate value-added, depending on the inputs, output (VAM estimates) would certainly differ. In other words, different tests might lead to different results even if the same model is used. As one respondent stated, “A VAM estimate may be accurate given the current test, but the test may not measure what is intended.”


RELATIONS TO OTHER VARIABLES


Items about criterion-related validity evidence loaded on two underlying factors. Three items generally related to correlations between VAM scores and other measures of teacher effectiveness loaded on to Relations to Other Variables, with Cronbach’s α = .91. Considered together, respondents were either neutral or mixed about whether VAMs are sufficiently related to other teacher performance criteria to justify use in teacher evaluation (M = 2.70, SD = 1.13). Responses on these items also differed by field, F(2, 27) = 7.69, p = .002, partial η² = .36 (a very large effect; Cohen, 1988), with much more concern expressed by educational researchers (M = 1.67, SD = 0.80) than either respondents who worked in economics (M = 3.12, SD = 1.07) or who studied quantitative methods (M = 3.17, SD = 0.92). Again, no difference was observed between those who identified with the fields of economics and methodology.


The two survey items that researchers used to ask respondents whether VAMs measure a substantively different construct than other performance measures, and therefore need not correlate with other criteria, loaded onto Measure Something Different with Cronbach’s α = .88. Across respondents, experienced VAM scholars neither agreed nor disagreed with statements that VAMs measure a different construct (M = 3.00, SD = 1.21). For this subscale, responses did not differ by primary field, F(2, 28) = 0.27, p = .762, with an observed partial η² = .02 (indicating small practical significance; Cohen, 1988).


As per the open-ended responses, one respondent indicated, “VAM[s] are best used in combination with other measures of teacher performance.” Relatedly, other respondents pointed out that “measures of teaching effectiveness should be designed in a such way as to capture different aspects of teaching quality, and hence do not need to correlate.”


INTENDED CONSEQUENCES


All items related to intended consequences of VAM use loaded onto a single factor, Intended Consequences, with Cronbach’s α = .93. Respondents, considered as a whole, neither agreed nor disagreed with statements that VAMs achieve their intended outcomes (M = 2.55, SD = 1.02), with mean responses differing by field, F(2, 28) = 5.22, p = .012, partial η² = .27 (a large effect; Cohen, 1988). VAM scholars in education expressed more concern about whether VAMs achieve intended outcomes (M = 1.86, SD = 1.19) than those in economics (M = 3.12, SD = 0.76). No other significant differences were observed between groups.


As per the open-ended responses taken from both sections on VAMs’ intended and unintended consequences, respondents emphasized that teachers cannot act on, and therefore be held accountable for, the measures of their effectiveness because the measures themselves are not transparent to begin with; hence, they are too difficult to use to satisfy their intended consequences (e.g., increased teacher effectiveness that may result in increased student learning). Also noted was that VAM data often come back to teachers from what are non-user-friendly, or now more popularly termed “black box,” statistical models, and this also played into their use in support of VAMs’ intended consequences, or lack thereof (see also Amrein-Beardsley, 2008; Ballou & Springer, 2015; Collins, 2014; Eckert & Dabrowski, 2010; Kappler Hewitt, 2015).


UNINTENDED CONSEQUENCES


EFA of items related to the unintended consequences of VAM use did not initially converge. On further examination of the items, researchers noticed that an item which asks if VAM use has increased competition among teachers (Item 32; see Table 5 or Online Supplement C, available at (https://scholarworks.bgsu.edu/seflp_pubs/16/) did not perform well. The poor performance of this item could have been due to some respondents considering increased teacher competition to be a positive (and perhaps intended) consequence of VAM use, whereas other respondents may have considered such competition to be negative. Once researchers removed the item from the analysis, the EFA produced a single scale, Unintended Consequences, with Cronbach’s α = .91. Responses on the Unintended Consequences scale did not differ by field, F(2, 24) = 1.30, p = .291, partial η² = .10 (a medium to large effect; Cohen, 1988). Across all fields, respondents neither agreed nor disagreed with statements that VAM use produces unintended negative consequences (M = 2.65, SD = 0.81).


Again, per the open-ended questions, some respondents also emphasized that the intended and unintended effects of VAMs might be different for teachers with different value-added scores (high vs. low), which seems likely. Respondents’ other concerns related to unintended consequences included teacher retention issues and low morale, churn on the job, and feelings of being unappreciated. However, one respondent commented that teachers remain committed to their schools and students “regardless of whether it improves their personal VAM or not.” Likewise, another noted, perhaps on the side of VAMs, that “VAMs are not the biggest piece of the problem for teachers.”


VAM USE


The eight items related to recommendations regarding VAM use loaded on two underlying factors. Six items related to VAM use in general loaded on to VAM Use, with Cronbach’s α = .92. Considered together, respondents were generally either neutral or mixed about whether VAMs can and should be used to evaluate teachers (M = 2.66, SD = 1.08). VAM Use responses differed by field, F(2, 27) = 7.35, p = .003, partial η² = .35 (a very large effect; Cohen, 1988), with educational researchers somewhat challenging their use (M = 1.89, SD = 0.98) more than either economists (M = 3.31, SD = 0.87) or methodologists (M = 3.08, SD = 0.84). There was no difference between the ratings of economists and methodologists.


Two items that researchers used to specifically ask whether VAMs should be used to evaluate all teachers (one item regardless of grade level taught, and the other regardless of content area taught) loaded on a second factor, VAMs for All Teachers, with Cronbach’s α = .82. Taken together, respondents were somewhat negative toward the use of VAMs for all teachers (M = 1.74, SD = 1.05), again differing by field, F(2, 27) = 3.74, p = .037, partial η² = .22 (a large effect; Cohen, 1988). Post hoc comparisons using Tukey’s honest significant difference test did not identify significant pairwise differences; however, educators seemed to demonstrate stronger disagreement than did experienced VAM scholars in other fields.


The opinions about the use of VAMs ranged drastically among respondents. Some respondents were firmly in favor of using VAMs for accountability purposes, saying, “I believe VAMs are by far the best way to hold teachers accountable”; others were in staunch opposition, indicating that VAMs “should NOT [emphasis in original] be used to make employment decisions without a lot of other investigative work.” The main reasons that many respondents did not recommend using VAMs were that VAM estimates do not provide “usable, digestible information,” and there is a lack of understanding of why VAM estimates are low for some teachers and high for others (e.g., this may be related to student selection, problems with content knowledge, or classroom management, among other factors). However, some respondents supported the use of VAMs in certain specific contexts, such as to identify “CLASSROOMS [emphasis in original] needing support, not teachers that [sic] are weak,” or being subjected to “judgmental decisions.” Respondents also repeatedly noted that if used, VAMs should be used in conjunction with other indicators of teacher effectiveness and performance. The use of VAMs depends on their implementation, which also stands to reason.


FREE-RESPONSE ITEMS


When respondents were asked what advice they would give to a state or district using or interested in adopting and using a VAM, as well as what comment(s) or recommendation(s) about VAMs or VAM use would they give to current U.S. Secretary of Education Betsy DeVos, respondents focused on whether and to what extent VAMs should be used in teacher evaluation systems. Overall, respondents indicated that while VAMs may currently be “the best available measure of teachers’ performance [when] raising student achievement on . . . standardized tests,” they are also “imperfect” and therefore “should be used in conjunction with other measures.” Again, responses differed by respondents’ discipline. Scholars who self-identified as educational researchers, versus economists or quantitative methodologists, were more concerned about VAMs and VAM use across the board.


Regardless of disciplinary area of expertise, however, respondents indicated that VAMs should not be used for summative, high-stakes, and consequential decisions (e.g., merit pay, salary, and termination being the consequences specifically mentioned). Otherwise, economists and quantitative methodologists were much more in favor of using VAMs for said purposes as long as they were used in conjunction with other measures; however, some quantitative methodologists did comment that these other measures (e.g., student surveys) should be given the same “critical evaluation” and “scrutiny” that VAMs have received and continue to undergo. These same respondents also cautioned that VAMs “should not be over-interpreted,” especially in comparison with other measures or when important summative decisions are attached to VAMs.


In terms of other interesting findings, respondents indicated that while VAMs can and should be used in conjunction with other measures, teachers, schools, and districts should also improve their knowledge about VAMs so that they can become better informed VAM consumers and users. As one respondent wrote, “VAMs are based on very complex statistical procedures and messy data collection designs.” Hence, knowing how to understand and then use these data to improve instruction and student learning will not simply come naturally.


Otherwise, and again by self-reported discipline, education researchers took a much stronger stand against VAMs being used at all, and for any reason (i.e., for formative or summative purposes). Some education researchers commented on the “harmful” impact of VAMs on teachers, along with the lack of actionable feedback that VAMs provide to teachers. One respondent indicated that the measurement issues with VAMs (e.g., reliability, validity, bias), along with VAMs’ unintended consequences, seem to “significantly outweigh any potential . . . benefits.” Although some respondents did indicate that VAMs could be used for formative feedback only, the majority of education research respondents emphasized the aforementioned note that VAMs should never be used for summative, high-stakes decisions.


CONCLUSIONS


In this study, we solicited the opinions of experienced VAM scholars and VAM experts who have (co)authored articles that addressed the use of VAMs to evaluate teachers and were published in one or more high-impact, influential journals in the fields of economics, education, and statistical methods. Informed by a systematic review of the same literature (Lavery et al., 2019), we analyzed respondents’ answers to questions about their perspectives on VAMs, VAM use, and VAMs’ purported strengths and weaknesses. In the present study, we were particularly interested in the informed viewpoints of this set of notable VAM scholars—especially now, given that the implementation of ESSA (2016) has provided states and districts with greater flexibility to choose teacher evaluation methods that do not necessarily include VAMs. Yet, these models remain in rather wide use (Close et al., 2018; ExcelinEd, 2017).


Across all respondents, regardless of self-reported discipline or field, experienced VAM scholars and VAM experts were generally either neutral or mixed toward the use of VAMs to evaluate teachers. Economists and methodologists were more supportive of using VAMs in teacher evaluation than education scholars were. In general, economists and quantitative methodologists were most favorable toward VAM use, with no detectable, significant difference between them, and education researchers were somewhat more critical. Further examination of Table 4 indicates that, although economists gave their strongest support to the Reliability and Precision of VAMs (M = 3.48 SD = 1.03), which was the highest mean score on any subscale given by any subgroup of respondents, economists’ responses on this subscale do not round up to a somewhat agree rating. Although it may seem easier to think of economists as generally pro-VAM, these data do not seem to support that conclusion, suggesting that they are more neutral.


The findings of this study suggest, instead, that those scholars who are most informed about VAMs, and about the evidence of their reliability, validity, and bias, are neutral or mixed toward their use to evaluate teachers, regardless of scholars’ self-reported disciplines or fields of study. Qualitative analyses of the responses to open-ended items support this view; many respondents advised caution or added caveats when discussing the use of VAMs in teacher evaluation. Taken together, the responses of the VAM scholars who participated in this study suggest that it would be inappropriate to evaluate teachers based on VAMs alone, though they may show more support for VAMs when used in conjunction with other teacher evaluation measures. Collectively, respondents do not recommend VAMs for high-stakes uses, such as for hiring, termination, promotion, or merit pay, and this also held true across self-reported disciplines.


That said, it seems that VAMs are not going to disappear from evaluation systems very quickly, even after ESSA (2016) lifted the requirements for their use. Recent evidence taken from Close et al. (2018), for example, indicates that the number of states using statewide growth models or VAMs has decreased from 42% to 30% since 2014. In addition, many states no longer have a one-size-fits-all teacher evaluation system, which is allowing local districts to make more choices about models, implementation, and execution in the contexts of the schools and the communities in which those schools exist. Likewise, the rhetoric surrounding teacher evaluation has changed: Language about holding teachers accountable for their value-added effects, or lack thereof, is becoming much less evident. Rather, states are increasingly looking to provide data to teachers as a means of supporting professional development and improvement, essentially shifting the purpose of the evaluation system away from summative and toward formative use.


What remains unclear, however, is that which drives continued VAM use, especially given both the nontrivial expenses connected with VAMs and the apparent lack of strong support among experts for their use. It is possible that the states and districts pulling away from VAM use are simply doing so more quickly than others. For example, some changes in VAM policies might be prompted by the lawsuits still being settled across the nation related to the federal government’s former Race to the Top (2011) initiatives (see, for example, “Teacher Evaluation Heads to the Courts,” 2015). As the scholarship and debate on VAMs and VAM use continue, however, the findings of this study suggest that economists, educational researchers, methodologists, policy makers, and other stakeholders may come from different perspectives, operate on different assumptions, and hold different priorities. Perhaps the first and best step toward developing a clear consensus among VAM scholars and VAM users is to focus on these underlying differences, with the aim of understanding and addressing stakeholders’ multiple perspectives in order to find the common ground on which to build consensus.


Note


1.

The third and fourth authors contributed equally to this manuscript and are listed in alphabetical order.

References


American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

 

Amrein-Beardsley, A. (2008). Methodological concerns about the Education Value-Added Assessment System. Educational Researcher, 37(2), 65–75. doi:10.3102/0013189X08316420

 

Amrein-Beardsley, A., & Holloway, J. (2017). Value-added models for teacher evaluation and accountability: Commonsense assumptions. Educational Policy, 33(3), 0895904817719519. doi:10.1177/0895904817719519

 

Babbie, E. (1990). Survey research methods (2nd ed.). Wadsworth.

 

Ballou, D., Sanders, W., & Wright, P. (2004). Controlling for student background in value-added assessment of teachers. Journal of Educational and Behavioral Statistics, 29(1), 37–65. doi:10.3102/10769986029001037

 

Ballou, D., & Springer, M. G. (2015). Using student test scores to measure teacher performance: Some problems in the design and implementation of evaluation systems. Educational Researcher, 44(2), 77–86. doi:10.3102/0013189X15574904

 

Banchero, S., & Kesmodel, D. (2011, September 13). Teachers are put to the test: More states tie tenure, bonuses to new formulas for measuring test scores. The Wall Street Journal. http://online.wsj.com/article/SB10001424053111903895904576544523666669018.html

 

Berliner, D. C. (2013). Problems with value-added evaluations of teachers? Let me count the ways! Teacher Educator, 48(4), 235–243. doi:10.1080/08878730.2013.827496

 

Berliner, D. C. (2014). Exogenous variables and value-added assessments: A fatal flaw. Teachers College Record, 116(1), 1-31.

 

Betebenner, D. W. (2011, April 8–12). Student growth percentiles [Training session]. Annual meeting of the National Council on Measurement in Education of the American Educational Research Association, New Orleans, LA.

 

Blair, J., Czaja, R. F., & Blair, E. A. (2014). Designing surveys: A guide to decisions and procedures (3rd ed.). SAGE.

 

Briggs, D., & Domingue, B. (2011). Due diligence and the evaluation of teachers: A review of the value-added analysis underlying the effectiveness rankings of Los Angeles Unified School District teachers by the Los Angeles Times. National Education Policy Center. http://nepc.colorado.edu/publication/due-diligence

 

Check, J., & Schutt, R. K. (2012). Research methods in education. SAGE.

 

Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014a). Measuring the impacts of teachers I: Evaluating bias in teacher value-added estimates. American Economic Review, 104(9), 2593–2632. doi:10.1257/aer.104.9.2593

 

Chetty, R., Friedman, J. N., & Rockoff, J. E. (2014b). Measuring the impacts of teachers II: Teacher value-added and student outcomes in adulthood. American Economic Review, 104(9), 2633–2679. doi:10.1257/aer.104.9.2633

 

Chin, M., & Goldhaber, D. (2015). Exploring explanations for the “weak” relationship between value added and observation-based measures of teacher performance. Center for Education Policy Research, Harvard University. http://cepr.harvard.edu/files/cepr/files/sree2015_simulation_working_paper.pdf

 

Clarivate Analytics. (2017). 2016 Journal Citation Reports® Social Sciences Edition. https://jcr.clarivate.com

 

Close, K., Amrein-Beardsley, A., & Collins, C. (2018). State-level assessments and teacher evaluation systems after the passage of the Every Student Succeeds Act: Some steps in the right direction. National Education Policy Center. https://nepc.colorado.edu/publication/state-assessment

 

Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Erlbaum.

 

Collins, C. (2014). Houston, we have a problem: Teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives, 22(98). doi:10.14507/epaa.v22.1594

 

Collins, C., & Amrein-Beardsley, A. (2014). Putting growth and value-added models on the map: A national overview. Teachers College Record, 116(1), 1-34. 

 

Condie, S., Lefgren, L., & Sims, D. (2014). Teacher heterogeneity, value-added and education policy. Economics of Education Review, 40, 76–92. doi:10.1016/j.econedurev.2013.11.009

 

Dillon, S. (2010, January 31). Obama to seek sweeping change in “No Child” law. The New York Times. http://www.nytimes.com/2010/02/01/education/01child.html?pagewanted=all

 

Duncan, A. (2009). The Race to the Top begins: Remarks by Secretary Arne Duncan. U.S. Department of Education. http://www.ed.gov/news/speeches/2009/07/07242009.html

 

Duncan, A. (2011). Winning the future with education: Responsibility, reform and results. Testimony given to the U.S. Congress, Washington D.C. U.S. Department of Education. https://www.ed.gov/news/speeches/winning-future-education-responsibility-reform-and-results

 

EBSCO Industries. (2018). EBSCOhost. https://www.ebsco.com/products/ebscohost-research-platform

 

Eckert, J. M., & Dabrowski, J. (2010). Should value-added measures be used for performance pay? Phi Delta Kappan, 91(8), 88–92.

 

Evans, J. R., & Mathur, A. (2005). The value of online surveys. Internet Research, 15(2), 195–219.

 

Everson, K. C. (2017). Value-added modeling and educational accountability. Review of Educational Research, 87(1), 35–70. doi:10.3102/0034654316637199

 

Everson, K. C., Feinauer, E., & Sudweeks, R. R. (2013). Rethinking teacher evaluation: A conversation about statistical inferences and value-added models. Harvard Educational Review, 83(2), 349–370. doi:10.17763/haer.83.2.m32hk8q851u752h0

 

Every Student Succeeds Act of 2015, Pub. L. No. 114-95, • 129 Stat. 1802. (2016). https://www.gpo.gov/fdsys/pkg/BILLS-114s1177enr/pdf/BILLS-114s1177enr.pdf

 

ExcelinEd. (2017). ESSA state plans: 50-state landscape analysis. https://www.excelined.org/wp-content/uploads/2017/12/ExcelinEd.Quality.ESSA_.50StateAnalysis.Dec072017.pdf

 

Fourcade, M., Ollion, E., & Algan, Y. (2015). The superiority of economists. Journal of Economic Perspectives, 29(1), 89–114. doi:10.1257/jep.29.1.89

 

Fox, L. (2016). Playing to teachers’ strengths: Using multiple measures of teacher effectiveness to improve teacher assignments. Education Finance and Policy, 11(1), 70–96. doi:10.1162/edfp_a_00176

 

Glaser, B., & Strauss, A. (1967). The discovery of grounded theory: Strategies for qualitative research. Aldine.

 

Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: The relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303. doi:10.3102/0013189X14544542

 

Hanushek, E. A. (1971). Teacher characteristics and gains in student achievement: Estimation using micro data. American Economic Review, 61(2), 280–288. http://www.jstor.org/stable/1817003

 

Hanushek, E. A. (1979). Conceptual and empirical issues in the estimation of educational production functions. Journal of Human Resources, 14(3), 351–388. doi:10.2307/145575

 

Hanushek, E. A. (2011). The economic value of higher teacher quality. Economics of Education Review, 30(3), 466–479. doi:10.1016/j.econedurev.2010.12.006

 

Harris, D. N., Ingle, W. K., & Rutledge, S. A. (2014). How teacher evaluation methods matter for accountability: A comparative analysis of teacher effectiveness ratings by principals and teacher value-added measures. American Educational Research Journal, 51(1), 73–112. doi:10.3102/0002831213517130

 

Johnson, M. T., Lipscomb, S., & Gill, B. (2015). Sensitivity of teacher value-added estimates to student and peer control variables. Journal of Research on Educational Effectiveness, 8(1), 60–83. doi:10.1080/19345747.2014.967898

 

Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1–73. doi:10.1111/jedm.12000

 

Kane, T. J., & Staiger, D. O. (2002). The promise and pitfalls of using imprecise school accountability measures. The Journal of Economic Perspectives, 16(4), 91–114. doi:10.1257/089533002320950993

 

Kane, T. J., & Staiger, D. O. (2008). Estimating teacher impacts on student achievement: An experimental evaluation. National Bureau of Economic Research. http://www.nber.org/papers/w14607

 

Kane, T. J., & Staiger, D. (2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains. Bill & Melinda Gates Foundation. http://www.metproject.org/downloads/MET_Gathering_Feedback_Research_Paper.pdf

 

Kappler Hewitt, K. (2015). Educator evaluation policy that incorporates EVAAS value-added measures: Undermined intentions and exacerbated inequities. Education Policy Analysis Archives, 23(76), 1–49. http://epaa.asu.edu/ojs/article/view/1968

 

Koedel, C., Mihaly, K., & Rockoff, J. E. (2015). Value-added modeling: A review. Economics of Education Review, 47, 180–195. doi:10.1016/j.econedurev.2015.01.006

 

Lavery, M. R., Amrein-Beardsley, A., Pivovarova, M., Holloway, J., Geiger, T., & Hahs-Vaughn, D. L. (2019, April 5–9). Do value-added models (VAMs) tell truth about teachers? Analyzing validity evidence from VAM scholars [Paper presentation]. Annual meeting of the American Educational Research Association, Toronto, Canada.

 

Layton, L. (2012, September 20). Rethinking the classroom: Obama’s overhaul of public education. The Washington Post. https://www.washingtonpost.com/local/education/rethinking-the-classroom-obamas-overhaul-of-public-education/2012/09/20/a5459346-e171-11e1-ae7f-d2a13e249eb2_story.html

 

Lazear, E. P. (1999). Economic imperialism. National Bureau of Economic Research. http://www.nber.org/papers/w7300.pdf

 

Lazear, E. P. (2001). Educational production. The Quarterly Journal of Economics, 116(3), 777–803. doi:10.1162/00335530152466232

 

Lockwood, J. R., McCaffrey, D. F., Hamilton, L. S., Stecher, B., Le, V.-N., & Martinez, J. F. (2007). The sensitivity of value-added teacher effect estimates to different mathematics achievement measures. Journal of Educational Measurement, 44(1), 47–67. doi:10.1111/j.1745-3984.2007.00026.x

 

Martinez, J. F., Schweig, J., & Goldschmidt, P. (2016). Approaches for combining multiple measures of teacher performance: Reliability, validity, and implications for evaluation policy. Educational Evaluation and Policy Analysis, 38(4), 738–756. doi:10.3102/0162373716666166

 

McCaffrey, D. F., Lockwood, J. R., Koretz, D., Louis, T. A., & Hamilton, L. (2004). Models for value-added modeling of teacher effects. Journal of Educational and Behavioral Statistics, 29(1), 67–101. http://www.rand.org/pubs/reprints/2005/RAND_RP1165.pdf

 

Miles, M. B., & Huberman, A. M. (1994). Qualitative data analysis (2nd ed.). SAGE.

 

Miles, M. B., Huberman, A. M., & Saldana, J. (2014). Qualitative data analysis: A methods sourcebook. SAGE.

 

Moher, D., Liberati, A., Tetzlaff, J., & Altman, D. G. (2009). preferred reporting items for systematic reviews and meta-analyses: The PRISMA Statement. Annals of Internal Medicine, 151(4), 264–269.

 

National Council on Teacher Quality. (2013). State of the States 2013: Connect the dots: Using evaluations of teacher effectiveness to inform policy and practice. http://www.nctq.org/dmsView/State_of_the_States_2013_Using_Teacher_Evaluations_NCTQ_Report

 

Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010). Value-added modeling of teacher effectiveness: An exploration of stability across models and contexts. Educational Policy Analysis Archives, 18(23), 1–27. doi:10.14507/epaa.v18n23.2010

 

No Child Left Behind Act of 2001, Pub. L. No. 107-110, • 115 Stat. 1425. (2002). http://www.ed.gov/legislation/ESEA02/

 

Nye, B., Konstantopoulos, S., & Hedges, L. V. (2004). How large are teacher effects? Educational Evaluation and Policy Analysis, 26(3), 237–257. http://dx.doi.org/10.3102/01623737026003237

 

Papay, J. P. (2011). Different tests, different answers: The stability of teacher value-added estimates across outcome measures. American Educational Research Journal, 48(1), 163–193. doi:10.3102/0002831210362589

 

Papay, J. P. (2012). Refocusing the debate: Assessing the purposes and tools of teacher evaluation. Harvard Educational Review, 82(1), 123–141. doi:10.17763/haer.82.1.v40p0833345w6384

 

Race to the Top Act of 2011, S. 844, 112th Cong. (2011). http://www.govtrack.us/congress/bills/112/s844

 

Rothstein, J. (2009). Student sorting and bias in value-added estimation: Selection on observables and unobservables. Education Finance and Policy, 4(4), 537–571. doi:10.1162/edfp.2009.4.4.537

 

Scholl, N., Mulders, S., & Drent, R. (2002). Online qualitative market research: Interviewing the world at a fingertip. Qualitative Market Research, 5(3), 210–223. doi:10.1108/13522750210697596

 

Schonlau, M., Fricker, R. D. Jr., & Elliot, M. N. (2001). Conducting research surveys via e-mail on the Web. RAND.

 

Shannon, D. M., Johnson, T. E., Searcy, S., & Lott, A. (2002). Using electronic surveys: Advice from survey professionals. Practical Assessment, Research & Evaluation, 8(1). http://ericae.net/pare/13~getvn.html

 

Strauss, A. L., & Corbin, J. (1998). Basics of qualitative research: Grounded theory procedures and techniques (2nd ed.). SAGE.

 

Teacher evaluation heads to the courts. (2015, October 6). Education Week. http://www.edweek.org/ew/section/multimedia/teacher-evaluation-heads-to-the-courts.html

 

U.S. Department of Education. (2010). Race to the Top program: Guidance and frequently asked questions. https://www2.ed.gov/programs/racetothetop/faq.pdf

 

Weisberg, H. F. (2005). The total survey error approach: A guide to the new science of survey research. University of Chicago Press.

 









Cite This Article as: Teachers College Record Volume 122 Number 7, 2020, p. 1-34
https://www.tcrecord.org ID Number: 23320, Date Accessed: 10/21/2021 11:19:55 PM

Purchase Reprint Rights for this article or review