
Spring or Fall Annual Tests? Implications for Valueadded Modelsby Michael S. Hayes & Seth Gershenson — September 04, 2018 School districts rely on standardized tests that are only administered once per academic year to produce valueadded measures (VAMs) of teacher effectiveness. This is problematic because students’ summer learning is incorrectly attributed to the teacher, potentially biasing estimates of teacher effectiveness. However, there is limited research on whether spring or fall tests yield more valid VAMs. We fill this gap in knowledge by comparing the accuracy of falltofall and springtospring “crossyear” VAMs relative to arguably more valid falltospring “withinyear” VAMs. We find that springtospring “crossyear” VAMs, relative to falltofall “crossyear” VAMs, are more valid, as they are more consistent with “withinyear” VAMs. This suggests that spring assessments are preferred to fall assessments, at least when the objective is to obtain valid VAMbased estimates of school or teacher effectiveness. INTRODUCTION Most school districts rely on standardized tests that are only administered once per academic year, usually in the spring, to evaluate teacher and school effectiveness. For example, the Nashville school district administers standardized tests each spring that measure students’ achievement gains between the spring of grade g – 1 and the spring of grade g. However, this approach is potentially problematic, as students’ summer learning is incorrectly attributed to students’ gradeg teachers and schools. The resultant potential bias in valueadded measures (VAMs) of teacher and school effectiveness when using such “crossyear” or “springtospring” achievement gains is well documented (Downey et al., 2008; McEachin & Atteberry, 2017; Gershenson & Hayes, 2018). One seemingly attractive solution is to administer tests at both the start and end of the school year and estimate VAMs on the associated withinyear gains. However, this is not a panacea, as schools and teachers being evaluated on growth would have an incentive to artificially depress the fall scores. Moreover, adding a second round of tests would be costly, in terms of both time and money, not to mention the political costs of adding more tests at a time when many stakeholders are calling for fewer tests (Superville, 2015; Ujifusa, 2015). Accordingly, if highstakes tests are to be administered only once per year, it is vital for school administrators to know whether implementing those assessments in the spring or fall yields more credible VAMs. The current study fills a gap in the VAM literature by directly addressing the question of whether fall or spring assessments yield more accurate VAMs. We do so by comparing the accuracy of falltofall and springtospring crossyear VAMs relative to the arguably more valid falltospring “withinyear” VAMs. Specifically, we use studentlevel data from the Early Childhood Longitudinal Study, Kindergarten Class of 201011 (ECLSK) to estimate both crossyear and withinyear VAMs for firstgrade classrooms, then compare the VAMbased rankings. We find that springtospring crossyear VAMs, relative to falltofall crossyear VAMs, are more reliable and consistent with withinyear VAMs. This suggests that when only one assessment per year is feasible, spring assessments are preferred to fall assessments, at least when the objective is to obtain valid estimates of school or teacher effectiveness. DATA The current study uses studentlevel data from the Early Childhood Longitudinal Study, Kindergarten Class of 201011 (ECLSK) to estimate classroom valuedadded scores. The ECLSK, collected by the National Center for Education Statistics (NCES), is a longitudinal dataset comprised of a nationally representative sample of the 201011 kindergarten cohort. The full sample contains approximately 18,000 children in over 900 kindergarten programs.^{1} The survey oversampled certain subgroups of children, and the current study uses NCESprovided sampling weights to adjust for the survey’s nonrandom sampling frame. The main results remain qualitatively similar when the sampling weights are not applied, which is reassuring because it suggests that the main result is not driven by schools that have a relatively higher number of disadvantaged students compared to the typical U.S. school (Solon, Haider, & Wooldridge, 2015). The ECLSK administered ageappropriate reading and mathematics tests to all surveyed children in the fall and spring of kindergarten and the spring of first grade.^{2} The ECLSK computed vertically scaled test scores using Item Response Theory (IRT). In the baseline analysis, we use the unstandardized version of these IRT test scores. However, results are qualitatively similar if we standardize by subject, grade, and semester. We caution readers to avoid extrapolating the current study’s results to other grade levels. Our analysis also requires test score data in the fall of both first grade and second grade. However, only a random subsample of ECLSK children were surveyed in the falls of first and second grades. Therefore, the analytic sample is restricted to this subsample, which includes approximately 3,700 children. The fall observations facilitate the following calculations: • Testscore change between spring of kindergarten (K) and spring of first grade (1) • Testscore change between fall of first grade (1) and spring of first grade (1) • Testscore change between fall of first grade (1) and fall of second grade (2) We make three additional sample restrictions. First, students who changed schools anytime in the years between kindergarten and second grade, or who experienced a midyear classroom change, are excluded. School changers are excluded to avoid conflating summer learning with disruptions to learning associated with changing schools (Schwartz et al., 2017). Second, we exclude students who repeated or skipped kindergarten, first grade, or second grade. Lastly, students are excluded if they were missing basic demographic data or classroom indicators. These additional restrictions yield a baseline analytic sample of approximately 1,800 first graders in over 750 classrooms. To check if the main results are not driven by imprecision associated with the relatively small number of students per classroom, we run the analysis with a subsample of 150 classrooms that have at least four surveyed students and find qualitatively similar results. The ECLSK is an ideal dataset for the current study for at least two reasons. First, the ECLSK is the only nationally representative survey of U.S. students with test score data spanning two summer vacations (i.e., summers after kindergarten and first grade) that links students to classrooms. This feature of the ECLSK allows for the estimation of both withinyear and crossyear VAMs for the same cohort’s firstgrade school year. Second, the ECLSK collects data from parents on students’ summer activities, which allows us to test if conditioning on summer activities reduces the bias inherent in crossyear VAMs. Table A1 summarizes the student characteristics and summer activities of the analytic sample of ECLSK first graders.
Notes. All ECLSK estimates are weighted to account for the unequal probabilities of sample selection by NCESprovided sampling weights. ECLSK sample sizes are rounded to the nearest 50, in accordance with NCES regulations for restricteduse ECLSK data. Three features of the ECLSK assessments require further explanation. First, due to the cohort nature of the ECLSK, we only observe teachers in one school year; therefore, we can only identify “classroom” and not teacher effects. Second, the ECLSK assessments were administered to different students on different days. A small number of ECLSK administrators met individually with each student when administering an assessment, and this resulted in variation in test dates across schools, classrooms, and even students within the same classrooms (Fitzpatrick et al., 2011). Third, the assessments were not administered on the first or last days of the academic year. This is potentially problematic because some kindergarteners took the exam well in advance of the end of kindergarten and some first graders took the test well after the start of first grade. To account for this, we follow Quinn (2015) in adjusting for the timing of the test by extrapolating each math and reading test score to the first (or last) day of the school year. For the summer between kindergarten (K) and first grade (1) and the summer between first grade and second grade (2), there are nine relevant dates (d):_{ }, , , , , , , , and .^{3} The extrapolations follow two steps. First, we calculate the daily learning rate during the relevant academic year for each child. Second, assuming that the same daily learning rate applies to the start and end of the year, we extrapolate what test scores would have been at the beginning and end of the school year. For example, we use the following equations to calculate the extrapolated end of K test score:
where y^{j} represents achievement at date j for j (, , ). Only the first two of these are observed. Equation 1 calculates the childspecific daily learning rate in K. Equation 2 uses the spring assessment score in kindergarten and the kindergarten daily learning rate to calculate the predicted assessment score at the end of kindergarten for each child. Table 1 reports both math and reading extrapolated test scores for all relevant dates. Interestingly, average math achievement appears to increase by almost two points between the end of kindergarten and the start of first grade. However, there is a slight decrease in average math achievement between the end of first grade and the start of second grade. Table 1 also shows that the average child experiences summer reading loss over both summers. METHODS We utilize the ECLSK data to make three sets of comparisons. First, we compare VAMbased rankings of classroom effectiveness generated by falltospring (withinyear) achievement gains to the arguably less valid rankings generated by springtospring (crossyear) gains; the corresponding VAM specifications are given by Equations 3a and 3b, respectively:
and Second, we compare VAMbased rankings of classroom effectiveness generated by Equation 3a to the arguably less valid rankings generated by falltofall (crossyear) gains shown in Equation 3c below: Lastly, we compare VAMbased rankings from the two crossyear equations (3b and 3c). In Equation 3, students and classrooms are indexed by i and c, respectively; K, 1, and 2 indicate kindergarten, first grade, and second grade, respectively; y is academic achievement (i.e., extrapolated math and reading scores); vector x contains some combination of the student characteristics and summer activities described in Table A1; θ are the classroom fixed effects (FE) upon which rankings of classroom effectiveness will be based; and u is a meanzero error term that captures the unobserved predictors of achievement. All equations are estimated by ordinary least squares (OLS). The baseline model contains only a limited set of the student demographic variables including indicators for race, gender, poverty, English language learner (ELL) status, individualized education plan (IEP) status, kindergartener redshirt status, attending private school, attending an urban school, and attending a rural school (Gershenson & Hayes, 2017). As a robustness check, we add additional controls for student demographic characteristics and summer activities to all models. We compare the rankings generated by cross and withinyear VAMs in two ways similar to previous researchers (Gershenson & Hayes, 2018; Guarino, Reckase, & Wooldridge, 2015; Koedel & Betts, 2007; Loeb & Candelaria, 2012; McCaffrey et al., 2009; McEachin & Atteberry, 2017). First, we estimate Spearman Rank Correlations, which are simple summary statistics that measure the similarity between two rankings. Second, we construct transition matrices that document switching across specifications, which provide a more nuanced understanding of how the rankings change and of the implications for policies that penalize teachers at the bottom of the effectiveness distribution or reward teachers at the top. RESULTS Table 2 reports Spearman rank correlations of the comparisons between estimated first grade classroom effects generated by Equations 3a, 3b, and 3c for the baseline specification as well as several alternative specifications. The Spearman rank correlations suggest that estimated classroom effects from VAMs using springtospring achievement gains for both subjects are more robust to test timing than similar VAMs using falltofall achievement gains. In fact, for both math and reading achievement, the Spearman rank correlation coefficients are more than 10 percentage points higher when using the classroom effects generated by springtospring achievement gains relative to the falltofall achievement gains. Not surprisingly, shown in Column 3, the Spearman rank correlation coefficients are the smallest when comparing the rankings of classroom effects generated from VAMs using falltofall achievement gains to similar classroom effects from springtospring achievement gains. The results reported in Table 2 are not significantly affected by changes to the baseline specification. For example, removing or adding control variables on student characteristics and summer activities does not appreciably change the Spearman correlation coefficients. This finding is not surprising as previous research suggests that only 10% of the variation of summer learning can be explained by student and household characteristics (Downey et al., 2004), and also the data on summer activities in the ECLSK do not contain detailed information on the quality of summer activities and parent involvement over the summer. The result is not affected by excluding the NCESprovided sampling weights mentioned in Section 2. Similarly, the main result is not altered by including only classrooms with at least four surveyed students, which suggest that the findings are not driven by imprecision associated with the relatively small number of students per classroom. Table 3 presents transition matrices for math and reading achievement based on the baseline valueadded model that conditions on elements of x typically observed in administrative data. Transition matrices report the movement of classrooms across quintiles of the classroom effectiveness distribution, which provides a more nuanced understanding of the stability of the rankings reported in Table 2. The diagonal elements of the transition matrices reported in Table 3 represent classrooms that were in the same quintile of the effectiveness rankings generated by falltofall and falltospring VAMs. As expected given the results in Table 2, the figures along the diagonals are significantly lower than 100%, reinforcing the general finding that 1^{st} grade classroom effectiveness rankings are sensitive to the timing of the assessments used in the VAM. Indeed, only about half of classrooms ranked in the lowest or highest quintiles of math effectiveness remained in the same quintile in both the withinyear and crossyear rankings.
Notes. The baseline model contains only a limited set of the student demographic variables, including indicators for race, gender, poverty, English language learner (ELL) status, individualized education plan (IEP) status, kindergartener redshirt, attending private school, attending an urban school, and attending a rural school. The rich control set specification contains variables summarized in Table A1. All Spearman correlation coefficients reported in Table 2 are strongly statistically significant with a pvalue less than 0.0005. The restricted sample includes classrooms with at least four surveyed students. The restricted sample includes 700 students in 150 classrooms.
Notes. The statistics reported in this table compare rankings of classroom effects generated by Equations 3a and 3c of the main text. The sample contains 1,800 students and 750 classrooms. Table 4 replicates the transition matrix analysis in Table 3 for specifications Equation 3a and Equation 3b, comparing the rankings of first grade classrooms generated by falltospring VAMs to those generated by springtospring VAMs. Table 4 shows that the springtospring VAMs are more stable than the falltofall VAMs, and large swings cross multiple quintiles are exceedingly rare in both subjects. Overall, the results from Table 4 are consistent with the main finding, from Table 2, that estimated classroom effects from VAMs using springtospring achievement gains for both subjects are less affected by test timing than similar VAMs using falltofall achievement gains.
Notes. The statistics reported in this table compare rankings of classroom effects generated by Equations 3a and 3b of the main text. The sample contains 1,800 students and 750 classrooms. DISCUSSION The current study addresses a common problem facing the majority of U.S. school districts: the difficulties of estimating teacher effectiveness with standardized tests administered only once per year. The problem is that students’ summer learning gains and losses are incorrectly attributed to schools and teachers when crossyear VAMs are used to evaluate teacher effectiveness. Indeed, previous research has documented this potential bias in crossyear VAM estimates (Downey et al., 2008; McEachin & Atteberry, 2017; Gershenson & Hayes, 2018). Given the political and financial challenges of administering standardized exams twice per year, school administrators need to be aware of whether implementing those assessments in spring or fall yields more credible VAMs. Ours is the first study to directly address this question. The current study provides information on the validity of valueadded estimates of classroom effects generated by falltofall and springtospring crossyear VAMs relative to arguably more valid falltospring withinyear VAMs. We consistently find that estimated classroom effects from VAMs using springtospring achievement gains for both subjects are more robust than similar VAMs using falltofall achievement gains. Specifically, for both math and reading achievement, the Spearman rank correlation coefficients are more than 10 percentage points higher when using the classroom effects generated by springtospring achievement gains relative to the falltofall achievement gains. Transition matrices reported in the current study provide a similar finding. The policy implication from this finding is that when only one assessment per year is feasible, spring assessments are preferable to fall assessments, at least when the objective is to obtain valid estimates of school or teacher effectiveness. Moving forward, one area for future research is determining the optimal timing of the spring assessment. Assuming the validity of the springtospring crossyear VAMs increases monotonically as the school year progresses, we would predict that a springtospring crossyear VAM administered in March would be more valid than a similar springtospring crossyear VAM administrated in February, but less valid than one administrated in April. Therefore, if one is willing to make this sort of monotonicity assumption, we can say that tests administered later in the school year are better than tests administered earlier in the year. Unfortunately, our current data does not allow us to formally test this hypothesis. Notes 1. Sample sizes are rounded to the nearest 50, as per NCES rules for restricteduse ECLSK data. 2. See Fitzpatrick et al. (2011) and Quinn (2014) for more discussion of the tests. 3. Unfortunately, the ECLSK does not report the exact day of the assessment. Instead, an indicator for the week the assessment was administered is provided. We impute test dates by converting the week indicators to the midpoint of the week (e.g., if week 1 covers the 1^{st} through the 7^{th}, we impute a test date of the 4^{th}). This should not create any systematic bias due to the conditional randomness of the test dates (Fitzpatrick et al., 2011). References Downey, D. B., von Hippel, P. T., & Broh, B. (2004). Are schools the great equalizer? Cognitive inequality during the summer months and the school year. American Sociological Review, 69(5), 613–635. Gershenson, S., & Hayes, M. S. (2017). The summer learning of exceptional students. American Journal of Education, 123(3), 447–473. Gershenson, S., & Hayes, M. S. (2018). The implications of summer learning loss for valueadded estimates of teacher effectiveness. Educational Policy, 32(1), 55–85. Guarino, C. M., Reckase, M. D., & Wooldridge, J.M. (2015). Can valueadded measures of teacher performance be trusted? Education Finance & Policy, 10(1), 117–156. Fitzpatrick, M. D., Grissmer, D., & Hastedt, S. (2011). What a difference a day makes: Estimating daily learning gains during kindergarten and first grade using a natural experiment. Economics of Education Review, 30(2), 269–279. Koedel, C., & Betts, J. R. (2011). Does student sorting invalidate valueadded models of teacher Effectiveness? An extended analysis of the Rothstein critique. Education Finance & Policy, 6(1), 18–42. Loeb, S., & Candelaria, C. (2012). How stable are valueadded estimates across years, subjects, and student groups? The Carnegie Knowledge Network. Retrieved from http://www.carnegieknowledge.org/wpcontent/uploads/2012/10/CKN_201210_Loeb.pdf McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance & Policy, 4(4), 572–606. McEachin, A., & Atteberry, A. (2017). The impact of summer learning loss on measures of school performance. Education Finance & Policy, 12(4), 468–491. Quinn, D. M. (2015). Blackwhite summer learning gaps: Interpreting the variability of estimates across representations. Educational Evaluation and Policy Analysis, 37(1), 50–69. Schwartz, A. E., Stiefel, L., & Cordes, S. A. (2017). Moving matters: The causal effect of moving schools on student performance. Education Finance & Policy, 12(4), 419–446. Superville, D. R. (2015). Students take too many redundant tests, study finds. Education Week. Retrieved from https://www.edweek.org/ew/articles/2015/10/28/studentstaketoomanyredundanttestsstudy.html Ujifusa, A. (2012). Standardized testing costs states $1.7 billion a year, study says. Education Week. Retrieved from https://www.edweek.org/ew/articles/2012/11/29/13testcosts.h32.html Solon, G., Haider, S. J., & Wooldridge, J. M. (2015). What are we weighting for? Journal of Human Resources, 50(2), 301–316.


