Spring or Fall Annual Tests? Implications for Value-added Models
by Michael S. Hayes & Seth Gershenson — September 04, 2018
School districts rely on standardized tests that are only administered once per academic year to produce value-added measures (VAMs) of teacher effectiveness. This is problematic because students’ summer learning is incorrectly attributed to the teacher, potentially biasing estimates of teacher effectiveness. However, there is limited research on whether spring or fall tests yield more valid VAMs. We fill this gap in knowledge by comparing the accuracy of fall-to-fall and spring-to-spring “cross-year” VAMs relative to arguably more valid fall-to-spring “within-year” VAMs. We find that spring-to-spring “cross-year” VAMs, relative to fall-to-fall “cross-year” VAMs, are more valid, as they are more consistent with “within-year” VAMs. This suggests that spring assessments are preferred to fall assessments, at least when the objective is to obtain valid VAM-based estimates of school or teacher effectiveness.
Most school districts rely on standardized tests that are only administered once per academic year, usually in the spring, to evaluate teacher and school effectiveness. For example, the Nashville school district administers standardized tests each spring that measure students achievement gains between the spring of grade g 1 and the spring of grade g. However, this approach is potentially problematic, as students summer learning is incorrectly attributed to students grade-g teachers and schools. The resultant potential bias in value-added measures (VAMs) of teacher and school effectiveness when using such cross-year or spring-to-spring achievement gains is well documented (Downey et al., 2008; McEachin & Atteberry, 2017; Gershenson & Hayes, 2018).
One seemingly attractive solution is to administer tests at both the start and end of the school year and estimate VAMs on the associated within-year gains. However, this is not a panacea, as schools and teachers being evaluated on growth would have an incentive to artificially depress the fall scores. Moreover, adding a second round of tests would be costly, in terms of both time and money, not to mention the political costs of adding more tests at a time when many stakeholders are calling for fewer tests (Superville, 2015; Ujifusa, 2015). Accordingly, if high-stakes tests are to be administered only once per year, it is vital for school administrators to know whether implementing those assessments in the spring or fall yields more credible VAMs.
The current study fills a gap in the VAM literature by directly addressing the question of whether fall or spring assessments yield more accurate VAMs. We do so by comparing the accuracy of fall-to-fall and spring-to-spring cross-year VAMs relative to the arguably more valid fall-to-spring within-year VAMs. Specifically, we use student-level data from the Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K) to estimate both cross-year and within-year VAMs for first-grade classrooms, then compare the VAM-based rankings. We find that spring-to-spring cross-year VAMs, relative to fall-to-fall cross-year VAMs, are more reliable and consistent with within-year VAMs. This suggests that when only one assessment per year is feasible, spring assessments are preferred to fall assessments, at least when the objective is to obtain valid estimates of school or teacher effectiveness.
The current study uses student-level data from the Early Childhood Longitudinal Study, Kindergarten Class of 2010-11 (ECLS-K) to estimate classroom valued-added scores. The ECLS-K, collected by the National Center for Education Statistics (NCES), is a longitudinal dataset comprised of a nationally representative sample of the 2010-11 kindergarten cohort. The full sample contains approximately 18,000 children in over 900 kindergarten programs.1 The survey oversampled certain subgroups of children, and the current study uses NCES-provided sampling weights to adjust for the surveys nonrandom sampling frame. The main results remain qualitatively similar when the sampling weights are not applied, which is reassuring because it suggests that the main result is not driven by schools that have a relatively higher number of disadvantaged students compared to the typical U.S. school (Solon, Haider, & Wooldridge, 2015).
The ECLS-K administered age-appropriate reading and mathematics tests to all surveyed children in the fall and spring of kindergarten and the spring of first grade.2 The ECLS-K computed vertically scaled test scores using Item Response Theory (IRT). In the baseline analysis, we use the unstandardized version of these IRT test scores. However, results are qualitatively similar if we standardize by subject, grade, and semester. We caution readers to avoid extrapolating the current studys results to other grade levels.
Our analysis also requires test score data in the fall of both first grade and second grade. However, only a random subsample of ECLS-K children were surveyed in the falls of first and second grades. Therefore, the analytic sample is restricted to this subsample, which includes approximately 3,700 children. The fall observations facilitate the following calculations:
Test-score change between spring of kindergarten (K) and spring of first grade (1)
Test-score change between fall of first grade (1) and spring of first grade (1)
Test-score change between fall of first grade (1) and fall of second grade (2)
We make three additional sample restrictions. First, students who changed schools anytime in the years between kindergarten and second grade, or who experienced a mid-year classroom change, are excluded. School changers are excluded to avoid conflating summer learning with disruptions to learning associated with changing schools (Schwartz et al., 2017). Second, we exclude students who repeated or skipped kindergarten, first grade, or second grade. Lastly, students are excluded if they were missing basic demographic data or classroom indicators. These additional restrictions yield a baseline analytic sample of approximately 1,800 first graders in over 750 classrooms. To check if the main results are not driven by imprecision associated with the relatively small number of students per classroom, we run the analysis with a subsample of 150 classrooms that have at least four surveyed students and find qualitatively similar results.
The ECLS-K is an ideal dataset for the current study for at least two reasons. First, the ECLS-K is the only nationally representative survey of U.S. students with test score data spanning two summer vacations (i.e., summers after kindergarten and first grade) that links students to classrooms. This feature of the ECLS-K allows for the estimation of both within-year and cross-year VAMs for the same cohorts first-grade school year. Second, the ECLS-K collects data from parents on students summer activities, which allows us to test if conditioning on summer activities reduces the bias inherent in cross-year VAMs. Table A1 summarizes the student characteristics and summer activities of the analytic sample of ECLS-K first graders.
Notes. All ECLS-K estimates are weighted to account for the unequal probabilities of sample selection by NCES-provided sampling weights. ECLS-K sample sizes are rounded to the nearest 50, in accordance with NCES regulations for restricted-use ECLS-K data.
Three features of the ECLS-K assessments require further explanation. First, due to the cohort nature of the ECLS-K, we only observe teachers in one school year; therefore, we can only identify classroom and not teacher effects. Second, the ECLS-K assessments were administered to different students on different days. A small number of ECLS-K administrators met individually with each student when administering an assessment, and this resulted in variation in test dates across schools, classrooms, and even students within the same classrooms (Fitzpatrick et al., 2011). Third, the assessments were not administered on the first or last days of the academic year. This is potentially problematic because some kindergarteners took the exam well in advance of the end of kindergarten and some first graders took the test well after the start of first grade. To account for this, we follow Quinn (2015) in adjusting for the timing of the test by extrapolating each math and reading test score to the first (or last) day of the school year.
For the summer between kindergarten (K) and first grade (1) and the summer between first grade and second grade (2), there are nine relevant dates (d): , , , , , , , , and .3 The extrapolations follow two steps. First, we calculate the daily learning rate during the relevant academic year for each child. Second, assuming that the same daily learning rate applies to the start and end of the year, we extrapolate what test scores would have been at the beginning and end of the school year. For example, we use the following equations to calculate the extrapolated end of K test score:
where yj represents achievement at date j for j (, , ). Only the first two of these are observed. Equation 1 calculates the child-specific daily learning rate in K. Equation 2 uses the spring assessment score in kindergarten and the kindergarten daily learning rate to calculate the predicted assessment score at the end of kindergarten for each child. Table 1 reports both math and reading extrapolated test scores for all relevant dates. Interestingly, average math achievement appears to increase by almost two points between the end of kindergarten and the start of first grade. However, there is a slight decrease in average math achievement between the end of first grade and the start of second grade. Table 1 also shows that the average child experiences summer reading loss over both summers.
We utilize the ECLS-K data to make three sets of comparisons. First, we compare VAM-based rankings of classroom effectiveness generated by fall-to-spring (within-year) achievement gains to the arguably less valid rankings generated by spring-to-spring (cross-year) gains; the corresponding VAM specifications are given by Equations 3a and 3b, respectively:
Second, we compare VAM-based rankings of classroom effectiveness generated by Equation 3a to the arguably less valid rankings generated by fall-to-fall (cross-year) gains shown in Equation 3c below:
Lastly, we compare VAM-based rankings from the two cross-year equations (3b and 3c).
In Equation 3, students and classrooms are indexed by i and c, respectively; K, 1, and 2 indicate kindergarten, first grade, and second grade, respectively; y is academic achievement (i.e., extrapolated math and reading scores); vector x contains some combination of the student characteristics and summer activities described in Table A1; θ are the classroom fixed effects (FE) upon which rankings of classroom effectiveness will be based; and u is a mean-zero error term that captures the unobserved predictors of achievement. All equations are estimated by ordinary least squares (OLS). The baseline model contains only a limited set of the student demographic variables including indicators for race, gender, poverty, English language learner (ELL) status, individualized education plan (IEP) status, kindergartener redshirt status, attending private school, attending an urban school, and attending a rural school (Gershenson & Hayes, 2017). As a robustness check, we add additional controls for student demographic characteristics and summer activities to all models. We compare the rankings generated by cross- and within-year VAMs in two ways similar to previous researchers (Gershenson & Hayes, 2018; Guarino, Reckase, & Wooldridge, 2015; Koedel & Betts, 2007; Loeb & Candelaria, 2012; McCaffrey et al., 2009; McEachin & Atteberry, 2017). First, we estimate Spearman Rank Correlations, which are simple summary statistics that measure the similarity between two rankings. Second, we construct transition matrices that document switching across specifications, which provide a more nuanced understanding of how the rankings change and of the implications for policies that penalize teachers at the bottom of the effectiveness distribution or reward teachers at the top.
Table 2 reports Spearman rank correlations of the comparisons between estimated first grade classroom effects generated by Equations 3a, 3b, and 3c for the baseline specification as well as several alternative specifications. The Spearman rank correlations suggest that estimated classroom effects from VAMs using spring-to-spring achievement gains for both subjects are more robust to test timing than similar VAMs using fall-to-fall achievement gains. In fact, for both math and reading achievement, the Spearman rank correlation coefficients are more than 10 percentage points higher when using the classroom effects generated by spring-to-spring achievement gains relative to the fall-to-fall achievement gains. Not surprisingly, shown in Column 3, the Spearman rank correlation coefficients are the smallest when comparing the rankings of classroom effects generated from VAMs using fall-to-fall achievement gains to similar classroom effects from spring-to-spring achievement gains.
The results reported in Table 2 are not significantly affected by changes to the baseline specification. For example, removing or adding control variables on student characteristics and summer activities does not appreciably change the Spearman correlation coefficients. This finding is not surprising as previous research suggests that only 10% of the variation of summer learning can be explained by student and household characteristics (Downey et al., 2004), and also the data on summer activities in the ECLS-K do not contain detailed information on the quality of summer activities and parent involvement over the summer. The result is not affected by excluding the NCES-provided sampling weights mentioned in Section 2. Similarly, the main result is not altered by including only classrooms with at least four surveyed students, which suggest that the findings are not driven by imprecision associated with the relatively small number of students per classroom.
Table 3 presents transition matrices for math and reading achievement based on the baseline value-added model that conditions on elements of x typically observed in administrative data. Transition matrices report the movement of classrooms across quintiles of the classroom effectiveness distribution, which provides a more nuanced understanding of the stability of the rankings reported in Table 2. The diagonal elements of the transition matrices reported in Table 3 represent classrooms that were in the same quintile of the effectiveness rankings generated by fall-to-fall and fall-to-spring VAMs. As expected given the results in Table 2, the figures along the diagonals are significantly lower than 100%, reinforcing the general finding that 1st grade classroom effectiveness rankings are sensitive to the timing of the assessments used in the VAM. Indeed, only about half of classrooms ranked in the lowest or highest quintiles of math effectiveness remained in the same quintile in both the within-year and cross-year rankings.
Notes. The baseline model contains only a limited set of the student demographic variables, including indicators for race, gender, poverty, English language learner (ELL) status, individualized education plan (IEP) status, kindergartener redshirt, attending private school, attending an urban school, and attending a rural school. The rich control set specification contains variables summarized in Table A1. All Spearman correlation coefficients reported in Table 2 are strongly statistically significant with a p-value less than 0.0005. The restricted sample includes classrooms with at least four surveyed students. The restricted sample includes 700 students in 150 classrooms.
Notes. The statistics reported in this table compare rankings of classroom effects generated by Equations 3a and 3c of the main text. The sample contains 1,800 students and 750 classrooms.
Table 4 replicates the transition matrix analysis in Table 3 for specifications Equation 3a and Equation 3b, comparing the rankings of first grade classrooms generated by fall-to-spring VAMs to those generated by spring-to-spring VAMs. Table 4 shows that the spring-to-spring VAMs are more stable than the fall-to-fall VAMs, and large swings cross multiple quintiles are exceedingly rare in both subjects. Overall, the results from Table 4 are consistent with the main finding, from Table 2, that estimated classroom effects from VAMs using spring-to-spring achievement gains for both subjects are less affected by test timing than similar VAMs using fall-to-fall achievement gains.
Notes. The statistics reported in this table compare rankings of classroom effects generated by Equations 3a and 3b of the main text. The sample contains 1,800 students and 750 classrooms.
The current study addresses a common problem facing the majority of U.S. school districts: the difficulties of estimating teacher effectiveness with standardized tests administered only once per year. The problem is that students summer learning gains and losses are incorrectly attributed to schools and teachers when cross-year VAMs are used to evaluate teacher effectiveness. Indeed, previous research has documented this potential bias in cross-year VAM estimates (Downey et al., 2008; McEachin & Atteberry, 2017; Gershenson & Hayes, 2018). Given the political and financial challenges of administering standardized exams twice per year, school administrators need to be aware of whether implementing those assessments in spring or fall yields more credible VAMs. Ours is the first study to directly address this question.
The current study provides information on the validity of value-added estimates of classroom effects generated by fall-to-fall and spring-to-spring cross-year VAMs relative to arguably more valid fall-to-spring within-year VAMs. We consistently find that estimated classroom effects from VAMs using spring-to-spring achievement gains for both subjects are more robust than similar VAMs using fall-to-fall achievement gains. Specifically, for both math and reading achievement, the Spearman rank correlation coefficients are more than 10 percentage points higher when using the classroom effects generated by spring-to-spring achievement gains relative to the fall-to-fall achievement gains. Transition matrices reported in the current study provide a similar finding. The policy implication from this finding is that when only one assessment per year is feasible, spring assessments are preferable to fall assessments, at least when the objective is to obtain valid estimates of school or teacher effectiveness. Moving forward, one area for future research is determining the optimal timing of the spring assessment. Assuming the validity of the spring-to-spring cross-year VAMs increases monotonically as the school year progresses, we would predict that a spring-to-spring cross-year VAM administered in March would be more valid than a similar spring-to-spring cross-year VAM administrated in February, but less valid than one administrated in April. Therefore, if one is willing to make this sort of monotonicity assumption, we can say that tests administered later in the school year are better than tests administered earlier in the year. Unfortunately, our current data does not allow us to formally test this hypothesis.
1. Sample sizes are rounded to the nearest 50, as per NCES rules for restricted-use ECLS-K data.
2. See Fitzpatrick et al. (2011) and Quinn (2014) for more discussion of the tests.
3. Unfortunately, the ECLS-K does not report the exact day of the assessment. Instead, an indicator for the week the assessment was administered is provided. We impute test dates by converting the week indicators to the midpoint of the week (e.g., if week 1 covers the 1st through the 7th, we impute a test date of the 4th). This should not create any systematic bias due to the conditional randomness of the test dates (Fitzpatrick et al., 2011).
Downey, D. B., von Hippel, P. T., & Broh, B. (2004). Are schools the great equalizer? Cognitive inequality during the summer months and the school year. American Sociological Review, 69(5), 613635.
Gershenson, S., & Hayes, M. S. (2017). The summer learning of exceptional students. American Journal of Education, 123(3), 447473.
Gershenson, S., & Hayes, M. S. (2018). The implications of summer learning loss for value-added estimates of teacher effectiveness. Educational Policy, 32(1), 5585.
Guarino, C. M., Reckase, M. D., & Wooldridge, J.M. (2015). Can value-added measures of teacher performance be trusted? Education Finance & Policy, 10(1), 117156.
Fitzpatrick, M. D., Grissmer, D., & Hastedt, S. (2011). What a difference a day makes: Estimating daily learning gains during kindergarten and first grade using a natural experiment. Economics of Education Review, 30(2), 269279.
Koedel, C., & Betts, J. R. (2011). Does student sorting invalidate value-added models of teacher Effectiveness? An extended analysis of the Rothstein critique. Education Finance & Policy, 6(1), 1842.
Loeb, S., & Candelaria, C. (2012). How stable are value-added estimates across years, subjects, and student groups? The Carnegie Knowledge Network. Retrieved from
McCaffrey, D. F., Sass, T. R., Lockwood, J. R., & Mihaly, K. (2009). The intertemporal variability of teacher effect estimates. Education Finance & Policy, 4(4), 572606.
McEachin, A., & Atteberry, A. (2017). The impact of summer learning loss on measures of school performance. Education Finance & Policy, 12(4), 468491.
Quinn, D. M. (2015). Black-white summer learning gaps: Interpreting the variability of estimates across representations. Educational Evaluation and Policy Analysis, 37(1), 5069.
Schwartz, A. E., Stiefel, L., & Cordes, S. A. (2017). Moving matters: The causal effect of moving schools on student performance. Education Finance & Policy, 12(4), 419446.
Superville, D. R. (2015). Students take too many redundant tests, study finds. Education Week. Retrieved from https://www.edweek.org/ew/articles/2015/10/28/students-take-too-many-redundant-tests-study.html
Ujifusa, A. (2012). Standardized testing costs states $1.7 billion a year, study says. Education Week. Retrieved from https://www.edweek.org/ew/articles/2012/11/29/13testcosts.h32.html
Solon, G., Haider, S. J., & Wooldridge, J. M. (2015). What are we weighting for? Journal of Human Resources, 50(2), 301316.