Change and Continuity in Grades 3–5: Effects of Poverty and Grade on Standardized Test Scores
by Heidi Legg Burross - 2008
Background/Context: The question of the influence of Comprehensive School Reform (CSR) on achievement is an important one because many policy makers use achievement scores as the measure of success for schools, classrooms, and students. Research has demonstrated that high-poverty schools have less experienced teachers and access to fewer resources than do low- and moderate-poverty schools. Interest in fourth-grade achievement has been minimal both in research and in legislation.
Research Question: Do these CSR schools make gains that would not be expected without the funding and programs? Another question examined here is whether there is a decrease in performance at fourth grade.
Population: The population consists of third-, fourth-, and fifth-grade student data from 65 schools.
Research Design: Data from the state’s norm-referenced TerraNova test and Stanford Achievement Test, and the criterion-referenced Arizona's Instrument to Measure Standards (AIMS) from the years 2000–2007 for the samples were compared over time and between groups.
Conclusions: These limited data indicate that there were occasional, observable performance decreases on student standardized test scores from third to fourth grade that often recovered somewhat in Grade 5. Because of problems with making cross-year and cross-grade comparisons using the AIMS scores, the “fourth-grade window” hypothesis could not be reliably inspected with the data available. Although gains were shown for schools that received CSR funding, their gains were similar to both high- and low-poverty schools that received no funding. Fluctuations in yearly performances may be more of an artifact of changes in test design and scoring than of student improvements.
The question of the influence of Comprehensive School Reform (CSR) on achievement is an important one because many policy makers use achievement scores as the measure of success for schools, classrooms, and students. Do these CSR schools make gains that would not be expected without the funding and programs? Another question examined here involves the fourth-grade window, the idea that there is a decrease in performance at fourth grade (Good, Burross, & McCaslin, 2005). There are many hypotheses as to what may lead to this observed dip in performance, some of which are discussed subsequently, but these data tell more about this effect and its relation to poverty. Currently the focus of decision-making (including No Child Left Behind [NCLB]) is third grade, with fourth grade not even tested by some states until more recently.
REVIEW OF THE LITERATURE
THE EFFECTS OF POVERTY
Do the resources available at home and in school affect the level of achievement of the students they house and serve? There is considerable debate on this point dating back over 40 years, with Coleman (Coleman et al., 1966) and Hanushek (1997) arguing that the poverty level of the school does not relate to the achievement level of its students. These authors are in a general minority, however, because other research has found that students achievement levels do correlate with poverty indicators in schools and surrounding communities (Ladd, Chalk, & Hansen, 1999; Stiefel, Rubenstein, & Berne, 1998). With limited resources available in their homes and schools, high-poverty students have limited access to technology, texts, and programs, and limited opportunities for success (Darling-Hammond, 2006). Darling-Hammond has also found that the teachers in high-poverty schools have more inexperienced teachers with lower levels of education than schools with less poverty. The effects of poverty on the crucial fourth-grade window is of special interest for determining whether students at a financial disadvantage are also suffering academic decline unparalleled by their more affluent peers. There is international evidence that fourth graders show differential achievement in reading based on their poverty status (Ogle et al., 2003).
The third grade has been the focus of many studies and policy-making decisions. Current legislation considers third grade to be the point at which students should be able to read, and third grade is the targeted population for determining federal grant eligibility (No Child Left Behind Act [NCLB] of 2001, 2002). In fact, third grade is mentioned three times in the NCLB Act as the end period for when students should have mastered key basic skills; fourth grade is mentioned only once in a quote from research as the grade in which Hawaiian students ranked the lowest of 39 states on reading in 1998. This, despite Pogrows (1999) claim that third-grade test scores tend to overpredict achievement for students in high-poverty situations.
Third grade is important, but fourth grade has shown some interesting trends that bear investigation. Fourth grade marks the start of puberty for some students, and the emotional and cognitive changes that accompany this developmental stage (Erikson & Erikson, 1981; Piaget, 1967). With this change, often there is also distraction and academic decline. Academic decline has been detected in standardized tests (Good et al., 2005) and in international research in which third-grade students outperformed fourth-grade students on the same items (Wang, 2003).
CONSIDERATIONS IN COMPARING TYPES OF SCHOOLS
The adoption of the NCLB law exerted pressure on public schools throughout the nation (Nichols, Glass, & Berliner, 2006). All schools in the present study were presumably responding to the demands of the legislation and working to improve their standardized test scores to maintain federal funding. These pressures led to changes in instruction and focus in classrooms (Nichols & Berliner, 2005; Pedulla et al., 2003). All schools in the present study and most nationwide were facing penalties for failures related to high-stakes testing, and many were changing their approaches to educating. These changes may have affected schools test results, and these changes may be differentiated by the funding provided through the CSR program and by the resources available at the schools.
A total of 21 of 27 CSR schools that included Grades 3, 4, or 5 constituted the original sample. Schools in a comparison group of 23 nearby non-CSR schools in the same or a close zip code with similar grade make-up, free/reduced lunch percentages, and size were individually matched to the CSR schools. Another sample of 21 low-poverty schools was selected based on the same criteria, except these schools all had free/reduced lunch percentages less than 10% in 2000. The CSR schools averaged 79.8% free/reduced lunch recipients, matched non-CSR averaged 72.3%, and low poverty averaged 6.2% across the 8 years of this study.
The norm-referenced test used for third through fifth grades by the state was the Stanford Achievement Test (SAT-9) until 2003. Both the SAT-9 and the current norm-referenced test, the TerraNova (TN), have content areas that include math, reading, and language; both use multiple-choice-item formats. These are reported as national percentile ranks of the mean, which allows for comparisons across schools but makes general trends within grades difficult to detect. For example, evidence of the fourth-grade window would not be detected using these norm-referenced tests because the relative standing of individuals and schools would not change. An overall drop would be invisible in the rankings.
Third and fifth grades also have been assessed by the criterion-referenced test, Arizona's Instrument to Measure Standards (AIMS), since its inception in 1999, and the testing of fourth grade started in 2005. The AIMS test uses items created by Arizona teachers. It includes math, reading, and writing content areas. The determination of what is required for scores of falls far below the standard, approaches the standard, meets the standard, and exceeds the standard changes over years and differs by grades, so it is difficult to compare across these time periods. Percentages of students within each category are reported online through the states Web site (Arizona Department of Education, 2008). Those who exceed or meet the standard are considered to be passing.
Arizona started using the Arizona's Instrument to Measure Standards Dual Purpose Assessment (AIMS-DPA) in 2005 for third through eighth grades. This weeklong process combines items from the AIMS and the TN tests.
RELATIONSHIPS BETWEEN ACHIEVEMENT SCORES AND POVERTY LEVELS
Pearson product moment correlations between AIMS and SAT-9/TN test scores across the years 20002007 by grade indicate that the two are highly related, with correlation values ranging from r = 0.80 to r = 0.98 (all p < 0.01). Fourth grades in available years 20052007 had the lowest correlations, with none exceeding r = 0.91. Although the tests use different assessment methods and the criteria for passing change over years, there is still a strong relationship between the scores.
The correlations between test scores and free/reduced lunch percentages were also strong (all p < 0.01) between and across all years for both AIMS and SAT-9/TN. For SAT-9/TN, the range was r = -.89 to r = -.95 (n = 58 to n = 62), and for AIMS, the correlations ranged from r = -.84 to r = -.94 (n = 52 to n = 61). These negative correlations stem from the fact that those schools with lower percentages of students receiving free/reduced lunches have higher standardized test scores. Even when the low-poverty sample was removed from the analysis, correlations among SAT-9 scores and free/reduced lunches for just CSR and non-CSR schools were between r = -.74 and r = -.39 (p < 0.05, n = 40 to n = 43 for all) for all within-year relationships. For AIMS, r was found to be between r = -.71 and r = -.30 (p < 0.06, n = 35 to n = 42 for all). With even small poverty differences in schools, there was a relationship between test scores and poverty levels.
SCHOOL TYPE DIFFERENCES
There were differences among the three school types on the norm-referenced tests. From 2000 to 2007, the CSR schools increased two to eight percentiles on average (see Figure 1).
Figure 1. Mean Percentile Rank SAT-9/TerraNova Scores Across Subjects by School Type From 2000 to 2007
The low-poverty samples dropped by about the same amount over the same time period. There was a general increase in 2004, the year before the change from SAT-9 to TN, and a drop in all schools percentile scores in 2005, when the TN test was adopted by the state. Although the reason for this drop is unknown and not documented in the states investigation of the change in norm-referenced tests (Arizona Linking Study, 2005), a few factors may have led to this result.
One possible explanation is that the schools had been using the SAT-9 practice tests to help prepare students for the exam, and the change may have negated this technique. The TN may be less aligned with classroom achievement goals. There are also fewer items on the TN than on the SAT-9, decreasing reliability.
An examination of mean differences in AIMS mean percentage passing among the three samples yielded similar findings but must be approached cautiously because of the changing criteria for passing (Olson & Sabers, in press). Using the 2000 math AIMS test as an example, CSR schools had mean values of 10% and 21% passing for third and fifth grades, respectively, non-CSR had 35% and 19%, and low poverty had 83% and 67%. Prior to 2005, at least six CSR and non-CSR schools had no students pass AIMS math in third grade or fifth grade in one or more years. In all content areas, the performances in years 20002004 on the AIMS maintained at least a 40%50% point difference between CSR and low-poverty schools in all grades. In 2005, all schools saw a jump in percentage passing the AIMS (as explained by the National Assessment of Educational Progress and in Olson & Sabers, in press). The CSR and non-CSR schools now had percentage passing values on the AIMS ranging from 37% to 71%, whereas low poverty had percentage passing values mostly above 90%, with only one year in one grade dropping below 80%.
It is of little surprise that there was greater fluctuation in the criterion-referenced AIMS test scores than in the norm-referenced SAT-9/TN scores. Percentile ranks are generally less variable than are criterion-referenced scores by their very natures. If a test in one year is more difficult or easier than it had been in previous years, all students may perform differently, which affects the percentage who pass at criterion level (Olson & Sabers, in press). But their general standing in comparison to other students would not be expected to change much.
COMPARISONS OVER TIME
Comparisons over years and grades should be cautiously approached, particularly with the AIMS scores, because the general difficulty of items, the cut-off scores for the passing categories, and the stakes involved in the test scores changed over the years of this study (Tucson Unified School District, 2005). The test was modified in response to public pressures and when a passing score became required for graduation. However, following the same groups across years may shed some light on performance differences in grades and school types. Four cohorts were created for the years of data in this study: third grade in 2000 (Cohort 1), 2003 (Cohort 2), 2004 (Cohort 3), and 2005 (Cohort 4). Repeated measures analyses of variance allowed for examination of three aspects: the main effect of school type (CSR, non-CSR, and low poverty); changes over time (by year/grade); and the interaction of school type and time (addressing the question, Are changes in time different for the different school types?). Analyses were run separately for AIMS and SAT-9/TN data.
The first aspect was easy to address because for all cohorts in all content areas on both tests, the low-poverty schools outperformed both CSR and non-CSR schools (all p ≤ .001). There were some differences between the CSR and non-CSR, but there were no discernable trends.
Analyses with SAT-9/TN scores on these cohorts did reveal mixed trends (Table 1).
Table 1.. SAT-9/TN performance area percentile rank means and standard deviations (in parentheses) by school type and year/grade
For both tests, there were no interaction effects of school type with time, so relative standing of these school types did not change over time. The first cohort increased in math and reading in fourth grade but dropped in reading and language in fifth grade. The second cohort also improved in math and reading in fourth grade but dropped in fourth-grade language and in fifth-grade math. Cohort 3 was more stable but did drop in fifth-grade math and in fourth-grade language. The most recent cohort improved in fourth-grade reading and language but dropped in fifth-grade math.
There was more variability in the AIMS data (Table 2).
Table 2.: AIMS performance area percentage passing means and standard deviations (in parentheses) by school type and year/grade
Only third- and fifth-grade data are available for the first two cohorts. Cohort 1 dropped in all content areas from third to fifth grade. The drop was more drastic for the CSR and non-CSR schools in reading and writing. The second cohort improved in math but dropped in writing.
The last two cohorts would be of the most interest because they allow the first criterion-referenced examination of the fourth-grade window; however, the cut scores for proficiency changed during these years (Olson & Sabers, in press), making conclusions about changes impossible to interpret.
These limited data indicate that there were occasional, observable performance decreases on student standardized test scores from third to fourth grade that often recovered somewhat in Grade 5. Because of problems with making cross-year and cross-grade comparisons using the AIMS scores, the fourth-grade window hypothesis could not be reliably inspected with the data available. If at some point the cut score for passing stabilizes and the test is designed to allow for observations of growth over time, the fourth-grade window hypothesis should be reexamined.
Although gains were shown for schools that received CSR funding, their gains were similar to both high- and low-poverty schools that received no funding. Fluctuations in yearly performances may be more of an artifact of changes in test design and scoring than of student improvements. Although it would be nice to claim that funding and implementing new programs solved issues of low test performance, evidence suggests that this process alone may not be sufficient. Data are not available to examine prefunding performances, but the funding may have arrested plummeting scores over time, allowing these schools to maintain performances, albeit lower performances. Individual schools have claimed that progress was made after being funded, but their performances mirror the performances of schools in similar situations that did not receive funding.
The picture is less clear for the fourth-grade window. All types of schools show differences in test performance over timesometimes increases, sometimes decreases. This picture is even more muddled by the changing nature of the criterion-referenced test given in Arizona and the lack of absolute performance information available from the norm-referenced tests. The only way to definitively examine this phenomenon is to look at individual student data over years on a stable, comparable measure. Absence of individual, trustworthy data makes it impossible to address, for example, whether a student at the 20th percentile has moved up by the fourth grade, or whether individual classrooms are seeing impressive, notable gains.
Despite these results, individual student test performance is sent to principals and teachers in the late summer or early fall after the administration. Many practitioners use this information for student program placement. They need to depend on the validity and reliability of the test scores not only as a student indicator but also in relation to the achievement profiles determined by the state. As part of the states classification system, AIMS scores are used to determine where a school falls along a continuum, from excelling to failing. Schools deemed as failing schools are targeted for funding and staffing changes.
The AIMS test is a state and federal requirement, so there is little indication that the state will stop using it in the near future. Whether the implementation of high-stakes tests such as these benefits education and students has not been determined. Many researchers (Nichols et al. 2006; Pedulla et al., 2003) have concluded that high-stakes tests are detrimental to the educational process and derail techniques that teachers have found to be historically effective. Although it is an admirable goal to aim to improve student achievement, there is little evidence that this goal has been achieved in CSR schools in the state of Arizona.
This research was supported by the U.S. Office of Educational Research and Improvement (OERI) Grant No. R306S000033. The authors take full responsibility for the work and no endorsement from OERI should be assumed.
Arizona Department of Education. (2008). Research and evaluation section. Retrieved April 18, 2008,
Arizona Linking Study. (2005). Technical report for the linking of TerraNova to SAT/9. Retrieved
September 27, 2007, from
Coleman, J., Campbell, E., Hobson, C., McPartland, J., Meade, A., Weinfeld, F., et al. (1966). Equality of
educational opportunity. Washington, DC: US Department of Health, Education and Welfare.
Darling-Hammond, L. (2006). Securing the right to learn: Policy and practice for powerful teaching and
learning. Educational Researcher, 35(7), 1324.
Joan Erikson. Harvard Educational Review, 51, 249269.
Good, T., Burross, H. L., & McCaslin, M. (2005). Comprehensive School Reform: A longitudinal study of
school improvement in one state. Teachers College Record, 107, 22052226.
Hanushek, E. (1997). Assessing the effects of school resources on student performance: An update.
Educational Evaluation and Policy Analysis, 19, 141164.
Ladd, H. F., Chalk, R., & Hansen, J. S. (Eds.). (1999). Equity and adequacy in educational finance:
Issues and perspectives. Washington, DC: U.S. Department of Education.
Nichols, S., & Berliner, D. C. (2005, March). The inevitable corruption of indicators and educators through
high-stakes testing (EPSL-0503101-EPRU). Retrieved September 27, 2007, from
Nichols, S. L., Glass, G. V, & Berliner, D. C. (2006). High-stakes testing and student achievement: Does
accountability pressure increase student learning? Education Policy Analysis Archives, 14(1). Retrieved
September 27, 2007, from http://epaa.asu.edu/epaa/v14n1/
No Child Left Behind Act of 2001, Pub. L. No. 107-110, 115, Stat. 1425 (2002).
Ogle, L.T., Sen, A. Pahlke, E., Jocelyn, L., Kastberg, D., Roey, S., et al. (2003). International
comparisons in fourth-grade reading literacy: Findings from the progress in international reading literacy
study (PIRLS) of 2001. Boston: Boston College.
Olson, A. M., & Sabers, D. L. (in press). Standardized tests. In T. L. Good (Ed.), 21st century education: A
reference handbook. Thousand Oaks, CA: SAGE.
Pedulla, J. J., Abrams, L. M., Madaus, G. F., Russell, M. K., Ramos, M. A., & Miao, J. (2003, March).
Perceived effects of state-mandated testing programs on teaching and learning. Findings from a national
survey of teachers. Retrieved September 27, 2007, from the National Board on Educational Testing and
Public Policy (Boston College) Web site: http://www.bc.edu/research/nbetpp/statements/nbr2.pdf
Piaget, J. (1967). The childs conception of the world. Totowa, NJ: Littlefield, Adams, and Company.
Pogrow, S. (1999). Overcoming the cognitive wall: Accelerating the learning of Title I students after the
third grade. In G. Orfield & E. Debray (Eds.), Hard work for good schools: Facts not fads in Title I reform
[mimeo]. The Civil Rights Project, Harvard University, Cambridge, MA.
Stiefel, L., Rubenstein, R., & Berne, R. (1998). Intra-district equity in four large cities: Data, methods and
results. Journal of Education Finance, 23, 447467.
Tucson Unified School District. (2005). AIMS: Cut changes, test changes, and score comparability.
Retrieved April 28, 2008, from
Wang, J. (2003, April). An analysis of item score difference between 3rd and 4th grades using the TIMSS
database. Paper presented at the annual meeting of the American Educational Research Association,