Testing High-Stakes Tests: Can We Believe the Results of Accountability Tests?
by Jay Greene, Marcus Winters & Greg Forster - 2004
This study examines whether the results of standardized tests are distorted when rewards and sanctions are attached to them, making them high-stakes tests. It measures the correlation in school-level test resultsincluding both score levels and year-to-year score changeson high-stakes and low-stakes tests administered in the same schools in nine school systems. It finds that test score levels generally correlate very well, while year-to-year score changes correlate very well in Florida but much more weakly in other school systems. It concludes that the stakes of high-stakes tests do not distort information about the general level at which students are performing, and in Florida they also do not prevent the tests from providing accurate information about school influence over student progress.
There is considerable diversity in testing policies nationwide. States and school districts around the country vary in the types of tests they use, the number of subjects they test, the grades in which they administer the tests, and the seriousness of the sanctions or rewards they attach to test results. Some states, such as Minnesota, report scores on state-mandated tests to the public in order to shame school districts into performing better; other states, such as Ohio and Massachusetts, require students to pass the state exam before receiving a high school diploma. Chicago public school students must perform well on the Iowa Test of Basic Skills in specified grades in order to be promoted to the next grade, even though neither the test nor the sanction is required by the state of Illinois.
Perhaps the nations most aggressive test-based accountability measure is Floridas A+ program. Florida uses results on the Florida Comprehensive Assessment Test (FCAT) to hold students accountable by requiring all students to pass the third grade administration of the exam before moving to the fourth grade, and by withholding diplomas from students who have not passed all sections of the tenth grade administration of the exam. It also holds schools and districts accountable by using FCAT results to grade schools from A to F on school report cards that are very widely publicized and scrutinized. However, what really makes Floridas program stand out is that the state holds schools and districts accountable for their students’ performance on FCAT by offering vouchers to all students in schools that have earned an F on their report cards in any two of the previous four years. These chronically failing schools face the possibility of the ultimate consequence─ they could lose their students and the state funding that accompanies them.
Two states, Florida and Virginia, and several school districts gave their students both a high-stakes test and a commercially designed low-stakes test during the school year. The low-stakes tests are used to assess how well students are doing compared to national norms and to decide what curriculum changes should be implemented to better serve students. Since parents and school officials see the results of the tests and use them for their own purposes, it would be incorrect to say that there are no stakes attached to them at all. However, the stakes attached to these tests are small enough that schools have little or no incentive to manipulate the results in the way that some fear high-stakes tests may be manipulated. Thus a students performance on a low-stakes test is most likely free from potential distortion.
Several objections have been raised against using standardized testing for accountability purposes. Most concerns about high-stakes testing revolve around the adverse incentives created by the tests. Some have worried that pressures to produce gains in test scores have led to poor test designs or questionable revisions in test designs that exaggerate student achievement (e.g., see Koretz and Barron 1998 on Kentuckys test; Haney 2000 on Texas test; and Haney et al. 1999 on Massachusettss test). Others have written that instead of teaching generally useful skills, teachers are teaching skills that are unique only to a particular test (e.g., see Amrein and Berliner 2002; Klein et al. 2000; McNeil and Valenzuela 2000; Haney 2000; and Koretz and Barron 1998). Still others have directly questioned the integrity of those administering and scoring the high-stakes tests, suggesting that cheating has produced much of the claimed rise in student achievement on such exams (e.g., see Cizek 2001; Dewan 1999; Hoff 1999; and Lawton 1996).
Most of these criticisms fail to withstand scrutiny. Much of the research done in this area has been largely theoretical, anecdotal, or limited to one or another particular state test. For example, McNeil and Valenzuelas critique of the validity of high-stakes testing is based largely on theoretical expectations and anecdotal reports from teachers, whose resentment of high-stakes testing for depriving them of autonomy may cloud their assessments of the effectiveness of testing policies (see McNeil and Valenzuela 2000). Their reports of cases in which high-stakes tests were manipulated are intriguing, but they do not present evidence on whether these practices are sufficiently widespread to fundamentally distort testing results.
Other researchers have compared high-stakes test results to results on other tests, as we do in this study. Prior research in this area, however, has failed to use tests that accurately mirror the population of students taking the high-stakes test or the level of knowledge needed to pass the state mandated exam.
Amrein and Berliner (2002) find a weak relationship between the adoption of high-stakes tests and improvement in other test indicators, such as NAEP, SAT, ACT, and AP results.1 Koretz and Barron (1998) find that Kentuckys high-stakes test results show increases that are not similarly found in the states NAEP results. Klein et al. (Klein, Hamilton, McCaffrey, and Stecher 2000) similarly claim that gains on the Texas high-stakes test appear to be larger than are shown by NAEP.
Comparing state-mandated high-stakes tests with college entrance and AP exams is misleading because the college-oriented exams are primarily taken by the best high school students, who represent a minority of all students. Though the percentage of students taking these exams has increased to the point that test-takers now include more than the most elite students, they still are not taken by all students, and this hinders their usefulness for assessing the validity of near-universally administered high-stakes tests. Only a third of all high school students take the SAT, and even fewer take the ACT or AP. Furthermore, college-oriented tests tell us nothing about the academic progress of the student population that high-stakes testing is most intended to benefit: low-performing students in underserved communities. In addition, because these tests are intended only for college-bound students they test a higher level of knowledge than most high-stakes tests, which are used to make sure students have the most basic knowledge necessary to earn a diploma. Any discrepancy between the results of college-oriented tests and high-stakes tests could be attributable to the difference in the populations taking these tests and the different sets of skills they demand.
Comparisons between high-stakes tests and NAEP are more meaningful than comparisons to college-oriented tests, though NAEP-based analyses also fall short of the mark. NAEP is administered infrequently and only to certain grades. Any weak correlation between NAEP and high-stakes tests could be attributable to such factors. When tests are not administered around the same time and are not administered to the same students, their results are less likely to track each other. This will soon change with the new, more frequent NAEP testing schedule required under the No Child Left Behind Act─ although NAEP will also become a high-stakes test under No Child Left Behind, so its usefulness for evaluating other tests may not be improved.
Rather than focusing on statewide outcomes, like NAEP or college-oriented exam results, Haney uses classroom grades to assess the validity of Texas high-stakes test. He finds a weak correlation between Texas high-stakes results and classroom grades, from which he concludes that the Texas high-stakes test results lack credibility (see Haney 2000). However, it is more likely that classroom grades lack credibility. Classroom grades are subjective and inconsistently assigned, and are thus likely to be misleading indicators of student progress (see Barnes and Finn 2002 and Figlio and Lucas 2001). Even if we stipulate that classroom grades are valuable for some purposes, at the very least they are not an appropriate benchmark for evaluating the credibility of standardized tests. They are subject to significant differences in assessment standards between teachers, and are frequently assigned based on an individual students personal situation rather than according to a set of strict standards. Furthermore, the importance of these problems is magnified when we analyze only one particular subject in one particular year, since grading standards and practices may also vary by subject and student grade.
There have also been a number of responses to these critiques of state testing validity. For example, Hanushek and Phelps have written a series of methodological critiques of the work by Haney and Klein (see Hanushek 2001 and Phelps 2001). Hanushek points out that Kleins finding of stronger gains on the Texas state test than on NAEP should come as no surprise given that Texas school curricula are more closely aligned with the Texas test than with NAEP (see Hanushek 2001). Phelps takes Haney and Klein to task for a variety of errors, alleging (for example) that Haney used incorrect NAEP figures on exemption rates in Texas and that Klein failed to note more significant progress on NAEP by Texas students because of excessive disaggregation of scores (see Phelps 2000).
Other analyses, such as those by Grissmer, et al., and Greene, also contradict Haney and Kleins results. Contrary to Haney and Klein, Grissmer and Greene find that Texas made exceptional gains on the NAEP as state-level test results were increasing dramatically (see Grissmer, Flanagan, Kawata, and Williamson 2000; and Greene 2000). Unfortunately, our inability to correlate individual-level or school-level performance on the NAEP and the Texas test, as well as the infrequent administration of NAEP, prevent any clear resolution of this dispute.
This study differs from other analyses in that it focuses on the comparison of school-level results on high-stakes tests and commercially designed low-stakes tests. By focusing on school-level results we are comparing test results from the same or similar students, reducing the danger that population differences may hinder the comparison. Examining school-level results also allows for a more precise correlation of the different kinds of test results than is possible by looking only at state-level results, which provide fewer observations for analysis.2 In addition, school-level analyses are especially appropriate because in most cases the accountability consequences of high-stakes test results are applied at the school level. By comparing school-level scores on high-stakes and low-stakes tests, this study attempts to find where, if anywhere, we can believe high-stakes test results. If we see that high-stakes and low-stakes tests produce similar results, we have reason to believe that results on the high-stakes test were not affected by any of the adverse incentives tied to the test.
The first step in conducting this study was to locate states and school districts that administer both high-stakes and low-stakes tests. We examined information available on each states Department of Education website about their testing programs, and contacted by phone states whose information was unclear. A test was considered high-stakes if any of the following depended upon it: student promotion or graduation, accreditation, funding cuts, teacher bonuses, a widely publicized school grading or ranking system, or state assumption of at least some school responsibilities. We found two states, Florida and Virginia, that administered both a high-stakes test and a low-stakes test.3 Test scores in Florida were available on the Florida Department of Educations website, and we were able to obtain scores from Virginia through a data request.
We next attempted to find individual school districts that also administered both high-stakes and low-stakes tests. We first investigated the 58 member districts of the Council for Great City Schools, which includes many of the largest school districts in the nation. Next, through Internet searches, we looked for other school districts that administer multiple tests. After locating several of these districts, we contacted them by phone and interviewed education staffers about the different types of tests the districts administered.
Because we were forced to rely on Internet searches and non-systematic phone interviews to find school districts that gave both high-and low-stakes tests, our search was certainly not exhaustive.4 As indicated in Table 1, the two states and seven school districts included in this study, which did administer both high-and low-stakes tests, contain approximately 9% of all public-school students in the United States and a significantly higher percentage of all students who take a high-stakes test. We therefore have reason to believe that our results provide evidence on the general validity of high-stakes testing nationwide.
We examined data from two states in which high- and low-stakes test results were available. Florida administers the FCAT, which is tied to an aggressive set of sanctions for low performance, including denial of promotion or graduation as well as vouchers for students at chronically failing schools (see discussion above). It also administers the Stanford-9, a nationally respected norm-referenced test, with no formal stakes attached to the results. Virginia administers the Standards of Learning (SOL) test, which is required for graduation and is also tied to school accreditation, as well as the Stanford-9 with no formal stakes attached.
We also examined data from seven school districts in which high- and low-stakes test results were available. Chicago administers the Iowa Test of Basic Skills (ITBS), a nationally respected norm-referenced test. In third, sixth, and eighth grade the city requires students to pass the ITBS for promotion, while in fourth, fifth, and seventh grade the ITBS is administered without formal stakes.5 Boston administers the Massachusetts Comprehensive Assessment System (MCAS), which the state requires for student promotion and graduation, as well as the Stanford-9 with no formal stakes. Toledo and Fairfield, both in Ohio, administer the Ohio Proficiency test, which the state requires for graduation and ties to school funding; Toledo also administers the Stanford-9 without formal stakes, while Fairfield administers the Terra Nova, a nationally respected norm-referenced test, without formal stakes. Blue Valley, Kansas, administers the Kansas Assessment test, which the state ties to school accreditation, as well as the ITBS without formal stakes. Columbia, Missouri, administers the Missouri Assessment Program test, which the state ties to school accredita- tion. The city also administers a test without formal stakes─ the ITBS in 1998–1999 and 1999–2000, and the Stanford-9 in 2000–2001. Fountain Fort Carson, Colorado, administers the Colorado Student Assessment Program test, which the state uses to issue school report cards, as well as the ITBS without formal stakes.
In each of the school systems we studied, we compared scores on each test given in the same subject and in the same school year. When possible, we also compared the results of high- and low-stakes tests given at the same grade levels. We were not able to do this for all the school systems we studied, however, because several districts give their low-stakes tests at different grade levels from those that take their high-stakes tests. When a high or low-stakes test was administered in multiple grade levels of the same school level (elementary, middle, or high school), we took an average of the tests for that school level. Though this method does not directly compare test scores for the same students on both tests, the use of school-level scores does reflect the same method used in most accountability programs.
Because we sometimes had to compute an average test score for a school, and because scores were reported in different ways (percentiles, scale scores, percent passing, etc.), we standardized scores from each separate test administration by converting them into what are technically known as z- score results. To standardize the test scores into z-scores, we subtracted the score a school received on the test administration by the average score on that administration throughout the district/state. We then divided that number by the standard deviation of the test administration. The standardized test score is therefore equal to the number of standard deviations each schools result is from the sample average.
In school systems with accountability programs, there is debate over how to evaluate test results. School systems evaluate test results in one of two ways: either they look at the actual average test score in each school or they look at how much each school improved its test scores from one year to another. Each method has its advantages and disadvantages. Looking at score levels tells us whether or not students are performing academically at an acceptable level, but it does not isolate the influence of schools from other factors that contribute to student performance, such as family and community factors. Looking at year-to-year score gains is a value-added approach, telling us how much educational value each school added to its students in each year.
For the school systems we studied, we computed the correlation between high- and low-stakes test results for both the score level and the year-to-year gain in scores. We found the year-to-year gain scores for each test by subtracting the standardized score on the test administration in one year from the standardized score on the test administration in the previous year. For example, in Florida we subtracted each schools standardized score on the fourth grade reading FCAT test in 2000 from the same schools standardized score on the fourth grade reading FCAT in 2001. This showed us whether a school was either gaining or losing ground on the test.
We used a Pearsons correlation to measure how similar the results from the high- and low-stakes tests were, both in terms of score levels and in terms of the year-to-year gain in scores. For example, for score levels we measured the correlation between the high-stakes FCAT third grade reading test in 2001 and the low-stakes Stanford-9 third grade reading test in 2001. Similarly, for year-to-year score gains we measured the correlation between the 20002001 score gain on the FCAT and the 20002001 score gain on the Stanford-9.6 Where there is a high correlation between high- and low-stakes test results, we conclude that the high stakes of the high-stakes test do not distort test results, and where there is a low correlation we have significantly less confidence in the validity of the high-stakes test results.7
There are many factors that could explain a low correlation between high- and low-stakes test results. One possibility would be that the high-stakes test is poorly designed, such that schools can successfully target their teaching on the skills required for the high-stakes test without also conveying a more comprehensive set of skills that would be measured by other standardized tests. It is also possible that the implementation of high-stakes tests in some school systems could be poorly executed. Administering high-stakes tests in only a few grades may allow schools to reallocate their best teachers to those grades, creating false improvements that are not reflected in the low-stakes test results from other grades. The security of high-stakes tests could also be compromised, such that teachers and administrators could teach the specific items needed to answer the questions on the high-stakes test without at the same time teaching a broader set of skills covered by the low-stakes standardized test. It is even possible that in some places teachers and administrators have been able to manipulate the high-stakes test answers to inflate the apparent performance of students on the high-stakes test.
More benign explanations for weak correlations between high- and low-stakes test results are also available. When we analyze year-to-year gains in test scores, there is the problem of having to measure student performance twice, thus introducing more measurement error. Weak correlations could also partially be explained by the fact that the score gains we examine do not track a cohort of the same students over time. Such data are not available, forcing us to compute the difference in scores between one years students against the previous years students in the same grade. While this could suppress the correlation of gain scores, it is important to note that our method is comparable to the method of evaluation used in virtually all state high-stakes accountability systems that have any kind of value-added measurement. In addition, if a school as a whole is in fact improving, we would expect to observe similar improvement on high- and low-stakes tests comparing the same grades over time.
Correlations between results on high- and low-stakes tests could also be reduced to some extent by differences in the material covered by different tests. High-stakes tests are generally geared to a particular state or local curriculum, while low-stakes tests are generally national. But this can be no more than a partial explanation of differences in test results. There is no reason to believe that the set of skills students should be expected to acquire in a particular school system would differ dramatically from the skills covered by nationally respected standardized tests. Students in Virginia need to be able to perform arithmetic and understand what they read just like students in other places, especially if students in Virginia hope to attend colleges or find employment in other places
If low correlations between results on high- and low-stakes tests are attributable to differences between the skills required for the two tests, we might reasonably worry that the high-stakes test is not guiding educators to cover the appropriate academic material. It might be the case that the high-stakes test is too narrowly drawn, such that it does not effectively require teachers to convey to their students a broad set of generally useful skills. The low-stakes tests used in the school systems we studied are all nationally respected tests that are generally acknowledged to measure whether or not students have successfully achieved just this kind of broad skill learning, so if the high-stakes test results in these systems do not correlate with their low-stakes test results, this may be an indication that poorly-designed high-stakes tests are failing to cover a broad set of skills. On the other hand, if their high-stakes test results are strongly correlated with their results on low-stakes tests that are nationally respected as measurements of broad skill learning, this would give us a high degree of confidence that the high-stakes tests are indeed testing a broad set of generally useful skills and not just a narrow set of skills needed only to pass the test itself.
Interpretation of our results is made somewhat problematic because we cannot know with absolute certainty the extent to which factors other than school quality influence test score levels. Family background, population demographics, and other factors are known to have a significant effect on students level of achievement on tests, but we have no way of knowing how large this effect is. To an unknown extent, score level correlations reflect other factors in addition to the reliability of the high-stakes test. However, the higher the correlation between score levels on high- and low-stakes tests, the less we have reason to believe that poor test design or implementation undermines the reliability of high-stakes test results. Furthermore, where a high correlation between year-to-year score gains accompanies a high correlation between score levels, we can be very confident that the high-stakes test is reliably measuring school quality because family and demographic factors have no significant effect on score gains.
No doubt some will object that a high correlation between high- and low-stakes test scores does not support the credibility of high-stakes tests because they do not believe that low-stakes standardized tests are any better than high-stakes standardized tests. Some may question whether students put forth the necessary effort on a test with no real consequences tied to their scores. This argument might prove true if we find low correlations between the tests on the score levels. If a large number of students randomly fill in answers on the low-stakes test, that randomness will produce low correlations with the high-stakes tests, on which the students presumably gave their best effort. But where we find high correlations on the score levels, we can have confidence that students gave comparable effort on the two tests. Increased randomness in student answers on low-stakes tests, which would result from their investing lower levels of effort in taking those tests, would show up in reduced correlations between high-stakes and low-stakes test results. High correlations between test results would be very unlikely unless students gave about the same level of effort on both tests.
Others may object entirely to the use of standardized testing to assess student performance. To those readers, no evidence would be sufficient to support the credibility of high-stakes testing, because they are fundamentally opposed to the notion that academic achievement can be systematically measured and analyzed by standardized tests. Obviously these readers will not be persuaded by the analysis presented here, which uses one set of standardized tests to measure the reliability of another set of standardized tests. This study begins with the premise that achievement can be measured by standardized tests, and is in fact adequately measured by low-stakes tests. Its purpose is to address the concern that measurements of achievement are distorted by the accountability incentives that are designed to spur improvement in achievement. By comparing scores on tests where there may be incentives to distort the results with scores on tests where there are almost no incentives to distort the results, we are able to isolate the extent to which the incentives of high-stakes testing are in fact distorting information on student achievement.
As indicated in Table 2, for all the school systems examined in our study we generally found high correlations between score levels on high- and low-stakes tests.8 We also found some high correlations for year-to-year gains in scores on high- and low-stakes tests, but the correlations of score gains were not as consistently high, and in some places were quite low.
This greater variation on score gain correlations might be partially explained by the increased measurement error involved in calculating score
gains as opposed to score levels. It is also possible that high-stakes tests provide less credible measures of student progress in some school systems than in others. In places where high-stakes tests are poorly designed (such that teaching to the test is an effective strategy for boosting performance on the high-stakes test without also conveying useful skills that are captured by the low-stakes test) or where the security of tests has been compromised (such that teachers can teach the exact items to be included in the high-stakes test, or help students cheat during the test administration), the correlations between score gains on high- and low-stakes tests may be quite low. The high correlations between score level results on high- and low-stakes tests do not rule out these possibilities because, to an unknown extent, score levels reflect family and demographic factors in addition to school quality. However, despite this, the high correlations between score level results do justify a moderate level of confidence in the reliability of these systems high-stakes tests.
Perhaps the most intriguing results we found came from the state of Florida. We might expect the especially large magnitude of the stakes associated with Floridas high-stakes test to make it highly vulnerable to adverse responses because of the incentives created by high-stakes testing. It was in Florida, however, that we found the highest correlations between high- and low-stakes test results, for both score levels in each given year and for the year-to-year score gains.
Floridas high-stakes test, the FCAT, produced score levels that correlated with the score levels of the low-stakes Stanford-9 standardized test across all grade levels and subjects at 0.96. If the two tests had produced identical results, the correlation would have been 1.00. The year-to-year score gains on the FCAT correlated with the year-to-year score gains on the
Stanford-9 at 0.71. Both of these correlations are very strong, suggesting that the high- and low-stakes tests produced very similar information about student achievement and progress. Because the high-stakes FCAT produces
results very similar to those from the low-stakes Stanford-9, we can be confident that the high stakes associated with the FCAT did not distort its results. If teachers were teaching to the test on the FCAT, they were teaching generally useful skills that were also reflected in the results of the Stanford-9, a nationally respected standardized test.
In other school systems we found very strong correlations between score levels for high- and low-stakes test results in each given year, but relatively weak or even negative correlations between the year-to-year score gains on the two types of tests. For example, in Virginia (Table 4) the correlation between score levels on the states high-stakes SOL and the low-stakes Stanford-9 was 0.77, but the correlation between the year-to-year score gains on these two tests was only 0.15. Similarly, in Boston (Table 5) the
correlation between the level of the high-stakes MCAS and the low-stakes Stanford-9 was 0.75, but the correlation on the gain in scores between these two tests was a moderate 0.27. In Toledo (Table 6) the correlation between the level of the high- and low-stakes tests was 0.79, while the correlation between the score gains on the same tests was only 0.14.
In Chicago (Table 7), the correlation between score levels on the high- and low-stakes administrations of the ITBS was a very strong 0.88. But the year-to-year score gain of the results in high-stakes grades was totally uncorrelated (0.02) with the year-to-year score gain from the grades where the stakes are low. Similarly, in Columbia (Table 8) the high correlation (0.82) of score levels on the high- and low-stakes tests was accompanied by a
weak negative correlation (0.14) between the year-to-year score gain on the two types of tests.
In some school systems even the level of results on high- and low-stakes tests correlated only moderately well. In Blue Valley (Table 9) the high- and
low-stakes tests produced score levels that correlate at 0.53 and score gains that correlate at only 0.12. In Fairfield the score levels on the high- and low-stakes tests correlated at 0.49, while, oddly, the year-to-year score gains have a moderate negative correlation of - 0.56. In Fountain Fort Carson (Table 11) the score level correlation was only 0.32, while the score gain correlation was an even weaker 0.05.
The finding that high- and low-stakes tests produce very similar score level results tells us that the stakes of the tests do not distort information about the general level at which students are performing. If high-stakes testing is only being used to assure that students can perform at certain academic levels, then the results of those high-stakes tests appear to be reliable policy tools. The generally strong correlations between score levels on high- and low-stakes tests in all the school systems we examined suggest that teaching to the test, cheating, or other manipulations are not causing high-stakes tests to produce results that look very different from tests where there are no incentives for distortion.
But policy makers have increasingly recognized that score level test results are strongly influenced by a variety of factors outside of a school systems control. These include student family background, family income, and community factors. If policymakers want to isolate the difference that schools and educators make in student progress, they need to look at year- to-yearscore gains, or value-added measures, as part of a high-stakes accountability system.
Florida has incorporated value-added measures into its high-stakes testing and accountability system, and the evidence shows that Florida has designed and implemented a high-stakes testing system where the year-to-year score gains on the high-stakes test correspond very closely with year-to-year score gains on standardized tests where there are no incentives to manipulate the results. This strong correlation suggests that the value-added results produced by Floridas high-stakes testing system provide credible information about student progress that is not distorted by the rewards and sanctions provided by the states accountability system.
In all of the other school systems we examined, however, the correlations between score gains on high- and low-stakes tests are much weaker. We cannot be completely confident that those high-stakes tests provide accurate information about school influence over student progress. However, the consistently high correlations we found between score levels on high- and low-stakes tests do justify a moderate level of confidence in the reliability of those high-stakes tests.
Our examination of school systems containing 9% of all public school students shows that accountability systems that use high-stakes tests can, in fact, be designed to produce credible results that are not distorted by teaching to the test, cheating, or other manipulations of the testing system. We know this because we have observed at least one statewide system, Floridas, where high stakes have not distorted information either about the level of student performance or the value that schools add to their year-to-year progress. In other school systems we have found that high-stakes tests produce very credible information on the level of student performance and somewhat credible information on the academic progress of students over time.
1 For a critique specifically of the Amrein and Berliner study see Greene and Forster 2003. Raymond and Hanushek question the validity of Amrein and Berliners findings given that their replication of Amrein and Berliners study produces the opposite results (see Raymond and Hanushek 2003).
2 Using school-level data rather than state-level data will greatly reduce the extent to which correlations between test results are artificially inflated because of the differences in population characteristics that are masked by lumping together all the students in each state. Individual-level data would allow for an even more precise analysis, but unfortunately they are not available.
3 A number of states and school districts administer a standardized test in addition to the state criterion-referenced test, but many of those standardized tests had high stakes attached to the results. For example, Houston and Dallas, Texas, Arizona, and California all administered multiple tests to their students but all tests had high stakes. We could not include those states or school districts in our sample.
4 Because school-level test scores are public information and usually covered under state freedom of information laws we might have expected obtaining the scores to have been relatively easy. Unfortunately, we encountered numerous delays and refusals from school officials. Some school districts were very helpful with their test score information and provided us with the necessary data. Other school districts, however, were less helpful and in some cases were downright hostile. The Maynard, Massachusetts school district, for instance, refused to give us the data. We spoke directly to the Assistant Superintendent of the district, who said she was in charge of testing. She informed us that she would not release the test score information because she was philosophically opposed to our study. We are unaware of how her philosophical opposition trumps public information laws, but since we had neither the time nor the resources to pursue the matter in the courts she was successful in denying us her test score information. The Maynard, Massachusetts case was by far the most blatant obstruction we faced attempting to obtain the necessary test scores, but some other districts were reluctant to provide the information until we informed them that they were legally required to do so. We found this rather disturbing considering that public schools claim that their transparency is one of their greatest virtues. In performing this study, at least, we certainly did not find public schools to be transparent.
5 Because the same test is used as both a high-stakes test and a low-stakes test in Chicago, our findings there cannot address whether teaching to the test is occurring. ITBS is generally agreed to be a good measurement of student learning, in which case teaching to the ITBS would be desirable, but we cannot prove this is the case by comparing one administration of the ITBS to another. However, our results will indicate whether the high-stakes administrations of the exam are distorted by outright cheating (e.g. giving out answers or tampering with student answer sheets).
6 Our method can be illustrated by using Virginias administration of the high-stakes SOL and the low-stakes SAT-9 elementary math tests in 2000 as an example. In this year, Virginia gave the SOL to students in the 3rd and 5th grade, and gave the SAT-9 to 4th graders. We averaged the 3rd and 5th grade scores on the SOL test to get a single school score on that test.
We next standardized the scores on each of the tests. The SOL was reported as mean scaled scores and the SAT-9 scores were reported as mean percentiles. We calculated both the average school score on each test and the standard deviation on each test administration. On the SOL the average school mean scaled score was 431.93 and the standard deviation was 39.31. On the SAT-9 the average school percentile was 57.93 and the standard deviation was 15.24. For each school we subtracted the average school score on the test from that individual schools score on the test and divided the resulting number by the standard deviation. So for Chincoteague Elementary School, which posted a 60 percentile score on the SAT-9 the calculation was thus:
60 57.93 = .14
After standardizing scores for every school in the state on each of the two test administrations in question, SAT-9 4th grade math, 2000, and SOL elementary average math, 2000, we then correlated the standard scores on the two tests. In this instance we find a correlation of .80. This high correlation leads us to conclude that in this case the stakes of the tests had no effect on the results of the tests.
We then found and correlated the gain scores for each test. Building off our example, we subtracted the standardized scores on the 1999 administration of the tests from the standardized scores in the 2000 administration of the tests to find the gain or loss the school made on the test in the year. In our example school, this meant a .01 standard score gain on the SAT-9 and a .10 standard score gain on the SOL. We calculated the gain scores for each school in the state and correlated the results. In this example we found a correlation of .34, a moderate correlation between the two tests.
Next we combined the standardized scores of the test by grade, while keeping them separated by year and subject and correlated the results. In our example this meant combining all 2000 administrations of the SAT-9 math test (elementary, middle and high school scores) and doing the same for the SOL math 2000 test and correlating the results. In this example we found a high correlation of .77. We then repeated this combining and correlating for the difference scores. In our example we found that the difference between the 2000 and 1999 standardized scores on the SOL in all grades correlated with the difference between the 2000 and 1999 standardized scores on the SAT-9 in all grades at a level of .29, a moderate correlation.
7 There is one distortion that might be caused by the incentives created by the high stakes of high-stakes tests that this method cannot detect: if school systems are excluding low-performing students from the testing pool altogether, such as by labeling them as disabled or non-English speaking, a high correlation between scores on high- and low-stakes tests would not reveal it. However, the research that has been done so far on exclusion from high-stakes testing gives us no good reason to believe that this is occurring to a significant extent. Most studies of this phenomenon are methodologically suspect, and those that are not have found no significant relationship between high-stakes testing and testing exclusion (for a full discussion, see Greene and Forster 2002).
8 It is generally accepted to consider correlations above between .75 and 1 strong correlations, between .5 and .25 moderate correlations and between .25 and 0 weak correlations (Mason, et al., 1999).
Amrein, Audrey L. and David C. Berliner, High-Stakes Testing, Uncertainty, and Student Learning, Education Policy Analysis Archives, Volume 10, Number 18, March 28, 2002. Available from http://epaa.asu.edu/epaa/v10n18/.
Barnes, Christopher and Chester E. Finn, What Do Teachers Teach? A Survey of Americas Forth and Eighth Grade Teachers September 2002.
Cizek, Gregory J., Cheating to the Test, Education Matters, Volume 1, Number 1, Spring 2001. Available from http://educationnext.org/2001sp/40.html.
Dewan, Shaila, The Fix Is In. Are educators cheating on TAAS? Is anyone going to stop them?, The Houston Press, February 25, 1999. Available from http://www.houstonpress.com/issues/1999-02-25/feature.html/1/index.html.
Education Weeks Quality Counts 2002. Available from http://www.edweek.org/sreports/qc02/
Figlio, David N. and Maurice E. Lucas, Do High Grading Standards Effect Student Performance? December 2001
Greene, Jay P., The Business Model, Education Next, Summer 2002.
Greene, Jay P., The Looming Shadow: Can the Threat of Vouchers Persuade a Public School to Turn Itself Around? The Case of Florida Suggests Yes, Education Next, Winter 2001. Available from http://educationnext.org/20014/76.html
Greene, Jay P., An Evaluation of the Florida A-Plus Accountability and School Choice Program. Florida State University, Manhattan Institute, and Harvard Program on Education Policy and Governance, February 2001. Available from http://www.manhattan-institute.org/html/cr_aplus.htm
Greene, Jay P., The Texas School Miracle is for Real, City Journal, Summer 2000. Available from http://www.cityjournal.org/html/10_3_the_texas_school.html
Greene, Jay P. and Greg Forster, Burning High Stakes Tests at the Stake, The Education Gadfly, Volume 3, Number 1, January 8, 2003. Available from http://www.edexcellence.net/gadfly/
Greene, Jay P. and Greg Forster, Effects of Funding Incentives on Special Education Enrollment. Manhattan Institute, December 2002. Available from http://www.manhattan-institute.org/html/cr_32.htm
Greene, Jay P. and Marcus A. Winters, When Schools Compete: The Effects of Vouchers on Florida Public School Achievement. Manhattan Institute, August 2003.
Grissmer, David W., Ann Flanagan, Jennifer Kawata and Stephanie Williamson, Improving Student Achievement: What State NAEP Test Scores Tell Us. Rand Report, July 25, 2000. Available from http://www.rand.org/publications/MR/MR924/
Haney, Walt, The Myth of the Texas Miracle in Education, Education Policy Analysis Archives, Volume 8, Number 41, August 19, 2000. Available from http://epaa.asu.edu/epaa/v8n41/index.html.
Haney, Walt, Clarke Fowler, Anne Wheelock, Damian Bebell and Nicole Malec, Less Truth Than Error? An independent study of the Massachusetts Teacher Tests, Education Policy Analysis Archives, Volume 7, Number 4, February 11, 1999. Available from http://epaa.asu.edu/epaa/v7n4/
Hanushek, Eric A., Deconstructing RAND, Education Matters, Volume 1, Number 1, Spring 2001. Available from http://educationnext.org/2001sp/65.html
Hoff, David J., N.Y.C. Probe Levels Test-Cheating Charges, Education Week, December 15, 1999. Available from http://www.edweek.org/ew/ewstory.cfm?slug=16cheat.h19
Klein, Stephen P., Laura S. Hamilton, Daniel F. McCaffrey and Brian M. Stecher, What Do Test Scores in Texas Tell Us? Rand Report, October 24, 2000. Available from http:// www.rand.org/publications/IP/IP202/
Koretz, Daniel M. and Sheila I. Barron, The Validity of Gains in Scores on the Kentucky Instructional Results Information System (KIRIS). Rand Report, 1998. Available from http://www.rand.org/publications/MR/MR1014/#contents
Lawton, Millicent, Alleged Tampering Underscores Pitfalls of Testing, Education Week, November 13, 1996. Available from http://www.edweek.org/ew/vol-16/11cheat.h16
Mason, Robert D. et al., Statistical Techniques in Business and Economics, 10th Edition, McGraw-Hill, 1999.
McNeil, Linda and Angela Valenzuela, The Harmful Impact Of The TAAS System Of Testing In Texas: Beneath The Accountability Rhetoric. Harvard Civil Rights Project, January 6, 2000. Available from http://www.law.harvard.edu/civilrights/conferences/testing98/drafts/mcneil_valenzuela.html
Phelps, Richard P., Test Bashing Series. EducationNews.org, 2000. Available from http://www.educationnews.org/test_bashing_series_by_richard_p.htm
Raymond, Margaret E. and Eric A. Hanushek, High-Stakes Research, Education Next, Volume 3, Number 3, Summer 2003. Available from http://www.educationnext.org/20033/48.html