Measuring Up: What Educational Testing Really Tells Us
reviewed by Hong Qian - April 14, 2009
Title: Measuring Up: What Educational Testing Really Tells Us
Author(s): Daniel Koretz
Publisher: Harvard University Press, Cambridge
ISBN: 0674028058, Pages: 368, Year: 2008
Search for book at Amazon.com
With the implementation of the No Child Left Behind Act (NCLB), educational testing is becoming universal and important in American education. However, because of their complexity, test scores are commonly misunderstood and misused by policy makers, educators, and parents. Since achievement tests have great influence on students, teachers, and schools, those mistakes can lead to severe consequences. Measuring Up: What Educational Testing Really Tells Us, by Daniel Koretz attempts to help lay people recognize what educational testing can and cannot tell us, understand the limitations inherent in testing, and be capable of interpreting test scores sensibly and using testing reasonably.
The book is organized into 13 chapters that address the issue of why people should be careful when they interpret test scores and why it is not reasonable to depend solely on test results to identify good and bad schools. Koretz starts the discussion with a simple question: What is a test? He points out that, contrary to common expectations that test scores provide a straightforward and complete measure of educational achievement, tests are by nature incomplete in two ways. One is that tests can assess only a part of the various educational goals that are pursued by schools. The second is that even in measuring the goals that are amenable to testing, tests are unavoidably very small samples of behavior that we use to make estimates of students mastery of very large domains of knowledge and skill (p. 9). According to Koretz, people often forget that tests are incomplete and indirect measures, hence the origin of enormous misunderstandings.
Even though a lot of newspapers and policy makers present claims about educational testing with so much certainty and self-assurance, Koretz warns us this certainty is not warranted. To help readers understand the complexities and uncertainty of achievement testing, he illustrates core principals and concepts essential to a test, including reliability, scaling, and validity.
In Chapter 7, Koretz describes two kinds of errors inherent in a test which will undermine the reliability of the test: measurement error and sampling error. Since items on the test are only a small sample of what is important, different selections of items can yield variation in performance. Thats why a student will get different scores on the SAT if he/she takes it several times, even though other conditions are the same. Such measurement errors mean that it can be highly risky to judge a students performance with a single test. Sample error is another kind of inconsistency, not resulting from the selection of test items, but from selection of particular people from which one takes a measurement (p. 164). For example, it is not safe to conclude that a school is improving based on increased scores on a standardized test, because it is possible that the school may get a high-scoring student population this year. This is especially true when it comes to judging the performance improvement for minority students, who always appear in small samples and are hence vulnerable to extreme-scoring students.
Koretz also reminds us to be careful when we encounter standards-based reporting of student achievementhow many students reach levels like basic, proficient, and advanced. Even though this type of achievement measurement is now popular in the United States, it can obscure a lot of useful information, such as variations in student performance, because information about differences among students within any one of those ranges does not register (p. 194). Standards-based reporting can also be misleading because standard setting is less scientific than most people believe and varies significantly among states. Therefore, any comparisons among states are hindered since proficient may mean different things in different states.
The third important concept Koretz emphasizes is validity, which asks how well a test measures what we want to measure. There are two factors which would undermine validity, failing to measure adequately what ought to be measured and measuring something that shouldnt be measured (p. 220). Because tests are man-made and cannot be perfect, no test can eliminate these two factors exclusively.
According to Koretz, the uncertainty and imprecision of inferences from achievement tests are not only from limitations inherent in testing as mentioned above, but also from factors outside testing itself. In Chapter 6, Koretz sets the question what influences test scores? in a large context. He argues that, a great many things other than the quality of schools influence educational achievement and the impact of these noneducational factors can be huge (p. 114). Test scores may tell us what students already know or not, but they cannot tell us why students know or not. Koretz proposes that before drawing a conclusion about the quality of a school from scores on tests, both educational and noneducational factors should be considered.
Koretz argues that another more formidable threat to correct inference from achievement tests are inflated scores, which are the dirty secret of high-stakes testing (p. 235). His argument is based on a handful of studies, including two of his own. Studies consistently show that gains on high-stakes tests are much larger than those on low-stakes tests. By teaching to the test, inflated scores render one core principal of a test no longer true: the measure based on the samples no longer represents the larger whole (p. 241). Some defenders of teaching to the test may argue that if the test covers the right content, then it is good that people are teaching to it because students will learn important things. However, Koretz believes that the incomplete nature of tests, for example, not all desirable educational goals are suitable for testing, will deprive students of learning other important things if teachers only teach content that will appear on the tests. Inflated scores could also distort readers comparisons of schools because different schools may have different levels of test preparation. Hence, gains in scores may not represent meaningful gains in student achievement. Koretz maintains that score inflation cannot be easily fixed because teachers have incentives to raise students scores on high-stakes tests.
When it comes to test students with special needsthose with disabilities or with limited proficiency in English, the picture is even more complex and uncertain. Federal laws require students with special needs to be assessed in the same way as other students, with the good intention of holding educators accountable for these students. However, Koretz believes that the difficulties inherent in testing these groups appropriately are daunting (p. 282). The need for appropriate accommodation imposes a challenging problem when testing students with special needs because we often dont know which accommodations will offset the bias caused by the disability without giving the student an unfair advantage (p. 290). For example, it is easy to provide a blind student a Braille version of a test to offset the disadvantages caused by visual disabilities, but in terms of a student with a learning disability, it is more difficult to find an appropriate accommodation. If there are no reasonable adaptations or accommodations, the test scores of these students cannot provide us a meaningful base to interpret their achievement.
Even though Koretz illustrates the limitations of testing, he is not among the anti-testing crowd. Rather, he repeatedly makes the point that the limitations of testing do not render educational testing useless. He believes educational testing is analogous to a powerful medication. If used carefully, it can be a very powerful tool for changing education for the better; if used indiscriminately, it poses a risk of various and severe side effects (p. 332). Such a position provides him a safe place to discuss educational testing, avoiding intense and unnecessary controversy between pro-testing and anti-testing positions.
Koretz challenges common expectations about educational testing and offers easily understood and informative explanation for this issue. From the perspective of an expert in testing, he provides helpful knowledge for policy makers, educators, and parents who need to interpret test scores carefully and use tests reasonably. Even though the whole book is based primarily on Koretzs own experience and argument rather than research studies, it can be a good introductory lesson for all people involved in American education.