U.S. Mathematics and Science Achievement: How Are We Doing?
by George W. Bohrnstedt - 1997
One of two articles on academic achievement, this article discusses the performance of U.S. students on mathematics and science achievement tests compared to students in other countries. Recent data show they are performing better. This commentary examines the importance of meeting certain research conditions before drawing conclusions about achievement trends. (Source: ERIC)
Since the appearance of A Nation at Risk (U.S. Department of Education, 1983), policymakers, practitioners and parents have all agreed on the need to improve American elementary and secondary education. Data from most international studies show our nations children performing poorly on tests of academic achievement compared with students of the same age in some other nations. Yet, as the editors of this special feature section point out, data from the National Assessment of Educational Progress (NAEP) suggest that U.S. students are performing better than they did at the time when A Nation at Risk was written (U.S. Department of Education, 1997). Is there a paradox? Before answering this question, it is instructive to ask what conditions must be met for one to draw firm conclusions about the validity of trends in academic achievement.
Common content frameworks across nations and across time. If one is to be able to make valid comparisons either across nations or over time within the same nation, it is necessary to compare "apples with apples" and "oranges with oranges." If one is assessing mathematics and the framework for one assessment calls for measuring the full range of mathematical curricular topics based on the National Council of Teachers of Mathematics (NCTM) standards and another focuses only on numbers and operations, any comparison of results from two assessments based on the two frameworks will necessarily be flawed.
Items that validly measure the various subareas and themes contained in the frameworks. Valid comparisons depend not only on common content frameworks across nations and across time, but on items that measure the various subareas and themes captured by the frameworks. If the items in one assessment are balanced so as to measure all the various themes and subareas in one of the frameworks including both closed and open-ended questions, but the other measures only those themes or subareas that can be measured with multiple-choice items, comparisons of results across assessments are again problematic.
Identical rules for deciding who will be assessed and who will be excluded from assessment. If some assessments have assessment administrators who have and apply strict criteria for excluding students from the assessment (e.g., because of a serious mental or physical disability) but another assessment does not provide the assessment administrator with a clear set of guidelines for exclusion, invalid comparisons across the two assessments could be made. Or, in the case of international comparisons, if strict guidelines are not given about which students are to be sampled for the assessment, one runs the risk of overrepresenting some types of students (e.g., social class, language groups) and underrepresenting others. Closely related to this is the necessity of drawing random samples of students once a decision has been made about a set of rules for defining the populations of students to be sampled.
Comparison with the same populations across time. If we want to be able to make a set of reliable comparisons across time, it is important that the populations compared are defined in the same way. It would be impossible to judge U.S. fourth graders reading achievement over time if the population of students being sampled from year to year were defined differently each year. Similarly, it would be difficult to draw firm inferences about fourth graders reading performance across time if different nations are included in the studies over time.
Other criteria for comparison might be mentioned (e.g., the meaning of the items must be identical over time or across nations, the range of achievement assessed should be the same, the reliabilities of the test the same, etc.), but the ones I have listed allow me to make the point I would like to make.
How Are Our Students Doing?
When we examine the various international assessments and apply the various criteria discussed above, we quickly see that few of them are met. The frameworks and items for the First International Mathematics Study (circa 1970) are different from those of the International Assessment of Progress (IAEP), which in turn differ from those used in the Third International of Mathematics and Science Study (TIMSS). Furthermore, the countries included in the various studies done at various times differ, as do the populations sampled from those countries. In spite of these weaknesses, the data strongly suggest that our thirteen-year-olds are not as proficient in mathematics and science as students in many of the nations with whom we compete economically. Furthermore, our performance seems to be worse in mathematics than in science.
But not all results using international comparisons are as clear. Results from the 1994 NAEP Reading Assessment sent reading educators in the United States into shock when we learned that 40 percent of our fourth graders scored at or below the basic level. If one were to have guessed how our fourth graders would perform on an international reading assessment given the NAEP results, most of us would have guessed our fourth graders were performing terribly by international standards. Not so, as it turns out. The international study of reading literacy showed that U.S. nine-year-olds read better than students from any of the participating OECD countries! My point here is not to debate which of these conclusions about the reading ability of our fourth graders is the correct one; rather, I wish to point out how difficult it can be to draw firm inferences based on international comparisons when studies use different content frameworks and items and assess different subareas of an academic subject.
The long-term and short-term NAEP trend results suggest that our students are doing better in mathematics and science now than they were doing at the time A Nation at Risk was written. Can these results be trusted? I think the answer is yes. The long-term NAEP frameworks, items, and population definitions have remained the same over time. The short-term comparisons (e.g., those beginning in 1990 or after) also appear to be valid. Although they use a different framework than is used for the long-term trend results, the 1990 to 1994 comparisons are based on assessments using the same framework. In addition, the inclusion rules about students have remained constant.
Given this encouraging news, can we also conclude that our thirteen-year-olds are improving relative to thirteen-year-olds in other countries? Unfortunately, we cannot. Because few of the comparison criteria listed above are met, it is impossible to draw firm conclusions about whether our performance in mathematics and science has improved over time relative to most of the other countries. There appears to be at least one exception, however. In all of the mathematics assessments for thirteen-year-olds in which Japan has participated, Japanese students have scored at or near the top. This is remarkable given the differences in the frameworks, countries compared, sampling, and so on, from assessment to assessment. I think it is safe to conclude that the mathematics performance of our thirteen-year-olds has not improved relative to the performance of Japans thirteen-year-olds over the history of doing international assessments in mathematics.
In summary, while I think it is safe to conclude that the mathematics and science achievements of our thirteen-year-olds has improved since the appearance of A Nation at Risk, the available data simply are not good enough to draw firm conclusions as to whether we have improved relative to students from other nations.
Some Concluding Thoughts
It is important for us as a nation to use international benchmarks as a way to judge the performance of our students. But to do so meaningfully requires that we improve the bases for such comparisons. Al Beaton (TIMSS Director) is credited with saying: "If you want to measure change, dont change the measure." Based on the conclusions from my analysis, Beatons dictum might be modified to state: "If you want to make comparisons, dont change the bases for making comparisons." More specifically, dont change the measures, the content frameworks, the inclusion criteria, and the populations used for comparison.
TIMSS is a carefully constructed assessment that has provided us with useful information about how the nations fourth and eighth graders are performing relative to those of other nations. If we can get agreement that future international assessments will use the TIMSS frameworks, the same or parallel TIMSS assessment items, and the same sampling frame, and if the same nations agree to participate, in the future we will be able to say with some degree of confidence whether the academic performance of our students is improving relative to that of other nations. Being able to link results from international assessments, such as TIMSS to NAEP, which is being explored as this is being written, will also allow us to determine whether trends noted for NAEP are replicated in international assessments. With these kinds of data, the potential for paradoxes in the future will have been minimized.
Elley, W. B. (1992). How in the world do students read? Netherlands: International Association for the Evaluation of Educational Achievement.
U.S. Department of Education. (1983). The imperative for education reform. Washington, DC: Author.
U.S. Department of Education, National Center for Education Statistics. (1997). The condition of education 1997 . Washington, DC: Author.