The Many Facets of PISA
by David C. Berliner - 2015
This article discusses four facets of the PISA program: (a) the issue of the comparability of the cognitions elicited by items across national and linguistic cultures, (b) the association of PISA with economic outcomes for nations, (c) the search in PISA data for universally applicable instructional techniques, and (d) the differences in cross-national attitudes toward the PISA subjects and how those affect test scores.
Reading these articles I was reminded of the blind men and the elephant. If you remember, each of those gentlemen described a bit of the elephant, but getting to understand the whole was impossible for them. While our authors see better and clearer than those in the parable, PISA is quite like the elephant. There are so many facets to this enormous and ambitious program, and each facet could be the object of study for years. And even were those many studies of the various facets of PISA competently completed, we are still not likely to have a good grip on the whole.
In this collection of papers I identify five facets of ongoing research into the PISA program. They do not come close to constituting an understanding of the whole, any more than does a description of the ears and trunk of the elephant help in describing that beast. But both the blind men of the parable, and the keen sighted authors of these papers, quite accurately describe their piece of the whole. These are the kinds of understandings that are needed for our knowledge base if we hope to keep moving closer to better understanding the PISA program in its full complexity, if, indeed, that were ever possible.
We have as one facet the interpretation of PISA scores, examined by Ercikan, Roth, and Asil (hereafter E, R, & A). I will address this paper last because, in some way, PISA cannot be interpreted sensibly without understanding all the problems and achievements associated with the tests development, its use, and its correlates. These are facets of PISA discussed in the other papers, as all these authors try to understand their piece of this enormously complicated program. I identified a second facet of the PISA program, reflected in the papers dealing with item development and interpretation. These issues are addressed in two papers. One is by Solano-Flores and Wang (hereafter S-F & W), and the other by Ruiz-Primo and Li (hereafter R-P & L). A third facet is related to economics and test scores within and across countries. This complex area is addressed by M. M. Chiu and could be studied fruitfully and perhaps endlessly, as societies change, negating the insights of one decade because of changes in national policies and practices in another decade. The fourth facet I identify is about the cognitive processes needed for achieving well on these international assessments, as discussed in the paper by Artelt and Schneider (hereafter A & S). This paper is one of the many that continuously search PISA results for sound and universal instructional advice that could be given to nations so that their students might perform better on PISA tests. This has always been a major goal of PISA interpreters, but like the holy grail, non-trivial universal advice about how to increase a nations scores always seems to be sought, but never found. The fifth facet is about attitudes held about the subject matter being assessed, and the relationship of those attitudes to PISA scores. This facet is addressed by Lim, Bong, and Woo (hereafter L, B, & W).
PISA AND THE ITEM FACET
S-F & W, through amazingly detailed and highly logical analysis of PISA science items, teach us that national curriculum and test congruence matters. In their analysis the particular element of test items examined was the illustrations that accompanied PISA science items. S-F & W found many more ways to code these illustrations than I had thought possible. More importantly, they found that characteristics of the items illustrations significantly influenced national scores. For example, American and Mexican students were found not to do as well on items with representations as did Chinese students (really, Shanghai students). It seems likely that that the cause of these differences is that American and Mexican teachers see science as more like factual knowledge to be taught in ways that allow for answers on multiple choice science tests, while Chinese teachers appear to see science as requiring illustrations that provide alternative representations of scientific concepts. Under these circumstances, these countries are expected to differ on how they perform on a test that uses various representations of science concepts in its item formats.
This is not, however, a new insight. It is only a replication at the international level of the well-known fact that in state testing the overlap between the curriculum taught and the curriculum assessed is a strong determiner of a teachers (or a schools) test score. Separate from the effects on PISA scores, teaching science through multiple illustrations or representations versus teaching science through a heavy reliance on formula or verbal descriptions appears to be a good idea. Most scientists would agree that this should be done, even if national differences in achievement on these kinds of items were not so evident. So S-F & W sensibly suggest that Mexico and the United States change to new curriculum and methods of instruction, new ways that might be better for teaching science and for increasing PISA test scores.
But suggestions for improvement of a nations schools, derived from PISA test data, are always likely to need qualification. In this case, educational and psychological literature exists documenting how illustrations can sometimes hinder learning during instruction. The research also demonstrates that sometimes illustrations can confuse test takers (Hannus & Hyönä, 1999; Torcasio & Sweller, 2009). As yet, we are not smart enough to always know when that will be the case. So there is room for much work to do around this facet, and S-F & W have a lead on the world in figuring this out. It is certainly worthwhile work.
Another piece of this facet concerned with item type is explored by R-P & L. These diligent investigators also engage in logical, micro-analytic work, as they explore the contexts used in PISA items. The context for one series of test items could be a household situation. Another set of items might be related to the description of a workplace. The fundamental question asked by R-P & L is whether these uses of context help or hinder some types of students, and some nations, but not others. On the surface it seems as if providing context for test items should make items easier, helping students to perform better. But might context, instead, add content irrelevant variance to our distribution of scores, because the contexts are unfamiliar to some students or distracting to others? That is, are poor children distracted by the embedding of some PISA test items in a professional workplace with which they have little familiarity? Is the household described as the setting for a set of items going to be responded to the same way in rural France, the deep south of the United States, Norway, and New York City? From their meticulous research R-P & L raise doubts about the use of context in PISA item construction, and so does everyone I know to whom I asked this question. But this is a common attribute of PISA items.
It is hard to have tests work the same way in different countries. Maybe it is impossible. Perhaps impediments to interpretation are small when such items are embedded in contexts, as supporters of PISA would claim; but perhaps not. Years ago we learned that small changes in items, changes that seem unremarkable by item developers, can have enormous effects on the percent of students who get an item right.
For example, Professor L. Shepard of Colorado (1988) reported that, in New Jersey, in the 1970s, whether students were tested in the addition of decimals with items laid out vertically or horizontally, made an enormous difference in student and in school scores.
This kind of layout:
resulted in an 86% passing rate in the state. But the layout of the same problem horizontally, as
resulted in a passing rate of 46%. For subtraction of decimals the two different formats produced passing rates of 78% and 30%. Shepard cites another famous example. For an assessment students were asked to write the equivalent Roman numeral for an Arabic numeral, and the Arabic number was given first. But if instead the Roman number were given first, and the Arabic number was what was sought as the right answer, the result was a 40% difference in test scores! Changes of these magnitudes, stemming from what appear to be very minor changes in item design, illustrate clearly the importance of the work done by R-P & L and of S-F & W for psychometric, linguistic, psychological, and sociological understanding of how test items function. Their work takes on additional significance as we try to interpret PISA scores derived from a multifaceted, cross-national assessment claiming to be a valid measure of achievement both within and across quite different countries, where populations possessing quite different cognitive systems for interpreting the world reside.
PISA AND THE ECONOMIC FACET
In examining a small piece of this very important facet M. M. Chiu tries to unravel why countries that have greater economic disparities generally have lower PISA scores. Chiu proposes something even meaner than the simple fact that poverty exists at a high rate in a rich country such as the United States. Chiu finds evidence that poverty in many nations is not merely a function of such things as a lack of jobs, a lack of a minimum wage, lack of unions, lack of sufficiently educated people for a modern economy, lack of health care, or other factors that might lead to high levels of poor people in a nation, and as a byproduct, great disparities in wealth. Chiu finds that within nations it is the behavior of the wealthy that contributes significantly to disadvantages in the community of poor people. Perhaps economists cannot use the words that a lay person like myself might use, but selfishness, shameful, callous are the words that comes to my mind as Chiu convincingly argues that the wealthy seek not just to keep or add to their wealth: They seek advantages based on keeping other people poor and disadvantaged! This is quite disturbing.
The New York Times columnist and Nobel laureate Paul Krugman agrees with Chius conclusions. Krugman (2012) says the rich in the United States are really engaging in class warfare. He believes that they dont particularly want to destroy our government, even if that sometimes appears to be the case. Instead, they just want to ensure that they benefit from it! So members of the upper income group are often against social security, health care, food stamps, and any other program that adds to their taxes but not to their quality of life. Thus, they have successfully lobbied for the virtual removal of estate taxes, for large deductions for ownership of second homes, and for vouchers and tuition tax credits. In my state of Arizona, tuition tax credits that support wealthy peoples children in private schools have so far removed from public funding about a half billion dollars. Last year alone tuition tax credits of about $70 million were granted, almost all of which subsidized private schooling for the wealthiest citizens of Arizona. Krugman believes that Americas wealthy are simply against all forms of government that do not benefit them. That is why, as a group with powerful ties to legislators, the wealthy often are against striving for equality in public school resources. Krugman, a fellow economist, backs up Chius microeconomic research with his sociopolitical argument.
How does this play out in communities across the United States and elsewhere? Do the wealthy get the best teachers for their children? Apparently. Do the wealthy get schools that have the most resources for their children? Apparently. There are microeconomic mechanisms at work within countries that reduce educational resources for one group, or allocate resources less efficiently to various groups in society. This results in fewer learning opportunities and lower student academic achievement for the group least powerful in the sociopolitical arena.
Chiu introduced me to the term rent seekingactivities that do not increase the overall output of the schools, but do allow certain segments of the school population to thrive. Vouchers and tuition tax credits have that characteristic. About 14 U.S. states have these tuition tax credits in place and more states are proposing them. These policies have been lobbied for by the wealthy in the United States for a number of decades now, especially since inequalities in income rose in the United States, after the 1980s. These programs provide extra money to the wealthy by subsidizing the schooling of their children, or it allows the wealthy to claim hefty tax credits for the tuitions they pay for their children. This, of course, reduces the amount of money available for the commonsthe support of public school teachers, police, fire fighters, child welfare services, etc. Rent seeking is also about the wealthier parent complaining to the school principal, or school board, about the presence of some lower income students, or English language learners, or some special education students getting special attention. Perhaps one of these students is given some tutoring, or accommodations on tests, and this takes away a teachers attention to that parents own child. Wealthy parents lobby school personnel more often and more successfully than do poorer parents. They help to get their child moved to this teacher or that one; they lobby for this AP class or that one, and then lobby to have their child admitted to that class; or they lobby to have their child serve on the journalism club, or to be given special accommodations on a test, etc. These are all examples of how the wealthy and the politically powerful, much more than the poor and the politically weak, engage in rent seeking, and this occurs whether the children of the rent-seeking parents are in private or public school.
I found some findings of Chiu predictable, but one was not. Among the predictable findings are that students in countries with greater family inequality had lower mathematics scores than students in countries with less family inequality. The difference in income inequality between Finland and the United States, and their differences in achievement on PISA, is a well-cited example of this. Chiu, also predictably, found that students in countries with greater inequality of teacher quality across schools, or greater inequality of education materials across schools had lower overall mathematics test scores than students in countries with more equal distributions of these human and nonhuman educational resources. Resources such as teacher degrees and training, or school facilities and textbooks, really do matter. Again, there is nothing surprising there. But the most interesting of Chius findings was not obvious to me. This was the fact that inequality of income and resources diminishes the scores of the rich and poor alike. The achievement test scores of the wealthiest children in a country reflect the degree of inequality that exists in that nation, and is a separate effect from the harm that inequality does to a nations poorest students. In fact, Chiu teaches us that inequality is a lose-lose game: On average, both groups do less well in achievement than if such inequalities were not present. Perhaps if the rich were more aware of how inequality hurts both their own childrens scores and the scores achieved by their nation, they would be less inclined to engage in rent seeking and more inclined to strive for a society that is more egalitarian in its distribution of wealth and its opportunities for schooling.
PISA AND THE COGNITIVE PROCESSES FACET
While exploring this facet A & S try to test a very hard to test theory, namely, that knowledge of metacognitive processesstrategies that promote greater learning and memoryactually affect the use of those strategies for facilitating learning. And it is further hypothesized that the actual use of the strategies affects measures of reading competence. I view this position as similar to believing that a gambler who understands the odds of winning at a roulette or blackjack table will not gamble. Many who have knowledge of the odds of winning at these various activities act on that information. Many will not. Las Vegas is filled with the latter kind of individuals, often poorer than the rest of the citys population. Thought and actions congruent with those thoughts are usually correlated, but probably not nearly as strongly as people think.
Turning to the data analysis done by A & S we learn that possession of information about cognitive strategies that are capable of promoting a high score on the PISA 2009 reading test is, in fact, convincingly correlated with reading achievement. Correlations were quite uniform across countries, and they were strong. So, although correlational, this finding, along with the theory and research that surrounds this finding, suggests a universal instructional recommendation: Nations that wish to score higher on the PISA reading test ought to explicitly teach well-researched metacognitive instructional strategies.
But, as is often the case, we frequently fail in our diligent search for universal instructional principles, recommendations that might fit all nations, all students. This principle, derived from this study, is no different. First, there is nothing in this research to link metacognitive strategy knowledge to metacognitive strategy use. And without empirical verification of that link between knowledge of strategies and use of those strategies the recommendation to learn these strategies seems weak. Second, the three subcategories of metacognitive strategies that were studied suggests that none of them is strongly related to reading competence, even though the correlation when these three categories were combined was strong and significant. This suggests it would be unwise to recommend instruction in metacognition to improve reading performance because we would not be sure which of the many types of strategies identified are worth teaching, even if we had evidence that the strategies were used. Further, the weak correlations of the subcategories of strategy to reading competence also suggests that the theory used to understand the ways metacognitive strategies work is not as robust as we might like. But the biggest problem in searching for implications in this study of cognition and PISA testing is recognized by the authors themselves. That is, it is as likely to be the case that sampled students who are high in measured reading achievement have higher metacognitive strategy knowledge because they are good readers. The causal flow might well go the other way, from reading competence to strategy knowledge, and not from strategy knowledge, through to strategy use, and then to reading competence. So the search for universal instructional implications is not found in this study, despite a quite valiant search with a data set that the authors recognized had many limitations.
I also noticed that the PISA scores corresponding to four levels of metacognitive strategy knowledge and use, from lowest to highest, resembled PISA data about social class and test scores. Not discussed here, but worth following up, I think, are the correlations between metacognitive strategy knowledge and such social class indicators as books in the home, highest degrees earned by parents, class composition of the school that was sampled, and so forth. My instincts tell me that interesting relationships will be found, illuminating the roots of metacognitive strategies in interesting ways. The paper by Lim, Bong, and Woo, discussed next, explicitly explores that point and the likelihood that the causal flow goes from reading competence to knowledge of strategies, rather than the other way around.
I should also note that my caution in interpretation of these data and my thoughts about the direction of the causal flow, would not stop my recommending that more emphasis be placed on teaching and using metacognitive strategies while teaching reading to students. Although some students may develop a number of common and helpful strategies out of their reading, I have no doubt that others read less well than they might had they discovered these strategies on their own. For those students it would seem wise to provide some coaching in the use of some well-researched strategies known to aid in comprehension.
In sum, it would be good if this facet concerned with cognitive processing when learning and when taking PISA examinations gets further attention. Attention might well be paid, as well, to the possibility that there are social class differences in the acquisition and use of these strategies.
PISA AND THE FACET DEALING WITH ATTITUDE ABOUT SUBJECT MATTER
In this investigation of Korean students attitudes and reading achievement on PISA 2009 by L, B, and W, we find a remarkable correspondence with reading research in Europe and the United States. Among the apparent universal findings to be gleaned from this study is that a positive attitude toward reading predicts the amount, diversity, and complexity of reading behavior, as evidenced by strong and significant positive correlations between these variables. A negative attitude towards reading also predicts reading behavior, but the sign of the correlation with the amount, complexity, and diversity of reading is negative. Together they make a strong case for a strong relationship between attitudes toward subject matter and competence in that subject matter.
The implication of this seemingly simple and easily predicted finding is really quite profound. It appears likely that the best and most frequent readers come from households where positive attitudes toward reading are communicated covertly (say through modeling of book and newspaper reading), overtly (say through discussions of what has been read), and rewarded (the act of reading garners both attention and praise). In the United States we have many households where such practices known to facilitate reading do not take place. This is especially true in homes headed by single parents. In Canada, roughly one fourth of its children are raised in single parent households. In the United States, about one third of its children are raised in such households. But the United States rate moves up to about 7 in 10 for children of African American heritage. It seems to me that educating parents about supporting literacy at home is nearly as important as teaching the children of these parents to read. Using television as a babysitter is so convenient in our modern lives, enjoyed by the children, and it can provide relief for stressed single parents, but television does not foster traditional literacy of the type measured on PISA and valued by employers. Parent education may be a helpful policy recommendation.
The data provided by L, B, and W describe what could be thought of as a virtuous circle. It works like this: Parents positive attitudes toward literacy results in childrens positive attitudes toward literacy; this leads to higher levels of school achievement; this results in higher average adult income for those children; those children, as adults, then raise families that promote positive attitudes toward literacy in its many forms, and those parents possess the fiscal resources to support literacy facilitating activities outside the home (tutors, debate teams, music lessons, trips to museums, etc.). It is quite possible that the well-documented hardening of the social class lines in the United States is related to the greater emphasis on acquiring higher levels of literacies in our society, literacies that are promoted in some families and left more to chance in others.
A second and not unexpected finding from this study is that girls in Korea, like their counterparts in the Western world, have both more positive attitudes toward reading and higher reading test scores than do boys. The social psychological implication of this finding across countries is yet to be dealt with thoughtfully. If entrance to higher education is based on merit, not quotas, woman will be admitted at much higher rates than men. This is now occurring in the United States where woman are close to being 60% of the college going undergraduate population. Back in 19701971, for example, woman received less than 10% of all university business degrees. Now they receive about 50% of those degrees. Equal opportunity, paired with womans more positive attitudes toward literacy, which is accompanied by stronger literacy performance, means there is a much better chance that the higher paying occupations of the 21st century will be held by woman. But the social systems of most countries are male dominated. Those social systems more often laud male breadwinners, elect more male politicians, and in many places, still adhere to property laws that favor men. Contemporary reality, however, is that the skills needed for the economies of the in 21st century favor females, just as the skills needed for the economies of most nations in the 19th and early 20th century favored males. Cross-national PISA data about gender, academic achievement, and economic growth suggests that we live in times where substantial social change is likely to occur. Even the more traditional Asian cultures are likely to experience the kinds of changes now starting to occur in the Western cultures.
On another issue, it took some time for researchers to realize that the idea of Instructional and Communication Technology (ICT) needed to be studied, because programming and use of ICT is a form of literacy highly valued and frequently needed for functioning in the 21st century. It seems also to be taking time for the research community to realize that they need to extend their ideas about the ways that literacy is studied traditionally, to include the reading, writing, and communication skills required for using the smart phone, tablet and computer. Thus I was pleased to see L, B, & W note that communication via modern technologies should be considered a distinct form of literacy that is quite worthy of research. It may seem a shame to many of us older adults that youth do not read as much as we did for fun, as a leisure activity, or for inquiry and intellectual stimulation. But that should never be confused with youths possession of a lower level of literacy. Contemporary youth, Korean and Western alike, appear to be quite adept at communication in the forms they value for themselves, namely, communication associated with smart phones, tablets, and computers. Their writing and reading behavior strikes me as no less complex in emotional tone, ideas, metaphors, similes, creativity, and the like, than was true of the literacy activities of my generation. PISA needs to acknowledge this shift in what literacy looks like in the 21st century and assess it accordingly.
In sum, and generalizing, we might posit that attitudes toward mathematics, science, and reading, the three curriculum areas assessed by PISA, vary at least by gender, family, social class, and nation. We might also posit that these attitudes about subject matter areas substantially influence achievement in those areas for the two sexes, families, social classes, and nations. Ultimately, then, the best teachers in, say, elementary and middle school, may not always be those that get the highest test scores from their students, if their students simultaneously develop an aversion to a subject matter area. Relying on a fast pace, spending more time on test preparation and homework, and embedding those activities in a stressful classroom environment to get scores up, may well lead to negative attitudes toward the subject area tested. A teacher with lower test scores, who finds ways to develop mathematics, science, or reading skills while simultaneously fostering more positive feelings toward those subject matter areas, may ultimately be better for a nation. This is important to think about because of the spread of value-added models of teacher evaluation (VAMs) in the United States and elsewhere. Teachers and schools are forced to compete to get the highest scores on various tests, and it is primarily the test score that counts in the evaluations of either a teacher or a school. The intense pressure felt by teachers and schools to score high on many tests is, of course, communicated to the students. The students, then, often suffer from that same stress, and many are likely to dislike the subjects where their anxieties are highest. These data suggest that, for example, a 10% rise in enjoyment of mathematics and science might get us a larger group of students choosing mathematics and science as majors in college, than would a 10% gain in mathematics and science test scores, if those gains were undermining student enjoyment of those subject matters. In sum, these data suggest that student attitude toward subject matter is as legitimate an outcome of our education systems to worry about as is student performance on standardized tests.
PISA AND THE INTERPRETATION FACET
This is the fifth facet addressed in this set of papers. I think in many ways this paper by E, R, & A is the most important of them all, and why I saved it for the last of my comments. This paper is all about interpretation. If ears and trunks and tails and feet are really well described, or in our case, if social class issues, strategy use, attitudes toward subject matter, and item development strategies were really well understood, would we then be able to accurately interpret what the whole really is like?
I think not. There is commonality in the parable of the blind men and their elephant and with our current understanding of the PISA program. Interpreting the whole proves to be quite difficult. A part of the interpretive problem, understanding the system, is a straightforward validity issue. Validity is about the quality of the argument and the evidence you have to support your claims about the meaning of a test. The logic of the argument that would allow us to trust the scores on the PISA test as indicators of the quality of various national systems of education, and as guides for changing ones own system of education, is clearly articulated by E, R, & A. They say, first, that the scores on the tests across countries must be trustworthy enough to allow rank ordering those countries on a single scale. We have seen, above, that mean scores from all the schools sampled in a nation hide information, and thus national rankings based on those means distort the achievements of a nations educational system. Apples and oranges are all combined to provide a score that hides as much as it provides.
Lets do a thought experiment. Suppose that only the public schools where less than 10% of its children are in poverty were used to judge a nations system of education. That would have the potential of being a maximal test of the public schools, a test of their quality when things are likely to be working well both in the schools, families, and the communities served by those schools. It would not be an examination of the typical or overall performance of a nations schools, many of which are trying to teach poor children from less well-educated families, in underresourced schools, and in neighborhoods that are not always wholesome. In the case of the United States this new score is likely to be close to other nations scores, since PISA reveals that the public schools for the wealthy, across nations, are much more similar than are the public schools of the poor, across nations. These new rankings might very well tell us that American schools are performing quite well when school and family resources are at maximum. If that were the case, and it appears that it is, conclusions drawn from the scores would be different than those now drawn from using the entire sample of schools. In the case of assessing the schools with the least poverty we are less likely to think the curriculum needs changing, less likely to think that teacher quality is deficient, less likely to blame the teacher unions, or to blame any of the other favorite objects of discontent that are suggested to explain why the United States has a school system that is of purportedly poor quality. So first, E, R, & A tell us we need to know that the scores used for the international rankings are worthy of our trust as indicators of the quality of a nations schoolsnot just trustworthy psychometrically, but logically, as well.
Second, E, R, & A ask if the scores obtained are sufficient indicators of the quality of the education systems across nations. Would love of learning or enjoyment of childhood differ by country? And should those be used as an outcome of schooling that is of worth, or does only achievement test score matter? Does school dropout rate vary across countries, and if so, is that an indicator of school quality? Compared to other nations, does a small difference in the test scores of the wealthy and the poor matter? Compared to other nations, does a small variance in achievement across schools or within schools matter? For students less interested in the tested academic subjects would the quality of the technical schools offered be a measure of the educational systems overall quality? Using a single indicator system such as test scores in three subject matter areas clearly limits our evaluations of a nations educational system. Each nations school critics and supporters need to think more deeply about this issue.
Third, say E, R, and A, to make a statement about the functioning of schools in a nation requires that we be sure there is homogeneity of performance across different student populations in that area. In general, gaps in opportunity and achievement for different social classes, ethnic groups, and regions, within countries, make comparisons of school system quality within nations and across nations quite difficult. We in the United States, for decades now, have been well aware of the educational gaps that exist between the poor and the wealthy. Now we find, as well, that the gap in educational outcomes is increasing between the middle class and the wealthy. We also have gaps in achievement between the south and the north and between most minorities and the majority. These gaps can be large and obvious, as in the 2006 PISA science test. On that test white U.S. students scored 523, well above the OECD average. But black students scored 409. What then can we then say about American schools overall? Australia, on the 2003 PISA science test, found that poor students in schools that served poor communities scored 455. But wealthy students in schools that served wealthy communities scored 607. This is a score difference of about 1.5 standard deviations, a huge difference. How then should we judge the quality of Australias schools? What does Australias average PISA score and its corresponding rank tell us about the quality of education in Australia? These large within nation differences are lost in rankings based on mean test scores. At best, then, PISA rankings provide only the crudest indicator of a nations educational quality.
This paper by E, R, & A is congruent with much of my own experience. After years of examining PISA and other international test data it is still hard for me to believe that the scores we get on these tests are valid indicators of what goes on in a nations classrooms, how good a nations teachers are, how good a nations curriculum and teacher training is, how good or bad the school or district or state or nation really is, and what the future holds for examinees, their state, or their nation. Put differently, I think that the remarkably high-level of psychometric work on PISA, along with the face validity and cleverness of the items used in the assessments, has taken away from the more fundamental concern of any assessment program, namely, the validity arguments that every test must meet. Fortunately, E, R, & A bring these issues to our attention.
This facet concerned with our understanding of PISA also addresses the issue of consequential validity quite directly. The interpretation of, and actions taken, as a function of knowing my nations or any other nations test scores, is simply not as clear as many make it out to be. There is the widespread, but fundamentally misleading, notion that one can borrow another nations practices and make them fit into ones own system of education with exactly the same effects that they had in their original system. Thus, Andreas Schleicher, one of PISAs directors, has proposed that schools striving to be excellent should include little bits of Finland, Japan, England, Israel, Norway, Canada, Belgium, and Germany (Alexander, 2012). But over 100 years ago, as international comparisons of schools were being made, sans tests, a British educator of renown wrote this warning. It is as important today as when it was written, and ignored, back then:
In studying foreign systems of education we should not forget that the things outside the schools matter even more than the things inside the schools, and govern and interpret the things inside. . . . No other nation, by imitating a little bit of German organization, can thus hope to achieve a true reproduction of the spirit of German institutions. . . . All good and true education is an expression of national life and character. . . . The practical value of studying in a right spirit and with scholarly accuracy the working of foreign systems of education is that it will result in our being better fitted to study and understand our own. (Sadler, 1900).
So we can learn from PISA by engaging in conversations about the results and thinking about our own systems of education. But conversation is a lot different from what often happens in the frenzied reporting of PISA results that occurs every three years. After the tri-annual reporting of PISA results too many proclamations are made by too many people who know nothing of schools or the effects of culture on school life. Their ignorance, and their preference for proclamation rather than conversation, can distract, even hurt, a national system of education. As one example, in the United States the new rigorous Common Core State Standards are being forced on states to make American students more competitive (meaning that they will score higher on PISA and other international assessments). This new curriculum, along with highly consequential tests for students and teachers, is supposed to fix our purported problem of poor, or merely average, performance. But that inadequate American curriculum, in schools that have 10% or less of its children in poverty, helped American students to score the highest in the world in reading on the 2009 PISA test. In fact, the same PISA test revealed that American students in schools that had 25% or fewer students in poverty placed third in the world. So these two groups of public school attendees, totaling somewhere around 12 million students, seemed to have a curriculum that was outstanding! But the U.S. overall score was not as high as desired, and so a totally new curriculum was mandated at the cost of billions of dollars for the complete package, including the new tests. Lost in the tri-annual attempt to fix blame for our schools not being number one, a position we probably never held, was that childhood poverty was the problem that needed addressing, not the curriculum. This is what happens when interpretation of a nations scores are made by people without understanding the educational systems strengths as well as its problems. It is, therefore, appropriate to wonder if PISA testing is often exploited or manipulated in many countries just for political purposes.
I have commented on five facets of research on PISA. The work done on each is important. We do need to know more about the problems and affordances that are inherent to item types for different test takers, living in different cultures. That work has generalizability and will influence many other test designers. I think also that important work is occurring in the research underlying both the facets concerned with characteristics of the students who take PISA teststheir attitudes and cognitive processes. Surely the attitudes held and the cognitive skills employed by students influence a broad array of behavior, and of most interest is the influence of that behavior on test takers in various countries. It is likely that those influences on behavior are not the same in every country or jurisdiction. A few universal findings may be found, but in general, findings that hold across all countries, for all demographic subgroups, are expected to be hard to establish. The study of the economic facet, at both the macro- and microeconomic levels, is also important work. It serves to illuminate what I will discuss next, the out-of-school factors that greatly influence schools within nations, and therefore the international rankings of those nations that participate in PISA. But even after a decade or two more of successful work on these and other facets of the PISA system we might identify, we would still have a fundamental question about PISA that is hard to answer: How shall we interpret PISA scores and ranks? This is not just the $64 question. Because so many educational expenditures are made on the basis of PISA score interpretations it may well be more like the $64 billion question, worldwide.
The meaning of test scores really is the fundamental issue that every assessment must answer convincingly. Every assessment must provide arguments that support this or that interpretation of what the assessment is thought to reveal. PISA has not yet convinced me that the current interpretations of the test results are appropriate. For example, because I am old, I remember that the United States has never fared well in international comparisons of student achievement in math and science. In 1964, the United States actually took 11th place among 12 countries participating in one of the first international studies of student achievement in math. And an international reading test in the 1970s reported poor performance for American pupils, far behind the leading nation, Italy. Germany faced the same problem not too long ago when PISA revealed that they scored well below where their self-image as a nation had placed them. Despite interpretations of PISA scores as indicators of a doomsday scenario in both countries, the achievements of Germany over the last few years, and the achievements of the United States over the last few decades, have proved to be remarkable. These two nations serve as economic engines of Europe and the world. So their poor ranking on the international tests tell us little about what to expect from an economy, although once that was the way PISA and other international tests were sold to the various participating nations. Sadly, most economists and news reporters still interpret PISA scores as indicators of future economic growth, which may be true of less well developed economies but seems not to be true for more highly developed countries. What I have learned, but my country still ignores, is that these interesting and well-constructed tests measure achievement reliably, but that the national scores have substantial validity problems when used to predict a nations future economic health. My advice for those in my country that worry because we are not doing well, overall, is that they should stop worrying. Our nations PISA scores and rank seems to matter little in the economic sphere, influencing nothing much beyond political rhetoric.
Over the years PISA and other international test have been around, I have become more impressed with trying to understand youth in their cultures, and thus I think a lot about how out-of-school factors influence in-school achievement (Berliner, 2009). For me, PISA has morphed into an indicator of how powerful cultural influences affect the behavior of youth, both in and out of school. Thus PISA scores, which quite naturally suggest interpretations that focus on schools, may also be indicators of many other things, as well. The paper by E, R, & A is sympathetic to this view.
Where might such a view lead? It suggests, perhaps, that underlying the success of Asian countries that are disparate in so many other ways, may be a common cultural force. Their success on PISA might reflect the influence of Confucianism on school achievement, quite independent of the quality of particular schools or the national school systems of which they are a part. If that seems sensible, as some have proposed, then a great improvement in academic achievement in the United States might occur with the adoption of Confucianism by more than our Asian American citizens!
With this hypothesis we see that out-of-school historical and cultural factors might be a greater influence on school achievement than are inside-the-school factors. The search for universal curriculum and instructional variables to improve low performing schools would then be predicted to have less payoff than thinking about what makes for a successful social milieu within which schools can function. Sociology, anthropology, history, and cultural studies, rather than educational psychology and psychometrics may be the disciplines that best can interpret PISA scores.
Currently PISA score interpreters in the United States and elsewhere look inside the school systems of the more successful nations on PISA, searching for suggestions about how to improve their low performing schools and nations. In the United States we hear: Raise expectations! Test more frequently! Use more incentives! Assign more homework! And so forth. But a nations performance on PISA may be much more dependent on the historical, cultural, and economic factors that surround its schools. The quality of each nations schools, therefore, may have a lot less to do with what happens inside the schools than outside the schools. PISA scores, perhaps, may then indicate more about the societies in which schools are embedded than about the schools themselves.
Actually, after the first few administrations of PISA, it was made clear by PISA personnel that their tests measure what youth have learned from birth to 15 years of age in and out of school. Thus, they warned, interpreters of PISA scores must be aware that the quality of the schooling experienced is only a part of those 15 years of development. For example, and not often thought about when interpreting PISA test data in the United States, is that a 15 year old in the United States, say in grade 9 or 10, has spent about 92% of their life out of school and only about 8% of their life in school. What does that mean for the interpretation of the test scores? What does it mean when, in the United States, about two thirds of the variation we see between schools is accounted for by out-of-school factors? This particular finding differs substantially by country, but for the United States it certainly complicates interpretations of the overall PISA rankings as a reflection of school quality.
In a nation like the United States, with such enormous variations in quality of life, as well as quality of schooling, the scores on any national or international test will mislead if we infer that those scores are most strongly influenced by what occurs inside the schools. Life in suburban Boston and life in rural Mississippi vary enormously, perhaps even more than does the quality of the schools attended in those two settings. PISA mean scores for any nation is a pool of data from all these various kinds of social and school settings. For the United States, in particular, judging our entire school system with such data seems patently unfair. Nevertheless, this overly simple portrait of American schools, and schools in other big and diverse nations, is used in PISA to make inferences about what makes successful schools across countries. But PISA score interpreters might entertain an equally tenable position, namely, that the rank of a nations schools are indicators of the overall success or failure of their societies. As I thought more about PISA score interpretation I came to believe that high quality schools do not make societies noticeably more successful, as is a basic premise of this elaborate, expensive, multifaceted program called PISA. Rather it may be that successful societies make for more successful schools. The causal flow is surely as likely to go the way I posit as the way PISA score interpreters believe it to be. I now find it no more likely that the success of a society depends on the performance of its schools than that the success of the schools is an indicator of the overall performance of the society in which those schools are situated.
This somewhat heretical view about the causal flow between societal factors and schools within nations is supported by research conducted by Feniger and Lefstein (2013). Their provocative study looked at the effects of a nations schools, as revealed in that nations PISA scores, for both Turkish immigrant children living in Western European countries, and for Chinese immigrant children living in Australia and New Zealand. The children classified as immigrants were either born in the host countries or were brought there as infants. These kinds of students provide an interesting natural experiment about culture and school effects. If the quality of a nations schools determine the scores that its students get, then the PISA scores of the immigrant students should resemble the PISA scores of similar children in their host country. But if it is the culture of the sending country, and the representation of that culture through the students family and neighborhood that determines school performance, then the PISA scores of the immigrant students should resemble the scores of the students in their home countries. The results were clear and supported by other research: PISA scores obtained by immigrant students with 15 years in their host country were closer to the scores of students from the country of origin than they were to the scores of students in the country to which they had moved. These findings suggest a weakness in the arguments and recommendations associated with PISA scores and rankings. Feniger and Lefstein say it this way.
We have argued on the basis of results from comparisons of immigrant students test scores to those of their host and source countries that cultural and historical forces are more consequential for student achievement than national education systems and policies. . . . [Thus] greater care should be exercised in the interpretation and uses of international comparative testing programmes such as PISA. While it may be politically attractive and expedient to attempt to imitate the educational policies and structures of high-attaining systems, our analysis suggests that such cross-national policy borrowing will likely be ineffective without attending to the historical and cultural contexts in which those policies operate.
Ultimately, these researchers note, the more important question is not whether national educational policies matter more or less than their social and cultural contexts, but how those national educational policies, school level variables, and a myriad of out-of-school factors interact with one another to influence PISA test scores. In sum, I believe the work conducted on many of the facets comprising PISA will be useful in sparking conversations about the improvement of assessment and instruction. But ultimately, we need to work most on understanding how to interpret the results we get, knowing full well that the chance for agreement on that facet by the community of scholars is virtually nil.
Alexander, R. (2012, May). International evidence, national policy and classroom practice: Questions of judgment, vision and trust. Paper presented at the Third Van Leer International Conference on Education, From Regulation to Trust: Education in the 21st Century, Jerusalem, Israel.
Berliner, D. C. (2009). Poverty and potential: Out-of-school factors and school success. Boulder and Tempe: Education and the Public Interest Center & Education Policy Research Unit. Retrieved from http://epicpolicy.org/publication/poverty-and-potential
Feniger, Y., & Lefstein, A. (2013, June). 'Surpassing Shanghai': An ironic investigation of PISA data. Paper presented at the International Workshop on Comparative and International Education, Hebrew University of Jerusalem, Israel.
Hannus, M., & Hyönä, J. (1999). Utilization of illustrations during learning of science textbook passages among low- and high-ability children. Journal of Contemporary Educational Psychology, 99(24), 95123.
Krugman, P. (2012, November 29). Class wars of 2012. New York Times, Opinion Pages. Retrieved from
Sadler, M. (1900). How can we learn anything of practical value from the study of foreign systems of education? In J. H. Higginson (Ed.), Selections from Michael Sadler: Studies in world citizenship (pp. 4851). Liverpool: Dejall and Meyorre.
Shepard, L. (1988, April). Should instruction be measurement driven? A debate. Paper presented at the meetings of the American Educational Research Association, New Orleans, LA.