Cautions About Inferences From International Assessments: The Case of PISA 2009
by Kadriye Ercikan, Wolff-Michael Roth & Mustafa Asil - 2015
Background/Context: Two key uses of international assessments of achievement have been (a) comparing country performances for identifying the countries with the best education systems and (2) generating insights about effective policy and practice strategies that are associated with higher learning outcomes. Do country rankings really reflect the quality of education in different countries? What are the fallacies of simply looking to higher performing countries to identify strategies for improving learning in our own countries?
Purpose: In this article we caution against (a) using country rankings as indicators of better education and (b) using correlates of higher performance in high ranking countries as a way of identifying strategies for improving education in our home countries. We elaborate on these cautions by discussing methodological limitations and by comparing five countries that scored very differently on the reading literacy scale of the 2009 PISA assessment.
Population: We use PISA 2009 reading assessment for five countries/jurisdictions as examples to elaborate on the problems with interpretation of international assessments: Canada, Shanghai-China, Germany, Turkey, and the US, i.e., countries from three continents that span the spectrum of high, average, and low ranking countries and jurisdictions.
Research Design: Using the five jurisdiction data in an exemplary fashion, our analyses focus on the interpretation of country rankings and correlates of reading performance within countries. We first examine the profiles of these jurisdictions with respect to high school graduation rates, school climate, student attitudes and disciplinary climate and how these variables are related to reading performance rankings. We then examine the extent to which two predictors of reading performance, reading enjoyment and out of school enrichment activities, may be responsible for higher performance levels.
Conclusions: This article highlights the importance of establishing comparability of test scores and data across jurisdictions as the first step in making international comparisons based on international assessments such as PISA. When it comes to interpreting jurisdiction rankings in international assessments, researchers need to be aware that there is a variegated and complex picture of the relations between reading achievement ranking and rankings on a number of factors that one might think to be related individually or in combination to quality of education. This makes it highly questionable to use reading score rankings as a criterion for adopting educational policies and practices of other jurisdictions. Furthermore, reading scores vary greatly for different student sub-populations within a jurisdiction – e.g., gender, language, and cultural groups – that are all part of the same education system in a given jurisdiction. Identifying effective strategies for improving education using correlates of achievement in high performing countries should be also done with caution. Our analyses present evidence that two factors, reading enjoyment and out of school enrichment activities, cannot be considered solely responsible for higher performance levels. The analyses suggests that the PISA 2009 results are variegated with regards to attitudes towards reading and out-of-school learning experience, rather than exhibiting clear differences that might explain the different performances among the five jurisdictions.
The PISA 2009 report card was released in December 2010. Shanghai-China, Korea, and Finland were on top of the rankings! The National Education Association (NEA) website reported: U.S. students in the middle of the pack (Walker, 2010). As with previous publications, the PISA results made the front pages of the national news media in many countries around the world. In the United States, The New York Times headlined Top Test Scores from Shanghai Stun Educators (Dillon, 2010); in Germany, we read about the shock of PISAthe German weekly Der Spiegel found reasons to publish a Review of a Decade of Shock (Verbeet, 2010); and in Canada the interpretation of results was mixed as one article warned, Canada is becoming outclassed in school, (Hammer, 2010) whereas another article in the same newspaper contradicted the former by stating, Canada is not becoming outclassed (Simpson, 2010). Similar to previous report cards, certain countries (e.g., Finland, Korea, Singapore) are found to be at the top, outscoring traditional economic powerhouses, such as Germany or the United States, whereas many countries find themselves below expectations. In some jurisdictions, such as our home province of British Columbia, Canada, politicians and policy makers orient themselves to the scorecard leaders (Shanghai-China, Finland, or Singapore) to think about and rethink its education policies (Premiers Technology Council, 2010).
In the United States, the results have been widely commented upon. In a letter to the editor of the Education Week, the NEA president Dennis Van Roekel (2010) stated, We should look at comparable, successful nations and try to learn from their experience. Examining the PISA results in more detail gives an intriguing insight into how other nations boost student learning and performance and underscores the importance of elevating the teaching profession. The NEA president, thereby, explicitly recommended looking at other, higher scoring nations and to learn from them; he also intimated that there are specific actions by means of which the high scoring countries boost the students scores to such an extent that the nations end up on top. In a similar vein, blogging in Education Week during the same week, Diane Ravitch (2010) criticized current education reform strategies in the United States by comparing them to practices and policies in the top performing jurisdictions Shanghai-China and Finland, suggesting policies and practices similar to those in these jurisdictions may lead to improvements in education in the United States. Here, again, not only was there a direct comparison made between practices and policies, but also these were seen as direct causes for the differences in average PISA scores.
We have already seen such debates and suggestions following the release of previous PISA results. This has led to instances, for example, where educators looked into Japanese classroomse.g., the TIMMS video study data (e.g., Shimizu, 1999)for clues about what to do to improve their own countrys ranking based on international assessment results. We might ask, Do higher ranks in achievement indicate more effective education systems for all students and ought we indeed look across the fence into other countries with very different cultural and educational contexts in the hope of finding such clues?
The purpose of this article is to caution against simplistic international comparisons of country performances to identify the most effective education systems and strategies for improving learning. We begin by articulating a broader framework for making inferences from international assessments and then provide exemplifying analyses that exhibit the limitations in the inferential process.
INFERENCES FROM INTERNATIONAL ASSESSMENTS
Identifying high performing countries based on international comparisons and emulating their practices to improve education in our own countries involves the following necessary sequence of inferences: (a) We can identify countries that have the best schooling outcomes and therefore best education systems; (b) we can determine what policies or practices are responsible for their success; and (c) if those policies or practices were emulated and implemented in our country then our education outcomes would improve. In the current discussions of international assessments and comparisons based on these assessments, these inferences are commonplace and there is now burgeoning mini-industry based on culling lessons learned from the study of high-performance education systems (Braun, 2013). To better understand the complexity of these inferences and elaborate on limitations of such inferences we break them down into implied propositions. In this section we discuss key propositions that are related to the inferences rather than an exhaustive list.
IDENTIFYING COUNTRIES WITH THE MOST EFFECTIVE EDUCATION
In international assessment contexts, identifying countries with the most effective education systems typically means rank ordering countries based on their overall performance levels and identifying high-ranking countries as candidate models to be emulated. Inferences are then made about the factors that led to the effectiveness of these countries education systems based on their rank-ordered placement on the assessment. Such an inference would need to be supported by a key set of at least three propositions: (a) the international assessment provides comparable scores for countries and jurisdictions that allow researchers to rank order countries on a single scale; (b) these scores alone are sufficient indicators of the quality of education and education systems in a country; and (c) there is homogeneity in performance levels across different student populations within countries (e.g., across provincial systems in Canada and states in the United States) that allow comparisons and rank ordering of countries.
IDENTIFYING EFFECTIVE POLICIES OR PRACTICES
Education and policy researchers use international assessment data to identify effective policies and practices by investigating relationships between performance levels and student background characteristics, classroom practices, and school contexts. The goal frequently is to identify practices, policies, and conditions that are responsible for the success levels of high performing countries and jurisdictions. This is achieved by, among other things, examining differences between the country of focus and top performing countries. An example of this is seen in a policy report Paine and Schleicher (2011) prepared about lessons that the United States may learn from PISA. We highlight three of their recommendations based on international comparisons.
First, in international comparisons, [in] the countries with the highest performance, teachers are typically paid better relative to others, education credentials are valued more, and a higher share of educational spending is devoted to instructional services than is the case in the United States (Paine & Schleicher, 2011, p. 4). The authors recommend emulating the raising of teachers status. A second recommendation is to raise education standards based on the observation that Most of the high performing countries have developed world-class academic standards for their students and their existence tends to be a consistent predictor for the overall performance of education systems (p. 5). Third they recommend allocating equitable funds to educate economically disadvantaged students based on the reasoning that available funds are spent differently in the United States than in the most successful PISA performing nations. It is one of only four OECD countries that appears to spend less money per student (based on teacher/student ratios) in its economically disadvantaged schools, while spending more in richer districts (p. 6).
All of these recommendations are made by identifying a selected set of differences between policies in the United States and high performing countries on PISA 2009. Its difficult to argue with these three recommendations as good strategies for improving education. However, the authors do not provide any evidence that all the high-ranking countries had these strategies in place and perhaps even more importantly whether low-ranking countries had policies that differed from these in any significant way. Without such evidence, the recommendations may be supported based on education philosophy and equity principles. But a causal link between the recommended strategies and high learning outcomes cannot be established. Here the implied propositions of inferences about effective policies and practices are: (a) there are uniform practices and policies across jurisdictions within countries that are responsible for success levels at the country level; and (b) these practices and policies are responsible for the success levels in high performing countries.
EMULATING AND IMPLEMENTING POLICIES AND PRACTICES
In the sequence of learning from international comparisons the next inference involves emulating and implementing policies and practices of high-ranking countries in our own countries. Even if we could identify the factors that were primarily responsible for the success of high-ranking countries, the question remains whether emulating these practices would lead to higher learning outcomes in our own countries. Successful emulation and implementation of strategies for improving education imply the following propositions: (a) practices and policies that lead to higher learning outcomes in other countries will lead to higher learning outcomes in our country; (b) school practices common place in one country can be transported to another country with different educational expectations, cultural contexts, teacher backgrounds, and educational systems. These are propositions that are not directly related to interpretation of international comparisons and the assessment data.
There are many reasons why specific policies and practices that are effective in one country or culture may not lead to better learning outcomes in another country or culture. For example, even within the same country, i.e., the United States, the participation structures in reading lessons that work for white American children do not work for those of native Hawaiian backgrounds (Au, 1980); reading practices, therefore, do not simply transfer from one cultural setting to another. Other factors, such as home literacy practices and student, parent, and public expectations are relevant. In addition, certain school practices and policies may not be feasible to implement across countries. These include increasing or decreasing class sizes, enforcing certain types or levels of education for teachers, and centralization or decentralization of education systems.
In this section we identified three sequential inferences and discussed their implied assumptions and propositions. The first two of these inferences depend on interpretations of international assessment data; the last one is based on expectations and assumptions independent of the assessment data. In the next section, we discuss limitations of these inferences, specifically those that are based on the assessment data. We focus on the two inferences about identifying most effective countries and identifying effective policies and practices that are based on international assessments, but not on inferences that are not directly tied to the data.
LIMITATIONS IN IDENTIFYING COUNTRIES THAT HAVE THE BEST SCHOOLING OUTCOMES AND MOST EFFECTIVE EDUCATION SYSTEMS
In this section, we present three key arguments against inferences that involve country rankings for identifying most effective education systems. First, there are several sources of incomparability in international assessments that create inaccuracies in comparing and ranking countries. These levels of incomparability require researchers to be cautious about making large inferential leaps based on small (absolute) differences between countries. Second, overall country performance is one of many indicators of education systems. If we want to examine education systems we need to look at multiple indicators such as school dropout rates, school climate, student and teacher behavior, and students perceptions of benefits of schooling. Third, focusing on performance at the level of entire countries may obscure important within-country differences in rankings for subgroups, such as gender or ethnic groups.
We exemplify these problems with the three key propositions and support our arguments using data from PISA 2009. PISA is a large-scale assessment that has been administered internationally in 3-year cycles since 2000. The number of participating countries has grown steadily since its inception, with 31, 41, 57, and 75 countries/jurisdictions participating in 2000, 2003, 2006, and 2009, respectively. PISA is administered to 15-year-old students to determine the extent to which they have attained the required knowledge, skills, and attitudes in scientific, mathematical, and reading literacies to succeed and fully participate in society when they near the end of compulsory education. Each assessment cycle has a primary domain and two secondary domains. In 2009, the primary domain was reading literacy. In this paper we draw our examples from this domain.
We use PISA 2009 reading assessment for five countries/jurisdictions as examples to elaborate on the problems with these propositions: Canada, Shanghai-China, Germany, Turkey, and the United States, i.e., countries from three continents that span the spectrum of high, average, and low ranking countries and jurisdictions. Among the five focus jurisdictions Shanghai-China is not a country but a city in China and the participating student sample was designed to be representative of Shanghai as a city not China as the country. In the remainder of this text, we therefore refer to cities or countries participating in PISA as jurisdictions. Three of the focus jurisdictions are part of the G7, the meeting of the finance ministers of the seven largest developed countries: Canada, Germany, and the United States. The two other jurisdictions are part of the seven largest emerging markets: China and Turkey. These five focus jurisdictions allow us to elaborate our points in more detail than discussing all 75 participant jurisdictions. However, whenever the data allow, we conduct our analyses using data from all participating jurisdictions and present findings for the five focus jurisdictions in the context of the broader international comparisons.
COMPARABILITY ISSUES IN INTERNATIONAL ASSESSMENTS
In using data from international assessments, such as PISA, there are two key comparability issues researchers need to take into account: comparability of student samples and comparability of scores and survey data. These two issues are discussed below.
Comparability of Student Samples
In international assessments, students are sampled based on a systematic probability sampling design. There are numerous challenges in implementing a complex sampling design. For example, in many jurisdictions, selected schools are not required to participate in the assessment and they may decline. Thus, school participation rates in PISA 2009 ranged between 67.83% for the United States and 100% for Estonia, Korea, Luxembourg, and Turkey (OECD, 2010, p. 177). Different methods of replacement for declining schools, as well as statistical adjustments for nonresponse, may reduce but cannot eliminate the resulting biases. In addition, education systems themselves may present limitations on the degree of comparability of samples. For example, jurisdictions may vary in their identification of special needs students (who are integrated in some jurisdictions but placed in special schools in others [e.g., the German Sonderschule]), their inclusion/exclusion rules, whether they include private schools, and whether they have single track (United States, Canada) or multitrack school systems (in Germany, there are Hauptschule, Realschule, and Gymnasium, with different length of schooling and academic concentrations). Exclusion rates in PISA 2009 ranged between 0% for Japan to 5.47% for Canada among the OECD countries (OECD, 2010, p.174). Different practices in the implementation of the sampling designthe inclusion and exclusion ruleslead to differential representativeness of target populations and comparability of samples across jurisdictions.
In international assessments, decisions on which populations to target have inevitable trade-offs. Target populations are either age groups, such as the 15-year olds in PISA, or grade levels, such as Grades 4, 8, and 12 in TIMSS. Differences in education systems, such as the age at which children start school, or the curricular sequencing of topics and subject areas across the years of schooling limit the inferences that can be drawn between schooling and learning outcomes based on the student samples. For example, student outcomes at age 15 may be a result of 911 years of schooling, depending on whether children start school as early as age four (e.g., UK) versus age six (e.g., Turkey). These differences in the starting age for schooling lead to age groups that may be in different grades. In PISA 2009, the 15 year olds sampled from the participating jurisdictions could be at any level between seventh grade and 12th grade. Targeting grade levels on the other hand results in student samples with similar numbers of years of schooling but with variation in ages in different jurisdictions and at different stages of development.
Comparability of Scores and Survey Data
Comparability of scores requires equivalence of measurement in different language and national versions of the assessments. Measurement equivalence includes: (a) equivalence of constructs, (b) equivalence of tests and surveys, and (c) equivalence of testing conditions. Linguistic equivalence of international assessments, an aspect of test equivalence, is one of the most investigated aspects of measurement comparability. In one of the subsections below we discuss test translation and adaptation issues to demonstrate types and degrees of differences that may exist in different language versions of international assessments. In the last subsection we discuss other sources of measurement incomparability that are not captured by the three categories of measurement equivalence.
Construct equivalence refers to the similarity of the construct measured by international assessments in different jurisdictions. This equivalence addresses whether there is theoretical and empirical evidence to support similar development and equivalent definitions of, for example, reading as a construct in different jurisdictions and languages. This type of evidence in reading would require test developers to demonstrate similarity of cognitive processes and development in reading competencies in different languages with different alphabets. In PISA 2009, the framework starts with the concept of literacy, a concept that includes students capacity to extrapolate from what they have learned and apply their knowledge in real-life settings, and their capacity to analyze, reason and communicate effectively as they pose, interpret and solve problems in a variety of situations (OECD, 2010, p. 22). The construct equivalence evidence would include evidence that reading competency is defined and used in similar ways to analyze, reason, interpret, and solve problems in real-life settings in the different jurisdictions participating in the assessment. By age 15, key differences between languages may not affect the literacy of students as greatly as when students first learn to read. However, application of literacy competencies to real-life contexts and solving problems may be operationalized in different ways.
Construct equivalence issues are particularly relevant in measurements captured by student, parent, and school background questionnaires. In addition to the limitations of indirect stakeholder report data (see Perry & Winne, 2006; Winne, Jamieson-Noel, & Muis, 2002) such as potential for positive response bias, constructs such as student attitudes, classroom climate, or even socioeconomic status are likely to have different significations in different countries, cultures, and languages. Little to no research has been conducted to date on the comparability of survey data gathered as part of international assessments.
Test equivalence refers to linguistic, content, and cultural equivalence of different language versions of tests and surveys administered in different jurisdictions. This means that items have: (a) similar referential relations to the life-world of the students (e.g., an item about [American] football is not equally salient to students in North America and those from other jurisdictions)1; (b) capture similar constructs (e.g., items pertaining to emotion may hold different meaning for students in Asia than they do in North America and Europe or in Arabic countries); (c) are of the same length (so not to disadvantage speakers of some languages with respect to the total amount and complexity of reading to be done2); (d) use the same format; and (e) do not include cultural references that may disadvantage students from different jurisdictions.
Testing Conditions Equivalence
Equivalence of testing conditions refers to whether: (a) tests and surveys are administered in an identical fashion in different jurisdictions; (b) the formats are equally appropriate for students in each jurisdiction (which is not the case for some sub/cultures, where the salience between literacy and orality differs such as for working-class students [Eckert, 1989]); (c) speededness is similar in all the jurisdictions (in fact, because of differently experienced temporality, students from some [indigenous, African American] cultures may be disadvantaged when the speed is exactly the same, Roth, Tobin, & Ritchie, 2008); and (d) other response styles such as acquiescence, tendency to guess or strategic guessing and social desirability are similar (Ercikan & Lyons-Thomas, 2012; Hambleton, 2005; Hambleton & Patsula, 1999).
Measurement Equivalence Evidence
Types of evidence that would be needed to establish measurement equivalence include evidence: (a) of similar construct meaning for the comparison groups; (b) of equivalence of the constructs captured by pairs of items in the comparison languages for construct equivalence; and (c) from bilingual expert reviews, psychometric comparability, and cognitive analysis (Ercikan & Lyons-Thomas, 2012). The equivalence of testing conditions would need to include evidence that procedures used in test administration and familiarity with test format, speededness, and response styles were similar in the participating jurisdictions. Users of international assessment data may not be in a position to gather the evidence required for an evaluation of the equivalence of tests, some of which demand special research studies and data collection efforts. However, the comparability of scores and questionnaire data that result from international assessments needs to be an important consideration in interpreting findings from secondary analyses of such data.
PISA 2009 was administered in 75 countries/jurisdictions and was adapted to 50 different languages. Validity of inferences from international assessments critically depends on the degree to which these adapted versions of tests measure the intended constructs and provide comparable measurements (AERA, APA, & NCME, 1999; Ercikan, 2006; Ercikan & Lyons-Thomas, 2012; Hambleton, 2005; ITC, 2001). In PISA 2009, the verification procedures included: (a) development of two source versions of the instruments in English and French; (b) double translation design; (c) preparation of detailed instructions for the translation of the instruments for the participating jurisdictions; (d) preparation of translation/adaptation guidelines; (e) training of national staff in charge of the translation/adaptation of the instruments; and (f) verification of the national versions by international verifiers (OECD, 2011a).
An international verification of comparability was conducted for 83 out of 101 national versions of the assessment materials. These procedures were not conducted when a testing language was used for minorities that make up less than 10% of the target population or when jurisdictions borrowed a version that had been verified at the national level without making any adaptations. Two national versions, Irish and Valencian, verifications were only done at the national level (OECD, 2011a). These procedures demonstrate the great care given to translation/adaptation process in PISA. However, even though existence of such procedures minimizes translation/adaptation errors and problems with comparability, they do not indicate the degree to which assessments were comparable.
Previous research on international educational achievement tests demonstrated considerable incomparability between different language versions of tests (Ercikan, 1998; Ercikan & Koh, 2005; Ercikan & McCreith, 2002; Ercikan & Lyons-Thomas, 2012; Grisay, 2003; Hambleton, Merenda & Spielberger, 2005; Oliveri & Ercikan, 2011; Solano-Flores, Backhoff, Contrea-Niño, 2009). Even though some of the differences between different language versions of assessments could be attributed to adaptation errors, many were due to intrinsic differences between languages that may lead to differences in difficulty or commonness of vocabulary, differential length or complexity of sentences, and differential contextual meaning of vocabulary (Ercikan, 1998; Roth et al., 2013). Potential problems that may arise from adaptation of items into other languages are those factors that may lead to differential item functioning (DIF) (Allalouf, 2003; Ercikan, 1998; Gierl & Khaliq, 2001; Hambelton, 2005). DIF indicates that items have different psychometric properties and may be assessing somewhat different constructs (Lord, 1980).
Other Sources of Measurement Incomparability
In addition to linguistic differences between different national versions of assessments, researchers have highlighted several other sources of difference, including curricula, educational policies and standards, wealth, standard of living, cultural values, motivation to take the test, etc., may be essential for properly interpreting scores across cultural/language and/or national groups (Hambleton & Patsula, 1999, p. 162). In a reading assessment in particular, cultural differences may influence intrinsic interest, familiarity, and interpretation of the content of items. This is so because words and sentences are bound up differently with the life-worlds experienced by members of different cultures, and, therefore, the associated significations can be considerably different (Ercikan & Roth, 2006; Roth, 2010). All translation therefore jeopardizes comparability so that strictly speaking even within-language translations are nonequivalent (Derrida, 1996). Even within a particular linguistic context, such as the United States or Canada, some groups of students speak structurally and semantically different forms of English (e.g., Aboriginals, African Americans, working class students) making Standard English their second language (Roth & Harama, 2000). Test items articulated in Standard English therefore are nonequivalent for these groupseven though all students may declare English as their mother tongue.
RANK ORDERING OF JURISDICTIONS IN OVERALL PERFORMANCE DOES NOT TELL THE WHOLE STORY
The rank ordering of jurisdictions according to overall performance along a dimension such as reading literacy does not and cannot tell differential outcomes that are due to effectiveness of education systems. Even when scores are comparable across jurisdictions, evaluating the effectiveness of educational systems based on a single criterion and measure is highly problematic. In addition to reading literacy outcomes there are many other educational outcomes that need to be consideredand possible measures are not even included, such as high school graduation rates, suicide rates among students, or levels of happiness (an important measure in Bhutan, where education is also viewed as one of the basic needs required to achieve Gross National Happiness, Ministry of Health and Education, 2004, p. 5). Among those available as part of PISA data collection efforts are school climate, student, and teacher related factors that hinder learning, and students perceptions of benefits of schooling.
In this subsection, we discuss problems with respect to interpreting school rankings as indicators of effectiveness of education systems. Using the five jurisdiction data in an exemplary fashion, we examine the profiles of these jurisdictions with respect to important educational outcomes that cannot be simply summarized by an average scale score. These include high school graduation rates, school climate, student attitudes, and disciplinary climate. We examine these indicators of quality of education based on the PISA 2009 school and student questionnaire data, except for the graduation rates that were obtained from another OECD database. We then demonstrate that even when we only consider reading literacy scores, overall jurisdiction rankings are problematic by elaborating on how jurisdiction rankings vary for sub-groups such as gender groups.
HIGH SCHOOL GRADUATION RATES
A variable such as high school completion rate is often used as an indicator of literacy levels in the population and the labor force in a jurisdiction, and as an important indicator of quality and effectiveness of an education system. Yet, this education quality indicator is neither uniform within countries and jurisdictions nor does it have a positive correlational relationship with some other indicators of quality of education. A comparison of 35 developing countries exhibits highly varying dropout (noncompletion) rates across grade levels and age (under age, on time, one year over age, and two or more years over age) and may, for some subgroups, reach enormous proportions (e.g., nearly 80% of the students in Grade 11 who are one-or-more years above age drop out of school, and they are over age because of considerable repeating rates in all grades) (EPDC, 2009). There typically are also higher dropout rates at the transitions from elementary to middle and middle to high school levels. In industrialized nations, too, dropout (noncompletion) rates show historical and geographical differences. For example, in Canada, the 1990/1993 dropout rates ranged from 13.3% in British Columbia to 19.9% in Newfoundland; the 2007/2010 dropout rates ranged from a low of 6.2% (British Columbia) to a high of 11.7% (Quebec) (Statistics Canada, 2010). That is, within the same nation we may observe considerable differences over time and across regions or jurisdictions.
Based on high school graduation rates data from OECD (OECD, 2011b) we examined the degree to which jurisdiction performance ranks on the reading assessment corresponded with high school graduation rates in the jurisdiction. The OECD database contains graduation rates for only 30 of the 75 participating jurisdictions in PISA 2009. Most importantly, Shanghai-China graduation rates are not included in the database. In Figure 1, jurisdictions that fall below the red line are those whose graduation rate ranks are higher than their PISA reading performance ranks. Portugal with the highest graduation rate of 96%, ranks 17th among the 30 jurisdictions on reading performance displayed on this chart. Korea, on the other hand, ranks 11th on the graduation rate and 1st on the reading performance (among the jurisdictions considered in this comparison). Canada and United States rank considerably higher on reading performance than they do on their graduation rates. Even though these two jurisdictions rank very similarly on their graduation rates20th and 21stCanadas reading performance ranking is much higher (third) compared to the USA (10th).
Figure 1. PISA reading performance and high school graduation rate ranks
In fact, it would be surprising to find one-to-one correspondence between high school completion rates and performance. For example, all other things being equal, one might expect an increase in performance with an increase in dropout rate because dropouts are likely to be lower performing than those students remaining in school. Furthermore, completion rates and academic performance measure very different, but equally important, aspects of an education system. The graduation rates are results of a complex set of factors such as equity in the society, rural/urban distributions of populations, opportunities for employment, and industrialization of the jurisdiction among many others. Graduation rates are loosely connected to quality of education in schools and we rarely look for such connections to improve learning in schools. But we must ask, can we make judgments about quality of education in a jurisdiction without considering such an important outcome of an education system?
Conditions of school and school climate are important aspects of educational systems. Are higher performing schools more efficient and do they have better school climates? Using the PISA 2009 school questionnaire data we examined how jurisdiction reading performance rankings compared with jurisdiction school climate indicators for all 75 participating jurisdictions. Using the school questionnaire data, two composite scales presented in the PISA 2009 database were used. These were school climate with respect to students and school climate with respect to teachers. The school climate with respect to students included school principals responses to questions about the degree to which learning at school was hindered by the following factors on a four-point scale from Not at all to A lot: student absenteeism, poor student-teacher relations, disruption of classes by students, students skipping classes, students lacking respect for teachers, student use of alcohol or illegal drugs, and students intimidating or bullying other students. Principals also rated the degree to which the following teacher-related factors hindered schooling on the same four-point scale: teachers not meeting individual students needs, teachers low expectations of students, teacher absenteeism, staff resisting change, teachers being too strict with students, and students not being encouraged to achieve their full potential. Higher scores on these two scales indicate lower student- or teacher-related problems.
The rankings of jurisdictions based on the student-behavior factors hindering learning are plotted against reading performance rankings (Figure 2). Jurisdictions that are closer to the red line are those that are ranked similarly on both scales, whereas others away from the red line ranked differently on the two scales. Among our five focus jurisdictions, Shanghai-China reports the least amount of student-behavior problems and is ranked the highest in reading performance. However, Canada, which is ranking higher on the reading performance scale than the rest of the three jurisdictions, is only ranking better than Turkey on the student related problems scale. There is greater contrast in ranks for Albania, Azerbaijan, Georgia, Indonesia, and Malaysia, jurisdictions that ranked very high (all under 10) on the student behavior scale but ranked very low (all greater ranks than 40) on the reading scale.
Figure 2. PISA reading performance and student behavior ranks (Spearmans Rho = -0.183, p=0.122)
Ranks for teacher behavior hindering learning were plotted against reading performance ranks (Figure 3). Relative to each other, the five focus jurisdictions ranked differently on the teacher behavior scale than the way they ranked on the reading scale. Shanghai-China ranked fourth on the teacher behavior scale, while Canada ranked first. There is a large group of jurisdictions that rank higher on the teacher behavior scale whose performances are not high on the reading scaleall on the top left-hand corner of the chart. This group includes Albania, Azerbaijan, Indonesia, Malaysia, and Qatar. Another group of jurisdictions on the right lower corner of the chart ranks high on the reading performance (lower ranking values) but ranks low (high ranking values) on teacher behavior scale. These jurisdictions include Shanghai-China, Korea, and Hong Kong.
Figure 3. PISA reading performance and teacher behavior ranks (Spearmans Rho = -0.035, p=0.769)
STUDENT SATISFACTION WITH SCHOOLS
Students satisfaction with schools and their evaluation of what they get out of school is another indicator of quality of education. To investigate whether satisfaction with schools as quality of education was consistent with reading performance rankings, we compared satisfaction with school rankings with reading score rankings. On the PISA 2009 student questionnaire, students were asked to rate their agreement on a four-point scale with the following statements: (a) School has done little to prepare me for adult life when I leave school; (b) school has been a waste of time; (c) school has helped give me confidence to make decisions; and (d) school has taught me things that could be useful in a job. Higher ratings on the first two of these indicated low satisfaction with schools whereas higher ratings on the latter two indicated high satisfaction. The overall satisfaction scale (reported on the PISA database) is a composite score scale where the direction of responses are taken into account and higher scores indicate higher satisfaction. The relative rankings on the performance and satisfaction scales are compared in Figure 4. Whereas the United States ranks nearly the same on both scales, Turkey ranks much higher in satisfaction with schools than on the reading scale, whereas Shanghai-China ranks lowest on the satisfaction with schools (71st) but highest (first) on reading scores.
Figure 4. Satisfaction with schools and reading performance ranks (Spearmans Rho = 0.565, p < 0.001)
Positive associations have been reported between various aspects of classroom and school climate including discipline and achievement (Creemers & Reezigt, 1999; Freiberg, Huzinec, & Templeton, 2009). A study of 7,259 schools in 41 jurisdictions showed that better school discipline is associated with better classroom discipline and the latter with higher achievement, for example, in mathematics (Chiu & Chow, 2011). This result was obtained by pooling school-level within-jurisdiction analyses. Therefore, there is reason to think that the ranking of jurisdictions according to disciplinary climate may correlate with the ranking of these jurisdictions according to reading achievement. On the PISA 2009 test, students rated their agreement with the following on a four-point scale: (a) Students dont listen to what the teacher says; (b) there is noise and disorder; (c) the teacher has to wait a long time for the students to quiet down; (d) students cannot work well; and (e) students dont start working for a long time after the lesson begins. The relative ranking of jurisdiction on reading performance and effective classroom processes scales are presented in Figure 5. The results show that there is great variability in how jurisdictions ranked on the effective classroom processes scale compared to the reading performance scale. There are large groups of jurisdictions that ranked high on effectiveness of classroom processes, whereas they had low performance on reading rankings. These include Albania, Azerbaijan, Georgia, Kazakhstan, Kyrgyzstan, and Moldova. On the other hand jurisdictions like Finland, Australia, and New Zealand had high rankings on the reading scale but low rankings on effective classroom processes. Among our five focus jurisdictions, Shanghai-China ranked high on both the reading performance (first) as well as the effective classroom processes (fifth), whereas Canada ranked second on reading performance but the lowest on effective classroom processes (58th). Germany (28th on reading, 18th on classroom process), Turkey (41st on reading, 42nd on classroom processes), and the United States (17th on reading, 21st on classroom processes) ranked similarly on the two scales.
Figure 5. Disciplinary climate and reading performance ranks (Spearmans Rho = 0.268, p = 0.021)
JURISDICTION RANKINGS VARY BY SUB-GROUP: THE CASE OF GENDER
The PISA report card shows that in reading, Shanghai-China scored significantly higher than Canada on the reading scale, which scored significantly higher than Germany and the United States, both of which are not different from the OECD mean; and Turkey scored significantly below the OECD mean (see Table 1). Gender differences (D) are marked: Girls in all jurisdictions outperformed the boys, with the smallest differences in the United States (D = 25), followed by Canada (D = 35), Germany and Shanghai-China (D = 40), and Turkey (D = 43). With such large differences between gender groups in these jurisdictionsranging between 0.26 SD (U.S.) and 0.52 SD (Turkey)the rank ordering of mean performance of jurisdictions is not the same for boys and girls: Canadian girls (M = 542) have about the same mean as the Shanghai-Chinese boys (M = 536); German girls (M = 518) outperform boys in the United States and boys in Canada; the Turkish girls (M = 491) perform about the same level as the German boys (M = 478) and the boys in the United States (M = 488). The girls in the United States perform at the same levels as the boys in Canada.3
Table 1. PISA 2009 Mean Score and Gender Differences on the Reading Scale for Five Jurisdictions
Within each of these jurisdictions, girls and boys are part of the same education system. Therefore, if PISA is to provide educational data and information for improving policy and practice by using an international comparative perspective, the United States cannot simply look toward Canada to inform education improvement plans when girls in the United States perform at a similar level to boys in Canada. Similarly, Canada cannot look to Shanghai-China to improve performance in reading when on the average Canadian girls are performing at the same level as boys in Shanghai-China.
In this section, we presented a discussion of arguments against the three propositions: (a) the international assessment provides comparable scores for countries and jurisdictions that allow researchers to rank order countries on a single scale; (b) these scores alone are sufficient indicators of quality of education in a country; and (c) there is homogeneity in performance levels across different student populations within countries that allow comparisons and rank ordering of countries. First, a wide range of issues that may jeopardize comparability of scores and survey data were discussed. Second, exemplifying analyses were used to provide a variegated and complex picture of the relations between reading achievement rankings and rankings on a number of factors that one might think to be related individually or in combination to quality of education and, subsequently, academic performance. This makes it highly questionable to use reading score rankings as a criterion for adopting educational policies and practices of other jurisdictions. Third, reading scores vary greatly for different student subpopulations within a jurisdictione.g., gender, language, and cultural groupsthat are all part of the same education system in a given jurisdiction. This makes it difficult to rank order jurisdictions and identify the most effective education policies and practices that can be expected to lead to improvement for all students.
LIMITATIONS IN DETERMINING POLICIES OR PRACTICES RESPONSIBLE FOR SUCCESS
In this section we discuss limitations of drawing inferences about effective strategies by demonstrating problems with one of the two implied propositions of identifying effective strategies based on differences between high-performing jurisdictions and lower-performing jurisdictions, that is identified practices and policies are responsible for the success levels in high performing jurisdictions. Paralleling the Paine and Schleicher (2011) example of drawing inferences from PISA 2009, we consider a scenario where researchers and policy makers look for insights to identify strategies to improve education and performance in the United States. In our discussion of the limitations of the rank ordering jurisdictions, the United States was one of our focus countries. In this section, we discuss potential problems with identifying strategies for improving education in the United States by comparing it to the other four jurisdictions that include two of the top performing jurisdictions Shanghai-China and Canada, Germany (which performed approximately at the same levels as the United States), and Turkey (which performed lower than the United States). We focus on two factors that might be perceived as being responsible for performance differences between the United States and high-performing countries: reading enjoyment and out-of-school enrichment activities. For the identified strategies to account for higher performance levels, all high-performing countries need to be consistent with respect to these strategies and low performing countries need to be distinctly different than high-performing countries with respect to these strategies. Our analyses present evidence that these two factors, reading enjoyment and out of school enrichment activities, cannot be considered solely responsible for higher performance levels.
READING ENJOYMENT AND READING ACHIEVEMENT
Large differences between girls and boys reported in the preceding section may lead educators to think that differences in social factors, such as attitudes and general interest in reading may be important sources of differences between jurisdictions as well. In fact, in the five jurisdictions we examined, enjoyment of reading is the most or one of the most highly correlated student variable with reading performance, ranging from r = .3 (Turkey) to r = .5 (Canada and Germany). In addition, there is a big difference between Shanghai-China, the top-performing jurisdiction, and the United States, with Chinese students expressing greater interest and enjoyment in reading compared to the American students. When asked how much they read for enjoyment, distinct patterns emerge for students from the five focus jurisdictions. There is only a small fraction of Shanghai-Chinese students (8%) and relatively low proportion of students in Turkey (22%) who do not read for enjoyment, whereas large percentages of Canadian, German, and American students (ranging from 33% to 42%) do not read for enjoyment (Figure 6). A similar pattern is observed regarding students who read only when they have to (see Figure 6). The proportion of Turkish students lies about halfway between the traditional Western countries and Shanghai-China. There is a considerable percentage of students in the Western countries who consider reading to be a waste of time, ranging from 24% to 27%, whereas only a small percentage of students in the two emergent economies express the same attitude (6% to 8%). Similarly, the proportions of students who reported that they like to talk about what they read are lowest among German, American, and Canadian students (ranging from 32% to 43%) and highest for the Chinese (65%) and Turkish students (67%).
Figure 6. Percentage of Students Reporting That They do not Read for Enjoyment and Read Only if They Have to.
High correlations between interest in and enjoyment of reading and the reading scores within jurisdictions combined with low levels of interest and enjoyment of reading by the United States students may lead researchers to think that improving interest and enjoyment of reading may be the key to improving performance in reading. However, students from the two emerging economies are very similar in exhibiting much more favorable attitudes toward reading. Yet there are vast differences in reading performances between Shanghai-China and Turkey. Also interesting are the similarities in attitudes expressed by Canadian and American students in the face of significant differences in the reading scores. This discussion demonstrates problems with the proposition that the factors associated with higher performance in high-ranking jurisdictions, in this case interest and enjoyment in reading, are responsible for their success levels. This discussion also reveals problems with using correlates of higher achievement within countries and jurisdictions for identifying strategies to improve achievement in our own countries.
OUT OF SCHOOL ENRICHMENT AND REMEDIATION
Other striking differences between the United States and Shanghai-China exist in out-of-school enrichment and remediation activities, with much greater proportions of Chinese students being involved in out-of-school enrichment and remediation. Previous research explored out-of-school learning experiences, in particular in Asian jurisdictions, as an important factor on performance in international assessments (Lee, 2008; Lee, Park, & Lee, 2009). The PISA 2009 data show that the percentage of students who do some form of enrichment by studying outside of their school lessons are not different among the three first-world nations. But the rates are nearly three times as high in Turkey and almost five times as high in Shanghai-China. There are few differences between boys and girls within each jurisdiction (Figure 7).
Figure 7. Percentage of Students by Gender Reporting that they Attend Out-of-school Study Lessons
One form of enrichment consists of going to a school or public library to read for fun or to read magazines. The PISA 2009 results across the five jurisdictions show that there are no significant variations in the availability of libraries (only about 2.5% of the students report not having access). Yet across the jurisdictions, the percentage of students who go to the library to read for fun and to read magazines varies considerably. The results are similar to the preceding question about out-of-school lessons (Figure 8). In the present case, going to the library to read for fun or to read magazines is least frequent among German students and most frequent among Turkish and Shanghai-Chinese students, even though there are about the same number of libraries at school or in their neighborhood available to students in all five jurisdictions. This discussion provides additional evidence against the proposition that out of school enrichment activities are responsible for the performance gap between the United States and high-performing jurisdictions such as Shanghai-China and Canada.
Figure 8. Percentage of Students Reporting That They go to the Library to Read for Fun or to Read Magazines
The analyses in the preceding subsections suggests that the PISA 2009 results are variegated with regards to attitudes towards reading and out-of-school learning experience, rather than exhibiting clear differences that might explain the different performances among the five jurisdictions. Even though attitudes towards reading are some of the strongest predictors of reading performance in all five jurisdictions, these correlations are in no way informative about identifying strategies for improving reading. Students in jurisdictions with similar levels of interest in reading turn out to rank very differently in the reading comparison, which means that interest is insufficient for predicting reading achievement. In the five focus jurisdictions, students in the emerging economies are more likely to engage in enrichment and remedial classes than their counter parts in the G7 nations. The percentage is the highest for the highest performing jurisdiction Shanghai-China but lowest for the second-highest scoring jurisdiction. The percentage of students pursuing out-of-school studies follows the same pattern as that in going to the library to read for fun or to read magazines.
During the past four decades since the beginning of international assessments, researchers and policy makers have taken advantage of the freely available international assessment data sets to conduct analyses that explore relationships between student performance levels and student background characteristics, classroom practices, and school contexts. The goal typically is to identify the factors that contribute to success levels of the high performing jurisdictions. These are good steps to take for learning from other jurisdictions experiences. However, the correlational relationships we identify between many factors and the learning outcomes within countries do not lend themselves to making simple inferences about what practices and policies are individually responsible for improving education in our own jurisdictions.
The purpose of this study is to articulate a general framework for making inferences from international assessments and present cautions about these inferences. Our model includes a three-step sequence of inference: (a) identification of jurisdictions that have the best schooling outcomes and therefore best education systems; (b) determination of policies or practices responsible for success; and (c) emulation and implementation of policies and practices that would lead to improvement of educational outcomes in other jurisdictions. We exemplify limitations in this inferential sequence for some of the propositions that come with the first two of these steps.
The publication of the 2009 PISA results constituted a report card; a lot of shoulder patting (e.g., Finland, Domisch, 2009; Rautalin & Alasuutari, 2009) and soul searching has been and will be going on around the worlddepending on how well the students of the respective nation have done. According to the PISA 2009 report, Addressing the educational needs of such diverse populations and narrowing the gaps in student performance that have been observed remains a formidable challenge for all jurisdictions (OECD, 2010, p. 13). Within jurisdictions, correlational research can point to useful directions that can be further explored in in-depth research to guide education policy and practice. However, correlational associations within jurisdictions can neither be used to make claims about effectiveness of education systems, policies, and strategies in such jurisdictions, nor can they be expected to translate to higher performance in other jurisdictions. There are limitations of cross-sectional international assessment data such as PISA for examining effectiveness of education systems across jurisdictions for making comparisons in terms of the effects of education systems it is necessary (although not sufficient) to have longitudinal data (Goldstein, 2004, p. 4). Moreover, it remains a persistent weakness of all the existing large-scale international comparative assessments that they make little effort to do this (p. 4). Should we look toward other jurisdictions to copy strategies and practices for the purpose of changing education in another jurisdiction? The presented analyses show that the variations in achievement are likely due to much more complex interrelations of cultural, societal, and educational factors. This means that taking this or that student variable and education practice in one context and trying to replicate it somewhere else may not result in the aspired to learning outcomes from the original jurisdiction.
There are several issues related to research method, which we point out and discuss in the early parts of this paper. Thus, international comparisons such as the widely publicized ones using PISA 2009 face the fundamental problem of ensuring comparability of test scores and questionnaire data across jurisdictions with diverse educational systems and cultures. Therefore, establishing comparability of test scores and data across jurisdictions should be the first step in making international comparisons based on international assessment data such as PISA. Before researchers have established the existence of high levels of comparability, there is little use in attempting to make comparisons across nations or cultures.
In this paper, we conduct only a small number of the necessary analyses that need to be conducted for establishing the full range of limitations for drawing implications for policy and practice from international tests. In particular, studies are required that investigate whether: (a) the emulation of policies and practices from another jurisdiction actually leads to improvement of educational outcomes; and (b) what kinds of adjustments need to be made to accommodate the policies and practices of one jurisdiction to another with different culture, administrative structures, networks of social relations, and so on.
This study was supported by grants from the Social Sciences and Humanities Research Council of Canada (to Ercikan and Roth). All opinions are our own.
1. One only has to look at the different levels of coverage of the different sports across nations to understand that if an item were to make reference to sports it would be read differently and have different levels of salience in different countries and subcultures.
2. One think-aloud study with experts analyzing the equivalence of a multilingual national test reveals that French items were judged to be about 20%25% longer than the corresponding English versions (Roth, Oliveri, Sandilands, Lyons-Thomas, & Ercikan, in 2013).
3. These comparisons take standard errors presented in Table 1 into account.
AERA, APA, & NCME. (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association.
Allalouf, A. (2003). Revising translated differential item functioning items as a tool for improving cross-lingual assessment. Applied Measurement in Education, 16, 5573.
Au, K. H. (1980). Participation structures in a reading lesson with Hawaiian children: Analysis of a culturally appropriate instructional event. Anthropology and Education Quarterly, 11, 91115.
Braun, H. (2013). Prospects for the future: A framework and discussion of directions for the next generation of international large-acale assessments. In M. von Davier, E. Gonzalez, I. Kirsch, K. Yamamoto (Eds.), The role of international large-scale assessments: Perspectives from technology, economy, and educational research (pp. 149160). Princeton, NJ: Educational Testing Service.
Chiu, M. M., & Chow, B. W. Y. (2011). Classroom discipline across forty-one countries: School, economic, and cultural differences. Journal of Cross-Cultural Psychology, 42, 516533.
Creemers, B. P. M, & Reezigt, G. J. (1999). The role of school and classroom climate in elementary school learning environments. In H. J. Freiberg (Ed.), School climate: Measuring, improving and sustaining healthy learning environments (pp. 3148). London, UK: Falmer Press.
Derrida, J. (1996). Le monolinguisme de lautre ou la prothèse dorigine [Monolingualism of the Other or The prosthesis of origin]. Paris, France: Galilée.
Dillon, S. (2010, December 7). Top test scores from Shanghai stun educators. New York Times. Retrieved from http://www.nytimes.com/2010/12/07/education/07education.html
Domisch, R. (2009). Keine Mythen, sondern fundierte Schulreformen [No myths, just well-founded school reforms the learning success of Finnish pupils from the perspective of the Finnish National Board of Education]. Zeitschrift für Erziehungswissenschaft, 12, 597615.
Eckert, P. (1989). Jocks and burnouts: Social categories and identity in the high school. New York: Teachers College Press.
Educational Policy and Data Center. (EPDC). (2009). Pupil performance and age: A study of promotion, repetition, and dropout rats among pupils in four age groups in 35 developing countries. EPDC Working Paper No. WP-09-02. Retrieved from http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=18&ved=0CKoBEBYwEQ&url=http%3A%2F%2Fepdc.org%2Fpolicyanalysis%2Fstatic%2FEPDC%2520No.%252009-02%2520Pupil%2520performance%2520and%2520age.pdf&ei=MS0fT8GpLY2viQfWzeDqDQ&usg=AFQjCNFrSZFX2dPO41nGkGFqXCuCEfshrw&sig2=NQ8ackf0_yJJIaspT6m9AQ
Ercikan, K. (1998). Translation effects in international assessments. International Journal of Educational Research, 29, 543553.
Ercikan, K. (2006). Developments in assessment of student learning and achievement. In P. A. Alexander & P. H. Winne (Eds.), American Psychological Association, Division 15, Handbook of educational psychology (2nd ed., pp. 929953). Mahwah, NJ: Lawrence Erlbaum Associates.
Ercikan, K., & Koh, K. (2005). Construct comparability of the English and French versions of TIMSS. International Journal of Testing, 5, 2335.
Ercikan, K., & McCreith, T. (2002). Effects of adaptations on comparability of test items and test scores. In D. Robitaille & A. Beaton (Eds.), Secondary analysis of the TIMSS results: A synthesis of current research (pp. 391407). Dordrecht, The Netherlands: Kluwer Academic Publishers.
Ercikan, K., & Roth, W.-M. (2006). What good is polarizing research into qualitative and quantitative? Educational Researcher, 35(5), 1423.
Freiberg, H. J., Huzinec, C. A., & Templeton, S. M. (2009). Classroom managementA pathway to student achievement: A study of fourteen inner-city elementary schools. Elementary School Journal, 110, 6380.
Gierl, M. J., & Khaliq, S. N. (2001). Iden- tifying sources of different,ialitem and bun- dle functioning on translated achievement tests. Journal of Educational Measure- ment, 38, 164-187.
Goldstein, H. (2004) International comparisons of student attainment: Some issue arising from the PISA study. Assessment in Education: Principles, Policy and Practice, 11, 319330.
Grisay, A. (2003). Translation procedures in OECD/PISA 2000 international assessment. Language Testing, 20, 225240.
Hambleton, R. K. (2005). Issues, designs, and technical guidelines for adapting tests into multiple languages and cultures. In R. K. Hambleton, P. F. Merenda, & C. D. Spielberger (Eds.), Adapting educational and psychological tests for cross-cultural assessment (pp. 338). Mahwah, NJ: Lawrence Erlbaum Associates.
Hambleton, R. K., Merenda, P. F., & Spielberger, C. D. (2005). Adapting educational and psychological tests for cross-cultural assessment. Mahwah, NJ: Lawrence Erlbaum Associates.
Hambleton, R. K., & Patsula, L. (1999). Increasing the validity of adapted tests: Myths to be avoided and guidelines for improving test adaptation practices. Journal of Applied Testing Technology, 1, 130.
Hammer, K. (2010, December 8). How Canada is becoming outclassed in school. The Globe and Mail. Retrieved from http://www.theglobeandmail.com/news/national/how-canada-is-becoming-outclassed-in-school/article1829259/
International Test Commission (ITC) (2001). International test commission guidelines for test adaptation. London, UK: Author
Lee, J. (2008). Missing link in international education studies: Can we compare the US with East Asian countries in the TIMSS? International Electronic Journal for Leadership in Learning, 3(18). Retrieved from http://iejll.synergiesprairies.ca/iejll/index.php/iejll/article/view/462
Lee, C.-J., Park, H.-J., & Lee, H. (2009). Shadow education systems. In G. Sykes, B. Schneider, & D. N. Plank (Eds.), Handbook of education policy research (pp. 901919). New York: Routledge.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum.
Ministry of Health and Education. (2004). Education sector strategy: Realizing vision 2020 policy and strategy. Thimphu, Bhutan: Author.
OECD. (2010). PISA 2009 results: What students know and can do. Student performance in reading, mathematics and science vol. 1. OECD Publishing.
OECD. (2011a). PISA 2009 technical report. OECD Publishing.
OECD. (2011b). Education at a Glance 2011: OECD indicators. OECD Publishing. Retrieved from http://dx.doi.org/10.1787/eag-2011-en
Oliveri, M. & Ercikan, K. (2011). Do different approaches to examining construct comparability lead to similar conclusions? Applied Measurement in Education, 24, 349366.
Paine, S. L., & Schleicher, A. (2010). What the US can learn from the most successful education reform efforts. McGraw-Hill Research Foundation, Policy Paper: Lessons from PISA.
Perry, N. E., & Winne, P. H. (2006). Learning from learning kits: gStudy traces of students self-regulated engagements with computerized content. Educational Psychology Review, 18(3), 211228.
Ravitch, D. (2010, December 14). The real lessons of PISA. Education Weeks Blogs. Available at http://blogs.edweek.org/edweek/Bridging-Differences/
Rautalin, M., & Alasuutari, P. (2009). The uses of the national PISA results by Finnish officials in central government. Journal of Educational Policy, 24, 539556.
Roth, W.-M. (2010). Language, learning, context: Talking the talk. London, England: Routledge.
Roth, W.-M., & Harama, H. (2000). (Standard) English as second language: Tribulations of self. Journal of Curriculum Studies, 32, 757775.
Roth, W.-M., Oliveri, M. E., Sandilands, D., Lyons-Thomas, J., & Ercikan, K. (2013). Investigating sources of differential item functioning using expert think-aloud protocols. International Journal of Science Education.
Roth, W.-M., Tobin, K., & Ritchie, S. (2008). Time and temporality as mediators of science learning. Science Education, 92, 115140.
Shimizu, Y. (1999). Studying sample lessons rather than one excellent lesson: A Japanese perspective on the TIMSS videotape classroom study. Zentralblatt für Didaktik der Mathematik, 99, 190194
Simpson, J. (2010, December 10). Canada is not becoming outclassed. The Globe and Mail. Retrieved from http://www.theglobeandmail.com/news/opinions/opinion/canada-is-not-becoming-outclassed/article1831853/
Solano-Flores G., Backhoff, E., & Contreras-Niño, L. A. (2009). Theory of test translation error. International Journal of Testing, 9, 7891.
Statistics Canada. (2010). Trends in dropout rates and the labor market outcomes of young dropouts. Retrieved from http://www.statcan.gc.ca/pub/81-004-x/2010004/article/11339-eng.htm
Van Roekel, D. (2010, December 15). To raise PISA scores, we must support teachers. Education Week. Retrieved from http://www.edweek.org/ew/articles/2010/12/15/15vanroekel.h30.html
Verbeet, M. (2010, December 7). Bilanz eines Schock-Jahrzehnts [Accounting for a decade of shock]. Spiegel Online. Retrieved from http://www.spiegel.de/schulspiegel/wissen/0,1518,733310,00.html
Walker, T. (2010, December 7). PISA 2009: U.S. students in the middle of the pack. Retrieved from http://neatoday.org/2010/12/07/pisa2009/
Winne, P. H., Jamieson-Noel, D., & Muis, K. (2002). Methodological issues and advances in researching tactics, strategies, and self-regulated learning. Advances in Motivation and Achievement: New Directions in Measures and Methods, 12, 121155.