Home Articles Reader Opinion Editorial Book Reviews Discussion Writers Guide About TCRecord
transparent 13

The Flynn Effect and the Demography of Schooling

by Geraldine McDonald - 2010

Background/Context: Although the Flynn effect has been recognized for 60 years and a wide range of factors has been suggested, there is still no agreement on cause. The effect is generally interpreted as a phenomenon involving changes in mental functioning as a consequence of various forms of environmental influence.

Purpose: The purpose of the account is to argue that at least part of the change in intergenerational IQ scores is an artifact of the age-based scoring system of IQ tests, together with historical changes in age-grade patterns in school systems.

Research Design: This is a logical argument using secondary analysis to illustrate historical change in the demographic patterns at the level of the classroom, together with a survey of psychometric documents and accounts of constructing Otis-type tests to explain the role of age in calculating an IQ. A review of research focuses on whether the IQ measures age or grade and whether age change in a school population can account for an IQ change.

Conclusions/Recommendations: It is concluded that because the age-based scoring systems of IQ tests interact with generational changes of age by grade, a Flynn effect should not be interpreted as a massive intergenerational rise in mental functioning unless the two generations match age in grade in the case of school populations, and highest grades achieved in the case of adults.

Standardized tests in general—whether individual or group, timed or untimed, verbal, arithmetical, or figural, for children or adults—have all, in the past, become outdated. By the end of the 1940s, it had been established on both sides of the Atlantic that a rise in IQ appeared over time, whether those tested were adults (Tuddenham, 1948) or children (Scottish Council for Research in Education, 1949).

The generational rise in IQ scores found in many countries (Flynn, 1984, 1987, 1988, 1998) was called “the Flynn effect” by Herrnstein and Murray (1994), and, despite having existed for many years, it retains the status of a mystery. Grissmer, Flanagan, and Williamson (1998), for example, stated that “no one has been able to explain the gain in IQ scores” (p. 194), and  Neisser (1998) stated, “A rise in test scores is surely driven by some kinds of environmental change, but what are they?” (p. 17). Nor does Flynn (2006) offer any explanation.

It is assumed by Flynn and commentators that the IQ has a common meaning, so that, irrespective of type of test, it refers to a mental entity identifiable through performance on varied test items. Although items may differ according to the type of intelligence test, there is now a widespread method of scoring: a deviation IQ based on the normal curve, a standard deviation of 15 and raw scores adapted for units of age or obtained from a narrow age band. Test developers establish the validity of new tests by comparing results with existing tests. Group tests are compared with tests that are individually administered, such as the Stanford-Binet Intelligence Scale, figural tests are compared with tests using written language, and all psychological tests in English, have, for nearly 70 years, been evaluated in The Mental Measurements Yearbook produced by the Buros Institute. These procedures tend to perpetuate measures based on earlier age distributions across grade.

Thorndike (1973) claimed that the Stanford-Binet Intelligence Scale had been unchanged from the mid-1930s to the 1970s, with “any shifts being attributable . . . to differences in the characteristics of the subjects tested” (p. 353). Such a view is maintained in the popular acceptance of the idea, triggered by Flynn’s findings, that children today are smarter than their parents (Brown, 2002). Possible explanations for the rise have been put forward over at least 60 years (e.g., Angoff, 1988; Elley, 1969; Neisser, 1998; Scottish Council for Research in Education, 1949; Tuddenham, 1948), and in a comprehensive discussion of the rise published by the American Psychological Association (Neisser, 1998), most commentators assumed mental changes in the subjects tested.

Sir Godfrey Thompson’s early suggestions for causes included the environmental effects of the wireless and “puzzle corners” in newspapers (Scottish Council for Research in Education, 1949). Increase in amount of schooling, test sophistication, better educated homes, and new technologies are explanations that match our ideas about sources of learning and our knowledge that the IQ is closely correlated with educational performance. There is also support for explanations such as better nutrition, increases in population density, urbanization, and other factors that have increased over time. Arguments countering the idea of a rise, based on trends such as decrease in size of middle-class families, assume that the IQ measures, at least to some extent, hereditary endowment.

It is suggested in this present account that the Flynn effect may not necessarily be associated with changes in human brains or ways of thinking, but that it is connected to the practice of converting raw test scores on the basis of age categories and hence with historical changes in age by grade, which affect the comparability of samples. Evidence is taken from the raw score conversion charts in intelligence test manuals, records of historical changes in school populations, studies of whether intelligence tests measure age or grade effects, and scores from the standardization of one intelligence test at two points in time. Illustrations are drawn from school populations and tests of a self-administered kind. The reports of the Flynn effect generally collect differences in raw scores, which are then converted to an IQ. Tests vary in the number of items required to establish a measure of 1 year.

Since the invention of the IQ, there has been an increase in the provision of schooling. Amount of schooling, frequently suggested as accounting for the Flynn effect, is imprecise because it may include grade repetition or acceleration, and varying age of school entry (Cahan & Cohen, 1989). For populations of both children and adults, a better measure is grade level achieved.

The question to be explored is: To what extent can the Flynn effect, independently of any other factor, be explained by a scoring system that picks up changes in age at grade level? The argument follows a chain of issues. It looks first at historical changes in age in relation to grade. The demography of schooling is then explained, followed by discussion of the role of age in calculating an IQ, the relationship between school grade and scores, and the association of grade with IQ scores. This is followed by a review of empirical studies of age versus grade in relation to IQ. An account then follows of a practical test of the grade-level hypothesis. The argument is logical and not statistical.


Over the last 100 years, there has been a lowering of children’s ages for grade in education systems in industrialized countries. As ages have fallen, IQ scores have risen.

Flynn (1987) has focused particularly on individually administered and untimed tests such as the Weschler tests, and figural tests such as the Raven’s Progressive Matrices, but he has also reported, for several countries, rising scores obtained from self-administered IQ tests for schoolchildren.

In the decade from 1920 to 1930, Arthur Otis, Lewis Terman, E. L. Thorndike, and others developed pencil-and-paper group tests to determine the intelligence of schoolchildren (Thorndike, Hagen, & Sattler, 1986). These spread to many countries, where they were standardized for local conditions or became the basis for locally produced tests. Like the 1908 Binet-Simon scale for individuals (Binet & Simon, 1916), group tests derived a pupil’s score by comparing the raw score gained with the median score for those within the same age category irrespective of the grade in which the pupil was placed. This was originally expressed as a ratio of mental over chronological age.

The Otis Self-Administered Intermediate Test of Mental Ability, designed for elementary school pupils and first published in 1921, contained 75 items to be attempted in 30 minutes. This test and other similar ones survive today in other multiple choice, self-administered, and timed tests. Descendants include the eighth version of the Otis-Lennon School Ability Test (OLSAT), described by a reviewer of the first version as measuring “verbal-educational g” (Grotelueschen, 1969, p. 113).

Age-standardized tests for school populations became the most widely used of all intelligence tests providing IQs for the placement of children in groups, grades, or streams in schools and to compare the intelligence of populations according to race (Shuey, 1966). Age-based scoring systems are typical of IQ tests, although intelligence may now have been removed from test titles in favor of terms such as scholastic aptitude.


The systems of mass education developed in the 19th century were based on age and grade. Mandated age of entry, annual promotion, and age at which leaving was permitted created a system in which it was expected that at a certain age, a pupil should be at a particular grade. This age-grade relationship became a standard for the efficiency of an education system and for the classification of individual pupils. The smooth flow of pupils up the ladder of schooling was, however, impeded by standards of achievement required before a child could proceed to the next grade. Standards might be set at several grade levels; Ayres (1909), commenting on the results of such policies in the United States, wrote, “The term ‘retarded’ is applied to the child who is below the proper grade for his age. Our schools are crowded with such children” (p. 50).

In 1904, the rate of retardation in New York City was 39%, and the superintendent of the time studied the situation to identify its causes (Volkmor & Noble, 1914). As other cities began to investigate the problem, it became apparent that all had substantial rates of retardation, that there were variations from city to city, and that children were frequently retarded by several grades.

Statistics collected in the United States show that between a third and a half of the school children fail to progress through the grades at the normal rate; that from 10 to 16 percent are retarded by two years or more; that from 5 to 8 percent are retarded by at least three years. (Terman, 1919, p. 3)

The number of grades children were below that expected for their age served to identify the “retardates” who were believed to require special provision. To the extent that the reasons for the sluggishness concentrated on the children rather than the education system, retardation became a psychological issue, leading to a search for causes within the characteristics of the children and ultimately for methods of early identification of these characteristics through tests of intelligence.

Student flows through systems of schooling are not the consequence of better teaching, greater cultural complexity, improved nutrition, or enlightened parenting practices. Patterns of age in grade come about from factors such as population pressure on the capacity of schools, school attendance laws, economic conditions, social attitudes toward attendance, regulations on class size, ratio of teacher to pupils, and the “room in the next class” effect—that is, the need to balance pupil numbers moving out of one grade with the space available in the next (Frederiksen, 1983; Harvey, 1938; McDonald, 1993). The greatest influences on student flows over the first half of the 20th century have been a desire to reduce the costs caused by repeaters, the opening up of secondary schooling, and the need to ensure that pupils would enter high school before they were eligible to leave. Policies that supported social promotion came later. These forces resulted in a progressive decline in age at grade level (Schwager, Mitchell, Mitchell, & Hecht, 1992). Between 1918 and 1952, for example, average ages in the United States fell as much as 13 months in the middle grades (Lennon & Mitchell, 1955).

Evidence for demographic change can be found in age by grade tables that record the ages of children in an education system on one axis and their grades on the other. Cooke (1931), an enthusiast for such records, complained that historical sequences of age-grade distributions were not easy to find. This remains true. However, Harvey (1939a) noted that “Virginia . . . has published the longest continuous series presented in comprehensive form” (p. 751). Virginia’s record continues today. Comparisons across jurisdictions are complicated by different ages at which leaving is permitted, fluctuations in the birth rate, whether ages are recorded as means or medians, and the time within the school year when the statistics are collected.

Figure 11 shows the percentage distribution of 10-year-olds in four education systems, collected at two points in time, across all the grades in which they were located. The number of 10-year-olds varies in each system. In 1913 in Fredonia, New York, there were only 193 ten-year-olds in school. In the London Borough in 1915, there were 3,319. In September 2002, in New Brunswick, Canada, there were 9,305, and in Virginia in 2002, there were 94,380.

Figure 1. Distribution of 10-year-olds across grades in four education systems, 1913–2002


Irrespective of the difference in size of each of the four systems, the distributions shown in Figure 1 are variations on the same template. Each system has a peak of 10-year-olds in one grade. The greatest number of 10-year-olds in the systems recorded in 1913 and 1915 were in Grade 4. The peak in the two recent systems is at Grade 5. It can be seen that over the years from 1913 to 2002, there was a reduction in both the “tail” of the distributions and in the much smaller proportion of 10-year-olds accelerated in advance of their age. Had this reduction of age in grade been the consequence of changes in the intelligence of the children, then surely acceleration would have increased rather than decreased. The lowering of age in grade provides “evidence of changes in administrative practice, resulting in the more rapid acceleration of pupils in the early grades where they would congest” (Harvey, 1938, p. 659).

School systems distribute pupils across grades through the processes of selection for retardation, normal progress, and acceleration. When Cooke (1931) collected surveys of public school systems in the United States to determine the extent of retardation, he located surveys covering more than 2 million school pupils over a span of 20 years for whom comparable age-grade data were found. Figure 2 shows the result of selection for retardation and acceleration and its changes in 5-year spans from1908 to 1928.

Figure 2. Retardation, normality, and acceleration in public school systems by 5-year periods from 1908 to 1928


Note. Based on Cooke (1931, Table 4, p. 264).

The administrative practices of acceleration and retardation work toward homogenizing scholastic ability at grade level, although selection is at least in part influenced by the issue of fairness: that it would be unfair to accelerate younger children while older children remain behind, or to retain older children while younger ones progress.


When Alfred Binet and Theodore Simon wanted to assign a value to the intelligence of children, they arranged the test items in their third test, published in 1908, according to small intervals of chronological age. Terman greeted this innovation in the following manner: “Why should a device so simple have waited so long for a discoverer? We do not know. It is of a class with many other unaccountable mysteries in the development of scientific method” (Terman, 1919, p. 41).

Thelen and Adolph (1992) commented that “it is probably hard to estimate how thoroughly we have internalized the idea of age-appropriate activities as an index of biological functioning” (p. 374).

“The WAIS [Weschler Adult Intelligence Scale] Intelligence Quotient (IQ) . . . is obtained from a direct comparison of the subject’s test results with those of persons in his chronological age group. This is perhaps the most meaningful item of information with respect to the subject’s mental ability” (Wechsler, 1955, p. 2).

“Historically, all the basic concepts in intelligence measurement (mental age, intelligence quotient, and the modern concept of deviation-IQ) have been defined solely on the basis of age, irrespective of schooling” (Cahan & Cohen, 1989, p. 1247).

“Test scores are assumed to increase gradually with age, a small increase in age leading to a small increase in score” (de Lemos, 1989, p. 21). The use of age appears reasonable in light of the early Binet-Simon scale because it was designed to test children from a very young age, and it used everyday test items. The use of age independent of grade became much less plausible once testing expanded to older children in school.

Although Francis Galton (1892/1869) had no measures of intelligence to put to the test, he believed that intelligence was inherited and hence would be distributed according to the laws of probability in the form of a Gaussian, normal, or bell curve, in the same manner as the physical measurements of men in the army recorded earlier by Adolphe Quetelet. Despite the lack of empirical evidence for the normal distribution in test results, there was widespread belief that it mirrored the distribution of mental capacity. Cyril Burt (1919), for example, worked out a formula based on the normal curve and its standard deviation for the proportions of schoolchildren of any one age who should be retarded, in the normal grade, or accelerated, and he illustrated the difference between the ability of men and women with two perfectly symmetrical normal curves differing only in the size of their standard deviations (Burt & Moore, 1912).

Despite the assumption that the IQ is distributed normally, it is almost impossible to find such distributions of raw scores from either intelligence or achievement tests (see critique by Micceri, 1989). Raw test scores are typically asymmetrical, with a disproportionately long tail at the over-age and lower grade end in much the same manner as the distribution of age across grade (McDonald, 1989).

The 15-point standard deviation is an arbitrary value, the normal curve providing a framework for converting raw scores to an IQ and for comparing results from tests of different kinds. Although the curve is one means of ranking individuals and may be replaced by another scale such as a percentile rank, the symmetrical Gaussian curve appears to have been accepted as an image of reality, as suggested by its appearance on the cover of more than one book on the topic of intelligence (Herrnstein & Murray, 1994; Neisser, 1998; Olssen, 1987).


There are various forms of evidence that point to the role of grade rather than chronological age in determining the IQ of samples of schoolchildren. The first signs consist of what can be considered “straws in the wind.” Early test developers found that their curves differed according to sex.

In 1936, McIntyre standardized the Otis Self-Administering Intermediate Test of Mental Ability, Form A for Australia. He referred to the greater grade dispersion of boys than girls and to girls scoring higher than boys, and he offered as a possibility that “results have been affected by the grade placement of the pupil” (McIntyre, 1938, p. 72). Redmond and Davies (1940), who standardized the same Otis Intermediate Test for New Zealand, found that “the average age of the girls in any [grade] is lower than that of the boys” (p. 71) and that “the girls’ scores are higher than the boys” (p. 69), “the difference being equivalent to 1 IQ point” (p. 105).

Such was the belief in the validity of the IQ that when the sex difference in age at grade level was found to accompany difference in scores, the test developers found reasons for ignoring the association. The difference was attributed to test bias, or simply that the girls conformed readily to school and the boys did not (McIntyre, 1938; Redmond & Davies, 1940).

There is evidence of the general role of schooling. Vroon (1980) pointed out that IQ becomes reasonably predictable only from school age on. There are studies that show that initial achievement differences in schooling associated with small intervals of chronological age have disappeared within 4 or 5 years of schooling (Holmes, 1989; McAfee, 1981; Shepard & Smith, 1985). Angelini, Alves, Custodio and Duarte (1989) tested Brazilian children on the Colored Progressive Matrices and found that a 1988 sample performed more poorly than a sample tested in 1982, the difference being that the 1988 sample was based on a cohort that included children who were not attending school. O’Leary (2001) reported results from two international tests of science administered in Ireland, which showed the role of grade level on performance. In one, Ireland achieved a bottom rank on a test that compared student performance on the basis of age. Ireland appeared near the middle of the ranking on another test of science that compared student performance by grade. O’Leary explained that in Ireland, pupils were at a lower grade for age than the students from other education systems with whom they were being compared.

During their standardization of the Otis Intermediate Test in New Zealand, Redmond and Davies (1940) found that although “a difference of one year between [grades] corresponds exactly with a year difference in mental age” (p. 65), the mean score for a grade was higher than the mean score for the age normal to that grade (Redmond & Davies, pp. 112–113; see also McIntyre, 1938, p. 57).


“In the course of norming a series of group tests of general ability, I have consistently found that there is no increase in mean score according to age within grade” (de Lemos, 1989, p. 21)

De Lemos was concerned that when age norms rather than grade norms were used, different standards were being applied to older and younger children within the same grade. Older children within a grade received lower IQs for the same raw score than children as little as 3 months younger. “Norms based on age assume that no differences are expected between students of the same age who are at different grade levels, or students of the same age tested at different times of the year” (de Lemos, 1989, p. 32)

If the number of items successfully answered really rose through small maturational increments, a raw score rise would appear within a sample from which the children who had been either retarded or accelerated had been removed. De Lemos graphed the raw scores, obtained from the standardization of a number of overseas tests for Australia, of those pupils whose ages matched their grade. In every case, she found not a rise with age, but a rise between grades. What appeared was a steplike pattern reflecting grade levels as a series of treads, and the pattern from grade to grade as a series of risers. Raw scores obtained from age-standardized group tests for schoolchildren administered at one point in the school year showed that for children whose age was normal for grade, grade level was indeed reflected in raw scores, and further, that because of the missing children at the extremes of age normal for grade, the youngest 3-month group, rather than scoring lowest, generally scored a little higher than the oldest group of the normal age in the same grade. This is the reverse of the patterns of scores rising with age provided in the conversion charts of intelligence test manuals.

It is possible that figurative tests differ from Otis-type tests in their association with age. However, when pupils attempted a timed version of the Raven’s Standard Progressive Matrices, raw scores revealed the steplike pattern similar to that of timed self-administered tests (de Lemos, 1990, 1994), suggesting that timing has its own effects on scores.

In New Zealand, a study carried out by McDonald (1992) was stimulated by a concern that because the indigenous Māori pupils were disproportionately retained in the beginning classes, they would be penalized on the basis of either age (older than others in the class) or grade (in a class with a lower mean score than the classes of other pupils of the same age). McDonald studied 3,000 raw scores produced during the 1980 standardization of an Otis-type test called the Test of Scholastic Abilities (TOSCA; Reid, Jackson, Gilmore, & Croft, 1981). This produced the same patterns as the analysis by de Lemos. No score rise appeared within age groups normal for the grade levels for which the test was intended.

In 1987, Cahan and Cohen (1989) addressed the question of the effects of age versus grade using results from 11,000 children at Grades 4, 5, and 6 in Jerusalem’s Hebrew-language state-controlled elementary schools. The pupils were tested with a battery of items drawn from a well-known group tests of general ability incorporating a range of modalities. The question was explored with respect to each type of item in a design that separated out age from grade to study their respective contributions to raw scores. Results pointed to “schooling—rather than to other age-related factors—as the major factor underlying the increase of intelligence test scores as a function of age” (Cahan & Cahan, p. 1245). The authors did find a greater contribution of age to nonverbal items. Evidence of this kind strongly suggests that grade level should be taken into account in attempts to explain the rising IQ. These studies show the association between raw scores and grade but do not show a pattern of scores that would justify the age norms provided in conversion charts. It seems possible that age norms based on small age intervals have been inherited from earlier times when they were calculated on children of one age but at different grades, as shown in Figure 1, and have simply been reproduced over the years.

Figure 3 shows how the Otis Intermediate Test of Mental Ability converted raw scores according to chronological age. Raw scores, ages, and IQs have been read off the conversion table provided for the Otis Intermediate Test of Mental Ability (New Zealand Council for Educational Research [NZCER], 1969). Although they were all at the same grade, older children were required to pass more items than younger children to get the same IQ.

Figure 3. Raw score required at each age from 9.6 to 11.0 to gain an IQ of 100


Note. Based on conversion table in Otis Intermediate Test (NZCER, 1969).

Figure 3 records ages in 3-month categories, from 9.6 to 11.0, representing the age distribution at grade level at the time the test was developed. To have an IQ of 100, children in the youngest category required a raw score of only 23. The raw score required for IQ 100 rises with increments of age, and children in the oldest group needed to get 38 items correct. A calculation for the same age range was then based on a similar test that provides only a percentile rank (Reid et al., 1981). The same age conversion effect emerged. A pupil aged 9.9 with a raw score of 35 would get a percentile rank of 78, whereas a pupil aged 11 with the same raw score would get a percentile rank of 47.


This section looks at the extent to which demographic change can account for an IQ rise. The U.S. version of the Otis Self-Administering Intermediate Test of Mental Ability was used in New Zealand in the 1920s. Form A was standardized in 1936 and renormed in 1968. The gains in intelligence over 32 years on this Otis test were claimed by the author of the 1968 standardization (Elley, 1969) and by Flynn (1988) to show changes in mental ability equal to an increase of 1 year of mental growth. The gain had been “approximately seven points of ‘I.Q.’ at each age level” (Elley, p. 146). Flynn recalculated the figure as 8 points.

McDonald (1998) asked whether demographic change within a system of schooling could account for 7 points of Otis IQ. Grade-level raw score means had been reported for the first standardization of the Otis test (Redmond & Davies, 1940), and the means from the 1968 administration came from Elley (1969). Over the years, the proportions of age by grade had indeed altered, and the fall in average age across grade levels was about 7 months. Of the age samples from 9 to 14 years, the 10-year-old sample was the most representative. The proportion of 10-year-olds in each age category at each grade level in the two years 1936 and 1968 was established by using the official age-by-grade figures for all New Zealand schools in the relevant years. By 1968, there was a greater proportion of 10-year-olds at the appropriate grade level than in 1936, and the proportions were reduced at lower grades. The grade-level mean scores established in 1936 were then assigned according to the proportions of 10-year-olds at the different grade levels in 1968. Calculations showed that the age change at grade level would “have accounted for about three-quarters of the gain in scores for 10-year-olds in the classes in which they were most numerous and would have accounted for all the gain if the sample had been 10-year-olds at Standard IV [Grade 5]” (McDonald, 1998, p. 230).


The Flynn effect is said to be “adding about 3 IQ points per decade” (Neisser, 1998, p. 4), which suggests a regular increase. However, age change at grade level has not been regular over the last century, or even for the different age cohorts at each grade level. Harvey (1939a) showed that in Virginia Public Schools, over a period of 12 years, median age at Grade 4 fell by 8 months, at Grade 5 by 9 months, and at Grade 6 by 10 months, as Table 1 shows.

Table 1. Age Medians at Grades 4, 5, and 6 in Virginia Public Schools

Median Age in Grade


Grade 4

Grade 5

Grade 6

Year 1924–1925




Year 1930–1931




Year 1936–1937




Note. Figures from Harvey, 1939a, Table 1, p. 752.

If the grade-level hypothesis is correct, IQ scores should have varied over time in Virginia. On the basis of increasing access to the higher levels of schooling, the rise should eventually disappear, although the access of all pupils to noncompulsory levels of schooling has implications for the nature of samples. Flynn has indeed suggested that the score rise may not continue and has reported no rise on the WAIS in a recent Dutch sample (Flynn, 2006); in addition, a decline has been reported in Denmark (Teasdale & Owen, 2005), attributed by the authors to a decline in students aged 16–18 years entering advanced-level programs.

Like the Flynn effect, mystery surrounds the Black–White 15-point differential. “If group differences in test performance do not result from the simple forms of bias . . . what is responsible for them?” (American Psychological Association, 1995). Test manuals show that a test that converts a raw score according to age will be sensitive to age difference at grade level, whether between populations over time, or at any one time between groups based on categories such as sex or race. In Virginia Public Schools in 1936–1937, the median age of the Black pupils in urban schools at Grade 5 was a year older than that of the White children in the same grade, and the difference in rural schools was even greater (Harvey, 1939b). A population difference of more than a year in age at grade level would amount to a difference of about 7 Otis IQ points—insufficient to account for a gap of 1 standard deviation, but sufficient to make a difference to its size.

Figure 4 shows age at Grade 5 in Virginia schools in 1996–1997 and the position 7 years later.2

Figure 4. Comparison of age in years at Grade 5, measured at end of school years 1996–1997 and 2002–2003 in Virginia Public Schools


If age in grade increases but IQ does not fall, this would provide evidence for some factor independent of the conversion of raw scores for small categories of age. Figure 4 shows that pupils in Virginia Public Schools have recently become older for grade. There is a smaller percentage of 10-year-olds in Grade 5 in 2002–2003 than in 1996–1997—60% rather than 67%—and a greater percentage of overage children. It can be asked whether IQ continued to rise.

At least some of the suggested explanations may have their own connection to age by grade. For example, the association between IQ and socioeconomic status is complicated by the fact that not only do such children generally score relatively poorly when tested, but teachers may retain disadvantaged children in the belief that it will give them time to consolidate their learning. There will also be effects on IQ levels if achievement gates serve as barriers to progress, and intervention programs, designed to improve the performance of disadvantaged children, may create a group that fails the intervention (Hiebert, 1994).

Test sophistication as a cause of rise in IQ has long been acknowledged by test developers. According to Flynn, the role of test sophistication has been relatively modest since 1947, but he does not mention coaching. This was investigated early in the history of group testing. Gilmore (1927) demonstrated that although familiarity with the test meant that scores rose when young adults took the same Otis test a second time, a group that had also been coached made even greater gains.

A reversal of a score rise is likely to be counteracted by coaching, a process that works on improving performance on the answers to the tests and not on age–grade relationships. If schools face penalties for their score levels on any kind of test, they will coach pupils in the style of the test and the nature of the test items. According to Cannell (2006), publishers are likely to assist schools by providing model tests for pupil practice. If entry to a gifted and talented program involves taking a test such as the OLSAT, parents, hoping to get a child accepted, can find test models and advice on the Internet.


Irrespective of the scale used, at least part of the Flynn effect appears to be connected to a system of scoring that compares a raw score with an age standard in order to assign an IQ. Demographic change has meant that a 10-year-old in Fredonia today is likely to be in Grade 5, an increase of one grade from the position in 1913 (see Figure 1). That should contribute about 7 points of Otis IQ (Elley, 1969). If the comparison is made by age “then and now,” 10-year-olds will be on a higher tread of the de Lemos ladder of schooling, with a higher mean grade raw score. At grade level, a fall in age would produce a greater number of children at the younger end of the spread of ages. Their ages would work to raise their scores relative to older children at the same grade on conversion to an IQ (see Figure 3).

Most of the evidence presented by Flynn has been based on adult IQs. There is a connection between age by grade and adult scores in that the historical lowering of age at grade level has been accompanied by an increase in final grade levels achieved.

The age-grade distribution of 10-year-olds within a school system is not an isolated factor, but part of the student flow. In the absence of barriers to promotion, students younger for grade stay longer in school and reach higher grades than earlier cohorts. The evidence suggests that the Flynn effect should not be interpreted as a massive intergenerational change in mental functioning unless, in the case of school populations, the samples tested have a match of age in grade and, in the case of adults, a match of highest grades achieved.

It has been argued that the Flynn effect is affected by a scoring system that measures grade rather than age. However, questions remain about the extent to which demographic change and scoring for age can explain all instances of the Flynn effect. The illustrations provided in this account, based on a narrow range of age, a narrow range of education jurisdictions, and Otis-type tests, cannot answer this question. Future investigation should include individual, figural, and nonverbal tests used to test adult populations of the kind for which Flynn provided evidence of substantial increase in IQ.


I would like to thank Jacqueline Swansinger for information about the Fredonia table, the New Brunswick Department of Education for assistance in locating the Web site for their table, Marion de Lemos, Joanna Higgins, Caroline McDonald, and Fay and John Panckhurst for commenting on drafts of this paper, and anonymous reviewers for helpful advice.


1. The figures for Fredonia, New York, were retrieved July 28, 2003, from http://www.Fredonia.edu, where they had been posted as a student resource for a historical study. The table had been found in the local library. It has since been removed from the World Wide Web. The figures for the London Borough are in Burt (1919, p. 22). The figures for New Brunswick, Canada, were found on the World Wide Web with assistance from the New Brunswick Department of Education. The figures for Virginia are from the Virginia Department of Education’s enrolment table for 2002.

2. The figures from the Virginia Department of Education were taken from its annual enrollment tables showing all original entry pupils by age and grade, end-of-year membership, and number of pupils promoted and retained.


Angelini, A. L., Alves, I. C., Custodio, E. M., & Duarte W. F. (1989). The São Paulo norms of J. Raven's Coloured Progressive Matrices. Psychological Test Bulletin, 2(2), 46–49.

Angoff, W. H. (1988). The nature-nurture debate, aptitudes, and group differences. American Psychologist, 43, 713–720.

American Psychological Association. (1995). Stalking the wild taboo. Intelligence: Knowns and Unknowns. Report of a task force established by the Board of Scientific Affairs. Retrieved April 6, 2007, from http://www.lrainc.com/swtaboo/taboos/apa_01.html

Ayres, L. P. (1909). The money cost of the repeater. Psychological Clinic, 3, 49–57.

Binet, A., & Simon, T. (1916). The development of intelligence in children (the Binet-Simon Scale) (E. Kite, Trans.). Vineland, NJ: The Training School at Vineland.

Brown, P. (2002). Brain gain. New Scientist, 2(2332), 24–27.

Burt, C. (1919). The distributions and relation of educational abilities. London: London County Council.

Burt, C., & Moore, R. C. (1912). The mental differences between the sexes. Journal of Experimental Pedagogy, 1(4), 273–284, 355–388.

Cahan, S., & Cohen, N. (1989). Age versus schooling effects on intelligence development. Child Development, 60, 1239–1249.

Cannell, J. J. (2006). “Lake Woebegone,” twenty years later. Third Education Group Review 2(1). Retrieved February 2, 2007, from http://www.tegr.org/Review/Articles/vol2/v2n1.pdf

Cooke, D. H. (1931). A study of school surveys with regard to age-grade distribution. Peabody Journal of Education, 8, 259–266.

de Lemos, M. M. (1989). Effects of relative age within grade: Implications for the use of age-based norms for group tests of general ability. Bulletin of the International Test Commission, 28&29, 21–44.

de Lemos, M. M. (1990, September). The Raven's Progressive Matrices: Does schooling make a difference? Paper presented at the Australian Psychological Society Conference, Melbourne, Australia.

de Lemos, M. M. (1994). Not so straightforward: Interpreting the scores. Evaluation and Research in Education, 8(1&2), 69–83.

Elley, W. B. (1969). Changes in mental ability in New Zealand school children, 1936–1968. New Zealand Journal of Educational Studies, 4(2), 140–155.

Flynn, J. R. (1984). The mean IQ of Americans: Massive gains 1932 to 1978. Psychological Bulletin, 95(1), 29–51.

Flynn, J. R. (1987). Massive IQ gains in 14 nations: What IQ tests really measure. Psychological Bulletin 101(2), 171–191.

Flynn, J. R. (1988). IQ gains and New Zealand: About race, retardation and longitudinal studies. In M. Olssen (Ed.), Mental testing in New Zealand (pp. 128–151). Dunedin, New Zealand: University of Otago Press.

Flynn, J. R. (1998). IQ gains over time: Toward finding the causes. In U. Neisser (Ed.), The rising curve: Long-term gains in IQ and related measures (pp. 25–66). Washington, DC: American Psychological Association.

Flynn, J. R. (2006). Beyond the Flynn effect: Solution to all outstanding problems—except enhancing wisdom. Lecture at the Psychometrics Centre, Cambridge Assessment, University of Cambridge, England. Retrieved February 28, 2007, from http://www.thepsychometricscentre.co.uk/publications/BeyondTheFlynnEffect.asp

Frederiksen, B. J. (1983). Internal efficiency of school systems: A study in the use of pupil flow models for developing countries. Unpublished doctoral dissertation, University of Lancaster, England.

Galton, F. (1892). Hereditary genius: An inquiry into its laws and consequences. London: Macmillan. (Original work published 1869)

Gilmore, M. E. (1927). Coaching for intelligence tests. Journal of Educational Psychology, 18(2), 119–121.

Grissmer, D., Flanagan, A., & Williamson, S. (1998). Why did the Black-White score gap narrow in the 1970s and 1980s? In C. Jencks & M. Phillips (Eds.), The Black-White test score gap (pp. 182–226). Washington, DC: Brookings Institution Press.

Grotelueschen, A. (1969). Review of Otis-Lennon Mental Ability Test, by Arthur S. Otis & Roger T. Lennon. Journal of Educational Measurement, 6(2), 111–113.

Harvey, O. L. (1938). Enrollment trends and population shifts. Elementary School Journal, 38, 655–662.

Harvey, O. L. (1939a). Use of age-grade and promotion tables in the study of enrollment trends. Elementary School Journal, 39, 751–759.

Harvey, O. L. (1939b). Negro representation in public school enrollments. Journal of Negro Education, 8(1), 26–30.

Herrnstein, R., & Murray, C. (1994). The bell curve: Intelligence and class structure in American life. New York: Free Press.

Hiebert, E.H. (1994). Reading recovery in the United States: What difference does it make to an age cohort? Educational Researcher, 23(9), 15–25.

Holmes, C. (1989). Grade level retention: A meta-analysis of research studies. In L. Shepard & M. Smith (Eds.), Flunking grades: Research and policies on retention (pp. 16–33). Lewes, England: Falmer.

Lennon, R., & Mitchell, B. (1955, October). Trends in age-grade relationship: A 35-year review. School and Society, 124–125.

McAfee, J. K. (1981, April). Towards a theory of promotion: Does retaining students really work? Paper presented at the annual meeting of the American Educational Research Association, Los Angeles, CA. (ERIC Document Reproduction Service No. ED204871)

McDonald, G. (1989, November–December). The normal curve of intelligence: Is this a representation of promotion patterns? Paper presented at the annual meeting of the New Zealand Association for Research on Education, Central Institute of Technology, Heretaunga, New Zealand.

McDonald, G. (1992). “Henry and Iain. . . ”: A comment on a response. New Zealand Journal of Educational Studies, 27(1), 103–106.

McDonald, G. (1993). Ages, stages and evaluation: The demography of the classroom. Evaluation and Research in Education, 7(3), 143–154.

McDonald, G. (1998). “Working its magic”? IQ rise and the demography of the classroom. Oxford Review of Education, 24(2), 225–234.

McIntyre, G. A. (1938). The standardization of intelligence tests in Australia. Melbourne, Australia: Melbourne University Press.

 Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105(1), 156–166.

Neisser, U. (Ed.). (1998). The rising curve: Long-term gains in IQ and related measures. Washington, DC: American Psychological Association.

New Zealand Council for Educational Research. (1969). Otis Tests of Mental Ability manual of directions. Wellington: Author.

O'Leary, M. (2001). The effects of age-based and grade-based sampling on the relative standing of countries in comparative studies of student achievement. British Educational Research Journal, 27, 187–200.

Olssen, M. (Ed.) (1987). Mental testing in New Zealand: Critical and oppositional perspectives. Dunedin, New Zealand: University of Otago Press.

Redmond, M., & Davies, F. R. (1940). The standardization of two intelligence tests. Wellington: New Zealand Council for Educational Research.

Reid, N., Jackson, P., Gilmore, A., & Croft, C. (1981). Test of Scholastic Abilities: Teacher's manual. Wellington: New Zealand Council for Educational Research.

Schwager, M. T., Mitchell, D. E., Mitchell, T. K., & Hecht, J. B. (1992). How school district policy influences grade level retention in elementary schools. Educational Evaluation and Policy Analysis, 14, 421–438.

Scottish Council for Research in Education. (1949). The trend of Scottish intelligence: A comparison of the 1947 and 1932 surveys of the intelligence of eleven-year-old pupils. London: University of London Press.

Shepard, L. A., & Smith, M. L. (1985). Boulder Valley kindergarten study: Retention practices and retention effects. Boulder, CO: Boulder Valley Public Schools.

Shuey, A. (1966). The testing of Negro intelligence (2nd ed.). New York: Social Science Press.

Teasdale, T. W., & Owen, D. R. (2005). A long-term rise and recent decline in intelligence test performance: The Flynn effect in reverse. Personality and Individual Differences, 39, 837–843.

Terman, L. M. (1919). The measurement of intelligence. London: Harrap.

Thelen, E., & Adolph, K. E. (1992). Arnold L. Gesell: The paradox of nature and nurture. Developmental Psychology, 28, 368–380.

Thorndike, R. L. (1973). Stanford-Binet Intelligence Scale. Third revision. Form L-M. Boston: Houghton Mifflin.

Thorndike, R. L., Hagen, E. P., & Sattler, J. M. (1986). Stanford-Binet Intelligence Scale: Technical manual (4th ed.). Chicago: Riverside.

Tuddenham, R. D. (1948). Soldier intelligence in World Wars I and II. American Psychologist, 3, 54–56.

Volkmor, H., & Noble, I. (1914). Retardation as indicated by one hundred city school reports. Psychological Clinic, 8, 75–81.

Vroon, P. A. (1980). Intelligence: On myths and measurement. Amsterdam: North-Holland.

Wechsler, D. (1955) Weschler Adult Intelligence Scale. San Antonio, TX: Psychological Corporation.

Cite This Article as: Teachers College Record Volume 112 Number 7, 2010, p. 1851-1870
https://www.tcrecord.org ID Number: 15918, Date Accessed: 1/20/2022 11:48:45 AM

Purchase Reprint Rights for this article or review
Article Tools
Related Articles

Related Discussion
Post a Comment | Read All

About the Author
  • Geraldine McDonald
    Victoria University of Wellington
    E-mail Author
    GERALDINE MCDONALD is a research associate in the Faculty of Education, Victoria University of Wellington. She was formerly assistant director of the New Zealand Council for Educational Research, where one of the council’s functions was the development of standardized tests. She then taught at Victoria University of Wellington in teacher education and at the Wellington College of Education in a graduate course for teachers. She has a longstanding interest in the demographic patterns of schooling and their effects, particularly on the children of indigenous and minority groups. Recent publications include: with Huong Le, Joanna Higgins, & Val Podmore, “Artifacts, Tools and Classrooms” in Mind, Culture and Society (2005); “Literacy and the Achievement Gap” in Curriculum Matters (2006); and, with Joanna Higgins and Mary Jane Shuker, “Addressing the Baseline: Erving Goffman and Ethics in a Postgraduate Degree for Practising Teachers” in Teaching in Higher Education (2008).
Member Center
In Print
This Month's Issue