Failing to Meet the Standards: The English Language Arts Test for Fourth Graders in New York State
by Clifford Hill - 2004
This article examines two kinds of problems associated with the English Language Arts test at the fourth-grade level in New York State: (1) problems that inhere in the test itself and (2) problems associated with its use. As for the test itself, three kinds of problems are analyzed: (1) the use of multiple-choice tasks to assess reading comprehension, (2) the number of writing tasks that children must respond to within a limited time frame, and (3) the use of a rubric that contains unrealistic criteria. As for the use of the test, two kinds of problems are analyzed: (1) the use of coaching material that has not been properly vetted and (2) the discontinued use of broadly based assessment programs that complement testing with portfolio sampling of children's work. The conclusions offer recommendations for dealing with these problems.
This article examines two kinds of problems associated with the English Language Arts test at the fourth-grade level in New York State: (1) problems that inhere in the test itself and (2) problems associated with its use. As for the test itself, three kinds of problems are analyzed: (1) the use of multiple-choice tasks to assess reading comprehension, (2) the number of writing tasks that children must respond to within a limited time frame, and (3) the use of a rubric that contains unrealistic criteria. As for the use of the test, two kinds of problems are analyzed: (1) the use of coaching material that has not been properly vetted and (2) the discontinued use of broadly based assessment programs that complement testing with portfolio sampling of childrens work. The conclusions offer recommendations for dealing with these problems.
In the early 1990s, I was asked to deliver a lecture for the general public at Teachers College, Columbia University, on testing and assessment issues in American education. At the time the standards movement was gathering momentum, and there was widespread optimism that once the right kinds of tests were built, they could be used to ensure that high standards were met and thus bring about the school reform that all those concerned with education were calling for (see, e.g., Wiggins, 1989). Having worked in urban schools in poor neighborhoods, I was well aware of the need for educational reform, but I was not optimistic that testing could bring about the kind of reform that was needed. I thus reported in the lecture what colleagues and I were discovering as we conducted research on test material that was emerging within the standards movement. Using various techniques of discourse analysis (Aronowitz, 1984; Johnston, 1984; van Dijk & Kintsch, 1983), I described how various kinds of problems in the material were confusing to students, especially those who came from culturally diverse backgrounds (see Hill, 1992, pp. 2427).
I then made the case for a broadly based assessment model in which appropriate testing is complemented by other methods of documenting student achievement. The metaphor of ecology was central to the lecture, since I was concerned that high-stakes testing was likely to upset the delicate balance that skilled teachers work to maintain between what they teach and how they assess. For a number of years, early childhood teachers and I had been developing a curriculum-embedded assessment model for certain school districts in metropolitan New York. I described how the Progress Profile, by including both constructivist testing and portfolio documentation, provided useful information at the classroom level at the same time that it provided the system-level evaluation called for by New York State legislation (see Hill, 1992, pp. 3642).
A decade has now passed and I would like, once again, to draw on research colleagues and I are conducting in addressing the question of whether high-stakes testing has been effective in raising standards and achieving educational reform. Since this research deals with early childhood education in metropolitan New York, I would like to examine the English Language Arts (ELA) test administered to fourth graders in New York State. In the first part of the article, I present a research project on the ELA test, in which colleagues and I analyzed its three major components. In the second part, I report research on the effects of the ELA test on classroom practice. I first describe a project in which colleagues and I investigated the use of coaching materials in a Bronx school where the failure rate on the ELA test was unusually high. I then describe what has happened to the Progress Profile under the pressure of high-stakes testing. I end the article with a set of recommendations as to how the ELA test might be first revised and then made to function within a broader assessment model.
ENGLISH LANGUAGE ARTS (ELA) TEST: FOURTH-GRADE LEVEL, NEW YORK STATE
The ELA test, which was developed by CTB/McGraw-Hill for the New York State Education Department, is administered to fourth graders and eighth graders. At the fourth-grade level, the test consists of three components, which are administered in three 60-minute periods on consecutive days:
Day 1: a reading component, which consists of six selections followed by multiple-choice tasks (this material is drawn from a nationally normed test administered by CTB/McGraw-Hill)
Day 2: a listening/writing component, which consists of a single selection followed by constructed-response tasks (30 minutes) and an independent writing prompt (30 minutes)
Day 3: a reading/writing component, which consists of two selections followed by constructed-response tasks
To illustrate how the ELA test works, I would like to examine sample material from the version that was administered in 2002. I should note, at the outset, that New York State law ensures that this material, once used, must be available for public review. This policy is welcome, since test publishers have often used security concerns to prevent researchers from including sample material when reporting on research. Throughout the years I have found this prohibition frustrating since my own research on testing includes close analysis of actual material. For example, I was unable to obtain permission from CTB/McGraw-Hill to include material from the Test of Adult Basic Education in a journal article (Hill, 1995) and hence was limited in what I was able to communicate. In an era of high-stakes testing, it is important that test material be available for public discussion.1
2002 VERSION OF THE ELA TEST
The reading component contained 6 selections and 28 multiple-choice tasks. With respect to the selections, the test makers avoided certain problems that colleagues and I had documented in earlier work. To begin with, no selection appears to be a fragment from a larger text (see Hill & Parry, 1989). Moreover, the selections represent the kind of material that children are likely to read at school: a folktale, an interview with a naturalist, a poem, a student letter to a school newspaper, an article about the Appalachian Trail, and an article about a female astronaut. Finally, the selections, with the exception of the student letter, appear to have been adapted from various sources rather than written specifically for this test (see Hill, 1999; Hill & Larsen, 2000, for problems that result from using material prepared specifically for a test).
As for the multiple-choice tasks, they are virtually identical to those traditionally used on reading tests: They focus on the main idea, the authors purpose, factual details, basic inferences, and so on. For example, some form of main-idea taskwhich will be used to refer to any task that requires test takers to judge relative salienceis used with four of the six selections, including, surprisingly, the poem (as one of my undergraduate teachers once put it, when a poem starts having ideas, it ceases being a poem). Test makers find judgments about salience attractive because they presumably involve higher order thinking. Or to put it another way, such judgments are a convenient way to go beyond the merely factual, which multiple-choice tests are often accused of dwelling on.
There are, however, at least three ways in which main-idea tasks can be misleading. First, they seem to ask for the test takers judgment about meaning. Yet, as Hill and Larsen (2000) point out, they are more oriented to structure than meaning:
Our ability to answer such tasks depends not so much on our knowledge of meaning relationships in the real world, or in the imaginative world activated by a text, as on our sensitivity to how a text is written. Our sense of what a text is mainly about depends on its rhetoric, that is, on the order in which information is presented and on the relationships that are set up. (p. 48)
Hill and Larsen go on to show how a set of ideas can be reordered in various ways to make first one salient and then another. Thus, while a main-idea task appears to ask test takers to judge salience from their own point of view, it is in fact the author of a text who establishes what is salient, and such tasks are really concerned with the ability to read the authors signals accurately. In practice, however, relative salience is not particularly important in many kinds of texts, and hence a reader often has access to only fragmentary cues (Braddock, 1974).
A second way that main-idea tasks are misleading is that they assume a particular kind of text structure, one in which a main idea is at the top of a hierarchy and other ideas are subordinate to it. Although this notion is prevalent in testingand, for that matter, in reading and writing instructionit doesnt fare all that well in the real world. As psycholinguists (e.g., Black & Bower, 1980; Trabasso, 1981) have observed, prose is often better understood as a network of related ideas rather than a hierarchy dominated by a single idea.
A third way in which main-idea tasks are misleading is that what they seem to offerasking test takers to choose between competing candidates for saliencydoes not work out in practice. Given the ambiguities of establishing salience, test makers have difficulty establishing a correct answer that cannot be questioned. More often than not, they simply do not include alternative candidates for main-ness among the available choices.
One of the six selections used in the multiple-choice component, a letter from a student to a school newspaper, is followed by three tasks (see Figure 1). The first two of these tasks, numbered 16 and 17, include the word main and exemplify various kinds of difficulties described previously.
This selection has already been singled out as appearing to have been written for the test rather than adapted from existing material. As Hill and Parry (1989) point out, whenever test makers attempt to include authentic material, they are likely to end up with quite the opposite if they resort to constructing examples of such material. Using a student letter to a school newspaper is an attractive idea, but where is one to find a letter written by an actual seventh grader that can serve as a basis for multiple-choice tasks? Whatever the provenience of this letter, it comes off as inauthentic since readers cannot easily picture a seventh grader who would take on the subject of garbage in a letter to the school newspaper or who would be able to manage, even in the age of the Internet, such a sophisticated recitation of supporting facts.
Task 16 focuses on the main problem with garbage, and the correct answerthere are not enough places to put it allis based on the second sentence of the letterwe are running out of places to dump garbage. Yet the word instead immediately follows this sentence, suggesting that what it presents is not to be taken as definitive.
The network of related ideas in the text is shown in Figure 2. The words in bold are taken directly from the text; those in capital letters indicate how these words might function in a conventional reading. For example, the reader moves from the initial proposition (garbage is a big problem) to a suggested reason for this problem (we are running out of places to put it). The most obvious fix (trying to find more places to dump garbage) is then rejected (instead) in favor of a preferred alternative (throw less away). The best way to achieve this (recycling) results in a beneficial side effect (getting more use out of what we throw away). Thus, the text questions the view that garbage is necessarily constant and we must keep on finding places to put it and favors a view that we can reduce the amount of garbage through recycling and hence lessen the pressure to find more places to put it. The letter ends by moving from thinking globally to acting locally (we should start a recycling program at our school), which is the apparent motivation for writing the letter.
Test takers who have been conditioned to follow the most obvious structural cues will go along with the view that garbage is a constant and in working with task 16 will accept the choice the main problem with garbage is that there are not enough places to put it all. But more critically minded test takers who see this text as rejecting this view in favor of something better would look for a choice such as the main problem with garbage is that we dont recycle enough. But they will not find it since what is arguably the main problem with garbage is not included among the possible choices. Of the three incorrect choices supplied, not a single one has any foundation in the letter that would enable it to function as a credible main problem (Hill & Larsen, 2000, provide numerous examples of test makers simply avoiding genuinely competing answers in a main-idea task that would require test takers to engage in critical thinking).
Task 17 also includes the word main, but it has a quite different effect. While task 16 asks for the main problem, implying that there is only one, task 17 asks for a main argument, implying that there is more than one. In fact, there are two possibilities supplied in the text (see Figure 2): Recycling is a great way to  reduce the amount of garbage that we send to dumps, and to  get more use out of what we throw away. Option 1 is more central to the text than option 2, yet the correct answer Bturns garbage into something usefulis clearly based on option 2. Still, with a bit of reasoning it would be possible to view the choice Cmakes more places available to dump garbageas epitomizing option 1. This deduction might go something like recycling makes more places available to dump the garbage that we cant recycle, which could be thought of as constituting a main argument.
It is intriguing to explore whether multiple-choice tasks on a comprehension test can, in fact, be answered without reading the selection. Table 1 shows how 35 graduate students at Teachers College, Columbia University, responded to the three multiple-choice tasks without having read the student letter.2
Well over half of the 35 students20were able to answer all three tasks correctly. If we assume a random response model so that the probability of selecting the correct answer for all three tasks is 1 in 64, in a sample of 35 students the probability that 20 of them would answer all three correctly turns out to be effectively 0. This result is, no doubt, related to a number of factors, one of which may well have to do with the point established previously: When test makers construct a main-idea task, they tend to avoid genuinely competing answers and thus end up with incorrect choices that lack plausibility.
When the students explained how they were able to figure out the correct answers to these tasks, they reflected a good deal of test-taking savvy. On task 16, for example, one student pointed out that she could eliminate choices F and J because they reflected particular conditions that did not match the generic frame set up by the question. She further pointed out that although she was initially attracted to choice G, she was able to eliminate it, because combining this choice with the introductory words produces an unlikely sentence in which the word garbage is repeated: According to the letter, the main problem with garbage is that nobody wants to help clean up garbage. On the other hand, the correct choice H includes it which results in a normal sentence: According to the letter, the main problem with garbage is that there are not enough places to put it all. Hill and Larsen (2000) describe various ways in which test takers work with the notion of an ill-formed sentence as they narrow down their choices on a multiple-choice task.
In explaining their choices for tasks 17 and 18, certain students worked with another useful notionthat a correct choice on a multiple-choice test often reflects a view of how things are supposed to be. Hence for task 17 they selected choice B, since recycling should result in something useful, and for task 18 they selected choice G, since a persuasive letter should use examples. In effect, task 17 presupposes a normative view of how the world works and task 18 a normative view of how text works. Gee (1992) reports skilled test takers using a similar strategy when they try to figure out answers without having read the passage. He found that such test takers were willing to stick with a normatively based answer, even when it contradicted what they knew about a particular subject.
Having reviewed the multiple-choice component of the ELA test for fourth graders, I find little evidence that it avoids the kinds of problems that colleagues and I have documented throughout the years (e.g., Hill, 1999; Hill & Larsen, 2000; Hill & Parry, 1994). As we have seen, this component is built around a simplistic view of how text is structured and how readers make sense of it. What is especially disconcerting is that test takers are in a position to answer multiple-choice tasks correctly without even reading the selection on which they are based. Clearly this component is at odds with the remaining components of the ELA test, which are more performance oriented and thus better reflect the approach to testing that was originally envisioned by those advocating higher standards in American education.
The listening/writing component is administered in two 30-minute segments: (1) children listen to a selection read aloud and respond to three constructed-response tasks, and (2) they respond to a writing prompt which, though described as independent, does bear a thematic relation to the listening selection. The selection that was used in the 2002 version of the ELA test is shown in Figure 3.
Waldos Up and Down Day consists of 845 words, which, when read aloud at a normal speed, take about 6 to 7 minutes to read. Since the story is read twice, nearly half of the 30 minutes can be used up with the oral reading, which leaves children with just over 15 minutes to respond to the three tasks. In the absence of a recording, it is virtually impossible to standardize the speed at which the material is read. Certain teachers, especially those working with children who speak other languages, may be inclined to read slowly to maximize understanding (especially on the second reading when children are allowed to take notes). Other teachers, however, may prefer read at a faster pace to maximize the amount of time that children have to respond to the tasks.
Figure 4 presents one childs responses to the three tasks that follow Waldos Up and Down Day. This work is taken from the material used to train scorers, where it exemplifies level 4, the highest rating (New York State Testing Program, 2002d).
Given the testing situation, it is useful to examine both the quantity and the quality of what this child wrote. Here is the number of words written in response to each task:
29 (a graphic-organizer task)
30 (a brief-answer task)
31 (an extended-answer task)
When one considers the conditions imposed by the test, 174 words constitute a rather impressive output for a fourth grader. The child is responding to three separate tasks about a story that is not present in written form. The amount of time available is quite limitedas we have seen, probably just over 15 minutesand the writing must be done with care so that it fits in the predetermined spaces and is sufficiently legible for independent scorers to decipher it.
As for what the child wrote, one can detect a good deal of repetition across the three tasks. The three bits of bad luck listed in task 29 reappear in task 30 in virtually the same form (as can be seen, certain spelling errors are preserved):
His chair colapsed . . .
He had a assinment in math . . .
He struck out in softball.
. . . his chair colapsed . . .
. . . He didnt finish his assingment.
. . . he struck out at bat . . .
In task 30, the child does provide an initial framing for the recycled material, but even it has been recycled from the instructions:
Why doesnt Waldo want it back? Waldo dosent want it back because . . .
The childs extended response to task 31 recycles the same material one more time (it is even closer in form to what was written in task 29):
His chair colapsed and everyone laughed at him.
He had a assinment in math and got Eddie so they didnt finish.
He struck out in softball.
His chair colapsed and everyone laughed.
. . . he got Eddie who never finishes
. . . he was playing softball and he stuck out.
The New York State Testing Program included this childs responses as exemplifying the highest level of performance, even though the child uses essentially the same material on all three tasks. This is not to fault the child on using this strategy: It was an efficient way of responding, and it turned out to have a considerable pay offin addition to receiving the highest score, it was chosen as a model for training materials distributed throughout the state.
The general rubric for these tasks can be found in Figure 5. The first column lists the general qualities that are looked at: meaning, development, organization, and language use. The first two are relevant to all three tasks, whereas the last two are applied only to the extended response (task 31). The second column provides descriptions of how the various qualities are manifested in responses that receive the highest score.
Even a brief glance shows that the rubric is out of line with what this child was able to write. Consider, for example, such descriptions of language use as is fluent and easy to read, with vivid language and a sense of engagement or voice and is stylistically sophisticated, using varied sentence structure and challenging vocabulary. In state education departments throughout the country, phrases like these have been recycled in rubrics used to evaluate what children write on language arts tests. It is disconcerting that standards associated with the highly edited work of seasoned adult writers, working on familiar material over months or even years, is being applied to what children, working under the pressure of a high-stakes test, manage to get on the page when they have about 15 minutes to respond to three tasks about a story that they have just heard for the first time.
In addition to the rubric discussed here, the scoring manual supplies a specific rubric for the listening/writing component (see Figure 6). Evidently the more general rubric was found impractical as a basis for the actual work of scoring the responses, but even this more plain-spoken rubric is, at times, inflated (e.g., Responses are logical, well organized, focused, and fluent, with a sense of engagement or voice.). If the specific rubric is the one actually used, why promulgate a general rubric that is so dysfunctional?
During the second 30-minute segment on day 2, children are asked to respond to the following independent writing prompt:
32 Write a story about making a new friend. Describe how you met, what you did the first day together, and what helped you become friends.
In your story, be sure to include
● a title for your story
● a clear beginning, middle, and end to your story
● specific details to make your story interesting
Space limitations prevent any discussion of sample responses to this prompt, but I would like to make two points. First, the rubrics used are similar to those in Figures 5 and 6, an inflated general rubric followed by a more practical specific rubric. Second, the time allotment seems appropriate for a task that calls for an extended response, but why, then, is so much less time allotted for responding to task 31, which presents a similar challenge? Clearly the distribution of time on day 2 merits further attention.
The two reading selections that were used on day 3 in the 2002 version of the ELA test are shown in Figures 7 and 8. Both selections deal with the raven: One is an article that describes how intelligent this bird is (439 words), and the other is a Native American folktale that explains why it has only one color (382 words). Once again, children are required to do a good deal of work within a 60-minute period. They read two pieces (together, over 800 words) and respond to four constructed-response tasks, the last one calling for extended writing. Figure 9 shows the responses to these tasks that are included in the scorers manual as an example of work at level 4 (New York State Testing Program, 2002e).
Here is the number of words that the child wrote in response to the four tasks:
33 (a graphic-organizer task that follows the first selection)
34 (a short-response task that follows the second selection)
35 (a short-response task that follows the second selection)
36 (an extended-response task that follows the two selections)
Once again, when one considers the conditions imposed by the test, this is a reasonable output. It is surprising that this child wrote about 70% as much for each of the short responses as for the extended response. Indeed, the child went two lines beyond the 10 lines available in the training manual for each of the short responses, while using only 19 of the 27 lines available for the extended response. It is puzzling that in the test booklet actually used in 2002, only 6 lines rather than 10 were provided for the short responses. Children were thus faced with an even greater challenge of fitting what they wrote into the prescribed space.
It is clear that space can have an effect on how much and how well test takers manage to write. On the one hand, children are frequently concerned about how much they are expected to write in a testing situation, and in the absence of other indications, they tend to use the amount of space they are given as a guide, which may lead them to use larger or smaller handwriting than usual. On the other hand, sometimes children are warned that they must be sure to keep what they write within the prescribed spaces, which brings another kind of pressure. I might note here that the use of a computer ensures that writing is not only legible but also contained within a preestablished space (since the writing box simply expands to accommodate whatever a child writes). A computer can also provide an ongoing word count so that children can be aware of how much they are writing. In the conclusion of this article, I address the issue of providing children the option of using a computer when taking high-stakes tests such as the ELA test.
The general rubric used to evaluate childrens performance in the listening/writing component is used for these tasks as well. Since this childs answers were scored at the highest level (4), they are supposed to reflect the qualities described in Figure 5. Consider, for example, the childs response to the final task, the one that calls for extended writing. Certainly the childs letter has a good deal of charm and is rather disarming in its direct appeal to Mr. Pataki not to adopt the raven as the state bird (as the child puts it, the blue bird is just fine). But it is misleading to imply that a sequence of five short, rather disjointed paragraphs (three consist of a single sentence and even the other two are punctuated as if they are a single sentence) develops ideas fully with thorough elaboration and shows a logical, coherent sequence of ideas through the use of appropriate transitions or other devices.
We have examined two samples of childrens writing that were scored at the highest level, but they obviously do not meet the standards set up by the rubric. This mismatch provides evidence for a point frequently madethat high-stakes testing is of limited value in maintaining high standards in American education (Klein, Hamilton, McCaffrey & Stetcher, 2000; Koretz & Barron, 1998; Koretz, Linn, Dunbar, & Shepard, 1991; Stetcher, Baron, Kaganoff, & Goodwin, 1998). As I suggest in the next section, such testing needs to function within a larger model in which samples of student work produced over an extended period of time can be assessed as well.
Before leaving the ELA test, I would like to point out that rubric inflation creates a difficult situation for those who score the language arts test (ordinarily fourth-grade teachers). On the one hand, if they apply the rubric rigorously, they end up failing a substantial proportion of children who may then be forced to repeat the fourth grade (in many states, the number of children in urban schools who fail the test is depressingly large). On the other hand, if the scorers are lax in applying the rubric, they can be accused of not maintaining standards (in certain states, such accusations have been blown into full-scale cheating scandals). As long as the rubrics used to evaluate childrens writing are grossly inflated, this dilemma will remain: Either children run the risk of being evaluated unfairly, or teachers are put in a position where they can be accused of not maintaining standards. Having dealt with various problems that inhere in the ELA test, let us now turn to those associated with its use.
INCREASED USE OF COACHING MATERIAL
As soon as high-stakes testing was introduced, coaching materials began to be marketed to schools across the country. The marketing was especially intense in urban districts where a large proportion of students were failing. In New York City, for example, more than three quarters of the students failed the ELA test when it was introduced in 1999. As the academic year 19992000 got underway, a team of doctoral students and I initiated a research project designed to document how an elementary school in the Bronx would prepare children for the ELA test to be administered in February, 2000. The school had been placed on probation because nearly 90% of the fourth graders had failed the ELA test during the previous year.
As early as October, coaching material had been purchased by the district office and sent to the school with instructions as to how it should be used. Each day a certain amount of time was to be set aside for all fourth graders to work with the material. In classes for children with special needs, teachers were asked to use this material for a full day each week. Children were also expected to work with test-prep material every Saturday and even for 4 days over the winter break. A hefty packet of material was prepared for each home, and parents were instructed on how to help their children work through it.
At first glance, the coaching material looked promising. It was handsomely produced with ample illustrations. As we examined it more closely, however, we were dismayed with what we found between its glossy covers. The practice passage in Figure 10 illustrates just how misleading this material can be. This passage is followed by eight multiple-choice tasks designed to represent the range of tasks that children encounter. After children respond to the tasks, they turn to the back of the manual, where a correct choice is provided for each task along with explanations of why it is correct and why the other choices are wrong.
An Old China Test presents a good deal of misleading information. Let us begin by considering certain historical inaccuracies. This test did not originate in the Tang dynasty but rather in the Han dynasty. Moreover, contrary to a popular misconception, it was not based only on the teachings of Confucius but also on the teachings of other seminal thinkers of the Confucian era.
Students were not required to take a single test but rather a series of tests (those who failed a particular test were not allowed to take the following one). Nor were they required to write from memory all the books of Confucius . . . hundreds and hundreds of pages. Only students who had developed a sophisticated composing style were able to move successfully through all the tests.
Perhaps the most disconcerting feature of the passage is its use of materialistic values associated with modern society to characterize the motivation of those taking the tests. It is simply not accurate to state that successful candidates won a job and lots of money for life. They were provided an opportunity to work in the civil service but were not guaranteed lifetime employment. Certainly the material benefits associated with a government position were attractive, but it is misleading to ignore the role that the scholarly culture in ancient China played in motivating individuals to undertake long years of arduous study. From a Confucian perspective, such study developed the disciplined thinking that any effective leader must draw on.
This same imposition of an alien perspective is also evidenced in describing Chinese writing as requiring students to begin their writing on the last page of the book and write toward the front of the book (in task 21 the expression write backwards is used). Even an illustration that accompanies the passage reflects an alien perspective: a Western-style book is used rather than a scroll. This coaching material reflects a disturbing quality that can be found in what children read at school: Under the guise of multiculturalism, other cultures are often misrepresented to make them seem more familiar.
Historical inaccuracies can also be found in the explanations that accompany the multiple-choice tasks. These inaccuracies are especially pronounced in the explanation provided for task 25 (see Figure 11). Choice B (few people took the tests) is described as incorrect, since it is claimed that many people took the test to get lots of money and a job for life. As a matter of historical record, relatively few people took the test because of the arduous preparation that it required, so B is, in fact, a perfectly legitimate answer.
The reasoning used to justify choice A (the test took one month to finish) as the correct answer is equally flawed. One of the reasons given as justifying a long period was that the students had to write with a brush. Hence, in choosing A as the correct answer, children are supposed to infer that writing with a brush would have been slow. A more culturally appropriate inference, however, would be that Chinese students who lived at that time used a brush as efficiently as we use a pen or pencil today.
In rejecting choice A, Raisa, one of the children who attended the school we were observing, made not only this point but an equally compelling one. She pointed out that the passage didnt state exactly how long the test took. Her point is well taken, since children are often penalized when they use information not given in the passage when responding to a multiple-choice task. Children are also frequently advised to reject a precise choicelike one month in Ain favor of one that allows greater leewaysuch as few people in B. During an interview, Raisas teacher identified her as a gifted student who does not do as well on tests as she should.
At the time that doctoral students and I conducted this research, the New York Times published an op-ed piece that I wrote about the adverse effects of the coaching material used in New York City schools, especially those attended by culturally diverse children who live in poor neighborhoods (Hill, 2000a). Afterwards, I received various responses from teachers expressing their frustration with high-stakes testing. One fourth-grade teacher in an inner city school wrote the following:
Test prep is not limited to the commercial materials you wrote about in the opinion piece. Test prep is a culture that a failing (usually synonymous with poor) school is forced to choose. It means that each morning the number of days and hours until the test are ticked off over the school PA system. Test prep means that billboards around the school are covered with testing tips rather than student work. Test prep is when a school holds pep rallies not for its basketball team but for its test takers. Test prep is when students brag about the label given to them by a testing agency: Im Proficient in multiple meanings. Youre below Basic in computation . . . . I fear that the effects of high stakes testing on a schools climate are incurably opposed to the best traditions of real teaching and real learning. I cant wait until April 10th, the first day after testing, when Ill really become a teacher.
This teacher also focused on the conflict between testing and the creative work that students were doing in his class.
I recently completed a month-long Black History project with my students. It involved learning the methods of research. The project also involved creating a piece of historyfor example, one group made a replica of the Amistad, another reenacted the killing of Crispus Attucks at the Boston Massacre. At the conclusion of the project, a colleague and I cynically wondered how my students creativity would be assessed on test day. Not at all we decided.
This complaint about a lack of connection between testing and creative work is often voiced by teachers.
In the next section, I consider another major effect of high-stakes testingthe gradual disappearance of broader assessment models in which childrens creative work is evaluated. I focus on the Progress Profile, mentioned briefly at the outset of this article, and sketch in its major components in order to illustrate what is being lost.
DECREASED USE OF BROADER ASSESSMENT MODELS
Prior to the advent of high-stakes testing, I worked for a number of years with early childhood teachers in various districts of metropolitan New York to develop the Progress Profile (Hill, 2000b). The fundamental goal of this assessment model was to document childrens growth in literacy and numeracy from kindergarten through the fourth grade. The Progress Profile consisted of two components: (1) curriculum-based testing built around constructivist principles and (2) documentation of childrens work throughout each school year (i.e., a portfolio system was devised to collect samples of childrens work both at school and at home).
The curriculum-based testing of childrens literacy development is exemplified in Figure 12, which presents a folktale and five different kinds of tasks. The initial task of retelling the story provides a baseline that is useful in evaluating childrens responses to the tasks that follow. The next three tasks ask children to identify salient facts, to use these facts in making basic inferences, and to relate the story to their own experience (Bloom, 1984). The final task requires children to restructure the story in some way (e.g., to provide a new ending).
It is instructive to contrast these tasks with a multiple-choice task found in a commonly used test in New Zealand. The story on which this task is based is similar to the one in Figure 12 except that it is about a dog and a pukeko, a long-necked bird that bears a Maori name.
This story is mostly about a
(A) dog finding a bone.
(B) dog with a sore stomach.
(C) pukekos reward.
(D) foolish pukeko.
Once again, we encounter a task dealing with main-ness. Pukekos reward is designated as the correct choice, presumably because it is central to the plot. However, when Hayes and McCartney (1997) administered this task to 17 children in a Bronx school, 13 of them chose a foolish pukeko. Consider, for example, how an 11-year-old Latina girl defended this choice.
Why did you choose (D)?
Because he put his neck inside the dogs mouth and thats nasty and thats foolish.
Why is it foolish?
Because a normal personor whatever it iswouldnt actually put his mouth or his head or anything else into the dogs mouth.
Cause it could be a trap.
What kind of trap?
He puts his head into the dogs mouth and the dog just bites it off. (p. 17)
Clearly this girl understood the story and its moralone should be careful before actingbut the test makers provided her no reward.
After completing the comprehension tasks in the Progress Profile, the children were presented tasks that deal with literacy conventions: For example, they were asked (1) to identify specific uses of punctuation such as capitalization or quotation marks and explain how they were being used; (2) to read aloud a short excerpt so that teachers could carry out a miscue analysis (Goodman, 1976); and (3) to write a short excerpt that was read aloud so that their knowledge of sound-spelling correspondences could be evaluated. Childrens responses to these tasks were evaluated according to a preexisting scheme, which was designed to take account of appropriate variation: For example, when children wrote what they heard, they received partial credit for any phonetically motivated spelling, even if it was not the correct one (e.g., speakers of African American vernacular received partial credit for axed in place of asked).
While administering the tasks, the test giverwho was ordinarily the classroom teacherattempted to reduce pressure on the child. Resisting a widely held notion that time should be used as a factor in evaluating how well children can perform, the test giver allowed them, within reason, as much time as they needed to respond to the tasks. Moreover, if an individual child did not understand a given task or was unable to perform it, the test giver provided additional information. In developing the Progress Profile, we were influenced by the model of dynamic assessment (Feuerstein, Falik, & Feuerstein, 1998), in which teaching and learning are allowed, indeed encouraged, as children carry out a task. From this perspective, a major goal of assessment is to determine how well a child can use relevant information in problem solving (the scoring of a childs responses took into account the amount of information that a test giver provided).
The portfolio component varied from school to school, but it generally included (1) a journal in which a childs reading experiences at home were documented and (2) projects based on the childs exploratory reading in different content areas. As children took home books to read each week, they took along a small notebook in which entries could be made: Whoever read with the childa parent, a grandparent, an older sibling, or even a visiting friendwrote comments about the childs reading experience (as children progressed, they were encouraged to write their own comments as well).
When children brought the notebook back to school, the teacher responded to the entry. In this way, a home-school dialogue was built around the books that children read during the course of the year, and the teacher received a good deal of information that he or she could use in working with an individual child. In addition, this journal provided a comprehensive record of what children read at home that was useful at the beginning of a new school year when children moved from one teacher to another. The journal was so well received that we encouraged teachers to initiate it as early as kindergarten.
The projects that children undertook in different content areas could take a variety of forms. In the language arts, teachers showed considerable ingenuity in helping children display personal work: The stories children told and the accompanying pictures they drew were brought together in small books with a pocket for a library card so that children could check them out and read each others stories at home. As children participated in this activity, they developed various kinds of literacy knowledge. They learned, for example,
how an individual book is organized (images and words working together to make meaning) and how books are organized in a library. They received early exposure to the alphabetical principle of organization basic to the western experience of literacy; the alphabet becomes something more for children than a string of letters to recite. But apart from this practical knowledge of literacy, children learned an even more important lessonthat they could bring their personal worlds to the classroom. (Hill, 1992, pp. 3839)
Such active assimilation of literacy knowledge and skills was reinforced by the portfolio system as a whole. To assemble their portfolios, children were expected to maintain separate files in which they kept track of multiple drafts of various projects. They then had to take responsibility for deciding which projects to include and for preparing an introduction in which they explained how these decisions were made and how the portfolio was organized. This approach to portfolios developed not only childrens literacy skills but also their capacity for evaluating their own work.
The Progress Profile was well received by children, teachers, and parents. For children, the close attention that they received from their teachers was a rewarding experience that increased their confidence and motivation. We discovered that children looked forward to the testing as an opportunity to show what they knew and could do; that the teacher was able to provide help, when needed, contributed to their walking away from the experience with a sense of support. This experience contrasts sharply with the fear and intimidation that high-stakes testing often engenders.
For the teachers, working together to develop the Progress Profile provided an opportunity to share their best practices with each other. Moreover, as they administered the various tasks, they were able to make careful observations that carried over into their daily interactions with individual children. They also developed clinical knowledge and skills as they learned to do a miscue analysis or to evaluate sound-spelling correspondences.
An attractive feature of the Progress Profile was its focus on the connection between home and school. It was not only the journal documenting childrens reading experience that strengthened this connection. Parents were encouraged to be present when children presented their portfolios and thus were in a position to understand better the work they had done throughout the school year. Moreover, since the teachers had been involved in both designing and administering the test, they were in a position to communicate useful information during parent conferences.
As teachers in the various districts were confronted with high-stakes testing, they began to experience difficulty in maintaining the Progress Profile. They simply did not have sufficient time to maintain such a demanding program while preparing children for the new tests. In certain districts the use of commercially produced coaching material was mandated, so that teachers had to set aside additional time for test-prep activities. The teachers were thus forced to abandon an assessment program that they had worked hard to develop and had come to regard with a great deal of pride.
I have covered a good deal of ground in examining the ELA test administered to fourth-grade children in New York State. I first examined problems that inhere in the test itself and then those that have arisen from its use. I would now like to consider how these problems might be best addressed. It is important to stress at the outset that the ELA test, given the current political climate, is not likely to disappear. In fact, the No Child Left Behind legislation requires New York State to administer a comparable test in Grades 4 through 8, and it is thus prudent to concentrate on how the present test might be improved.
From my vantage point, an important first step is to eliminate the multiple-choice component. When I first examined high-stakes tests, I was disappointed to discover that multiple-choice testing was still in use, since the standards movement had promised the use of performance testing in which students would engage in familiar activities such as those evidenced in the listening/writing and reading/writing components. Writing about something one has read or listened to is an activity that has value not only in school but also in the larger society.
By way of contrast, a multiple-choice task is an activity that children carry out only when they take a test (or are being prepped for one). Indeed, this kind of task was invented early in the last century so that test answers could be quickly and impartially scored, a process that became even more efficient with the development of machine scoring. With the development of artificial intelligence techniques, computer-based scoring of constructed-response tasks is becoming more common. Yet the traditional rationale for multiple-choice testing is gradually being eroded (see Wang & Hill, 1999, for innovative research on such scoring within the Peoples Republic of China).
Unfortunately, our culture is heavily invested in multiple-choice testing. While there is a growing number of dissenters among teachers and educators, the general public, and most significantly legislators, tend to take these tests at face value. Educating them about the drawbacks of the multiple-choice format is likely to be a lengthy process. An effective strategy might be to target an area like reading comprehension, where the multiple-choice technique can be shown to be especially damaging (by way of contrast, this technique is relatively benign if it is used to focus on, say, grammatical features of language so that ambiguity can be generally avoided in establishing a correct choice).
As illustrated with main-idea tasks, this technique often forces test takers to adopt an oversimplified, even distorted approach to what they read. As we have seen, the wrong answers to a multiple-choice task can be so implausible that they can be rejected even without reading the passage. At the same time, they can be so attractive that children are stimulated to use them in establishing a context for material they find difficult to understand. Added to these psycholinguistic criticisms is a more practical argument. The multiple-choice component of the ELA test is simply not needed, since the reading/ writing component provides a viable alternative. Its use of performance testing is, in fact, more congruent with the goals of the standards movement.
As for the listening/writing and reading/writing components, a more thoughtful approach to the number and distribution of selections and tasks could increase the effectiveness of the ELA test. In the version of the test under consideration, the relation between what children are asked to do and the time they are allotted varies markedly across the two components. It would be preferable to establish a stable frame for both components in which a single selection is followed by three related tasks, and in the case of the listening/writing component, care should be taken that the selection, read twice, leaves ample time for the tasks.
The three tasks should be designed so that children move through an integrated experience. The first task, as presently constituted, functions well as an organizing activity that helps children select and arrange relevant material from the selection. The second task, however, could be more closely aligned with the first by asking children to use the material they have assembled in interpreting what they have read or listened to. The integration of these two tasks would help children to ground their interpretation in the text and thus prepare them for a third task that asks them to connect the text to their own experience. In effect, the three tasks should reflect a broad model of comprehension that encourages children to integrate fact, inference, and experience (Bloom, 1984).
If the ELA test is to reach its full potential, the rubrics used to evaluate childrens writing need to be overhauled to reflect more realistic standards. To begin with, the general rubric should be more closely aligned with the specific rubrics used to carry out the actual work of evaluation, and both should, in turn, be more closely aligned with what children of this age are able to produce under the conditions imposed by the ELA test (i.e., writing about unfamiliar material in a limited amount of time). We can observe here a major flaw in high-stakes testing, namely, that standards associated with good writing have not been sufficiently adapted to what students are able to do in a testing situation.
To maintain the standards associated with less-pressured writing, testing needs to be complemented by other forms of assessmentfor example, students carrying out a research projectin which students have the opportunity to develop and refine their writing about familiar material over an extended period of time. But even here the standards must be adapted for use with students of different ages.
There is a final recommendation that I would like to make: Children should be provided the option of taking the ELA test on a computer. I am directing a research project in which a digitally based assessment model is being developed for the Pacesetter Program, which is sponsored by the College Board to help culturally diverse students prepare for higher education (Hill, 2003). High school students in this program generally prefer to respond to a writing task on a computer, since it allows them to get their thoughts down and revise the language used in expressing them more easily. It is clear that tools such as a spell checker contribute to what is often referred to as scribal fluency.
Given the high stakes attached to a test such as the ELA test, students should, in principle, be provided the tools that enable them to do their best work (Bennett, 1999). Hence children who are accustomed to writing on a computer should be provided a digital version of the ELA test (just as older students are provided such versions of the TOEFL or the GRE). As Russell and Abrams (2004) observe, an unfortunate consequence of high-stakes testing is that its exclusive reliance on pencil and paper has worked against the use of computers in the classroom.
Before leaving the subject of computers and testing, I would like to make two further observations based on our current research. First, when students respond by computer, we are able to examine not only their work but also how they carried it out: for example, how much time they spent previewing a task, how much time they spent revising their answer, and what kinds of tools they used). We are thus in a position to provide useful feedback on how they went about doing the various tasks. I have not dealt with feedback on the ELA test, but I have found that it is so generic it is of little value (see Hill, 1992, for a discussion of how misleading feedback based on multiple-choice items can be).
Second, our assessment model encompasses not only print material but also material that integrates print with sound and image. Since students increasingly work with such material, we find that they are especially motivated when they encounter it on an assessment task. We also find that how they respond to such a task provides useful information about how well they are prepared to function in a digital age. Ultimately, the standards movement will reach its full potential only when performance testing makes greater use of digital technologies.
I would like to close this article by considering how, given the attention that must be paid to the ELA test, the values associated with school-based assessment programs like the Progress Profile can be preserved. Certainly the testing component of such programs has been superceded by a high-stakes test, but a strong case can be made that the portfolio component be preserved and integrated into a state-wide assessment program. Such a program would be responsive to the policy statements issued by professional organizations such as the National Council on Measurement in Education. In summarizing these statements in a report published by the National Research Council, Heubert and Hauser (1999) write that scores from large-scale assessments should never be the only sources of information used to make a promotion or retention decision . . . . Test scores should always be used in combination with other sources of information about student achievement (p. 286).
I do not have space to explore the challenges of designing such a state wide assessment program (see Darling-Hammond, 2004), but I would like to revisit certain benefits. Apart from the greater equity discussed previously, such a model would allow teachers to play an important role in assessment. Historically, educational testing has tended to reduce that role. Indeed, it originated in the early part of the last century partly because of a commonly held belief that many teachers were biased toward immigrant children and thus should not be responsible for assessing them (Thorndike, 1915).
This lack of confidence in teachers has persisted in American education, even though they have the most knowledge about the children they teach. As indicated by the earlier discussion of Raisa in the Bronx school, teachers are often aware that an individual childs test performance is misleading. They should thus be given the opportunity to provide reliable evidence to counterbalance a test score (whether too high or too low). One approach would be to allow teachers to submit portfolio evidence only when they are convinced that an ELA test score is seriously misleading. Hence the state would not be required to maintain an official portfolio score for each child (an attempt by Vermont to maintain such a system collapsed under its own weight; Hill, 1992, describes how other misguided policies contributed to the collapse). Such a parsimonious approach would limit bureaucracy, an important consideration in any statewide assessment program, while preserving the possibility that a teacher can play an instrumental role in deciding how an individual child is assessed.
Integrating portfolios into a statewide assessment program would also result in the benefits discussed earlier in relation to the Progress Profile. When teachers and children are engaged in portfolio assessment, they work together on a daily basis to ensure that childrens best work is assembled. Such cooperation strengthens the bond that is crucial to childrens success in school (see Ferguson, 2002, for evidence that this bond is especially crucial for Latino and African American children). This approach also provides parents with more meaningful information about their childs work in school. An unfortunate consequence of high-stakes testing is that it leads parents to accept on faith whatever number is assigned to their child, but provides them little help in figuring out what the number might mean. When parents are invited to review their childs portfolio, they come away with a real sense of what their child knows and is able to do.
Children are the ultimate beneficiaries when teachers and parents are involved in assessment. As they themselves review what they have done throughout the school year and select certain pieces for their portfolio, they develop a sense of what constitutes good work. The ultimate goal of the standards movement is to help students internalize standards that they can draw on in evaluating their own work, not only in school but also as they take on responsibilities in the larger world.
I owe special thanks to Eric Larsen, a longtime friend and colleague, who has been present in all phases of this work: helping to conceptualize the research projects on which the article is based, ensuring that its various parts work together, and shaping both its style and substance as it moved through various drafts. I would also like to thank Howard Everson, Ellen Maier, and Gary Natriello for editorial suggestions as well as Hung Chia Yuan and Robert Schwarz for technical assistance. Finally, I would like to thank Frank Horowitz and Jo Anne Kleifgen as well as the graduate students they teach for responding to multiple-choice tasks without reading the passage and describing how they went about making their choices.
Aronowitz, R. (1984). Reading tests as texts. In D. Tannen (Ed.), Coherence in spoken and written discourse (pp. 4362). Norwood, NJ: Ablex.
Babcock, D. (2000). Comprehensive reading and writing assessment. Merrimack, NH: Options.
Bennett, R. (1999). Using new technology to improve assessment. Educational Measurement: Issues and Practice, 18, 512.
Black, J., & Bower, G. (1980). Story understanding as problem-solving. Poetics, 9, 223250.
Bloom, B. S. (1984). Taxonomy of educational objectives. New York: Longman.
Braddock, R. (1974). The frequency and placement of topic sentences in expository prose. Research in the Teaching of English, 8, 287302.
Darling-Hammond, L. (2004). Standards accountability, and school reform. Teachers College Record, 106, 10471085.
Ferguson, R. (2002). Addressing racial disparities in high-achieving suburban schools. NCREL Policy Issues, 13, 111.
Feuerstein, R., Falik, L., & Feuerstein, R. (1998). In R. Samuda (Ed.), Advances in cross-cultural assessment (pp. 100161). Thousand Oaks, CA: Sage.
Gee, J. (1992). What is reading? Literacies, discourses, and domination. Journal of Urban and Cultural Studies, 2, 6577.
Goodman, K. (1976). Miscue analysis: Theory and reality in reading. In J. E. Merritt (Ed.), New horizons in reading: Proceedings of the Fifth International Reading Association World Congress on Reading (pp. 1526). Newark, DE: International Reading Association.
Hayes, T., & McCartney, N. (1997). An analysis of the Progressive Achievement Tests of Reading. Unpublished manuscript.
Heubert, J., & Hauser, R. (Eds.). (1999). High stakes: Testing for tracking, promotion, and graduation. Washington, DC: National Academy Press.
Hill, C. (1992). Testing and assessment: An ecological approach (Inaugural lecture for the Arthur I. Gates Chair in Language and Education). New York: Teachers College, Columbia University.
Hill, C. (1995). Testing and assessment: An applied linguistics perspective. Educational Assessment, 2, 179212.
Hill, C. (1999). A national reading test for fourth graders: A missing component in the policy debate. In B. Preseissen (Ed.), Teaching for intelligence I (pp. 129152). Chicago, IL: Skylight Publishing.
Hill, C. (2000a, March 18). Practicing without learning [Op-ed article]. The New York Times, p. A15.
Hill, C. (2000b). The Progress Profile: Constructivist assessment in early childhood education. In A. L. Costa (Ed.), Teaching for intelligence II (pp. 211230). Chicago: Skylight Publishing.
Hill, C. (2001). Linguistic and cultural diversity: A growing challenge to American education. New York: College Board.
Hill, C. (2003). Integrating digital tools into a culturally diverse curriculum: An assessment model for the Pacesetter Program. Teachers College Record, 10, 278296.
Hill, C., & Larsen, E. (2000). Children and reading tests. Stamford, CT: Ablex.
Hill, C., & Parry, K. (1989). Autonomous and pragmatic models of literacy: Reading assessment in adult education. Linguistics and Education, 1, 223289.
Hill, C., & Parry, K. (1992). The test at the gate: Models of literacy in reading assessment. TESOL Quarterly, 26, 433461.
Hill, C., & Parry, K. (1994). From testing to assessment: English as an international language. Harlow, UK: Longman.
Johnston, P. (1984). Reading comprehension assessment: A cognitive basis. Newark, DE: International Reading Association.
Klein, S. P., Hamilton, L. S., McCaffrey, D. F., & Stetcher, B. M. (2000). What do test scores in Texas tell us? Santa Monica, CA: RAND.
Koretz, D., & Barron, S. I. (1998). The validity of gains on the Kentucky Instructional Results Information System (KIRIS) (MR-1014-EDU). Santa Monica, CA: RAND.
Koretz, D., Linn, R. L., Dunbar, S. B., & Shepard, L. A. (1991, April). The effects of high-stakes testing: Preliminary evidence about generalization across tests. In R. L. Linn (Chair), The effects of high stakes testing. Symposium conducted at the Annual Meeting of the American Educational Research Association and the National Council on Measurement in Education, Chicago, IL.
New York State Testing Program. (2002a). English language arts, Book 1 (Grade 4). Monterey, CA: CTB/McGraw-Hill.
New York State Testing Program. (2002b). English language arts, Book 2 (Grade 4). Monterey, CA: CTB/McGraw-Hill.
New York State Testing Program. (2002c). Listening selection (Grade 4). Monterey, CA: CTB/ McGraw-Hill.
New York State Testing Program. (2002d). Listening: Scoring guide and scorer practice set (English Language Arts, Grade 4). Albany, NY: New York State Education Department.
New York State Testing Program. (2002e). Reading: Scoring guide and scorer practice set (English Language Arts, Grade 4). Albany, NY: New York State Education Department.
Russell, M., & Abrams, L. (2004). Instructional uses of computers for writing: The effect of state testing programs. Teachers College Record, 106, 13321357.
Stecher, B. M., Barron, S., Kaganoff, T., & Goodwin, J. (1998). The effects of standards-based assessment on classroom practices: Results of the 1996-97 RAND Survey of Kentucky Teachers of Mathematics and Writing (CSE Technical Report 482). Los Angeles: Center for Research on Evaluation, Standards, and Student Testing.
Thorndike, E. (1915). An improved scale for measuring ability in reading. Teachers College Record, 16, 3153, 445467.
Trabasso, T. (1981). On the making of inferences during reading and their assessment. In J. T. Guthrie (Ed.), Comprehension and teaching: Research reviews (pp. 5675). Newark, DE: International Reading Association.
van Dijk, T. A., & Kintsch, W. (1983). Strategies of discourse comprehension. New York: Academic Press.
Wang, H., & Hill, C. (1999). Short-answer questions in College English testing. In Y. Teng (Ed.), Second language acquisition and English learning strategies (pp. 317338). Nanjing, China: Nanjing University Press.
Wiggins, G. (1989). Teaching to the (authentic) test. Educational Leadership, 46(7), 4147.
CLIFFORD HILL is the Arthur I. Gates Professor of Language and Education Emeritus at Teachers College, Columbia University. He has written various articles and books that deal with language and literacy assessment, most notably Children and Reading Tests and From Testing to Assessment: English as an International Language. Another major area of Dr. Hills research is concerned with how language represents space and time. This research has been funded by international research institutes, such as the Max Planck Institut fur Psycholinguistik and the Institut Nationale de Recherches Pedagogiques. His publications in this area have been translated into other languages. He currently directs a research project on digitally based assessment, jointly funded by the College Board, Teachers College, Columbia University, and the Office of Educational Research and Improvement (OERI) in the U.S. Department of Education.