Home Articles Reader Opinion Editorial Book Reviews Discussion Writers Guide About TCRecord
transparent 13
Topics
Discussion
Announcements
 

An Experimental Study of the Effects of Monetary Incentives on Performance on the 12th-Grade NAEP Reading Assessment


by Henry Braun, Irwin Kirsch & Kentaro Yamamoto - 2011

Background/context: The National Assessment of Educational Progress (NAEP) is the only comparative assessment of academic competencies regularly administered to nationally representative samples of students enrolled in Grades 4, 8, and 12. Because NAEP is a low-stakes assessment, there are long-standing questions about the level of engagement and effort of the 12th graders who participate in the assessment and, consequently, about the validity of the reported results.

Purpose/Focus: This study investigates the effects of monetary incentives on the performance of 12th graders on a reading assessment closely modeled on the NAEP reading test in order to evaluate the likelihood that scores obtained at regular administrations underestimate student capabilities.

Population: The study assessed more than 2,600 students in a convenience sample of 59 schools in seven states. The schools are heterogeneous with respect to demographics and type of location.

Intervention: There were three conditions: a control and two incentive interventions. For the fixed incentive, students were offered $20 at the start of the session. For the contingent incentive, students were offered $5 in advance and $15 for correct responses to each of two randomly chosen questions, for a maximum payout of $35. All students were administered one of eight booklets comprising two reading blocks (a passage with associated questions) and a background questionnaire. All reading blocks were operational blocks released by NAEP.

Research Design: This was a randomized controlled field trial. Students agreed to participate without knowing that monetary incentives would be offered. Random allocation to condition was conducted independently in each school.

Data Collection/Analysis: Regular NAEP contractors administered the assessments and carried out preliminary data processing. Scaling of results and linking to the NAEP reporting scale were conducted using standard NAEP procedures.

Findings: Monetary incentives have a statistically significant and substantively important impact on both student engagement/effort and performance overall, and for most subgroups defined by gender, race, and background characteristics. For both males and females, the effect of the contingent incentive was more than 5 NAEP score points, corresponding to one quarter of the difference in the average scores between Grades 8 and 12. In general, the effect of the contingent incentive was larger than that of the fixed incentive, particularly for lower scoring subgroups.

Conclusions/Recommendations: There is now credible evidence that NAEP may both underestimate the reading abilities of students enrolled in 12th grade and yield biased estimates of certain achievement gaps. Responsible officials should take this into account as they plan changes to the NAEP reading framework and expand the scope of the 12th-grade assessment survey.

INTRODUCTION


The National Assessment of Educational Progress (NAEP) is a federally funded program that operates under the auspices of the National Assessment Governing Board (NAGB) and is managed by the National Center for Education Statistics (NCES). The work of NAEP is conducted by a consortium of private, nonprofit, and commercial organizations under contract to NCES.1 NAEP is the only regularly administered assessment of academic competencies that is given to nationally representative samples of students in Grades 4, 8, and 12.2 Under the No Child Left Behind (NCLB) Act, states are required to participate in the NAEP reading and mathematics assessments for Grades 4 and 8. Consequently, the national sample in those grades is now the aggregate of the state samples.


Although NAEP’s 12th-grade assessment is not mentioned in the NCLB Act, it has received greater attention as human capital issues have become increasingly important in terms of our global competitiveness, especially in light of concerns over the productivity of our nation’s school systems (New Commission on the Skills of the American Workforce, 2008; see also Kirsch, Braun, Yamamoto, & Sum, 2007, and references therein). These concerns are not new; 2008 marks the 25th anniversary of A Nation at Risk, the bellwether report that many feel played a key role in initiating the wave of education reforms at both the state and federal levels that continues to this day.3


American high schools have been the subject of much attention of late, with substantial evidence of high dropout rates and the low skills possessed by many students who do persist and graduate (Belfield & Levin, 2007; Orfield, 2004). Indeed, there are reports that in many states, more than half of high school graduates entering community colleges must register for at least one “developmental” course before taking a full load of credit courses.4 These findings, at least in part, have led to the recognition of the need for credible national indicators of the proficiencies of high school graduates. In response, the NAGB established the National Commission on NAEP 12th Grade Assessment and Reporting. The commission’s report, released in 2004, recommended that 12th-grade NAEP be redesigned so that it can report on “the readiness of 12th graders for college, training for employment, and entrance into the military” (p. 1). It also argued for expanding “state NAEP” to include 12th-grade NAEP. (In fact, the 2009 administration of 12th-grade NAEP included 11 state samples.) The report goes on to discuss some of the work that would need to be done to lay the groundwork for such an ambitious overhaul.


Notwithstanding the current and proposed roles for 12th-grade NAEP, a number of issues have arisen with regard to the utility of the reported results. Because NAEP only samples enrolled students, the large numbers of young people in the corresponding age cohort who have left school by the spring semester of 12th grade are not represented in the sample. Although it is generally understood that inferences from the NAEP sample can only be made to the in-school population, from a human capital perspective, this is a serious limitation.5 Moreover, comparisons over time are confounded with how cumulative dropout rates by the midpoint of the 12th grade vary from one cohort to another.


With respect to the in-school population, the validity of inferences based on NAEP depends critically on both the quality of the sample drawn and the nature of the nonresponse. Although sample selection is designed and executed with great care, nonresponse rates at the school and student levels are somewhat higher than desirable.6 Classical sampling theory asserts that if selected students and schools that choose not to participate differ in the distribution of achievement from those that do, then estimates based on the obtained NAEP sample will be biased.7


Finally, there are long-standing questions about the level of engagement and effort in completing the assessment among those students who do participate in NAEP. NAEP is a “low-stakes” assessment, with the results having no bearing on the students’ academic record. Given the diversity of educational experiences and interests among 12th graders, devising an instrument that would be regarded as meaningful and appropriate by all students is a considerable challenge.


These issues were recognized by the commission, and indeed, one of its recommendations was that “NAEP’s leaders . . . should develop and implement bold and dramatically new incentives to increase the participation of high schools and 12th grade students in NAEP and the motivation of 12th grade students to do their best on NAEP” (National Commission on NAEP 12th Grade Assessment and Reporting, 2004, p. 8). The commission made some suggestions in this regard but recognized the need for more research with respect to both participation and motivation. A brief review of the extant research is presented in the literature review that follows. We note here only that a survey of the motivational factors and social influences that can impact student engagement and effort can be found in Brophy and Ames (2005). They concluded,


Students have little reason to choose to take the NAEP tests, to fully engage in doing so, or to use strategies likely to optimize their performance. . . . Taken together, these considerations point toward the conclusion that NAEP assessments of twelfth graders faces [sic] daunting motivational obstacles that are difficult to overcome, so that efforts to do so are not likely to be successful. (p. 16)


Nonetheless, enthusiasm and support for 12th-grade NAEP generally remains high. Accordingly, it is all the more essential to attempt to quantify the degree to which the outcomes for students taking 12th-grade NAEP reasonably reflect their capabilities and, if not, what are the possible impacts on the many statistics reported by NAEP. The primary goal of this study, then, was to provide policy makers with credible evidence on the possibility of bias in reported NAEP statistics resulting from differential motivation and engagement among various subgroups of interest.


Because it is logistically difficult to study this issue in the context of operational NAEP, we carried out an independent randomized experiment in which some students took the assessment under “control” conditions, whereas others took the assessment under one of two incentive conditions (the “treatments”) involving monetary payments. The treatment effect is typically defined to be the average difference in outcomes between each incentive condition and the control condition. However, there was also interest in investigating the extent to which treatment effects varied across different subgroups of the population.


The study samples were designed to be large enough to detect treatment effects of practical interest. In addition, the study was implemented in such way that the treatment effects could be properly represented on the NAEP scale, thereby adding to their utility and interpretability. As will be clear from the literature review to follow, most of the experimental studies of extrinsic motivation on academic outcomes have employed assessments of mathematics. For the present study, we selected the NAEP reading assessment. In part, this was to broaden the range of NAEP tests that have been examined. More important, one can plausibly conjecture that at the 12th grade, increased motivation is more likely to have an impact on reading performance than on mathematics performance, because the latter depends more directly on the courses taken. That is, greater motivation cannot compensate for the absence of relevant skills.


This study is noteworthy for both the size and heterogeneity of the sample of participating schools and students. Although the sample was not nationally representative, it did include schools from various regions and types of locations. In addition, key race/ethnic groups were well represented. The study also had high internal validity; that is, the random allocation of students to the control and treatment groups within each school appeared to have been well executed, and the linking of the results to the NAEP scale was relatively straightforward. At the same time, the estimated standard errors are somewhat larger than desirable, particularly for some subgroups.8


LITERATURE REVIEW


There has been long-standing interest in the relationship between motivation and performance. The general review by Wise and DeMars (2005) concluded that the average difference in performance scores between motivated and unmotivated students is approximately 0.6 standard deviations.9 Turning to NAEP, there have been ongoing concerns regarding differential student motivation and its impact on the reported results. Lazer (n.d.) provided some evidence that student motivation may play a role in the accuracy of 12th-grade NAEP data. He reported that 12th-grade students are more likely than their younger peers to omit questions that involve writing or independently constructing a response. In addition, he noted that NAEP data indicate that seniors reported not trying as hard on NAEP as on other tests, and this percentage was significantly higher among 12th graders in comparison with fourth- and eighth graders. For example, he noted that in 2000, 45% of 12th graders reported not trying as hard on NAEP as on other mathematics tests. The corresponding percentages among fourth- and eighth-grade students participating in NAEP were 8% and 20%, respectively.


Issues of motivation were also raised by Hoffman (2004) and by Brophy and Ames (2005) and concluded that NAEP faces daunting challenges in this respect. In part, these challenges stem from the fact that there is no incentive for students to participate and do their best, nor for local schools or districts to value their participation in 12th-grade NAEP. For example, NAEP is administered in the middle of the second term, when high school seniors may well be less focused on academic issues. In addition, NAEP is not aligned to a school’s curriculum and does not yield any outcomes that are directly related to local schools or districts.


Hoffman (2004) conducted a review of the literature on motivation theory to examine the implications for NAEP participation and performance. Brophy and Ames (2005) provided an overview of the motivational research in education and stressed the relevance of this literature to the challenges NAEP faces, noting that it deals with individuals choosing whether to participate in an activity and the factors that influence the quality of such participation. Both sets of authors recognize the relevance of the expectancy x value theory as it might relate to test motivation. Basically, this theory says that the effort that individuals are willing to invest in a task reflects the extent to which they expect to be able to perform the task and whether they value the rewards they will get or see the activity as important and worthwhile.


Brophy and Ames (2005) concluded that despite the challenges that arise from the fact that large-scale assessments do not have direct consequences for students, teachers, or school districts, the motivational literature does suggest some strategies that might be employed. One of these is the use of monetary incentives. They noted, “Monetary payment is the ideal choice of incentive because it is easy to administer, and because money is attractive to all students” (p. 19).


Despite the increased focus on large-scale assessments generally, and 12th-grade NAEP specifically, there have been relatively few investigations of the role of monetary incentives on student performance in low-stakes testing situations. A meta-analysis conducted by Deci, Koestner, and Ryan (1999) in examining the effects of extrinsic rewards on intrinsic motivation did not report any studies that directly examined this issue. In our search of the literature, we did not locate studies that directly measured the impact of monetary rewards on student performance in low-stakes tests such as NAEP, PISA, or TIMSS (Brown & Walberg, 1993; Kiplinger & Linn, 1996).


We found only two teams of researchers who reported experimental studies examining the impact of monetary incentives on student performance on items drawn from large-scale national and international surveys. One, led by H. O’Neil, reported several studies dating back more than a decade that used monetary incentives to increase student motivation and performance on released NAEP mathematics items (O’Neil, Sugrue, & Baker, 1996). The other, led by J. Baumert, focused on mathematics items used in the PISA 2000 survey of 15-year-olds (Baumert & Demmrich, 2001). More recently, O’Neil, Abedi, Miyoshi, and Mastergeorge (2005) published a study using released TIMSS mathematics items to examine the effect of motivation on 12th-grade student performance.


In their 1996 study, O’Neil et al. manipulated test motivation using monetary and other types of incentives to examine their impact on the performance of students in Grades 8 and 12 in the United States responding to a set of NAEP mathematics items. In the monetary treatment groups, students were told that they would receive $1 for each item they answered correctly. In both grade levels, a control group received the same set of NAEP mathematics items, which were administered following the standard NAEP procedures. Both main and interaction effects were tested at each grade level for the full sample of students and for the subsample that could correctly identify their treatment group. The authors felt that students who were unable to identify their treatment group might also have been less likely to increase their motivational state in the testing situation. A main effect for the monetary treatment was found only for the subsample of eighth-grade students (those who correctly identified their treatment condition) working on easy and moderately difficult items. In contrast, no treatment effects were found for the full sample of 12th graders (even for easier items) or for the subsample students who correctly identified their treatment condition.


In the study that used released TIMSS mathematics items, O’Neil et al. (2005) attempted to “improve” on the results they found using NAEP mathematics items with students in Grades 8 and 12 in the United States. They argued that the primary reason they did not see an effect for 12th graders was that the size of the incentive was not large enough and that many of the participants in the earlier study were actually surprised that they received a payment.


All the studies reported here were carefully designed and executed. Nonetheless, there are design choices that can yield greater power and sensitivity, leading to better estimates of the impact of monetary incentives on test performance.  First, almost all the studies cited here focus on mathematics. As the various authors noted, tests comprising such items are highly dependent on specific knowledge that is most commonly acquired through the type and number of mathematics courses that have been taken. Given our interest in 12th-grade NAEP, we decided that an interesting alternative would be to focus on reading literacy instead of mathematics because the former is less dependent on specific coursework. In addition, passages selected for inclusion in NAEP are typically school based and relatively long, requiring sustained reading on the part of a student. It is possible that in this context, monetary incentives might well have a stronger impact on motivation and performance.


Second, the O’Neil et al. (2005) study used a relatively small number of schools (fewer than 10) and students, limiting the range of students who might participate and the potential power to find main or interaction effects. Accordingly, we decided to include a larger number of high schools across several states, involving a much greater number of students.


A third area of improvement concerns the nature of the incentives. In the earlier studies, some of the treatment conditions involved different types of external rewards, not just monetary payments. Where financial rewards were employed, they appeared not to be immediate. That is, students had to wait some period of time before receiving their payment, possibly reducing the impact of the incentive on their performance. Our design incorporated immediate payment, as well as a device to heighten students’ awareness of their treatment condition.


STUDY DESIGN


INCENTIVES


Reflecting on the literature briefly reviewed in the preceding section, it was decided that the study should incorporate two different incentive strategies. For the first, on entering the class where the session was held, students were offered, in advance, a flat sum in appreciation for their participation. The rationale was that this was easy to administer and would strengthen students’ sense of obligation to exert effort. For the second, on entering the class where the session was held, students were offered a token sum for participation, and larger sums contingent on their performance on two randomly chosen items. The rationale was that the conditions of the reward would encourage students to work on all the items. The sums involved had to be large enough to have the intended motivational effect, but not so large as to be entirely impractical in an operational setting.10


For the control condition, students were read the standard NAEP instructions.11 For the first incentive condition, students were read the same NAEP instructions as the control group and told that at the conclusion of the session, they would receive a $20 gift card from either Target or Barnes & Noble in appreciation for their participation and doing their best to answer each item. On a sign-up sheet, they were asked to indicate their gift card preference and to sign their name.


For the second incentive condition, students were read the same NAEP instructions as the control group. They were informed that they would receive a $5 gift card at the end of the session. In addition, they were told that at the conclusion of the session, two questions would be selected at random from the booklet. The value of the gift card would be increased by $15 for each question answered correctly, for a maximum gift card value of $35. As with the other incentive group, they were given a choice between Target and Barnes & Noble and were asked to indicate their preference on a sign-up sheet and sign their name. For both incentive groups, having the students declare a preference was intended to reinforce the message that they would receive a monetary award at the conclusion of the session.


In point of fact, at the conclusion of all three types of sessions, students were given a $35 gift card to the store of their choice. The rationale was that it was undesirable to have students in the same school participate in the study but receive substantially different rewards.12 Because the distribution of the gift cards took place at the end of the session, this change did not compromise the integrity of the study and, not surprisingly, by all accounts, the students were pleased with the change.


COGNITIVE INSTRUMENTS


The reading framework used for the 2007  NAEP reading assessment has been the basis for developing reading assessments since 1992 (National Assessment Governing Board, 2006). It describes how reading should be assessed at Grade 12 and distinguishes among three contexts for reading, as well as among four aspects. The three contexts—reading for literary experience, reading for information, and reading to perform a task—provide guidance to test developers as to the types of texts that should be included in the assessment. The four aspects are intended to characterize the ways in which students might respond to these different texts.


Reading for literary experience involves the reader in exploring themes, events, characters, settings, and the language of literary works. Reading for information engages the reader with expository texts that describe the real world. Reading to perform a task involves reading to accomplish something. The framework specifies that 35% of the assessment be devoted to reading for literary experience, 45% to reading for information, and 20% to reading to perform a task.


As explained in the NAEP framework document, readers develop their understanding of a particular text in different ways. They focus on general topics or themes, interpret and integrate ideas within and across texts, make connections to background knowledge and experiences, and examine the content and structure of the text. NAEP’s questions are based on these four aspects of reading and require the selection and integration of various reading strategies rather than the application of a specific strategy or skill. The framework specifies that 50% of the assessment items must be devoted to forming a general understanding and developing an interpretation, 15% to making reader/text connections, and 35% to examining the content and structure of a text.


NAEP tasks are typically organized into 25-minute blocks, with each block consisting of a single reading passage and a set of associated questions. The questions comprise both multiple-choice and constructed response (student-produced) formats requiring either short or extended answers. At least half of the questions associated with each passage involve a constructed response format that requires students to write their answers and explain and support their ideas. Passage lengths range from 500 to 1,500 words. Each student is assigned two 25-minute reading blocks, as well as a final (common) block that contains background questions. School administrators are also asked to complete the NAEP school questionnaire.


Only two of the three contexts for reading were used in this study: reading for literary experience, and reading for information. Reading to perform a task was not included because the other two contexts were judged to be much more important (accounting for 80% of the test, according to the framework). Moreover, to include all three contexts would have required significantly expanding the number of booklets and the sample size.


The study design called for employing four released NAEP reading blocks.13  Two reading for literary experience blocks (denoted A and C) and two reading for information blocks (denoted B and D were selected from among seven available released blocks. The four that were chosen for inclusion in this study had the closest overall match to the reading framework, with respect to both question formats and aspects, as shown in Table 1. Constructing booklets that matched the framework specifications as closely as possible is especially important because the number of booklets for the study is much smaller than the number employed in a regular NAEP administration.


Table 1. Booklet Design



Booklet


Total Number of  questions


Multiple choice


Essay questions

(NAEP:

> 50%)


Forming a general understanding

(NAEP:

50%)

Making

reader/text connections

(NAEP:

15%)


Examining content and  structure

(NAEP:

35%)

AB/BA

20

6

14   

(70%)

8

(40%)

4

(20%)

8

(40%)


CD/DC


20


7


13    

(65%)


9

(45%)


4

(20%)


7

(35%)


AC/CA


22


9


13

(59%)


10

(45%)


4

(18%)


8

(36%)


BD/DB


18


4


14

(78%)


7

(39%)


4

(22%)


7

(39%)


The four blocks were assembled into four different combinations of two blocks, with each such combination constituting a booklet (see Table 1). To control for order effects, each combination was presented twice, with a different block in the lead position. Consequently, there were eight different booklets in all. The four combinations were selected to provide the information needed to construct each subscale and to estimate the covariation between the two subscales. Booklets were randomly distributed to students in each school in the study.


The first pair of booklets displayed in Table 1 contain the blocks A and B, in either AB order or BA order. Together, the two blocks contain 20 questions: 6 multiple choice and 14 essay questions (i.e., requiring a student produced response). The proportion of essay questions, 70%, exceeds the test blueprint requirement of 50%. The next three values compare the task profile in the booklets with that mandated in the blueprint. For example, with respect to forming a general understanding, the requirement is that 50% of the questions be linked to that goal. For this booklet, only 40% of the questions qualify.


BACKGROUND QUESTIONNAIRE


The background questions appeared as a block at the end of each of the eight booklets. Study items were selected from the operational NAEP student questionnaire. Two sections were selected for inclusion. The first section contained 14 questions. Students were asked to identify their gender and race/ethnicity and to record the types and number of written materials in their home, whether they had a computer in their home, the highest level of education attained by their mother and father, and the number of days they were absent from school in the last month. The second section asked students about their reading practices, their expectations regarding future education and activities, and their level of effort on this and other tests they have taken.


SAMPLE SELECTION


A principal consideration in the design of the study was to gather sufficient data so as to have reasonable power in detecting substantively meaningful treatment effects. The investigators conducted a power analysis under the assumption that random samples of students in each school would be assigned to each of the three treatment conditions.14 Given budgetary constraints, it was impossible to attempt to obtain a nationally representative sample of schools. However, there was a deliberate effort to recruit schools from diverse locales serving students from different backgrounds.


The power analysis, which took into account the nested structure of the sample, indicated that a sample of 60 schools with approximately 60 students per school (i.e., 20 students randomly assigned to each condition in each school) would be sufficient to achieve the desired power. With the assistance of a number of NAEP state coordinators and Westat, the investigators recruited 64 schools in seven states to participate in the study. Schools were offered $200 for taking part in the study. Because of last-minute cancellations, the final tally was 59 schools in seven states. Student response rates by state varied from 23.1% in Massachusetts to 78.2% in Mississippi.15 See Table 2 for the breakdown by state.


Table 2. Student Response Rates by State

State

 Number of schools

 Response rate

 (percentage)

Florida

10

66.8

Massachusetts

7

23.1

Michigan

23

60.2

Mississippi

8

78.2

New Jersey

2

40.0

New Mexico

4

45.8

Wyoming

5

57.1

Overall

59

56.0


Altogether, 4,663 students were invited to participate and assigned a priori to one of the three conditions. Ultimately, 2,612 students were assessed, corresponding to an overall response rate of 56%. The participation rates by condition were nearly equal (see Table 3) and, more important, the student samples were well matched on the background characteristics that are typically associated with academic achievement. Thus, neither the overall student response rate nor the variability across states undermined the internal validity of the study.


Table 3. Student Response Rates by Condition

Session ID

Control group

Incentive 1

Incentive 2

Total

Number of students to be assessed

1,552

1,565

1,546

4,663

Number of students actually assessed

835

884

893

2,612

Percentage assessed

53.8

56.5

57.8

56.0


It is crucial to note that students agreed to participate in the study with no foreknowledge that there would be monetary rewards. That information was known only to the school principal and the individual who served as the study coordinator for the school. They were asked to keep this information absolutely confidential and, as far as we know, this was done.16 Moreover, all sessions in a school were scheduled at the same time or at adjacent periods, so there was limited opportunity for “contamination” within a school.


ADMINISTRATION


Another goal of the study was to collect data in such a manner as to facilitate a credible linking to the NAEP scale. To this end, the investigators (1) enlisted Westat (the contractor that performs these functions in regular administrations of NAEP) to carry out school recruitment and administration of the assessment and (2) enlisted Pearson (the contractor that performs these functions in regular administrations of NAEP) to carry out the preparation and shipping of the test booklets, quality control for the returned materials, and processing and scoring of the answer booklets, and to prepare data files for analysis. Throughout the implementation phase, regular NAEP procedures were followed to the extent possible.


The main deviation from operational NAEP concerned the timing of the administration. Although 12th-grade NAEP is administered in February and March, for logistical reasons, it was impossible to conduct the study during this period. Accordingly, the investigators settled on a fall (October and November) administration. It was felt that reading proficiency would not be much different in mid-fall from midwinter, and, moreover, the results from the control group could be compared with those from regular administrations.


Some other differences from a regular NAEP administration were: (1) there were fewer inducements for students to participate in the study,  (2) only very limited accommodations for students with disabilities were offered, (3) no make-up sessions were available, (4) the instructions read to the control group classes differed very slightly from those read in regular administrations, and (5) the instructions read to the incentive group classes were the same as those read to the control group classes, with the addition of information related to the particular incentive.


SCORING


Multiple-choice responses were scored directly. Open-ended responses for each block were evaluated by teams of scorers, trained and supervised by experienced leaders recruited by Pearson, as is the case in operational NAEP. Training sets of responses and scoring rubrics drawn from archives were employed, and the process was carried out in a manner analogous to standard NAEP procedures.17 Rater reliability was estimated by double-reading a 10% random subsample. With the exception of one polytomous item, exact agreement ranged from 88% to 98%, with a median value above 90%.


DATA PREPARATION


INTRODUCTION


The argument supporting the representation of the study results on the NAEP reporting scale has two aspects. The substantive aspect notes the use of NAEP released item blocks (representing the two dominant subscales of the reading framework), as well as the scrupulous adherence to NAEP procedures for administration and data processing. The second aspect rests on the technical validity of the statistical links from the raw scores to the NAEP scale. The linkage was conducted in a carefully designed sequence of analyses, beginning with an examination of classical item statistics, followed by the estimation of a number of item response theory (IRT) models and an examination of the quality of the different fits to the data. Upon obtaining satisfactory results, the conditioning and linking steps of the standard NAEP procedure were implemented. As is the case with regular NAEP administrations, the psychometric analyses were carried out separately for each subscale, with the composite reporting scale constructed only at the last step.


ITEM ANALYSIS


Item analyses were conducted separately for each of the experimental conditions. For multiple choice items (Type 1) and dichotomously scored constructed response items (Type 2), standard item statistics, including proportion correct and r-biserials with block totals, were computed.18 Extended constructed response items (Type 3) were scored on a 4-point scale. For these items, a generalization of the r-biserial was employed.19 Summary results for the proportions correct are displayed in Table 4. Examination of Table 4 reveals that when confronted with constructed response items, students in the incentive conditions obtained higher proportions correct than did students in the control condition. For example, for Incentive 2, the difference was about 0.05. Although this does not appear to be large, recall that most of the NAEP score points are derived from performance on constructed response items. Consequently, the higher proportions correct noted here can have a substantial impact on reported scores and in fact foreshadow the results to be obtained with more sophisticated approaches.


Table 4. Average Item Proportions Correct by Item Type and Incentive Condition

 N

 Item Type

Control group

 

Incentive 1

 

Incentive 2

       

13

Multiple choice

.62

 

.63

 

.62

18

CR-Dichotomous

.55

 

.59

 

.60

10

CR-Polytomous

.52

 

.54

 

.57


Auxiliary analyses (not shown) indicate that the differences in performance were greater on the information blocks, which included greater numbers of constructed response items. There were small differences by condition with respect to off-task responses, omitted items, or items not reached. In the aggregate, however, the control group had a slightly higher percentage of items with missing responses than did the two incentive groups.


PRELIMINARY SCALING


NAEP relies on IRT to carry out the scaling of the data. This involves obtaining for each item estimates of the parameters that characterize the probability of a particular response as a function of location along a latent unidimensional proficiency scale. This function is known as the item characteristic curve (ICC). A version of the Bilog-Parscale software package adapted for the NAEP data structure was used in this study.20 For multiple-choice and dichotomously scored constructed response items, a standard three-parameter IRT model was fit to the data. For polytomously scored items, a partial credit model was employed.


For the first stage of analysis, IRT models were fit to the data for each subscale, separately by condition. A comparison of the estimated item parameters for the control condition with archival values showed close agreement, indicating that the items had similar operating characteristics despite differences in the timing and the year of administration. For the second stage, these results were compared with those obtained by fitting the IRT models, pooling the data over conditions. The latter approach differs from regular NAEP only in that each group was assumed to have a normal distribution of latent proficiency with its own mean and variance. However, the parameters that describe the ICC for each item were assumed to be the same for all three groups. The fit of the latter model was acceptable for all three groups (i.e., good internal consistency) and nearly as good as the fit of the models when estimated separately by group.


NAEP SCALING


For this stage of analysis, the item parameters for each item were constrained to be equal to their archival values (based on earlier administrations), and the three-group model was refit to the data. The output was a set of estimates of the means and variances of the three proficiency distributions. A number of the polytomously scored constructed response items displayed some lack of fit: four from the reading for literary experience subscale and three from the reading for information subscale. The scaling was then rerun, allowing the model parameters for those items to be estimated along with the parameters of the proficiency distributions.21 This resulted in slight changes in the estimates of the latter parameters. A pseudo-chi-square goodness-of-fit statistic was calculated for each combination of group and subscale. The results indicated generally reasonable fits for all three conditions with respect to both subscales.


Effect sizes were calculated using the standard deviation for the control group as the denominator and are displayed in Table 5. They range from 0.08 to 0.25 and are greater for the Incentive 2 condition. They are also greater for the information subscale, which is based on a greater number of polytomously scored items than is the other subscale. Note that these effect sizes do not translate easily to score differences on the NAEP reporting scale.


Table 5. Effect Sizes by Subscale, Item Parameters Based on Archival Data and Study Data

 

Incentive 1

to Control

Incentive 2

to Control


Literacy


0.08


0.13


Information


0.15


0.25


CONDITIONING


The NAEP data structure can be characterized as an incomplete matrix of students by items (i.e., each student only takes a small fraction of the total set of items associated with each subscale). Because of the sparseness of the matrix, estimates of students’ locations on the latent proficiency scale based only on their responses to the items administered will be both biased and volatile.22  


To address this problem, NAEP augments the information in the cognitive data with ancillary data drawn from the student, teacher, and school questionnaires. Because of the amount of background data available, NAEP constructs a large number of principal components that account for more than 90% of the variation in the background variables. In the subsequent conditioning phase, the test data and the principal components based on the background data are combined in a multivariate latent regression model to generate for each student a family of multivariate posterior distributions of proficiency with respect to each of the three subscales.23


Then, for each student, five members of the family of posterior distributions are selected at random, and a single draw is made at random from each of the distributions. Each draw or realization consists of a vector representing potential scores on the three subscales. The five realizations are called “plausible values” in NAEP terminology, yielding five data sets for analysis. Typically, analyses are run separately for each set of plausible values, and the average over the five results is taken as an estimate of the population quantity of interest. The variability among the five results is used to obtain an estimate of the measurement error in the estimate of the population quantity. When combined with an estimate of the error due to the sampling of students and schools, an estimate of the total error is derived.


These NAEP procedures were followed, with the difference that two (rather than three) subscales were involved, and only information from the student questionnaire was used. Only 92 principal components were derived from the student variables and from selected two-way interactions among them.24 These principal components captured more than 90% of the available variance. Analyses were then run five times, once for each set of plausible values, and the results averaged to obtain the estimates reported here.


LINKING


Again following NAEP procedures, a linear transformation based on archival parameters was employed to link the plausible values for each subscale to the corresponding NAEP subscale. Once that was done, the transformed subscales were combined to form the composite reporting scale. In main NAEP, as noted previously, there are three subscales. The two subscales included in this study were each assigned a relative weight of 0.4, and the third subscale (reading to perform a task) was assigned a weight of 0.2. For this study, the composite scale was constructed by assigning the two subscales equal weights.


RESULTS


INTRODUCTION


The analyses reported next are organized as follows: We first examine the effects of the incentives on student self-reported engagement and effort, and then the effects on performance, overall and for subgroups defined by various combinations of characteristics. Most of the tables present results separately for males and females because there are interactions of interest. The purpose is to determine whether the overall treatment effects also reflect effects at lower levels of aggregation. Consistency in sign and, to a lesser degree, magnitude, add to the credibility of the findings.


Because there are some differences among the students allocated to the three conditions with respect to relevant characteristics, there is a question of what proportion of the observed treatment effects can be accounted for statistically by those differences. Accordingly, we carry out an analysis of variance using a number of demographic and behavioral characteristics, as well as a set of indicator variables that distinguish students with different reading patterns. This approach is complemented by the identification of those subgroups for which the incentives have had the greatest impact.


The results of the study indicate that both monetary incentives have an appreciable effect on student performance, with the contingent incentive (Incentive 2) generally having a larger effect then the fixed incentive (Incentive 1). These aggregate effects can be represented graphically (Figure 1), employing a proposal by Holland (2002) on how to display the difference between two cumulative distribution functions. In this case, the score distribution for each incentive condition is compared with the score distribution under the control condition. To construct each graph, a convenient set of percentiles is selected and, for each percentile, the difference in the corresponding quantiles for the two distributions is plotted. In essence, this amounts to looking at the horizontal difference between the cumulative distribution functions at selected points along the vertical (probability) scale.25 To construct Figure 1, 12 percentiles were chosen, and a smooth curve passed through the points obtained.


Figure 1. Quantile difference plots for comparison of the two incentive subgroups with the control subgroup


[39_16008.htm_g/00002.jpg]


It is evident from Figure 1 that the effect of Incentive 2 is positive over the entire range of the distribution, and that is true for Incentive 1, except (perhaps) for the lowest percentiles. Formally, this means that the score distribution for each incentive condition is stochastically larger than the score distribution for the control condition. It is noteworthy that the effect of Incentive 2 is stronger at the left tail of the distribution, suggesting that a contingent monetary reward has greater impact among typically lower scoring students. It is also at the left tail of the distribution that the differences in the effects of the incentives are largest. The impressions generated by this figure must be examined in light of the more detailed comparisons between similar groups of students (with respect to individual characteristics) exposed to the different conditions. However, it does suggest that there are interactions between the type of incentive and the type of student.   


In the tables that follow, statistics are accompanied by their estimated standard errors. With NAEP data, estimation of standard errors statistics is complicated by the nature of the survey design. The standard errors reported here were calculated using operational NAEP procedures that take account of both sampling error and measurement error. For further details, see Allen et al. (2001).


FINDINGS (CROSS-TABS)


We begin with student self-reports on levels of engagement and effort. Table 6 displays the results for the question, How important was it to you to do well on this test? and Table 7 displays the results for the question, How hard did you try on this test in comparison to other tests? Apparently, the incentives did have an impact. For example, from Table 6, we note that slightly fewer than 36% of students in the control group answered important or very important. The corresponding percentages for the two incentive groups were nearly 46% and 50%, respectively. In all three conditions, the effects of the incentives on scores were greatest for those who answered important or very important.


Table 6. How important was it to you to do well on this test?

   

Control group

Incentive 1

Incentive 2

Not very important

 

N

155

122

98

 

COL%

19.0

14.1

11.1

 

Score mean

287.0 (3.3)

292.4 (3.5)

288.7 (4.4)

      

Somewhat important

 

N

369

348

345

 

COL%

45.3

40.3

39.2

 

Score mean

291.6 (2.1)

293.0 (2.4)

296.3 (2.6)

      

Important

 

N

205

286

294

 

COL%

25.2

33.1

33.4

 

Score mean

289.9 (2.4)

294.2 (2.2)

297.7 (2.4)

      

Very important

 

N

86

108

144

 

COL%

10.6

12.5

16.3

 

Score mean

282.6 (3.9)

291.2 (4.5)

295.1 (2.9)

      

Total

 

N

815

864

881

 

COL%

100

100

100

 

Score mean

289.3 (1.7)

293.1 (1.9)

295.7 (1.8)


Table 7. How hard did you try on this test compared to other tests?

   

Control group

Incentive 1

Incentive 2

Tried not as much

 

N

306

232

212

 

COL%

37.5

26.8

24.0

 

Score mean

285.2 (2.0)

286.8 (3.1)

286.9 (3.0)

      

Tried about as much

 

N

440

550

578

 

COL%

54.0

63.5

65.5

 

Score mean

295.7 (2.2)

298.2 (1.8)

300.4 (1.8)

      

Tried harder

 

N

53

60

74

 

COL%

6.5

6.9

8.4

 

Score mean

269.8 (6.0)

280.4 (4.1)

290.1 (4.9)

      

Tried much harder

 

N

16

24

18

 

COL%

2.0

2.8

2.0

 

Score mean

258.5 (7.4)

271.8 (6.1)

276.5 (7.0)


Similarly, from Table 7, we see that nearly 38% of students in the control group reported not trying as hard on this test. The corresponding percentages for the two incentive groups were 27% and 24%. In all three conditions, the effects of the incentives on scores were greatest for those who answered that they tried harder or much harder on this test.


Table 8 displays the condition score means overall, by gender and by race.26 Note first that the numbers of students from a particular demographic category are very similar across conditions, attesting to the success of the implementation of the design. Next, we note that the mean for the control condition, 289.2 points, is 3 points greater than the reported mean for the 2005 12th-grade NAEP reading assessment. That is, the students participating in this study were somewhat more able than those who participated in the 2005 assessment. In particular, the average for  males in the control group (286.7 points) is about 7 points greater than that reported for males in 2005, and the average for females in the study (291.5 points) is about a half-point smaller than that reported for females in 2005. Although these findings do not in any way invalidate the analyses that follow, they do suggest some caution in generalizing the results to the national population.


Table 8. Study Statistics by Condition and Selected Student Characteristics

   

Control group

Incentive 1

Incentive 2

Total

Male

 

N

393

389

409

1191

 

Score Mean

286.7 (2.0)

289.3 (2.2)

292.2 (2.4)

289.4 (1.8)

       

Female

 

N

433

482

486

1401

 

Score Mean

291.5 (2.1)

295.4 (2.1)

296.8 (2.2)

294.7 (1.8)

       

TOTAL

 

N

826

871

895

2592

 

Score Mean

289.2 (1.7)

292.6 (1.9)

294.7 (2.0)

292.3 (1.7)

White, not Hispanic

 

N

493

552

561

1606

 

Score Mean

296.8 (1.5)

302.9 (1.3)

302.7 (1.9)

300.9 (1.2)

       

Black, not Hispanic

 

N

218

220

207

645

 

Score Mean

276.4 (2.2)

274.3 (2.9)

278.6 (2.7)

276.4 (2.2)

       

Hispanic

 

N

88

79

93

260

 

Score Mean

277.1 (3.2)

274.1 (4.6)

280.2 (5.5)

277.3 (3.1)

       

Asian/

Pacific Islander

 

N

20

19

30

69

 

Score Mean

297.4 (7.5)

292.7 (9.2)

299.8 (5.6)

297.2 (4.1)

       

American Indian/Alaska Native

 

N

6

3

3

12

 

Score Mean

277.5 (13.3)

300.4 (40.7)

295.2 (17.1)

287.6 (7.3)

       

Other

 

N

2

2

3

7

 

Score Mean

307.6 (20.2)

306.1 (46.8)

301.5 (22.7)

304.6 (14.9)


Turning to the estimation of treatment effects, students in the first incentive condition scored, on average, 3.4 points higher than those in the control condition, whereas students in the second incentive condition scored 5.5 points higher. Because the standard deviation of the scores overall is just under 36 points, the larger effect size is approximately 0.15. This is substantial and worthy of note.27 When the data are disaggregated by gender, the patterns are rather similar: For males, the effects of the two incentives are 2.6 points and 5.5 points, respectively. For females, they are 3.9 and 5.3 points, respectively. With regard to race/ethnicity, the effects are greatest for White students and for Incentive 2. For Black students and Hispanic students, the effects for Incentive 1 are negative but not significantly different from zero.


Table 9 displays score means for students cross-classified by gender and race. Of the 12 estimated incentive effects, 8 are positive and 4 are negative. The effects appear to be strongest for Whites (males and females) and Hispanic females. The differential effect of the incentive has implications for the estimation of gaps between groups. For example, under the control condition, the Black–White gap for males is 20 points. Under the first incentive condition, it is 27.1 points, and under the second, it is 24.6 points. For females, the Black–White gap under the control condition is 20.4 points. Under the first incentive condition, it is 29.3 points, and under the second, it is 23.5 points.


Table 9. Study Statistics by Condition, Gender, and Race/Ethnicity

MALE

  

Male control

Male Incentive 1

Male Incentive 2

White, not Hispanic

 

N

224

246

263

 

COL%

57.0

63.2

64.3

 

Score mean

293.6 (2.0)

299.2 (1.5)

299.6 (2.6)

      

Black, not Hispanic

 

N

104

101

93

 

COL%

26.5

26.0

22.7

 

Score mean

273.6 (3.6)

272.1 (3.8)

274.8 (3.8)

      

Hispanic

 

N

50

31

36

 

COL%

12.7

8.0

8.8

 

Score mean

281.1 (4.2)

261.7 (6.8)

279.4 (6.1)

FEMALE

  

Female Control

Female Incentive 1

Female Incentive 2

White, not Hispanic

 

N

269

302

298

 

COL%

62.1

62.8

61.3

 

Score mean

299.4 (2.0)

305.5 (1.9)

305.4 (1.9)

      

Black, not Hispanic

 

N

114

119

112

 

COL%

26.3

24.7

23.0

 

Score mean

279.0 (3.4)

276.2 (3.3)

281.9 (3.3)

      

Hispanic

 

N

37

47

57

 

COL%

8.5

9.8

11.7

 

Score mean

271.6 (4.9)

281.4 (4.1)

280.7 (7.9)


Table 10 presents score means for students cross-classified by gender and mother’s education. Of the 16 estimated effects, 13 are positive and 3 are negative. Overall, the effects are similar for males and females, with no strong patterns with respect to the level of mother’s education. Table 11 presents results for students cross-classified by gender and number of days absent in the previous month. As expected, for each combination of condition and gender, mean scores generally decreased with more frequent absences. The incentive effects, with one exception, were all positive and somewhat stronger for males than for females. In general, both males and females with higher frequencies of absences displayed the greatest effects.


Table 10. Study Statistics by Condition, Gender, and Mother’s Education Level

Male

  

Male control

Male Incentive 1

Male Incentive 2

Did not finish high school

 

N

43

35

38

 

COL%

11.2

9.2

9.4

 

Score mean

280.8 (5.6)

278.6 (6.1)

276.3 (5.9)

      

Graduated high school

 

N

89

93

90

 

COL%

23.2

24.3

22.3

 

Score mean

282.5 (3.1)

285.3 (3.4)

284.2 (3.3)

      

Some education after high school

 

N

105

97

101

 

COL%

27.3

25.4

25.1

 

Score mean

288.0 (3.4)

292.1 (3.2)

294.4 (3.1)

      

Graduated college

 

N

132

142

160

 

COL%

34.4

37.2

39.7

 

Score mean

292.3 (3.5)

295.0 (2.5)

303.6 (2.2)

      

I don't know

 

N

15

15

14

 

COL%

3.9

3.9

3.5

 

Score mean

265.2 (8.4)

283.7 (9.6)

278.5 (8.4)

Female

  

Female control

Female Incentive 1

Female Incentive 2

Did not finish high school

 

N

55

72

67

 

COL%

12.9

15.0

14.0

 

Score mean

280.3 (4.2)

283.7 (3.2)

284.4 (3.3)

      

Graduated high school

 

N

107

100

127

 

COL%

25.1

20.9

26.6

 

Score mean

285.1 (3.5)

287.0 (4.1)

292.5 (3.2)

      

Some education after high school

 

N

120

121

124

 

COL%

28.1

25.3

25.9

 

Score mean

298.2 (2.8)

296.5 (3.3)

301.2 (2.8)

      

Graduated college

 

N

135

172

157

 

COL%

31.6

35.9

32.8

 

Score mean

298.8 (3.0)

306.2 (2.6)

305.4 (2.4)

      

I don't know

 

N

10

14

3

 

COL%

2.3

2.9

0.6

 

Score mean

265.0 (10.5)

275.4 (9.2)

296.2 (12.3)


Table 11. Study Statistics by Condition, Gender, and Number of Days Absent From School Last Month

Male

  

Male control

Male Incentive 1

Male Incentive 2

None

 

N

160

166

187

 

COL%

41.3

43.3

46.3

 

Score mean

292.6 (2.6)

291.3 (2.8)

295.6 (2.4)

      

1–2 days

 

N

144

152

152

 

COL%

37.2

39.7

37.6

 

Score mean

285.6 (2.6)

288.8 (2.8)

291.7 (3.2)

      

3–4 days

 

N

56

49

38

 

COL%

14.5

12.8

9.4

 

Score mean

279.6 (5.1)

289.4 (5.1)

292.6 (6.4)

      

5-10 days

 

N

18

12

22

 

COL%

4.7

3.1

5.4

 

Score mean

274.6 (5.7)

290.5 (10.7)

295.5 (6.8)

      

>10 days

 

N

9

4

5

 

COL%

2.3

1.0

1.2

 

Score mean

255.4 (13.6)

277.8 (17.4)

268.4 (20.8)

Female

  

Female control

Female Incentive 1

Female Incentive 2

None

 

N

148

166

156

 

COL%

34.7

34.6

32.5

 

Score mean

296.3 (3.0)

300.0 (2.5)

301.7 (2.7)

      

1–2 days

 

N

194

202

215

 

COL%

45.4

42.1

44.8

 

Score mean

293.6 (2.7)

296.1 (2.6)

298.7 (2.5)

      

3–4 days

 

N

58

79

75

 

COL%

13.6

16.5

15.6

 

Score mean

280.8 (3.6)

283.8 (4.1)

291.6 (3.4)

      

5–10 days

 

N

21

28

30

 

COL%

4.9

5.8

6.2

 

Score mean

279.7 (6.8)

297.1 (7.0)

287.1 (7.3)

      

> 10 days

 

N

6

5

4

 

COL%

1.4

1.0

0.8

 

Score mean

284.7 (15.3)

294.8 (14.1)

298.6 (20.8)


FINDINGS (ANALYSES OF VARIANCE)


The cross-tabulated results presented here, as well as others not shown, exhibit consistency in the signs of the estimated effects for the two incentive conditions relative to the control condition. The magnitudes do vary, no doubt in part because of sampling fluctuations. A natural question, then, is to what degree the estimated effects can be accounted for by differences across groups in the student samples “exposed” to each treatment. That is, despite the evidence that the samples are approximately randomly equivalent, perhaps the accumulation of small differences on characteristics related to the outcome could explain some portion of the observed effects.


One way to address the question is to conduct an analysis of variance (ANOVA) with NAEP scores as the outcome and a host of student characteristics as predictors. Because of the large number of variables available, it was advisable to employ a smaller group of predictors. Accordingly, the following strategy was pursued. The student questionnaire included 15 questions related to the frequency of engaging in various reading habits. An exploratory latent class analysis was conducted, and a six-class solution was selected as providing the best tradeoff between data fit and parsimony. Each class was characterized by a particular profile of reading habits.28 Indicator functions for the set of classes were constructed and incorporated into the ANOVA model.


In addition to these indicators, other indicators were included for race (White, Black, Hispanic, Other), eligibility for school lunch, frequency of talking about studies at home, frequency of absences from school in the previous month, frequency of a language other than English being spoken in the home, as well as for the experimental conditions.29 Mother’s education was treated as an integer variable, with the four levels coded 0, 1, 2 and 3. The dependent variable was reading achievement, as represented by the first plausible value.


The models were estimated separately for males and females and were run as standard ANOVAs (i.e., ignoring the nested structure of the data). Accordingly, the standard errors generated underestimate the true level of uncertainty. The fitted models yielded an R2 of 0.23 for males and an R2 of 0.26 for females. For females, the effects for both incentive conditions were statistically significant. For males, only the effect for the second incentive was significant. The patterns in the coefficients were generally similar in the two models, with the coefficients for race/ethnicity, mother’s education, and eligibility for school lunch being large and significant. There were also significant differences among the latent classes, but the patterns differed by gender.


Each model yields “least squares means” for the three conditions—that is, means that have been adjusted for the other variables in the model. The differences among these means represent adjusted treatment effects and can be compared with the treatment effects derived from the unadjusted means presented in Table 8. For convenience, the four sets of treatment effects are presented in Table 12. Estimated standard errors are in parentheses. We note that for males, the adjusted treatment effect for Incentive 2 is somewhat smaller than the unadjusted treatment effect. On the other hand, for females, it is the estimated effect for Incentive 1 that is reduced. For both males and females, the adjusted effects for Incentive 2 are statistically significant. Thus, there is no basis for discounting the estimated effects of the incentives on student performance, at least with regard to the contingent monetary incentive.


Table 12. Treatment Effects (in NAEP Score Points) by Incentive Condition and Gender.

   Estimated Standard Errors in parentheses.

Male

Male control

 

Male Incentive 1

 

Male Incentive 2

Unadjusted

0

 

2.5

 

5.5

Adjusted

0

 

2.5 (1.8)

 

4.2 (1.8)

Female

Female control

 

Female Incentive 1

 

Female Incentive 2

Unadjusted

0

 

3.9

 

5.3

Adjusted

0

 

3.0 (1.7)

 

5.6 (1.7)


The ANOVAs were also run separately for each subscale. For both males and females, the effects of the incentives were substantially greater for the reading for information scale than for the reading for literary experience scale. Recall that the blocks for the former scale have proportionately more constructed response item formats than do the blocks for the latter scale.


KEY GROUPS


Reviewing the patterns in the effects displayed in Tables 9–11 does not suggest a strong relationship between the size of the effect and the mean score of the focal group in the control condition. That is, if we consider a group defined by a combination of characteristics (e.g., males who are White or females whose mothers graduated from college), then the mean score of that group in the control condition is treated as the baseline, and the effect of the intervention is the difference between the mean score of the group exposed to that intervention and the baseline. It is neither the case that groups with lower baselines are more likely to experience greater effects, nor the case that groups with higher baselines are more likely to experience lesser effects—or vice versa. Rather, it appears that the magnitudes of the effects may be more strongly linked to certain demographic or behavioral characteristics.


Accordingly, an exploratory analysis was conducted to identify population subgroups that displayed a large positive effect and for which the sample size was sufficiently large to effectively rule out sampling fluctuations as a plausible explanation. Admittedly, this is a form of “data snooping,” but the findings are presented to explore the robustness of the results rather than to test hypotheses regarding specific demographic subgroups. This investigation was carried out separately for males and females, with a focus on the second incentive.


The group defined by gender (male), race (White) and days absent (more than 3 days in the last month) displays an effect of 18.1 points, based on a total sample size 95. For all males, the effect is 5.5 points, based on a total sample size of 802. If the 95 students in the focal group (constituting about 12% of all males) were deleted, the effect for males would be approximately 3.8, corresponding to a reduction of about 30%.


The group defined by gender (female), race (Hispanic), and not being an English language learner displays an effect of 13.2 points, based on a total sample size of 82. For all females, the effect is 5.3 points, based on a total sample size of 919. If the 82 students in the focal group (constituting about 9% of all females) were deleted, the effect for females would be approximately 4.9 points, corresponding to a reduction of about 17%. We conclude that neither for males nor for females can the observed treatment effects be substantially accounted for by a relatively small group with an unusually large effect.


DISCUSSION


SUMMARY


As human capital issues have become increasingly important, policy makers have focused their attention on American high schools and the fact that almost half of disadvantaged minority students fail to graduate on time. Moreover, according to NAEP data, among those students who remain through the 12th grade, more than half perform at the basic or below basic levels of proficiency in both reading and mathematics.  That is, more than half of high school seniors fail to demonstrate the mastery of the prerequisite knowledge and skills that are fundamental for proficient work in Grade 12. Two of the factors that bear on the validity and credibility of reported NAEP results are the statistical characteristics of the sample of students who take the assessment, and the level of effort expended by those students. The study reported here focused solely on the latter factor. This article has described the implementation and results of a large, randomized field trial that was designed to estimate the effects of modest monetary incentives on 12th-grade students’ performance on the NAEP reading assessment. The analyses reported here support the claim that we have been able to estimate the experimental effects of the incentives relative to the control condition (for the population of participating students) and to represent those effects on the NAEP scale.


The impact of the incentives was reflected in students’ self-reports regarding the importance of doing well and trying harder, as well as in higher average scores. The absolute sizes of the effects were substantial, ranging from approximately 2.5 to 5.5 scale score points for subgroups defined by gender alone. (Note that 5 points is about one quarter of the observed achievement gap between White and Black students in the control group in this study.) The data also revealed significant interactions effects. For the contingent incentive, the effect is mainly due to the lower tail of the score distribution being shifted sharply upward relative to the lower tail of the score distribution in the control condition. Equally important is the fact that for each incentive condition, there are differential effects among demographic subgroups of interest.  For example, among Hispanic females who were not English language learners, the size of the impact of the contingent incentive on their average reading scores was about 2 1/2 times greater than the overall effect for females; for White males who reported frequent absences from school, the impact was about three times greater than the overall effect for males.


It appears that modest monetary rewards can influence the behaviors of many American high school seniors—at least with respect to performance on a NAEP-like reading assessment. Thus, we conclude that operational NAEP estimates of the overall average reading proficiencies of the nation’s 12th graders are likely lower than those that would be obtained if contingent monetary incentives were offered. These are important findings, and, accordingly, a number of analyses were conducted to examine the consistency of the aggregate results for various subpopulations of students, as well as their robustness. In the main, both consistency and robustness were confirmed. Interpretation of the results is complicated by substantial evidence that motivation and effort vary by student subgroups defined by some of NAEP’s reporting variables. These interactions imply that standard NAEP estimates of the gaps in achievement between different subgroups may be biased. The results of the study lend credence to the concerns raised in the reports of the National Commission (2004) and by Brophy and Ames (2005). Of course, inferring what the effects would be for a nationally representative sample would be speculative. Nonetheless, the magnitudes of the observed effects, as well as their consistency across subgroups, certainly urge caution in interpreting the reported NAEP statistics, with respect to both absolute performance and achievement gaps.


Now, it may be argued that the statistics reported by NAEP are useful even if NAEP does not capture students’ maximal performance. That is, perhaps NAEP results capture instead students’ typical performance, and it is typical performance that is more representative of how students’ reading competencies would manifest themselves in various “real-life” settings. However, if students’ performances in such settings have meaningful consequences for them (e.g., job applications), then their motivation and engagement arguably would be greater than they are in a NAEP administration. Thus, the observed score differences between the control condition and the incentive conditions should have import for educators and policy makers alike.


At this juncture, there is already increased pressure for 12th-grade NAEP to provide a snapshot of students’ readiness for the world beyond high school. This may well entail a substantial change in the assessment framework, as well as other aspects of NAEP. If and when such changes are implemented, a new trend line would have to be established. That would be a convenient time for modifications to current recruitment and administration procedures to be introduced, with the twin goals of (1) improving participation at both the school and student levels, and (2) enhancing student motivation and effort.


CAVEATS


This study is based on a convenience sample of 59 schools in seven states. Although the school sample is heterogeneous, it is not nationally representative, and consequently, one cannot draw inferences from the sample to the nation as a whole. As is the case with NAEP, students participating in the assessment are volunteers. The overall participation rate was 56%, and we have no evidence on how able these students are in comparison with those who were invited but did not choose to sit for the assessment. We have no direct information on how comparable the students in our study are with students who participate in regular NAEP administrations, except that the students in the study do appear to be somewhat more able than the students who sat for the 2005 administration.


The test battery for the study does differ in a number of respects from that of operational NAEP. Only four blocks of items were employed rather than the 20 or so blocks that are available for a regular administration. Moreover, only two of three reading subscales were directly assessed. Thus, there is some question as to how fully the study was able to represent the construct of reading as delineated in the NAEP reading framework.


Finally, the validity of the estimated treatment effects depends on the students in the incentive groups understanding the nature of the monetary incentives and, equally important, none of the students having external information regarding the nature of the study. With regard to the former, students were asked to select the particular gift certificate they preferred and to sign their names so that the incentive would be more tangible. In the debriefings reported by Schultz, Deatz, and Gladden (2008), few students in the incentive conditions indicated that they were unaware of the monetary rewards offered. With regard to the latter, administrators reported only four isolated instances in which students had received information from friends in other schools. It appears that the impact of most deviations from the intended administrative procedures and context would result in reduced estimates of the treatment effects.


LOOKING TO THE FUTURE


The findings presented here are the result of a single study that focused on the NAEP reading assessment. Studies of NAEP and other large-scale assessments reported in the literature have focused on mathematics and found small or negligible effects of incentives. The difference in outcomes may be due to a combination of subject matter and experimental design. Standard scientific practice calls for some form of replication to strengthen the credibility of our findings. However, such field studies are very difficult to carry out, particularly if there is a desire to embed the study in an operational administration to achieve greater external validity. Moreover, it is evident that offering even modest monetary incentives to the full 12th-grade sample in a regular administration would be prohibitively expensive. We are then left with a conundrum: Is it possible to elicit maximal test performance without offering monetary incentives? If there are such strategies, it may be more useful to investigate their utility in an experimental or quasi-experimental setting rather than conduct a true replication of the experiment reported here.


In debriefings after the test administration, students offered suggestions on how to enhance motivation with nonmonetary incentives. Appendix D of the report of the National Commission outlines a number of strategies to increase both participation and motivation.  Moreover, if the nation is serious about using 12th-grade NAEP as a measure of student readiness or as a national measure of human capital, then we need to take into account the fact that a significant percentage of the age cohort of interest is no longer in school and, therefore, missing from these estimates.  A more comprehensive view of human capital will need to find ways of complementing NAEP results with analogous statistics drawn from the population of out-of-school youth.


What is clear is that those charged with expanding and enhancing the utility of NAEP will need to devote substantial thought and appropriate resources to accomplish this goal. Both are necessary for enhancing the accuracy, utility, and credibility of the results reported by NAEP for the 12th-grade reading assessment. At the same time, it must be recognized that no single strategy will be equally effective for all students so that the problem of getting all participants to “try their best” is not easily solved.


Acknowledgments


This project was jointly funded by grants from the Labor Economics Section, Department of Economics, Princeton University, from the National Center for Education Statistics, U.S. Department of Education, and from the Educational Testing Service. We thank Alan Krueger, Peggy Carr, and Ida Lawrence for their support and encouragement. We also benefitted from comments and suggestions provided by reviewers from NCES, ETS, and Teachers College Record. In the recruitment of schools, we were assisted by the state NAEP coordinators. We are also appreciative of the assistance of the principals and study liaisons in the participating schools, as well as the efforts of all the students who sat for the assessment. In the implementation of the study, we were assisted by Julie Eastland, Mary Lou Lennon, and the staff at Westat, Inc., and Pearson, Inc.  Final production assistance was provided by Elizabeth Brophy and Youjin Lee.


Notes


1. The principal consortium members are the Educational Testing Service, Westat, Pearson, and Fulcrum IT.

2. For a history of NAEP, see Jones and Olkin (2004). For further information on NAEP , see http://nationsreportcard.gov.

3. See the report of the National Commission on Excellence in Education (1983). For recent comments, see the special report in the Phi Delta Kappan (Smith, 2008).

4. See, for example, the report of the National Commission on NAEP 12th Grade Assessment and Reporting, 2004, p. 2.

5. At its inception in 1969, NAEP did assess a sample of an age cohort, regardless of whether the individuals were enrolled in school. This is no longer the case, even with the NAEP long-term trend.

6. In 2007, school and student response rates were 89% and 80%, respectively, whereas in 2009, they were 83% and 80%.

7. In many large-scale assessment surveys, respondents and nonrespondents can be compared on a number of background characteristics, an estimate of the nonresponse bias can be obtained, and adjusted estimates using poststratification can be calculated. These estimates, however, are themselves subject to considerable uncertainty.

8. That is, estimated differences between conditions of practical import may not reach statistical significance because standard errors vary inversely as the square roots of the sample sizes.

9. The result is based on a meta-analysis of 12 studies.

10. Of course, the study budget was also a constraint on the total award amount.

11. The instructions were modified slightly to acknowledge that the assessment was part of a special study and not a regular NAEP administration.

12. This change resulted in a substantial increase in the cost of the study.

13. Only blocks released some years in the past were used so that there would be no chance of undermining the validity of the NAEP 12th-grade assessment administered in the spring of 2009 by using operational blocks.

14. Randomization of treatment conditions within each school affords greater power than randomizing across schools, because the latter approach introduces between-school differences in achievement into the estimation of treatment effects. On the other hand,  the former requires a more complex administration protocol.

15. The student response rate in a state is defined as the ratio of the number of students assessed to the number of students invited to participate. The low response rate in Massachusetts is due to the fact that the state requires that parents return signed consent forms.

16. Based on reports from the assessment administrators, there are four known cases where one or more students in a session indicated that they had been told that they would receive a gift certificate. It appears they obtained this information from students in another school. In addition, the audit report by Schult, Deatz and Gladden (2008) notes that in one session, in response to a student’s query, students were informed that the study was being conducted to investigate student motivation.

17. One difference is that in regular NAEP administrations, student responses are scanned and the images sent electronically to scorers. Given the small volume of responses in this study, it was decided to have the scorers work directly with the booklets.

18. The r-biserial is the Pearson correlation between the item score and the total score for the block.

19. The statistic is described in Allen, Donoghue, and Schoeps (2001).

20. This is the same software used in operational NAEP. For a detailed description of NAEP scaling procedures, see Allen et al. (2001).

21. This procedure is consistent with that of NAEP when such items are reused.

22. See Mislevy, Johnson, and Muraki (1992) for a fuller explication of the conditioning model in NAEP.

23. A bivariate regression model is employed to take advantage of the correlation between the subscales.

24. The interactions included those for which we intended to present results. Had the interactions not been incorporated in the conditioning phase, the reported results would be subject to some degree of bias.

25. For example, consider the score distribution under the control condition. The percentile (score point) corresponding to the 20th percent is 269.5. That is, approximately 20% of the students in the control condition “scored” (i.e., had the first plausible value) at or below 269.5. For the score distribution under the Incentive 2 condition, the percentile is 276.8. The difference is 7.3 score points and is plotted on the vertical above the point 20 on the horizontal axis. This was done for a set of percents, and then a smooth curve was passed through the points.

26. Counts and average scores for the Totals column may vary across tables because of missing data. Results for the categories Asian/Pacific Islander, American Indian/Alaska Native, and Other are not reported because the counts for the first were less than 30 and for the other two were less than 10.

27. Although the results are presented on the NAEP scale, the difference between conditions cannot be directly compared with reported NAEP gaps that are based on nationally representative samples.

28. Although the nature of these classes is of interest in its own right, it is not germane to the statistical adjustment that follows.

29. The three frequency variables were each dichotomized so that only a single indicator was needed for each.


References


Allen, N. L., Donoghue, J. R., & Schoeps, T. L. (2001). The NAEP 1998 technical report. NCES 2001-509. Washington DC: National Center for Education Statistics, Office of Educational Research and Improvement, U. S. Department of Education.


Baumert, J., & Demmrich, A. (2001). Testing motivation in the assessment of student skills: The effects of incentives on motivation and performance. European Journal of Psychology of Education, 16, 441–462.


Belfield, C. R., & Levin, H. M. (2007). The economic losses from high school dropouts in California. California Dropout Research Project Report No. 1. Santa Barbara: University of California.


Brophy, J., & Ames, A. (2005, September). NAEP testing for twelfth graders: Motivational issues. Paper prepared for the National Assessment Governing Board, Michigan State University. Retrieved from http://www.nagb.org/publications/final_naep-testing_paper_carole_jere.doc


Brown, S. M., & Walberg, H. J. (1993). Motivational effects of test scores of elementary students. Journal of Educational Research, 86, 133–136.


Deci, E. L., Koestner, R., & Ryan, R. M. (1999). A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation. Psychological Bulletin, 125, 627–668.


Hoffman, R. G. (2004). Implications from motivation theory for NAEP participation and performance. Alexandria, VA: Human Resources Research Organization.


Holland, P. W. (2002). Two measures of change in gaps between the CDFs of test-score distributions. Journal of Educational and Behavioral Statistics, 27, 3–17.


Jones, L. V., & Olkin, I. (2004). The nation’s report card: Evolution and perspectives. Bloomington, IN: Phi Delta Kappa Educational Foundation.


Kiplinger, V., & Linn, R. (1996). Raising the stakes of test administration: The impact on student performance in the National Assessment of Educational Progress. Educational Assessment, 3, 111–133.


Kirsch, I., Braun, H. I., Yamamoto, K., & Sum, A. (2007). America's perfect storm: Three forces changing our nation's future. Policy Information Center Report. Princeton, NJ: Educational Testing Service.


Lazer, S. (n.d.). Rethinking the Grade 12 NAEP assessment. Paper prepared for the National Assessment Governing Board. Princeton, NJ: Educational Testing Service.


Mislevy, R. J., Johnson, E. G., & Muraki, E. (1992). Scaling procedures in NAEP. Journal of Educational Statistics, 17, 131–154.


National Assessment Governing Board. (2006). Reading framework for the 2007 National Assessment of Educational Progress. Washington, DC: U.S. Government Printing Office.


National Commission on Excellence in Education. (1983). A nation at risk: The imperative for educational reform. Washington, DC: Author.


National Commission on NAEP 12th Grade Assessment and Reporting. (2004). 12th grade student achievement in America: A new vision for NAEP. A report to the National Assessment Governing Board, Washington, DC.


New Commission on the Skills of the American Workforce. (2008). Tough choices or tough times. San Francisco: Jossey-Bass.


Orfield, G. (2004). Dropouts in America: Confronting the graduation rate crisis. Cambridge, MA: Harvard Education Press.


O’Neil, H. F., Jr., Abedi, J., Miyoshi, J., & Mastergeorge, A. (2005). Monetary incentives for low-stakes tests. Educational Assessment, 10, 185–208.


O’Neil, H. F., Jr., Sugrue, B., & Baker, E. L. (1996). Effects of motivational interventions on the National Assessment of Educational Progress mathematics performance. Educational Assessment, 3, 135–157.


Schultz, S. R., Deatz, R. C., & Gladden, F. L. (2008). NAEP-QA Grade 12 motivation study: Summary of assessment site visits (Report No. FR-08-10). Alexandria, VA: Human Resources Research Organization.


Smith, B. M. (Ed.). (2008). Special section: School reform turns 25. Phi Delta Kappan, 89(8).


Wise, S. L., & DeMars, C. E. (2005). Low examinee effort in low-stakes assessment: Problems and potential solutions. Educational Assessment, 10(1), 1–17.




Cite This Article as: Teachers College Record Volume 113 Number 11, 2011, p. 2309-2344
https://www.tcrecord.org ID Number: 16008, Date Accessed: 10/27/2021 6:51:00 PM

Purchase Reprint Rights for this article or review
 
Article Tools

Related Media


Related Articles

Related Discussion
 
Post a Comment | Read All

About the Author
  • Henry Braun
    Boston College
    E-mail Author
    HENRY BRAUN has held the Boisi Chair in Education and Public Policy at Boston College since 2007. From 1979 to 2006, he worked at Educational Testing Service (ETS) in Princeton, New Jersey, where he served as vice president for research management (1989–1999). He has a bachelor’s degree in mathematics from McGill University and M.A. and Ph.D. degrees, both in mathematical statistics, from Stanford University. He has a long-standing involvement in technical analyses of policy issues, especially those involving testing and accountability. He has done considerable work in the area of value-added modeling and authored Using Student Progress to Evaluate Teachers: A Primer on Value-Added Models (2006). He was a major contributor the OECD monograph, Measuring Improvements in Learning Outcomes: Best Practices to Assess the Value-added of Schools (2008) and chair of the NRC panel that recently issued the publication, Getting Value out of Value-Added: Report of a Workshop.
  • Irwin Kirsch
    Educational Testing Service
    IRWIN KIRSCH has held the title of Distinguished Presidential Appointee at Educational Testing Service since 1999, where he began working in 1984. He holds a bachelor’s degree in psychology from the University of Maryland, an M.S. in communication disorders from Johns Hopkins University, and a Ph.D. in educational psychology from the University of Delaware. His interests include issues involving the comparability and interpretability of large-scale assessments, and using technology to link learning and assessment. He has had a long-standing involvement in the development and implementation of large-scale comparative surveys including NAEP, and he was one of the original developers of the International Adult Literacy Survey (IALS). He currently directs the Program for the International Assessment of Adult Competencies (PIAAC) for the OECD and chairs the reading expert group for PISA. He has authored a number of policy reports using data from these surveys, including America’s Perfect Storm.
  • Kentaro Yamamoto
    Educational Testing Service
    KENTARO YAMAMOTO is deputy director/principal research scientist for the Center for Global Assessment at Educational Testing Service (ETS). He has been a technical advisor for OECD and the U.S. Department of Education. He has designed or contributed to the design of numerous national and international large-scale surveys of adults, as well as for special populations, such as NAEP, TIMMS, PISA, IALS, ALL, and PIAAC. He has also designed several individual tests in reading and literacy, a mixture model of continuous and discrete measurement models for diagnostic testing and IRT scaling that has been used for all literacy surveys at ETS. He also designed the online testlet adaptive testing for adult literacy skills. He has written numerous reports, research papers, and contributed chapters and has given numerous presentations at national and international conferences.
 
Member Center
In Print
This Month's Issue

Submit
EMAIL

Twitter

RSS