Are Achievement Gap Estimates Biased by Differential Student Test Effort? Putting an Important Policy Metric to the Test
by James Soland - 2018
Background/Context: Achievement gaps motivate a range of practices and policies aimed at closing those gaps. Most gaps studies assume that differences in observed test scores across subgroups are measuring differences in content mastery. For such an assumption to hold, students in the subgroups being compared need to be giving similar effort on the test. Studies already show that low test effort is prevalent and biases observed test scores downward. What research does not demonstrate is whether test effort differs by subgroup and, therefore, biases estimates of achievement gaps.
Purpose: This study examines whether test effort differs by student subgroup, including by race and gender. The sensitivity of achievement gap estimates to any differences in test effort is also considered.
Research Design: A behavioral proxy for test effort called “rapid guessing” was used. Rapid guessing occurs when students answer a test item so fast, they could not have understood its content. Rates of rapid guessing were compared across subgroups. Then, achievement gaps were estimated unconditional and conditional on measures of rapid guessing.
Findings: Test effort differs substantially by subgroup, with males rapidly guessing nearly twice as often as females in later grades, and Black students rapidly guessing more often than White students. However, these differences in rapid guessing generally do not impact substantive interpretations of achievement gaps, though basic conclusions about male–female gaps and changes in gaps as students progress through school may change when models account for test effort.
Conclusions: Although the bias introduced into achievement gap estimates by differential test effort is hard to quantify, results provide an important reminder that test scores reflect achievement only to the extent that students are willing and able to demonstrate what they have learned. Understanding why there are subgroup differences in test effort would likely be useful to educators and is worthy of additional study.
Achievement gaps are one of our nations most important policy metrics. We use gaps between races and genders to measure the effectiveness and fairness of the U.S. education system, including progress in those areas over the course of decades. In part to monitor that progress, the National Assessment of Educational Progress (NAEP) is administered at regular intervals and measures achievement by student subgroup. Achievement gap estimates based on tests like NAEP are the basis for many education practices, programs, policies, and funding streams. For example, Title I of the Elementary and Secondary Education Act is one of the largest sources of federal education funding. Those resources are allocated to support schools and districts serving low-income students, who often show achievement levels far below those of their more affluent peers and are much more likely to be racial minorities.
Many prominent gap studies base estimates on tests that have minimal stakes for students like NAEP and the assessments administered under federal studies like the Early Childhood Longitudinal Study of the Kindergarten Class of 1998 (ECLS-K). One reason that the lack of stakes tied to these tests should concern educators, researchers, and policy makers is that studies show low-stakes tests are associated with reduced student motivation to take them (Cole, Bergin, & Whittaker, 2008). A fundamental assumption underlying the validity of most uses of tests is that the examinees will use the assessment as a chance to demonstrate their mastery of the content. In education, we assume that students use math and reading tests to show what they have learned in those subjects. No matter how well a test is designed, it will not provide useful information for a student who chooses not to try. Research already shows that assuming test takers are motivated is not always justifiable (Wise, 2015). Further, the effect of low effort is almost always that observed test scores are biased downward relative to true achievement, meaning that the test scores used by practitioners and researchers often understate what a student knows and can do.
There are also reasons to believe that low test effort may be biasing achievement gap estimates. First, there is evidence that girls show higher test effort than boys, suggesting that there may be subgroup differences in test effort (Wise & DeMars, 2010). If two subgroups being compared in an achievement gap estimate show differential test effort, then one groups test scores will likely be biased downward at greater rates relative to those of the other group. Second, when present, low effort can bias scores downward by as much as 0.2 standard deviations (Rios, Guo, Mao, & Liu, 2016), which is sizeable relative to achievement gaps often reported at around 0.1 to 0.2 standard deviations. Third, there is evidence that this biasing occurs frequently enough to make subgroup differences practically and statistically significant. For instance, Wise (2015) found that rates of low test effort can reach more than 15% of examinees during middle- and high school. Thus, in the presence of imperfect test effort, achievement gap estimates based on observed test scores could be biased.
In this study, a low-stakes assessment was used to examine whether achievement gap estimates appear to be biased by differential test effort. The theory was investigated using a measure of a particular type of low test effort called rapid guessing, which is when students respond to an item so quickly, they could not have understood its content (Schnipke & Scrams, 1997). Metrics that use rapid guessing to measure test motivation are supported by decades of validation research reviewed by Wise (2015). (Though rapid guessing is a particular behavior, it is referred to throughout the article as low test effort or low test motivation for convenience.) Achievement gaps were estimated unconditional and conditional on this rapid-guessing metric and compared to answer two research questions. First, are there subgroup differences in rapid-guessing rates? A finding of differential effort would mean that observed test scores are biased downward relative to true scores at different rates by subgroup. Second, if there is differential test motivation, how much does the downward bias in observed scores influence achievement gap estimates?
I found that test effort tends to be much lower for most racial minority groups relative to White students and for male students relative to female students, with rapid-guessing rates differing by more than 0.3 standard deviations in later grades. As a result, observed test scores tend to understate achievement for minorities relative to White students, and for boys relative to girls. However, most gap estimates do not change enough to alter substantive conclusions drawn from them. Exceptions to this general finding include malefemale gaps, and changes in gaps as students progress through school, both of which differ substantively between unconditional and conditional estimates. In mathematics, original gap estimates favor girls in later grades, whereas effort-adjusted estimates favor boys.
There are two broad implications of these findings for practice and policy. First, achievement gaps are often the primary metric used to measure the current effectiveness and fairness of the U.S. education system, identify how sources of ineffectiveness or unfairness develop as students move through school, and evaluate the efficacy of policies designed to address such issues. For example, the malefemale gap in mathematics favoring boys has been posited as a possible cause of lower participation in quantitative college majors and lucrative professions among women. Studies suggest that these gaps have narrowed over time (Friedman, 1989) and that certain targeted policies may account for some of that narrowing (Fennema & Hart, 1994). If estimates of achievement gaps are biased by differential test motivation, then conclusions being made about the effectiveness and fairness of the U.S. education system over time, including the effects of policies designed to improve that system, may be misleading.
Second, my results provide a reminder that mean achievement test scores are not as uncomplicated as we might hope. Mean achievement test scores are not measures of ability or intelligence, but performance. Performance can be impacted by motivation, mindset, and other psychological factors. Therefore, tests only measure what students know to the extent that students are willing (and able) to demonstrate that knowledge. My results suggest that students are willing to show what they can do at very different rates across subgroups, which raises questions about underlying causes. Asking such questions has implications for how educators conceptualize, measure, and perhaps address achievement gaps.
BACKGROUND ON ACHIEVEMENT GAPS
This section briefly discusses what is known about the size of gaps for the subgroups in the study sample and the broad implications of these gaps. The literature is reviewed separately for racial minorityWhite gaps and malefemale gaps, in part because racial gaps are more confounded with socioeconomic status, which is also a source of differential achievement (Reardon, 2011). Treating racial minority students together is entirely for parsimony and is not meant to suggest that the causes and implications of these minorityWhite gaps are homogenous. Even within subgroups, achievement is not homogenous. For example, Reardon and Galindo (2009) showed that HispanicWhite achievement gaps differ substantially for Hispanic students from different countries of origin. The purpose of this literature review is simply to explain why stakeholders should care if gaps as a policy measure are flawed and show that the gaps I estimate are comparable with those documented in the literature.
GAPS BETWEEN WHITE AND RACIAL MINORITY STUDENTS
Achievement gaps in reading and mathematics between White students and racial minority students are generally large and tend to increase as students get older, though some of the evidence is mixed. For Black students, Quinn (2015) estimated that the gap in kindergarten is 0.32 standard deviations in reading and 0.54 standard deviations in math. There is much disagreement and confusion about how much these gaps change as students progress through school (Reardon & Galindo, 2009). However, studies often cite changes in gaps across grades that are less than 0.05 standard deviations as evidence of shifts in those gaps (Clotfelter, Ladd, & Vigdor, 2006; Hanushek & Rivkin, 2006). Gaps are also sizeable for Hispanic students. Using ECLS-K data, Reardon and Galindo (2009) estimated that the HispanicWhite achievement gap in the fall of kindergarten is 0.77 standard deviations in math and 0.51 standard deviations in reading. Shifts in those gaps by year between kindergarten and fifth grade are variable, with changes ranging from 0.01 to 0.17 standard deviations (most changes are less than a tenth of a standard deviation). There is less research for Native American students, but estimates based on ECLS-K show that Native American kindergarten students score about 27 percentile points lower than White students in reading (Demmert, Grissmer, & Towner, 2006).
Asian American students, by comparison, are the only student subgroup to outperform White students across grade levels, especially in math (Konstantopoulos, 2009). According to one study, in seventh grade, Asian students outperformed White students in math by nearly 1 standard deviation but performed below White students in reading by 0.13 standard deviations (Pang, Han, & Pang, 2011). However, these achievement gap estimates differ substantially by ethnicity within the Asian American subgroup (Pang et al., 2011).
IMPLICATIONS OF GAPS BETWEEN WHITE AND RACIAL MINORITY STUDENTS
These gaps, especially for racial subgroups with significantly lower achievement than White students, have existed for decades and often have consequences for students, teachers, school systems, and society. For example, achievement gaps have been posited as sources of differential educational attainment and postsecondary enrollment (Corbett & Hill, 2012; Hill, Corbett, & St. Rose, 2010; Milgram, 2011; Neal, 2006). Gaps may also help explain why many racial minority groups are severely underrepresented in selective universities (Reardon, Baker, & Klasik, 2012). These impacts on education, in turn, affect the types of careers people enter and their earnings. Heckman and Vytlacil (2001) found that returns on education tend to be much higher among students in the top quintile of high school test score distributions, which means the achievement gap likely influences how much students earn. These educational and financial implications say nothing of the psychological, health, and familial consequences of gaps (Ferraro, Farmer, & Wybraniec, 1997).
GAPS BETWEEN MALE AND FEMALE STUDENTS
Gaps in mathematics achievement favoring males were highlighted in the 1970s, though there is some evidence this gap has narrowed in recent years (Rampey, Dion, & Donahue, 2009; Robinson & Lubienski, 2011). For example, a meta-analysis of studies looking at the gender gap in mathematics found that average gender differences during the 1980s were not statistically significant (Friedman, 1989). Fennema and Hart (1994) attributed this narrowing of the gap partially to an awareness of differences in math achievement by gender and policies targeted at addressing those differences. In the last 10 years, results evidence very small if persistent gaps that vary based on the measure used. For example, results from NAEP show small gender gaps favoring males of around 0.1 standard deviation in elementary, middle, and high school (McGraw, Lubienski, & Strutchens, 2006). Gaps also appear to grow as students progress through school. Robinson and Lubienski (2011) used ECLS-K and found that gaps tended to grow between kindergarten and eighth grade, with year-to-year changes ranging from more than 0.1 standard deviation to less than 0.01 standard deviation. Relevant to test motivation decreasing between elementary and middle school, the gaps estimated by Robinson and Lubienski (2011) using quantile regression decreased substantively between fifth and eighth grade.
Whereas mathematics gaps tend to favor males, reading gaps usually favor females. Higher achievement in reading for girls has been observed since the 1960s (Gates, 1961), though research suggests these gaps may be closing (Rampey et al., 2009). Across grades, Husain and Millimet (2009) found that low-achieving males tend to lose ground in reading during late elementary school. Robinson and Lubienski (2011) also found that gaps can widen as students get older, especially among low-achieving students. For example, students at the 10th percentile of the ECLS-K test had a gap of 0.12 standard deviations that reached 0.24 standard deviations in eighth grade (Robinson & Lubienski, 2011). Gaps favoring girls often increased between fifth and eighth grade, a period when test motivation also tends to decrease (Robinson & Lubienski, 2011).
IMPLICATIONS OF GAPS BETWEEN MALE AND FEMALE STUDENTS
Many studies on the impact of malefemale achievement gaps focus on the math gap, an emphasis that occurs because achievement differences in mathematics across genders likely impact career trajectories and earnings. Higher performance in math among males may result in higher participation in science, technology, engineering, and math (STEM) majors and fields, which are associated with increased earnings relative to other careers. Since 2000, women have earned fewer than 20% of undergraduate degrees in engineering (Dey & Hill, 2007). This underrepresentation of women applies to most professions in STEM (Hill et al., 2010; Milgram, 2011). Women also earn less on average than men, a discrepancy driven in part by the higher proportion of men in lucrative STEM careers (Dey & Hill, 2007). Corbett and Hill (2012) found that there are significant wage gaps between men and women one year out of college and that part of the gap (though not all of it) is attributable to college major.
BACKGROUND ON TEST-TAKING EFFORT
This study uses a metric called response time effort (RTE), which flags extremely quick answers to test questions (Wise & Kong, 2005). Wise and Kongs (2005) metric quantifies a very particular type of low effort, namely, when students respond to a question so quickly, they could not have understood its content. For example, a students response to a reading item would be flagged as noneffortful if he responded to it much faster than the time necessary to read the associated passage. This approach was first developed by Schnipke and Scrams (1997), who divided test examinees into two categories: those exhibiting solution behavior and those exhibiting rapid-guessing behavior. Assignment of examinees to those categories is based on response time, or the seconds between when an item is presented and when it is answered. If a student responds faster than some specific (and very short) response time threshold, then the answer is deemed rapid guessing. Response time has several advantages as a basis for measuring examinee effort. For example, because the examinee is unaware the data are being collected, response time does not suffer from self-report biases like surveys of student effort (Wise & Kong, 2005).
The main difficulty inherent in using response time lies in determining what item responses constitute solution versus rapid-guessing behavior. That is, a response to a test question must be deemed either effortful or noneffortful, and wrongly grouping students would threaten the measures validity. Wise and Kong (2005) used an empirical approach to identify response time thresholds. These thresholds represent a very quick response time below which a student almost certainly responded before understanding the items content. A primary goal underlying response time thresholds set for RTE is to avoid wrongly identifying effortful responses as rapid, such as when an advanced student answers a question correctly and quickly. Wise and Kong (2005) and Wise and Ma (2012) used the thresholds that did the best job of avoiding such false positives.
Once specific item responses were separated into solution versus rapid-guessing behavior, Wise and Kong (2005) used a single overall measure of test effort for each student. This overall measure computes the proportion of a students test items on a single test that represent solution behavior. RTE scores range from 0 to 1. An RTE value of, say, .70 means the student rapidly guessed on 30% of the items. Wise (2015) went on to show that observed test scores from students with RTE values below .90 on the same test are much less reliable and show significant downward bias. Therefore, observed scores from tests on which students had an RTE value of .90 or below may be biased downward so much that they are not entirely trustworthy. Based on this work, RTE thresholds of 1 (no rapid responding), less than 1 but greater than .90, and below .90 are used throughout my descriptive statistics and gaps estimates.
As Wises (2015) review of the literature on test effort shows, nearly 30 studies spanning over a decade provide consistent evidence that rapid guessing is a form of bias irrelevant to the construct, not an aspect of true achievement (DeMars, 2007; Rios, Liu, & Bridgeman, 2014; Swerdzewski, Harmes, & Finney, 2011; Wise & Kong, 2005). This research is vital to my own study. If rapid guessing is merely a proxy for content knowledge rather than a form of bias, then controlling for it in gaps models likely washes out true differences in achievement. That body of research addresses several specific and related concerns about RTE as a measure. For example, RTE would not be a valid measure of test effort if high-achieving students were responding quickly to items and still getting them right, or if low-performing examinees were immediately identifying items as too difficult and reacting by responding rapidly. These concerns have been addressed in the research (DeMars, 2007; Rios et al., 2014; Swerdzewski et al., 2011; Wise & Kong, 2005), summarized by Wise (2015).
As a form of bias, rapid guessing almost always biases observed scores downward relative to true scores (Wise, 2015; Wise & Kingsbury, 2016). For example, a simulation study by Rios et al. (2016) found that true scores in assessed subjects were understated by 0.2 standard deviations on average when students responded rapidly to 6% or more of the items (equivalent to an RTE value of .94 or below). This biasing occurs frequently and in foreseeable ways. For example, rapid guessing occurs across grade levels and subjects but is most extreme for middle- and high school students in reading (Wise, 2015). At times, more than 20% of test takers in later grades have RTE values below .90 (Jensen, Rice, & Soland, 2018; Wise, 2015).
There is some evidence that RTE differs by subgroup and therefore impacts mean differences in observed achievement test scores. Most of this work examines test effort differences between boys and girls. DeMars, Bashkov, and Socha (2013) examined literature on test-taking effort and found a limited number of studies showing that girls tend to exhibit higher levels of test-taking effort. Wise and DeMars (2010) looked at freshman-to-sophomore gains on a college communications test and found that higher rates of rapid guessing among boys distorted growth comparisons between males and females.
The data consist of reading and mathematics scores from the Measures of Academic Progress (MAP), a vertically scaled computer-adaptive assessment given in most states across the country. Because MAP is administered via computer, response times can be measured. MAP has several advantages for effort-related work, including that students cannot skip items, there are no time constraints that would force students to rapidly guess, and the test is adaptive, which means students are unlikely to see extremely difficult or easy content. Although MAP is administered nationwide, calculating RTE is cumbersome in terms of both computing power and labor. Therefore, rather than use data from all states to establish nationally representative gaps estimates, data were used from seven states and four regions: West (State 1), Midwest (States 25), Northeast (State 6), and Southeast (State 7).1
Most of these states do not use MAP for high-stakes purposes. There are, however, exceptions. For example, districts in State 6 often use MAP as the basis for a teacher evaluation policy that identifies and remediates extremely low-performing teachers. Other states may also use MAP for district-specific accountability policies, but the test is not otherwise used in statewide policies. This fact is important given research showing that rapid guessing is a bigger issue for low-stakes tests (Wise, Pastor, & Kong, 2009). The relatively low stakes for MAP is both an advantage and a disadvantage of this study. On one hand, knowing how achievement gaps are impacted by test effort in high-stakes environments is important and will still need to be investigated. On the other, much of the literature on achievement gaps relies on tests that have no consequences for students. Given that some states in the sample use MAP in an accountability system and others use the assessment to inform practice (e.g., setting student growth goals), one could argue that stakes are lower on tests like NAEP, making estimates of the effect of rapid responding presented herein conservative.
Test results were used from three periods per year (fall, winter, and spring) over a 5-year period stretching from 2010 to 2015.2 In all the figures that follow, each period-by-year combination is given a consecutive number, such that those time periods range from period 1 (fall of 2010) to period 15 (spring of 2015). Only one grade level was used per year to ensure that differences in RTE were not conflated with students ages, which are often negatively correlated with test-taking effort (DeMars & Wise, 2010; Wise & DeMars, 2010). Students in the sample began in Grade 5 and ended in Grade 9. Although test effort and achievement gaps can be estimated for groups of students in successive grades, the cohorts are not intact: Students can enter and exit the sample at any time period, so long as they have at least one valid test score and accompanying RTE value during the studys duration. Table 1 presents statistics on the number of students by race, gender, and time period for three periods. The achievement patterns in Table 1 generally match those from the literature. For example, female students outperform males in reading, with little discernible gap in mathematics.
Table 1. Mean Standardized Reading and Math Scores by Student Subgroup
In this section, methods are presented for each research question.
DO STUDENT SUBGROUPS DIFFER IN TERMS OF TEST-TAKING EFFORT?
First, gaps in student test-taking effort were examined visually by plotting the proportion of students with an RTE value below .90 by subgroup, time period, and subject. Again, an RTE of .90 means that a given student showed solution behavior on 90% of the items on a test. The .90 RTE cutoff was used based on research showing that scores with values below this threshold should be given closer scrutiny and, in some cases, may not be valid achievement estimates (Wise & Ma, 2012). Second, differences in rapid guessing rates are reported in standard deviation units. RTE differences were standardized using a metric-free approach to estimating gaps developed by Ho (2009) and Ho and Reardon (2012). This approach uses only the ordinal information from RTE-based rank orderings of students (Ho, 2009; Ho & Haertel, 2006; Reardon & Ho, 2015). Therefore, the approach helps ensure that test effort differences across subgroups are not the result of some arbitrary facet of the RTE scale (Ho, 2009). Showing that the results are not a facet of the underlying scale is important given research demonstrating that achievement gap estimates are often sensitive to scale (Ho, 2009).
Specifically, RTE gaps are presented using an ordinal measure reported in standard deviation units called in the literature (Ho, 2009). equals
where is the cumulative standard normal distribution function and is the probit function. is the probability that a randomly chosen minority student has a higher score (RTE value) than that of a randomly chosen reference student (Whites for race-based gaps, males for gender gaps). The statistic can be thought of as the gap if RTE values are monotonically transformed such that both minority and reference values are normally distributed and gaps are computed in pooled standard deviation units of . For more details on estimating and interpreting and , please see Ho (2009).
ARE ACHIEVEMENT GAPS SENSITIVE TO TEST-TAKING EFFORT?
Gaps were estimated4 at each time period t (corresponding to the 15 term-year combinations) for student in subject and state by regressing standardized test scores on a dummy variable that takes the value of 1 if the student is a racial minority or female student, 0 if the student is in the reference category
The coefficient of interest is , which is the mean difference in MAP scores associated with being a minority or female student relative to being a White or male student, respectively.
Equation 3 is the same as Equation 2, but conditions on three RTE thresholds are
is a dummy coded as 1 when the student’s RTE equals 1 (here, the omitted reference group), is a dummy coded as 1 when the RTE value is less than 1 but greater than or equal to .90, and is a dummy variable coded as 1 when the RTE value is less than .90. One benefit of using dummy variables to measure RTE is that such an approach alleviates a problem with the measure as a covariate in a regression model, namely, that it is highly skewed. In this model, can be interpreted as the mean difference in achievement, reported in standard deviation units, when only comparing students within the same RTE band.
As a robustness check, gaps were also estimated using the metric-free approach detailed in research question 1, but with RIT scores as the outcome rather than RTE values.5 Results showed changes in gaps of similar direction and magnitude to those between Equations 2 and 3. Therefore, one can be somewhat confident that the sensitivity of gaps to test effort is not being driven solely by the underlying scale. Given the similarity of the results, only those from Equations 2 and 3 are reported, but metric-free results are available on request.
LIMITATIONS OF THE STUDY DESIGN AND DIRECTIONS FOR FUTURE RESEARCH
There are several limitations of this study. First, despite considerable evidence that RTE is a form of bias and not a proxy for true achievement, one cannot rule out some modest correlation between RTE and achievement. The problem such a correlation presents can be made clear through a hypothetical: If achievement and effort are perfectly correlated, then, conditional on effort, there would be no gap. My study is not meant to provide an estimate of the unbiased true gap if everyone tried equally, but simply to show that achievement gaps are likely sensitive to effort. Current research suggests that correlations between test effort and achievement are low. For example, across 10 study results, there was a median correlation between SAT scores and RTE of .08, and a median correlation between course grades and RTE of .07 (Wise, 2015). In my sample, correlations between RTE and a students MAP score from a different time period are low (.04.12), including when conditioned on race. Nonetheless, this study should be replicated using an approach that rescores tests in a way that directly accounts for student effort to see how much those modest correlations impact the findings.
Second, despite strong evidence that RTE is valid for quantifying the proportion of test items on which a student demonstrated solution behavior, the metric has limitations. For one, RTE does not capture all forms of low effort. As an example, a student may have an item open on the computer for a long time and still not have engaged with the content, a scenario that would not be captured by RTE. Ultimately, the type of effort examined in this study is limited only to cases in which there is a high degree of certainty that the student responded to an item before fully understanding its content. Given how many other forms of disengagement likely exist, estimates reported in this study may understate the effect of test effort on gaps.
Third, there is no guarantee that the results generalize to the true within-state gaps estimates if all students in each state in the sample had taken MAP. As discussed in an earlier footnote, estimates were produced that were reweighted to better match school-level demographics and socioeconomic status within states. Results did not change; therefore, only the unweighted results are reported. There is also no guarantee that results generalize to other states not included in the sample. Findings should be replicated using other samples.
Results suggest that test effort is highly correlated with subgroup status. As a result, unconditional and conditional achievement gaps often differ, with the biggest changes occurring in reading and between male and female students.
DO STUDENT SUBGROUPS DIFFER IN TERMS OF TEST-TAKING EFFORT?
Results show that high proportions of students in Grades 7 and above exhibit low test effort and that these rates of low effort can differ by more than 0.3 standard deviations across subgroups. Figure 1 shows plots of the proportion of students with RTE values below .90 by subgroup, subject, and time period (the higher the value on the vertical axis, the higher the rates of rapid guessing). This figure suggests that test effort differs by subgroup, especially in reading. For example, in reading at Time 15 (spring of Grade 9), 25% of African American students showed low test effort, compared with 15% of White students. Although gaps in mathematics test effort tend to be much smaller relative to reading, they are quite high for boys versus girls: Roughly twice as many males disengaged from math tests in high school as compared with females. Gaps are fairly small between White and Asian students, but the latter are the only racial minority subgroup evincing more effort relative to White students. One should also note that rates of rapid guessing increase differentially over time by subgroup, which could mean that changes in achievement gaps estimates over time are also impacted.
Estimates of the gap in test effort in standard deviation units using the metric-free approach detailed by Ho (2009) also show large differences in rapid guessing by subgroup. Negative values mean that the test effort gap favors White students or boys. In math, gaps range from -0.28 standard deviations (Native Americans) to 0.17 standard deviations (female students). In reading, gaps range from -0.36 standard deviations (African Americans) to 0.22 standard deviations (Asian Americans). In general, male students showed higher rates of rapid guessing than females, with differences consistently reaching 0.2 standard deviations in later grades.
Figure 1. The proportion of students with RTE values below .90 by subject and student subgroup
Are Achievement Gaps Sensitive to Test-Taking Effort?
The sensitivity of gap estimates to test motivation largely mirrors the patterns in RTE. Figure 2 presents gaps in mathematics achievement conditional and unconditional on RTE by subgroup. Gaps are presented in standard deviation units (though these are not metric-free estimates), with 95% confidence intervals. Note that the vertical axes are scaled differently so that gaps can be examined more closely. Changes in gaps between conditional and unconditional estimates are not statistically distinguishable from 0 for any racial subgroup. For malefemale gaps, the conditional gap changes by 0.05 standard deviation units, on average, compared with the unconditional gap, with a minimum change of 0.024 standard deviations in fifth grade and 0.088 standard deviations in ninth grade.
Figure 2. Math gaps in achievement conditional and unconditional on test effort
Figure 3 presents the same figure but in reading. In general, conditional estimates are smaller than unconditional. For Hispanic students, conditional and unconditional estimates differ by 0.058 standard deviation units, on average, with a range of 0.034 standard deviation units in fifth grade to 0.08 standard deviation units in ninth grade. For Black students, conditional and unconditional estimates differ by a mean of 0.086 standard deviation units, with a range of -0.054 standard deviation units in fifth grade to -0.137 standard deviation units in ninth grade. Meanwhile, malefemale gaps tend to shrink when conditioned on RTE, with a mean change of 0.093 standard deviation units (the range is from 0.058 to 0.13 standard deviation units).
Figure 3. Reading gaps in achievement conditional and unconditional on test effort
In many cases, the differences between conditional and unconditional estimates do not differ in substantive ways. For racial minority gaps, conditional and unconditional estimates do not differ much in mathematics, even in later grades. Reading gap estimates for racial minorities are more sensitive, but the magnitude of the shifts is generally modest, especially during elementary school. The sensitivity does, however, appear statistically and practically meaningful in three cases. First, African American gaps in reading differ across conditional and unconditional estimates, reaching more than a 10th of a standard deviation in later grades. Second, the sensitivity of gaps to test effort may be greatest for malefemale gaps. Conditional malefemale gaps in mathematics increase by around 200% in upper grades relative to unconditional estimates. This difference reverses the direction of the mathematics gap in later grades, from slightly favoring girls (unconditional) to favoring boys by more than 0.05 standard deviations (conditional). Third, given that rates of rapid guessing shown in Figure 1 increase at different rates as students progress through school, changes in gaps across grades appear somewhat sensitive, especially for malefemale gaps.
IMPLICATIONS FOR PRACTICE AND POLICY
Achievement gap estimates are used to understand a host of challenges facing our nation. If gaps estimates are biased, or reflect factors unrelated to achievement, then our understanding of those challenges and their root causes may be misguided as well. In education, gaps estimates are often used to assess the fairness and effectiveness of the U.S. schooling system over the course of decades. For example, researchers, practitioners, and policy makers use BlackWhite gaps to measure progress in reducing long-standing educational inequities that have their roots in segregation and economic disparity (Quinn, 2015). Beyond education, gap estimates are also an important data point in conversations about addressing broader societal issues that may stem from schooling. For instance, malefemale gaps in mathematics can help shed light on why women are less likely to major in quantitative disciplines and to take related (and oftentimes lucrative) jobs, which in turn contributes to differences in wages between men and women (Fennema & Hart, 1994). Basing such research on biased gaps estimates means those challenges may be misunderstood.
In addition to helping us understand educational and societal problems, gaps are also used to evaluate how effectively those problems are being addressed. For example, research often highlights that although achievement gaps in math favoring boys have existed for decades (Friedman, 1989), these gaps appear to be narrowing (Rampey et al., 2009). Some attribute that narrowing to policies designed to support women in science- and math-related careers (Fennema & Hart, 1994). Other studies use gaps to examine how practices and policies have impacted disparities in BlackWhite achievement (Lee, 2002). If gaps estimates are biased by differential rapid guessing, then they are less likely to be useful as instruments of evaluation and may even lead to ineffective programs being deemed effective or vice-versa.
Fortunately, the news for gaps research is generally good: Most gap estimates in my study are largely insensitive to the inclusion of RTE in the model. Overall, the sensitivity of gaps to test motivation appears low in early grades, in mathematics, and for most racial minority groups relative to White students. Even for African American students in reading, though gaps can shift by more than 0.1 standard deviation between conditional and unconditional estimates, these changes are insufficient to influence most basic conclusions about differential achievement. That is, the gaps still favor White students and are large in magnitude.
There are, however, a couple of instances in which the conclusions being drawn in earlier research may be misguided if test motivation is not considered. First, malefemale gaps in both subjects are quite sensitive to test motivation. The shifts I observe are large relative to gaps estimates reported in the literature, including by studies that use gaps to examine issues of fundamental importance to the U.S. educational system. For example, if test motivation were to similarly impact the malefemale math gap of roughly 0.1 standard deviation reported by McGraw et al. (2006), then the conditional gap would grow by nearly 100%. In my study, the shift in the mathematics gap is the difference between the gap favoring males versus females in later grades. As discussed, the mathematics gap has major consequences for education, career opportunities, and earnings.
Second, the differences in the changes in gaps across grades between conditional and unconditional estimates can also be sufficient to alter conclusions drawn in other research. Studies often report changes in gaps as students progress through school of .05 or less for BlackWhite and malefemale gaps (Hanushek & Rivkin, 2006; Reardon & Galindo, 2009; Robinson & Lubienski, 2011). The shifts in year-to-year gaps reported in my study would be enough to wash out some of the across-grade changes reported elsewhere. For example, Robinson and Lubienski (2011) reported gap estimates in mathematics that decrease between fifth and eighth grade, and reading gap estimates that increase between fifth and eighth grade. I see similar patterns in unconditional estimates but little shift in the gaps between fifth and eighth grade for the conditional estimates. Although one cannot be certain low test motivation is also a factor in Robinson and Lubienski (2011) or other referenced studies, the argument is plausible given that most of this research relies on low-stakes tests like NAEP or ECLS-K. Results may generalize less well to studies using state accountability tests that have consequences for schools, though one should note that such tests often have minimal stakes for students. Future studies should investigate the issue.
Yet, in some ways, the specific shifts in gap estimates are less important than the challenge they raise to assumptions about what gaps measure, exactly. This observation is especially relevant given how difficult test motivation itself is to measure. On one hand, there are plausible arguments that my estimates overstate the sensitivity of gaps to test motivation, especially if RTE and true achievement are correlated. On the other, there are equally plausible arguments that the reported estimates understate that sensitivity. RTE measures only a very particular behavioral manifestation of low test effort and is designed to be conservative, which means there are almost certainly students who are unmotivated but not flagged for rapid guessing. Regardless of whether my findings understate or overstate the bias introduced into gaps estimates by low motivation, they provide a new lens through which to view gaps.
Specifically, my results provide a useful reminder that mean achievement differences are not as uncomplicated as practitioners and policy makers sometimes assume. As Reardon (2016) noted, gaps result from the sum total of students schooling, after-school, home, and neighborhood experiences. Further, aggregated achievement test scores are not measures of intelligence or ability but of performance. Therefore, observed scores are impacted by factors that adults control, like what students are taught, and by motivation to perform. As Pollock (2009) argued, the trick is to actively denaturalize racial achievement patterns: to name them and claim them as things we, together, have both produced and allowed (p. 171). Perhaps data showing stark differences in test motivation can help denaturalize some achievement patterns by raising new questions about what they are measuring and why they are occurring.
In practical terms, this denaturalization might come in several forms. First, even if achievement gaps are not sensitive to motivation, the differences in rapid guessing rates across subgroups raise important questions for practice and policy. Differences in these rates can reach more than 0.3 standard deviations for certain subgroup comparisons. In later grades, the proportion of students who rapidly guess so much the validity of their scores might be in question differs by more than 10 percentage points across groups. These differences in rapid guessing might make educational leaders wonder: Why are there racial and gender differences in whether students feel that showing what they can do academically is worth their time?
The trends in low test effort shown in Figure 1 tend to mirror patterns on other outcomes that may be related to test motivation. For example, Balfanz, Herzog, and Mac Iver (2007) showed that low academic motivation often begins in middle school and increases during the early high school years. Similar across-grade trends occur for self-management, which measures how focused students are on academic tasks, and could relate to low test effort (Briesch & Chafouleas, 2009). In my sample, rapid guessing rates increased across subgroups as students got older and reached somewhat alarming rates in middle school for some subject-subgroup combinations. Additional research linking rates of rapid guessing to other social-emotional learning phenomena would give further credence to the idea that measured achievement gaps are the sum total of students educational experiences, contexts, mindsets, and engagement.
Assuming a broader connection between social-emotional learning and test motivation, this studys findings could also have implications for how we understand the root causes of gaps. Though my analyses do not make a causal link between test motivation and achievement gaps, this line of work could eventually provide insights into what causes gaps and how to address them. If test motivation is related to broader engagement in school, then addressing disengagement could help reduce gaps. This idea is in line with the current emphasis in policy on measuring and fostering social-emotional learning constructs, many of which are determinants of achievement (Duckworth & Yeager, 2015). Should additional research find that rapid guessing is related to social-emotional learning constructs like self-management and to broader disengagement from school, then RTE may provide a useful proxy for such constructs and could inform interventions designed to close gaps.
1. Given within-state samples may not match within-state populations, several weighting schemes were employed to restore representativeness, but gap estimates changed by no more than .02 standard deviations.
2. One problem with the time span of this study is that changes to the testing platform led to somewhat anomalous RTE values in one time period (fall of eighth grade). While RTE and gaps estimates are reported for this period, they should be interpreted with caution. Figures that plot achievement gaps omit this period altogether because including those results distorts the scale range of the RTE values.
3. Gaps were re-estimated using intact cohorts, but results did not change substantively beyond the noise associated with smaller sample sizes.
4. One potential concern with Models 1 and 2 is that they do not account for measurement error. As a result, estimated gaps may be a function of the reliability with which achievement is estimated. I therefore adjusted both RIT-scale and standardized scores for measurement error. When using RIT scores, estimates were weighted by the standard error of measurement for each score. When using standardized RIT scores, estimates were weighted by the overall test reliability. These approaches did not change results substantively (more than at the second decimal place), and are therefore not reported.
5. RIT scores were coarsened into a set of K categories, removing all but the ordinal information from the test score distribution. From there, heteroskedastic ordered probit models were fit to those coarsened data that included a binary variable for race (1 for racial minority students, 0 for white students) or gender. The framework for this approach was developed by Ho and Reardon (2012) and Reardon and Ho (2015) to estimate gaps when only ordinal proficiency categories are available. I estimated gaps using these probit models conditional and unconditional on RTE, converted the coefficients to a V statistic, and compared results across models. While several factors complicate the direct comparison of these conditional and unconditional V estimates, they nonetheless showed changes in the gaps of similar direction and magnitude to those between Equations 2 and 3. Therefore, one can be somewhat confident that the sensitivity of gaps to test effort is not being driven solely by the underlying scale.
Balfanz, R., Herzog, L., & Mac Iver, D. J. (2007). Preventing student disengagement and keeping students on the graduation path in urban middle-grades schools: Early identification and effective interventions. Educational Psychologist, 42(4), 223235.
Briesch, A. M., & Chafouleas, S. M. (2009). Review and analysis of literature on self-management interventions to promote appropriate classroom behaviors (19882008). School Psychology Quarterly, 24(2), 106.
Clotfelter, C. T., Ladd, H. F., & Vigdor, J. L. (2006). The academic achievement gap in grades three to eight (NBER Working Paper No. 12207). Cambridge, MA: National Bureau of Economic Research.
Cole, J. S., Bergin, D. A., & Whittaker, T. A. (2008). Predicting student achievement for low stakes tests with effort and task value. Contemporary Educational Psychology, 33(4), 609624.
Corbett, C., & Hill, C. (2012). Graduating to a pay gap: The earnings of women and men one year after college graduation. Washington, DC: American Association of University Women.
DeMars, C. E. (2007). Changes in rapid-guessing behavior over a series of assessments. Educational Assessment, 12, 2345.
DeMars, C. E., Bashkov, B. M., & Socha, A. B. (2013). The role of gender in test-taking motivation under low-stakes conditions. Research and Practice in Assessment, 8, 6982.
DeMars, C. E., & Wise, S. L. (2010). Can differential rapid-guessing behavior lead to differential item functioning? International Journal of Testing, 10(3), 207229.
Demmert, W. G., Grissmer, D., & Towner, J. (2006). A review and analysis of the research on Native American students. Journal of American Indian Education, 45(3), 523.
Dey, J. G., & Hill, C. (2007). Behind the pay gap. Washington, DC: American Association of University Women Educational Foundation.
Duckworth, A. L., & Yeager, D. S. (2015). Measurement matters assessing personal qualities other than cognitive ability for educational purposes. Educational Researcher, 44(4), 237251.
Fennema, E., & Hart, L. E. (1994). Gender and the JRME. Journal for Research in Mathematics Education, 25(6), 648659.
Ferraro, K. F., Farmer, M. M., & Wybraniec, J. A. (1997). Health trajectories: Long-term dynamics among Black and White adults. Journal of Health and Social Behavior, 38, 3854.
Friedman, L. (1989). Mathematics and the gender gap: A meta-analysis of recent studies on sex differences in mathematical tasks. Review of Educational Research, 59(2), 185213.
Gates, A. I. (1961). Sex differences in reading ability. Elementary School Journal, 61, 431434.
Hanushek, E. A., & Rivkin, S. G. (2006). School quality and the black-white achievement gap (Working Paper No. 12651). Cambridge, MA: National Bureau of Economic Research.
Heckman, J., & Vytlacil, E. (2001). Identifying the role of cognitive ability in explaining the level of and change in the return to schooling. Review of Economics and Statistics, 83(1), 112.
Hill, C., Corbett, C., & St Rose, A. (2010). Why so few? Women in science, technology, engineering, and mathematics. Washington, DC: American Association of University Women.
Ho, A.D. (2009). A nonparametric framework for comparing trends and gaps across tests. Journal of Educational and Behavioral Statistics, 34(2), 201228.
Ho, A. D., & Haertel, E. H. (2006). Metric-free measures of test score trends and gaps with policy-relevant examples (CSE Report 665). National Center for Research on Evaluation, Standards, and Student Testing (CRESST), University of California, Los Angeles.
Ho, A. D., & Reardon, S. F. (2012). Estimating achievement gaps from test scores reported in ordinal proficiency categories. Journal of Educational and Behavioral Statistics, 37(4), 489517.
Husain, M., & Millimet, D. L. (2009). The mythical boy crisis? Economics of Education Review, 28(1), 3848.
Jensen, N., Rice, A., & Soland, J. (2018). The influence of rapidly guessed item responses on teacher value-added estimates: Implications for policy and practice. Educational Evaluation and Policy Analysis, 40(2), 267284.
Konstantopoulos, S. (2009). The mean is not enough: Using quantile regression to examine trends in Asian-White differences across the entire achievement distribution. Teachers College Record, 111(5), 12741295.
Lee, J. (2002). Racial and ethnic achievement gap trends: Reversing the progress toward equity? Educational Researcher, 31(1), 312.
McGraw, R., Lubienski, S. T., & Strutchens, M. E. (2006). A closer look at gender in NAEP mathematics achievement and affect data: Intersections with achievement, race/ethnicity, and socioeconomic status. Journal for Research in Mathematics Education, 37(2), 129150.
Milgram, D. (2011). How to recruit women and girls to the science, technology, engineering, and math (STEM) classroom. Technology and Engineering Teacher, 71(3), 411.
Neal, D. (2006). Why has BlackWhite skill convergence stopped? Handbook of the Economics of Education, 1, 511576.
Pang, V. O., Han, P. P., & Pang, J. M. (2011). Asian American and Pacific Islander students: Equity and the achievement gap. Educational Researcher, 40(8), 378389.
Pollock, M. (2009). Colormute: Race talk dilemmas in an American school. Princeton, NJ: Princeton University Press.
Quinn, D. M. (2015). Kindergarten BlackWhite test score gaps: Re-examining the roles of socioeconomic status and school quality with new data. Sociology of Education, 88, 120139.
Rampey, B. D., Dion, G. S., & Donahue, P. L. (2009). NAEP 2008: Trends in academic progress. NCES 2009-479. Washington, DC: National Center for Education Statistics.
Reardon, S. F. (2011). The widening academic achievement gap between the rich and the poor: New evidence and possible explanations. In G. J. Duncan & R. J. Murnane (Eds.), Whither opportunity? Rising inequality, schools, and childrens life chances (pp. 91116). New York, NY: Russell Sage Foundation.
Reardon, S. F. (2016, May). The landscape of U.S. educational inequality. Symposium on achievement gaps conducted at Stanford University, Stanford, CA.
Reardon, S. F., Baker, R., & Klasik, D. (2012). Race, income, and enrollment patterns in highly selective colleges, 19822004. Center for Education Policy Analysis, Stanford University, Stanford, CA.
Reardon, S. F., & Galindo, C. (2009). The HispanicWhite achievement gap in math and reading in the elementary grades. American Educational Research Journal, 46(3), 853891.
Rios, J. A., Guo, H., Mao, L., & Liu, O. L. (2016). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74104.
Rios, J. A., Liu, O. L., & Bridgeman, B. (2014). Identifying low‐effort examinees on student learning outcomes assessment: A comparison of two approaches. New Directions for Institutional Research, 2014(161), 69–82.
Robinson, J. P., & Lubienski, S. T. (2011). The development of gender achievement gaps in mathematics and reading during elementary and middle school: Examining direct cognitive assessments and teacher ratings. American Educational Research Journal, 48(2), 268302.
Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state mixture model: A new method of measuring speededness. Journal of Educational Measurement, 34, 213232.
Swerdzewski, P. J., Harmes, J. C., & Finney, S. J. (2011). Two approaches for identifying low-motivated students in a low-stakes assessment context. Applied Measurement in Education, 24, 162188.
Wise, S. L. (2015). Effort analysis: Individual score validation of achievement test data. Applied Measurement in Education, 28(3), 237252.
Wise, S. L., & DeMars, C. E. (2010). Examinee noneffort and the validity of program assessment results. Educational Assessment, 15(1), 2741.
Wise, S. L., & Kingsbury, G. G. (2016). Modeling student test‐taking motivation in the context of an adaptive achievement test. Journal of Educational Measurement, 53(1), 86–105.
Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee motivation in computer-based tests. Applied Measurement in Education, 18(2), 163183.
Wise, S. L., & Ma, L. (2012, April). Setting response time thresholds for a CAT item pool: The normative threshold method. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, Canada.
Wise, S. L., Pastor, D. A., & Kong, X. J. (2009). Correlates of rapid-guessing behavior in low-stakes testing: Implications for test development and measurement practice. Applied Measurement in Education, 22(2), 185205.