Home Articles Reader Opinion Editorial Book Reviews Discussion Writers Guide About TCRecord
transparent 13
You Are Here: Read an Article > View All Posts for the Article > Read a Post
Read a Post for Creating Mathematical Futures through an Equitable Teaching Approach: The Case of Railside School
Reply to this Post

Statistical errors in this article?

Posted By: Careful Reader on May 4, 2014
I believe there may be some statistical errors in this article. Some questions for the authors:

(1) Given that the main hypothesis being tested relates to different patterns of development (it was predicted that students in the Railside school would have greater gains than those in the traditional schools), why wasn't a time-by-group ANOVA conducted on the data reported in Table 3? Doesn't Boaler & Staples's prediction relate to the time-by-group interaction effect, not the simple between-groups effects? In other words, why didn't Boaler & Staples investigate the trend in between-groups differences, not just the between-groups differences? Is Boaler & Staples's analysis a case of what Ben Goldacre refers to as "the statistical error that just keeps on coming" (http://www.theguardian.com/commentisfree/2011/sep/09/bad-science-research-error)?

(2) In Table 3 Boaler & Staples report summary statistics for students who took "all three" tests (Year 1 pre-test, Year 1 post-test and Year 2 post-test). Boaler & Staples compare the group means on these tests with a series of three independent t-tests. Given that a Year 3 post-test was conducted as well, why were the results from this test not reported in this Table? What were the mean scores on the Year 3 test of the two groups of students who had taken all four tests?

(3.1) On page 622 Boaler & Staples seek to argue that racial differences in achievement were reduced at Railside (the reform-oriented school), but that this was not the case at the traditional schools. They write:

"At the end of Year 1, only one year after the students started at Railside, there were no longer significant differences between the achievement of white and Latino students, nor Filipino students and Latino and Black students. The significant differences that remained at that time were between white and Black students and between Asian students and Black and Latino students (ANOVA F=5.208, df=280, p=0.000). Table 5 shows these results."

Isn't a one-way ANOVA on post-test data the wrong statistical test to use here? The claim Boaler & Staples wish to make is that racial differences were reduced at Railside more than they were at the traditional schools. Given this, why did they not investigate whether or not the school-by-year-by-ethnicity interaction effect was significant? Is Boaler & Staples's analysis another example of the error discussed by Goldacre?

(3.2) In order to convert an F statistic derived from a one-way ANOVA into a p value one needs two degrees of freedom, given by k-1 and N-k, where k is the number of groups and N is the total sample size. Here k = 5 and N = 272 (the sum of the n column in Table 5). Boaler & Staples report a single degree of freedom of 280, which would be impossible as either df, since N < 280. Where does this df value come from, and what are the correct degrees of freedom? Are the data in the table different to those used to calculate the test statistic? If so, why?

(3.3) Re-calculating the one-way ANOVA by hand from the summary statistics in Table 5 yields a different F value: F(4, 267) = 7.082, p < .001 (you can do this yourself at http://www.danielsoper.com/statcalc3/calc.aspx?id=43). Where does Boaler & Staples's value of 5.208 come from?

(3.4) Boaler & Staples suggest that at the end of Year 1 "there were no longer significant differences between the achievement of white and Latino students". But using the summary statistics reported in Table 5 to conduct a t test appears to suggest that the difference between white and latino students' achievement was significant, t(152) = 2.55, p = .012. What test did Boaler & Staples use to conclude that there was no significant difference here? If it involved a correction for multiple comparisons, why wasn't such a correction used in other places in the paper (to, for example, correct for the multiple comparisons in Tables 2 and 3)?

(4) In the analyses on page 623 Boaler & Staples report that their sample sizes varied between 63 and 67 (bottom of paragraph 2). However, at the start of this paragraph they state that the sample size for this piece of data collection was 105. Why were around 40 participants excluded from these analyses?

(5.1) Throughout the paper Boaler & Staples report responses to various self-report items. For example, on page 623 they write:

"In the Year 3 questionnaire students were asked to finish the statement: 'I enjoy math in school' with one of four time options: all of the time, most of the time, some of the time, or none of the time. Fifty-four percent of students from Railside (n=198) said that they enjoyed mathematics all or most of the time, compared with 29% of students in traditional classes (n=318) which is a significant difference (t = 4.758, df = 286, p<0.001)."

Since the authors report a t test, it seems that they are not computing the significance of the claim about the percentages of students who selected "all" or "most of the time", but rather are looking at group differences in the mean responses to these items. Is it appropriate to use a parametric statistical test to investigate group differences in means on a four point Likert scale? This is an especially relevant question for those items which clearly did not follow a normal distribution (e.g. on page 636 Boaler & Staples report that nearly half of Railside students selected the highest possible response to one item).

(5.2) In the quote above Boaler & Staples report that the analysis was conducted on a sample size of 516 (198 + 318), but the degrees of freedom in their t test (286), given by N-2, suggest that it was in fact 288. Which value is correct? Similarly on page 637 Boaler & Staples write "At Railside, 84% of the students agreed with ["Anyone can be really good at math if they try"], compared with 52% of students in the traditional classes (n= 473, t = -8.272, df = 451, p<0.001)." How is it possible for N = 473 to yield df = 451?
Thread Hierarchy
 Statistical errors in this article? by Careful Reader on May 4, 2014
    Member Center
    In Print
    This Month's Issue