Read a Post for Creating Mathematical Futures through an Equitable Teaching Approach: The Case of Railside School   Reply to this Post   Statistical errors in this article?  Posted By: Careful Reader on May 4, 2014   I believe there may be some statistical errors in this article. Some questions for the authors:
(1) Given that the main hypothesis being tested relates to different patterns of development (it was predicted that students in the Railside school would have greater gains than those in the traditional schools), why wasn't a timebygroup ANOVA conducted on the data reported in Table 3? Doesn't Boaler & Staples's prediction relate to the timebygroup interaction effect, not the simple betweengroups effects? In other words, why didn't Boaler & Staples investigate the trend in betweengroups differences, not just the betweengroups differences? Is Boaler & Staples's analysis a case of what Ben Goldacre refers to as "the statistical error that just keeps on coming" (http://www.theguardian.com/commentisfree/2011/sep/09/badscienceresearcherror)?
(2) In Table 3 Boaler & Staples report summary statistics for students who took "all three" tests (Year 1 pretest, Year 1 posttest and Year 2 posttest). Boaler & Staples compare the group means on these tests with a series of three independent ttests. Given that a Year 3 posttest was conducted as well, why were the results from this test not reported in this Table? What were the mean scores on the Year 3 test of the two groups of students who had taken all four tests?
(3.1) On page 622 Boaler & Staples seek to argue that racial differences in achievement were reduced at Railside (the reformoriented school), but that this was not the case at the traditional schools. They write:
"At the end of Year 1, only one year after the students started at Railside, there were no longer significant differences between the achievement of white and Latino students, nor Filipino students and Latino and Black students. The significant differences that remained at that time were between white and Black students and between Asian students and Black and Latino students (ANOVA F=5.208, df=280, p=0.000). Table 5 shows these results."
Isn't a oneway ANOVA on posttest data the wrong statistical test to use here? The claim Boaler & Staples wish to make is that racial differences were reduced at Railside more than they were at the traditional schools. Given this, why did they not investigate whether or not the schoolbyyearbyethnicity interaction effect was significant? Is Boaler & Staples's analysis another example of the error discussed by Goldacre?
(3.2) In order to convert an F statistic derived from a oneway ANOVA into a p value one needs two degrees of freedom, given by k1 and Nk, where k is the number of groups and N is the total sample size. Here k = 5 and N = 272 (the sum of the n column in Table 5). Boaler & Staples report a single degree of freedom of 280, which would be impossible as either df, since N < 280. Where does this df value come from, and what are the correct degrees of freedom? Are the data in the table different to those used to calculate the test statistic? If so, why?
(3.3) Recalculating the oneway ANOVA by hand from the summary statistics in Table 5 yields a different F value: F(4, 267) = 7.082, p < .001 (you can do this yourself at http://www.danielsoper.com/statcalc3/calc.aspx?id=43). Where does Boaler & Staples's value of 5.208 come from?
(3.4) Boaler & Staples suggest that at the end of Year 1 "there were no longer significant differences between the achievement of white and Latino students". But using the summary statistics reported in Table 5 to conduct a t test appears to suggest that the difference between white and latino students' achievement was significant, t(152) = 2.55, p = .012. What test did Boaler & Staples use to conclude that there was no significant difference here? If it involved a correction for multiple comparisons, why wasn't such a correction used in other places in the paper (to, for example, correct for the multiple comparisons in Tables 2 and 3)?
(4) In the analyses on page 623 Boaler & Staples report that their sample sizes varied between 63 and 67 (bottom of paragraph 2). However, at the start of this paragraph they state that the sample size for this piece of data collection was 105. Why were around 40 participants excluded from these analyses?
(5.1) Throughout the paper Boaler & Staples report responses to various selfreport items. For example, on page 623 they write:
"In the Year 3 questionnaire students were asked to finish the statement: 'I enjoy math in school' with one of four time options: all of the time, most of the time, some of the time, or none of the time. Fiftyfour percent of students from Railside (n=198) said that they enjoyed mathematics all or most of the time, compared with 29% of students in traditional classes (n=318) which is a significant difference (t = 4.758, df = 286, p<0.001)."
Since the authors report a t test, it seems that they are not computing the significance of the claim about the percentages of students who selected "all" or "most of the time", but rather are looking at group differences in the mean responses to these items. Is it appropriate to use a parametric statistical test to investigate group differences in means on a four point Likert scale? This is an especially relevant question for those items which clearly did not follow a normal distribution (e.g. on page 636 Boaler & Staples report that nearly half of Railside students selected the highest possible response to one item).
(5.2) In the quote above Boaler & Staples report that the analysis was conducted on a sample size of 516 (198 + 318), but the degrees of freedom in their t test (286), given by N2, suggest that it was in fact 288. Which value is correct? Similarly on page 637 Boaler & Staples write "At Railside, 84% of the students agreed with ["Anyone can be really good at math if they try"], compared with 52% of students in the traditional classes (n= 473, t = 8.272, df = 451, p<0.001)." How is it possible for N = 473 to yield df = 451?

 