A Grand Educational Experiment in Reading Instruction: Toward Methodology for Building the Capacity of Pre-Collegiate Schooling

by Dick Schutz - January 24, 2012

Context: The results of 14 very-large-scale Randomized Control Experiments calls into question the utility of the methodology for building the capacity of pre-collegiate schooling.

Purpose: To sketch the workings of Planned Variation methodology as an alternative and to call attention to a Natural Experiment currently in progress involving reading instruction in the UK and the US.

Conclusions: (1) Planned Variation methodology offers considerable promise for improving pre-collegiate schooling capacity. (2) The Grand Experiment warrants careful consideration as its operations unfold.

There is widespread consensus throughout the English-speaking world on a few matters:

Education is important to the welfare of individuals and nations

Reading capability is foundational to high-quality schooling

Education should be “evidence-based”

The “gold standard” in educational research is the Randomized Controlled Experiment

Setting aside other agreements and disagreements, these are the starting point for the discourse.

Randomized Controlled Experiments, (in which participants are assigned randomly to experimental groups), are often called for in education, but seldom conducted. However, the U. S. Institute of Education Sciences has since the mid-1990's managed to conduct more than a dozen very-large-sample, multiple-year Randomized Controlled Experiments. The experiments and their associated web links are listed in Figure l at the end of the paper.

The studies represent an impressive, sustained replication of Randomized Controlled Experimental methodology.  Each of the 14 experiments was conducted consistent with the highest methodological standards, and all aspects of the experiments are clearly reported—complete with Executive Summary, Full Report, and Technical Appendices.  Each report runs to several hundred pages.

The announcement of each study was typically well-publicized in press conferences and news releases, but few of the reports of the results have been published in professional journals or have found their way into the peer-reviewed literature; they remain buried in obscurity on the Internet.   None of the experiments has contributed to building schooling capacity; no more was known about how to improve instruction after the experiments than before.

The experiments, without exception, replicate a single finding: “No Impact.”  Although the results are spun to put the best possible light on the initiatives/innovations being investigated, the studies consistently find that the instructional consequences sought were not obtained.

In each of the studies, what is consistently seen is variability: variability within classes; variability between classes within schools; variability between schools within districts; variability between districts within states; and variability between states.  What you see is what you get: variability.  What you don’t see or get is any information about the instruction that the students actually received or how to make the instruction more effective and reliable.

The cumulative results of the studies raise serious questions about the applicability and usefulness of Randomized Controlled Experiments in investigations of pre-collegiate schooling.  There is no reason to expect that further replications will produce any different results.

The experiments come at a very high-dollar-cost.  Moreover, the results of such experiments become available “too late.”  That is, when a Randomized Controlled Experiment is completed, the “interventions” have “vened”; they are completed; over.  Both participants and investigators plod on as they were, or chase other “promising potentials.”  All that can be said is, “More research is needed.”


Is there a better way?  Yes there is. It involves investigating Natural Variation and/or Planned Variation.

[See: http://en.wikipedia.org/wiki/Natural_experiment ]

The methodology has been used in epidemiology and in economics where, as in pre-collegiate schooling, the assumptions entailed in a Randomized Controlled Experiment cannot be met without creating artificial and intrusive conditions, such as randomly assigning participants---schools, teachers, and students-- to comparison groups and denying the innovation/treatment to some participants to establish a “control group.”

Natural Variation Experimentation entails observing what is happening and looking for replications of if-then functional relationships.  That is, the investigation opens the black box of instruction to identify manipulable ways to achieve specified instructional consequences.  The methodology is closer to standard engineering than to standard science, although the scientific canons certainly apply.

Planned Variation Experimentation kicks Natural Variation Experimentation methodology up a notch.  It entails investigating planned alternative ways (called Models, Treatments, Innovations or Interventions) of arriving at a specified, aspired outcome under natural conditions.  Although the methodology is presently all-but-unknown in education, a compilation of papers on the topic, edited by Alice Rivlin and Michael Timpane, was published way back in (1975).

The papers provide commentary on large-scale educational Planned Variation studies conducted in the 1960's-70's, primarily involving Head Start and Follow Through.  The papers make it clear that there were serious confounding in the designs, and also issues in the statistical analysis, which fogged the “if-then” cause and effect relationships.  Further, the most popular Models of the time did not fare at all well in the comparisons, engendering a tangle of politics and academic ideology.  

The question of whether we should “give up or try harder” was answered by what later transpired.  Educational research history took a very different direction; the methodology was not further pursued. Randomized Controlled Experiments became viewed as the “Gold Standard,” and we’ve seen how that turned out.

Pre-collegiate schooling is characterized by all sorts of natural “experimental treatments,” constituting a “natural laboratory”; opportunities for observing the consequences of alternatives abound.  As Yogi Berra noted, “You can observe a lot just by watching what is going on.”  The instructional consequences of interest in Pre-collegiate schooling are transparent.  All one has to do to see them is to observe them.  But the arcane statistics inherent in Randomized Controlled Experiments fog the observation.  The situation is very much like what happened with the “Quants” and “Derivatives” that eventually triggered the banking collapse.

Planned Variation Experiments can be conducted unobtrusively, with hardly noticeable additional cost, and the results are immediately and continuously available.  One of the things that has plagued research in pre-collegiate schooling is Fidelity to Treatment.  Fidelity is the extent to which the participants (teachers) use (implement) the instructional materials and procedures that the investigators offer as a Treatment (program). When teachers don’t field the Treatment the way the investigators hoped, the investigators explain the poor results as lack of fidelity.  Participants, however, regard the phenomenon differently. They see the Treatment as Ivory tower people who don't understand the realities of schooling.  Both are right.  Investigators and participants are reading from the same musical score, but the songs they are singing are very different.  

Investigators firmly believe that the Treatment is “well defined”; they understand it, and they convince teachers that they too understand it—more or less.  But obviously if the participants are fielding the Treatment in different ways, the Treatment is anything but “well defined.”  It's not that the participants are trying to subvert the investigators.  “Professional Development” is provided to ensure that the participants and the investigators are talking about the Treatment in the same way.  It's that the Treatments are vapor; all one sees as the outcome of the treatment is variability.  Investigators ignore the vapor, concluding “my theory of action was and is correct; it's just that there was a lack of fidelity.” Such a conclusion is tantamount to saying, “I'm a good leader, but no one is faithful in following me.”

Planned Variation Experiments subdue this pesky phenomenon by simply looking at the accomplishments that are being obtained under natural conditions.  In a Planned Variation Experiment, the investigators and the treatments are being tested, not the participants and students.  If a Model (i.e. the instruction) yields only variability, the Model is vapor.  If there are observable differences in the accomplishments, the Models can be compared in terms of everyday standards: reliability of effect, time, and cost.

In short, the results of a Planned Variation Experiment provide evidence of how instruction is working, not how investigators theorize that that it should work.

It’s worth noting that the orientation advocated here is consistent with the “Reforms” generated by the Stanford Evaluation Consortium led by Lee Cronbach from 1974-1979  (Cronbach, L. J. & Associates, 1980):

By the term evaluation we mean systematic examination of events occurring in and consequent on a contemporary program—an examination conducted to assist in improving that program and other programs having the same general purpose (p. 4).

Following Luther’s precedent for Reform, the Consortium posted Ninety-Five Theses.  The most relevant Theses for present purposes are extracted in Figure 2 at the end of the paper.  It’s apparent that the term “Reform” in education came to take on a whole different meaning than the Reformation the Consortium hoped for.


It so happens that initiatives in reading instruction currently underway in the UK and in the US provide the makings for a very grand Planned Variation Experiment.  The experimental design is sketched in what follows.


UK Model.  It happens that the UK government has committed to teach all children to read by the end of Year/Grade 2.  To ensure that the commitment is being met, a Screening Check has been constructed to be administered to all children at the end of Year/Grade 1. It's a simple check of whether or not a child has been taught how to handle the Alphabetic Code that enables the child to process written text in the same way that spoken communication is processed.

The Framework for the Check is very straightforward:


The Check is unobtrusive; will be administered by teachers, and will take only a few minutes for each child.  The Check was trialed in 12 schools last fall, was field-tested in 300 schools this spring, and will be rolled out nation-wide at the end of the 2011-2012 school year.  

Very important to the Experiment, teachers and schools are asked to specify what instructional program(s) they used.  This makes it possible to disaggregate the results, not only by the usual bio-social categories of interest, but also by instructional program. With this information, programs can be compared, instruction can be modified, and the modifications can be investigated with fresh students the following year.

The Check will be administered at the end of Year/Grade 2 to students who do not pass the check in Year 1.  Year/Grade 2 marks the end of Key Stage 1 of the UK National Curriculum.  The results of the Check in Year 2 will augment the teacher’s assessment, and will provide longitudinal information to teachers in Key Stage 2 (Year/Grades 3-6), which leads to the first formal testing at the end of Year/Grade 6.  


The US Model.  It happens that the United States is following a very different route.  While the UK has taken on the commitment of teaching all children to read, the US has abandoned this commitment. Instead, the US is now committed to “ensuring that all students will graduate from high school 'college and career-ready' by 2020.”

To achieve the US commitment, Common Core State Standards for English Language Arts & Literacy in History/Social Studies, Science, and Technical Subjects have been formulated, and  “new and better tests” consistent with the Standards are being constructed.  While the UK Model expects formal reading instruction to be completed by Grade 2, the US Model stretches reading instruction from Grades K through 12.  However, children will not be tested against the Standards until the end of Grade 3; a year later than the UK Model expects the task to be completed.


Data.  Data for the UK 2011 field-test will soon become available, followed by data for the spring 2012 roll out, which will then be updated regularly each year.

Data for the US Model will not become available until after the 2014-2015 school year when the “new and better tests” are rolled out. The commitment of “all students graduating from high school college and/or career-ready” is scheduled to be reached in 2020. However, the 2020 graduates-to-be are currently in the pipeline as Grade 4 students.  NAEP 2011 results indicate that many students are badly positioned for their “Race to the Top.”  


Costs.  Most certainly!  These can at this time be compared and projected into the future.  The UK Model's superiority on this variable is clear and has a high degree of practical significance.  

Student Learning: Reliability in Delivering Intended Instructional Accomplishments.  The US Model regards the UK Model’s intended accomplishment as “unrealistic.”  Preliminary evidence on that point for the UK Model will begin to become available when the results of the spring 2011 field-test are released.  The US Model’s intended accomplishment is not projected to be achieved until 2020, but progression can be monitored in the interim.

Application of the Results.  The UK Model will generate results that can be used to upgrade the instruction of the new cohorts of students who enter school each year.  The US Model will generate Summative [end of year] and Formative [intra-year] test scores each year for students from Grade 3 through high school.  The Summative data will also be used as at least one element in determining teacher tenure and salary.  The difference in the UK and the US Models will permit comparisons of teacher attrition, overall costs of teacher training and professional development, and several other consequences of professional and public interest.


Possibly.  Skirmishes in the “reading wars” have been going on since at least the 18th Century, and continue today in the battle between proponents of “Whole Language” and proponents of “Phonics.”  The Alphabetic Code-based Screening Check entailed in the UK Model is all about "Phonics."

The Standards, foundational to the US Model, mention “Phonics” only indirectly and in passing:  

“. . .materials must meet the needs of a wide range of students, reinforcing key lessons in concepts of print, the alphabetic principle, and other basic conventions of the English writing system.”  

The US Model focuses on “close reading and analysis” of grade-banded texts. The tests accompanying the Standards focus on “comprehension of meaning” and give no attention to "phonics/ the Alphabetic Code" per se.

The overwhelming majority of teacher unions, university professors, and literacy associations in both the UK and US support “Balanced Literacy/Mixed Methods,” in which "Phonics" is but one of several elements.  However, the information that will be generated in the UK Model regarding the instructional programs that were used by schools and teachers will provide empirical comparison of “Phonics” and “Whole Language” instruction.  The results will provide an empirical basis for ending the “wars.”  Whether the war actually ends or not remains to be seen.


The basis that this one Experiment will provide is only a “beach head.”  Certainly there is more, much more, to schooling than teaching reading, and it’s still early in the game for this particular Grand Experiment.  The high methodological value of this particular Experiment is that it constitutes a prototype for further natural experimentation that can operationally build pre-collegiate schooling capacity.  

A notable feature of Planned Variations Experiments is that they can easily involve very large samples—hundreds of thousands if not millions of “subjects.”  The results can be disaggregated by the usual bio-social and demographic variables of interest.  But more importantly, the methodology provides entry into the black box of Instruction.  That is, very large samples make it possible to readily draw and investigate random sub-samples to replicate comparisons of the Models involved.  This procedure operationalizes the randomized, replicated “controls” that Randomized Controlled Experiments can only get at statistically.  With the very large sample, it’s possible to draw sub-samples of any size.  If you keep seeing the same thing consistently, you’ve demonstrated real-time replicability.  If you don’t consistently see the same thing, you at least have good clues as to why when you get “back to the drawing board.”

Another notable feature of Planned Variation Experiments is that their conduct, analysis, and interpretation do not rely on statistical apparatus that only a very few “Quants” understand.  Planned Variation methodology makes it possible to eliminate the distinction between “researchers” and “practitioners.”  It also eliminates the distinction between “qualitative research” that is anecdotal and “quantitative research” that has questionable validity in “real world” application.

Bottom Line:  Planned Variation methodology has much to offer in building the capacity of pre-collegiate schooling.  Meanwhile, the Grand Experiment is in play.  

“More news on this breaking story at eleven.”


Cronbach, L. J., Ambron, S. R., Dornbusch, S. M., Hess, R. D., Hornik, R.C., Phollips, D. C., Walker. D. F., & Weiner, S. S. (1980). Toward Reform of Program Evaluation: Aims, Methods, and Institutional Arrangements.  San Francisco: Jossey-Bass.

Rivlin, A. M. & Timpane, P. M. (1975). Planned Variation in Education: Should We Give Up or Try Harder?  Washington, D. C.: Brookings Institution.


Figure 1: Randomized Controlled Experiments Sponsored By the U.S. Institute of Educational Sciences

Remedial Reading Interventions


Reading First


Early Reading First Evaluation


After-school instruction


DC Scholarship Program


Reading Comprehension Interventions


Comprehensive School Reform


Head Start


Charter Middle Schools


Supplemental Literacy Courses for Struggling Ninth-Grade Readers


Teacher Performance Pay

www.performanceincentives.org/data/files/pages/POINT REPORT_9.21.10.pdf

Efficacy of Schoolwide Programs to Promote Social and Character Development and Reduce Problem Behavior in Elementary School Children


Comprehensive Teacher Induction Programs


Middle School “Striving Readers”


Figure 2: Toward Reform of Educational Evaluation—Relevant Theses

1. Program evaluation is a process by which society learns about itself.

2. Program evaluations should contribute to enlightened discussion of alternative plans for social action.

4. An evaluation of a particular program is only an episode in the continuing evolution of thought about a problem area.

7. In debates over controversial programs, liars figure and figures often lie; the evaluator has a responsibility to protect clients from both types of deception.

9. Commissioners of evaluations complain that the messages from evaluations are not useful, while evaluators complain that the messages are not used.

12. The hope that an evaluation will provide unequivocal answers, convincing enough to extinguish controversy about the merits of a social program, is certain to be disappointed.

15. Accountability emphasizes looking back in order to assign praise or blame; evaluation is better used to understand events and processes for the sake of guiding future activities.

18. A demand for accountability is a sign of pathology in the political system.  [emphasis added]

20. The ideal of efficiency in government is in tension with the ideal of democratic participation; rationalism is dangerously close to totalitarianism.

21. The notion of the evaluator as a superman who will make all social choices easy and all programs efficient, is a pipedream.

26. What is needed is information that supports negotiation rather than information calculated to point out the “correct” decision.

27. Events move forward by piecemeal adaptations.

30. It is unwise for evaluations to focus on whether a project “has attained its goals.

31. Goals are a necessary part of political rhetoric, but all social programs, even supposedly targeted ones, have broad aims.

33. Unfortunately, whatever the evaluator decides to measure tends to become a primary goal of program operators.

39. Before laying out a design, the evaluator should do considerable homework.  Pertinent question should identified by examining the history of similar programs, the related social theory, and the expectation of program advocates, critics, and prospective clients.

51. An evaluation of a particular project has its greatest implications for projects that will be put in place in the future.

54. It is better for an evaluative inquiry to launch a small fleet of studies than to put all its resources into a single approach.

58. Merit lies not in form of inquiry but in relevance of information…

60. External validity—that is, the validity of inferences that go beyond the data is the crux; increasing internal validity by elegant design often reduces relevance.

76. Evaluation contracts are increasing in size, but tying many strands into a single know is rarely the best way to get useful information.

79. Decentralizing much evaluation to the state level would be a healthy development.

93. The evaluator is an educator; his success is to be judged by what others learn.

Cite This Article as: Teachers College Record, Date Published: January 24, 2012
https://www.tcrecord.org ID Number: 16667, Date Accessed: 10/27/2021 12:43:17 PM

Purchase Reprint Rights for this article or review