Home Articles Reader Opinion Editorial Book Reviews Discussion Writers Guide About TCRecord
transparent 13

Leveraging Big Data to Help Each Learner and Accelerate Learning Science

by Philip H. Winne - 2017

Background: Today’s gold standard for identifying what works, the randomized controlled trial, poorly serves each and any individual learner. Elements of my argument provide grounds for proposed remedies in cases where software can log extensive data about operations each learner applies to learn and each bit of information to which a learner applies those operations.

Purpose of Study: Analyses of such big data can produce learning analytics that provide raw material for self-regulating learners, for instructors to productively adapt instructional designs, and for learning scientists to advance learning science. I describe an example of such a software system, nStudy.

Research Design: I describe and analyze features of nStudy, including bookmarks, quotes, notes, and and note artifacts that can be used to generate trace data.

Results: By using software like nStudy as they study, learners can partner with instructors and learning scientists in a symbiotic and progressive ecology of authentic experimentation.

Conclusion: I argue that software technologies like nStudy offer significant value in supporting learners and advancing learning science. A rationale and recommendations for this approach arise from my critique of pseudo-random controlled trials.

There is no real need to document that a great many who work in the field of education, as well as numerous commentators discussing the field of education, believe that software technologies can improve learning. Among these wide-ranging beliefs, causal factors that might account for improvements in learning are numerous. For example, recently in this journal Deschryver (2014) claimed that “ongoing rapid technological innovation, particularly as it relates to the web, is changing how humans interact with information” (p. 1). In support, he cited Selwyn (2008), who reported that 90% of undergraduates access information on the Internet in their programs of study. Can mere access to the Internet using a web browser and search engine improve learning? The question is interesting because such a practice entirely lacks an instructional design—neither web browsers nor search engines are designed to enhance what anyone learns when searching for information on the Internet.

At the opposite end of the software technology spectrum lie intelligent tutoring systems. Here, instructional design that adapts to learners is paramount. Four features are usually ascribed to intelligent tutoring systems: an interface allowing input from the learner and reporting output from the system, a model of the subject matter the learner studies, a model of what the learner knows (or probably knows), and a model of how the learner can be tutored to build knowledge of the subject studied (Sottilare, Graesser, Hu, & Holden, 2013). Recent meta-analyses (Ma, Adesope, Nesbit, & Liu, 2014; Steenbergen-Hu & Cooper, 2013) reported generally positive although modest benefits when students study using an intelligent tutoring system.

I recommend a third approach to using software technologies in education (Winne, 1992, 2006). It emanates from a belief that individuals will persist with change when they individually perceive that benefit is associated with that change, as well as a value that we should not disregard an individual’s learning trajectory in favor of a group’s gain. To warrant the approach I recommend, I take some care to describe why the current gold-standard methodology for identifying factors that improve learning—the random controlled trial (RCT)—is not a valid guide to improving an individual’s learning. This critique sets the stage for a proposal to adopt a different approach that uses software technologies as tools for gathering and analyzing extensive and highly detailed information—big data—about learning and to distribute learning analytics grounded in empirically guided partitions of that big data. I conjecture that such big data and its appropriate analyses can aid learners to make adaptations—to self-regulate learning—that would otherwise elude them or that adaptive technologies are not designed to make. To illustrate how such a system might be fashioned, I briefly describe software called nStudy. I conclude with several bold propositions about how to improve education and about how learners can become partners in this mission by donning the mantle of a self-serving learning scientist, i.e., a productively self-regulating learner.

Advances in science and in democracy rely on spirited, well-founded, and public debate. My further goal is to spark wide-ranging discourse about how learning science might be carried out more robustly in everyday learning settings and about potential fusions of software technologies, learning science, and big data.


Whatever the causal factors that improve learning may be, the received view, as espoused by the U.S. Institute of Education Science’s What Works Clearinghouse, is that those factors are best identified by carrying out randomized controlled trials, also known as true experiments. According to the What Works Clearinghouse Procedures and Standards Handbook, Version 3.0 (n.d.), the definitive attribute of an RCT is that “the unit that is assigned (for example, study participants, schools, etc.) must have been placed into each study condition through random assignment or a process that was functionally random” (p. 9). Virtues claimed for RCTs are widely published (e.g., What Works Clearinghouse, n.d.). Diverging from the received view, I argue that such experiments provide weak warrants for any particular instructor or any individual learner to adopt or participate in an intervention found effective in an RCT or a meta-analysis of multiple RCTs. Four issues underlie my claim.


A single RCT offers a fragile foundation for predicting what will happen if the intervention is replicated. I acknowledge that there is woefully little empirical evidence to validate this claim because replications are rare in the published literature. Disfavor for publishing replications prevails even though replications of an experiment are very highly valued in science (and by me). New, yet fragile, findings are preferred over investigations of whether a particular finding is robust. Fortunately, calls for publishing reports of high quality replications have recently been issued (Holcombe, 2013) and supported by the Association for Psychological Science (n.d.).

It is possible to estimate mathematically the range of the effect that might be observed if a particular RCT were perfectly replicated with the same sample that participated in the original study. First, some assumptions need to be made. I adopt Stanley and Spence’s (2014) approach.

The true effect of an intervention in the population from which an RCT’s sample is randomly drawn can never be known (unless the study involves the entire population). Therefore, I first assume the true value of an effect size in the population. Expressed as a correlation, the true effect size might plausibly range from 0.10 ≤ ρ ≤ 0.40. Expressed using Cohen’s d, this range is 0.20 ≤ d ≤ 0.87. In other words, compared to the comparison group, whose mean individual score is arbitrarily set at the 50th percentile, the intervention in this imagined RCT elevates the treatment group’s mean achievement score to somewhere between the 58th percentile and the 81st percentile. Second, I assume that the psychometric reliability of the achievement measure is 0.70 in this RCT’s sample.

Now an experimenter carries out a real RCT. Suppose that this experiment yields an effect size of r = 0.30, equivalent to d = 0.63. In other words, the treatment group’s mean score was elevated from the 50th percentile (the comparison group’s position) to the 74th percentile. Under the assumptions just made, we can ask: What is the expected range of the effect if the same sample participates in a replication, assuming zero carryover effect? A 95% confidence interval—a replication interval—given this observed effect size spans r = –0.33 (d = –0.70, 24th percentile) to r = 0.65 (d = 1.71, 96th percentile). The replication interval widens, but its boundaries cannot be estimated if a new random sample drawn from the same population experiences the exact same intervention. Unfortunately, a widened replication interval is inevitable. Samples must change when an intervention is disseminated, and the assumption of zero carryover effect is untenable if the outcome variable of the original RCT was achievement (because learning is defined to reflect a relatively permanent change in knowledge, skill, motivation, or attitude).

What does this thought experiment imply? Expecting a benefit from adopting an intervention is quite chancy if the finding from an RCT is put into practice. While there is a chance of benefit, there is also a chance of no benefit, and even a chance of harm. This unpredictability is inescapable. It is inherent in the methodological logic underlying RCTs.


Meta-analyses offer two significant benefits over any single RCT. First, because multiple RCTs are examined, the central tendency and the variability of a treatment’s effect size can be empirically estimated across multiple samples under modest to moderate variation in the circumstances within which the treatment is provided. However, this entails generalizing over—i.e., ignoring—many factors that differentiate individual studies. Examples of ignored factors might be opportunity to learn relevant prior knowledge before the study; characteristics of self-regulated learning; in situ versus controlled field versus a laboratory setting; or researcher-generated materials versus a curriculum series adopted by a district. Any one or more of these, as well as hosts of other factors, may moderate the size, expected variability, and even the direction of an effect.

The second benefit of meta-analyses addresses this weakness. Because meta-analyses synthesize multiple studies, this affords an opportunity to identify and investigate whether some particular variables are moderators of an intervention’s effect. Individual studies with smallish sample sizes cannot practically investigate many moderators because power to statistically detect an intervention’s effects would be excessively diminished. Meta-analyses can, to a degree, overcome this challenge.

The advantage meta-analyses gain in examining moderators is, however, an insufficient remedy. No matter how a set of studies in a meta-analysis is carved into subsets to investigate moderator variables, the resulting effect size for a particular moderator variable is, by the argument developed about a mean in the preceding section, a chancy estimate of what will happen if an intervention is replicated. Again, this is especially so if the replication of a meta-analysis involves a new sample. Meta-analytic findings do not escape or even lessen that challenge.

Furthermore, analyses that investigate the impact of any particular moderator variable on a treatment’s effect are necessarily incomplete. Because individual studies in a meta-analysis each include only one or a very few moderator variables, it is not possible in a meta-analysis to simultaneously examine all potential combinations of all moderator variables using all studies. Claims about the influence of a particular moderator variable on an intervention’s effect are confounded because other moderator variables that are statistically identified in the meta-analysis do not figure into the analysis of that particular moderator variable. In technical terms, the model of an effect moderated by a particular factor is examined using a poorly formed fractional factorial design, i.e., a design that confounds some interaction effects with main effects (as opposed to a fully factorial design that allows every main and every interaction effect to be statistically examined uniquely). The model predicting an effect that involves any single moderator is misspecified because it omits other moderators that the meta-analysis demonstrates have impact on a treatment’s effect. This very likely biases the estimate of the magnitude and the variance of each individual moderated effect (Rao, 1971).


The normal curve of scores for members of a population and the notion of randomly sampling from a population are beguiling abstractions. Learning science and the What Works Clearinghouse hold these in very high regard. As advantageous as these concepts are in theory, neither is particularly helpful when one attempts to describe, for any particular individual, a probable causal effect of a treatment that was statistically identified in an RCT.

Effects in RCTs are measured only for a group. When individuals are randomly sampled from a well-defined population, this is made clear by the standard representation of components representing the score: Xij for an arbitrary individual learner, represented by the subscript i, in an arbitrary group, represented by the subscript j. Three terms comprise Xij as set out in Equation 1. The term m represents influences on the learner’s score that arise because s/he is a member of a particular population; every learner in the population has exactly the same value of m. The effect of a treatment is symbolized by tj. It is preceded by a coefficient b that is often not written; in the simplest case, b has a value of 1 for learners who experience the treatment and 0 for learners who do not experience the treatment. In other words, learners in the group that does not experience the treatment are assigned a zero value for that treatment’s (potential) effect; because bj = 0, bj × ­­tj = 0. Because the symbol tj has no subscript i (referring to an individual), the treatment effect is theoretically identical for every learner in the treatment group j. In other words, there is zero variance for the treatment effect across learners within the treatment group. Finally, the term eij represents the aggregate contribution of a very large number of unknown random factors. This is the only term in the model of each learner’s score that reflects a learner’s individuality.

Xij = m + btj + eij


When a mean is calculated over all the learners in a group, the average value of all the individual learners’ eij components is theoretically expected to be zero. For every learner in the treatment group, the mean is completely and only determined by the terms symbolized by m and btj in Equation 1. Any deviation is due to unidentified random factors and is ignored in measuring the size of a treatment’s effect. If this is not the case, the model is biased.

This approach to identifying treatment effects in RCTs has a significant logical consequence: For any individual who experiences the treatment, it is impossible to predict that individual’s score, even if one knows the treatment group’s mean. All that can be predicted is that, with 99.7% probability, an individual’s score lies within a span of ±3 standard deviations surrounding the group’s mean. Odds are just 1 in 2, or 50%, that the individual will score on the upper side of the mean. These are poor betting odds for any particular individual to gamble on a treatment. The mean of a treatment group poorly forecasts what any individual learner can expect to happen by experiencing the treatment.


The Publication Manual of the American Psychological Association (2010) recommends that researchers “Describe the sample adequately . . . [because] appropriate identification of research participants is critical to the science and the practice of psychology, particularly for generalizing the findings, making comparisons across replications, and using the evidence in research syntheses and secondary data analyses” (p. 29). Technically, each factor named to identify a population is theorized or, better, empirically proven to make a causal contribution to determining the value of m in Equation 1. Symbolically using numeric subscripts to mark each factor, Equation 1 can be re-expressed as Equation 2.

Xij = (m1 + m2 + m3 + m4 + . . .) + btj + eij


As mentioned in the section on moderator variables, modeling these variables’ influences on scores is biased when factors that genuinely affect the outcome score, Xij, are left out of the model and when factors that do not influence the outcome score are included in the model. Rao (1971, p. 37) labeled these kinds of model misspecification the “left out variable” and the “irrelevant variable,” respectively.

Factors that researchers use to identify a population must have variance. If they do not, there is no reason to identify them. Some participants are female, some are male; participants represent various ethnic or racial groups; some participants receive funding for meals, while others do not. Factors that researchers name in defining the population to which they intend to generalize are rarely used as moderator variables in analyses of outcome scores. Because the researcher identifies these factors as relevant—or is required to identify them, per the Publication Manual—it follows that a theoretical account is needed that explains how the factor causes variation in the outcome. In addition, citations of previous empirical studies are needed to demonstrate that this factor needs attention in a current study. In practice, neither is typically provided. Moreover, if a researcher’s sample included such a factor that varied among participants—e.g., if some in the sample were female and others male—it would be prudent to involve that factor in analyses of data. Again, this is rare. In short, samples are described nearly universally using factors that are convenient or politically interesting at the group level rather than because they have been scientifically demonstrated to account for variance in the outcome variable. The consequence is that features that genuinely matter in setting the value of the population’s µ in Equation 2 are mostly unknown, and some features that set a population apart from another population are irrelevant. As a consequence, determining whether anyone in particular belongs to the population is guesswork.

An additional issue arises due to the common practice by which samples are formed. According to rules of statistical inference, a sample must be randomly drawn from a well-identified population. In the vast majority of studies, however, individuals in samples of convenience—not randomly sampled from a population—are randomly assigned to groups. This undermines the opportunity for random assignment to reduce bias in modeling outcome scores. As a result, an RCT is almost always a pseudo-random controlled trial (P-RCT).

Together, these practices lead to an incomplete, potentially misleading, and statistically misspecified model of the population to which findings observed in a P-RCT might apply. An individual will be greatly challenged to judge whether she or he is a member of the population to which the finding of a P-RCT is inferred to apply.


Four challenges cumulatively make it quite hazardous to predict what will happen to a particular learner who experiences a treatment that was identified as having a statistically detectable effect in a P-RCT or a meta-analysis of multiple P-RCTs. First, there is considerable wobble in groups’ means when replicating a true RCT. Effects at the group level may strengthen, weaken, disappear, or even reverse. Unfortunately, it cannot be predicted which is more likely. P-RCTs exacerbate this problem. Second, meta-analyses almost always identify moderator variables that adjust treatment effects. One-at-a-time moderator adjustments oversimplify the mixture of moderator variables that are enfolded in any particular effect and in any particular learner’s profile. Coupled with the wobble of means on replication, this further erodes confidence in predicting the effects of a treatment for an individual learner. Third, the logic of statistics offers an unattractive gamble. The mean is not representative of any individual learner, and odds are 50:50 that any learner should expect to score on one or the other side of the group’s mean. Fourth, practical but scientifically disagreeable practices for identifying populations and randomly sampling from them further undermine the validity of generalizing an effect to any particular learner. The cumulative result of these four issues is that a particular learner stands on very, very soft ground when predicting what will happen if a treatment is adopted or experienced when studying different content under different conditions.

I emphasize that these challenges do not necessarily lead to a conclusion that P-RCTs and meta-analyses of P-RCTs are useless. In every form of research I know about, inference has an inherent degree of uncertainty, sources of variance cannot be completely identified, and reference groups are indefinite. The benefit of research carried out as a P-RCT or a meta-analysis of several P-RCTs is a hypothesis about how learning may be influenced. I leverage this property subsequently.


What might be changed about how data are gathered, analyzed, and interpreted to offer better guidance to each learner about how to learn better? What about P-RCTs could be remedied?

P-RCTs dump many factors that influence learning into a so-called “error” or (better label) residual term, and, by design, researchers must ignore those factors when recommending a treatment to an individual learner. Those factors are important in the individual case. They may cause an individual’s score to deviate from a population’s or a sample’s mean. They represent a learner’s individuality. Data should be gathered and analyzed in a manner that, as much as possible, recognizes and leverages these factors rather than ignoring them.

Random sampling, a bedrock assumption of statistical models used to analyze data gathered in P-RCTs, is almost never achieved. The elasticity of the validity of statistical inferences when data are sampled nonrandomly, i.e., with bias, is unspecified. Analyses of data should, as much as possible, be freed from random sampling as a requirement for drawing evidence-based inferences.

There is typically meager or no evidence that the factors commonly recommended and used to identify a sample of learners who participate in a P-RCT cause differences in outcomes. Moreover, factors such as age, grade level, sex, and so forth are convenient but distal proxies for genuine and proximal causal factors. Factors used to identify a learner who participates in research should be tested to provide evidence that they influence achievement and provide clarity about whether a learner is a member of the population in which the results of research were observed.

Treatments that researchers operationalize in P-RCTs are carefully implemented and shielded from hosts of factors that could theoretically affect the outcome measured in an experiment. A learner studying in everyday circumstances is unlikely to ever replicate such highly controlled treatments. Treatments that learners can operationalize should be preferred.

P-RCTs typically gather scanty or no evidence about what learners do in the process of learning. Data should be gathered to trace which operations learners apply and to which content those operations are applied. Providing this information to learners and their instructors allows them to monitor how learning unfolds. It is paramount to know what the treatment is to link it to outcomes. Such information is also essential if researchers are to open the black box of their theories and more strongly connect theoretically key cognitive and motivational factors to what can be observed about how learners learn (Winne, 1982).


Suppose that all of the learners in a class, school, district, or state/province use software technology for a large portion of their learning activities in school and while doing homework. Content that they study—text, graphical material, tables, audio and video content, games, simulations, and so forth—is presented within a web browser or similar delivery envelope. As learners carry out their everyday activities to study, their operations on content—viewing a web page, bookmarking a video, reviewing an image or a note, highlighting text on a web page, copying text from a course and pasting it into the draft of an essay, annotating a time mark in a video, searching, and so forth—are fully recorded with a time stamp. Learners can organize content on which they operate by placing artifacts representing that content—bookmarked web pages, quotes, notes, essays, and so forth—in folders and by tagging each artifact: e.g., review this, explore, evidence? A text (actually, hypertext) editor is available for drafting compositions, lab reports, business plans, poems, arguments, and so forth. Artifacts that learners construct—highlighted content and notes, for example—can be dropped or pasted into their compositions. A discussion tool is available. Learners can form groups on their own as well as participate in groups their instructor creates. Texts that they contribute to threads in discussions are recorded and reusable in notes and compositions. The discussion tool also allows learners to share artifacts.

As learners use this software and its tools, data about their learning activities are generated without imposing requirements beyond using the features that the software affords. In essence, the stream of data that this software logs can be played back to portray a complete record of everything that can be observed about what and how the learner studied. If the software provides forms for structuring content in notes (described later in the section on nStudy), a form can be offered for learners to record text or audio as typed/think-aloud data. Analyses of these kinds of data using state-of-the-art techniques in data mining and process mining (see, e.g., Roll & Winne, 2015; Winne & Baker, 2013) set the stage for detailed learning analytics about learning as learners actually carried it out.

Data gathered using this kind of software exemplify the 7 Vs of big data (van Rijmenam, 2013):

Velocity refers to the rate at which data are generated. The software I describe traces every operation every learner performs and the content on which each learner performs the operations. While data at the individual level are generated only as fast as a learner works, as tens to hundreds to thousands of learners study, the velocity of data generation is orders of magnitude greater than in a set of P-RCTs examined in a meta-analysis.

Volume is the quantity of data. Typical P-RCTs may generate several codes describing sample characteristics of each participant, five to 20 subscale scores on various self-report instruments, and, perhaps, 50 codes for responses to items on achievement tests. The software system that I propose gathers these data as well. Of greater value for learning science is the opportunity to discover what might be changed to help learners go about learning more effectively. This value is gained because the software logs a time-stamped record of every operation a learner performs and the content operated on each time the learner uses a tool in the software: every highlight operation and the text highlighted, every word added to every note created and then edited during review, every web page bookmarked and when it is viewed and then reviewed, every contribution to a discussion and the peers to whom it was contributed, and so on. While it is difficult to estimate, the volume of data gathered by software may be two to ten orders of magnitude—hundreds to billions—greater than the volume of data gathered in a collection of P-RCTs examined in a meta-analysis.

Variety describes the diversity of formats for data. In this respect, software has a subtle advantage relative to typical P-RCTs. All the formats for data that can be gathered in software can be gathered in a P-RCT. What advantages software as a research instrument is that many kinds of data can be precoded or automatically scored and made ready for analysis almost instantaneously. This markedly enhances opportunities to analyze data compared to a group of P-RCTs that are carried out over a period of years.

Veracity is the accuracy of data. Compared to humans gathering and coding data, software has a clear advantage: The only mistakes and oversights software makes are ones that people make in designing and building the software. After those errors are identified and corrected, for all intents and purposes, data are perfectly accurate when gathered and coded by software.

Variability concerns changes in the interpretation of a kind of data. This concept is subtle. Consider a highlighting operation applied to a technical term in text, e.g., “antagonist.” In one location, the highlight may suggest that the learner is identifying and rehearsing the term’s definition. Elsewhere in the text, a superficially identical highlighting operation may suggest that the learner is comparing the concepts antagonist and protagonist. Software affords this variability of interpretation and facilitates considering variability when compared to the labor-intensive methods available for investigating variability in a P-RCT.

Visualization refers to a display of raw or analyzed data in a form that minimizes the training needed to interpret the visualization. In this respect, P-RCTs and software can be considered indistinguishable.

Value refers to the cost and efficiency of gathering and processing data relative to the utility of what can be discovered or verified with the data. The pace at which software can gather and process data and distribute results of analyses of data hugely exceeds that of a single P-RCT or set of P-RCTs carried out over a span of years and examined after that in a meta-analysis. At the same time that an instructor views information about learners’ studying activities and achievements, that information (likely reformatted to suit) can be relayed to and individualized for each learner. This can enhance instructors’ sense of learners, learners’ grounds for self-regulating learning, and researchers’ opportunities to pursue learning science.

What might be accomplished if big data about every learner’s learning every time they study every subject were generated by software such as I propose? To answer, I return to my analysis of P-RCTs and add other advantages of software like this as a medium for learning and for researching learning.

First, if all learners use this software as their venue for learning, sample size is now equal to the population. Concerns about pseudo-randomly sampling from an inaptly defined population of convenience evaporate. When researchers or policy makers are interested in comparisons across groups, the machinery of inferential statistical techniques can be shed. Simple effect sizes and statistics describing variability are sufficient.

Second, the problem of makeshift fractionally factorial designs for investigating moderator variables, an insurmountable challenge for individual P-RCTs and a considerable challenge for meta-analyses of P-RCTs, is vanquished. Every learner can be characterized in terms of every relevant proximal moderator. Importantly, moderators now extend to traces of operations learners use to learn plus qualities of those operations: frequency, patterning and context. A much more extensive and theoretically revealing characterization of moderator variables becomes possible.

Third, to identify subpopulations of learners that are homogenous at a point in time on a profile of multiple moderator variables, traditional methods such as clustering algorithms (see, e.g., Fahad et al., 2014) and newer methods such as process mining (Bannert, Reimann, & Sonnenberg, 2013; van der Aalst, 2014) can be applied to big data. A learner’s membership in a subpopulation can be identified using a very large pool of factors that form a multivariate profile. Distinguishable profiles based on multiple moderators can be explored for their correlation with achievement and other variables of interest, e.g., retention or rate of progress. Importantly, unlike a P-RCT in which membership in a treatment or control group must remain fixed to accommodate statistical analyses, big data afford the ability to dynamically identify subpopulations and a learner’s fit to subpopulations. Adjustments to membership can be applied as data accumulate over short time periods of hours and days. Neither learner, time, nor context needs to be static.

Fourth, big data can be gathered iteratively—for a single learner, over time and ranging over contexts created by studying different subjects with different mentors using different media at different times of day in different groupings with peers, and so forth. This provides a much finer grained and fuller account of learning, fashioned as a learning trajectory rather than a point-in-time outcome. What is examined about what learners learn can extend beyond responses to test items and items on self-report inventories to include tactics and strategies that they use in learning, traced by software.

Fifth, novel treatments can be examined using a contextualized evidentiary base. As an overly simple example, software can identify a schedule for reviewing content in grade 10 biology that strongly predicts achievement among learners who take few notes. Replicating a test of that schedule is as easy as distributing a recommendation over the Internet to other learners through the software they use to study. In this sense, instructional designs that arise naturally in a Darwinian ecology of learning can be identified, replicated, verified regarding treatment implementation, tested for effects, and evolved in two ways: recommending that self-regulating learners use favorable adaptations observed in the ecology, and broadcasting carefully selected but untested recommendations for learning that learning scientists design based on their analyses of data generated in the ecology. Beyond publishing interventions to progressively mutate learning activities, the software monitors each learner’s uptake of these interventions. This sets the stage for providing formative feedback that can help each learner transition from prior study strategies to updated ones. The uptake and validity of each discrete treatment implementation can be thoroughly mapped, and, over time, the evolution of learning can be charted.

Compared to P-RCTs and meta-analyses of them, the key advantage afforded by this ecology is the opportunity to transform everyday schooling and studying outside the classroom into thoroughly documented, naturalistic, and progressively responsive experiments. The ecology is a huge laboratory for easy-to-do, constant experimentation in the service of improving learning and advancing learning science. This contrasts with the rather long delay between identifying an effect in a novel P-RCT, awaiting replications and variations of that study by other researchers, completing a meta-analysis of that slowly accumulating literature, and, finally, disseminating information to instructors and then to learners. A learning ecology supported by learning software and fueled by big data processed to generate up-to-date and individualized learning analytics can identify and rapidly—e.g., daily—disseminate tailored and empirically grounded best practices to learners and their instructors.

In this new ecology, learners’ self-interest—be it to learn, or to earn good grades, or to “satisfice” to allow time for sports or music practice—equates to an easy-to-fulfill commitment to generate data that can advance learning science. This removes factors that commonly impede research in the genre of P-RCTs, e.g., time lost from required curricula, need for special equipment beyond a computer and the Internet, developing unique materials, supervision of treatment implementation to assure conformity to an experimental protocol, and so forth. As well, learners have full opportunity to exercise self-regulated learning through options that they are provided describing new approaches to learning. Whatever their choices may be, they continuously contribute, with minimal imposition, data that widen, deepen, and sharpen learning science.


The software nStudy (version 3) is an extension developed for the Google Chrome web browser. It provides tools for learners to apply study tactics while they study web pages, PDF documents, and videos on the Internet. It was designed, in part, to address the black-box problem in learning science, i.e., the problem of strongly grounding in observed behavior inferences about learners’ unobservable cognitive operations and motivational states that theoretically determine what they learn (Winne, 1982). Trace data that nStudy logs enhance the validity of inferences about what goes on in the black box of learners’ minds as they learn.

Suppose that Emma is studying a web page about global warming using nStudy. She drags the cursor over this text: “Life on Earth is possible because of the warmth of the sun” (see Figure 1); nStudy automatically highlights that text. In a menu that nStudy automatically pops up whenever a learner selects text, Emma types “explore” to tag the selection. Then she continues studying.

Figure 1. Emily creates a quote in nStudy at http://www.davidsuzuki.org/issues/climate-change/science/climate-change-basics/climate-change-101-1/


As Emma uses nStudy’s features, the software logs attributes of these events as trace data. In this case, nStudy logged that Emma selected text located within a uniform resource locator (URL) for the web page. The text Emma highlighted was logged, and the tag she applied to index her selection was also logged and associated with that selection. Each trace event is time-stamped, with an accuracy of approximately 1/20th of a second.

These trace data enhance the validity of inferences about Emma’s cognition and motivation at this moment. It can be inferred that Emma was motivated to metacognitively monitor information in this text. Motivation, which accounts for what people choose to do, is traced by Emma’s behavior of selecting text. The standard that Emma likely used for metacognitively monitoring information she was reading is probably whether her background knowledge is incomplete or materials at hand don’t include information that she wants or believes she needs. The meaning of the tag “explore” strongly implies this. That Emma took initiative to tag the selection “explore” grounds an inference that she is motivated to fill a gap in or extend her knowledge. Emma has a plan to explore for information that may satisfy that goal in the future.

Trace data like these are improvements over self-report data that could be obtained by asking Emma, before or after this study session, various questions such as whether she actively monitors her background knowledge, how usual it is for her to plan to elaborate information, whether information in the text is satisfying, or whether she makes plans to explore for additional information. Self-report responses rely on memory, which is not always veridical (Winne, 2010; Winne & Perry, 2000). As well, nStudy’s trace data can be used to examine context, such as whether Emma immediately broke off studying about global warming to explore the topic of how life depends on the sun. Had she done that, nStudy’s log of URLs visited and time stamps identifying when they were accessed would trace a redirection of her focus.


The extension nStudy illustrates the kind of software described in the preceding section. As much as possible, nStudy’s design reuses methods for creating and manipulating information in software with which learners are already familiar, e.g., selecting text in a word processor or chat tool to do something with it. Where nStudy’s methods are novel, it often offers guidance. For example, in my experience, few learners title their notes. The default note form in nStudy provides a box in which to do that. Above the note’s text field, nStudy provides light gray text reading “untitled plain note . . . .” This gently prompts the learner to title each note. (My hypothesis to be tested is that categorizing information benefits learning.) As a learner begins entering a title, the prompt disappears and is replaced by the learner’s title.

Instructors and researchers can configure some features of nStudy’s tools as guides to learners’ studying. For example, labels of text fields and replacement text in those fields can be removed or adapted to suit the purposes of an assigned project or a research study. For example, in a study where learners annotate evidence they use to build an argument, a researcher might want to introduce learners to the definition of a warrant, a concept not widely known.

When a learner uses one of nStudy’s tools, data are automatically logged in a database. These traces fully describe the learner’s observable studying activities. Gathering trace data is seamless: All that learners do is carry out everyday activities. The previously described scene in which Emma highlighted and tagged text is illustrative.

There are a variety of features in nStudy that learners can use while studying:

Bookmark: Bookmarks record the URL of every web site, PDF document, and video a learner views while logged into nStudy. A bookmark is automatically titled using information at the origin of that material; learners can retitle bookmarks as they wish. Then nStudy can filter the exhaustive history of URLs to show learners only those in which the learner operated on information, e.g., tagged selections of text or a frame in a video. Contrasting the complete history of URLs with those a learner operates on while studying, as indicated by traces, allows researching features of information that a learner uses compared to those only browsed.

Quote: Quotes are verbatim copies of text information a learner selects for highlighting, tagging, or annotating. Selecting text and choosing “Create Quote” from nStudy’s automatically generated popup menu creates a quote. The software marks a quote within the larger text by highlighting the selection and identifies the selection’s relative location in the browser’s scroll bar by a short, horizontal, colored band—a nub. Hovering the cursor over a nub displays an abbreviated version of the quote (its first 30 characters). Clicking that display scrolls to the highlighted text on the page, allowing review of the quoted information in context.

Note: Notes store a learner’s comments about and interpretations of content. Every note is a web form. Notes with a plain form have a title field that is initially filled with light gray replacement text reading “untitled plain note.” This prompts the learner to title these notes. Plain notes have a single text field for the note per se. A plain note form is the simplest of nStudy’s configurable forms. Adding other elements to a note form can prompt learners to metacognitively monitor content for particular kinds of information. For example, an Explain note form has a title field with the replacement text “untitled Explain note” plus four text fields labeled “Cause,” “Effect,” “Context,” and “Why” (see Figure 2). Various types of fields are can be included in a note form: rich text, checkbox, radio button, slider, drop-down list, date, image, and associated artifacts as clickable links. Note forms can be configured by researchers or instructors; we are developing an editor so learners can design forms for themselves.

Learners create note artifacts by selecting text or a clicking in a video, then selecting the form they want to use from a list of options in the menu nStudy pops up. A note is automatically linked to the content that prompted the learner to create it. Like quotes, notes are marked by a nub in the scroll bar of the browser or the video player’s scrub bar. Clicking the nub scrolls to its associated quote and opens the note for review or editing.

Figure 2. Emily creates a note in nStudy using the Explain note form at http://www.davidsuzuki.org/issues/climate-change/science/climate-change-basics/climate-change-101-1/


Term: Terms are a reserved note form learners can use to define key concepts. The note form for a term has a title field for the term itself, a definition field and a “See also” field where the learner or nStudy can link this term to conceptually related terms, notes, or bookmarks.

Termnet: Definitions of terms are often expressed using other terms. To represent this relational information property of terms, nStudy automatically forms a graph of terms defined by the relationship “in terms of.” Nodes in a termnet display represent terms, the conceptual building blocks of a topic; edges represent in-terms-of relationships among terms. For each open URL or essay (see below), nStudy automatically identifies the subgraph within the overall termnet that represents terms appearing in that artifact. The learner can view this visualization to examine a term’s conceptual structure as represented by its definitional relations to other key concepts. Other kinds of relations, e.g., terms’ co-occurring in a sentence, can be used to form termnets displaying different information qualities of web pages and essays.

Tag: Tags index information. Researchers and learners can create tags tailored to subjects (e.g., chemistry: acid, base, ionic bond, covalent bond), tasks (e.g., explore, review, summarize), or other purposes. Learners can use tags as a tool to collect all items tagged by a particular word or phrase (the tag). The name of any note’s form is automatically used as a tag. This allows all notes created using a particular schema to be quickly identified.

Discussion: To support and trace how learners collaborate, nStudy provides tools for them to exchange free text and share any nStudy artifacts they have constructed. Discussions can be synchronous or asynchronous. Within the discussion interface, learners can select from a dropdown list to choose a role, e.g., summarizer, critic or analyst. For each role, a collection of prompts is available in another dropdown list that guides the learner’s participation within that role. A critic’s prompts might include, e.g., “What is the evidence for that?” “Is that an overgeneralization?” “How reliable is that?” Clicking a prompt inserts it into a field where the learner can type contributions to be added to the discussion. The learner can elaborate the basic prompt before submitting it. As with bookmarked web pages, learners can annotate discussions with quotes, notes, and terms.

Essay: nStudy’s essay tool allows learners to compose richly formatted text to draft and edit into final form a term paper, lab report, business plan, and so forth. Common formatting features—e.g., font styles, paragraph styles, tables, and numbered and bulleted lists—are available in a toolbar. Notes, terms, and other nStudy artifacts can be quickly added to an essay. This encourages reuse of quotes, notes, terms, and segments of discussions and facilitates reviewing the context that prompted creating those items.

Map: A map is nStudy’s tool for constructing spatial displays of information as relations (lines) among items (nodes). Learners construct maps by adding existing artifacts and linking them at will or by creating artifacts directly within the map. Items in maps can be grouped to form a submap, represented as a folder. An item’s conceptual neighborhood can be shown at an arbitrary distance measured by the number of links traversed from a focal artifact to artifacts removed by n steps from the focal item.

Search: In nStudy’s map view, all of a learner’s nStudy contents—notes, terms, documents, and so forth—are listed in a panel to the left of the map. The list can be filtered by entering text in a search field, removing items from the list that do not satisfy the search. For example, entering “uranium” filters out all items that do not contain that word. Entering metadata such as a tag removes all items except those with that tag. A more advanced search feature is being developed to provide learners opportunity to search using other metadata. For example, to structure a study session focusing on reviewing content, a learner might search for notes not opened for a week or more that contain a term for which the learner judges knowledge to be incomplete.

Learning analytics: A feature currently being developed is a tool for learners (and researchers) to investigate studying by analyzing trace data that a learner generates. For example, a termnet might be displayed so that the diameter of the nodes representing terms is proportional to the number of times each term has been used in notes. This might suggest to learners how to apportion time to studying terms—those used often in notes might not need much more study. A log book to keep track of whether such hypotheses are supported by future data can guide productive self-regulated learning. Learning analytics that help learners productively self-regulate learning are a focus for future research.


I argue that software technologies like nStudy offer significant value in supporting learners and advancing learning science. A rationale and recommendations for this approach arise from my critique of pseudo-random controlled trials (P-RCTs). Using software like nStudy to gather fine-grained, time-stamped data can represent, within the wide limits of what learners can do when they use software, how learners operate on information to learn it and which information is the subject of which operation. Those data and the relations they bear to achievement are keys to improving learners’ self-regulated learning and instruction in ways that parallel my argument about the case of process–product research on teaching effectiveness (Winne, 1987). As well, the scope of observation and the pace of investigation afforded by software set a new stage for exploring how to support productive self-regulated learning over much longer spans of time and much more varied subject areas than have been studied so far in learning science. Because of the precision and fullness of data about how learners go about learning, opportunities to replicate and adapt treatments are rendered practically simple because operational definitions of treatments are thoroughly represented by trace data. What makes this kind of software-supported system even more appealing is that experimentation to improve learning becomes ubiquitous. Experimentation is intrinsic to an ecology populated by learners accessing authentic content for authentic purposes.

Adopting this kind of big data approach requires all learners to use software extensively in their academic work. In the context of today’s price points for computing hardware and electronic content, this is now or will very soon be practical. Gains that I forecast for learning and for learning science may be appealing enough to warrant up-front investments if there are currently modest to moderate real costs.


This work was supported by grants to Philip H. Winne from the Canada Research Chairs Program and the Social Sciences and Humanities Research Council of Canada SRG 410-2011-0727. Thanks to Zahia Marzouk, John Nesbit, Ilana Ram, Donya Samadi, Jason Stewart, Derra Truscott, and Jovita Vytasek for thoughtful comments on a draft of this article.


American Psychological Association. (2010). Publication Manual of the American Psychological Association (6th ed.). Washington, DC: Author.

Association for Psychological Science. (n.d.) Registered Replication Reports [Web page]. Retrieved from http://www.psychologicalscience.org/publications/replication

Bannert, M., Reimann, P., & Sonnenberg, C. (2013). Process mining techniques for analysing patterns and strategies in students’ self-regulated learning. Metacognition Learning, 9(2), 161–185. doi:10.1007/s11409-013-9107-6

Deschryver, M. (2014). Higher order thinking in an online world: Toward a theory of web-mediated knowledge synthesis. Teachers College Record, 116(12), 1–44.

Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., & Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2(3), 267–279. doi:10.1109/tetc.2014.2330519

Holcombe, A. (2013, March 3). Registered Replication Reports are open for submissions! [Web log post]. Retrieved from https://alexholcombe.wordpress.com/2013/03/03/registered-replication-reports-are-open-for-submissions/

Ma, W., Adesope, O. O., Nesbit, J. C., & Liu, Q. (2014). Intelligent tutoring systems and learning outcomes: A meta-analysis. Journal of Educational Psychology, 106(4), 901–918.

Rao, P. (1971). Some notes on misspecification in multiple regressions. The American Statistician, 25(5), 37–39.

Roll, I., & Winne, P. H. (2015). Understanding, evaluating, and supporting self-regulated learning using learning analytics. Journal of Learning Analytics, 2(1), 7–12.

Selwyn, N. (2008). An investigation of differences in undergraduates’ academic use of the internet. Active Learning in Higher Education, 9(11), 11–22.

Shadish, W. R., Cook, T. D., & Campbell, D. T. (2001). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

Sottilare, S., Graesser, A., Hu, X., & Holden, H. (Eds.). (2013). Design recommendations for intelligent tutoring systems. Orlando, FL: U.S. Army Research Laboratory.

Stanley, D. J., & Spence, J. R. (2014). Expectations for replications: Are yours realistic? Perspectives on Psychological Science, 9(3), 305–318.

Steenbergen-Hu, S., & Cooper, H. (2013). A meta-analysis of the effectiveness of intelligent tutoring systems on K–12 students’ mathematical learning. Journal of Educational Psychology, 105(4), 970–987.

van der Aalst, W. M. P. (2014). Process mining in the large: A tutorial. In E. Zimányi, Ed., Business intelligence: Third European summer school, eBISS 2013, Dagstuhl Castle, Germany, July 7–12, 2013, tutorial lectures (pp. 33–76). Basel, Switzerland: Springer.

van Rijmenam, M. (2013, August 6). Why the 3V’s are not sufficient to describe big data [Web log post]. Retrieved from https://datafloq.com/read/3vs-sufficient-describe-big-data/166

What Works Clearinghouse (n.d.) Procedures and standards handbook, version 3.0. Retrieved from https://ies.ed.gov/ncee/wwc/Docs/referenceresources/wwc_procedures_v3_0_standards_handbook.pdf.

Winne, P. H. (1982). Minimizing the black box problem to enhance the validity of theories about instructional effects. Instructional Science, 11, 13–28.

Winne, P. H. (1987). Why process-product research cannot explain process-product findings and a proposed remedy: The cognitive mediational paradigm. Teaching and Teacher Education, 3, 333–356.

Winne, P. H. (1992). State-of-the-art instructional computing systems that afford instruction and bootstrap research. In M. Jones and P. H. Winne (Eds.), Adaptive learning environments: Foundations and frontiers (pp. 349–380). Berlin, Germany: Springer-Verlag.

Winne, P. H. (2006). How software technologies can improve research on learning and bolster school reform. Educational Psychologist, 41, 5–17.

Winne, P. H. (2010). Improving measurements of self-regulated learning. Educational Psychologist, 45, 267–276.

Winne, P. H, & Baker, R. S. J. d. (2013). The potentials of educational data mining for researching metacognition, motivation and self-regulated learning. Journal of Educational Data Mining, 5(1), 1–8.

Winne, P. H., & Perry, N. E. (2000). Measuring self-regulated learning. In M. Boekaerts, P. Pintrich, & M. Zeidner (Eds.), Handbook of self-regulation (pp. 531–566). Orlando, FL: Academic Press.

Cite This Article as: Teachers College Record Volume 119 Number 3, 2017, p. 1-24
https://www.tcrecord.org ID Number: 21769, Date Accessed: 7/23/2019 9:40:20 AM

Purchase Reprint Rights for this article or review
Article Tools
Related Articles

Related Discussion
Post a Comment | Read All

About the Author
  • Philip Winne
    Simon Fraser University
    E-mail Author
    PHILIP H. WINNE is Professor and Canada Research Chair in Self-Regulated Learning and Learning Technologies; and Associate Dean, Graduate Studies and Research in the Faculty of Education, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada.
Member Center
In Print
This Month's Issue