Home Articles Reader Opinion Editorial Book Reviews Discussion Writers Guide About TCRecord
transparent 13

Can Groups Learn?

by Elizabeth G. Cohen, Rachel Lotan, Beth Scarloss, Susan E. Schultz & Percy Abram - 2002

This is a study of assessment of the work of creative problem-solving groups in sixth-grade social studies. We test the proposition that providing students with specific guidelines as to what makes an exemplary group product (evaluation criteria) will improve the character of the discussion as well as the quality of the group product. To assess the groupís potential for successful instruction, we examine the character of the group conversation as well as the quality of the group product. We present a statistical model of the process of instruction that connects the use of evaluation criteria, group discussion, creation of the group product, and average performance on the final written assessment.


Assessing the outcome of group work that uses creativity and problem solving requires very careful attention to the nature of the group assignment in relationship to any measures of performance. According to Solomon (1998), the articulation between the content and the performance that will be used to measure it is of prime importance. Assessments that require factual and conceptual knowledge are only appropriate for tasks with rich academic potential. Sometimes, teachers faced with students with limited skills in reading and writing give them a poster to draw instead of requiring them to read materials or to produce written reports. They do so with the idea that they are using the latest work in psychology to reflect multiple intelligences (Gardner, 1999). However, the poster assignment does not usually involve factual and conceptual knowledge, so the teacher is unable to make academic assessments. These group products demand no connection between the academic discourse of the group and the academic content that students are expected to master. Students are not required to demonstrate their understanding of concepts. In contrast, the group tasks used in this study demand close connections between the group discussion and academic content, while simultaneously employing a wide range of intellectual abilities. The assessments we use therefore focus on academic content, the quality of the group products, as well as written essays. We chose the last type of assessment because written performance is increasingly important with higher grade levels.


Given creative open-ended tasks, the assessment of group work requires a redefinition of a group as more than a collection of individuals. Although sociologists and social psychologists who have studied groups have always claimed that a group is greater than the sum of its parts, there has been a tendency in assessment to regard the potential for performance in a group as the sum total of the amount of information, skills, and abilities that individuals bring to that group. Through the creative exchange of ideas, groups can solve problems and construct knowledge beyond the capacity of any single member. Thus it is possible to talk about the concept of group learning that is a result of the interaction of the group members and is not attributable to one well-informed person who undertakes to create the product or even to a division of labor in which different persons contribute different pieces of the product.


In this study of assessment we take the position that the quality of the interchange among the members and the quality of the group product are good indicators of the group’s learning potential. We argue that the product should be assessed because it has an important mediating effect on instructional outcomes. Specific kinds of group discussion affect the quality of the group product that, in turn, influences the nature of group learning.

The quality of group discussion and the group product have consequences not only for group learning but also for individual performance. Group experiences are important for students’ individual-level comprehension and ability to express what they have learned.


Given a carefully engineered set of academic group tasks, group learning will depend on clarity of criteria for evaluation. Very often, groups with an assigned task have only a vague idea about what criteria will be used to assess their product. It is a truism in assessment literature that students should be aware of criteria that are being used for evaluation (Frederiksen & Collins, 1989). However, there is no clear research evidence in the assessment literature that awareness of such criteria will improve performance. The evidence for the importance of knowing the basis for the evaluation of one’s work comes from research in social psychology.

In this study we provide specific evaluation criteria for each group product. Students know from the start what will constitute an exemplary performance. Instead of generic criteria, students have criteria that refer to the specific academic content of the particular group task they have undertaken. Because they have these criteria in advance, groups are able to evaluate their own planned performance and product before they must rise and present to the class and the teacher. Thus, this assessment meets two of the conditions Shepard specified for classroom assessments that align with instruction: “Expectations visible to students” and “Students active in evaluating their own work” (Shepard, 2000, p. 8).


This is a study of five sixth-grade classrooms, all using the strategies of complex instruction. We have detailed data concerning talk of the groups, group products, and performance on an essay after the completion of a unit. Complex instruction is a set of strategies for creating equitable classrooms. Using these strategies, teachers can teach to a high intellectual level in academically and linguistically heterogeneous classrooms (Cohen & Lotan, 1997). Curricular materials developed for complex instruction have challenging group activities that require multiple intellectual abilities and also demand comprehension and application of academic content. In addition to the printed word, the academic content is embedded in nontext resources, such as pictures, diagrams, and music. Multiple ability curricula permit different kinds of students to make different kinds of intellectual contributions to the final product on each of the tasks.

These multiple ability curricula assume that the effects on academic learning will be more powerful if there is a set of group-work tasks carefully engineered to represent both central concepts and more factual academic materials. Each group moves through this set of tasks requiring products each day, such as skits, murals, or models. Group activities make up a unit designed to teach what is called the big idea, along with considerable factual, academic content. Because tasks vary as to media and mode, and because groups of students rotate through different tasks, there is more than one way and more than one chance to apprehend central concepts of the unit. In this way, a wide variety of learners can acquire deep understanding. To assess student learning, one can construct a test on the unit or require a written essay concerning major concepts underlying the unit.


The first set of hypotheses pertains to the effects of using evaluation criteria on quality of group discussion, on group products, and on written assessment. The second set focuses on the effects of quality of group discussion and group products on written assessment. We then develop a path model of the direct and indirect effects on written assessment. Finally, we take up the question of the effects of group-level variables on individual outcomes.


Hypothesis I. The use of evaluation criteria will lead directly to more evaluative and task-focused talk as well as to better group products and will be associated with higher scores on final written assessment.

The use of evaluation criteria should affect group learning in three ways. First, criteria should help to produce a better group product. With explicit criteria, both students and teachers have guidance as to what will count as a good product. Schultz et al. (2000) has shown that teachers who use evaluation criteria provide much more feedback on group products that is specific and concrete than teachers who lack these criteria. In complex instruction, the whole class hears feedback to each group on tasks that some groups will subsequently carry out. This feedback prepares the groups who will do particular tasks the next day to do a better job at meeting the criteria.

Second, the use of evaluation criteria should improve the quality of the discussion and thus promote group learning. When teachers tell groups that they are expected to use evaluation criteria to evaluate their own work as they move through the task, they are delegating the right to evaluate to groups of students. Students will feel that they have the right and duty to evaluate the quality of the group product and to assess their own and other’s contributions. Under these conditions, students will be more critical of their own work (Abram et al., 2002). More evaluative and self-critical conversation will lead to higher average performance of group members on an academic assessment such as a written essay.

Third, the presence of evaluation criteria should make the group more task focused. According to Dornbusch and Scott (1975), participants exert greater effort when they perceive evaluations as soundly based. When evaluations are seen as sound, participants will direct their efforts toward improving the evaluations they receive. If students have specific evaluation criteria for each group task, then both students and teacher will be more likely to make evaluations that will be seen as soundly based. The perception of soundly based evaluations will increase student effort to turn out a product that will achieve favorable responses/feedback from peers and teacher. In addition, evaluation criteria should heighten students’ awareness of the requirements for developing a good product to present to the class and should therefore make students’ talk more product focused. Groups that are more product focused should show less disengagement. There should be fewer interactions that are observably off-task.

This discussion leads to some specific predictions: Groups with evaluation criteria in comparison to groups without criteria will (1) have more talk that evaluates the group product; (2) show less off-task behavior; (3) have better group products; and (4) write better final essays on the academic content of the unit. Also, evaluation criteria and task-focused talk will constitute independent predictors of quality of group product.


Hypothesis II. More self-assessment of the group product as well as a better quality group product will lead to higher average scores for the group on written assessment.

Under certain conditions, we expect to see a relationship between final written assessment and the quality of the group product. First, the written assessment must match the content and concepts of group tasks. Second, requirements for a group product must likewise reflect academic content and concepts of the assigned group task. Third, the group product must require reciprocal interdependence of group members. In other words, group members must engage in mutual exchange of ideas, and all group members must make contributions to the product.

Ordinary group work frequently does not meet these conditions. However, when instruction and assessment do constitute a good match and when there is a true group task that cannot be well done by one individual, the process of creating a group product will add to the understanding and ability to articulate knowledge on the part of all group members. When the group focuses on the required academic content, they will produce a superior product. The creation and presentation of that superior product to the class as a whole will represent critical learning opportunities for all the members of the group.

As group members prepare group products, they brainstorm and exchange ideas, manipulate concrete objects to construct artifacts and models, plan dialogues and rehearse debates, map the information, and design expressive visuals. Such processes serve as excellent prewriting activities for individuals who will then be asked to formulate their own written answers to prompts about the big idea of the unit. Prewriting activities are particularly important to scaffold the writing process.

In addition to positive effects from the group product, there is a critical feature of group discussion that will lead to superior understanding on the part of individuals. As the group becomes self-critical and analyzes ways in which their planned group product will or will not be successful, members are able to achieve a deeper grasp of central ideas underlying the activity. To carry out this critique, they must apply ideas and academic content gained from resource materials and their discussion of those ideas to the product and to the presentation for the class. Obviously, if they have explicitly stated evaluation criteria, an accurate assessment of the quality of their product will be much easier. We have learned from pilot studies that such evaluative talk is comparatively rare among sixth graders and requires explicit modeling by the teacher for most groups to exhibit the behavior at all. We predict that both evaluative talk about the product and the quality of the group product will be independent precursors of the average performance of the group on written essays on the academic content of the unit.


This conceptualization suggests a path model picturing causal relationships among a set of four successive events: the initial conditions of evaluation criteria, group discussion, creation of the group product, and average group performance on the final written assessment. How do all these factors fit together into a model of the process of instruction and assessment? For example, we have predicted that evaluation criteria will have direct effects on quality of discussion and group products. Will they also have direct effects on written assessment once quality of the discussion and product are controlled? Will all the different indicators of a high-quality group discussion have direct effects on the group’s average essay score or will some of them primarily affect the product while others affect essay scores? To answer these questions we will develop a path model.


Hypothesis III. The better the quality of the group discussion and product, the better will be the individual performance of group members.

Assessment typically focuses on what individuals have learned from an instructional experience. Thus far, we have confined our analysis to the group level. A natural question is, What is the effect on the individual’s performance of being in a group that produces a superior product?

We would expect that individual learning is greatly benefited both by exposure to high-quality group talk and by participation in the creation of a superior group product. Schultz randomly assigned students to group and individual conditions to experience the same multiple ability unit on ecology. Students who participated in groups performed much better on several different assessments compared with students who carried out the same tasks as individuals (Schultz, 1999). Additionally, Cossey (1997) showed that exposure to high-quality conversation in math groups working on a multiple ability unit was a direct predictor of individual learning gains in mathematics. We reason that if individuals are exposed to more academic content relevant to the product and to the creation of a better product, their individual achievement should reflect that superior experience. In addition, because complex instruction requires students to write up an individual report after each group task, they should be well prepared to write individual essays summarizing some of what has been learned by the group.


In the five classrooms, there were 39 groups of 4-5 students who all experienced group tasks and assessment meeting the conditions outlined previously. For all the hypotheses and predictions, except the one concerning individual performance, we use only measures at the group level. These include the measures of student talk in the groups, assessment of group products, and individual essay scores aggregated to the group level. The variability among small groups in the nature of their discussions and in quality of their group products as well as variation in the dependent variable of the average group essay score permitted us to test the predictions concerning group learning.


All five teachers were highly trained in instructional strategies of complex instruction. Failure to delegate authority to groups of students or failure to hold groups accountable for their group products would undermine the learning outcomes for reasons quite outside variability in groups we wished to study. Therefore, all five teachers were examined with standard measures of implementation, assuring us that these were all classrooms where complex instruction met our own standards.

Groups in different classrooms all carried out the same group tasks using the same curricular material. All five classes completed three preliminary complex instruction units to familiarize teachers and students with evaluation criteria and with our data collection procedures and instruments. We collected data used in this study in the fourth and final complex instruction unit (focal unit), “The Importance of the Afterlife in Ancient Egypt.”


Three of the five teachers used activity cards with evaluation criteria printed on them. Students of these three teachers practiced using evaluation criteria during “skillbuilder” exercises prior to implementation of the units. These skillbuilders provided students with guidelines and practice in talking about evaluation criteria. In the two comparison classrooms, students experienced a general skillbuilder designed to improve quality of group discussion. In addition, students using evaluation criteria had opportunities to practice the use of evaluation criteria in three units prior to measurement in the focal unit.


We gathered information from five sixth-grade classrooms (163 students in 39 groups) drawn from a linguistically, ethnically, and racially diverse population in California’s Central Valley during the 1998-1999 academic year. The average percentile ranking on the SAT-9 standardized test for the 163 students in our sample was 34.6. Approximately 25% of the students in the study were designated limited English proficient. The majority of these students reported either Spanish or Punjabi as their first language. As is common in communities in California’s Central Valley, many residents are involved in agriculture, and a number are immigrant workers.


We audiotaped student groups for the entire lesson each day during the focal unit. The focal unit required 5 days of implementation so that students could rotate through the five tasks, give a presentation, and receive feedback on each task. We analyzed group conversation from three of the 5 days of implementation to derive measures of group talk. We collected data on group products and presentations with audiotape and digital photographs of the physical products. After the focal unit, all students wrote an essay on several activities of the unit and their relationship to the unit’s central idea.


Complex instruction units in social studies are organized around “big ideas” that are central to the discipline of history. The unit “The Importance of the Afterlife in Ancient Egypt” prompts students to develop a conceptual understanding of the afterlife’s role in the everyday lives of ancient Egyptians. Understanding how people order their lives and the ideas that shape their behavior is central to understanding the history of that people.

The unit on Egyptian afterlife is centered around the conceptual question “How did the ancient Egyptians’ concept of the afterlife affect their everyday lives?” Students completed five different activities, each consisting of a performance task conceptually tied to the unit’s big idea and intended to provide an opportunity to explore one aspect of the central concept. Each activity gives groups one piece of the academic content of the unit, an open-ended performance task, and a conceptual tie to the big idea. See Figure 1 for the various activities and their conceptual tie to the big idea.

To give the reader a sense of the unit, we discuss one activity in detail, Activity 2: Heavy Heart. The activity card instructs students as follows: “As a group, develop a skit depicting the journey into the afterlife and the weighing of the heart ceremony. Include characters such as Osiris, Anubis, the Devourer, Maat, and the deceased person. Discuss the many good deeds the deceased person did in his/her life as well as defend some of the bad deeds he/she did.”

Deciding what to do, both in terms of how to go about producing the skit and what academic information to include, is in the students’ hands. In this activity, four “resource cards” contain the “history.” Two are text-based resources that are primary source excerpts translated from the ancient Egyptian Book of the Dead. One contains spells to protect the deceased’s heart on the way to the Hall of Judgment, and the other lists 15 of the confessions the deceased was required to make to each of the 42 lesser deities, or judges, along the way. The remaining two resource cards present the content information in color images of the weighing of the heart ceremony. One of these is a panel reproduced from an ancient Egyptian tomb painting. The other is a modern drawing of the scene, including a brief description of each of the participants and their roles. These are the only content resources included with this activity and are intended to communicate all of the academic material students will use to complete their task.

Groups are expected to discuss and answer several higher order questions requiring them to deeply explore their resource materials. One question, for example, asks students to explore the 15 listed confessions and to “Choose three ‘sins’ your group considers to be the most serious crimes against the gods and three your group considers to be the least serious. Explain your choices.” Defining what is important (both to ancient Egyptians and to themselves), figuring out what the Egyptians thought about the afterlife, deciding how to put this information together, and planning what information should be included in the skit are all left to the students. In this way, all of the activities are open-ended.

click to enlarge

Each activity required intellectual abilities that go beyond reading and writing tasks. Although reading and writing are required, the ability to understand information from an ambiguous text, to translate that information to another context, to apply information from one visual context to another, and to plan dramatic action based on static images are equally important for a successful performance of the task. In this way, all of the tasks are multiple-ability in nature.

Evaluation criteria are purposely open-ended, allowing students to decide how to fulfill them. Evaluation criteria are not intended to give students a “recipe for success”; rather, they give students insight into the teacher’s expectations for the academic content and its presentation. Teachers direct students to use evaluation criteria when assessing their own products, when making presentations to the class, and when evaluating the academic performance of others. Evaluation criteria from The Heavy Heart Activity are as follows: “Skit includes at least 2 sins, 2 virtues and, 1 spell; Skit gives good reasons for whether or not the deceased entered the afterlife; Skit is well rehearsed and believable.”


The final essay asked students to incorporate what they had learned about Egypt during the complex instruction unit in written form and to make connections between the activities of the unit and the daily life and beliefs of ancient Egyptians. The essay prompt read:

You have just finished studying ancient Egypt. By participating in group discussions, preparing your projects, and making your presentations, you learned that the ancient Egyptian’s beliefs and ideas about the afterlife influenced all aspects of their culture. In this essay, discuss how those beliefs about the afterlife affected the ancient Egyptians while they were alive. Use what you learned in the following three activities for details: Activity 1—Preparations for the afterlife; Activity 2—Beliefs about the weighing of the heart ceremony; Activity 4—Mummification.



To check that all teachers met standards for good implementation of complex instruction, we collected data on overall levels of classroom interaction with the Whole Class Observation Instrument. For each lesson we generated two separate Whole Class Observations. These snapshots indicated how many students were talking with or without manipulating materials or disengaged. Percentages were calculated by counting the number of students engaged in the specific behavior, dividing that number by the total number of students observed, and multiplying by 100. One teacher in the comparison condition approached but failed by 1 % to meet the first criterion of 35% of students talking or talking and manipulating materials. Teachers exceeded the standards in all other categories.


We developed the Student Talk Instrument to capture the degree to which the 39 groups studied were engaged in evaluative, product-focused, and content-related talk across the three of the five rotations. After listening to the audiotaped recordings of the groups, we created nine mutually exclusive categories that characterized types of students’ talk. Raters independently scored each tape and achieved an interrater agreement score of over 85%.

Off-task talk recorded group discussions on subjects that were neither product related nor academic in nature. This included personal discussions of events that occurred outside of the classroom or the school. Approximately 15% of all group talk was off-task.

Group evaluative talk refers to questions, declarative statements, opinions, or reflections about the product/presentation. It does not include evaluations of content materials from the unit (i.e. resource cards, activity questions), nor does it include evaluations of persons. This category captures comments that assess quality of the group’s product. Evaluative talk might, for example, question the planned dialogue for the skit illustrating the Weighing of the Heart Ceremony: “That’s dumb—a king wouldn’t talk like that.”

Content discussion of the product is a category in which talk explicitly addresses the given task. This category includes proposals for the nature of the product, connections made between the big idea and the group’s product or presentation, as well as rehearsals of lines included in a script, song, or presentation. In our sample activity, such talk might refer to discussions about which would be appropriate virtues and sins (“I think raping is worse than stealing”), who would be at the Weighing of the Heart Ceremony (“It says Osiris was always there”), or how the skit will portray the determination of whether or not the deceased person’s heart weighs more than the feather of truth (“We can put a weight inside the heart”).


Using audiotapes of the presentations and the products themselves, Scarloss (2001) created a unique method of scoring group performance. She developed separate rubrics for each of three components of the scoring system: concrete content, conceptual content, and presentation conventions for each of the five activities.

The rubrics for concrete content were developmental. For example, when a group created a skit for the weighing of the heart ceremony, they received a top score if the skit addressed five or more major elements of the ceremony, was highly detailed, and included evidence of props. Elements for this activity included specific content-based requirements, such as naming the deceased’s good deeds and the weighing of the heart against the feather of truth. She used the following distinctions to allocate scores: “(1) Minimal or missing; (2) Applied but with elements missing or wrong; (3) Applied with included reasoning; (4) Applied with included reasoning, complete, coherent, exemplary.”

The rubrics for conceptual content were also developmental. For the same activity a top score required the following: “deceased’s disposition into the afterlife explicitly linked to specific deeds in the deceased’s life; disposition explicitly tied to the weight of the heart relative to weight of the feather of truth.”

The rubrics for presentation conventions were task specific. Each of the aesthetic conventions was rated on a score from 0 to 2. For example, one of the elements on which a skit was scored was the narrative: “skit tells a story: has a beginning, middle, and end.” The total score of all the aesthetic elements was then converted into scores in the 1-4 range used by all the rubrics.

On each of the rubrics, scorers reached greater than 90% agreement. Preliminary analyses indicated a high degree of colinearity among these component measures. A product index score (“Average Product”) was created by taking the average of the component scores.


We administered a multiple-choice test before any teaching about Egypt and again after the focal unit. Unfortunately, it proved almost useless in assessment. Although all classrooms showed major gains in their knowledge of Egypt, it was affected by the great variation in the amount of preteaching on ancient Egypt among the five teachers. Two of the three teachers who used evaluation criteria spent many weeks on Egypt prior to the focal unit, thus confounding our attempts to assess the impact of the evaluation criteria on the multiple-choice test independent of the amount of the preteaching.

We also gave a general test on Egypt after the teacher completed background instruction but before the focal unit. This test included some items directly pertaining to the focal unit and other more general items concerning ancient Egypt. In this analysis we used scores from this general test as a control on the group’s prior knowledge of Egypt. We used only those items that were directly related to the content of the unit, so as to avoid confounding influences of the variable amount of preteaching.

The individual essay was a more appropriate final assessment than the general test because it dealt specifically with activities of the focal unit. Scores on the essay were based on four areas: Facts and Details; the Big Idea; Organization; and Mechanics. For each area, students received a score ranging from 1 through 4, where 4 represented the highest score possible and 1 represented the lowest. Two scorers of the essays reached an interrater agreement of 91%. For purposes of this analysis we used only the score for Facts and Details and that for the Big Idea.


Table 1 presents descriptive statistics for the principal variables of the analysis: average group scores for the pretest, the product, and the essay and the average percentage of group talk that concerned evaluation of the product, content of the product, and off-task remarks. Note the low percentage of evaluation as a percentage of total talk (2.3%); at least one group had none of this type of talk. Talking about the content of the product was much more common, with a mean of 20.8%. Off-task talk ranged from 2.8% of total talk to 40.9%, with a mean of 15.4%.


The first set of predictions concerns the effects of evaluation criteria. Table 2 shows the means of the two sets of classrooms, with and without evaluation criteria. The table includes the results of t tests for the difference between means. As predicted, groups in classrooms using evaluation criteria have a significantly higher rate of evaluating their product in their group discussions than groups without criteria (2.8% vs. 1.7%; t = 2.5; p = .02). In terms of classroom talk, this means that groups with evaluation criteria used questions, statements, opinions, and reflections to evaluate the accuracy, relevance, and appropriateness of the content in their group products and presentations more than their peers in groups without evaluation criteria. This difference was significant despite the low overall level of the occurrence of this kind of talk.

click to enlarge

click to enlarge

Groups without criteria were significantly less task-focused than groups with evaluation criteria. We used a negative measure of task focus—the percentage of off-task talk. Almost 20% of all talk in groups without criteria was off task, whereas the parallel percentage for the groups with evaluation criteria was 13%. The t test was significant at the .03 level. Off-task talk within a group had a deleterious effect on that group’s overall performance. When some members of a group refused to cooperate on the task at hand, two results were common: Either the remaining members were forced to complete the task without those persons’ help in the limited time available (or even in the face of active obstruction), or the remaining members were themselves drawn into off-task behavior. Thus, there was good support for the prediction that groups with evaluation criteria would show less off-task behavior. Surprisingly, on the second measure of task focus, groups working without evaluation criteria were just as likely to talk about the content of the product as groups working with evaluation criteria.

Groups with criteria had significantly superior products to groups without criteria (t = 4.0; p = .00). The results strongly support the prediction that use of evaluation criteria improves the quality of group products.

As predicted, groups with criteria had a significantly higher average score on the essay (3.9) than the groups without criteria (3.1; p = .00). In other words, groups that worked with evaluation criteria were better able to include correct academic content and the central concept in their essays than groups working without criteria. That groups working with evaluation criteria also had significantly higher scores on the pretest suggests that we need to control on possible differences in what groups knew at the start of the unit in analyzing predictors of essay scores.

We also predicted that evaluation criteria and the amount of task-focused behavior would be independent predictors of quality of group product. In regressing the average group product score on its predictors, we controlled on average pretest score. Table 3 contains the correlations among the principal variables in the analysis. Table 4 presents the regression of average group product score on use of evaluation criteria (a dummy variable), percentage of talk about content of the product, and the aggregated group pretest score. This regression uses the percentage of talk about content of the product rather than the percentage of off-task talk as an indicator of task focus for two reasons: (1) unlike off-task talk, it is unrelated to the use of evaluation criteria and (2) it is a positive and direct measure of product-focused behavior. Table 3 shows that off-task behavior and talk about the content of the product are rather closely and negatively related to each other (r = -.60).

click to enlarge

The regression reported in Table 4 shows that both the measure of talk about content of the product (Beta = .33; p = .02) and use of evaluation criteria (Beta = .51; p = .001) are positive predictors of quality of the group’s product. The pretest score is unrelated to product quality. Thus, there is good support for the prediction that task-focused talk and evaluation criteria are independent predictors of product quality. Knowing the amount of information brought to the task by the group is of no use in predicting quality of the product. Only knowing about the product focus of the discussion and whether or not the group had evaluation criteria as guides to an exemplary product were useful.

click to enlarge


Our second hypothesis places group product as an important antecedent to the learning outcome as measured by the aggregate essay score. We hypothesized that self-assessment of the group as indicated by the percentage of talk evaluating the product would be a direct predictor of the aggregate essay score. Again, pretest score is included as a control variable measuring what the group knew at the beginning of the unit.

Table 5 presents the regression of group essay score on these predictors. The beta weights for average product score (.43) and talk that is evaluation of the product (.38) are both statistically significant (p = .01). Again, pretest score is not a significant predictor of essay performance. This regression provides good support for the predictions that evaluative talk and evaluation criteria are independent predictors of aggregated essay scores.

click to enlarge


Figure 2 portrays the causal relations between the instructional and assessment variables in a path model that flows from the initial differences in the use of evaluation criteria through the discussions of the groups to their products and finally to the aggregate essay score. To calculate the path coefficients for this model, we ran new regressions, dropping the pretest score because it was not a significant predictor. All the path coefficients were statistically significant.

The group product plays a critical mediating role between the character of the discussion and the ability of the group members to write about what they have learned in an essay. Evaluation criteria have a direct effect on the nature of the group discussion and group product but only an indirect effect on essay scores. In another regression of group essay score, not presented here, we included the dummy variable for evaluation criteria as a predictor. Evaluation criteria had no effect on essay score. It is through the increase in self-assessment (talk that is evaluation of product) and through the superior product that evaluation criteria affect the final essay.

click to enlarge


We used hierarchical linear modeling techniques developed by Bryk and Raudenbush (1992) to estimate the effects of group product scores and evaluative group talk on individual essay performance. Expressed as equations, the model we estimated is:

Y = B0 (Individual Essay Scores + R

B0 = G00 + G01 * (Group Product Score) + G02 * (Group Evaluative Talk)

where G00 represents the intercept and R and U0 are error terms. The second equation is substituted into the first, and the model is calculated as a regression.

We must acknowledge that we had small numbers for conducting this analysis (group level N = 39, individual level N = 147). Although these numbers are adequate for OLS regression, the average of four students per group necessarily reduces reliability estimates for this model (RE = .051). The consistency of these findings with the results of our earlier analyses suggests that estimates from the hierarchical models are not spurious correlations but reflect strong influences of evaluative talk and making high quality group products on students’ ability to write about what they have learned.

Table 6 presents results from this analysis. The quality of the group product and the evaluative talk in the group are both significant predictors of individual essay performance. Creating a high-quality group product appears to have a stronger effect on students’ ability to write about what they have learned than does the amount of evaluative talk in the group. The reader may recall that although discussion about the academic content of the group product constituted, on average, about 20% of group talk, evaluative talk represented only about 2% of total talk. According to this and the earlier analyses, high-quality group products and student discussion that represents self-assessment have clear effects on academic achievement at both the individual and group levels.

click to enlarge

A chi-square test of the unexplained variance (error terms) in the equation shows no significant unexplained variance (x2 = 27.23, p > .5). That is, beyond the effects of group product scores and evaluative talk, nothing substantial about the groups accounts for improved individual performance.

Reading scores have long been robust predictors of academic performance. We ran a second analysis that included students’ reading scores as a control. That analysis did not differ appreciably from the results presented in Table 5. Individual reading scores were a significant predictor of performance on the essay (t = 4.4, p < .00). Controlling for individual reading score and estimating the effects of the quality of group products and evaluative talk in the group did not substantially change our ability to predict essay performance. Specifically, although there were small changes in / values for the group level variables (G01: / = 3.0, p = .01; G02: / = 2.2, p = .03), reading accounted for some of the variability in the intercept (Goo: / = 2.7, p = .03).


Can groups learn? The groups in this study did indeed learn as a result of their discussions and their creation of group products. Both self-assessment in group discussion and quality of the group product were independent predictors of group essay performance. Moreover, learning was not a matter of relevant academic knowledge that individuals brought to the group but came about through reciprocal exchange of ideas and through a willingness to be self-critical about what the group was creating. Failure of average pretest score to predict either quality of group product or average essay score suggests that learning arose from the group as a whole.

Groups learned particularly well if they knew precisely what criteria would be used to evaluate their product. Evaluation criteria are a motivational tool helping groups to be more self-critical and increasing their effort to turn out a superior group product thereby strengthening essay performance. This study presents direct empirical evidence of the benefits of students’ awareness of evaluation criteria. A simple comparison of essay scores of the two sets of classrooms shows that classrooms with evaluation criteria had higher scores than classrooms not using evaluation criteria (Abram et al., 2002). But such an analysis obscures the finding that evaluation criteria have an indirect not a direct effect on learning outcomes. The direct effects of evaluation criteria are on group functioning and products that, in turn, boost essay scores.

What is missing in this model of classroom instruction is the role of the teacher. Unless the teacher trains the students to use evaluation criteria, and uses evaluation criteria when providing specific feedback to groups, one would probably not see the favorable effects of the criteria. Teachers in classrooms with evaluation criteria give much more concrete and specific feedback to groups (Schultz et al., 2000). Our pilot study demonstrated that unless there is adequate opportunity to learn how to use criteria, there is little evidence of their use in group discussion. Self-assessment is a type of meta-cognition, which is learned behavior. Students probably learn this new behavior by listening to the teacher use the evaluation criteria to model critical analysis of products and to provide feedback to groups.


Researchers, evaluators, and teachers have typically neglected products of cooperative learning. Scarloss’s (2001) development of reliable rubrics for a variety of products shows how it is possible for researchers and evaluators to assess cooperative learning. Creating a product is a performance task that can be used for assessment. In these groups, students grappled with central concepts and academic content as presented in resource materials and discussion questions. Groups concretized their understanding by representing it in a group product.

For teachers, group products are an invaluable way to gauge students’ understanding of underlying concepts. Evaluation criteria that are specific to each product provide a way for teachers to give formative assessment and feedback as each group makes a presentation. Once teachers see which criteria are particularly challenging for students to meet, they can supplement instruction as necessary. Without criteria, teachers often make only the most general comments when a group presents its unique solution or product.


The final essay gave us a window on the extent to which students walked away with a good grasp of the major ideas underlying the unit. Scoring the essay separately for organization and mechanics allowed us to examine deep academic understanding quite apart from the student’s ability to write well. Because many of these students were English learners or from a poor rural background, they exhibited marked difficulties in the mechanics of writing, despite a reasonable grasp of the central concepts. If a major objective is to see improvement in these writing skills, it will be necessary to spend much more instructional time on writing and providing feedback for the individual reports. Those classes where teachers had spent considerable time on paragraph construction showed good effects in scores with the organizational rubric.

Aggregated essay scores for each group provided an excellent dependent variable for summative assessment at the group level. We are not suggesting that teachers should use this kind of assessment but were only trying to show how the average achievement on the essay is related to what went on during the group interaction and to the kind of product the group created. The fact that individual performance in the last analysis was affected by both quality of group product and self-assessment shows that teachers can feel confident about using individual assessment to measure the instructional outcome of group work. Individuals greatly benefit by exposure to the discourse and the creative process in groups.


It is important not to overgeneralize these findings to all forms of cooperative learning. The observed relationships hold at least under certain conditions. For example, the tasks were open-ended and challenging, demanding an exchange of ideas. Instead of the more common evaluation criteria that are generic, our results come about when criteria are specific to the academic content in each group product. There are other conditions we regard as necessary for these relationships to hold. Desirable group discourse will not happen by magic. Students must be trained in skills for harmonious and helpful discourse. If there are severe status problems within the groups, so that some persons are prevented from contributing or if one person does all the talking, then the positive results of the group work are at risk. Therefore teachers must know (as these teachers did) how to treat status problems (Cohen, 1994). Finally, tasks and assessments must be properly engineered so that the group product reflects the academic content of the task and so that the summative assessment is a good match to the tasks’ academic content.


The overwhelming emphasis on accountability in today’s educational debates has resulted in a lack of focus on the primary mission of education, namely teaching students what they need to know and do (Popham, 1999). McColskey and McMunn (2000) state that many districts institute short-term strategies to increase test scores but neglect to examine whether these strategies will enhance student motivation, learning, and development. This amplifies the dilemma teachers already face between improving test scores quickly and focusing on strategies that support high-quality learning environments in all classrooms. This study has illustrated how it is possible to have accountability at the classroom level in such a way that assessment supports the process of learning.


Abram, P. L., Scarloss, B., Holthuis, N. C, Cohen, E. G., Lotan, R. A., & Schultz, S. E. (2002).

The use of evaluation criteria to improve academic discussion in cooperative groups. Asia

Pacific Journal of Education, 22, 16-27

 Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis

methods. Newbury Park, CA: Sage

Cohen, E. G. (1994). Designing groupwork: Strategies for heterogeneous classrooms (2nd ed.). New

York: Teachers College Press.

Cohen, E. G., & Lotan, R. A. (Eds.). (1997). Working for equity in heterogeneous classrooms: Sociological theory in practice. New York: Teachers College Press.

Cossey, R. (1997). Mathematics communication: Issues of access and equity. Unpublished doctoral dissertation, Stanford University.

Dornbusch, S. M., & Scott, W. R. (1975). Evaluation and the exercise of authority. San Francisco:


Frederiksen, J. R., 8c Collins, A. (1989). A systems approach to educational testing. Educational

Researcher, 18, 27-32.

Gardner, H. (1999). Intelligence reframed: Multiple intelligences for the 21st century. New York: Basic


McColskey, W., & McMunn, N. (2000, October). A lapse in standards: Linking standards-based

reform with student achievement. Phi Delta Kappan, 115-120.

Popham, J.W. (1999). Where large scale education assessment is heading and why it shouldn’t.

Educational Measurement, 18, 13-17.

Scarloss, B. (2001). Sensemaking, interaction, and learning in student groups. Unpublished doctoral

dissertation, Stanford University.

Schultz, S.E. (1999). To group or not to group: Effects of groupwork on students’ declarative and

procedural knowledge in science. Unpublished doctoral dissertation, Stanford University.

Schultz, S. E., Scarloss, B., Lotan, R.A., Abram, P. L., Cohen, E. G., & Holthuis, N. C. (2000,

April). Let’s give ‘em somethin’ to talk about: Teacher’s talk to students in open-ended group tasks.

Paper presented to the AERA Annual Meeting, New Orleans, LA.

Shepard, L. (2000). The role of assessment in a learning culture. Education Researcher, 29, 4—14.

Solomon, P.G. (1998). The curriculum bridge: From standards to actual classroom practice. Thousand

Oaks, CA: Corwin Press.

Cite This Article as: Teachers College Record Volume 104 Number 6, 2002, p. 1045-1068
https://www.tcrecord.org ID Number: 10986, Date Accessed: 12/2/2021 11:40:07 PM

Purchase Reprint Rights for this article or review
Article Tools
Related Articles

Related Discussion
Post a Comment | Read All

About the Author
  • Elizabeth Cohen
    School of Education, Stanford University
    E-mail Author
    ELIZABETH G. COHEN is Professor Emerita of the School of Education and the Department of Sociology at Stanford University. She founded and directed the Program for Complex Instruction at Stanford. She authored Designing Groupwork: Strategies for Heterogeneous Classrooms. She is currently writing a book on educational policy concerning inequity in the schools and what might be done to ameliorate these problems. It is tentatively entitled And Never Mind the Children.
  • Rachel Lotan
    School of Education, Stanford University
    E-mail Author
    RACHEL LOTAN is associate professor (teaching) at the Stanford University School of Education and director of the Stanford Teacher Education Program. Her academic interests are teaching education, teaching and learning in heterogeneous classrooms, and the sociology of classrooms and schools. Currently, she is conducting research on how English learners in the middle schools can simultaneously acquire academic English and master central concepts in the social studies.
  • Beth Scarloss
    School of Education, Stanford University
    E-mail Author
    BETH A. SCARLOSS is a postdoctoral research associate at Stanford University, evaluating a teacher professional development program and its relationship to student academic performance.
  • Susan Schultz
    School of Education, Stanford University
    E-mail Author
    SUSAN E. SCHULTZ is a social science research associate and lecturer in the School of Education at Stanford University. Dr. Schultz's teaching and research interests focus on science education and the education of preservice as well as in-service teachers, with particular emphasis on issues of alternative assessments, cooperative learning strategies, and equity. She is currently coordinating Stanfordís work on the Center for Assessment and Evaluation of Student Learning Project funded by the National Science Foundation and teaching a three-quarter Science Curriculum and Instruction course in the Stanford Teacher Education Program.
  • Percy Abram
    School of Education, Stanford University
    E-mail Author
    PERCY L. ABRAM is a doctoral candidate at Stanford Universityís School of Education. Mr. Abramís doctoral thesis, entitled Does Language Matter?: The Impact of Native Language Use on Academic Achievement for Second-Generation Latinos, examines the role of Spanish use on the formation of information networks with adults at school and in the community and the effects of these network relationships on studentsí educational expectations and achievement. Mr. Abram recently published an article in the Autumn 2001 TESOL Journal entitled "Beyond Sheltered Instruction: Rethinking Conditions for Academic Language Development."
Member Center
In Print
This Month's Issue