
What LargeScale Survey Research Tells Us About Teacher Effects on Student Achievement: Insights from the Prospects Study of Elementary Schools.by Brian Rowan, Richard Correnti & Robert Miller  2002 This paper discusses conceptual and methodological issues that arise when educational researchers use data from largescale, survey research to examine the effects of teachers and teaching on student achievement. Using data from Prospects: The Congressionally Mandated Study of Educational Growth and Opportunity 19911994, we show that researchers’ use of different statistical models has led to widely varying interpretations about the overall magnitude of teacher effects on student achievement. However, we conclude that in wellspecified models of academic growth, teacher effects on elementary school students’ growth in reading and mathematics achievement are substantial (with dtype effect sizes ranging from .72 to .85). We also conclude that various characteristics of teachers and their teaching account for these effects, including variation among teachers in professional preparation and content knowledge, use of teaching routines, and patterns of content coverage, with effect sizes for variables measuring these characteristics of teachers and their teaching showing dtype effect sizes in the range of .10. The paper concludes with an assessment of the current state of the art in largescale, survey research on teaching. Here, we conclude that survey researchers must simultaneously improve their measures of instruction while paying careful attention to issues of causal inference in order. This paper discusses conceptual and methodological issues that arise when educational researchers use data from largescale, survey research to examine the effects of teachers and teaching on student achievement. Using data from Prospects: The Congressionally Mandated Study of Educational Growth and Opportunity 1991–1994, we show that researchers’ use of different statistical models has led to widely varying interpretations about the overall magnitude of teacher effects on student achievement. However, we conclude that in wellspecified models of academic growth, teacher effects on elementary school students’ growth in reading and mathematics achievement are substantial (with dtype effect sizes ranging from .72 to .85). We also conclude that various characteristics of teachers and their teaching account for these effects, including variation among teachers in professional preparation and content knowledge, use of teaching routines, and patterns of content coverage, with effect sizes for variables measuring these characteristics of teachers and their teaching showing dtype effect sizes in the range of .10. The paper concludes with an assessment of the current state of the art in largescale, survey research on teaching. Here, we conclude that survey researchers must simultaneously improve their measures of instruction while paying careful attention to issues of causal inference. This paper is about conceptual and methodological issues that arise when educational researchers use data from largescale, survey research studies to investigate teacher effects on student achievement. In the paper, we illustrate these issues by reporting on a series of analyses we conducted using data from Prospects: The Congressionally Mandated Study of Educational Opportunity. This largescale, survey research effort gathered a rich store of data on instructional processes and student achievement in a large sample of American elementary schools during the early 1990s as part of the federal government’s evaluation of the Title I program. We use data from Prospects to estimate the overall size of teacher effects on student achievement and to test some specific hypotheses about why such effects occur. On the basis of these analyses, we draw some substantive conclusions about the magnitude and sources of teacher effects on student achievement and suggest some ways that surveybased research on teaching can be improved.1 The paper is divided into three parts. Part I illustrates the varying analytic procedures that researchers have used to estimate the overall magnitude of teacher effects on student achievement, showing why previous research has led to conflicting conclusions. This issue has gained special salience in recent years as a result of Sanders’s (1998) claim that “differences in [the] effectiveness of individual classroom teachers . . . [are] the single largest [contextual] factor affecting the academic growth of . . . students” (p. 27, emphasis added). Sanders’s conclusion, of course, is sharply at odds with findings from an earlier generation of research, especially production function research showing that home and social background effects are more important than classroom and school effects in explaining variance in student achievement. In Part I of this paper, we discuss the conceptual and methodological foundations that underlie various claims about the magnitude of teacher effects on student achievement, and we present some empirical results that explain why different analysts have reached differing conclusions about this topic. Part II of the paper shifts from examining the overall effects of teachers on student achievement to an analysis of why such effects occur. We review some findings from recently conducted, largescale research on American schooling. This literature has examined a variety of hypotheses about the effects of teachers’ professional expertise, students’ curricular opportunities, and classroom interaction patterns on students’ achievement. Decades of research suggest that each of these factors can have effects on student learning, but the research also suggests that such effects are usually small and often inconsistent across grade levels, types of pupils, and academic subjects (Brophy & Good, 1986). In Part II of this paper, we review some common hypotheses about teacher effects on student achievement and use Prospects data to empirically assess both the size and consistency of these effects. In Part III, we review what we learned from these analyses and suggest some strategies for improving largescale, survey research on teaching. We argue that largescale, survey research has an important role to play in contemporary educational research, especially in research domains where education policy debates are framed by questions about what works and how big the effects of specific educational practices are on student achievement. But we also argue that largescale, survey research on teaching must evolve considerably before it can provide accurate information about such questions. In particular, our position is that future efforts by survey researchers should clarify the basis for claims about effect sizes; develop better measures of teachers’ knowledge, skill, and classroom activities; and take care in making causal inferences from nonexperimental data. PART I: EXAMINING THE SIZE AND STABILITY OF TEACHER EFFECTS ON STUDENT ACHIEVEMENT Our discussion of largescale, survey research on teaching begins with questions about how big teacher effects on student achievement are. Researchers can use a variety of analytic procedures to estimate the overall magnitude of teacher effects on student achievement, but as we demonstrate later, these alternative procedures produce markedly different conclusions about this question. The overall purpose of this section, then, is to carefully describe the conceptual and methodological underpinnings of alternative approaches to estimating the magnitude of teacher effects on student achievement and to make clear why different approaches to this problem produce the results they do. VARIANCE DECOMPOSITION MODELS In educational research, the overall importance of some factor in the production of student learning is often judged by reference to the percentage of variance in student achievement accounted for by that factor in a simple variance decomposition model.2 With the widespread use of hierarchical linear models, a large number of studies (from all over the world) have decomposed the variance in student achievement into components lying among schools, among classrooms within schools, and among students within classrooms. In a review of this literature, Scheerens and Bosker (1997) found that when student achievement was measured at a single point in time (and without controlling for differences among students in social background and prior achievement), about 15–20% of the variance in student achievement lies among schools, another 15–20% lies among classrooms within schools, and the remaining 60–70% of variance lies among students. Using the approach suggested by Scheerens and Bosker, these variance components can be translated into what Rosenthal (1994) calls a dtype effect size. The effect sizes for classroomtoclassroom differences in students’ achievement in the findings just cited, for example, range from .39 to .45, mediumsized effects by the conventional standards of social science research.3 Although the review by Scheerens and Bosker (1997) is a useful starting point for a discussion of the overall magnitude of teacher effects on student achievement, it does not illustrate the full range of empirical strategies that researchers have used to address this question. As a result, we decided to analyze data from Prospects to duplicate and extend that analysis. In the following pages, we illustrate several alternative procedures for estimating the percentages of variance in students’ achievement lying among schools, among classrooms within schools, and among students within classrooms. The analyses were conducted using the approach to hierarchical linear modeling developed by Bryk and Raudenbush (1992) and were implemented using the statistical computing software HLM/3L, version 5.25 (Bryk, Raudenbush, Cheong, & Congdon, 2000). ANALYSIS OF PROSPECTS DATA As a first step in the analysis, we duplicated the approach to estimating teacher effects on student achievement reported by Scheerens and Bosker (1997) and just discussed. The analysis was conducted using data on two cohorts of students in the Prospects study, those going from 1st to 3rd grade over the course of the study and those going from 3rd to 6th grade. In the analyses, we simply decomposed the variance in students’ achievement at a single point, using as dependent variables students’ IRT scale scores on the CTBS reading and mathematics batteries. The analysis involves estimation of a simple, threelevel, random effects model that Bryk and Raudenbush (1992) call an unconditional model—that is, a model in which there are no independent variables. For each cohort, we conducted variance decompositions at each grade level for reading and at each grade level for mathematics achievement, yielding a total of 12 separate analyses.4 Across the first set of analyses, we found that between 12% and 23% of the total variance in reading achievement was among classrooms and that between 18% and 28% of the total variance in mathematics achievement was among classrooms. Thus, the classroom effect sizes in these analyses ranged from about .35 to about .53 using the dtype effect size metric discussed by Scheerens and Bosker (1997). Although these results duplicate those reported by Scheerens and Bosker (1997), they are not very good estimates of teacher effects on student achievement. One problem is that the analyses look at students’ achievement status—that is, achievement scores at a single point in time. However, students’ achievement status results not only from the experiences students had in particular classrooms during the year of testing but also from all previous experiences students had, both in and out of school, prior to the point at which their achievement was assessed. As a result, most analysts would rather not estimate the effect of teachers on cumulative measures of achievement status, preferring instead to estimate the effect teachers have on changes in students’ achievement during the time when students are in teachers’ classrooms. A second problem with the estimates just cited is that they come from a fully unconditional model—that is, a model that does not control for the potentially confounding effects of students’ socioeconomic status and prior achievement on classroomtoclassroom differences in achievement. In the analyses just cited, for example, at least some of the classroomtoclassroom differences in students’ achievement status resulted not only from some teacher effect but also from differences in the socioeconomic background and prior achievement of the students in different classrooms. Most analysts are unwilling to attribute compositional effects on achievement to teachers, and they therefore estimate teacher effects on student achievement only after controlling for such effects in their models. These clarifications have led to the development of what researchers call valueadded analyses of teacher effects. Valueadded models have two key features. First, the dependent variables in the analysis are designed to measure the amount of change that occurs in students’ achievement during the year when students are in the classrooms under study. Second, measures of change are adjusted for differences across classrooms in students’ prior achievement, home and social background, and the social composition of the schools students attended. The purpose of valueadded models is to estimate the proportions of variance in changes in student achievement lying among classrooms, after controlling for the effects of other, confounding variables.5 To see whether valueadded models give different results than those previously discussed, we conducted further analyses using Prospects data. In these analyses, we used two of the most common empirical approaches to valueadded estimates of teacher effects on student achievement. The first approach is often called a covariate adjustment model. Here, students’ achievement status in a given year is adjusted for students’ prior achievement, home and social background, and the social composition of schools, and the variance in students’ adjusted achievement status is decomposed into school, classroom, and student components using the same threelevel hierarchical linear model as before.6 Using this approach with Prospects data, we found that roughly 4% to 16% of the variance in students’ adjusted reading achievement was lying among classrooms (depending on the grade level in the analysis) and that roughly 8% to 18% of the variance in adjusted mathematics achievement was lying among classrooms (depending on the grade at which the analysis was conducted). In the covariate adjustment models, then, the dtype effect sizes for classrooms ranged between .21 and .42, depending on the grade level and subject under study, somewhat less than the effect sizes in the fully unconditional models.7 A second approach to valueadded analysis uses students’ annual gains in achievement as the criterion variable. In this approach, students’ gain scores for a given year become the dependent variable in the analysis, where these gains are once again adjusted through regression analysis for the potential effects of students’ socioeconomic status, family background, prior achievement, and school composition (using variables discussed in footnote 4). Using this approach with Prospects data, we found that somewhere between 3% and 10% of the variance in adjusted gains in students’ reading achievement was lying among classrooms (depending on the grade being analyzed), and somewhere between 6% and 13% of the variance in adjusted gains in mathematics was lying among classrooms. The corresponding dtype effect sizes in these analyses therefore range from .16 to .36. PROBLEMS WITH CONVENTIONAL ANALYSES Neither of the valueadded analyses just discussed indicates that classroom effects on student achievement are large. But each suffers from important interpretive and methodological problems warranting more discussion. Consider, first, some problems with covariate adjustment models. Several analysts have demonstrated that covariate adjustment models do not really model changes in student achievement (Rogosa, 1995; Stoolmiller & Bank, 1995). Instead, such analyses are simply modeling students’ achievement status, which in a valuedadded framework has been adjusted for students’ social background and prior achievement. When viewed in this way, it is not surprising to find that teacher effects are relatively small in covariate adjustment models. Such models, in fact, are assessing teacher effects on achievement status, not change. If one really wants to assess the size of teacher effects on changes in student achievement, models of annual gains in achievement are preferable. As Rogosa (1995) demonstrates, annual gains in achievement are unbiased estimates of students’ “true” rates of achievement growth and are therefore preferable to covariate adjustment models in the analysis of change. However, simple gain scores suffer from an important methodological problem that researchers need to guard against. As Rogosa demonstrates, when there is little variance among students in true rates of academic growth, annual gains in achievement provide very unreliable measures of underlying differences among students in rates of change. In addition, in variance decomposition models using gain scores, measurement error due to unreliability in the gain scores will be reflected in studentlevel variance components, increasing the denominator in effect size formulas and thus reducing teacher effect size coefficients. In fact, as we discuss later, this problem is present in the Prospects data, where differences among students in true rates of academic growth are relatively small and problems of unreliability loom large. For this reason, the effect sizes derived from the gain score models discussed in this paper are almost certainly underestimates of the overall effects that classrooms have on growth in students’ achievement. IMPROVING ESTIMATES OF TEACHER EFFECTS What can researchers do in light of the problems just noted? One obvious solution is to avoid the covariate adjustment and gains models used in previous research and to instead use statistical models that directly estimate students’ individual growth curves (Rogosa, 1995). In current research, the statistical techniques developed by Bryk and Raudenbush (1992), as implemented in the statistical computing package HLM/3L (Bryk et al., 2000) are frequently used for this purpose. For example, the HLM/3L statistical package can be used to estimate students’ growth curves directly if there are at least three data points on achievement for most students in the data set. However, at the current time, this computing package cannot be used to estimate the percentages of variance in rates of achievement growth lying among classrooms within schools over time because, as Raudenbush (1995) demonstrated, estimation of these variance components within a growth modeling framework requires development of a “crossclassified” random effects model.8 Fortunately, the computer software needed to estimate crossclassified random effects models within the framework of the existing HLM statistical package is now under development, and we have begun working with Raudenbush to estimate such models using this computing package. A detailed discussion of the statistical approach involved here is beyond the scope of this paper, but suffice it to say that it is an improvement over the simple gains models discussed earlier, especially because the crossclassified random effects model allows us to estimate the random effects of classrooms on student achievement within an explicit growth modeling framework.9 For this paper, we developed a threelevel, crossclassified, random effects model to analyze data on the two cohorts of students in the Prospects data set discussed earlier. In these analyses, we decomposed the variance in students’ growth in achievement (in mathematics and reading) into variance lying among schools, among students within schools, within students across time, and among students within classrooms.10 Two important findings have emerged from the analyses of these crossclassified random effects models. One is that only a small percentage of variance in rates of achievement growth lies among students. In crossclassified random effects models that include all of the control variables listed in footnote 4, for example, about 27–28% of the reliable variance in reading growth lies among students (depending on the cohort), with about 13–19% of the reliable variance in mathematics growth lying among students. An important implication of these findings is that the true score differences among students in academic growth are quite small, raising questions about the reliability of the gain scores used in the analysis of Prospects data discussed previously. More important for our purposes is a second finding. The crossclassified random effects models produce very different estimates of the overall magnitude of teacher effects on growth in student achievement than do simple gain scores models. For example, in the crossclassified random effects analyses, we found that after controlling for student background variables, the classrooms to which students were assigned in a given year accounted for roughly 60–61% of the reliable variance in students’ rates of academic growth in reading achievement (depending on the cohort), and 52–72% of the reliable variance in students’ rates of academic growth in mathematics achievement. This yields dtype effect sizes ranging from .77–.78 for reading growth (roughly two to three times what we found using a simple gains model) and dtype effect sizes ranging from .72–.85 for mathematics growth (again, roughly two to three times what we find using a simple gains model).11 The analysis also showed that school effects on achievement growth were substantial in these models (d 5 .55 for reading and d = .53 for mathematics).12 It should be noted that these effects are not only statistically significant but also substantively important. One way to demonstrate this involves calculating the number of months of growth that would be expected in a calendar year for students who differed by 1 SD in their individual rates of academic growth, or who were similar in individual growth rates but were assigned to classrooms, school, or both, that were 1 SD apart in instructional effectiveness. For example, holding constant students’ classroom and school assignments, the crossclassified random effects models we estimated previously suggest that two students who differed by a standard deviation in their linear rates of growth in mathematics across Grades 1 through 3 would experience a difference of 1.28 months in academic growth across a calendar year. Meanwhile, holding other conditions constant, a school 1 SD above another in instructional effectiveness would produce about 1.60 months more academic growth for a student in mathematics during a calendar year, and a classroom 1 SD higher than another would produce about 2.13 months of added mathematics growth for a student during a calendar year. It should be pointed out that similar teacher effects are found for both reading and mathematics in all cohorts, suggesting that this example generalizes to both mathematics and reading at all grade levels being analyzed here. THE CONSISTENCY OF CLASSROOM EFFECTS ACROSS DIFFERENT ACADEMIC SUBJECTS AND PUPIL GROUPS The analyses just reported suggest that the classrooms to which students are assigned in a given year can have nontrivial effects on students’ achievement growth in a calendar year. But this does not exhaust the questions we can ask about such effects. An additional set of questions concerns the consistency of these effects—for example, across different subjects (i.e., reading and mathematics), for different groups of pupils, or both. We have been unable to find a great deal of prior research on these questions, although Brophy and Good’s (1986) seminal review of processproduct research on teaching did discuss a few studies in this area. For example, Brophy and Good cite a single study showing a correlation of .70 for adjusted, classroomlevel gains across tests of word knowledge, word discrimination, reading, and mathematics. They also cite correlations ranging from around .20 to .40 in the adjusted gains produced by the same teacher across years, suggesting that the effectiveness of a given teacher can vary across different groups of pupils. Both kinds of findings, it is worth noting, are comparable to findings on the consistency of school effects across subjects and pupil groups (see Scheerens & Bosker, 1997). Given the sparseness of prior research on these topics, we turned to Prospects once again for relevant insights. To assess whether classrooms had consistent effects on students’ achievement across different academic subjects, we simply correlated the residuals from the valueadded gains models for each classroom.13 Recall that these residuals are nothing more than the deviations in actual classroom gains from the gains predicted for a classroom after adjusting for the student and schoollevel variables in our models. In the analyses, we found only a moderate degree of consistency in classroom effects across reading and mathematics achievement, with correlations ranging from .30 to .47, depending on the grade level of the classrooms under study.14 The results therefore suggest that a given teacher varies in effectiveness when teaching different academic subjects. In Prospects data, there was slightly less variation in teacher effects across academic subjects at later grades, but this could be a cohort effect because different groups of pupils are in the samples in earlier and later grades. A second question we investigated was whether classrooms had consistent effects on students from different social backgrounds. To investigate this issue, we changed the specification of the valueadded regression models discussed previously. In previous analyses, we were assuming that the effects of studentlevel variables on annual gains in achievement were the same in all classrooms. In this phase of the analysis, we allowed the effects of student socioeconomic status (SES), gender, and minority status on achievement gains to vary randomly across classrooms. Because the data set contains relatively few students per classroom, we decided to estimate models in which the effects of only one of these independent variables was allowed to vary randomly in this way in any given regression analysis. Overall, the analyses showed that background variables had different effects on annual gains in achievement across classrooms, with these random effects being larger in lower grades (especially in reading) than at upper grades. Thus, in the Prospects study, students from different social backgrounds apparently did not perform equally well across classrooms within the same school. Moreover, when the variance components for these additional random effects were added to the variance components for the random effects of classrooms, the overall effects of classrooms on gains in student achievement became larger. In early grades reading, for example, the addition of random effects for background variables approximately doubles the variance in achievement gains accounted for by classrooms (the increase is much less, however, for early grades mathematics and also less for upper grades mathematics and reading). For example, in a simple gains model where only the main effects of classrooms are treated as random, the dtype effect size was .26. When we also allowed background effects to vary across classrooms, however, the dtype effect sizes became .36 when the male effect was treated as random, .26 when the SES effect was allowed to vary, and .38 when the minority effect was allowed to vary. STUDENT PATHWAYS THROUGH CLASSROOMS A third issue we examined was the consistency of classroom effects for a given student across years. We have seen that in any given year, students are deflected upward or downward from their expected growth trajectory by virtue of the classrooms to which they are assigned. This occurs, of course, because some classrooms are more effective at producing academic growth for students, with the dtype effect size for annual deflections being around .72 to .85 in crossclassified random effects models (and around .16 to .36 when measured in terms of annual gains in achievement). In any given year, such effects may not seem especially sizeable. But if some students were consistently deflected upward as a result of their classroom assignments during elementary school, while other students were consistently deflected downward, the cumulative effects of classroom placements on academic growth could be quite sizeable, producing substantial inequality in student achievement in elementary schools. Currently, we know very little about this process in American elementary schools. Instead, the most important evidence comes from Kerckhoff’s (1983) seminal study of schools in Great Britain. Kerckhoff tallied the accumulated deflections to expected academic growth for students as they passed through British schools and found that the accumulation of consistently positive or negative deflections was much greater in British secondary schools than in primary schools. A similar process might be occurring in the United States, where elementary schools have a common curriculum, classrooms tend to be heterogeneous in terms of academic and social composition, and tracking is not a part of the institutional landscape. Because this is the case, elementary schools do not appear to be explicitly designed to produce academic differentiation. As a result, we might expect the accumulation of classroom effects on student achievement to be fairly equal over the course of students’ careers in elementary schools. To get a sense of this issue, we analyzed the classroomlevel EB residuals from the crossclassified growth models estimated above. Recall that these models control for a large number of student and school variables. In the analysis, we first calculated the classroom residuals for each student at each time point. We then correlated these residuals at the student level across time points. In the analysis, a positive correlation of residuals would indicate that students who experienced positive deflections in one year also experienced positive deflections in the following year, suggesting that classroom placements in elementary schools worked to the consistent advantage of some students and the consistent disadvantage of others. What we found in the Prospects data, however, was that deflections were inconsistently correlated across successive years, sometimes being positive, sometimes being negative, and ranging from .30 to 1.18. Overall, this pattern suggests that, on average, within a given school, a student would be expected to accumulate no real learning advantage by virtue of successive classroom placements. Note, however, that these data aren’t showing that students never accumulate successively positive (or negative) deflections as a result of their classroom placements. In fact, some students do experience consistent patterns. But in these data, such patterns should be exceedingly rare. For example, assuming that classroom effects are uncorrelated over time, we would expect about 3% of students to experience positive deflections 1 SD or more above their expected gain for 2 years in a row, and less than 1% to receive such positive deflections 3 years in a row. Another 3% of students in a school would receive 2 straight years of negative deflections of this magnitude, with less than 1% receiving three straight negative deflections. Obviously, students who experience consistently positive or negative deflections will end up with markedly different cumulative gains in achievement over the years (Sanders, 1998). But the data analyzed here suggest that such differences arise almost entirely by chance, not from a systematic pattern of academic differentiation through successively advantaging or disadvantaging classroom placements. The following results further illustrate this point. Using the EB residuals just discussed, we classified students according to whether (in a given year) they were in classrooms that were 1 SD above the mean in effects on achievement growth, 1 SD below the mean, or somewhere in between. Overall, when data on both cohorts and for both academic subjects are combined, we found that 3.4% of the students were in classrooms 1 SD above the mean in 2 consecutive years, and 2.4% of the students were in classrooms 1 SD below the mean for 2 consecutive years. Across 3 years, .45% of students were in classrooms 1 SD above the mean for 3 consecutive years, and .32% were in classrooms 1 SD below the mean for 3 consecutive years. To be sure, students accumulated different classroom deflections to growth over time, and this produced inequalities in achievement among students. But the pattern of accumulation here appears quite random and not at all the result of some systematic process of social or academic differentiation. SUMMARY OF PART I What do the findings just discussed suggest about the overall size and stability of teacher effects on student achievement? On the basis of the analyses reported here, it seems clear that assertions about the magnitude of teacher effects on student achievement depend to a considerable extent on the methods used to estimate these effects and on how the findings are interpreted. With respect to issues of interpretation, it is not surprising that teacher effects on students’ achievement status are small in variance decomposition models, even in the earliest elementary grades. After all, status measures reflect students’ cumulative learning over many years, and teachers have students in their classrooms only for a single year. In this light, the classroom effects on students’ achievement status found in Prospects data might be seen as surprisingly large. In elementary schools, Prospects data suggest that after controlling for student background and prior achievement, the classrooms to which students are assigned account for somewhere between 4% and 18% of the variance in students’ cumulative achievement status in a given year, which translates into a dtype effect size of .21 to .42. As we have seen, however, most analysts don’t want to analyze teacher effects on achievement status, preferring instead to examine teacher effects on students’ academic growth. Here, the use of gain scores as a criterion variable is common. But analyses based on gain scores are problematic. Although annual gains provide researchers with unbiased estimates of true rates of change in students’ achievement, they can be especially unreliable when true differences among students in academic growth are small. In fact, this was the case in Prospects data, and the resulting unreliability in achievement gains probably explains why we obtained such small effect size coefficients when we used gain scores to estimate teacher effects. Recall that in these analyses, only 3% to 13% of the variance in students’ annual achievement gains was found to be lying among classrooms. One clear implication of these analyses is that researchers need to move beyond the use of both covariate adjustment models (which estimate effects on students’ adjusted achievement status) and annual gains models if they want to estimate the overall magnitude of teacher effects on growth in student achievement. A promising strategy here is to use a crossclassified random effects model, as Raudenbush (1995) and Raudenbush and Bryk (2002) discuss. The preliminary analysis of Prospects data reported here suggests that crossclassified random effects models will lead to findings of larger dtype teacher effects. For example, in the crossclassified random effects analysis discussed in this paper, we reported dtype effect sizes of .77–.78 for teacher effects on students’ growth in reading achievement, and dtype effect sizes of .72–.85 for teacher effects on students’ growth in mathematics achievement. These are roughly three times the effect size found in other analyses. In this paper, we also presented findings on the consistency of teacher effects across academic subjects and groups of pupils. Using a gains model, we found that the same classroom was not consistently effective across different academic subjects or for students from different social backgrounds. We also used a crossclassified random effects model to demonstrate that cumulative differences in achievement among students resulting from successive placements in classrooms could easily have resulted from successive chance placements in more and less effective classrooms. This latter finding suggests that elementary schools operate quite equitably in the face of varying teacher effectiveness, allocating pupils to more and lesseffective teachers on what seems to be a chance rather than a systematic basis. Although the equity of this system of pupil allocation to classrooms might be comforting to some, the existence of classroomtoclassroom differences in instructional effectiveness should not be. As a direct result of teachertoteacher differences in instructional effectiveness, some students make less academic progress than they would otherwise be expected to make simply by virtue of chance placements in ineffective classrooms. All of this suggests that the important problem for American education is not simply to demonstrate that differences in effectiveness exist among teachers but rather to explain why these differences occur and to improve teaching effectiveness broadly. PART II: WHAT ACCOUNTS FOR CLASSROOMTOCLASSROOM DIFFERENCES IN ACHIEVEMENT To this point, we have been reviewing evidence on the overall size of teacher effects on student achievement. But these estimates, although informative about how the educational system works, do not provide any evidence about why some teachers are more instructionally effective than others. To explain this phenomenon, we need to inquire about the properties of teachers and their teaching that produce effects on students’ growth in achievement. In this section, we organize a discussion of this problem around Dunkin and Biddle’s (1974) wellknown scheme for classifying types of variables in research on teaching. Dunkin and Biddle were working within the processproduct paradigm and discussed four types of variables of relevance to research on teaching. Product variables were defined as the possible outcomes of teaching, including student achievement. Process variables were defined as properties of the interactive phase of instruction—that is, the phase of instruction during which students and teachers interact around academic content. Presage variables were defined as properties of teachers that can be assumed to operate prior to, but also to have an influence on, the interactive phase of teaching. Finally, context variables were defined as variables that can exercise direct effects on instructional outcomes, condition the effects of process variables on product variables, or both. PRESAGE VARIABLES The processproduct paradigm discussed by Dunkin and Biddle (1974) arose partly in response to a perceived overemphasis on presage variables in early research on teaching. Among the presage variables studied in such work were teachers’ appearance, enthusiasm, intelligence, and leadership—socalled trait theories of effective teaching (Brophy & Good, 1986). Most of these trait theories are no longer of interest in research on teaching, but researchers have shown a renewed interest in other presage variables in recent years. In particular, researchers increasingly argue that teaching is a form of expert work that requires extensive professional preparation, strong subjectmatter knowledge, and a variety of pedagogical skills, all of which are drawn upon in the complex and dynamic environment of classrooms (for a review of conceptions of teachers’ work in research on teaching, see Rowan, 1999). This view of teaching has encouraged researchers once again to investigate the effects of presage variables on student achievement. In largescale survey research, teaching expertise is often measured by reference to teachers’ educational backgrounds, credentials, and experience. This is especially true in the socalled production function research conducted by economists. Because employment practices in American education entail heavy reliance on credentials, with more highly educated teachers, those with more specialized credentials, or those with more years of experience gaining higher pay, economists have been especially interested in assessing whether teachers with different educational backgrounds perform differently in the classroom. In this research, teachers’ credentials are seen as “proxies” for the actual knowledge and expertise of teachers, under the assumption that teachers’ degrees, certification, or experience index the instructionally relevant knowledge that teachers bring to bear in classrooms. In fact, research on presage variables of this sort has a long history in largescale studies of schooling. Decades of research have shown, for example, that there is no difference in adjusted gains in student achievement across classes taught by teachers with a master’s or other advanced degree in education compared with student achievement in classes taught by teachers who lack such degrees. However, when largescale research has focused in greater detail on the academic majors of teachers, the courses teachers have taken, or both, results have been more positive. For example, several largescale studies (reviewed in Brewer & Goldhaber, 2000; Rowan, Chiang, & Miller, 1997) have tried to assess the effect of teachers’ subjectmatter knowledge on student achievement by examining differences in student outcomes for teachers with different academic majors. In general, these studies have been conducted in high schools and have shown that in classes where teachers have an academic major in the subject area being tested, students have higher adjusted achievement gains. In the NELS: 88 data, for example, the rtype effect sizes for these variables were .05 for science gains, and .01 for math gains.15 Other research suggests an extension of these findings, however. At least two studies, using different data sets, suggest that the gains to productivity coming from increases in high school teachers’ subjectmatter course work occur mostly when advanced material is being taught (see, e.g., Chiang, 1996; Monk, 1994).16 Fewer production function studies have used teachers’ professional preparation as a means of indexing teachers’ pedagogical knowledge, although a study by Monk (1994) is noteworthy in this regard. In Monk’s study, the number of classes in subjectmatter pedagogy taken by teachers’ during their college years was found to have positive effects on high school students’ adjusted achievement gains. DarlingHammond, Wise, and Klein (1995) cite additional, smallscale studies supporting this conclusion. ANALYSES OF PRESAGE VARIABLES As a followup to this research, we examined the effects of teachers’ professional credentials (and experience) on student achievement using Prospects data. In these analyses, we developed a longitudinal data set for two cohorts of students in the Prospects study: students passing from Grades 1 to 3 over the course of the study and students passing from Grade 3 to 6. Using these data, we estimated an explicit model of students’ growth in academic achievement using the statistical methods described in Bryk and Raudenbush (1992) and the computing software HLM/3L, version 5.25 (Raudenbush, Bryk, Cheong, & Congdon, 2000). Separate growth models were estimated for each cohort of students and for each academic subject (reading and mathematics). Thus, the analyses estimated four distinct growth models: (a) a model for growth in reading achievement in Grades 1–3; (b) a model for growth in mathematics achievement in Grades 1–3; (c) a model for growth in reading achievement in Grades 3–6; and (d) a model for growth in mathematics achievement in Grades 3–6. In all of these analyses, achievement was measured by the IRT scale scores provided by the test publisher. The reader will recall that these are equal interval scores (by assumption), allowing researchers to directly model growth across grades using an equalinterval metric. In all analyses, students’ growth in achievement was modeled in quadratic form, although the effect of this quadratic term was fixed. In the early grades cohort, the results showed that students’ growth in both reading and mathematics was steep in initial periods but decelerated over time. In the upper grades, academic growth in reading was linear, and growth in mathematics achievement accelerated at the last point in the time series. Average growth rates for both reading and mathematics were much lower in the upper grades than in the lower grades. In all of the models, we estimated the effects of home and social background on both achievement status and achievement growth, where the variables included (a) gender; (b) SES; (c) minority status; (d) number of siblings; (e) family marital status; and (f) parental expectations for a student’s educational attainment. In general, these variables had very large effects on students’ achievement status but virtually no effects on growth in achievement. We also controlled for school composition and location in these analyses, where the social composition of schools was indexed by the percentage of students in a school eligible for the federal free lunch program and where location was indexed by whether or not a school was in an urban location. Here too, the schoollevel variables had large effects on intercepts but not on growth. All of these results are important—suggesting that when the analysis shifts from concern with students’ achievement status to a concern with students’ growth in achievement, home and social background, as well as school composition and location become relatively insignificant predictors of academic development. In our analysis of presage variables using Prospects data, we focused on three independent variables measuring teachers’ professional background and experience. One was a measure of whether or not a teacher had special certification to teach reading or mathematics. The second was a measure of whether or not a teacher had a bachelor’s or master’s degree in English (when reading achievement was the dependent variable) or in mathematics (when mathematics was tested). Third, we reasoned that teacher experience could serve as a proxy for teachers’ professional knowledge, under the assumption that teachers learn from experience about how to represent and teach subject matter knowledge to students. The reader is cautioned that very few teachers in the Prospects sample (around 6%) had special certification, subjectmatter degrees, or both. For this and other reasons we used the robust standard errors in the HLM statistical package to assess the statistical significance of the effects of these variables on growth in student achievement. The analyses were conducted using a threelevel hierarchical linear model of students’ growth in academic achievement, where classroom variables are included at level one of the model as time varying covariates.17 The results of these analyses were reasonably consistent across cohorts in the Prospects data but differed by academic subject. In reading, neither teachers’ degree status nor teachers’ certification status had statistically significant effects on growth in students’ achievement, although we again caution the reader about the small number of teachers in this sample who had subjectmatter degrees or special certification. In reading, however, teacher experience was a statistically significant predictor of growth in students’ achievement, the dtype effect size being d = .07 for early grades reading and d = .15 later grades reading.18 In mathematics, the results were different and puzzling. Across both cohorts of students, there were no effects of teachers’ mathematics certification on growth in student achievement. There was a positive effect of teachers’ experience on growth in mathematics achievement but only for the later grades cohort (d = .18).19 Finally, in mathematics and for both cohorts, students who were taught by a teacher with an advanced degree in mathematics did worse than those who were taught by a teacher not having a mathematics degree (d = .25).20 It is difficult to know how to interpret the negative effects of teachers’ mathematics degree attainment on students’ growth in mathematics achievement. On one hand, the negative effects could reflect selection bias (see also footnote 13, where this is discussed in the context of high school data). In elementary schools, for example, we might expect selection to negatively bias estimated teacher effectiveness, especially if teachers with more specialized training work in special education, compensatory classroom settings, or both. In a subsidiary analysis, we respecified the regression models to control for this possibility (by including measures of students’ special education, compensatory education, or gifted and talented classification), but the effects remained unchanged. The other possibility is that this is a real effect and that advanced academic preparation is actually negatively related to students’ growth in achievement in elementary schools. Such an interpretation makes sense only if one assumes that advanced academic training somehow interferes with effective teaching, either because it substitutes for pedagogical training in people’s professional preparation or because it produces teachers who somehow cannot simplify and clarify their advanced understanding of mathematics for elementary school students. DISCUSSION What is interesting about production function studies involving presage variables is how disconnected they are from mainstream research on teaching. Increasingly, discussions of teachers’ expertise in mainstream research on teaching have gone well beyond a concern with proxy variables that might (or might not) index teachers’ expertise. Instead, researchers are now trying to formulate more explicit models of what teaching expertise looks like. In recent years, especially, discussions of expertise in teaching often have been framed in terms of Shulman’s (1986) influential ideas about pedagogical content knowledge. Different analysts have emphasized different dimensions of this construct, but most agree that there are several dimensions involved. One is teachers’ knowledge of the content being taught. At the same time, teaching also is seen to require knowledge of how to represent that content to different kinds of students in ways that produce learning, and that, in turn, requires teachers to have a sound knowledge of the typical ways students understand particular topics or concepts within the curriculum and of the alternative instructional moves that can produce new understandings in light of previous ones. None of this would seem to be well measured by the usual proxies used in production function studies, and as a result many researchers have moved toward implementing more direct measures of teachers’ expertise. To date, most research of this sort has been qualitative and done with small samples of teachers. A major goal has been to describe in some detail the pedagogical content knowledge of teachers, often by comparing the knowledge of experts and novices. Such work aims to clarify and extend Shulman’s (1986) original construct. One frustrating aspect of this research, however, is that it has been conducted in relative isolation from largescale, survey research on teaching, especially the long line of production function studies just discussed. Thus, it remains to be seen if more direct measures of teachers’ knowledge will be related to students’ academic performances. It is worth noting that prior research has found positive effects of at least some direct measures of teachers’ knowledge on student achievement. For example, largescale research dating to the Coleman report (Coleman et al., 1966) suggests that verbal ability and other forms of content knowledge are significantly correlated to students’ achievement scores, as the metaanalysis reported in Greenwald, Hedges, and Laine (1996) shows. This is complemented by more recent work showing that teachers’ scores on teacher certification tests and college entrance exams also affect student achievement (for a review, see Ferguson & Brown, 2000). It should be noted, however, that Shulman’s (1986) original conception of pedagogical content knowledge was intended to measure something other than the “pure” content knowledge measured in the tests just noted. As Shulman pointed out, it would be possible to know a subject well but lack the knowledge to translate this kind of knowledge into effective instruction for students. Given the presumed centrality of teachers’ pedagogical expertise to teaching effectiveness, a logical next step in largescale survey research is to develop direct measures of teachers’ pedagogical and content knowledge and to estimate the effects of these measures on growth in students’ achievement. In fact, along with colleagues, we are currently taking steps in this direction.21 Our efforts originated in two lines of work. The first was the Teacher Education and Learning to Teach (TELT) study conducted at Michigan State University. The researchers who conducted this study developed a survey battery explicitly designed to assess teachers’ pedagogical content knowledge in two areas—mathematics and writing (Kennedy, Ball, & McDiarmid, 1993). Within each of these curricular areas, a battery of survey items was designed to assess two dimensions of teachers’ pedagogical content knowledge: teachers’ knowledge of subject matter and teachers’ knowledge of effective teaching practices in a given content area. As reported in Deng (1995), the attempt to construct these measures was more successful in the area of mathematics than in writing and more successful in measures of content knowledge than pedagogical knowledge. An interesting offshoot of this work is that one of the items originally included as a measure of pedagogical content knowledge in the TELT study was also included in the NELS: 88 teacher questionnaire. As a result, we decided to investigate the association between this item and student achievement in the NELS: 88 data on 10thgrade math achievement. As reported in Rowan et al. (1997), we found that in a wellspecified covariate adjustment for student achievement, the item included in the NELS: 88 teacher questionnaire had a statistically significant effect on student achievement. In this analysis, a student whose teacher provided a correct answer to this single item scored .02 SD higher on the NELS: 88 mathematics achievement test than did a student whose teacher did not answer the item correctly. The corresponding rtype effect size for this finding is r = .03, and R^{2} = .0009.22 Although the effect sizes in the NELS: 88 analysis are tiny, the measurement problems associated with an ad hoc, oneitem scale measuring teachers’ content knowledge are obvious. Moreover, the effect of this ad hoc measure of teachers’ knowledge was assessed in Rowan et al.’s (1997) analysis by reference to a covariate adjustment model of students’ 10thgrade achievement status. As a result, one should not expect large effects from such an analysis. For this reason, our colleagues are now developing an extensive battery of survey items to directly assess teachers’ pedagogical content knowledge in the context of elementary schooling. Our development work to date is promising. For example, we have found that we can construct highly reliable measures of teachers’ pedagogical content knowledge within fairly narrow domains of the school curriculum using as few as six to eight survey items. Our goal in the future is to estimate the effects of these measures on growth in students’ achievement in our own study of school improvement interventions.23 TEACHING PROCESS VARIABLES Although presage variables of the sort just discussed, if well measured, hold promise for explaining differences in teacher effectiveness, quantitative research on teaching for many years has focused more attention on processproduct relationships than on presageproduct relationships. In this section of the paper, we discuss prior research on the effects of teaching process variables on student achievement and describe how we examined such effects using Prospects data. Time on Task/Active Teaching One aspect of instructional process that has received a great deal of attention in research on teaching is time on task. A sensible view of this construct, based on much previous processproduct research, would refer not so much to the amounts of time allocated to learning a particular subject, which has virtually no effect on achievement, nor even to the amount of time in which students are actively engaged in instruction, because high inference measures of student engagement during class time also have only very weak effects on achievement (Karweit, 1985). Rather, processproduct research suggests that the relevant causal agent producing student learning is how teachers use instructional time. Brophy and Good’s (1986) review of processproduct research on teaching suggests that effective use of time involves active teaching. In their view, active teaching occurs when teachers spend more time in almost any format that directly instructs students, including lecturing, demonstrating, leading recitations and discussions, and frequently interacting with students during seatwork assignments. This kind of teaching contrasts with a teaching style in which students frequently work independently on academic tasks, are engaged in nonacademic work, or both. Active teaching also involves good classroom management skills—for example, the presence of clear rules for behavior with consistent enforcement, close and accurate monitoring of student behavior, and the quick handling of disruptions and transitions across activities. There are several interesting points about these findings. The most important is that the concept of active teaching is generic. That is, research shows that active teaching looks much the same across academic subjects and positively affects student achievement across a range of grade levels and subjects. At the same time, the concept does not imply that a particular instructional format (e.g., lecture and demonstration, recitation, or other forms of guided discussion) is generally more effective than another across academic subjects and grade levels. In fact, the findings presented in Brophy and Good (1986) suggest that what is important is not how a teacher is active (i.e., the activities he or she engages in) as much as that the teacher is—in fact—an active agent of instruction. Thus, we can expect to find variability in the frequency and effectiveness of various instructional formats, but in virtually all settings high achievement growth is expected to occur when the teacher is actively carrying the material to students as opposed to allowing students to learn without scaffolding, supervision, and feedback. Analysis of TimeonTask/ActiveTeaching Measures To see if patterns of active teaching help explain classroomtoclassroom differences in students’ academic growth, we analyzed the effects on growth in achievement of several measures of active teaching available for upper grade classrooms in Prospects data.24 The measures were taken from three types of questions on the teacher questionnaire. One question asked teachers to report on the average minutes per week spent in their classrooms on instruction in reading and mathematics. The second asked teachers to rate the percentage of time they spent engaged in various active teaching formats, including time spent (a) presenting or explaining material; (b) monitoring student performance; (c) leading discussion groups; and (d) providing feedback on student performance. The third asked teachers to rate the percentage of time that students’ in their classrooms spent in individualized and wholeclass instruction. Following the review of evidence on active teaching mentioned earlier, we reasoned that what would matter most to student achievement was not the amount of time teachers spent on instruction nor even how teachers distributed their time across various active teaching behaviors. Instead, we hypothesized that the important variable would be how much active teaching occurred. From this perspective, we predicted that there would be no effect of minutes per week of instruction in reading or math on student achievement and no effect of the instructional format variables (a–d above). What would matter most, we reasoned, was the extent to which the teacher was operating as an active agent of instruction. From this perspective, we predicted that the percentage of time students spent in individualized instruction (where students work alone) would indicate a lack of active teaching and would have negative effects on students’ growth in achievement. By contrast, we reasoned that the percentage of time spent in wholeclass instruction (where teachers are the active agents of instruction) would have positive effects. To conduct this analysis, we simply respecified the HLM growth analyses used in estimating the effects of teacher certification and experience so that it now included the active teaching variables. As expected, teachers’ reports about minutes per week spent in instruction, and their reports on the teaching format variables, did not have statistically significant effects on students’ growth in reading or mathematics achievement. The results for time spent on individualized instruction were mixed but generally supportive of our hypotheses. For reading, the data were consistent with the prediction that more time spent by students in individualized settings translated into less academic growth, the effect size here being d = 2.09.25 In mathematics, however, time spent on individualized instruction had no significant effect. The data on percentage of time spent in wholeclass instruction were consistently supportive of our hypothesis. In both reading and mathematics, this variable was statistically significant. In reading, the effect size was d = .09. In mathematics, the effect size was d = .12.26 Discussion of TimeonTask/ActiveTeaching Variables The results from the Prospects analyses appear remarkably consistent with previous processproduct research and confirm that active teaching (as carried out in a wholeclass setting) can have a positive effect on students’ growth in achievement. However, the results reported here probably don’t give us a very accurate indication of the magnitude of this effect for several reasons. For one, items in the Prospects teacher questionnaire forced teachers to report on their use of different instructional behaviors and settings by averaging across all of the academic subjects they taught. Yet Stodolsky (1988) has found that the mix of instructional activities and behavior settings used by the same teacher can differ greatly across subjects. Moreover, a great deal of research on the ways in which respondents complete questionnaires suggests that the kinds of questions asked on the Prospects teacher questionnaire—questions about how much time was spent in routine forms of instructional activities—cannot be responded to accurately in oneshot questionnaires. This lack of accuracy probably introduces substantial error into our analyses, biasing all effect sizes downward and perhaps preventing us from discovering statistically significant relationships among teaching processes and student achievement. OPPORTUNITY TO LEARN/CONTENT COVERED In addition to active teaching, processproduct research also consistently finds a relationship between the curricular content covered in classrooms and student achievement. However, definitions and measures of curricular content vary from study to study, with some studies measuring only the content that is covered in a classroom and other studies measuring both the content covered and the cognitive demand of such content. Any serious attempt to measure content coverage begins with a basic categorization of curriculum topics in a particular subject area (e.g., math, reading, writing). Such categorization schemes have been derived from many different sources, including curriculum frameworks or standards documents, textbooks, and items included in the achievement test(s) being used as the dependent variable(s) in a processproduct study. In most research on content coverage, teachers are asked to rate the amount of emphasis they place on each topic in the content list developed by researchers. Across all such studies, the procedures used to measure content coverage vary in two important respects. First, some surveys list curriculum content categories in extremely finegrained detail, whereas others are more course grained. Second, teachers in some studies fill out these surveys on a daily basis, whereas in most studies they fill out an instrument once annually, near the end of the year. Obviously, measures of content coverage can serve either as dependent or independent variables in research on teaching because it is as interesting to know why content coverage differs across teachers as it is to know about the effects of content coverage on student achievement. When the goal of research is to predict student achievement, however, a common approach has been to measure the amount of overlap in content covered in a classroom with the content assessed in the achievement test serving as the dependent measure in a study. A great deal of research, ranging from an early study by Cooley and Leinhardt (1980) to more recent results from the TIMSS assessments (Stedman, 1997) have used this approach. These studies uniformly show that students are more likely to answer items correctly on an achievement test when they have received instruction on the topics assessed by that item. In fact, the degree of overlap between content covered in a classroom and content tested is a consistent predictor of student achievement scores.27 In addition to measuring topics covered, it can be useful to examine the cognitive objectives that teachers are seeking to achieve when teaching a given topic. In research on teaching, the work of Porter and colleagues (Porter, Kirst, Osthoff, Smithson, & Schneider, 1993) is particularly noteworthy in this regard. In Porter et al.’s work, curriculum coverage is assessed on two dimensions—which topics are covered and, for each topic, the level of cognitive demand at which that topic is covered, where cognitive demand involves rating the complexity of work that students are required to undertake in studying a topic. Recently, Porter et al. found that the addition of a cognitive demand dimension to the topic coverage dimension increases the power of content measures to predict gains in student achievement (Gamoran, Porter, Smithson, & White, 1997). Analysis of Content Covered To examine the effects of content coverage on student achievement, we conducted an analysis of Prospects data. In the Prospects study, teachers filled out a questionnaire near the end of the year in which they were asked to rate the amount of emphasis they gave to several broad areas of the reading and mathematics curricula using a 3point rating scale (ranging from no emphasis, to moderate emphasis, to a great deal of emphasis). From these data, we were able to construct two measures of content coverage—one in reading for the lower grades cohort (sufficient items for a scale were not available for the upper grades) and one for mathematics. In the following paragraphs, we discuss how these items were used to assess the effects of content coverage on student achievement. For lower grades reading, we developed a set of measures intended to reflect students’ exposure to a balanced reading curriculum. Such a curriculum, we reasoned, would include attention to three broad curricular dimensions—word analysis, reading comprehension, and writing. We measured students’ exposure to word analysis through a single item in which the teacher reported the amount of emphasis placed on this topic. We measured students’ exposure to reading comprehension instruction by combining eight items into a single Rasch scale, where the items were ordered according to the cognitive demand of instruction in this area. In the scale, items ranged in order from the lowest cognitive demand to the highest cognitive demand as follows: identify main ideas, identify sequence of events, comprehend facts and details, predict events, draw inferences, understand author’s intent, differentiate fact from opinion, and compare and contrast reading assignments. The scale had a person reliability (for teachers) of .73.28 A third measure was a single item in which teachers reported the emphasis they placed on the writing process. In assessing the effects of these variables on growth in students’ reading achievement, we simply expanded the HLM growth models for the early grades cohort used in previous analyses. In the analyses, each of the curriculum coverage variables had a positive and statistically significant effect on students’ growth in reading. The effect of a teacher’s emphasis on word analysis skills was d = .10. The effect of the reading comprehension measure was d = .17. The effect of a teacher’s emphasis on the writing process was d = .18.29 For mathematics, we used a single, multiitem scale measuring content coverage. Data for this measure were available for both cohorts of students in the Prospects data, and for both cohorts the measure can be thought of as indexing the difficulty of the mathematics content covered in a classroom, where this is assessed using an equalinterval Rasch scale in which the order of difficulty for items (from easiest to most difficult) was whole numbers/whole number operations, problem solving, measurement or tables (or both), geometry, common fractions or percent (or both), ratio and proportions, probability and statistics, and algebra (formulas and equations). In both scales, a higher score indicated that a student was exposed to more difficult content. For the early elementary cohort, the scale had a person reliability (for teachers) of .77; in the upper elementary sample, the person reliability (for teachers) was .80. Once again, this measure was simply added as an independent variable into the HLM growth models used in earlier analyses. When this was done, the effect of content coverage on early elementary students’ growth in mathematics achievement was not statistically significant. However, there was a statistically significant relationship for students in the upper elementary grades, the effect size being d = .09.30 Discussion of Content Covered In general, the dtype effect sizes reported for the association of content coverage measures and growth in student achievement are about the same size as dtype effect sizes for the other variables measured here. This should give pause to those who view opportunity to learn as the main explanation for studenttostudent differences in achievement growth. In fact, in one of our analyses (lower grades mathematics), the opportunity to learn variable had no statistically significant effects on student achievement.31 Moreover, the positive effects of curriculum coverage should be interpreted with caution for two reasons. One problem lies in assuming that opportunity to learn is causally prior to growth in student achievement and is therefore a causal agent because it is very possible that, instead, a student’s exposure to more demanding academic content is endogenous—that is, results from that student’s achievement rather causing it. To the extent that this is true, we have overestimated curriculum coverage effects.32 On the other hand, if curriculum coverage is relatively independent of past achievement, as some preliminary results in Raudenbush et al. (2002) suggest, then our measurement procedures could be leading us to underestimate its effects on student achievement. This is because the measures of curriculum coverage used in our analyses are very course grained in their descriptions of instructional content and because teachers are expected to accurately recall their content coverage patterns across an entire year in responding to a oneshot questionnaire. Once again, the findings just discussed seem plagued by unreliability in measurement, and in this light it is somewhat remarkable that crude measures of the sort developed for the Prospects study show any relationship at all to achievement growth. CONTEXT VARIABLES As a final step in our analysis of instructional effects on student achievement, we examined the extent to which the relationships of presage and process variables to student achievement just discussed were stable for different kinds of students. This analysis was motivated by data from the random effects models estimated in Part I of this paper, which showed that the same classroom could have different effects on growth in achievement for students from different social backgrounds. In Part II of this paper, we have shifted from estimating random effects models to estimating mixed models in which instructional effects are fixed—that is, assumed to have the same effects in all classrooms for students from all social backgrounds. In this section, we relax this assumption to examine interactions among presage and process variables and student background. The HLM statistical package being used here allows researchers to examine whether presage and process variables have the same effects on growth in achievement for students from different social backgrounds, but it can do so only when there are sufficient data. In the analyses conducted here, for example, students’ achievement is measured only at three or (in the best case) four time points. With this few time points, the program has insufficient data to estimate the extremely complex models that would be required to test for interactions among social background and instructional process variables. But there are some ways around this problem.33 In addition, if one proceeds with such an analysis, as we did for exploratory purposes, interactions can be found. For example, in an exploratory analysis, we specified a statistical model for growth in early reading achievement in which we assumed that the effects of the instructional variables discussed earlier would be conditioned by students’ gender, SES, or minority status. In the analysis, we found some evidence for the kinds of interactions being modeled, but it was far from consistent. For example, the data suggested that wholeclass instruction was more effective for males and less effective for higher SES students. The analysis also suggested that teachers’ emphasis on the writing process was more effective for males and that teacher experience was less effective for minority students. Thus, one can find evidence that the effectiveness of particular teaching practices varies for different groups of pupils. But there are problems with this kind of analysis that extend far beyond the fact that there are insufficient data for such an analysis in Prospects. Equally important, there is little strong theory to use when formulating and testing such hypotheses. Thus, although research on teaching suggests that the effects of instructional variables can vary across different groups of pupils, it provides little guidance about what—exactly—we should predict in this regard. Consider, for example, the findings just discussed. What instructional theory predicts that the effect of wholeclass teaching is more effective for males than females, or for lower SES rather than higher SES students? More important, although it would be possible to formulate an elaborate, post hoc explanation for why more experienced teachers appear to be less effective in promoting early reading growth among minority students (e.g., cohort differences in teacher training or in attitudes might explain the finding), should we interpret this finding knowing that it occurs in the context of several other findings that are completely unpredicted by any theory? We would argue that we should not and that, at least until theory catches up with our power to analyze data statistically, we keep our statistical analyses simple. The main point about context effects, then, is that educational researchers have a long way to go in modeling context effects, both in terms of having the requisite data available for modeling complex, multilevel statistical interactions, or in having the kinds of theories that would make attempts to do so justifiable. As a result, we recommend that largescale research on teaching limit itself for now to an examination of fixed effects models, where theoretical predictions are stronger and more straightforward. SUMMARY OF PART II The analyses in Part II of this paper illustrate that largescale research can be used to examine hypotheses drawn from research on teaching. The results also suggest that such hypotheses can be used to at least partially explain why some classrooms are more instructionally effective than others. The analyses presented in this paper, for example, showed that classroomtoclassroom differences in instructional effectiveness in early grades reading achievement and in mathematics achievement (at all grades) could be explained by differences in presage and product variables commonly examined in research on teaching. In the analyses, several variables had dtype effect sizes in the range of .10 to .20, including teacher experience, the use of whole class instruction, and patterns of curriculum coverage in which students were exposed to a balanced reading curriculum and to more challenging mathematics. At the same time, the results in Part II of this paper suggest that we probably shouldn’t expect a single instructional variable to explain the classroomtoclassroom differences in instructional effectiveness found in Part I of this paper. Instead, the evidence presented in Part II of this paper suggests that many small instructional effects would have to be combined to produce classroomtoclassroom differences in instructional outcomes of the magnitude found in Part I of this paper. At the same time, the distribution of classroom effectiveness within the same school (discussed in Part I of this paper) suggests that very few classrooms in the same school present an optimal combination of desirable instructional conditions. Instead, the majority of classrooms probably present students with a mix of more and less instructionally effective practices simultaneously. This scenario is made all the more plausible by what we know about the organization and management of instruction in the typical American school. Research demonstrates that American teachers have a great deal of instructional autonomy within their classrooms, producing wide variation in instructional practices within the same school. Variations in instructional practices, in turn, produce the distribution of classroom effects that we discovered in our variance decomposition models, with a lack of real coordination across classrooms probably accounting for students’ movement through more and less effective classrooms over the course of their careers in a given school. If there is a “magic bullet” to be found in improving instructional effectiveness in American schools, it probably lies in finding situations in which many instructionally desirable conditions coexist in classrooms and in situations where students experience such powerful combinations of instructional practice across their careers in school. In fact, this is one reason we and our colleagues have become so interested in studying instructional interventions. By design, these interventions seek to smooth out classroomtoclassroom differences in instructional conditions and to encourage the implementation of instructional conditions that combine to produce fairly powerful effects on student learning across all classrooms within a school. This insight suggests a real limitation to research on teaching that looks exclusively at natural variations in instructional practice, as the research presented in this paper did (and as much other largescale, survey research tends to do). If we look only at natural variation, we will find some teachers who work in ways that combine many desirable instructional conditions within their classrooms and others who don’t. But if we rely solely on a strategy of looking at naturally occurring variation to identify best practice, we have no way of knowing if the best cases represent a truly optimal combination of instructional conditions or whether even the best classrooms are operating below the real (and obtainable) production frontier for schooling. In our view, it would be better to shift away from the study of naturally occurring variation in research on teaching and to instead compare alternative instructional interventions that have been designed—a priori—to implement powerful combinations of instructionally desirable conditions across classrooms in a school. In this case, we would no longer be studying potentially idiosyncratic variations in teacher effectiveness but rather the effects of wellthoughtout instructional designs on student learning.34 PART III: HOW TO IMPROVE LARGESCALE, SURVEY RESEARCH ON TEACHING The discussions presented in this paper show how largescale, survey research has been used to estimate classroomtoclassroom differences in instructional effectiveness and to test hypotheses that explain these differences by reference to presage, process, and context variables commonly used in research on teaching. Throughout this paper, however, we have pointed out various conceptual and methodological issues that have clouded interpretations of the findings from prior research on teaching or threatened its validity. In this section, we review these issues and discuss some steps that can be taken to improve largescale research on teaching. EFFECT SIZES IN RESEARCH ON TEACHING One issue that has clouded research on teaching is the question of how big instructional effects on student achievement are. As we tried to show in earlier sections of this paper, the answer one gives to the question of how much of the variance in student achievement outcomes is accounted for by students’ locations in particular classrooms depends in large part on how the criterion outcome in an analysis of this problem is conceived and measured. Research that uses achievement status as the criterion variable in assessing teacher effects is looking at how much a single year of instruction (or exposure to a particular instructional condition during a single year) affects students’ cumulative learning over many years. Obviously, the size of the instructional effect that one obtains here will differ from what would be obtained if the criterion variable assessed instructional effects on changes in student achievement over a single year. In fact, in analyses of achievement status, home background variables and prior student achievement will account for larger proportions of variance than variables indexing a single year of teaching. That said, it is worth noting that analyses using covariate adjustment models to assess instructional effects on students’ achievement status can identify both the random effects of classroom placement on students’ achievement and the effects of specific instructional variables. However, the effect sizes resulting from such analyses will be relatively small for obvious reasons. A shift to the analysis of instructional effects on growth in achievement presents different problems, especially if gain scores are used to measure students’ rates of academic growth. To the extent that the gain scores used in analysis are unreliable, estimates of the overall magnitude of instructional effects on student achievement will be biased downward. As the literature on assessing change suggests, it is preferable to begin any analysis of instructional effects by first estimating students’ true rates of academic growth and then assessing teacher effects on growth within this framework. Unfortunately, computing packages that allow for such analyses are not yet commercially available, although preliminary results obtained while working with a developmental version of such a program (being developed by Steve Raudenbush) suggests that effect size estimates from such models will be very different from those obtained using covariate adjustment and gains models. All of this suggests that there might be more smoke than fire in discussions of the relative magnitude of instructional effects on student achievement. Certainly, the discussion to this point suggests that all effect sizes are not created equally. In fact, the same instructional conditions can be argued to have large or small effects simply on the basis of the analytic framework used to assess the effects (i.e., a covariate adjustment model, a gains model, or an explicit growth model). Thus, although there is much to be said in favor of recent discussions in educational research about the over reliance on statistical significance testing as the single metric by which to judge the relative magnitude of effects—especially in largescale, survey research, where large numbers of subjects almost always assure that very tiny effects can be statistically significant—the discussion presented in this paper also suggests that substantively important instructional effects can indeed have very small effect sizes when particular analytic frameworks are used in a study. Moreover, when this is the case, large sample sizes and statistical significance testing turns out to be an advantage because it works against having insufficient statistical power to identify effects that are substantively important when the dependent variable is measured differently. In particular, to the extent that researchers are using covariate adjustment or gains models to assess instructional effects, large sample sizes and statistical significance tests would seem to be an important means for locating substantively meaningful effects, especially because these models present analytic situations in which the decks are stacked against finding large effect sizes.35 A final point can be made about efforts to estimate the magnitude of teacher effects on student achievement. In our view, the time has come to move beyond variance decomposition models that estimate the random effects of schools and classrooms on student achievement. These analyses treat the classroom as a black box, and although they can be useful in identifying more and less effective classrooms, and in telling us how much of a difference natural variation in classroom effectiveness can make to students’ achievement, variance decomposition models do not tell us why some classrooms are more effective than others, nor do they give us a very good picture of the potential improvements in student achievement that might be produced if we combined particularly effective instructional conditions into powerful instructional programs. For this reason, we would argue that future largescale research on teaching move to directly measuring instructional conditions inside classrooms, assessing the implementation and effectiveness of deliberately designed instructional interventions, or both. THE MEASUREMENT OF INSTRUCTION As the goal of largescale, survey research on teaching shifts from estimating the random effects of classrooms on student achievement to explaining why some classrooms are more instructionally effective than others, problems of measurement in survey research will come to the fore. As we discussed in Part II of this paper, there is a pervasive tendency in largescale, survey research to use proxy variables to measure important dimensions of teaching expertise, as well as an almost exclusive reliance on oneshot questionnaires to crudely measure instructional process variables. Although the findings presented here suggest that crude measures of this sort can be used to test hypotheses from research on teaching, and that crude measures often show statistically significant relationships to student achievement, it is also true that problems of measurement validity and reliability loom large in such analyses. What can be done about these problems? One line of work would involve further studies of survey data quality—that is, the use of a variety of techniques to investigate the validity and reliability of commonly used survey measures of instruction. There are many treatments of survey data quality in the broader social science literature (Biemer, Groves, Lyberg, Mathiowetz, & Sudman, 1991; Groves, 1987, 1989; Krosnick, 1999; Scherpenzeel & Saris, 1997; Sudman & Bradburn, 1982; Sudman, Bradburn, & Schwarz, 1996), and a burgeoning literature on the quality of survey measures of instruction in educational research (Brewer & Stasz, 1996; Burstein et al., 1995; Calfee & Calfee, 1976; Camburn, Correnti, & Taylor, 2000, 2001; Chaney, 1994; Elias, Hare, & Wheeler, 1976; Fetter, Stowe & Owings, 1984; Lambert & Hartsough, 1976; Leighton, Mullens, Turnbull, Weiner, & Williams, 1995; Mayer, 1999; Mullens, 1995; Mullens & Gayler, 1999; Mullens & Kasprzyk, 1996, 1999; Porter et al., 1993; Salvucci, Walter, Conley, Fink, & Mehrdad, 1997; Shavelson & DempseyAtwood, 1976; Shavelson, Webb, & Burstein, 1986; Smithson & Porter, 1994; Whittington, 1998). A general conclusion from all of this work seems to be that the survey measures of instruction used in educational research suffer from a variety of methodological and conceptual problems that can only be addressed by more careful work during the survey development stage. The work that we are doing with colleagues to address these problems deserves brief mention here. As we discussed at an earlier point in this paper, we have become keenly interested in assessing the effects of teachers’ pedagogical content knowledge on students’ achievement, but rather than rely on the kinds of indirect proxy measures that typify much previous research in this area, we have instead begun a program of research designed to build direct measures of this construct from scratch. To date, we have been through one round of pretesting in which we have found that it is possible to develop highly reliable measures of teachers’ content and pedagogical knowledge in very specific domains of the school curriculum using as few as six to eight items (Rowan, Schilling, Ball, & Miller, 2001). We also have begun to validate these measures by looking at think aloud protocols in which high and lowscoring teachers on our scales talk about how and why they answered particular items as they did. Finally, in the near future, we will begin to correlate these measures to other indicators of teachers’ knowledge and to growth in student achievement. The work here has been intensive (and costly). But it is the kind of work that is required if survey research on instruction is to move forward in its examination of the role of teaching expertise in instructional practice.36 We also have been exploring the use of instructional logs to collect survey data on instructional practices in schools. In the broader social science research community, logs and diaries have been used to produce more accurate responses from survey respondents about the frequency of activities conducted on a daily basis. The advantage of logs and diaries over oneshot questionnaires is that logs and diaries are completed frequently (usually on a daily basis) and thus avoid the problems of memory loss and misestimation that plague survey responses about behavior gathered from oneshot surveys. Here, too, we have engaged in an extensive development phase. In Spring, 2000, we asked teachers to complete daily logs for a 30–60day time period, and during this time we conducted independent observations of classrooms where logging occurred, conducted think alouds with teachers after they completed their logs, and administered separate questionnaires to teachers designed to measure the same constructs being measured by the logs. To date, we have found that teachers will complete daily logs over an extended period of time (if given sufficient incentives), that due to variation in daily instructional practice roughly 15–20 observations are needed to derive reliable measures of instructional processes from log data, that log and oneshot survey measures of the same instructional constructs often are only moderately correlated, and that rates of agreement among teachers and observers completing logs on the same lesson vary depending on the construct being measured.37 In future work, we will be correlating logderived measures with student achievement and comparing the relative performance of measures of the same instructional construct derived from logs and from our own oneshot questionnaire. The point of all this work is not to trumpet the superiority of our measures over those used in other studies. Rather, we are attempting to take seriously the task of improving surveybased measures of instruction so that we can better test hypotheses derived from research on teaching. Without such careful work, estimates about what works in terms of instructional improvement and how big the effects of particular instructional practices are on student achievement will continue to be plagued by issues of reliability and validity that currently raise doubts about the contributions of past survey research to broader investigations of teaching and its consequences for student achievement. PROBLEMS OF CAUSAL INFERENCE IN SURVEY RESEARCH If the goal of survey research is to test hypotheses about the effects of teachers and their teaching on student achievement, then more is needed than appropriate interpretation of differing effect size metrics and careful development of valid and reliable survey instruments. To achieve the fundamental goal of assessing the effects of teachers and their teaching on students’ achievement, researchers must also pay attention to problems of causal inference in educational research. That largescale survey research confronts tricky problems of causal inferences in this area is demonstrated by some of the results we reported earlier in this paper. Consider, for example, the findings we reported about the effects of teacher qualifications and students’ exposure to advanced curricula on students’ achievement. A major problem in assessing the effects of these variables on student achievement is that students who have access to differently qualified teachers or to more and less advanced curricula are also likely to differ in many other ways that also predict achievement. These other factors are confounding variables that greatly complicate causal inference, especially in nonexperimental settings. For several decades, educational researchers assumed that multiple regression techniques could resolve most of these problems of causal inference. But this is not always the case. For example, some analysts have noted that strategies of statistical control work effectively to reduce problems of causal inference only under limited circumstances. These include circumstances where all confounding variables are measured without error and included in a regression model; when twoway and higherorder interactions between confounding variables and the causal variable of interest are absent or specified in a model; when confounding variables are not also an outcome in the model; and when confounding variables have the same linear association with the outcome that was specified by the multiple regression model (Cohen, Ball, & Raudenbush, in press). Other researchers have taken to using instrumental variables and twostage least squares procedures to simulate the random assignment of experiments, or they have employed complex selection models to try and control for confounding influences across treatment groups formed by nonrandom assignment, or they have advocated for interrupted time series analyses in which data on outcomes are collected at multiple time points before and after exposure to some treatment of interest. All of these approaches are useful, but they also can be difficult to employ successfully, especially in research on teaching, where knowledge of confounding factors is limited and where at least one of the main confounding variables is also the outcome of interest (students’ achievement levels). In fact, difficulties associated with effectively deploying alternatives to random assignment in nonexperimental research might account for the finding that nonexperimental data is less efficient than experimental data in making causal inferences. For example, Lipsey and Wilson (1993) reported on 74 metaanalyses that included both experimental and nonexperimental studies of psychological, educational, or behavioral treatment efficacy, or all three. Their analysis showed that average effect sizes for various causal hypotheses did not differ much between experiments and nonexperimental studies but that variation in effect sizes was much larger for the nonexperimental studies. All of this suggests that the typical—nonexperimental—survey study of instructional effects on student achievement probably builds knowledge more slowly, and more tenuously, than experimental research. The argument we are making should not be considered an unambiguous call for experimental studies of teaching, however. Although there is growing consensus among researchers in many disciplines—including economics, political science, and the applied health sciences fields—that experiments are the most desirable way to draw valid causal inferences, it is the case that educational experiments will suffer from a number of shortcomings, especially when they are conducted in complex field settings, over long periods of time, where treatments are difficult to implement, where attrition is pervasive, where initial randomization is compromised, where crossover effects frequently occur, and where complex organizations (like schools) are the units of treatment. Much has been learned about how to minimize these problems in experimental studies (e.g., Boruch, 1997), but in the real world of educational research, complex and larger scale experiments seldom generate unassailable causal inferences. Thus, scrupulous attention to problems of causal inference seems warranted not only in nonexperimental but also in experimental research. Moreover, even when experiments (or various quasiexperiments that feature different treatment, control groups, or both) are conducted, there is still an important role for survey research. Although policy makers may be interested in the effects of intent to treat (i.e., mean differences in outcomes among those assigned to experimental and control groups), program developers are usually interested in testing their own theories of intervention. They therefore want to know whether the conditions they think should produce particular outcomes do indeed predict these outcomes. The usual black box experiment, which examines differences in outcomes across those who were and were not randomly assigned to the treatment, regardless of actual level of treatment—is fairly useless for this purpose. Instead, measures of treatment implementation and its effects on treatment outcomes are what program developers usually want to see. They recognize that treatments are implemented variably, and they want to know how—and to what effect—their treatments have been implemented. Thus, even in experimental studies of teaching effects on student achievement, there is an important need for careful measurement of instruction, and the larger the experiment, the more likely that surveys will employed to gather the necessary data for such measures. CONCLUSION All of this suggests that there is a continuing role for survey research in the study of instructional effects on student achievement. It also shows the critical interdependence among the three problems that must be confronted if survey research is to inform research on teaching. We cannot interpret the results of largescale survey research on teaching very sensibly if we do not have a clear understanding of what constitutes a big or small effect, but no matter which method we choose to develop effect size metrics, we won’t have good information from survey research about these effects if we don’t also pay attention to issues of measurement and causal inference. Without good measures, no amount of statistical or experimental sophistication will lead to valid inferences about instructional effects on student achievement, but even with good measures sound causal inference procedures are required. The comments and illustrations presented in this paper therefore suggest that although largescale, survey research has an important role to play in research on teaching and in policy debates about what works, survey researchers still have some steps to take if they want to improve their capacity to contribute to this important field of work. The research reported here was supported by grants to the Consortium for Policy Research in Education from the Atlantic Philanthropies, the National Science Foundation’s Interagency Educational Research Initiative (Grant #REC 9979863), and the U.S. Department of Education (Grant #OERIR308A600003). The opinions expressed are those of the authors, not the agencies supporting this work. We thank Steve Raudenbush for advice and assistance at various stages of the work. REFERENCES Alexander, K., & Entwisle, D. (1992). Summer setback: Race, poverty, school composition, and mathematics achievement in the first two years of school. American Sociological Review, 57(1), 72–84. Alexander, K., & Entwisle, D. (2001). Keep the faucet flowing: Summer learning and home environment. American Educator, 25(3), 10–15. Biemer, P., Groves, R., Lyberg, L., Mathiowetz, N., & Sudman, S. (1991). Measurement errors in surveys. New York: John Wiley & Sons. Boruch, R. F. (1997). Randomized experiments for planning and evaluation: A practical guide (Applied Social Research Methods Series, Vol. 44). Thousand Oaks, CA: Sage Publications. Brewer, D. J., & Goldhaber, D. D. (2000). Improving longitudinal data on student achievement: Some lessons from recent research using NELS:88. In D. W. Grissmer & J. M. Ross (Eds.), Analytic issues in the assessment of student achievement (pp. 169–188). Washington, DC: U.S. Department of Education. Brewer, D. J., & Stasz, C. (1996). Enhancing opportunity to learn measures in NCES data. In From data to information: New directions for the National Center for Education Statistics (NCES 96901, pp. 31–328). Washington, DC: U.S. Government Printing Office. Brophy, J. E., & Good, T. (1986). Teacher behavior and student achievement. In M. C. Wittrock (Ed.), Handbook of research on teaching (3rd ed., pp. 328–375). New York: Macmillan. Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park, CA: Sage. Bryk, A. S., Raudenbush, S. W., Cheong, Y. F., & Congdon, R. (2000). HLM 5: Hierarchical linear and nonlinear modeling. Lincolnwood, IL: Scientific Software International. Burstein, L., McDonnell, L., Van Winkle, J., Ormseth, T., Mirocha, J., & Guiton, G. (1995). Validating national curriculum indicators. Santa Monica, CA: RAND. Calfee, R., & Calfee, K. H. (1976). Beginning teacher evaluation study: Phase II, 1973–74, final report: Volume III.2. Reading and mathematics observation system: Description and analysis of time expenditures. Washington, DC: National Institute of Education (ERIC Document Reproduction Service No. ED127367) Camburn, E., Correnti, R., & Taylor, J. (2000). Using qualitative techniques to assess the validity of teachers’ responses to survey items. Paper presented at the meeting of the American Educational Research Association, New Orleans, LA. Camburn, E., Correnti, R., & Taylor, J. (2001). Examining differences in teachers’ and researchers’ understanding of an instructional log. Paper presented at the meeting of the American Educational Research Association, Seattle, WA. Chaney, B. (1994). The accuracy of teachers’ selfreports on their postsecondary education: Teacher transcript study, Schools and staffing survey (NCES 9404). Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. Chiang, F.S. (1996). Teacher’s ability, motivation and teaching effectiveness. Unpublished doctoral dissertation, University of Michigan, Ann Arbor. Cohen, D. K., Raudenbush, S. W., & Ball, D. L. (in press). Resources, instruction, and research. In R. F. Boruch & F. W. Mosteller (Eds.), Evidence matters: Randomized trials in educational research. Washington, DC: Brookings Institution. Coleman, J. S., Campbell, E., Hobson, C., McPartland, J., Mood, A., Weinfeld, F. & York, R. (1966). Equality of educational opportunity. Washington, DC: U.S. Government Printing Office. Cooley, W. W., & Leinhardt, G. (1980). The instructional dimensions study. Educational Evaluation and Policy Analysis, 2(1), 7–25. DarlingHammond, L., Wise, A. E., & Klein, S. P. (1995). A license to teach: Building a profession for 21stcentury schools. San Francisco: Westview Press. Deng, Z. (1995). Estimating the reliability of the teacher questionnaire used in the Teaching and Learning to Teach (TELT) study (Technical series 951). East Lansing, MI: National Center for Research on Teacher Learning. (ERIC Document Reproduction Service No. ED 392 750) Dunkin, M., & Biddle B. (1974). The study of teaching. New York: Holt, Rhinehart & Winston. Elias, P. J., Hare, G., & Wheeler, P. (1976). Beginning teacher evaluation study: Phase II, 1973–74, final report: Volume V.5. The reports of teachers about their mathematics and reading instructional activities. Washington, DC: National Institute of Education. (ERIC Document Reproduction Service No. ED127374) Ferguson, R. F., & Brown, J. (2000). Certification test scores, teacher quality, and student achievement. In D. W. Grissmer & J. M. Ross (Eds.), Analytic issues in the assessment of student achievement (pp. 133–156). Washington, DC: U.S. Department of Education. Fetters, W. B., Stowe, P. S., & Owings, J. A. (1984). High school and beyond, a national longitudinal study for the 1980s, quality of responses of high school students to questionnaire items (NCES 84216). Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. Gamoran, A., Porter, A., Smithson, J., & White, P. (1997). Upgrading high school mathematics instruction: Improving learning opportunities for lowachieving, lowincome youth. Educational Evaluation and Policy Analysis, 19(4), 325–338. Gage, N. L., & Needels, M. C. (1989). Processproduct research on teaching: A review of criticisms. Elementary School Journal, 89, 253–300. Greenwald, R., Hedges, L. V., & Laine, R. D. (1996). The effect of school resources on student achievement. Review of Educational Research, 66, 361–396. Groves, R. M. (1987). Research on survey data quality. Public Opinion Quarterly, 51, 156–172. Groves, R. M. (1989). Survey errors and survey costs. New York: John Wiley & Sons. Karweit, N. (1985). Should we lengthen the school term? Educational Researcher, 14, 9–15. Kennedy, M., Ball, D., & McDiarmid, W. (1993). A study package for examining and tracking changes in teachers’ knowledge (Technical series 931). East Lansing, MI: National Center for Research on Teacher Learning. (ERIC Document Reproduction Service No. ED 359 170) Kerckhoff, A. C. (1983). Diverging pathways: Social structure and career deflections. New York: Cambridge University Press. Krosnick, J. A. (1999). Survey research. Annual Review of Psychology, 50, 537–567. Lambert, N. M., & Hartsough, C. S. (1976). Beginning teacher evaluation study: Phase II, 1973–74, final report: Volume III.1. APPLE observation variables and their relationship to reading and mathematics achievement. Washington, DC: National Institute of Education. (ERIC Document Reproduction Service No. ED127366) Leighton, M., Mullens, J., Turnbull, B., Weiner, L., & Williams, A. (1995). Measuring instruction, curriculum content, and instructional resources: The status of recent work (NCES Working Paper No. 199511). Washington, DC: U.S. Department of Education. Lipsey, M. W., & Wilson, D. B. (1993). The efficacy of psychological, educational, and behavioral treatment: Confirmation from metaanalysis. American Psychologist, 48(12), 1181–1209. Mayer, D. (1999). Measuring instructional practice: Can policymakers trust survey data? Educational Evaluation and Policy Analysis, 21(1), 29–45. Monk, D. H. (1994). Subject area preparation of secondary mathematics and science teachers and student achievement. Economics of Education Review, 13(2), 125–45. Mullens, J. (1995). Classroom instructional processes: A review of existing measurement approaches and their applicability for the teacher followup survey (NCES Working Paper No. 199515). Washington, DC: U.S. Department of Education. Mullens, J., & Gayler, K. (1999). Measuring classroom instructional processes: Using survey and case study field test results to improve item construction (NCES Working Paper No. 199908). Washington, DC: U.S. Department of Education. Mullens, J., & Kasprzyk, D. (1996). Using qualitative methods to validate quantitative survey instruments. In 1996 Proceedings of the Section on Survey Research Methods (pp. 638–643). Alexandria, VA: American Statistical Association. Mullens, J., & Kasprzyk, D. (2000). Validating item responses on selfreport teacher surveys. In Selected papers on education surveys: Papers presented at the 1998 and 1999 ASA and 1999 AAPOR meetings (NCES Working Paper 200004). Washington, DC: U.S. Department of Education. Porter, A. C., Kirst, M., Osthoff, E., Smithson, J., & Schneider, S. (1993). Reform up close: An analysis of high school mathematics and science classrooms. Madison: University of Wisconsin, Wisconsin Center for Education Research. Raudenbush, S. W. (1995). Hierarchical linear models to study the effects social context on development. In J. M. Gottman (Ed.), The analysis of change (pp. 165–202). Mahwah, NJ: Lawrence Earlbaum Associates. Raudenbush, S. W., & Bryk, A. S. (2002). Hierarchical linear models: Applications and data analysis (2nd. ed.). Thousand Oaks, CA: Sage. Raudenbush, S., Bryk, A., Cheong, Y., & R. Congdon. (2000). HLM 5 hierarchical linear and nonlinear modeling. Illinois: Scientific Software International. Raudenbush, S. W., Hong, G. L., & Rowan, B. (2002). Studying the causal effects with application to primary school mathematics (Working Paper). Ann Arbor: University of Michigan, Consortium for Policy Research in Education, Study of Instructional Improvement. Rogosa, D., (1995). Myths and methods: Myths about longitudinal research plus supplemental questions. In J. M. Gottman (Ed.), The analysis of change (pp. 3–66). Mahwah, NJ: Lawrence Earlbaum Associates. Rosenthal, R. (1994). Parametric measures of effect size. In H. Cooper & L. V. Hedges (Eds.), The handbook of research synthesis (pp. 231–244). New York: Russell Sage Foundation. Rowan, B. (1999). The task characteristics of teaching: Implications for the organizational design of schools. In R. Bernhardt, C. N. Hedley, G. Cattaro, & V. Svolopoulos (Eds.), Curriculum leadership for the 21st century. Cresskill, NY: Hampton Press. Rowan, B., Chiang, F.S., & Miller, R. J. (1997). Using research on employees’ performance to study the effects of teachers on students’ achievement. Sociology of Education, 70, 256–284. Rowan, B., Schilling, S., Ball, D. L., & Miller, R. (2001). Measuring teachers’ pedagogical content knowledge in surveys: An exploratory study (Research Note S2). Ann Arbor: University of Michigan, Consortium for Policy Research in Education, Study of Instructional Improvement. Salvucci, S., Walter, E., Conley, V., Fink, S., &, Mehrdad, S. (1997). Measurement error studies at the National Center for Education Statistics (NCES 97464). Washington, DC: U.S. Department of Education, Office of Educational Research and Improvement, National Center for Education Statistics. Sanders, W. (1998). Valueadded assessment. The School Administrator, 55(11), 24–32. Sanders, W., & Horn, S. P. (1994). The Tennessee valueadded assessment system (TVAAS): Mixedmodel methodology in educational assessment. Journal of Personnel Evaluation in Education, 8, 299–311. Scheerens, J., & Bosker, R. (1997). The foundations of educational effectiveness. New York: Pergamon. Scherpenzeel, A., & Saris, W. E. (1997). The validity and reliability of survey questions: A metaanalysis of MTMM studies. Sociological Methods & Research, 25(3), 341–383. Shavelson, R. J., & DempseyAtwood, N. (1976). Generalizability of measures of teaching behavior. Review of Educational Research, 46, 553–611. Shavelson, R. J., Webb, N. M., & Burstein, L. (1986). Measurement of teaching. In M. Wittrock (Ed.), Handbook of research on teaching (3rd ed., pp. 50–91). New York: MacMillan. Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15(2), 4–14. Smithson, J. L., & Porter, A. C. (1994). Measuring classroom practice: Lessons learned from efforts to describe the enacted curriculum—The reform up close study. Madison: University of Wisconsin, Consortium for Policy Research in Education. Stedman, L. C. (1997). International achievement differences: An assessment of a new perspective. Educational Researcher, 26(3), 4–15. Stodolsky, S. S. (1988). The subject matters: Classroom activity in math and social studies. Chicago: University of Chicago Press. Stoolmiller, M., & Bank, L. (1995). Autoregressive effects in structural equation models: We see some problems. In J. M. Gottman (Ed.), The analysis of change (pp. 261–276). Mahwah, NJ: Lawrence Earlbaum Associates. Sudman, S., & Bardburn, N. M. (1982). Asking questions: A practical guide to questionnaire design. San Francisco: JosseyBass. Sudman, S., Bradburn, N. M., & Schwarz, N. (1996). Thinking about answers: The application of cognitive processes to survey methodology. San Francisco: JosseyBass. Whittington, D. (1998). How well do researchers report their measures? An evaluation of measurement in published educational research. Educational and Psychological Measurement, 58(1), 21–37. BRIAN ROWAN is a professor of education at the University of Michigan and director of the Study of Instructional Improvement, conducted by the Consortium for Policy Research in Education. His scholarly interest focuses on the organizational analysis of schooling, paying special attention to the ways in which schools organize and manage instruction and affect student learning. Rowan’s recent publications appear in Hoy and Miskel (Eds.), Theory and Research in Educational Administration (Vol. 1) and the Journal of Educational Change. RICHARD CORRENTI is a doctoral candidate in educational administration and policy at the University of Michigan, Ann Arbor. His research interests include the measurement of instruction, instructional effects on student learning, and program evaluation of educational reform interventions. ROBERT J. MILLER is a doctoral candidate in educational administration and policy at the University of Michigan, Ann Arbor. His main fields of interest are educational policy, organizational theory, and analysis of school effectiveness.


