Evaluating Teacher Preparation Programs Using the Performance of their Graduates
by Cory Koedel & Eric Parsons - November 04, 2014
In a recent article the authors use data from Missouri to show that differences between traditional teacher preparation programs, measured in terms of the effects of their graduates on student achievement, are smaller than has been suggested by previous research in other states. Indeed, they find that most programs in Missouri are statistically indistinguishable from one another. The authors identify a technical error made in previous work to which they attribute their discrepant findings. In short, some previous studies have failed to properly account for teacher sampling, and in doing so, have overstated the extent to which graduates from different teacher preparation programs truly differ. This commentary considers the implications of this result in the context of the current policy push for more rigorous evaluations of teacher preparation programs.
Recently proposed federal and state policies that are designed to improve teacher preparation have focused on using outcome-based information about program efficacy to inform decision-making. Among the program outcomes being considered for graduates are statistical performance measures based on student test scores, high-quality observational measures of classroom performance, job placement, retention rates, and feedback from surveys of program graduates and their employers (Crowe, 2011; United States Department of Education, 2011). The idea of using outcome-based metrics to evaluate teacher preparation programs is a logical extension of a related literature that uses similar metrics to evaluate individual teachers. The best available scientific evidence from the literature on individual teachers indicates that these outcome-based metrics, particularly value-added metrics, can be used to identify persistently effective and ineffective teachers (Chetty, Friedman, and Rockoff, 2014a; Kane, McCaffrey, Miller, and Staiger, 2013; Kane and Staiger, 2008). In turn, this information can be applied in a number of ways to improve student outcomes (Boyd, Lankford, Loeb, and Wyckoff, 2011; Chetty, Friedman, and Rockoff, 2014b; Dee and Wyckoff, 2013; Goldhaber and Theobald, 2013; Ehlert, Koedel, Parsons, and Podgursky, 2014; Condie, Lefgren, and Sims, 2014; Hanushek, 2011; Winters and Cowen, 2013).
The purpose of this commentary is to discuss the seemingly innocuous extension of our knowledge base on outcome-based performance metrics for individual teachers toward the development of similar metrics for teacher preparation programs. While it is our view that the literature on value-added and alternative rigorous performance measures for individual teachers has rightly encouraged the policy push in this direction, there are fundamental aspects of the evaluation problem for teacher preparation programs that merit careful attention. We argue that outcome-based accountability for teacher preparation programs may have the potential to meaningfully improve the quality of the teaching workforce in the long run, but in the short run an overreliance on outcome-based rankings in policy decisions is inadvisable.
Focusing first on the long run, the argument for developing and reporting on how graduates from different teacher preparation programs perform in the workforce is compelling. We know from a large research literaturewith many key studies cited abovethat differences in exposure to effective teaching can dramatically influence student success in school and beyond. One obvious way to improve the teaching workforce is to improve the quality of training that future teachers receive before entering the classroom.
Of course, the first step in the improvement process is to develop rigorous measures that highlight performance differences across programs. Identifying these differences is critical to promoting polices that can improve trainingfor example, if more effective programs can be identified they can be studied to determine best practices, which can then be replicated elsewhere. And ineffective programs can be asked to improve or risk being disaccredited. Schooling and labor-market outcomes for students in todays K-12 schools are being harmed every time an ineffective program places a teacher in a classroom who either (a) could have gone to a better program and been trained more effectively, or (b) should never have been in the classroom in the first place. We owe it to these students to do better.
Taking the long view, we can imagine a situation 5-10 years from now where rigorous performance metrics for program graduates are incorporated in a meaningful way into holistic evaluations of teacher preparation programs, much like what is currently happening in some states and school districts for individual teachers (e.g., see Dee and Wyckoff, 2013). More rigorous evaluations of teacher preparation programs would provide a much needed and currently missing mechanism by which effective programs could distinguish themselves. Such an outcome would be of great benefit to the K-12 education systemif stronger programs could be identified, it would likely lead to their growth. Combined with a corresponding decline in the presence of ineffective programs, the net effect would be a larger share of teachers entering the profession from teacher preparation programs that produce stronger graduates. None of this is possible, however, unless we can develop evaluation systems that have the ability to distinguish between the quality of graduates from different programs. The development of these systems should start today, and in many locales, it already has.
In contrast to our optimism in the long run, in the short run we are pessimistic about evaluating teacher preparation programs based on the performance of their graduates. Two related factors that contribute to our pessimism are (a) differences in average quality across graduates from different programs, at least among most traditional programs, are presently small (Koedel, Parsons, Podgursky, and Ehlert, in press; Goldhaber, Liddle, and Theobald, 2013) and (b) technical aspects of the evaluation problem for teacher preparation programs make it difficult to differentiate programs based on small performance differences. In short, the primary technical issue is that the teacher-level sample sizes available for evaluating most programs in most states are small. This, combined with the fact that individual teacher effects are relatively large, makes it statistically infeasible to detect small to moderate differences in program performance using available data in most states. Note that this evaluation problem is not unique to any particular performance metric for program graduates, be it value-added, teacher observations, student surveys, etc. For a more lengthy discussion of the technical issues associated with evaluating teacher preparation programs, we refer the interested reader to Koedel, et al. (in press).
Of particular concern to us in the short run is that policymakers will rush to generate and use program rankings based on these types of graduate performance metrics, despite the fact that available evidence suggests that differences in program rankings will generally not represent substantive differences in the quality of program graduates. For example, consider the rankings of two programs in Missouri, where in our earlier work we have identified 12 large programs for which ratings based on graduate performance can be reasonably estimated. Suppose we generate rankings showing that program-A rated 3rd overall and program-B rated 10th. This would seem to imply a large difference in quality between the graduates across programs. However, available evidence from Koedel et al. (in press) and Goldhaber et al. (2013) suggests that the quality difference between graduates from these programs is likely quite small and may even be non-existent. As such, we worry that policymakers, practitioners, and potential students will overvalue these ranking differences in their decision-making.
So how can we be optimistic in the long run yet so pessimistic in the short run? If programs are not currently distinguishable using output-based metrics, why would this change, especially given that the statistical issues that make evaluating teacher preparation programs difficult will not? The source of our optimism in the long run is that we believe the development of a more rigorous system of evaluation in teacher preparation has the potential to foster innovation and generate real, detectable differences in quality across programs in the future. It is important to recognize that todays parity may be driven in part by the current policy environment where program performance is not measured and, as a result, programs have few external incentives to consistently train and certify graduates who will excel in the classroom. The development of an evaluation system in the short run, with the acknowledgement that its immediate value will be small, may reap long-term rewards if it fosters innovation and a stronger commitment to improvement from teacher preparation programs.
In summary, we hope to have conveyed both our long-run optimism and short-run pessimism about the value of rigorously evaluating teacher preparation programs based on the classroom performance of their graduates. The absence of any such evaluation right now does not encourage innovation or promote excellence, and thus it is not surprising that differences across programs are small. The development of more rigorous evaluations may lead to long-term improvements in teacher preparation, and given the importance of teacher quality for student success, this potential should not be overlooked. But it may take some time for changes to occur. Premature policy actions based on the currently small and largely undetectable differences across most programs may lead to bad decision-making, ranging from employers over-valuing differences in program rankings in hiring decisions to regulators prematurely anointing some programs as effective and other programs as not. Our recommendation is to continue to develop and implement these systems, monitor and report the results, but keep our finger off of the trigger. This is the approach that is in the best interest of the primary stakeholders in teacher preparation the students who are taught by program graduates in K-12 schools across the United States.
Boyd, D., Lankford, H., Loeb, S. & James, J. (2011). Teacher Layoffs: An Empirical Illustration of Seniority v. Measures of Effectiveness. Education Finance and Policy, 6(3), 439454.
Chetty, R., Friedman, J.N. & Rockoff, J.E. (2014a). Measuring the Impacts of Teachers I: Evaluating Bias in Teacher Value-Added Estimates. American Economic Review, 104(9), 25932632.
Chetty, R., Friedman, J.N. & Rockoff, J.E. (2014b). Measuring the Impacts of Teachers II: Teacher Value-Added and Student Outcomes in Adulthood. American Economic Review, 104(9), 26332679.
Condie, S., Lefgren L. & Sims, D. (2014). Teacher Heterogeneity, Value-Added and Education Policy. Economics of Education Review, 40(1), 7692.
Dee, T. & Wyckoff, J. (2013). Incentives, Selection, and Teacher Performance: Evidence from IMPACT. NBER working paper No 19529.
Ehlert, M., Koedel, C., Parsons, E. & Podgursky, M. (2014). Choosing the Right Growth Measure: Methods Should Compare Similar Schools and Teachers. Education Next, 14(2), 6671.
Goldhaber, D., Liddle, S. & Theobald, R. (2013). Gateway to the Profession: Assessing Teacher Preparation Programs Based on Student Achievement. Economics of Education Review, 34(1), 2944.
Goldhaber, D. & Theobald, R. (2013). Managing the Teacher Workforce in Austere Times: The Determinants and Implications of Teacher Layoffs. Education Finance and Policy, 8(4), 494527.
Hanushek, E. A. (2011). The Economic Value of Higher Teacher Quality. Economics of Education Review, 30(3), 466479.
Kane, T. J., McCaffrey, D. F., Miller, T. & Staiger, D. O. (2013). Have We Identified Effective Teachers? Validating Measures of Effective Teaching Using Random Assignment. Seattle, WA: Bill and Melinda Gates Foundation.
Kane, T. J. & Staiger, D.O. (2008). Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation. NBER Working Paper No. 14607.
Koedel, C., Parsons, E., Podgursky, M. & Ehlert, M. (in press). Teacher Preparation Programs and Teacher Quality: Are There Real Differences Across Programs? Education Finance and Policy.
United States Department of Education. 2011. Our Future, Our Teachers: The Obama Administrations Plan for Teacher Education Reform and Improvement.
Winters, M. A. & Cowen, J. M. (2013). Would a Value-Added System of Retention Improve the Distribution of Teacher Quality? A Simulation of Alternative Policies. Journal of Policy Analysis and Management, 32(3), 634654.