Home Articles Reader Opinion Editorial Book Reviews Discussion Writers Guide About TCRecord
transparent 13

Caught in a Vise: The Challenges Facing Teacher Preparation in an Era of Accountability

by Rick Ginsberg & Neal Kingston - 2014

Background: Despite polling data that suggests that teachers are well respected by the general public, criticism of teacher preparation by various organizations and interest groups is common, often highlighting the perceived need for increasing their rigor and performance. A number of studies and reports have critiqued teacher preparation, and high-profile leaders like Secretary of Education Arne Duncan have called for substantive changes. At the same time, the field of teacher preparation has been embracing change with the idea of accountability based on student performance. Indeed, recently released evidence suggests that in the area of clinical preparation, education programs require many hours of field placement experiences, countering one of the key criticisms of the preparation programs.

Purpose: The purpose of this study is to examine the field of teacher preparation in the current era of accountability and testing. After a brief overview of the current context facing teacher preparation, the issue of outcome measures for varying professions is explored by comparing accreditation outcome measures utilized in selected professions. Then, the strengths and weaknesses of currently emerging assessment models are explored. Finally, a discussion of potential ways to assess teacher preparation program performance with an array of sources and measures is presented.

Research Design: The study is a combination of a secondary analysis and analytic essay. The use of outcomes associated with 10 professions was examined by reviewing accreditation standards and documentation from published reports available on websites for the specific measures used to assess student success and program outcomes. As a means of validating findings, feedback was obtained from accreditation coordinators and/or other leaders in each profession. The analysis of currently emerging assessment models for teacher preparation was based upon a review of literature on value added and other similar assessments.

Conclusions/Recommendations: The review of professions found that all are struggling with better means for assessing program outcomes, with a great deal of similarity in the processes currently in place used across fields. Teacher education was found to include more of the different ways for assessing outcomes than any other profession. Significant concerns with currently promoted value-added models for assessing outcomes of teacher preparation were identified, with the use of multiple measures of evidence suggested as the best means for moving forward. We argue that teacher preparation programs are caught in a vise—with an appreciation and desire among those in the field for greater accountability while being squeezed by a sense that the approaches being suggested are prone to error and misuse.

For those NOT involved in education, this sort of "value-added" approach to education makes perfect sense, but those of us who are intimately involved in this calling understand that boiling education down to a single number is an exercise in futility. Should we all be held accountable? Absolutely! After all, we are talking about our most precious commodity—our children and their intellectual and emotional growth and well-being. But have we figured out the ultimate best method for this accountability system yet? Absolutely not!

—a State Department of Education leader, former superintendent, principal, and teacher (personal communication, September 19, 2012)

On December 2, 2006, Pfizer, the large pharmaceutical company, made a much dreaded announcement. The drug they were pinning much of their future in the cardiology market on, Torcetrapib, then in its Phase III clinical trial (the last phase prior to seeking FDA approval), was being terminated. This was quite a shocker, as at an investor’s meeting just several days earlier, the CEO announced that this same drug was “one of the most important compounds of our generation” (Lehrer, 2011). Hopes for the drug were high, and Pfizer had been planning that Torcetrapib could replace its highest seller, LIPITOR®, whose patent would soon expire and generics would gravely impact their 10+ billion dollar return per year. Unlike LIPITOR, which lowers LDL, what physicians generally refer to as the “bad” cholesterol, the promise of Torcetrapib—in a class of drugs known as CETP inhibitors—was that it would raise HDL levels, the so-called “good” cholesterol. Early studies were very promising, and evidence was clear that high levels of HDL are associated with lower levels of heart disease risk. For some, the promise of Torcetrapib was that it would act as a form of “roto-rooter,” clearing the arteries of arterial plaque. It had anti-inflammatory, antithrombotic, and antioxidant effects. Estimates suggested that Pfizer had invested close to a billion dollars in the trial.

So, what happened? An independent panel of experts monitoring the trial discovered that the drug triggered higher rates of chest pain and heart failure than the control group, with a 60% increase in overall mortality. In real numbers, with 15,000 people in the trial (half on LIPITOR and Torcetrapib, half on just LIPITOR), 82 patients in the experimental Torcetrapib group died versus 51 patients in the control. Pretty stark test the patients took—live or die! The drug was pulled.

This paper, however, focuses on accountability and testing in education, specifically, accountability for teacher preparation programs. But, the problems exemplified in the Torcetrapib saga have lessons for us in education. In both cases, there is overreliance on correlation to infer causality, an understandable circumstance in the highly public and vulnerable world of education, but still not a valid means for making important determinations about graduates from teacher preparation programs. In dealing with human subjects, causality is complex, involving a number of interactions and effects that today aren’t well understood. This suggests that making high-stake decisions on the back of these correlations is problematic and prone to error.

The underlying thinking—what Henry Braun (2008) referred to as the theory of action—goes like this: Students go through a college preparation program, they graduate, they teach, and they are responsible for students taking tests, so their teacher preparation institutions should be judged by the scores of the graduates’ students on the state’s standardized tests. This is highly logical, reductionistic, and relies on the correlation of teacher preparation to the test scores of the students taught by the preparation program graduates. Causation, in this sense student performance, can be inferred by the graduates’ students’ results years after preparation ends.

In the case of Torcetrapib, what the researchers didn’t consider in the trials is what other effects the drug might have. It turns out that Torcetrapib triggers increased levels of the hormone aldosterone, which is tied to fatty plaque buildup in arteries and high blood pressure. And, the reality set in that there are several forms of HDL, the main alpha fraction and at least three others, along with four types of alpha and 16 lipoproteins, enzymes, and other proteins in an array of ratios among all of them (Lowe, 2008). These various HDL particles differ in substantive ways like shape, density, size, and antiatherogenic properties. So, it may be that you don’t want to raise HDL numbers across the board, but perhaps raise just one of them while lowering others. Joy and Hegele (2008) summed it up by saying that HDL metabolism is quite intricate, “with functional quality perhaps being a more important consideration than the circulating quality of HDL” (p.1), or simply that “some of the key beneficial mechanisms of HDL are incompletely understood” (p. 3). This is like a classroom with 30 students, all with differing experiences, different home backgrounds, and different levels of health and motivation, with students in school A in a very different context than those in school B—some with great mentoring, strong collegial relations and principal leadership, and others not, and so on. Or, to put it in terms that we educators can all understand, correlation doesn’t equal causation.

The issue of truly grasping the causal link among variables is an issue for any field. In chemistry, for example, Stephen Johnson (2008) warned of the problem of overreliance on simple correlation as the causal link in chemical modeling used to predict biological activity. He refers to this assignment of causality based on correlated variables as the logical fallacy: Cum hoc, ergo propter hoc (with this, therefore because of this).

The purpose of this paper is to examine the field of teacher preparation in the current era of accountability and testing. We argue that teacher preparation programs are caught in a vise—with an appreciation and desire among those in the field for greater accountability being squeezed by a sense that the approaches being suggested are prone to error and misuse. After a brief overview of the current context facing teacher preparation, the issue of outcome measures for various professions is explored by comparing accreditation in selected professions. Then, the strengths and weaknesses of currently emerging assessment models are explored. Finally, a discussion of potential ways to assess teacher preparation program performance utilizing an array of sources and measures is presented.


Recently, Gallup and Phi Delta Kappa released their annual poll on Americans’ attitudes toward public education (Bushaw & Lopez, 2012). The results were largely positive about public education and teaching. For example, 72% of Americans say teachers gave their children praise or recognition for doing good work in the past seven days, 84% say they feel that their child is safe in school, 62% are willing to pay more money in taxes for public schools, 89% feel it is important to close the achievement gap, 84% believe this can happen while maintaining high standards (suggesting a lot of confidence in schools), 71% have trust and confidence in public school teachers, 77% give the school their eldest child attends an A or a B rating. The single largest problem identified for public schools is the lack of financial support (35%), not bad teaching, teacher preparation, failing schools, etc., as much of the media and critics’ rhetoric would have us all believe. A sense of crisis about American schools and problems directed at teachers and teaching is missing from these results.

The poll asked a number of targeted questions about teacher preparation. The results suggest that the public favors increasing the rigor for entry into teacher preparation programs. The public is split (52% favoring) with regard to the use of standardized tests to evaluate teachers. Indeed, 63% of those favoring their use responded that such tests should account for one third or more of teacher evaluations. At the same time however, when asked to describe the characteristics of teachers that made a difference in their lives, the top responses were “caring,” “encouraging,” and “attentive/believe in me.” The standardized tests that about half of the public approve for evaluating teachers provide no evidence of the characteristics of teachers who effectively impacted their lives. Quite a conundrum, suggesting inconsistency, a great divide among the public, and a likely lack of understanding of the implications of using standardized tests to gauge the performance of new teachers.

This lack of understanding seems to be at the core of much of the criticism leveled at teacher preparation. Much of the current criticism is built upon practices long ago refined by modern teacher preparation programs. But, as a subset of the entire education enterprise, teacher preparation is among the most highly criticized components. For example, the well-funded and self-anointed National Council on Teacher Quality (NCTQ), soon to release its own rating of all teacher preparation programs in the country it can get information about, will have its findings published amid great fanfare by U.S. News & World Report. On its website, NCTQ states that its goal for rating the education schools is based upon conclusions it has reached prior to conducting any review. The NCTQ website states, “Unlike other professional schools, teacher prep programs are held to weak standards, enabling ineffective programs to receive state approval and national accreditation. The result? Too few new teachers receive the knowledge and skills they need to be successful in the classroom.” Its president, Kate Walsh, has argued, “It is an accepted truth that the field is broken” (Kronholz, 2012, p. 3). Similarly, Shober (2012) of the American Enterprise Institute argued that “overhauling teacher education” is the least disruptive elixir and the key for what he called closing the teacher quality gap (p. 12).

Levine’s (2006) widely cited study of teacher preparation programs was more nuanced, but found that teacher preparation programs were very diverse, of mixed quality, and seemed not to be preparing teachers with the skills that principals agreed are most important for new teachers. He cited problems with curriculum, faculty disconnected from the field, low admission standards, insufficient quality control, and great disparities among institutions in terms of quality. Interestingly, the data from program graduates utilized for this study were based on students prepared between 1995 and 2000. In contrast, 80% of graduates of education schools reported in a 2008 Public Agenda survey (Public Agenda, 2008) that they were very (42%) or somewhat (35%) prepared for their first year of teaching. More recently, Eduventures reported that nearly 80% of new teachers from a national sample indicated they were very or well prepared for the classroom (Eduventures, 2009).

Of course, criticism of teacher preparation is not new. The landmark report, A Nation at Risk (National Commission on Excellence in Education, 1983), lamented that not enough of the academically able students were being attracted to teaching and that teacher preparation programs needed substantial improvement. The Commission argued that the curriculum was too heavy with education methods courses and not enough content. A series of reform reports directed at improving teacher preparation followed in the 1980s and 1990s (e.g., Holmes Group, 1995). In 1998, Heather MacDonald of the Manhattan Institute was scathing in her denunciation of education schools as unchanging. She wrote, “Like aging vestal virgins, today’s schools (of education) lovingly guard the ancient flame of progressivism. Since the 1920s they have not had a single new idea; they have merely gussied up old concepts in new rhetoric, most recently in the jargon of minority empowerment. To enter an education classroom, therefore, is to witness a timeless ritual, embodied in an authority structure of unions and state education departments as rigid as the Vatican” (MacDonald, 1998, p. 4). George Will argued in a 2006 column that “The surest, quickest way to add quality to primary and secondary education would be addition by subtraction: Close all the schools of education” (2006, January 16).

In more recent years, leaders like Secretary of Education Arne Duncan have picked up the mantra. In a well-publicized speech at Teachers College in October of 2009, Duncan argued that “by any standard, many if not most of the nation’s 1,450 schools, colleges, and departments of education are doing a mediocre job of preparing teachers for the realities of the 21st century classroom. America’s university-based teacher preparation programs need revolutionary change, not evolutionary tinkering” (Ed.gov, 2009, October 22). Just two weeks earlier, in a speech at the University of Virginia, Duncan sounded a similar theme, “In far too many universities, education schools are the neglected stepchild. Too often they don’t attract the best students or faculty. The programs are heavy on educational theory and light on developing the core area knowledge and clinical training under the supervision of master teachers. . . . Student teachers are not trained in how to use data to improve their instruction.” He concluded, “So it is clear that teacher colleges need to become more rigorous and clinical, much like other graduate programs” (Ed.gov, 2009, October 9). And, although Duncan did sound some cautious optimism for making changes, in a policy era that has demanded “scientifically-based evidence,” the assertions typically repeated are based on little or no data.

The irony amid these waves of critical commentary is that teacher preparation programs have been undergoing change for years and have embraced the idea of accountability focusing on student learning. For example, the American Association of Colleges of Teacher Education (AACTE) opens its mission statement with the goal of promoting the learning of all P–12 students. The newly created merger of the two teacher preparation accrediting bodies (NCATE and TEAC) resulted in the Council for the Accreditation of Educator Preparation (CAEP), which has as its mission the preparation of highly qualified educators through the accreditation of programs in which data drive decisions; resources and practices support candidate learning; and candidates demonstrate knowledge, skills, and professional dispositions geared toward raising student achievement. Indeed, CAEP’s recently created Commission on Standards and Performance Reporting is charged with creating a system of standards that will “transform the preparation of teachers by creating a rigorous system of accreditation that demands excellence and produces teachers who raise student achievement.” Neither of these organizations, which represent a significant portion of the institutions preparing future teachers, appear bent on defending past practices, ignoring the need for change, or refusing to consider the importance of P–12 learning in their work.

Recently released evidence directly refutes much of the criticism leveled at teacher preparation institutions. A common criticism of teacher preparation accreditation is that everyone passes. Recent data released by CAEP suggests otherwise. For example, examining NCATE’s accreditation decisions in 2011, of the 135 accreditation decisions, only 67% were fully accredited for up to seven years, and even many of these were noted with areas for improvement. Another 9% were approved after addressing areas for improvement from a review two years earlier, and 13% were approved for only two years with an additional site visit required. Eight percent of all decisions were deferments intended to provide the institutions additional time to clarify questions that arose during the decision-making process. Three percent of decisions resulted in accreditation being revoked, with another set of institutions (equivalent to 3% of the total) choosing to drop out prior to a site visit or final decision. The picture is hardly one of low expectations and simple assessment, but instead suggests a process applying standards that leads to hard decisions (Council for the Accreditation of Educator Preparation, 2013).

AACTE collects the Professional Education Data System (PEDS) data about its institutions, though other institutions not members of AACTE participate as well. These data suggest that field experiences are robust in teacher preparation programs. Clock hours were grouped by averages for programs with the lowest requirements and those with the highest requirements. At the BA level, in terms of actual clock hours, the averages for early field experiences (prior to student teaching) were 114 hours among the programs with the lowest requirements and 189 hours among those with the highest requirements. For master’s programs, they were 111 and 164 hours, respectively. For supervised clinical experiences/student teaching, the average clock hours were 500 among BA programs with the lowest requirements and 562 among the highest. At the master’s level, averages were 480 hours for the lowest requirements and 586 hours for the highest requirements. The average number of semesters of supervised clinical experiences/student teaching was 1.29 semesters for BA programs and 1.43 semesters for master’s degree programs. Thus, the total average number of hours of field experiences among the schools with the lowest requirements among bachelor’s programs was 614 hours and the average among schools with the highest requirements was 751 hours. For master’s programs the averages were 644 hours among the schools with the lowest requirements and 750 hours for schools with the highest requirements. To those who suggest that teacher preparation slights the importance of clinical experiences, these data suggest a very different picture (American Association of Colleges for Teacher Education, 2013).

Indeed, accreditation standards in other fields often do not specify the number of hours required for clinical experiences (e.g., journalism and mass communications standards call for encouraging opportunities for internships, and allow up to two semester courses at an appropriate professional organization or nine semester hours at a professional media outlet where supervision is available and faculty are involved; ACEJMC Accrediting Standards, 2012). Calculations of required clinical experiences for entire professional fields are not typically available. In some professions, such as social work, clinical hours are specified. Their accreditation standards call for a minimum of 400 hours at the undergraduate level (Council on Social Work Education [CSWE], 2010). Similarly, nurse practitioners, a graduate-level program, require 500 supervised direct patient care clinical hours (National Task Force on Quality Nurse Practitioner Education, 2012). Teacher education, in practice, requires the equivalent number of hours or more.

With all the criticism, it is easy to fall into a defensive stance for those involved in teacher preparation, but that isn’t the intent here. Indeed, some of the most compelling arguments for making changes in teacher preparation come from those intimately involved in the work (e.g., see Bransford, Darling-Hammond, & LePage, 2005; Ball, Sleep, Boerst, & Bass, 2009; Blue Ribbon Panel on Clinical Preparation and Partnerships for Improved Student Learning, 2010). And, the field itself created the edTPA (a portfolio-based assessment developed by 24 states and 160 participating institutions), which is a means of authentically assessing candidate performance while in preservice training (see http://edtpa.aacte.org). It certainly is true that there are a number of highly intractable problems in P–12 education, like addressing the achievement gaps and assisting the many high poverty and at-risk students, often ensconced in our urban areas. Teacher preparation must ensure that its graduates have experience teaching children living in these conditions and must work with other segments of the education landscape in addressing the issues. But, amid the very vocal criticism of the field, it appears that teacher preparation has made advances that directly address areas of concern, and it is obviously pushing itself to improve in new ways. While it is no doubt true that a great deal of improvement remains, the overall picture painted by the critics is far bleaker than the reality suggests.

We now turn to examining other fields to determine how they have embraced calls for outcome measures in assessing the quality of their preparation.


Sociologists of organizations have long recognized a set of common characteristics of professions—high income, prestige and influence, high educational requirements, professional autonomy, licensure, commitment of members to the profession, codes of ethics, cohesion of the professional community, monopoly over a task, intensive adult socialization experiences for recruits, etc. (Goode, 1969). Today, professions all have some form of accreditation expectation. And, a consistent theme emerging among all the professions is how outcome measures can become a part of the accreditation process.

The Council for Higher Education Accreditation (CHEA) and the U.S. Department of Education (USDE) formally recognize accrediting organizations (regional accreditors, professional accreditors, etc.). Together, these two recognizing bodies account for 7,818 accredited programs. In all, 56 accreditors were recognized by CHEA, 54 by USDE, and 30 by both (CHEA, 2012). CHEA began significant focus on student learning outcomes with its commission of a paper on the topic in 2001 (Ewell, 2001). A year later, it hosted a series of three Student Learning Outcomes Workshops addressing the issue (CHEA, 2002). An outcome was defined as a set of direct and indirect types of evidence for assessing student learning outcomes. These included the following outcomes:

direct: capstone performances, professional/clinical performances, third-party testing (e.g., licensure), and faculty-designated examination

indirect: portfolios and work samples, follow-up of graduates, employer rating of graduates, and self-reported growth by graduates

CHEA’s interest in student outcome measures has not waned, notably addressed in their reports, Accreditation and Accountability: A CHEA Special Report (CHEA, 2006) and Effective Practices: The Role of Accreditation in Student Achievement (CHEA, 2010). In a joint report prepared with the Association of American Colleges and Universities (AAC&U), aptly titled New Leadership for Student Learning and Accountability (CHEA & AAC&U, 2008), a series of six principles and eight actions were outlined “for meaningful educational accountability” (p.1).

Others have also stressed the importance of student outcomes, some specifically tied to professional accreditation. In the legal profession, for example, two key reports, one produced by the Carnegie Foundation (Sullivan, Colby, Wegner, Bond, & Shulman, 2007) and one by the Clinical Legal Association (Stuckey et al., 2007), provided great impetus for considering student learning outcomes in the profession. The Carnegie study, for example, argued that all professional training seeks to  “initiate novice practitioners to think, perform and to conduct themselves (that is, to act morally and ethically) like professionals” (Sullivan et al., p. 22). In 2008, the American Bar Association’s Section of Legal Education and Admission to the Bar released a report from its Outcome Measures Committee (Carpenter et al., 2008), with the direct charge of determining “whether and how we can use output measures, other than bar passage and job placement, in the accreditation process” (p. 4).

As part of their analysis, they reviewed accreditation processes related to outcome measures across 10 other professions—medical doctors (MD), osteopathic medical doctors (DO), dentists (DD), veterinarians (Vet), pharmacists (Phm), psychologists (Psy), teachers (Tch), engineers (Eng), accountants (Acct), and architects (Arch). They isolated 28 assessment criteria that were used by 2 or more of the 10 professions. Table 1 lists these 28 assessment criteria, along with indicators of whether or not each profession utilized the method. Criteria are listed from most common to least common and professions from right to left, listed from those evaluated by the greatest number of criteria to those evaluated by the fewest.

Table 1. Assessment Criteria Used by Selected Professions1

Assessment Criteria











Licensure/Certification exams









Evaluate clinical, problem-solving, and/or communication skills










Students possess competencies expected by public/profession










Criteria measure skills, knowledge, behaviors, and/or attitudes










Student portfolios used to measure competency









Learning objectives and evidence-based data for all competencies









Attrition/completion and timely graduation rates









Ongoing outcome assessment of student achievement








Attract, manage, serve (patient/client), and communicate with diverse community








Form/summary evaluation of student achievement in class and/or clinic





Results of board exams before/requirements for graduation








Peer review and external evaluations






Students demonstrate self-initiated learning abilities/traits








Longitudinal tracking of careers and achievement







Preceptor assessment of competency and professional behavior





Ongoing assessment by faculty of curriculum and changes in profession







School fosters/assesses experimentation and innovation







Faculty scholarship







Employment rates







Mastery of technology






Acceptance into resident and internship programs






Program directors assess graduates’ preparation and profess behavior





Grads assess own preparation and professional behavior





Structured clinical exams






National standardized assessment instruments





Employer reports/satisfaction





State reviews of programs





Evaluations during mentoring year (evaluations during student teaching)














Adapted from Carpenter et al., 2008

The authors of the report were careful to explain that their attempts to describe the categories were as accurate as possible, though they understand that not every factor utilized by each profession was captured. Each discipline’s accreditation outcome-centered norms were approved by CHEA or USDE in the four years prior to the analysis. The range of the number of assessment criteria utilized by the varying professions was from a low of 5 (dentistry and engineering) to a high of 16 (teaching). The authors suggested that three broad models of accreditation emerged: (a) some accreditation agencies require certain assessments (while allowing the addition of others); (b) some models mandate several criteria but suggest others; and finally, (c) some models delegate great authority to the schools in designating criteria. While no weight was applied by the authors to the various criteria and the types of measures utilized were clearly linked to the profession’s mission, it is interesting to note that teaching accreditation utilized the most variables in its accreditation processes. In addition, the authors identified six outcome measures that were most common among the 10 fields across all 28 of the accreditation processes analyzed. Those measures and their frequency are licensure of graduates (7); evaluation of the clinical, problem-solving, and communication skills of students (7); criteria to ensure that students possess the competencies expected by the profession and public (6); evaluation of the skills, knowledge, and behavior/attitudes of students (6); student portfolios (5); and collection of evidence-based data of learning objectives/competencies (5). Teaching required all of these common criteria, the only one of these 10 professions with an accrediting body to do so.

The medical profession is particularly interested in outcome measures for the field. Recent critiques, for example (Makary, 2012), captured the concern about how hospitals hide outcome data from patients, and could easily provide greater transparency to assist clients in selecting among their care options. The notion of “quality of care” has become a significant thrust in the medical field, with more attention being focused on medical education outcomes than ever before. A recent Rand Europe study (Nolte, Fry, Winpenny, & Brereton, 2011) examined the use of outcome metrics in preparing medical practitioners across the world. They concluded that “assuring and enhancing the quality of care should form the ultimate goal of high quality training but the available evidence of direct association between the quality of healthcare education and training and the quality of care provided remains scant” (p. xi). The Rand study cited a Belgian Health Care Knowledge Centre literature review and concluded they couldn’t find evidence of an association among quality training settings, learning outcomes, and physicians’ competencies.

In the United States, medical education at the MD level is controlled by the Liaison Committee on Medical Education (LCME). The LCME standards for accreditation leading to a medical degree include two standards related to outcomes in section “E” on Evaluation of Program Effectiveness (LCME, 2012):

ED-46. A medical education program must collect and use a variety of outcome data, including national norms of accomplishment, to demonstrate the extent to which its educational objectives are being met.

The medical education program should collect outcome data on medical student performance, both during program enrollment and after program completion, appropriate to document the achievement of the program’s educational objectives. The kinds of outcome data that could serve this purpose include performance on national licensure examinations, performance in courses and clerkships (or, in Canada, clerkship rotations) and other internal measures related to educational program objectives, academic progress and program completion rates, acceptance into residency programs, and assessments by graduates and residency directors of graduates' preparation in areas related to medical education program objectives, including the professional behavior of its graduates.

ED-47. In evaluating program quality, a medical education program must consider medical student evaluations of their courses, clerkships (or, in Canada, clerkship rotations), and teachers, as well as a variety of other measures.

It is expected that the medical education program will have a formal process to collect and use information from medical students on the quality of courses and clerkships/clerkship rotations. The process could include such measures as questionnaires (written or online), other structured data collection tools, focus groups, peer review, and external evaluation. (p. 15)

In essence, assessing the competency of MDs once they’ve departed the institution is left to using the equivalent of employer surveys with residency directors.

Medical education is complicated by the fact that all MDs must serve a residency, and the assessment of those programs is handled by the Accreditation Council for Graduate Medical Education (ACGME). ACGME has been struggling with developing systems for assessing quality of care provided by graduates since the late 1990s. Their Outcomes Project has gone through several phases and is working toward development of a system to assess the outcomes of graduate medical education that can be measured and quantified in an attempt to demonstrate “that clinical patient outcomes are associated with and linked to educational outcomes” (Haan et al., 2008, p. 574). The work is challenging (Swing, 2007). As Swing suggested, “Measurement of patient care quality, in particular using clinical process and outcome measures, is still in its infancy. Assessing care quality using patient care process measures associated with desirable outcomes has some advantages, but, to-date, a relatively small number of validated processes exist” (p. 653). What complicates the drive for finding quality measures to link preparation to patient care is what Haan et al. (2008) described as the ability to link a selected process or outcome measure to a particular resident. They reported, “Medical education does not occur in isolation, and most process and outcomes measures represent the group milieu in which treating and learning occur” (p. 579). They called for physicians to recognize their role and responsibility as part of a team, though they clearly understood that a great deal of work remains in order to identify what they called “the most optimal quality indicators and benchmarked targets” (p. 580). So, even in this secondary accreditation involving physicians already holding professional licenses from accredited institutions, a variety of difficult obstacles remain to be overcome in order to validly measure training and link it with patient outcomes.

Our research examined the use of outcomes in accreditation associated with 10 professions—law, medicine, social welfare, engineering, journalism, athletic training, psychology, business, pharmacy, and teaching. We examined the accreditation requirements for each field and gathered the specific measures utilized for assessing student success and outcomes from the documents available on accrediting body websites. As a means of validity checking our findings, we asked for feedback from accreditation coordinators and other leaders in these professions on the descriptions we compiled. What follows is a brief synopsis of the state-of-the-art in each profession.


Law schools are accredited by the American Bar Association. The current ABA standards consider outcomes by examining scores on the licensing test, the bar examination. They also require reporting employment outcomes nine months after graduation. Law school deans are required to determine if students have attained their identified learning outcomes. The ABA, as previously discussed, is examining new outcome measures. Below is a description of proposed standards that likely will become the basis for new approaches to addressing outcome measures (ABA Standards Review Committee, 2010):

Interpretation 304-1: Assessment activities and tools are likely to be different from school to school and law schools are not required by Standard 303 to use any particular activities or tools.

Learning and other outcomes should be assessed using tools both internal to the law school and external to the law school. The following internal tools, when properly applied and given proper weight, are among the tools generally regarded to be pedagogically effective to assess student performance: completion of courses with appropriate assessment mechanisms, performance in clinical programs, performance in simulations, preparation of in-depth research papers, preparations of pleadings and briefs, performance in internships, peer (student to student) assessment, compliance with an honor code, achievement in co-curricular programming, evaluation of student learning portfolios, student evaluation of the sufficiency of their education and performance in capstone courses or other courses that appropriately assess a variety of skills and knowledge. The following external tools, when properly applied and given proper weight, are among the tools generally regarded to be pedagogically effective: bar exam passage rates, placement rates, surveys of attorneys, judges, and alumni, and assessment of student performance by judges, attorneys or law professors from other schools. (pp. 5-6)


Medical schools are accredited by the Liaison Committee on Medical Education (LCME). The LCME bases its accreditation determinations on a report written by surveyors and site visits. As previously discussed, the medical field is examining its use of outcome measures for all its health-related fields. For physicians, the key outcome measures are the licensing examinations, criterion-referenced tests taken multiple times during a physician’s training in medical school through residency. Regarding the LCME, its standard ED-46 speaks directly to outcome measures (LCME, 2012).


Social work programs are accredited by the Council on Social Work Education (CSWE). Under its Educational Policy 2.1, core competencies, it outlines 10 competencies that are measurable practice behaviors that are comprised of knowledge, values, and skills. Their Educational Policy and Accreditation Standards document (CSWE, 2010) states, “The goal of the outcome approach is to demonstrate the integration and application of the competencies in practice with individuals, families, groups, organizations, and communities” (p.3). Regarding assessment of the 10 competencies, Educational Policy 4.0 addresses assessment through five standards:


The program presents its plan to assess the attainment of its competencies. The plan specifies procedures, multiple measures, and benchmarks to assess the attainment of each of the program’s competencies.


The program provides evidence of ongoing data collection and analysis and discusses how it uses assessment data to affirm and/or make changes in the explicit and implicit curriculum to enhance student performance.


The program identifies any changes in the explicit and implicit curriculum based on the analysis of the assessment data.


The program describes how it makes its constituencies aware of its assessment outcomes.


The program appends the summary data for each measure used to assess the attainment of each competency for at least one academic year prior to submission of the self-study.

In a recently revised document prepared to assist institutions with preparing their assessment materials (Holloway, 2012), several instruments or approaches were specified that programs might utilize in their plans to assess student achievement: portfolios, embedded measures, exit interviews, focus groups, surveys (e.g., employers, field-instructors, student exit surveys, and alumni), self-efficacy measures, standardized cases, and licensure exam results.


Engineering programs are accredited by ABET. ABET sets out eight general criteria for baccalaureate programs to meet (ABET, 2012). Two general criteria—Program Educational Objectives and Student Outcomes—are directly related to outcome measures:

Criteria 2—Program Educational Objectives (PEOs): The program must have published program educational objectives that are consistent with the mission of the institution, the needs of the program’s various constituencies, and these criteria. There must be a documented and effective process, involving program constituencies, for the periodic review and revision of these program educational objectives.

Criteria 3—Student Outcomes: The program must have documented student outcomes that prepare graduates to attain the program educational objectives. Student outcomes are outcomes (a) through (k) plus any additional outcomes that may be articulated by the program. (a) an ability to apply knowledge of mathematics, science, and engineering; (b) an ability to design and conduct experiments, as well as to analyze and interpret data; (c) an ability to design a system, component, or process to meet desired needs within realistic constraints such as economic, environmental, social, political, ethical, health and safety, manufacturability, and sustainability;(d) an ability to function on multidisciplinary teams; (e) an ability to identify, formulate, and solve engineering problems; (f) an understanding of professional and ethical responsibility; (g) an ability to communicate effectively; (h) the broad education necessary to understand the impact of engineering solutions in a global, economic, environmental, and societal context; (i) a recognition of the need for, and an ability to engage in life-long learning; (j) a knowledge of contemporary issues; and (k) an ability to use the techniques, skills, and modern engineering tools necessary for engineering practice.

PEOs are defined on the ABET website as broad statements that describe what graduates are expected to attain within a few years of graduation. PEOs are based on the needs of the program’s constituencies. In practice, for General Criteria 2, it is expected that programs will assess PEOs 3–5 years after graduation. Most institutions use alumni surveys, employer surveys, and input form groups like advisory boards to inform these assessments.

Student outcomes are defined on the ABET website as describing what students are expected to know and be able to do by the time of graduation. These relate to the knowledge, skills, and behaviors that students acquire as they progress through the program. Each degree program defines its assessment program using direct and indirect measures (grades, assignments, simulations, portfolios, surveys, etc.) typically developed into a matrix across the curriculum.

ABET General Criteria 4 is Continuous Improvement. This, too, relates to outcome measures with a focus on using data for program improvement.


The accrediting agency for Journalism is the Accrediting Council on Education in Journalism and Mass Communications (ACEJMC). There are nine standards, including standard 9—Assessment of Learning Outcomes. It includes five indicators (ACEJMC, 2012):


The unit defines the goals for learning that students must achieve, including the “Professional Values and Competencies” of this Council.


The unit has a written assessment plan that uses multiple direct and indirect measures to assess student learning.


The unit collects and reports data from its assessment activities and applies the data to improve curriculum and instruction.


The unit maintains contact with its alumni base to assess their experiences in the professions and to provide suggestions for improving curriculum and instruction.


The unit includes members of journalism and mass communication professions in its assessment process.

The following items are listed as “Evidence”: (a) a written statement on competencies; (b) a written assessment plan; (c) evidence of alumni and professional involvement in assessment, such as surveys, advisory boards, social media initiatives, portfolio reviews, and other activities; (d) records on information collected from multiple measures of assessment and on the application of this information to course development and improvement of teaching, ensuring that the assessment findings have been systematically gathered, synthesized, and applied; and (e) end-of-year unit summary assessment report and analysis.


Athletic training is accredited by the Commission on Accreditation of Athletic Training Education (CAATE). CAATE operates with ten standards (CAATE, 2012). Standard II is Outcomes. It has five components—develop a plan, assessment measures, collect the data, data analysis, and action plan. In practice, outcomes that are required include the following information:

clinical site evaluations

clinical instructor evaluations

completed clinical proficiency evaluations

academic course performance

retention and graduation rates

graduating student exit evaluations

alumni placement rates one year post graduation

A recent addition is the requirement to utilize the Board of Certification (BOC) examination aggregate data for the most recent three test cycle years in the following metrics: number of students graduating from the program who took the examination, number and percentage of students who passed the examination on the first attempt, and overall number and percentage of students who passed the examination regardless of the number of attempts. Programs that have a three-year aggregate BOC first-time pass rate below 70% must provide an analysis of the deficiencies and develop an action plan for correction.

Some additional measures that are typically utilized include student prerotation goals, student mid- and end-of-semester evaluation by clinical instructors, clinical instructor self-evaluations, and student foundational professional behavior evaluations.


Psychologists (PhDs) are accredited by the American Psychological Association (APA). Accreditation applies to counseling psychologists, clinical psychologists, and school psychologists. The APA requires professional psychology programs to report outcome data, which in turn are dependent on individual student outcome data. Programs have some latitude in establishing the required competencies, as programs define their own training model, program goals, and objectives. However, there are core competencies that all programs must implement. Exactly how competencies should be evaluated is not strictly defined, though the APA Commission on Accreditation (CoA) has established an Implementing Regulation that provides parameters for the outcomes evaluation for doctoral programs.

The CoA requires all accredited programs to provide outcome data on the extent to which the program is effective in achieving its goals, objectives, and competencies. The Implementing Regulation clarifies the type of data CoA needs to make an accreditation decision for doctoral programs.

According to the Guidelines and Principles (G&P) for doctoral programs (F.1a) the program, with appropriate involvement from its students, engages in regular, ongoing self-studies that address its effectiveness in achieving program goals and objectives in terms of outcome data (i.e., while students are in the program and after completion). Accredited doctoral programs specify their goals, objectives, and competencies. It is each program’s responsibility to collect, present, and utilize (a) aggregate proximal outcome data that are directly linked to program goals, objectives, and competencies; and (b) aggregate distal outcome data that are directly linked to program goals and objectives (APA, 2012).

Proximal data are defined as outcomes for students as they progress through and complete the program, and that are linked to the program’s goals, objectives, and competencies. Typically, this includes evaluations of students’ performance (e.g., in courses) and more objective data (e.g., number of presentations/publications). Distal data are defined as outcomes, which are linked to the program’s goals and objectives (e.g., alumni surveys and graduates’ professional practice accomplishments like licensure and employment), for students after they have completed the program. Finally, aggregate data are compilations of proximal data and compilations of distal data across students that may be presented by cohort, program year, or academic year. Aggregate data demonstrate the effectiveness of the program as a whole, rather than the accomplishment of individual students over time.


Business programs are accredited by the Association to Advance Collegiate Schools of Business (AACSB). AACSB provides accreditation standards for both business and accounting programs at the baccalaureate, master and doctoral levels. The guidelines require that each degree-granting program have an assessment process in place. For business, the general standards include (a) strategic management, (b) students and faculty, and (c) assurance of learning, Each of these has its own set of individual standards. The assurance of learning standard is directly related to outcome measures. The AACSB revised document on standards (AACSB, 2012) describes the intent of the assurance of learning standards:

Assurance of Learning Standards evaluate how well the school accomplishes the educational aims at the core of its activities. The learning process is separate from the demonstration that students achieve learning goals. Do students achieve learning appropriate to the programs in which they participate? Do they have the knowledge and skills appropriate to their earned degrees? Because of differences in mission, student population, employer population, and other circumstances, the program learning goals differ from school to school. Every school must enunciate and measure its educational goals. Few characteristics of the school are as important to stakeholders as knowing the accomplishment levels of the school's students when compared against the school's learning goals.

Assurance of learning to demonstrate accountability (such as in accreditation) is an important reason to assess learning accomplishments. Measures of learning can assure external constituents, such as potential students, trustees, public officials, supporters, and accreditors, that the organization meets its goals.

Another important function for measures of learning is to assist the school and faculty members to improve programs and courses. By measuring learning the school can evaluate its students’ success at achieving learning goals, can use the measures to plan improvement efforts, and (depending on the type of measures) can provide feedback and guidance for individual students (p. 60).

AACSB describes both direct and indirect measures to be used for the assurance of learning standards, though it emphasizes that indirect measures can supplement, but not replace, the direct measures. Examples of direct measures include student selection criteria, course-embedded measurement, and demonstration through stand-alone testing or performance. Indirect measures include things like surveying alumni and surveying employers.

AACSB also accredits accounting programs, which have Assurance of Learning Standards as well. Nine standards are utilized across undergraduate, master, and doctoral programs to address assurance of learning, largely focused on criteria related to curriculum and course activities (see AACSB Accountability Accreditation Standards—Assurance of Learning at http://www.aacsb.edu/accreditation/accounting/standards/aol/). In addition, Standard 33, part of the Accounting Participants Standards, calls for providing information about graduates on placement (after three months) and career success (after an appropriate time using graduate and employer surveys).


The doctor of pharmacy (PharmD) program is accredited by the American Council on Pharmacy Education (ACPE). Its most recent standards were adopted in 2011 (ACPE, 2011). In all, there are 30 standards broken into six categories: (a) Standards for Mission, Planning, and Evaluation; (b) Standards for Organization and Administration; (c) Standards for Curriculum; (d) Standards for Students; (e) Standards for Faculty and Staff; and, (f) Standards for Facilities and Resources. Regarding evaluation, the standards document explains the purpose this way:

The college or school must establish and implement an evaluation plan that assesses achievement of the mission and goals. The evaluation must measure the extent to which the desired outcomes of the professional degree program (including assessments of student learning and evaluation of the effectiveness of the curriculum) are being achieved. Likewise, the extent to which the desired outcomes of research and other scholarly activities, service, and pharmacy practice programs are being achieved must be measured. The college or school must use the analysis of process and outcome measures for continuous development and improvement of the professional degree program (p. 6).

In practice, postgraduate PharmD assessment typically includes at least five measures:

North American Pharmacist Licensure Examination (NAPLEX) or "the board" exam to practice that is available only to PharmD graduates from an accredited school of pharmacy;

Multistate Pharmacy Jurisprudence Examination (MPJE) or "law board" exam really the second board component a PharmD graduate must pass to practice pharmacy;

competitive residency programs, which allow specialization, more appropriate to practice in institutional settings (very competitive, with only 3000 available for 4500 applicants);

job placement; and

postgraduation surveys, although considered optional.

While matriculating through the PharmD program, student learning and curricular effectiveness must be evaluated. Though specific required elements aren’t identified in the standards, student portfolios, simulations, course examinations, formative and summative assessments, licensure examination results, self-assessment, and preceptor and faculty assessments of student development are some of the methods suggested. Thus, one example we were provided examined four major measures of knowledge and performance of students that are utilized: (a) course examinations; (b) standardized client program (i.e., case challenges for students using actors trained as patients, physicians, and nurses), which measures base knowledge, communication skills, and rudimentary counseling skills each year; (c) Introductory Pharmacy Practice Experiential (IPPE) Program, which places students with a preceptor in the summer after the first and second academic years; (d) Advanced Pharmacy Practice Experiential (APPE) Program, which places students with nine different preceptors in a diversity of pharmacy settings over a 10-month period in the final year of the program.


The majority of teacher preparation programs are accredited by the National Council for the Accreditation of Teacher Education (NCATE). However, a second accrediting agency, the Teacher Education Accreditation Council (TEAC) has been active since the 1990s and accredited roughly one fourth to one third of the number of institutions that NCATE accredits. Recently, these two agencies agreed to merge into a single accrediting agency—the Council for the Accreditation of Educator Preparation (CAEP). For purposes of this analysis, the NCATE standards will be applied, though CAEP recently appointed a national commission to create new standards for the emerging organization that promise to be more rigorous than those utilized in the past.

NCATE accreditation is complicated by the fact that states license students, so all programs must be state approved regardless of their NCATE status. In practice, many states have partnerships with NCATE, though some require a state review of program content, while others allow for specialized professional associations (e.g., math teachers association) to perform the curricular review embedded in the NCATE accreditation process. NCATE has six standards that programs must meet: (a) candidate knowledge, skills, and professional dispositions; (b) assessment system and unit evaluation; (c) field experiences and clinical practice; (d) diversity; (e) faculty qualifications, performance, and development; and (f) unit governance and resources (NCATE, 2008).

Regarding Standard 2, NCATE requires that the unit has an assessment system that collects and analyzes data on applicant qualifications, candidate and graduate performance, and unit operations to evaluate and improve the performance of candidates, the unit, and its programs. This standard and its subcomponents are evaluated as unacceptable, acceptable, or target. NCATE suggests that evaluation systems usually have the following features:

Unit faculty collaborate with members of the professional community to implement and evaluate the system.

Professional, state, and institutional standards are key reference points for candidate assessments.

The unit embeds assessments in programs, conducts them on a continuing basis for both formative and summative purposes, and provides candidates with ongoing feedback.

The unit uses multiple indicators (e.g., 3.0 GPA, mastery of basic skills, general education knowledge, content mastery, and life and work experiences) to identify candidates with potential to become successful teachers or assume other professional roles in schools at the point of entry into programs (as a freshman, junior, or postbaccalaureate candidate).

The unit has multiple decision points, (e.g., at entry, prior to clinical practice, and at program completion).

The unit administers multiple assessments in a variety of forms and aligns them with candidate proficiencies. These may come from end-of-course evaluations, written essays, or topical papers, as well as from tasks used for instructional purposes (such as projects, journals, observations by faculty, comments by cooperating teachers, or videotapes) and from activities associated with teaching (such as lesson planning, identifying student readiness for instruction, creating appropriate assessments, reflecting on results of instruction with students, or communicating with parents, families, and school communities).

The unit uses information available from external sources such as state licensing exams, evaluations during an induction or mentoring year, employer reports, follow-up studies, and state program reviews.

The unit has procedures to ensure credibility of assessments: fairness, consistency, accuracy, and avoidance of bias.

The unit establishes scoring guides, which may be rubrics, for determining levels of candidate accomplishment and completion of their programs.

The unit uses results from candidate assessments to evaluate and make improvements in the unit, and its programs, courses, teaching, and field and clinical experiences.

In the evaluation of unit operations and programs, the unit collects, analyzes, and uses a broad array of information and data from course evaluations and evaluations of clinical practice, faculty, and admissions (p. 28).


Several key findings, which are consistent with some of the major outcomes of the prior studies reported, resulted from the analysis we conducted of the 10 professions we examined. First, programs typically have a great deal of leeway in devising their assessment systems and plans. While some professions specifically mandate certain elements, even certain instruments/methods, the general pattern is one of allowing institutions to devise plans specifically tied to their stated missions and standards.

Second, the accrediting bodies typically differentiate between what are called direct and indirect measures, proximal and distal measures, or internal and external measures. The greater reliance, and a requirement for some accrediting bodies, is to focus on the internal, direct, proximal measures as these are best suited for assessing student outcomes in programs as students matriculate through their course of study. Several fields are cautious about the use of the other types of measures.

Third, a great deal of similarity appears in the accreditation standards and practices among the various fields. As the ABA’s Outcome Measures Committee study revealed, some professions utilize more assessment measures than others, but there is a great deal of what institutional theorists have referred to as institutional isomorphism (DiMaggio & Powell, 1983), that is, a lot of homogeneity among the professions. Perhaps this is the result of the CHE and USDE recognition that many professions seek. Though we didn’t analyze this, we believe that there may be as much variation within fields on the types of data employed for outcome measures in professional accreditation as there is across fields.

Finally, after-graduation work-performance measures as an indicator of preservice training performance largely relies on the use of some kind of employer satisfaction surveys and graduate/alumni self-efficacy surveys. While medicine and other medical fields are keenly aware of the quality of care issue facing them, their growing use of data linking preparation to practice is at the graduate level after the MD is already earned, through the ACGME accreditation of residency programs. Here too, however, outcome performance is largely measured using a form of an employer (e.g., residency director) satisfaction survey.

The concerns associated with measuring performance outcomes of graduates after preparation is completed and graduates are in the workforce were highlighted by several accreditation directors. One from engineering, for example, explained that PEOs are difficult to measure directly, the alumni surveys typically used are marginally effective, and employer surveys can also be problematic. Apparently, professional advisory boards can supply a source of valuable evidence since they hire graduates. A pharmacy director shared similar types of concerns, explaining that discussions are underway about potential measures including patient safety/care, or more specifically, quality of patient care (there are moves to develop measures for different pharmacy settings), and health-care cost savings (pilot studies demonstrate how groups of pharmacists can do this, but they are difficult on an individual basis). However, such discussions recognize the problems with relating these types of things to the education an individual receives at a particular institution.


Our culture has a fascination with numbers, quantification, and rankings. In an elegant little book called The Numbers Game, Blastland and Dilnot (2009) argued that numbers provide the means for making sense of our vast and complicated world. But, they sounded a warning. While counting is easy, “life,” they explained, “is messier than a number” (p. 1). That is why, in their terms, we end up squashing things into a shape that fits the numbers.

We experienced a stark example of their caution in a recent newspaper story that ranked our University of Kansas football team’s home advantage as 14th best in the country. That was astounding to us, given that over the past three years our home record was 8 wins and 13 losses, with the wins largely attributable to scheduling weaker teams and likely victories at the beginning of each season. Turns out that this ranking was based on a very complicated mathematical formula devised by a website called the “Prediction Machine,” (http://predictionmachine.com/college-football-homefield-advantage) and the ranking is due to a variety of factors including swings in wins at home versus on the road, win differentials, etc. The numbers didn’t lie, but what the prediction actually captured were teams that didn’t necessarily win consistently at home and on the road, since for them there would be no home-field advantage. Thus Alabama, winner of several of the recent national championships, was ranked 106th. Just for sanity’s sake, given that we are at the institution where James Naismith, the inventor of basketball, worked for the majority of his career, we also checked the website’s ranking of home-court advantage in basketball. For the 2012 NCAA Division I basketball championship game, the teams were ranked as follows: 86 for Kansas and 134 for Kentucky, the eventual champion (http://predictionmachine.com/college-basketball-homecourt-advantage).

Sports aficionados eat this stuff up, and talking heads on multiple radio and television shows wax eloquent about rankings nearly every day. Each year, a complicated statistical formula dictates which two teams play in the football national championship game that typically creates a firestorm of debate for teams left out. But, it isn’t just in sports that our culture is enamored with numbers. In education, we are all familiar with the rankings and ratings provided by news magazines and other agencies, and many school districts and schools are rated and ranked yearly. Some are highly critical of these ratings and rankings (e.g., see Gladwell, 2011), as they argue that the variable selection process that identifies factors used for analysis completely shapes the ultimate rank. Thus, for example, the highest two ranked law schools in the 2012 ranking by U.S. News and World Report, Yale (#1) and Stanford (#2)—schools that incidentally have the lowest acceptance rate for students—were ranked 36th (Yale) and 64th (Stanford) in 2009 by the Internet Legal Research Group (http://www.ilrg.com/rankings/law/index.php/1/desc/Bar/2009), which ranks law schools solely on bar examination passage rates. Their top 10 that year included Marquette University (tied for #1), Campbell University (#9), and Ave Maria School of Law (#10). U.S. News and World Report utilizes a wide variety of variables in setting its rankings. Obviously, this displays that what is used and emphasized to compute rankings makes a huge difference. The numbers are accurate as computed, but deriving meaning and validity is complicated.

A major thrust driving debate in the teacher preparation field is the use of value-added (VAMs) or growth measures as a means of assessing the quality of the preparation programs. Driven in large part by the federal initiatives in Race to the Top and other federal programs, the idea is to assess the quality of the preparation program through the value-added performance of their graduates with P–12 students once they are teaching. Indeed, the waivers from NCLB requirements that many states pursued require a use of VAMs or growth measures to assess schools and evaluate teachers, and the USDE’s forthcoming regulations regarding the Higher Education Act will most certainly require these assessments to evaluate preparation programs.

We argue that we should view and utilize these numeric classifications of teachers or teacher preparation programs with the same caution that any rankings demand. We explain our concern by identifying three theoretical reasons for caution and then offering multiple specific psychometric and practical problems that have to be considered. Our caution stems from the understanding that the basis for the reliance on test scores to assess teachers or teacher preparation programs ultimately comes down to a hypothesized causal relationship—that is, a well-prepared teacher should have a positive influence on student outcomes as measured by a state assessment. This is logical and simple, easily understood, but ignores the reality that correlation does not necessarily equate to causation, and that other factors may be at play, as depicted in the story of Pfizer’s Torcetrapib.

Our first caution, we call the iatrogenic concern. Iatrogenic diseases in medicine are those that are physician or hospital induced, diseases caused by errors unrelated to the reason for getting care. Typically, these outcomes aren’t purposeful, but iatrogenic diseases cause real damage. So, for example, you go into the hospital for an operation on your kidney, and end up with a staph infection. Or, you are prescribed a medication that causes all sorts of other complications. A study in the Journal of the American Medical Association (Starfield, 2000) found that such iatrogenic diseases were the third leading cause of death in the United States after heart disease and cancer. In education, our iatrogenic concern relates to the studies reporting teaching to the test, shrinking curriculum, and an array of other effects of high-stakes testing that may be negatively impacting schools (Berliner, 2011; Center on Education Policy, 2009; McMurrer, 2007).

Related to this, is the economic term negative externalities, which is used to describe the consumption or production of a good that causes harmful effects. In other words, it is a negative externality when a company or firm doesn’t have to pay the full cost of a decision. So, for example, it is a negative externality if a company produces chemicals that cause toxic deposition to land so builders can’t build, or a company alters water composition so fishermen can’t fish since the fish died or moved to other locations and the producers aren’t held responsible. In personal terms, it may be as simple as you playing loud music at night, making it impossible for a neighbor to sleep. (This description of negative externalities was drawn from two websites—http://economics.fundamentalfinance.com/neagtive-externality.php and http://www.economicshelp.org/marketfailure/negative-externality.html.) Some negative externalities that the VAMs may be causing include the shrinking of curriculum and narrowly teaching to the test.

Perhaps the most widely known of these conceptual cautions is the concept known as Campbell’s Law (Campbell, 1976). It goes as follows: “The more any quantitative social indicator is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor” (p. 49). Campbell discussed this topic using examples from multiple social programs and situations, including education, but was clear that such effects can be the outcome for education systems bent on having to raise scores to be considered successful. As he explained, “When test scores become the goal of the teaching process, they both lose their value as indicators of educational status and distort the educational process in undesirable ways” (p. 52).

What all these concepts share—iatrogenic concerns, negative externalities, and Campbell’s Law—is the implication that unintended consequences are possible and can have devastating effects on the very issues that policies are trying to impact. Consequences need to be carefully considered and policies changed once the reality of a negative outcome is realized. Sadly, as economic writer Tim Harford pointed out in a Ted.com talk (http://www.ted.com/talks/tim_harford.html), too many policymakers are imbued with what he calls the God Complex, which leads to the inability to change policies once in place, even when they don’t work as intended or cause problematic outcomes.

Specifically in terms of VAMs and growth models, we identify a number of problems related to their use with teachers in school districts and in assessing teacher preparation programs. VAMs and growth models aren’t the same, much as there are multiple types of VAMs themselves. So, this review is just a simple compilation of concerns with these types of tests as utilized in making decisions for individuals. When assessing teacher preparation, the focus of analysis is a group, not an individual, involving many teachers and classroom settings, so it is likely that some of the reliability issues may be less severe than they are for decisions about individuals (Gansle et al., 2012). But, many of the validity and bias concerns we raise remain. Others in this volume and elsewhere examine all these issues in more detail (e.g., Goldschmidt, Choi, & Beaudoin, 2012; Harris, 2011; Newton, Darling-Hammond, Haertel, & Thomas, 2010). For our purposes, as depicted in Table 2, Concerns with overreliance on test scores, we break these issues into four areas: (a) curriculum and teaching, (b) students, (c) teachers and teacher preparation, and (d) psychometric. We don’t view this as an all-inclusive list, and realize that in some instances our placement of factors could easily be altered from one category to another.

Table 2. Concerns with Overreliance on Test Scores



Curriculum and Teaching

Teaching to the test

Shrinking the curriculum


Nonrandom distribution of students to schools and classrooms

Quells teaching of creativity

Teachers and Teacher Preparation

Nonrandom distribution of teachers to undergraduate and graduate programs

Students take courses, credentials, and degrees from multiple institutions

Can’t be applied to all teachers or educators who don’t test


Lack of reliability of difference scores (Time 1–Time 2)

Regression to the mean

Decision consistency

Tests capture a limited perspective of student learning

Causal claims may not be warranted

Quantified metrics tend to drive decisions

Instructional insensitivity


Teaching to the Test

Numerous researchers have lamented the concern that the overreliance on test scores for accountability purposes can cause problems such as narrowing of the curriculum, teaching to tests, and the like (Berliner, 2011; CEP, 2009; Falk, 2002; Gordon Commission, 2012). We describe this as a prime example of the tail wagging the proverbial dog. Teachers will teach to tests and their leaders will promote focus on test outcomes, especially when such measures are the driving factor for evaluative purposes. Some simply call this an example of bad teaching practice (Ritter & Shuls, 2012), but in the world of high-stakes assessment, this is optimizing behavior and should be expected. One option would be for test makers to develop tests that are worth teaching to. The two consortia creating tests for the Common Core Standards hold out some promise for altered types of tests that go beyond a focus on short-term memory recall items, but the evidence isn’t yet clear that this will happen. Indeed, our concern is that the very accountability measures, the VAMs aimed at assessing student performance, will inevitably lead to worse teaching due to reactions to the types of accountability requirements being promoted. Mathematicians and scientists studying college course accountability approaches comparing multiple choice and constructed-response type questions offered some explanation for this. They consistently concluded that multiple choice questions, the kind that VAMs rely on, are incapable of capturing anything beyond low-level cognitive processing, hindering critical thinking (Dufresne, Leonard, & Gerace, 2002; Martinez, 1999; Stranger-Hall, 2012). Similarly, the Gordon Commission (2012) recently concluded that “several studies, using several different methodologies, have shown that the state tests do not measure the higher order thinking, problem solving, and creativity needed for students to succeed in the 21st century” (p. 2).

Our concern is deeper than this. In a high-stakes accountability environment, what sorts of teaching might we expect to see proliferate? Researchers examining the development of teachers evolving from novice to expert have highlighted a continuum of phases that expert teachers must evolve through. Twiselton (2007, 2012), for example, examined teachers in training and highlighted that expert teachers grow in their practice through three phases: from task manager (keeping students on task, orderly, and completing assigned tasks), to curriculum deliverer (prescribed learning, dictated by someone else, and curriculum as the goal itself), and finally, to concept/skill builder (focused on concepts and skill development, tasks as a vehicle for learning, and transferable and transformational learning). When higher test scores are the desired outcome measure, our concern is that many teachers will never move beyond the two early stages of teaching into the more expert concept and skill-building phase. If expert teaching isn’t captured in the high-stakes assessments proliferating through American schools and the type of information that experts impart isn’t assessed, expert instruction may become a lost art. The Gordon Commission (2012) highlighted the multiple negative impacts on good teaching that especially impact poor and minority students.

Sadly, teaching to the test can be taken to the extreme. Over the last several years, newspapers have increasingly been reporting cases of educators changing student answers to increase classroom, school, and district scores. A particularly severe example of educator cheating that occurred in Georgia was documented by Kingston (2013). While cheating may be the result of unethical behavior by educators, it may also be due to educators covering for material they weren’t able to review with students. This is the sort of corruption and distortion predicted by Campbell’s Law due to the over-reliance on quantitative measures for decision-making. Obviously educator cheating would impact the validity of VAM scores.

Shrinking Curriculum

As suggested above, when accountability is built on test scores in selected subjects, it is likely that other subjects will receive less emphasis. The Center on Education Policy (CEP) examined accountability in three states (CEP, 2009) and found a consistent focus on test preparation in classroom instruction and a narrowing of the curriculum to address the emphasis placed on tested subjects. Others have reported the same sort of narrowing of curriculum due to the significance placed on the tested subjects (Berliner, 2011; Hamilton et al., 2007; McMurrer, 2007).


Non-Random Distribution of Students

Although policymakers and school leaders may choose to ignore a well-established statistical necessity, the lack of random assignment of students to classrooms undermines a reliance on tests as a way of gauging teacher or school performance. Though quasi-experimental research designs and value-added test models judiciously strive to capture as many confounding variables as possible, they do not capture them all. This doesn’t necessarily mean the tests are useless for assessment purposes (e.g., see Harris, 2011, 2012), but it does leave any results questionable if used as the primary indicator, and immoral if they are relied upon for high-stakes decisions.

Quells Teaching of Creativity

Zhao (2012) argued that overattention given to standardized tests zaps creativity out of classrooms. When the focus is on increasing the test scores, one of the great strengths of American education is bound to be lost—the characteristic of allowing individuality and creativity to emerge. Indeed, Zhao showed that American education is currently a leader in measures of creativity and entrepreneurship, and that many Asian countries that are traditionally heavily focused on pumping up test scores are now striving to make their schools more like their American counterparts in order to bolster their ability to foster greater creativity.


Non-Random Distribution of Teachers

Much as with students, teachers are rarely randomly assigned to classrooms, except in certain research situations. Some evidence suggests that newer teachers are often assigned to classes with lower achieving, harder to teach students (Feng, 2010). But, it seems to us it should be unnecessary to remind policymakers that students entering teacher preparation are never randomly assigned. Drawing conclusions about teacher preparation without such assignment is highly problematic. Typically, research that studies differences in preparation models attempts to randomly assign graduating students from differing models to classrooms or schools as a means of capturing the impact of preparation (e.g., see Constantine et al., 2009; Eduventures, 2011). But, this approach ignores the reality that, in such studies, the treatment being tested is the preparation, so the random assignment must occur with assignment of prospective teachers into teacher preparation models, and not with the graduates of these programs. What Mathematica and other studies have done is rendered near meaningless due to this design flaw. In states using VAMs, teachers being evaluated are never randomly assigned. It leaves any conclusions that might be drawn about teacher or teacher preparation performance very problematic.

Multiple institutions

There are other research design-related flaws that states using VAMs for assessing teachers and teacher preparation must consider. Teachers often have degrees from multiple institutions—community colleges, four-year institutions, and at the graduate level often different schools than those where they obtained other degrees. How are these multiple institutions to be parsed out in assessing the performance of teacher preparation?

Non-Tested Subjects

Obviously, not all subjects (or grades) are tested, so all teachers or their preparation institutions cannot be assessed in this manner (Henry, Kershaw, Zulli, & Smith, 2012). Most proposals for using VAMs rely solely on in-state data. So, institutions with a large number of out-of-state students who return to their home state or graduates who migrate to other states following program completion, have a potential serious sampling bias of examining only scores of those remaining in the state. Although two multistate consortia are developing new tests tied to the Common Core Standards, it remains uncertain if cross-state analyses will be possible in the foreseeable future.


Lack of Reliability of Difference Scores (Time 1 – Time 2)

Under most circumstances, when one test score is subtracted from another, the reliability of the difference score is much lower than the reliability of the original test score. While some authors (e.g., Rogosa & Willett, 1983) provided examples where this would not be the case, when the correlation between the observed scores at time 1 and time 2 is fairly high (as is the case with most achievement test data), this lower reliability is to be expected.

Regression to the Mean

Another way to look at difference scores is that a low score at time 1 might be due to low achievement, negative errors of measurement (or negative errors of prediction), or both. Since, by definition, error scores are uncorrelated, at time 2, half the test takers who had negative errors of measurement will have positive errors of measurement. Thus, for many examinees, low scores at time 1 will be followed by relatively high scores at time 2. This is also true (though to a lesser extent) of the average scores of classrooms of students.

Kahneman (2011) devoted an entire chapter in his bestselling Thinking Fast, Thinking Slow to explaining how misunderstood the statistical certainty of regression to the mean is in a variety of practices. To explain, he argued that success in any endeavor is due to a combination of talent plus luck. So, even the greatest golfers, for example, are likely to have a lower score after a higher one, or vice versa, due to weather conditions, bounces of the ball, or other luck-related factors. With students, luck may relate to the specific questions on a test, what their teacher chose to emphasize, a headache on test day, a bad night’s sleep the evening before, etc. But, the point is that the issue of regression to the mean, in any practice, cannot simply be ignored.

Decision Consistency

Goldschmidt et al., (2012) looked at a number of issues related to using measures of student growth in accountability systems. They reported the correlation between two consecutive years of school-level growth scores using 10 different models of growth for reading and math scores in 2,645 schools in four states (p, 47). For simple gain scores (year 2 – year 1) the correlations ranged from .24 to .28. For the most consistent of the 10 models, the correlations ranged from .72 to .81.

One way to look at these correlations is the extent to which a school that demonstrates growth in the top fifth of schools one year is in the top fifth the next year. By pure chance (that is, without knowing anything about their year 1 performance), we would expect 20% of schools to be in the top fifth. With a correlation of .25, the probability of being in the top fifth a second year only increases to 31%. Even with a correlation of .8, the probability of a school being in the top fifth a second year in a row is only 64%. Thus, many schools that show good growth one year (top fifth) are unlikely to show good growth the subsequent year. In fact, many schools that are in the top fifth one year will be in the bottom fifth the next year due substantially to the lack of reliability of growth models. Because individual teacher average growth will almost always be based on fewer students than school average growth, the correlations between two years will likely be lower and, thus, the decision consistency will be lower.

Related to this, a similar validity concern was raised by Hill, Kapitula, and Umland (2011), who conducted case studies of middle school mathematics teachers, including survey and observational-based indicators of teacher quality, instruction, and student characteristics to determine a mathematical quality of instruction (MQI) score. They found that while all teachers judged to have above-average instruction also had above-average VAM scores, teachers with low and very low MQI scores typically had moderately strong VAM scores. They concluded that VAMs “are not sufficient to identify problematic and excellent teachers accurately” (p. 825). VAMs alone were not seen as accurately identifying high- and low-quality teachers for purposes of accountability.

Tests Capture a Limited Perspective of Student Learning

There remain many educational objectives that are difficult or expensive to measure using large-scale standardized assessments. By itself, this might not be problematic. Such objectives could still be instructed and measured informally. But in the high-stakes decision world, this raises substantial problems. How will teachers and other educators associated with non-tested subjects be evaluated?

Causal Claims may not Be Warranted

Studies have shown that students often perform at different levels on different tests (Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein, 2012; Lockwood et al., 2007). This lack of reliability across different measures renders the idea that tests of content knowledge capture exactly what has been taught as very fallible. Indeed, Darling-Hammond et al. (2012) argued that teacher effectiveness itself varies depending on the statistical method employed. Thus, making causal claims about teacher performance with such evidence can result in error-laden decisions (see also Newton et al., 2010).

An added concern relates to the fact that a test given in the spring semester (the typical testing timeframe), really means that, even in a self-contained classroom, a teacher being assessed is lodged with her or his students for seven to eight months, with no control over what happened in the months with the prior teacher before the last semester ended (after the prior year’s test), as well as no control over education-related activity during the summer months. Baker (2012b) referred to this as compromising the attribution of responsibility, implying that causal claims may be very misleading, in that gauging the impact of teaching actually delivered by multiple instructors or not calculating the effects of what happened during time away from instruction in the summer (which is impacted by learning supports offered in the home) means that teachers are being assessed in part for instructional time over which they have no control.

Research is clear that issues like leadership, new teacher mentoring, and other support are vital for new teachers. The lack of controls for these and other potentially contaminating variables minimizes any causality claims that might be made about teacher preparation programs once graduates are out of the university and teaching in school districts.

Quantified metrics tend to drive decisions

While one would like to believe that the lack of measurement would not negatively impact instruction, there is a growing body of evidence that this is not likely to work because of the corollary to “What gets measured gets taught” (National Science Foundation, 2012), “What does not get measured does not get taught.”

But, even more significant for assessing teachers or teacher preparation, is Baker’s (2012a) argument that in current accountability schemes, the quantified metric, no matter its actual weighting in an evaluation, becomes the sole metric driving decisions when a certain level of performance must be achieved. Thus, if teachers fall below the identified level of performance, no matter their principal’s observational evaluation, student survey data, or performance on other measures, they are labeled as failures.

Instructional Insensitivity

A small but growing literature shows that when conditioned on overall student achievement (total test score), for most items, there is little or no difference in student performance between items for which teachers have taught a topic and those for which they have not (e.g., Court, 2010; Kao, 1990; Mehrens & Phillips, 1986; Niemi, J. Wang, Steinberg, Baker, & H. Wang, 2007; Polikoff, 2010; Popham, 2007). That is, most test items are insensitive to instruction. There may be many reasons this is so (more learning goes on outside schools, use of item-total correlations to select items favor items that tap into general factors not readily amenable to instruction, etc.), but the reasons for the observed findings have not been studied. Regardless, before holding teachers accountable for improving student performance, evidence should be provided that such scores can be improved by good instruction.


In this section, we have tried to explain the obvious—that data are not infallible, and that significant caution needs be taken when applying any measure to high-stakes decisions. Policy leaders should be asking a series of very practical questions prior to utilizing assessments for high-stakes summative decisions. For example, what does a particular score on a test or the number of questions right or wrong indicate? Given that thousands, or even millions, of students may be taking these exams, a single question right or wrong may be very important as to a summative decision. But just as in the research arena, where effect size has become as important as statistical significance in reporting results, decision makers need to better understand exactly what the scores they count on truly mean. We both have engaged in cutoff score setting and, as anyone who has participated in that process knows, there is a lot of magic involved, and typically little in the way of predictive validity for any score.

There should always be questions about validity. Are the assessments measuring what is intended? Hill et al. (2011), for example, raised serious questions about the validity of VAMs alone for assessing differences in quality of teaching. And, the difficult question of how much gain is appropriate for a certain mix of students in a single year is a very difficult question to answer, especially since students aren’t randomly assigned to schools, classrooms, and teachers.

As we consider the multitude of concerns regarding the overreliance on tests for assessing individual teachers or teacher preparation programs, we characterize what is happening as a form of “operational verisimilitude,” that is, the sense or appearance of what is being presented is true or real. Tests certainly capture something, a limited picture of what teachers do in classrooms with their students. Even those supportive of using VAMs recognize this reality. The Strategic Data Project (2011), for example, in its defense of VAMs, cautioned that these tests are an imperfect measure of student achievement; that the statistical models rest on challengeable assumptions; that students are often taught by groups, limiting the ability for these tests to measure single-teacher effects; that different VAMs often produce different results; and that some student information may not be accounted for, thereby limiting conclusions that might be drawn. Currently, standardized tests used in VAMs are a one-time, one-day snapshot of what happens over the course of an entire year. Thus, when there is an overreliance on such types of data, we are concerned with the notion of infallibility that follows in many quarters, that is, the unquestioned faith about the causality link among variables—e.g., teachers, teacher preparation programs, and these test scores. Statisticians are fully aware of the need to consider both Type I and Type II errors in any hypothesis they test. Those same sorts of cautions should be heeded in this situation as well.

We should never forget that all standardized tests are subjectively developed by human beings. As Blastland and Dilnot (2007) explained, “it is . . . a fundamental of almost any statistic that, in order to produce it, something somewhere has been defined and identified. Never underestimate how much nuisance that small practical detail can cause” (p. 5). Numbers lose their precision in the real world of schools, classrooms, students, and teachers, and that needs to be fully understood when high-stakes decisions are rendered on numbers that may be capturing and portraying a limited view of reality.

When it comes to assessing teacher preparation programs, the multitude of issues needed to be considered exacerbates these concerns. Recent scholarship on the use of VAMs for teacher program assessment has underscored this dilemma. Floden (2012) for example, identified three issues to consider: (a) VAMs measure only one of several important dimensions of teacher preparation quality; (b) they compare programs based on average scores, ignoring within-program variability; and (c) program graduate VAMs are influenced by the labor market. Henry, et al. (2012) suggested that VAMs only partially fulfill one of several goals for any evaluation, the accountability/oversight function, ignoring program improvement, assessment of merit and growth, or knowledge development. Crowe (2010) argued that single measures can never gauge all components of program effectiveness, while Plecki, Elfers, and Nakamura (2012) reported that “accountability and improvement efforts are not well served by simple measures used out of convenience” (p. 330).. Even Kristin Gansle et al. (2012) in Louisiana, in a description of the formative evaluation program in place in that state, highlighted that VAMs do find variability in teacher effectiveness across and within programs, though identifying those factors that specifically cause such differences was described as “particularly daunting” (p.312). They further highlighted that only teachers in grades and subjects tested can be assessed, and caution that “the results do not answer why a particular result occurred or what might be done to improve on it” (p. 312).


We do not argue against the use of VAMs, growth models, or other tests and the scores they produce. Instead, we agree with others who call for multiple measures, done multiple times, over multiple years (Ferguson, 2012). This form of triangulation is vital for any system leading to important decisions, and properly used tests—whether standardized or teacher-made—should be a part of any such assessment program. Recently, the MET Project (MET Policy and Practice Brief, 2013) called for the use of multiple assessments to evaluate teachers, including student achievement measures, structured observations, and student perception surveys. We find their arguments about evaluating teachers compelling, though it remains interesting that Finland, the nation with the highest test scores in many international comparisons, largely eschews the use of standardized tests for assessment purposes, but instead relies on well-prepared teachers to evaluate and assess the quality of the work their students produce (Sahlberg, 2011).

In this final section, we turn to a very brief discussion of some potential and promising approaches that might serve as an appropriate way to judge teacher preparation. It is a system with multiple measures that employs VAMs in a way that doesn’t overindulge the science as it currently exists. Harris (2012), in his well-reasoned support of VAMs, argued that they should be used at this point as a form of screening device, much like diagnostic tests in medicine. Although cardiac stress tests, cholesterol tests, or any number of medical assessments often have high rates of error, they do spur further investigation to truly understand particular medical problems. All tests are likely to result in some Type I and Type II errors, but Harris felt comfortable that “VAM as diagnostic tool” is a worthwhile use of the tests in better understanding teacher (or teacher preparation) effects.

This seems a reasonable use of the tests, using them as a sort of screening mechanism to determine if there are potential problems that require further examination. Of course, the tests may also mask teaching deficits (e.g., over emphasis on test preparation or teaching to the test), so even when there is a positive result, further examination with other measures is appropriate. But, if scores are weak, it should trigger some sort of further review to confirm or challenge the finding. Building on what we said earlier about overreliance on quantitative metrics, we don’t believe that any specific cut-off score must be reached—even if the tests are only a percentage of the total assessment.

The growth of sophisticated portfolios that include teaching videos and other data are emerging as useful for candidate assessment and licensure. Portfolios like edTPA (see http://www.edtpa.aacte.org/; Singer-Gabella, 2012) provide valuable information for assessing teacher performance. A system where a sample of program graduates have such portfolios compiled prior to graduation and once in the field may prove to be a valuable source for determining effectiveness. These are labor intensive—in terms of both compiling and assessing the portfolios—but offer a more authentic way of analyzing and evaluating candidate, teacher, and, potentially, teacher preparation performance.

Observations have been a staple used in teacher evaluation by school authorities. Well-designed systems that are properly analyzed can provide important information about the quality of teacher work and effects. Hill (2009) argued that observations can identify teaching skills that VAMs miss. In a more recent study, Hill et al. (2011) called for using VAMs in combination with discriminating observation systems. Danielson’s framework for teacher evaluation has shown to produce both valid and reliable results differentiating quality of teaching (Danielson, 2011; Sartain, Stoelinga, & Krone, 2010). Adding well-designed teacher observations to any assessment of individual teachers and the graduates of preparation programs adds detail and insight into specifically what is happening in the classrooms.

Graduate and employer (school principal) surveys seem most relevant for assessing teacher preparation graduates and preparation programs. Surveys of graduates of programs can provide useful insights into the value of the training they received and, perhaps, even provide self-reflection on their preparedness for the classroom. Their employers, ostensibly principals, can also provide useful insights on the quality of the preparation their employees (teachers) received. These types of surveys are common in many professional accreditation reports. Similarly, as the Phi Delta Kappa survey reported earlier highlighted, most people remember those teachers who impacted their lives. Figuring out ways to discern which teachers actually had a positive impact on their students’ lives would be complicated to do, but, nonetheless, would capture what is arguably the most significant impact a teacher can have.

Ron Ferguson and his colleagues have developed surveys for students to gauge teacher effects on student engagement and classroom learning conditions. These factors have been shown to be closely associated to student learning measures in the MET study (Crow, 2011; Ferguson, 2012; MET Policy and Practice Brief, 2012). Such student surveys, perhaps in combination with parent and other stakeholder surveys, add another set of elements to understanding the full value of preparation.

Finally, Deborah Meier (1995), in her work with inner city schools, developed a set of what might be described as outcome measures, what she referred to as “Habits of Mind.” These are a means for teaching professionals to question and examine what they are doing to impact their students in elementary schools. Some equivalent of “Habits of Mind” may need to be created for getting a full picture of the impact of teacher preparation on the students their graduates ultimately serve. Perhaps some combination of the various approaches discussed here could be compiled to assess the habits of effective new teachers in a way that truly captures the elements of teachers we want to prepare without overvaluing any particular assessment method.


In this paper, we have tried to display how teaching has been highly criticized for its practices, how much of that criticism is unwarranted, and how various professions are working to deal with the very real need to measure outcomes of their training—with little emerging beyond what teacher preparation already does. We also highlighted the problems that will inevitably ensue with overreliance on test scores as the dominating feature in accountability systems. The field of teacher preparation is caught—caught between a desire to extend its already lavish assessment practices to truly capture the impact on P–12 student performance and the reality of a host of limitations on what is possible. Sadly, policymakers don’t feel caught the same way, but instead, are scurrying to implement approaches that are laden with problems.

Why there is such support for approaches that have numerous psychometric and other questions surrounding their use is probably understandable in a policy context. Americans want to have good schools, and people in policy positions are seeking means for finding solutions for the perception of American schools as inferior. But caution, rather than speed, is really what is needed when both individual and program survival are dependent on the measures employed. How many alleged criminals, for example, have been exonerated with the advent of DNA testing? Prior decisions may have been well meaning, but were shown to be erroneous. Perhaps what is at play is what Sperber (2010) called the “guru effect.” Simply put, when experts say something is true, believers think that it is so. As he explained, “All too often, what readers do is judge profound what they have failed to grasp” (p. 583). Similarly, neuroscientists have found that neuroscientific explanations are considered sound even when the information provided is irrelevant or illogical (Weisberg, Keil, Goodstein, Rawson, & Gray, 2008). Mathematicians, too, are finding that their discipline is held in such high esteem by non-experts that mistakes are inevitable. As Eriksson (2012) argued, “If mathematics is held in awe in an unhealthy way, its use is not subjected to sufficient levels of critical thinking” (p. 746). Simply put, Trout (2002) concluded that people often believe explanations when they are intuitively satisfying, not because the explanations are accurate.

This “guru effect” for very sophisticated statistical measures, like VAMs and those who produce and promote them, makes a lot of sense for nonexperts tasked with finding ways to make teachers or teacher preparation better. But they err, just as Pfizer did with Torcetrapib, in rushing to judgment before the evidence truly supported their hopes. What makes causal sense, what seems logical and intuitively pleasing, may be a false hope. The evidence presented in this paper shows that teacher preparation is in the forefront in its use of outcome measures to gauge the effectiveness of its work. The move to employ VAMS and other types of tests as part of the outcome matrix is the apple in the Garden of Eden of outcome assessment. Taking a bite is fraught with peril. Nuanced use of these measures, in ways that don’t overassume their validity, should be the approach taken as this innovation evolves.


ABET. (2012). Criteria for accrediting engineering programs, 2013–2014. Baltimore, MD: Author. Retrieved from http://www.abet.org/DisplayTemplates/DocsHandbook.aspx?id=3149

ACEJMC. (2012). ACEJMC accrediting standards. Lawrence, KS: Author. Retrieved from http://www2.ku.edu/~acejmc/PROGRAM/STANDARDS.SHTML

ACPE. (2011). Accreditation standards and guidelines for the professional program in pharmacy leading to the doctor of pharmacy degree. Chicago, IL: Author. Retrieved from https://www.acpe-accredit.org/pdf/FinalS2007Guidelines2.0.pdf

American Association of Colleges for Teacher Education (2013).  The changing teacher preparation profession:  A report from AACTE’s Professional Education Data System (PEDS).  Washington, DC:  Author

American Bar Association, Section of Legal Education and Admissions to the Bar, Standards Review Committee (ABA). (2010). Student learning outcomes: Chapter 3—Program of legal education (Draft). January 8–9, 2010 meeting. Retrieved from http://www.abajournal.com/files/Learning_Outcomes_Clean_Copy_for_January_2010.pdf

American Psychological Association (APA). (2012). Implementing regulations: Section C: IRs related to the guidelines and principles. Washington, DC: Author. Retrieved from http://www.apa.org/ed/accreditation/about/policies/implementing-guidelines.pdf

Association to Advance Collegiate Schools of Business (AACSB). (2012). Eligibility procedures and accreditation standards for business accreditation. Tampa, FL: Author. Retrieved from http://www.aacsb.edu/accreditation/standards-busn-jan2012.pdf

Baker, B. (2012a, March 31). Firing bad teachers based on bad (VAM) versus wrong (SGP) measures of effectiveness: Legal note. [Web log post]. Retrieved from http://schoolfinance101.wordpress.com/2012/03/31/firing-teachers-based-on-bad-vam-versus-wrong-sgp-measures-of-effectiveness-legal-note/

Baker, B. (2012b, April 28). If it’s not valid, reliability doesn’t matter so much! More VAM-ing & SGP-ing teacher dismissal [Web log post]. Retrieved from http://schoolfinance101.wordpress.com/2012/04/28/if-its-not-valid-reliability-doesnt-matter-so-much-more-on-vam-ing-sgp-ing-teacher-dismissal/

Ball, D. L., Sleep, L., Boerst, T. A., & Bass, H. (2009). Combining the development of practice and the practice of development in teacher education. Elementary School Journal, 109(5), 458–474.

Berliner, D. (2011). Rational responses to high stakes testing: The case of curriculum narrowing and the harm that follows. Cambridge Journal of Education, 41(3), 287–302.

Blastland, M. & Dilnot, A. (2009). The numbers game. New York, NY: Gotham Books.

Blue-Ribbon Panel on Clinical Preparation and Partnerships for Improved Student Learning (2010). Transforming teacher education through clinical practice: A national strategy to prepare effective teachers. Washington, DC: National Council for Accreditation of Teacher Education.

Bransford, J., Darling-Hammond, L., & LePage, P. (2005). Introduction. In L. Darling-Hammond & J. Bransford (Eds.), Preparing teachers for a changing world (pp.1–39). San Francisco, CA: Jossey-Bass.

Braun, H. (2008, November 6). Using value-added modeling to judge institutional effectiveness: Consensus and contention in the land. Presented at the annual AAU Education Dean’s meeting, Washington, DC.

Bushaw, W. J. & Lopez, S. J. (2012). Public education in the United States: A nation divided. Phi Delta Kappan, 94(1), 9–25.

Campbell, D. T. (1976). Assessing the impact of planned social change. Occasional Paper Series Paper #8, Hanover, NH: Dartmouth College.

Carpenter, C. L., Davis, M. J., Harbaugh, J. D., Hertz, R., Johnson, Jr., E. C., Jones, M., . . . Worthen, K. J. (2008). Report of the Outcome Measures Committee. Chicago, IL: American Bar Association, Section of Legal Education and Admissions to the Bar.

Center on Education Policy (CEP). (2009). How state and federal accountability policies have influenced curriculum and instruction in three states. Washington, DC: Author.

Commission on Accreditation of Athletic Training Education (CAATE). (2012). Standards for the academic accreditation of professional athletic training programs. Austin, TX: Author.

Constantine, J., Player, D., Silva, T., Hallgren, K., Grider, M., & Deke, J. (2009). An evaluation of teachers trained through different routes to certification. Washington, DC: U.S. Department of Education.

Council for Higher Education Accreditation (CHEA). (2002). Student learning outcomes. The CHEA Chronicle, 5(2), 1–4.

Council for Higher Education Accreditation (CHEA). (2006). Accreditation and accountability: A CHEA special report. Washington, DC: Author.

Council for Higher Education Accreditation (CHEA). (2010). Effective practices: The role of accreditation in student achievement. Washington, DC: Author.

Council for Higher Education Accreditation (CHEA). (2012, August). Profile of Accreditation. Fact Sheet #1. Washington, DC: Author.

Council for Higher Education Accreditation & Association of American Colleges and Universities (2008). New leadership for student learning and accountability. Washington, DC: Council for Higher Education Accreditation.

Council for the Accreditation of Educator Preparation (2013).  Annual report to the public, the states, policymakers, and the education profession.  Washington, DC:  Author.

Council on Social Work Education (CSWE). (2010). Educational policy and accreditation standards. Alexandria, VA: Author.

Court, S. C. (2010). Instructional sensitivity of accountability tests: Recent refinements in detecting insensitive items. Paper presented at the Council of Chief State School Officers’ National Conference on Student Assessment, Detroit, MI.

Crow, T. (2011). The view from the seats. Journal of Staff Development, 32(6), 24–30.

Crowe, E. (2010). Measuring what matters: A stronger accountability model for teacher education. Washington, DC: Center for American Progress.

Danielson, C. (2011). Evaluations that help teachers learn. Educational Leadership, 68(4), 35–39.

Darling-Hammond, L., Amrein-Beardsley, A., Haertel, E., & Rothstein, J. (2012). Evaluating teacher education. Kappan, 93(6), 8–15.

DiMaggio, P. J. & Powell, W. W. (1983). The iron cage revisited: Institutional isomorphism and collective rationality in organizational fields. American Sociological Review, 48(2), 147–160.

Dufresne, R. J., Leonard, W. J., & Gerace, W. J. (2002). Making sense of students’ answers to multiple-choice questions. The Physics Teacher, 40, 174–180.

Ed.gov. (2009, October 9). A call to teaching: Secretary Arne Duncan's remarks at The Rotunda at the University of Virginia. Charlottesville, VA. Retrieved from http://www2.ed.gov/print/news/speeches/2009/10/10092009.html

Ed.gov. (2009, October 22). Teacher preparation: Reforming the uncertain profession—Remarks of Secretary Arne Duncan at Teachers College, Columbia University. New York, NY. Retrieved from http://www2.ed.gov/print/news/speeches/2009/10/10222009.html

Eduventures. (2009). Educator preparation: Strengths and areas for improvement in preparation programs (Catalog No. 10SOECR0709). Boston, MA: Author.

Eduventures. (2011). Effectively preparing teachers to positively impact student achievement: What does the research say? (Catalog No. 17SOECRI0211). Boston, MA: Author.

Eriksson, K. (2012). The nonsense math effect. Judgement and Decision Making, 7(6), 746.

Ewell, P. T. (2001). Accreditation and student learning outcomes: A proposed point of departure. Washington, DC: Council for Higher Education Accreditation.

Falk, B. (2002). Standards based reforms: Problems and possibilities. Phi Delta Kappan, 83(8), 612–620.

Feng, L. (2010). Hire today, gone tomorrow: New teacher classroom assignments and teacher mobility. Education Finance and Policy, 5(3), 278–316.

Ferguson, R. F. (2012). Can student surveys measure teaching quality? Phi Delta Kappan, 94(3), 24–28.

Floden, R. E. (2012). Teacher value added as a measure of program quality: Interpret with caution. Journal of Teacher Education, 63(5), 356–360.

Gansle, K. A., Noell, G. H., & Burns, J. M. (2012). Do student achievement outcomes differ across teacher preparation programs? An analysis of teacher education in Louisiana. Journal of Teacher Education, 63(5), 304–317.

Gladwell, M. (2011, February 14). The order of things: What college rankings really tell us. The New Yorker. Retrieved from http://www.newyorker.com/reporting/2011/02/14/110214fa_fact_gladwell

Goldschmidt, P., Choi, K., & Beaudoin, J. P. (2012). Growth model comparison study: Practical implications of alternative models for evaluating school performance. Washington, DC: Council of Chief State School Officers.

Goode, W. J. (1969). The theoretical limits of professionalization. In A. Etzioni (Ed.), The semi-professions and their organizations: Teachers, nurses and social workers. New York, NY: The Free Press.

Gordon Commission on the Future of Assessment in Education. (2012). Shifting paradigms: Beyond the abstract. Assessment, Teaching, and Learning, 2(2), 1–6.

Haan, C. K., Edwards, F. H., Poole, B., Godely, M., Genuardi, F. J., & Zenni, E. A. (2008). A model to begin to use clinical outcomes in medical education. Academic Medicine, 83(6), 574–580.

Hamilton, L. S., Stecher, B. M., Marsh, J. A., McCombs, J. S., Robyn, A., Russell, J. L., . . . & Barney, H. (2007). Standards-based accountability under no child left behind: Experiences of teachers and administrators in three states. Santa Monica, CA: Rand Corporation.

Harris, D. N. (2011). Value-added measures in education: What every educator needs to know. Cambridge, MA: Harvard Education Press.

Harris, D. N. (2012, November 28). Creating a valid process for using teacher value-added measures [Web log post]. Retrieved from http://shankerblog.org/?p=7242

Henry, G. T., Kershaw, D. C., Zulli, R. A., & Smith, A. A. (2012). Incorporating teacher effectiveness into teacher preparation program evaluation. Journal of Teacher Education, 63(5), 335–355.

Hill, H. (2009) Evaluating value-added models: A validity argument approach. Journal of Policy Analysis and Management, 28, 700–709.

Hill, H., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794–831.

Holloway, S. (2012). Some suggestions on educational program assessment and continuous improvement for the 2008 EPAS. Alexandria, VA: Council on Social Work Education, Commission on Accreditation.

Holmes Group. (1995). Tomorrow’s schools of education. East Lansing, MI: Author.

Johnson, S. R. (2008). The trouble with QSAR: Or how I learned to stop worrying and embrace fallacy. Journal of Chemical Modeling, 48(1), 25–26.

Joy, T. & Hegele, R. A. (2008). Is raising HDL a futile strategy for atheroprotection? Nature Reviews Drug Discovery, 7, 143–155.

Kahneman, D. (2011). Thinking fast, Thinking slow. New York, NY: Farrar, Straus and Giroux.

Kao, C.-F. (1990). An investigation of instructional sensitivity in mathematics achievement test items for U.S. eighth grade students. (Doctoral dissertation). Retrieved from University of California, Los Angeles.

Kingston, N. M. (Accepted, 2013). Educational testing case studies. In J. Wollack & J. Fremer (Eds.), Handbook of test security. New York, NY: Routledge.

Kronholz, J. (2012). A new type of ed school: Linking candidate success to student success. Education Next, 12(4), 42.

Lehrer, J. (2011). Trials and errors: Why science is failing us [Online article]. Retrieved from http://www.wired.com/magazine/2011/12/ff_causation/all/

Levine, A. (2006). Educating school teachers. Washington, DC: The Education Schools Project.

Liaison Committee on Medical Education (LCME). (2012, May). Standards for accreditation of medical education programs leading to the M.D. degree. Retrieved from  http://www.lcme.org/publications/functions2012may.pdf

Lockwood, J. R., McCaffery, D. F., Hamilton, L. F., Stetcher, B., Le, V.-N., & Martinez, J. F. (2007). The sensitivity of value-added teacher effect estimates to different mathematics achievement measures. Journal of Educational Measurement, 44(1), 47–67.

Lowe, D. (2008, February 10). A look behind Pfizer’s failed “good cholesterol” drug. Seeking Alpha. Retrieved from http://seekingalpha.com/article/63913-a-look-behind-pfizers-failed-good-cholesterol-drug

MacDonald, H. (1998, Spring). Why Johnny’s teacher can’t teach. City Journal. Retrieved from http://www.city-journal.org/html/8_2_al.html

Makary, M. (2012). Unaccountable: What hospitals won’t tell you and how transparency can revolutionalize health care. New York, NY: Bloomsbury Press.

Martinez, M. E. (1999). Cognition and the question of test item format. Educational Psychologist, 34(4), 207–218.

McMurrer, J. (2007). Choices, changes, and challenges: Curriculum and instruction in the NCLB era. Washington, DC: Center on Education Policy.

Mehrens, W. A. & Phillips, S. E. (1986). Detecting impacts of curricular differences in achievement test data. Journal of Educational Measurement, 23(3), 186–196.

Meier, D. (1995). How our schools could be. Phi Delta Kappan, 76(5), 369–373.

MET Policy and Practice Brief. (2012). Asking students about teaching. Seattle, WA: Bill & Melinda Gates Foundation.

MET Policy and Practice Brief. (2013). Ensuring fair and reliable measures of effective teaching. Washington, DC: Author.

National Commission on Excellence in Education. (1983). A nation at risk: The imperative for educational reform. Washington, DC: Author.

National Council for Accreditation of Teacher Education (NCATE). (2008). Professional standards for the accreditation of teacher preparation institutions. Washington, DC: Author.

National Science Foundation. (2012). America’s pressing challenge—Building a stronger foundation. Retrieved from http://www.nsf.gov/statistics/nsb0602/#standards

National Task Force on Quality Nurse Practitioner Education. (2012). Criteria for the evaluation of nurse practitioner programs. Washington, DC: National Organization of Nurse Practitioner Faculties.

Newton, X. A., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010). Value-added modeling of teacher effectiveness: An exploration of stability across models and contexts. Educational Policy Analysis Archives, 18(23), 1–27.

Niemi, D., Wang, J., Steinberg, D. H., Baker, E. L., & Wang, H. (2007). Instructional insensitivity of a complex language arts performance assessment. Educational Assessment, 12(3–4), 215–237.

Nolte, E., Fry, C. V., Winpenny, E., & Brereton, L. (2011). Use of outcome metrics to measure quality in education and training of healthcare professionals. Cambridge, UK: Rand, Europe.

Plecki, M. L., Elfers, A. M., & Nakamura, Y. (2012). Using evidence for teacher education program improvement and accountability: An illustrative case of the role of value-added measures. Journal of Teacher Education, 63(5), 318–334.

Polikoff, M. S. (2010). Instructional sensitivity as a psychometric property of assessment. Educational Measurement: Issues and Practice, 29 (4), 3–14.

Popham, J. W. (2007). Instructional insensitivity of tests: Accountability’s dire drawback. Phi Delta Kappan, 89(2), 146–150.

Public Agenda. (2008). Lessons learned: New teachers talk about their jobs, challenges and long-range plans. Washington, DC: National Comprehensive Center for Teacher Quality and Public Agenda.

Ritter, G. W. & Shuls, J. V. (2012). If a tree falls in a forest, but no one hears . . . Phi Delta Kappan, 94(3), 34–38.

Rogosa, D. R. & Willett, J. B. (1983). Demonstrating the reliability of the difference score in the measurement of change. Journal of Educational Measurement, 20(4), 335–343.

Sahlberg, P. (2011). Finnish lessons. New York, NY: Teachers College Press.

Sartain, L., Stoelinga, S. R., & Krone, E. (2010). Rethinking teacher evaluation. Chicago, IL: Consortium on Chicago School Research at the University of Chicago Urban Education Institute.

Shober, A. F. (2012). From teacher education to student progress: Teacher quality since NCLB. Washington, DC: American Enterprise Institute.

Singer-Gaballa, M. (2012). New way of testing rookie teachers could be a game changer. Hechinger Report. Retrieved from http://hechingerreport.org/content/testing-what-it-takes-to-teach_9230/

Sperber, D. (2010). The guru effect. Review of Philosophy and Psychology, 1(4), 583–592.

Starfield, B. (2000). Is U.S. health really the best in the world? Journal of the American Medical Association, 284(4), 483–485.

Stranger-Hall, K. F. (2012). Multiple-choice exams: An obstacle for higher-level thinking in introductory science classes. CBE Life Sciences Education, 11(3), 294–306.

Strategic Data Project. (2011). Value-added measures: How and why the Strategic Data Project uses them to study teacher effectiveness. Cambridge, MA: Center for Educational Policy Research at Harvard University.

Stuckey, R. & others. (2007). Best practices for legal education: A vision and a road map. St. Paul, MN: Clinical Legal Education Association.

Sullivan, W. M., Colby, A., Wegner, J. W., Bond, L., & Shulman, L. S. (2007). Educating lawyers: Preparation for the profession of law. San Francisco, CA: Jossey-Bass.

Swing, S. R. (2007). The ACGME outcome project: Retrospective and prospective. Medical Teacher, 29, 648–654.

Trout, J. D. (2002). Scientific explanation and the sense of understanding. Philosophy of Science, 69, 212–233.

Twiselton, S. (2007). Seeing the wood for the trees: Learning to teach beyond the curriculum. Cambridge Journal of Education, 37(4), 489–502.

Twiselton, S. (2012, November 8). Seeing the wood for the trees: Developing teacher expertise—Conditions, contexts and implications. Paper presented at the Universities’ Council for the Education of Teachers Annual Conference, Hinckley, UK.

Weisberg, D. S., Keil, F. C., Goodstein, J., Rawson, E., & Gray, J. R. (2008). The seductive allure of neuroscience explanations. Journal of Cognitive Neuroscience, 20(3), 470–477.

Will, G. (2006, January 16). Ed schools vs. education. Newsweek, 147(3), 98.

Zhao, J. (2012). World class learners. Thousand Oaks, CA: Sage.

Cite This Article as: Teachers College Record Volume 116 Number 1, 2014, p. -
https://www.tcrecord.org ID Number: 17295, Date Accessed: 1/28/2022 4:22:47 AM

Purchase Reprint Rights for this article or review
Article Tools
Related Articles

Related Discussion
Post a Comment | Read All

About the Author
  • Rick Ginsberg
    University of Kansas
    E-mail Author
    RICK GINSBERG is dean of the School of Education at the University of Kansas. Prior to that, he served as the director of the School of Education at Colorado State University. He is the 2012–2013 chairman of the board of the American Association of Colleges of Teacher Education, the chair of the Kansas Professional Standards Board, a member of the interim Board of the Council for the Accreditation of Educator Preparation (CAEP), and serves on the CAEP Commission on Standards and Performance Reporting. His recent research examines educational policy, politics and reform, and aspects of leadership. His most recent publication focused on the impact on principals and superintendents of leading during a fiscal downturn—Ginsberg, R. & Multon, K (2011). Leading through a fiscal nightmare: The impact on principals and superintendents. Phi Delta Kappan, 92, 42–47.
  • Neal Kingston
    University of Kansas
    E-mail Author
    NEAL KINGSTON is a professor in the Psychology and Research in Education Department at the University of Kansas and serves as director of the Achievement and Assessment Institute and codirector of the Center for Educational Research and Evaluation. His research focuses on helping large-scale assessments better support student learning.
Member Center
In Print
This Month's Issue