Home Articles Reader Opinion Editorial Book Reviews Discussion Writers Guide About TCRecord
transparent 13
Topics
Discussion
Announcements
 

Proceed With Caution: Interactive Rules and Teacher Work Sample Scoring Strategies, an Ethnomethodological Study


by Robert V. Bullough, Jr. - 2010

Background: Facing growing accountability pressures, Teacher Work Samples (TWSs) as a model of performance-based assessment is of growing significance in teacher education. Developed at Western Oregon University and widely adopted and adapted, proponents claim that the model is “real,” “natural,” “meaningful,” and “helpful” (G. R. Girod, 2002).

Research Questions: The study addresses three questions: (1) How do sample raters understand their responsibilities? (2) What are the underlying interactive rules and strategies used by raters to achieve their aims, and how are they employed? (3) What issues or concerns should teacher educators interested in using TWS methods address as they seek to demonstrate candidate quality and program value?

Research Design: Conversation analysis, ethnomethodology.

Data Collection and Analysis: Ten TWS scoring conversations conducted by four teams were recorded and analyzed to identify interactive rules and strategies. Scoring teams were composed of one tenure-track elementary teacher education faculty member and one clinical teacher education faculty member. Excerpts from a TWS case judged marginal are presented and analyzed.

Findings: From the case, a set of interactive rules (tenure-track faculty speak first; the efficiency and equivalence rules; and scorers are prepared) and strategies (splitting the difference; rubric simplification; previewing scores; and rubric stretching) are identified, and implications of their use are discussed for assessment validity, fairness, content quality and coverage, meaningfulness, and cognitive complexity (R. L. Linn, E. L. Baker, & S. B. Dunbar, 1991).

Conclusions: This study raises a number of concerns about the expectations for and use of Teacher Work Samples and cautions about their use for high-stakes assessment.

Facing what Cochran-Smith (2001) described as the “outcomes question in teacher education,” there has been a dramatic shift over the past few years within teacher education toward gathering and providing evidence to demonstrate that teacher education and teachers make a positive difference in student learning. Feeling ever-increasing accountability pressures, across the nation, “providers of teacher education are struggling to demonstrate, document, and measure the effects, results, consequences, and effects of teacher preparation on school and other outcomes” (Cochran-Smith, 2005, p. 9). As a means for providing such evidence, interest in Teacher Work Samples (TWSs) has grown significantly. From a survey of 240 member institutions of the American Association of State Colleges and Universities, for example, Wineburg (2006) concluded that “work samples/portfolios of teacher candidates, usually developed during methods courses or student teaching,” is one of four “primary methods” for establishing teacher education program effectiveness (p. 56).


Despite such growing influence, remarkably little research has been conducted on work samples and their use, scoring or programmatic value, and impact. The purpose of this study is to critically consider the scoring processes used to achieve consensus (or agreement) and to warrant teaching quality. The intent is not to argue that the analysis and use of work samples have no place in teacher education, but rather to suggest limitations to what reasonably can be expected of them. Most especially, doubt is cast on the use of TWS ratings for high-stakes assessment. Three specific questions guided the study. Given that scorer consensus is the widely assumed aim of sample ratings, (1) how do the raters understand their responsibilities? (2) what are the underlying interactive rules and strategies used by raters to achieve their aims, and how are they employed? and (3) what issues or concerns should teacher educators interested in using TWS methods address as they seek to meet the “outcomes question”?


When conceptualizing this study, issues raised by Moss and Schutz (2001) about standards-driven models of assessment like the TWS proved helpful. Drawing on data from two case studies, one of a team struggling to produce a set of descriptive statements helpful for scoring writing portfolios for a particular INTASC performance level, these authors show how any standard inevitably represents a compromise rather than a consensus of understanding. Further, they demonstrated that, when “cut free of the contexts in which they [are] created” (p. 38), standards become fixed, masking the diversity of opinions within the community that produced them. Moss and Schutz concluded, “Even the areas of seemingly straightforward agreement and disagreement are not as clear-cut as they might seem” (p. 50). They demonstrated how agreement comes less from the existence of an assumed underlying “professional consensus” than from avoiding details, ignoring contexts, and shifting discursive levels so that the language used is general enough to allow comfortable place to contending positions. Additionally, they raised a disturbing possibility: that when scoring student work, more talk among coders about the application of any given standard actually may lead to greater rather than less disagreement. Hence, it may be unwise for raters to explore in any depth the reasons underlying assignment of a particular score.


As Moss and Schutz (2001) demonstrated, when freed from context, the production of scores of teacher or student performance always requires raters to go beyond—probably well beyond—the level of agreement forged when the standards were produced. This is true even when considerable care goes into making certain that the group producing the standards is diverse and members well seasoned. There are, then, important reasons for attempting to uncover the processes by which current standards-driven models of assessment produce teacher ratings and inform program evaluation.


BACKGROUND, CLAIMS, AND THE TWS MODEL


Pioneered by faculty at Western Oregon University and now widely adopted and adapted as a model of teacher assessment, the Teacher Work Sample is tightly linked to accreditation standards set by the National Council for Accreditation of Teacher Education (NCATE) and the Interstate New Teacher Assessment and Support Consortium (INTASC) (Denner, Salzman, & Bangert, 2001; Schalock, 1998; Watkins & Bratberg, 2006). The model is briefly described by some of those centrally involved with its development and testing:


The methodology is designed on 2- to 5-week units of instruction intended for all students in a regular classroom that are anchored to state- or district-established standards for learning. The instructional units are planned and implemented by prospective teachers, with review and approval by both college and school supervisors, with preinstruction and postinstruction analysis and reporting of student progress toward targeted learning outcomes. (Schalock, Schalock, & Ayres, 2006, p. 104)


Much is claimed for the model:


First, it is real because the performance assessment tasks prospective teachers to learn to perform reflect real-life teaching. Second, it is natural because the performance assessment tasks occur in classrooms with real pupils. Third, it is meaningful because the performance assessment tasks reflect important aspects of teaching. Finally, it is helpful because the assessments cause prospective teachers to explore their own practice by asking them to consider and address [important] questions when planning, implementing, and documenting their work on an instructional unit. (Girod, 2002, p. 67).


Generally, a work sample offers evidence obtained from teaching a unit of several lessons of having met 5 to 7 standards determined by some group or groups (whose membership is probably unknown to those doing the scoring) to be crucially important to successful teaching. With seven standards, Brigham Young University (BYU, 2006), the site of this study, is representative: (1) “The teacher uses information about the learning/teaching context and student individual differences in setting learning goals(s) and objectives and planning instruction and assessment”; (2) “The teacher sets significant, challenging, varied and appropriate learning goals and objectives based on state/district content standards”; (3) “The teacher uses multiple assessment modes and aligned with learning goal(s) and objectives to assess student learning before, during and after instruction”; (4) “The teacher designs instruction for specific learning goal(s) and objectives that address characteristics and needs of students, and the learning context”; (5) “The teacher uses ongoing analysis of student learning to make instructional decisions”; (6) “The teacher uses assessment data to profile student learning and communicate information about student progress and achievement”; and (7) “The teacher analyzes the relationship between his or her instruction and student learning in order to improve teaching practice.” Finally, an overall rating is made that includes an assessment of the presentation, organization, attention to diversity, and quality of the sample. This is the decision that matters most to the candidate. Guidelines for candidates include suggested page length for each category of evidence within the TWS, and sometimes explicit questions to be addressed.


Rubrics composed of standards, quality indicators, and descriptors for rating (and writing) work samples facilitate both TWS production and assessment of the quality of the teacher candidate’s evidence and presentation. On a scoring sheet, ratings of each indicator are recorded and points assigned. Rating scales vary across institutions. For example, at Western Carolina University, the scale is above standard, at standard, below standard, and unsatisfactory. It is advanced, proficient, and developing at Colorado State University. In some institutions, students are required to rework areas judged weak. A candidate at Wichita State University, for example, must “earn 80% or better on the TWS total score with no less than 60% on any one section” (Wichita State University, 2007, p. 6). A student who fails to meet this level is placed on remediation until the “expected proficiency level” is met. To pass at Emporia State University, a candidate must earn 70% of the points possible on the work sample.


Representing scoring thresholds, the descriptors guide assessment and indicate different levels of performance for each indicator, and when taken together, the indicators are understood to capture the essential elements of each standard. For example, at BYU, Standard II, “The teacher sets significant, challenging, varied and appropriate learning goals and objectives based on state/district content standards,” has four indicators, each rated on a scale from 0 to 5 points. Indicators are: clarity of learning goal and objectives; alignment with national, state or local standards; levels of objectives; and appropriateness of objectives for students. Operating like short Likert scales, the TWS rating scales used across the country commonly range from high to low; clear to unclear; appropriate to not appropriate; aligned to not supported; thorough to minimal; well to poorly designed; logically sequenced to disorganized; always to seldom; and substantial, multiple, and all to none. For some items, the presence or absence of evidence is considered. For assessing the entire work sample, a similar pattern obtains, including reliance on terms denoting quantity: All sections are well-organized to few sections, and spelling, grammar, and writing are correct to some set percentage.


Individuals or panels of teachers, school administrators, and university professors may score TWSs. The expectation is that raters will be trained to achieve a reasonable level of reliability. A common approach to rating work samples, one employed at BYU, is to have two educators independently rate a sample and then meet to negotiate scores, including a final determination of pass or fail. For institutions with large student enrollments like BYU, where more than 200 samples are scored in elementary education each year, an ongoing and very serious challenge is to find and train a sufficient number of raters to distribute the burden of scoring. This difficulty leads to involving in scoring faculty members who have taught the teacher candidates who produce the samples. Another challenge is to report the cumulative results to arts and sciences colleges and the many departments involved in teacher education in a way that is useful for program improvement, a requirement under NCATE Standard Two for program accreditation.


THEORETICAL FRAMEWORK: ETHNOMETHODOLOGY AND CONVERSATION ANALYSIS


Teacher Work Sample scoring represents a particular and peculiar form of institutional talk, talk tied to making and justifying quality judgments about teaching. Like all talk, scoring conversations are embedded in context-specific rules of interaction that, although generally taken for granted, are heavily value laden. Analyzing institutional talk is a means for revealing values and commitments and for locating overly constraining but taken-for-granted assumptions about purpose and practice. Because such talk opens and closes opportunities for future learning for beginning teachers and produces judgments of program quality used when making a case for accreditation, analysis of such conversations is of genuine importance to teacher educators as well as to beginning teachers.


The task of opening up and making sense of the scoring conversations was informed by the assumptions of ethnomethodology. These helped to focus criticism and direct the analysis. Ethnomethodology is the “study of the methods people use for producing recognizable social orders. . . . The object . . . is to discover the things that persons in particular situations do, the methods they use, to create the patterned orderliness of social life” (Rawls, 2002, p. 6). The specific concern of ethnomethodologists is understanding and revealing how social knowledge, which is tacit, operates—“how members of society, in their interactions with one another, draw upon their common-sense knowledge of the society, including its institutional realities, in producing and managing those interactions” (Francis & Hester, 2004, p. 203). What is revealed are aspects of an embodied “rules system” supported by shared expectations and reflecting a rationality of everyday interaction (Mehan & Wood, 1975). A key assumption is that persons go about life trying to “normalize . . . incongruities within the order of events of everyday life” (Garfinkel, 1967, p. 54), and this is done in part through complying with what is expected. What is expected within the institutional context and practice is for sample scorers to achieve consensus on their ratings and to certify sample quality, which, by inference, is assumed to represent teaching ability. These aims are supported by a set of taken-for-granted rules and interactive strategies. Operationally, these same rules and interactive strategies also reveal how institutional aims are understood.


As one of many forms of discourse analysis, conversation analysis (CA) represents an “attempt to describe people’s methods for producing orderly social interaction” (Silverman, 2006, p. 210). Especially, CA has proved to be a fruitful means for getting at the rules underpinning interaction (Hutchby & Wooffitt, 1999) and is a favorite tool of ethnomethodologists. While drawing on CA transcription symbols and coding, it should be noted here that the analysis presented departs somewhat from CA conventions for reasons that will be noted shortly. Involving the coding of conversation in such a way as to reveal patterns, CA helps make available the ways in which talk seeks to achieve a purpose by offering “methodic answers” to specific problems (Hutchby & Wooffitt). Neither the problem nor the answers are necessarily self-evident. Revealing problems and answers requires systematic analysis, digging into what was said, how and where it was said. In CA, a central focal point for analysis is turn-taking, the transitions between turns. CA codes give readers access to how interaction takes shape and flows, and opens for consideration the emotional loading of what is said. These are important qualities for locating speaker intent. In the relationship between turns and next turns, participants reveal how they understand content and the other’s intent—what actions are designed to accomplish. Of special importance here is the way in which the contexts—a scoring conversation—within which interaction takes place shape that interaction, constraining, although not fully determining, what is said and how it is said.


Contexts offer “rules” (norms or procedures) by which speakers make their inferences of cospeaker intentions and about the nature and purpose of the interaction taking place. In the interpretive process and against this normative backdrop, they make sense of what is said and not said, and they respond. CA, then, seeks to reveal the “inferential order of talk: the kinds of cultural and interpretative resources participants rely on in order to understand one another in appropriate ways” (p. 39). Talk of this sort is specialized, adapted, and sequenced for institutional purposes that orient speakers—at least to the degree to which the speakers identify themselves and are identified with the practices supported by an institution (see Wenger, 1999). Forms of identification and status matter here, although usually they fall outside of the range of the interpretative interest of CA (Silverman, 2006).


Underpinning the analysis done for this study is a general question posed by Silverman (1993), one central to the intent of conversational analysis: “How was this outcome accomplished?”(p. 142). The institutional outcome of interest is consensus on the quality of a TWS, and achieving consensus frames much that was said and done within the scoring conversations. A word about agreement and consensus: Consensus, Moss and Schutz (2001) argued, is different from agreement:


With . . . consensus, it is assumed that parties reach an understanding that all interpret in the same way. With agreement. . . whereas parties accept a particular conclusion in a particular context, what is agreed upon may actually be (and to some extent, always is) interpreted differently by each. (p. 59)


As will be noted, this difference is important because a greater authority claim follows consensus, offering a much stronger warrant for a judgment than a mere agreement. In standards-driven assessment models, the presumption tends to be that consensus will be achieved by raters and, as Moss and Schutz (2001) demonstrated, disagreement and dissension have no place.


DATA PRESENTATION AND ANALYSIS


Ten work sample scoring conversations conducted by four 2-person faculty teams during the fall of 2007 were audiotaped and later coded using CA guidelines for analysis. Analysis took place in two phases. The first involved identifying patterns in the interaction of the two scorers—in how the conversations were structured, who said what, when, and how it was said, and how the conversation flowed and was kept moving. From the patterns, tacitly shared underlying interactive rules and strategies were inferred. Second, an attempt was made to determine the extent to which the rules and strategies were present across the 10 conversations. The majority of the data presented here come from a single TWS scoring conversation that well illustrates the operation of the interactive rules and strategies. Although most CA work involves the building and analyzing of large collections of transcripts representing a particular form of conversation, such as interviewing, sequences, like phone greetings, or other interactive practices, there is value in focusing intensely on a single case of interaction (see Hutchby & Wooffitt, 1999, chapter 3). This is so because, as with TWS rating conversations, “Turns at talk are often very long, and accomplish many different actions” (Hutchby & Wooffitt, p. 120). Where illuminating of the sample case, other data drawn from the nine conversations will be offered. It is important to note that all 10 scoring conversations involved a clinical faculty member and a tenure-track faculty member negotiating their separate ratings of a single TWS to produce a single score.


The scoring conversation presented is of the second sample rated by the team. When comparing first and second scoring conversations of a team, it became apparent that from the first TWS rating conversation and the second, a way of working together emerged that was found to be productive and responsive to the purpose at hand. These involve generation of scoring strategies and rules that were present, although in somewhat varying degrees, in the transcripts of each team conversation. All the rules and all the strategies were present in each of the conversations but not to the same degree. The primary reason for focusing on this particular conversation rather than on another or on the entire set is that the sample was one of only two judged marginal but passing by the 8 raters, and of the two, this one proved most revealing in part because it was most demanding of the raters. By not representing clearly either very low or very high quality, cases of this kind provide especially rich opportunities for locating assumptions and commitments that underpin scorer actions and interpretations. In effect, as Toulmin (2001) suggested, cases like this one, as outliers, “can be used to explain the central, rather than the other way around!” (p. 30). By straining comfortable and established methods of decision-making, marginal cases make unusual methodological demands on raters of the sort noted by Moss and Schutz (2001). Ambiguity in criteria proves vexing, differences in understanding loom large, and justifications of scores become insistent. The difficulty of scoring such cases produces procedural adjustments and interpretive leaps unlikely to emerge under other circumstances that illuminate rater practices and understandings of the purposes of their work.  


Generally, CA is conducted by analysts who are part of the same culture as the interactants studied. This is a valued condition for understanding and producing an informed interpretation, one that represents what the participants “take it they are doing” (Hutchby & Wooffitt, 1999, p. 113). However, serious problems may arise, as they did here, when analysis is done by an insider, leading to a necessary departure from preferred CA methods. Because he knew many of the raters, the author could not transcribe tapes of the TWS scoring conversations. As a condition for participation, raters insisted that strong assurances of anonymity be given. Indeed, the majority of raters choose not to participate in part out of concern that their identities would become known. Hence, the tapes were transcribed by a skilled coder familiar with TWSs but outside the faculty (using CA transcription codes; Silverman, 2006, pp. 398–399; see also Hutchby & Wooffitt, 1999, pp. vi–vii, and an abbreviated version in the appendix).


As noted, the work samples were rated independently by two educators representing two categories of faculty: tenure-track elementary education professors who were specialists in specific program content areas such as reading and science, and clinical faculty. The clinical faculty members were distinguished elementary school teachers employed full-time by the university for 2- or 3-year terms to work in the teacher education program. Their primary responsibility was to coordinate and supervise field work, including student teaching. Cooperating teachers also rated samples, but not those considered here. Previously, each of the tenure-track and most of the clinical faculty scorers participated in a 2-day training session that involved discussing and clarifying rubrics and practice scoring, seeking an acceptable level of reliability and validity. Although the author did not know (and does not know) who did the scoring, faculty status was indicated in anticipation of the possibility that the different institutional roles might influence the TWS scoring conversations. As noted, scorers faced two interrelated institutional tasks, to achieve consensus on a set of scores for parts of the sample and on a final score for the entire sample and thereby to warrant sample (and presumably teacher) quality. The ratings were used to determine whether the candidate’s sample was of sufficient quality to justify continuation in the program and eventually teaching licensure. Depending on the score given, a beginning teacher might be required to rework parts of the sample or, if the sample failed, potentially the entire sample. Although the beginning teachers understood that passing the sample was required for program completion, they also knew that reteaching a unit or even redoing the entire sample was a possibility. Such requirements would delay program completion. Generally, it appeared that faculty were willing to work with students until they achieved the desired level of performance, although presently there are growing concerns about what this commitment requires of faculty time.


Roughly 40% of the TWS scoring conversation follows, including, most especially, those selections that display the interactive rules and the strategies developed by the team to reach consensus (agreement) about sample quality. Rules and strategies will be noted where they emerged in the conversation and represent, as suggested, elements critical to the value and success of the analysis for purposes of this study. As noted, the rating scale is 0–5, but these are spread across four performance categories: exceeds expectation (5); meets expectation (4–3); partially meets expectation (2–1); and not met/missing evidence (0). As noted, a brief description of the CA transcript coding system is contained in the appendix. For readers unfamiliar with CA and even for those used to working with transcribed audiotapes, reviewing the codes before reading the excerpts will prove helpful. The codes reveal actions beneath the words, such as speaker emphasis, which is important for determining intent. Discussion is interspersed throughout. The first speaker is a tenure-track elementary education faculty member (TTF). The second speaker is a clinical faculty member (CF). A discussions section follows the analysis of the entire transcript. The case is divided into sections representing the standards and related indicators that introduce the section. In part, this organization was selected because it is the one employed by all the teams, and it shapes the logic of the conversations that emerged. By convention, the lines in CA transcriptions are numbered, but for ease of reading, they are not numbered here. To maintain reasonable excerpt length, CA code conventions are also broken such that textual gaps are indicated by a line of three horizontal periods preceded and followed by a space ( ... ).


THE CASE: A THIRD-GRADE SCIENCE UNIT


*******

Standard I: The teachers use information about the learning-teaching context and student individual differences in setting learning goals(s) and objectives and planning instruction and assessment. Indicators: Knowledge of community, school, and classroom factors; knowledge of characteristics of students; implications for instructional planning and assessment.


TTF: I’m pretty sure I taught her . . . and she did quite well in my class.

CF: I was thinking this was a student (.5) that I had . . . but I can’t remember if her last name was _____=

TTF: = but that wouldn’t be right would it because she’s in a student teaching experience not in an internship.

CF: That’s true. . . (  ).

TTF: I’m not sure of that but that would [make sense

CF:  No I think you’re right because I was very surprised because the one I was thinking of was a tremendous student in the class and [this is weak

TTF: [and my memory of her in the classroom was she was really strong. Now I again could be mixed up on the person because I didn’t go back and look at it but um =

CF: = now that you mention it it couldn’t be her so =

TTF: = I don’t think I let that color the evaluations I made here but (.2) this is a weaker TWS and I think we both would agree.


Discussion. The two faculty raters begin their analysis by trying to recall the candidate, thereby attempting to provide a human context for scoring, a backdrop against which to make their judgments that transcends the written document. Although in principle, samples are not to be compared—only rated in relationship to the rubrics—in fact, comparisons are made because prior experience of sample scoring comprises an interpretative backdrop that gives substance to the rubrics. Hence, the raters actually come to the scoring conversations in some measure, however small or large, already having transcended the rubrics by giving them detail and definition. That comparisons are made is clear when both faculty members agree that the sample is “weak.” Locating the sample early in the conversation in this way frames much of the discussion that follows.


After struggling to place the student, the raters realize that she has been confused with another student(s), one considered to be “tremendous” and  “strong.” Despite the confusion, a serious problem emerges but is quickly set aside: how to explain a troubling contradiction—a “tremendous” student producing a weak TWS. Strong students—able teachers—are expected to produce high-quality work samples, and the entire assessment system rests on this assumption. Potentially, the discovery that able teachers, as determined by empirical evidence, produce poor-quality samples might undermine the model and additionally raise critical questions about the quality of the program and of individual faculty members’ instruction. This danger is evident in the scoring conversation that led to the only failed TWS of the 10. An excerpt from this conversation reveals the difficulty:


CF:

You know what I think is str:ange? I think her teaching= TTF: =is passing= CF: =is passing and her Teacher Work Sample is not. Is that possible? TTF: Uh huh. I agree with you. CF: But then is that our fault in not teaching her appropriately how to w:rite a Teacher Work Sample? (.4) Maybe that’s not fair because other students do it and pass. TTF: Uh huh. CF: But I, I think if we had w:atcher lessons (.2) and saw the kids learn and you look at (.1) how much they d:id learn, it was probably a passing unit.


From this transcript, it is apparent that the possibility of failing the sample proved deeply troubling for the raters. Both come to a disturbing conclusion: The problem may well have arisen because this student teacher, who taught in a remote urban location several hundred miles from campus, received comparatively little mentoring and was not adequately coached in how to prepare an successful sample. Still, within the institutional context of scoring the sample and warranting quality, they failed it.


Returning to the case sample: As the raters interact, they conclude that neither one actually knows the student, and they feel relief. This initial way of reading the sample, however, appears to have influenced the scorers’ judgment despite the tenure-track faculty member’s assertion that believing that she knew the student did not “color the evaluations” initially made of the sample.


TTF: So, (.7) starting at the beginning. Okay (.2) hhh here we go on ___________. Hers is a Grade 3, both of these [samples] are third grade but this one is a science unit is how she labels it. Okay, s:o contextual factors, you saw a 4, 4, and [a 2.

CF: [ I saw 4, 4, 2 =

TTF: = ((laughs)) I saw 2, 2, and 3.

CF: So we’re (.5) =

TTF: = Yeah =

CF: = diverged again and this is probably going to play back to the [same thing

TTF: [issue with the=

CF: = (one) we dealt with, yeah. I see it as listing things and I thought her listing of things was (.) pretty thorough.


Discussion. The tenure-track faculty member signals the beginning of the formal scoring conversation and sets an agenda for how the team will work (an agenda solidified by the first sample that the team scored). This is the first rule of interaction—that tenure-track faculty speak first—present across all but one of the four faculty scoring teams. This rule represents an initial claim to rhetorical power. The scorers understand Standard I, indicators, and descriptors somewhat differently. They encountered this problem in the first sample that they jointly scored. Laughter indicates recognition of the problem but does not signal a need to renegotiate meaning, which is assumed to be shared. Both realize that as with the first sample scored, the tenure-track faculty member’s scores usually will be lower, sometimes much lower, than the clinical faculty member’s scores, and each set of scores will necessarily be adjusted if agreement is to be achieved; mostly the tenure-track faculty member’s scores will be raised, and the clinical faculty scores lowered. This challenge establishes the need for creation of strategies for overcoming their differences.


Respecting this standard, the clinical faculty member believes that demonstrating “knowledge” of school and community means being able to list “factors” that characterize a community whereas the tenure-track faculty member expects something different, something more. The assumption of the clinical faculty member may very well be grounded in the program’s reliance on the language traditions embedded in Benjamin Bloom’s Taxonomy, where “knowledge” is understood to represent facts. What is certain is that they do not agree on what the rubric requires of students, which raises questions about the reliability and validity of their scores.


CF: ... (1.2) (

) She describes the classroom quite well.

TTF: You know I said, I made a comment about the previous sample that was really one that came up here. She’s the one who talks about the self-contained students and never explains (.2) ... And I made a comment about [the writer of the first sample], but it really was about [this writer]. (.2) .hh She gives us lots of numbers about the school and the classroom, yet her depth of understanding I thought was lacking. S:o, (.2) you see her as a 4, and I see her as a 2. Should we call it a 3 because the 3 and the 4 [are similar?

CF:[ I think that’s fair.


Discussion. The tenure-track faculty member admits in this excerpt to confusing two samples. This issue arises later in the conversation because scorers have difficulty keeping work samples distinct, treating each as separate for rating purposes. Most important, this excerpt illustrates the emergence of a strategy for achieving agreement, perhaps the most important strategy, splitting the difference. A score of 3 or 4 indicates meets expectation, making the decision relatively easy “because the 3 and 4 are similar.” The aim of consensus is set aside in favor of strategic agreement.


In the next excerpt, the tenure-track faculty member brings additional information to the discussion: background knowledge held of the area within which the school is located. This dramatically alters the discussion.


TTF: [I can tell you because this is an area just adjacent to where I lived for so very long a time and maybe that’s some of why I was extra (.) sensitive to the way that she =

CF: = yeah

TTF: = her understanding of community and her assumptions were kind of (.5) skewed I thought. But that’s (.2) a lens that shouldn’t be used.


Discussion. The tenure-track faculty member admits to possessing knowledge that may have influenced scoring, even while asserting that such knowledge “shouldn’t be used.” The presumption is that as an expression of professional judgment, TWS scoring should be objective and disinterested, but it is not; with this realization, questions of fairness and reliability emerge. Had the scorer possessed no relevant knowledge about the community or the specific area surrounding the school, it is likely that the sample would have been scored differently—perhaps, ironically, less validly. Possession of additional contextual knowledge elevates what is required to indicate “a very good knowledge of the characteristics of the community” (level 4) for the tenure-track faculty member. The statement that this additional knowledge “shouldn’t be used” is a reminder to both raters of their institutional responsibilities, yet it appears that the score may have been lowered because of this knowledge, an outcome that would be recognized as a sign of acting unprofessionally.


CF: My biggest problem with this section was the implications. I thought they were =

TTF: = okay, [you

CF: [weak at best. She only listed 3 factors instead of 4.

TTF: Oh, okay. ... So (.2) you have =

CF: = I have a 2 and =

TTF: = and I have a 3. I am really perfectly happy to go to a 2 here.

CF: I think they were weak to start with and she didn’t even do f:our, which was the minimum.

TTF: (.9) I didn’t pick that up.

CF: So, had she had four I would have probably said there a 3 because they were (.) low end okay.

TTF: Yeah, if you look about implications, I made a list—use topics they are interested in, give fast finishers, which n:either one of those to me are substantial =

CF: = right, [now I think

TTF:

  [I mean they’re okay, but they are not deep. And then the last is (.9) academic and developmental (.5) work that is appropriate, wh:ich I put an okay here to say yeah but she doesn’t give us any [detail on what she means when she says that.

CF:       [She just talks about it but doesn’t say here’s what I’m going to do about it much. It’s just there’s this problem.

TTF: Okay, so we’ve got 3, 3, 2, and we’re both comfortable with that?

CF: I think so.


Discussion. The clinical faculty member assumes that both raters share an understanding of what is meant by “weak” and that no justification is necessary. The clincher comes not from a quality determination but from lack of quantity. An acceptable score for this indicator—instructional implications—comes primarily from counting; no fewer than four implications are required by the clinical faculty member’s reckoning. The indicator for partially meeting expectation reads: “Candidate provides general implications for instruction and assessment based on student individual differences, and community, school, and classroom characteristics.” “Student individual differences” is 1; “community,” 2; “school,” 3; and “classroom,” 4—one general implication for each area. Reading the indicator in this literal and narrow way elevates a quantitative measure over a quality judgment but offers the virtue of clarity and simplicity in scoring, an aim much sought after by those who write and score rubrics. Four is the “minimum” number, and the raters thereby avoid a complex and perhaps time-consuming discussion of the differences in the meaning of the descriptors; what is the difference between “general” and “specific” implications (the difference between a score of 2 and a score of 3)? A strategy, rubric simplification (whenever possible, rubrics should be simplified; focus on what is most important; what can be counted should be counted) and a rule, the efficiency rule (Efficiency in scoring is essential; agreement needs to come as quickly as possible), emerge in this excerpt.


*******

Standard II: The teacher sets significant, challenging, varied, and appropriate learning goals and objectives based on state/district content standards. Indicators: Identify a clear unit outcome or learning goal based on the State Core Curriculum that will guide the planning, delivery, and assessment of your unit; alignment of objectives; levels of objectives; appropriateness.


TTF: Okay what do we look like on the next section? You’ve got =

CF: = uh (.3) 4, 5, 3, and 4.

TTF: I’ve got 4, 3, 3, and 3. So we agree on that third 3.

CF: The one we’re m:ost disagreeing on is number two in the aligning. Um (.8) =

TTF: = You know, part of (.4) the reason for that, (.9) let me see what I wrote down. I think (.5) I put clarity and objective number one, I was (.5) confused here. OH, students will defend—what does she mean?

CF: Uh (.5) [I think

TTF:   [I think I f:inally figured it out but it’s not clearly written. Okay, on number two, what do you have?

CF: I have a 5 and you have a 3.

TTF: I could go to a four. [Should we go for a 4?

CF:         [I think a 4. Yes.

TTF: Okay. And then the next one, oh, what about =

CF: = on the first one I had a 4 and you had a 4.

TTF: Oh, sure. We’re just the same. Then (.2) appropriateness of objectives, you had a 3 and I had a 4. Should we call it a 3.5 because the descriptor is the same?

CF: I think that’s right. We’re going to see a lot of 3/4 scores on this one. It’s the same.

TTF: It is the same descriptor. (.9)


Discussion. Here another strategy is employed, one that was used earlier: previewing scores, to satisfy the efficiency rule; one rater states all the scores for a standard, quick comparisons are made, and when there are differences, discussion follows. When individual scores of both raters are the same, the efficiency and equivalency rules lead to the conclusion that there is no reason for discussion: “Oh, sure. We’re just the same.” The equivalency rule (identical scores=same meanings) states that identical scores represent similar interpretations. This rule is crucially important for achieving agreement but ultimately proves false, as will become evident shortly. This is a rule of convenience or necessity that underpins and sustains the strategy of splitting the difference.


Splitting the difference between a 3 and a 5 on objective alignment is the difference between objectives support the learning outcome and align with national, state, or local standards and objectives support the learning outcome and are unusually well aligned with national, state, or local standards. Making such a judgment requires moving well beyond the rubric to make a comparison of the individual whose work is being scored to some imagined population of teachers—presumably, but not necessarily, beginning teachers. This represents a dramatic extension of the interpretative backdrop discussed earlier, moving from other beginning TWSs to a general population of teachers. Because it is probably impossible to describe what the difference is between “unusually well aligned” and “aligned” objectives, the distinction probably is meaningless, raising questions about both the validity and reliability of any score produced. Agreeing that this part of the sample is acceptable—falling in the 4–3 range, meets expectationsplitting the difference moves the conversation along. A problem, however, arises with one of the sample objectives. Knowing the school area, the tenure-track faculty rater is troubled by an assumption made by the teacher about the children’s background knowledge and experience. Additionally, the tenure-track faculty member’s comment that a section was “not clearly written” points toward a troubling possibility: To what degree are work samples writing tests and not tests of performance?


*******

Standard III: The teacher uses multiple assessment modes aligned with the learning goal(s) and objectives to assess student learning before, during, and after instruction. Indicators: assessments (few to multiple); scoring and performance criteria; adaptations based on student need; quality of assessments. The later speaks to questions of assessment validity and reliability.


CF: On (.2) on assessment plan I had her 4s on the first two.

TTF: And I have 2 and 3.

CF: Then I had 3 and 3. [ We’re pretty close on the last one.

TTF: [Okay. We agree on those. 3, 3, we could call that 3.5 because it’s the same descriptor [meets expectation].

CF: So let’s look at the first =

TTF: = Number one, multiple assessments with a variety of modes.

CF: (.7) hhh Okay, I put (.5) they are all written assessments (.3) but are well aligned with the objectives. Not a great variety but they are assessing (.2) what she is teaching and assessing what the (.5) core (curriculum) is supposed to be taught. So I guess I’m giving her credit for having her assessments be on t:arget but they are almost all written assessments. Not a large variety.

TTF: And not multiple assessments =

CF: = and as we go to the rubric it’s saying, probably not very strong on the multiple assessments. So I think=

TTF: = where are you willing to go?

CF: Let’s, you had a 2 and a 4, let’s go with 2.5 there. I think that’s probably pretty (.3)

   [do you feel good about that?

TTF: [I think that, uh huh. That represents our (.) collective thinking. And then the clarity of the scoring crit:eria, you gave her a 4 and I gave her a th:ree. I’ll go with a 4 because you helped me [see that one a little bit better last time.

CF: [I think we had a similar situation last time.

TTF: Yeah. I’m good with that.


Discussion. This excerpt includes an example of the tenure-track faculty member actually raising a score based on an argument made by the clinical faculty member outside of raises resulting from the strategy of splitting the difference. The argument, however, was made previously when scoring the first sample. In that sample, following a comment made by the clinical faculty member, “I’m probably (.2) a little higher grader,” a discussion follows about clarity of the descriptors. The clinical faculty scorer rated the performance criteria a 5, whereas the tenure-track faculty member scored a 2. All that was required by the standard was a statement within the sample of a passing score for each assessment. In response, the clinical faculty scorer stated, “On each of her assessments her expectations (.4) set amount or a pass/fail on the test. I guess I’m not seeing where you thought she was l:acking on the clarity of her performance expectations.” The tenure-track faculty member responded, hesitatingly and cautiously, as the codes indicate: “Y::eah, I don’t have notes on it so I probably, we don’t know (.4).” The clinical faculty member pressed the case: “Each of the formative assessments is n:umbered with a=,” and the TTF interrupted: “=yeah, (.2) I think I was too low on that one.” At this point, the tenure-track faculty member refers to the standard and concludes: “=okay m:ost scoring procedures are clearly explained; performance criteria are generally clear (.2) and are provided for all assessments. Yeah, we can go with a 4 on that one.” Recalling the previous interaction and the resolution of the differing scores, coupled with the clinical faculty scorer’s return to the rubric (“as we go to the rubric it’s saying...”), made the case.


The TTF’s comment, “= where are you willing to go?” is also important. The question is asked of the clinical faculty member to determine how much flexibility there is in the suggested score. It seems that there is a great deal: A 4 becomes an agreed-on 2.5, a position right between meets expectation (3–4) and partially meets expectation (2–1). The intent seems clear; the TTF member thought that the clinical faculty member’s score was much too high and, as so often happened in the conversation, the clinical faculty member’s score was lowered by agreement.


*******

Standard IV: The teacher designs instruction for the specific learning goal and objectives that address characteristics and needs of students, and the learning context. Indicators: preassessment and contextual information (meets expectation 4–3: “Preassessment data are charted, analyzed, and patterns noted that influence instructional design”); instructional strategies (meets expectation 4–3: “Most learning activities reflect best practices in the teaching major); technology; adaptations for special needs learners; overall unit plan (meets expectation 4–3: Lessons are logically sequenced, student interest/engagement would be high).


... TTF: Adaptations. There are some. The thing that (.4), the thing that kept haunting me (.2) is th:at (.2) she gave this pretest and when you read the pretest you think, these are 3rd graders? Did that bother you at all? That that was =

CF: = I’m not remembering that.

TTF: That all of her instruction and all of the pretests, the whole thing seemed very, v:ery   first grade to me. First gradish to me. And I will defer to your sense of that, because it’s been a while since I’ve been there. [But I was very disturbed by that.

CF: [As I look back through it now, it is very =

TTF: = remember it was like, list three things an animal needs to survive. (.6) ... hh S:o, she had this very simple pre test ... you’ll also notice in her charting of the posttest she has lots and lots of them topping out. That tells me her assessment doesn’t give her a window to really capture what these kids could do. So, hhh on adaptations (indicator 4) for special needs learners (.4) I gave her a 2 bec:ause it says some appropriate adaptations are identified to meet individual needs of students.

CF: And that is probably pretty accurate.

TTF: And I think (.) kids who are advanced are also special needs learners ...

CF: I could go down to a 2 on that without much problem. That is about what it was. There wasn’t a large variety

TTF: (.9) She’s attended somewhat to the kids that are struggling, but she’s just=

CF: = maybe even a 2.5 would be (.2) appropriate, but I could go with a 2.

TTF: I’m going to go with a 2 because I don’t think she’s r:eally attending to the individual needs. What did you give her overall?

CF: I had a 3.

TTF: And I did too. Yeah, you’re exactly right on that. (.2)


Discussion. The discussion about the preassessment refers to the first indicator of the assessment standard (Standard IV). Deciding how to score this indicator proves difficult. The descriptor for meets expectation (4–3) reads: “The plan includes multiple assessments with a variety of modes that fit the content and the student skill level in their complexity.” The difficulty for the tenure-track rater is that the sample includes multiple assessments, but the content seems inappropriate, too simple. The descriptors do not provide an opening for considering the appropriateness of the content level, only the fit between assessment and skill level. On this view, a higher score is warranted but disallowed, and although not initially sharing this concern, the clinical faculty member comes to agree. By design, rubrics bound judgment, but in this instance, the result proves troubling.


A new interactive strategy emerges: rubric stretching. This strategy differs from the others in part because it is attached to the task of recognizing quality rather than achieving consensus. Dissatisfied in some sense by what the rubric requires and moving beyond the underlying agreements that initially produced the standards, the scorers add to what is required. Such additions produce both validity and reliability problems for scoring. Standards thereby become increasingly arbitrary and undermine fairness because those who write the samples are unaware operationally of what comes to be required to receive a desired score. Rubric stretching of a different kind follows when, as noted, scorers bring background knowledge to their task that reveals potential weaknesses in a sample that otherwise would go unnoticed. In principle, however, it is possible that additional knowledge could advantage a sample writer if, for instance, a scorer was aware of specific challenges arising in a teaching context that were unacknowledged within the sample itself. This situation actually obtained when the entire sample was scored.


*******

Standard V: The teacher uses ongoing analysis of student learning to make instructional decisions. Indicators: modifications based on analysis of student learning and sound professional practice.


TTF: Okay, the next section. This is faster than I thought. Instructional decision making, I gave her two 3s.

CF: I gave her a 4 and 5. I thought she did g:ood on the (.) choices that she made. Let me look at that for a second.

TTF: I thought it was all right but kind of mundane. But (.4) I may be wrong on that one.

CF: Um ((pages turning)). ... I’m just looking through my notes here. ((pages turning)) (.6) I thought they were both (.) student learning oriented and she based her decisions on what to do from what she was seeing happening with the kids. (.4) (    )

TTF: See, this is the one you were referring to. It wasn’t very sophisticated.

CF: (.9) I’m getting this one confused with another I read this morning. I was asked to grade [a sample] last night ... so I’m—I think maybe I was thinking about that other one here.

TTF: (.5) She does dec:ide to do additional monitoring (.2) in the one and I think the other one her modification was reteaching. And so, to me those are okay.

CF: I think I just thought they both, she looked at what students were doing and as a result she decided to do something different [which is what they’re asking.

TTF: [Okay, so you have her a=

CF: = I had a [4 and a 5.

TTF: [4 and a 5.

CF: And you had 3 and=

TTF: =3.5.

CF: 3 and 3.5.

TTF: So should we go with=

CF: =let’s go 3.5 and 4. I think that’s fair.


Discussion. This segment proves quite revealing. Rubric stretching is employed when the tenure-track faculty scorer expresses disappointment with the instructional modifications made, deeming them “mundane.” Nowhere can the indicators or descriptors used be understood to require other than “appropriate modifications”; nothing is said that suggests that unusual or distinctive adjustments are required, let alone “sophisticated” actions. Puzzled by the differences in their assessment of this standard, the clinical faculty member realizes that as a result of scoring a case the evening before, confusion has followed, and self-corrects. “I think maybe I was thinking about that other one here.” For both raters, scoring multiple samples is a problem; samples blur, and it is difficult to keep the evidence and the arguments straight. Time pressures add to the difficulty. The tenure-track faculty member tries to speed up the conversation (note the overlapping words: CF: “which is what they’re asking”; TTF: “Okay, so you have her as=”). Following these words, the clinical faculty member states, “I had a 4 and a 5.” After what has been said and admitted by the clinical faculty member, this statement proves surprising to the tenure-track scorer: “4 and 5.” Rather than explore their differences, however, differences are split, but not exactly: A 5 and a 3.5 becomes a 4, a rating closer to the tenure-track faculty member’s score than the clinical faculty member’s.


*******

Standard VI: The teacher uses assessment data to profile student learning and communicate information about student progress and achievement. Indicators: profile clarity; data summary; and impact on student for whole class, student subgroup, and individual students.


TTF: (.9) Okay. Next is report of student learning. ... TTF: I gave her a 2.

CF: I gave her 2s and 1s all the rest.

TTF: Those 2s we agreed on.

CF: 2s on the last two I had [her

TTF:  [You were harder on her. That’s interesting that on this one I  [was

CF: [Here’s what I wrote—used unit scores. First of all her reporting was wrong. The graphs didn’t show the results of this objective. It showed the results of the whole unit.

TTF: I didn’t go back and double-check her. ...

CF: They were the wrong results. She also stated here that in her individual student section they’re both all zeros for the preassessment and yet it shows a 22. Her data and what she’s saying didn’t jibe very often and I thought it was inadequate. ...

TTF: So (.2) let’s go back to the whole class and talk about those two because we saw that kind of b:ackwards from each other. I may not have picked up=

CF: =clarity of presentation and summary, ag:ain to me that’s saying is the graph clear, is the summary [clear.

TTF: [it’s that same issue

CF: (.3) She mislabeled things, she called the pretest the posttest. I circled that.

TTF: See, you looked more carefully at it than I did.

CF: I thought the data was clearly represented with the graphs on this part. The pre- and posttests and all the objectives I thought were very clear, but the s:ummary was very weak. ...

TTF: So I have a 2 and you have a 4. Should we go with a 3 on that first one?

CF: that would be (.2), yeah that would be fine. ...

TTF: All right, we got that section.


Discussion. The scoring of this standard involves three categories of students: whole class, subgroups, and individual students. For each category, there are three indicators: profile clarity; data summary; and impact on student learning. The first category, “whole class,” and the first indicator, “profile clarity,” is scored by the clinical faculty member as 4, whereas the tenure-track faculty member rates it a 2. All the rest of the categories, with one exception, are scored “2s and 1s” by the clinical faculty scorer who describes this standard as the candidate’s “worst,” a statement followed by an audible exhale. The tenured faculty member seems surprised, noting, “You were harder on her. That’s interesting.” As the clinical faculty member describes the reasons for the scores, the tenure-track faculty member admits being careless, first by saying, “I didn’t pick that up” and later stating, “I didn’t take the time to check it.” In the tenure-track faculty scorer’s haste to complete the review, something potentially important about the sample was missed. Critical analysis of the sample by the clinical faculty member continues, and another admission and self-justification from the tenure-track scorer follows: “See, you looked more carefully at it than I did.” Rather than seek better understanding of the sample to check the claims of the clinical faculty scorer, the tenure-track faculty member agrees, and once again, they split the difference.


This excerpt reveals another rule: Scorers are prepared (scorers do their homework, act in good faith, and carefully consider the sample in relationship to the rubrics when scoring). Upon this rule rests the validity of the entire work sample assessment system, and it conveys a professional expectation. Near the conclusion of this excerpt, it is evident that the scorers–especially the tenure-track faculty member—are in a hurry to complete their task;  the TTF remarks, “Okay, let’s go [to]” but is interrupted by the clinical faculty member, who has one more comment to make. The tenure-track faculty member pushes the assessment along, and there is a sense of relief in this scorer’s words when discussion of the standard ends: “All right, we got that section.” Most important, this interactive sequence reveals weaknesses in the splitting the difference scoring strategy. Use of the strategy assumes a relatively high level of interpretative agreement on a section and accepts that the most salient differences are tied to one or another scorer being slightly less or more demanding. It is not an effective strategy if one or another rater either misreads the sample or is careless when reading it. Because splitting the difference is a strategy most often used without checking to see if in fact there is an underlying interpretative agreement not only about the conclusion but also about the “facts,” the possibility emerges that serious misunderstandings will go undetected and perhaps actually be amplified, particularly if an error comes early in the conversation and is offered by the tenure-track faculty member.


*******

Standard VII: The teacher analyzes the relationship between his or her instruction and student learning in order to improve teaching practice. Indicators: interpretation of student learning; insights on effective instruction and assessment; implications for future teaching; and implications for personal professional improvement.  


CF: Then her reflections. It was hhh all over the place. I went 3, 2, 4, 5.

TTF: And I went 3, 2, 3, 4. [So we’re close.

CF: [we’re very close. Why don’t we go 3, 2, =

TTF: =okay so 3, 2

CF: And then you go 3.5

TTF: And I actually have a 3.5 right there. Should we go for 4? Did you give her a 4 there?

CF: I gave her a 4. Then let’s go 4, 4. I think that’s fair.


Discussion. This excerpt begins with use of the strategy of previewing scores. The raters are in a hurry. Their task is nearly complete unless they fail the sample, and it is clear that they do not want to fail it (the next excerpt underscores this point). Failure would require additional work and extensive justifications to support requiring reteaching a unit or for guiding efforts to improve the sample. After all the critical comments and marginal and failing scores, the final conversation excerpt, when the entire sample is rated, represents an effort to justify passing what both raters believe is a weak sample. To accomplish this aim, they engage in some contradictory talk.


*******

Overall assessment: indicators: mechanics of writing; organization of sample; diversity (descriptions related to diversity); and overall TWS quality.


TTF: Okay. (.5) All right, overall her mechanics were terrible.

CF: I thought they weren’t very good. I gave her a [3.

TTF: [ I gave her a 3. I should have given her a 2.5 but I was feeling sorry for her at this point.

CF: There were a lot of typos.

TTF: Yep. (.3) Organization you thought it was okay.

CF: I thought it was v:ery organized as far as how she laid it out, [indicating ( )

TTF: [I’m going to go with you on that because I couldn’t find much of ((laughs)) anything.

CF: I thought the rest of it was 3s. You had a 2 on diversity. I had a 3.

TTF: ... I’m also concerned about the l:ow level of complexity and challenge of the pre/posttest as well as all of the formative assessments. She had them drawing lines, matching, circling, sorting, it was very easy.

CF: It was very elementary.

TTF: I could conc:eive of an instance when she might have said, (.2) these are students who have a lot better conceptual knowledge than they have the ability to linguistically (.) talk about it. You could almost have justified some of the instructional decisions based on [demographics of her class

CF: [ It would have been for a group of students who had deficiencies.

TTF: I could almost have gone with what seemed to be very elementary stuff if she’d said this is why I’m doing it. But I saw absolutely n:o indication that it was a conscious choice. (.3) And so, that was my overall note. ... (.4) And then, I thought the best thing she did she didn’t talk about. When she sent them out into the school with the digital cameras.

CF: Right, (.3) she didn’t really discuss that.

TTF: I wondered if they actually did it. I thought, okay she’s on to something here. They were actually going out in their own environment, applying it.

CF: I think that’s because her reflections are low. She probably doesn’t have a very good idea of what she’s strong at and what she’s weak at. I don’t think she recognized that very well.

TTF: I was exc:ited to say okay she did use technology in a really (.2) solid way but never really talked about it. It made me wonder if that was one of the things she ran out of time and didn’t really do.

CF: It could be that, or just the fact that she didn’t realize she was doing a good thing there.

TTF: I got the sense as you looked at the lesson plans (.1), some had reflections and then they quit. I wondered if she really had finished teaching it.

CF: That’s something I didn’t notice. That could be that she ran out of gas at the end of this.

TTF: Or by the time this was due she’d only done half the unit which I don’t know how you counsel students about that. (.3) That’s how I explained to myself why she didn’t d:eal with that.

... CF: (.7) Now I hhh circled pass and I put barely.

CF: I put a very low pass. I thought she squeaked by.

TTF: Even though the numbers, and I know we were weren’t paying attention to the numbers, I sat back and the reason I said barely pass is I thought was how would you fix this? I put a barely pass.


Discussion. This excerpt begins with two remarkable statements: The tenure-track scorer comments, “overall her mechanics were terrible,” yet a 3, meets expectations, is given. The clinical faculty rater remarks that the mechanics “weren’t very good” but also gives a score of 3. The tenure-track scorer justifies the score by stating, “I was feeling sorry for her at this point.” This statement is indicative of the extent to which the scorers are willing to go in order to avoid failing a sample and in a sense represents the ultimate use of rubric stretching. Both scorers seek possible explanations for what they believe to be rather poor-quality work. The tenure-track faculty scorer even wonders if the candidate was unable to do higher quality work because of an inability to finish teaching the unit, and the clinical scorer wonders if the candidate simply ran out of “gas at the end of this.” Reaching well beyond the rubrics, both are sensitive to the time demands of the program on students and seeking reasons for leniency. In the end, they conclude that she “squeaked by.” They agree to pass the sample, but only reluctantly: “Now I hhh circled pass and I put barely.” Ultimately, it appears that the single most important reason for passing the sample is that the raters could not see any point to failing it or any way in which it could be fixed or improved with a reasonable investment of time and resources. They did not believe that the candidate should be required to reteach and produce a new sample, a possibility that was not discussed. They assiduously avoided adding up rubric scores, perhaps because if they actually went by the numbers, a different outcome might have followed.


DISCUSSION: RULES, STRATEGIES, AND ISSUES


Analysis of the case sample makes unsatisfactory the simple and straightforward answer to the first research questions posed by this study. The question is, How do the raters understand their responsibilities? The simple and expected institutional answer: to produce consensus on a set of scores indicating sample quality. Clearly, there was no consensus on scores if consensus is understood to mean “that parties reach an understanding that all interpret in the same way” (Moss & Schutz, 2001, p. 59). They did not. Perhaps they cannot. Moreover, no amount of rubric rewriting to achieve higher levels of specificity likely could accomplish this end. The concepts embedded in the standards inevitably are, to a degree, slippery, and, lacking shared context, they require scorers to fill them in based on their own individual and, to some degree, idiosyncratic experience with schooling, teaching, learning, and assessment. As an aim, agreement offers a lower and seemingly more reasonable target than consensus, one that recognizes the inevitability of compromise but still enables joint action. So, is this how the raters understood their work, to produce a weak agreement about the quality of a work sample? Because beginning teachers could be prevented from being certified to teach based on a failing sample score, sample scores are used to evaluate the program, and, perhaps most important, children are involved, this cannot possibly be a fully acceptable conclusion.


An alternative emerges from the case that represents a return to consensus as an implicit aim and responsibility. Necessitating compromise, agreement operates explicitly and consensus implicitly, subtly, in scorer actions. A successful sample-scoring conversation results in agreement about whether a sample passes, but achieving this aim does not require a high degree of shared understanding of the rubrics among scorers; every score involves interpretation, a weighing and sifting of evidence, a defining and redefining of terms, and some form of negotiation growing out of a scorer’s sense of item quality. The consensus that is required to produce agreement and that guides negotiation of any and all scores grows out of the taken-for-granted interactive rules and scoring strategies employed by the raters. Grounded in an assumed common experience that is in fact not fully common, the rules and strategies take form and find expression in response to the very practical task of having, in a very short period of time, to make and then agree on a set of quality judgments about the work samples. What emerges is a kind of procedural consensus about how to go about sample scoring that aspires to utility and reasonableness rather than to precision and objectivity.


The second question has to do with uncovering how samples are scored, and what rules and strategies are used and how they are used by scorers. The interactive rules that emerged from analysis of the TWS scoring conversation were: tenure-track faculty speak first, the efficiency rule, the equivalency rule (identical score=same meaning), and scorers are prepared. Operating as assumptions, these rules enable interaction and facilitate agreement. The strategies were: splitting the difference, rubric simplification, and previewing scores, which support the efficiency rule, and rubric stretching, which supports the aim of warranting quality.


The first rule, tenure-track faculty speak first, was present in 9 of the 10 scoring conversations, indicating variation in one of the teams. By beginning the scoring conversation, the tenure-track faculty member established an initial framework for discussion and claimed a discursively more powerful position, at least at first. This rule places a requirement on the clinical faculty member to respond to and accept the tenure-track faculty member’s problem framing, sometimes by use of the previewing scores strategy (the case began with a preview of the clinical faculty member’s scores). As conversation evolves, these roles become somewhat fluid and more open. Both raters are bound by the institutional expectation that a single decision will emerge from the conversation, and both understand and work toward this outcome. The ease with which scorers accept the necessity of compromise seems to suggest recognition of legitimate, but nonconsequential, differences in background and understanding. Commitment to achieving agreement as quickly as possible is evident in the overlapping talk (indicated by brackets) and drives the conversation forward. Limits in available time for scoring prove very motivating.


With rare exception, the clinical faculty score the rubrics higher than do the tenure-track faculty. Noting differences, sometimes through use of the previewing strategy, and facing the challenge of the efficiency rule, rubric simplification, and, most especially, splitting the difference, are employed to achieve a single score. When the resulting average score from splitting the difference falls within the passing range, the conversation moves along swiftly, eliminating the need to consider carefully reasons that the initial scores differed. This action is supported by the equivalency rule (identical score = same meaning). Splitting the difference is thought to produce scores that are fair, honoring both rater’s judgments while expediting the review. When splitting the difference points toward a failing score, more extensive justification is necessary. No rater wants to fail a work sample. For clinical faculty, especially, there is a heavy personal investment in passing the samples because they likely would be most responsible for remediation. For both scorers, however, sample failure represents a negative statement not only about the candidate but also about program quality.


Rubric stretching is not merely a strategy opposing rubric simplification. As noted, stretching is tied to the aim of providing quality assurances and is assumed to be the especial purview of the tenure-track faculty scorer, whereas simplification is linked to the efficiency rule. In addition to rubric stretching, tenure-track scorers assert their status as guardians of quality by usually scoring sample sections lower than the clinical faculty team member. Yet, out of necessity, initial scores are compromised. The result is that the tenure-track faculty scorer can defend quality while simultaneously lowering a score, an action justified by the rules, strategies, and formal role expectations. Stretching complicates scoring by allowing relevant but formally inappropriate information to influence judgment. The difficulty is that stretching, unlike simplifying a rubric, exposes a rater’s movement outside of proper role boundaries and reveals sources of potential bias. The scorer moves outside of agreements that produced a rubric revealing how a claim to objectivity—to sticking to the boundaries set by a rubric—and disinterestedness is a pretense, one that the case scorers had difficulty maintaining, as noted previously, even as there was evidence of having tried. Nevertheless, and here a contradiction emerges, questions of quality and fairness, and perhaps of decency, may justify stretching, as when the scorers positively noted the promising but underanalyzed and inadequately reported activity of sending the children out into their environment with cameras.


No rule is of greater importance than scorers are prepared. When this rule breaks down, the entire process becomes suspect because no score can be assumed valid. Breakdown comes when scorers are overburdened by the number of samples they must rate while carrying on their other and various work responsibilities. The case provides strong evidence of this problem. Yet, no one outside the scoring team would know this rule had been breached because results of all samples are reported in the same clean, simple, numerical form coupled with an indication of having been passed or failed. The simplicity of the form in which ratings are reported belies the dynamics of the scoring process itself, suggesting a strong presence of both objectivity and disinterestedness. More will be said on this point shortly. It is apparent that short of long and detailed examinations of the interpretations made of the evidence presented within a sample, use of the rubric-stretching and splitting-the-difference strategies may result in missing serious shortcomings in a sample (and perhaps the beginning teacher), even as they mask potentially deep differences between scorers in their understanding of the rubrics and of good teaching. Here, once again, the efficiency rule comes into play: Given their many and diverse responsibilities, faculty scorers must complete their work as quickly as possible, and potentially consequential differences are procedurally set aside unnoticed.


An additional word about the scorers-are-prepared rule: Forming scoring teams of clinical and tenure-track faculty like those whose work was studied here brings together individuals representing very different status systems and claims to authority within the university. This combination of scorers may produce some unique challenges not found in other pairings. Confidence in a fellow scorer can be anchored in recognition of technical competence—though training the scorer has demonstrated reasonable reliability—but, more likely for those who participated in this study, it was grounded in professional respect and relationship, what Seligman (1997) characterized as a “structure of familiarity” (p. 162). The danger here is that respect may be asymmetric: Tenure-track faculty are better known to clinical faculty than clinical faculty are to tenure-track faculty. Institutionally, tenure-track faculty enjoy greater status and much greater power than clinical faculty, thus supporting the assumption by the tenure-track faculty scorers of greater discursive power and influence over scoring outcomes and, concomitantly, clinical faculty deference. Such relationships, and differences in relationships, very likely influence how rules and strategies develop and are used within the scoring conversations.


The third and final question concerns issues about work sample use, and these are perhaps best approached through a set of questions about validity first discussed by Linn, Baker, and Dunbar (1991) in their classic paper on performance-based assessment. Linn and his colleagues offered a number of cautions about performance assessment beginning with the “consequential basis of validity” (p. 17). They wrote, “High priority needs to be given to the collection of evidence about the intended and unintended effects of assessments on the ways teachers and students spend their time and think about the goals of education” (p. 17). Their point forces the question, Are samples worth the time students spend creating them and the time that faculty spend coaching writers and scoring samples? Could this time be used more effectively in better ways? From a programmatic perspective, do samples represent a conception of teaching and learning consistent with program values and intentions? Consider that TWSs not only present a view of what beginning teachers should know and be able to do at some level but also provide a statement of what teacher education programs should teach and be held accountable for. Not surprisingly, given the tight linkage between NCATE and INTASC standards and work samples, emphasis is placed on skills associated with a rather technical view of teaching. Generally speaking, the model favors relatively simple teaching models, especially direct instruction, over other less linear and more complex approaches to teaching. It is apparent that the outcome of a sample may be predetermined by the choice of the teaching unit to be profiled—an issue that, again, arose with the one failed sample. This practice is narrowing and perhaps negative, discouraging students from demonstrating what they actually are capable of achieving. Instead, at least for those who well understand the task, a safe and savvy selection is most likely made, one that is familiar and comfortable and thereby facilitates ease of reporting and also of scoring.


Considering fairness in relationship to validity, Linn and his colleagues (1991) warned of the dangers of bias of various kinds, including when scoring, and echoed the common statement that “training and calibrating of raters is critical” (p. 18). It is doubtful, however, that any amount of training could ever achieve the desired level of fairness, of validity, when raters score work samples. Of course, sample scoring, like all standards-based assessment systems, involves a good deal of interpretation. This is nothing new, only a matter of underscoring the difficulty and complexity of the task and of noting the importance of setting reasonable expectations. Given the results from this study, the problem seems especially daunting. For example, one solution often given is to tighten standards for scoring and definitions of important terms. As the data suggest, the standards employed at BYU—which are representative—in practice prove themselves simultaneously to be so tight that students appear to be strongly encouraged to choose direct models of instruction to portray their best thinking (even when such models prove overly constraining), and so loose that raters struggle to understand precisely what is required by a standard or indicator. Tightening presents problems when the intention of sample scoring is to distinguish among multiple levels of performance, especially at higher levels. Often moving up a scale necessarily leads to producing items that become increasingly imprecise—statements like “unusually well aligned”—and that call for comparisons of various kinds of candidate performances with some imagined and presumably well-known standard of teaching expertise. Clarity often requires triviality, sometimes involving counting, as has been shown. But then quantitative measures must give way to indicators and descriptors with ever higher degrees of imprecision as means for distinguishing between levels. To fill in the imprecision and to make a quality judgment, an accepted and well-understood interpretative framework is needed, especially of beginning teacher development, but these also are general and varied, again underscoring the limitations of assuming the existence of a professional consensus. Yet in practice, such frameworks appear to be taken for granted when they need to be forged. On this point, more will follow shortly.


Ultimately, without rethinking the assumption that consensus is the aim of sample scoring, the production of exemplars as a helpful solution to reliability and validity problems also promises disappointment. On this view, scoring involves matching performance to rubric and indicator, but detailed exemplars would be required for every level and kind of judgment and model of, or approach to, teaching—unless, of course, one model is preferred, and this would necessarily need to be a decision explicitly made and then publicly defended (raising then a set of questions that Linn and his colleagues [1991] discussed in relation to content quality and content coverage). What models would be excluded? Which one(s) included? Ironically, this solution actually would make scoring more labor intensive, more complex, and potentially less sensitive and certainly less professional. Difficulties of these kinds, at least in some measure, create the conditions that produce the scoring strategies identified in this study: means for reducing complexity and enabling agreement on sample scores within an acceptable time period.


Questions of fairness arise from an additional source: For faculty members intimately involved in teacher education, negative sample ratings are recognized vaguely as comments about the quality not only of a candidate’s work but also of the program and the instruction offered. This point is underscored by comments of the scorers of the single failed sample and suggested by the discussion surrounding the decision to pass the case sample despite all its weaknesses. The bind expresses itself in multiple ways, one of which is the long recognized tensions that arise between the responsibility to simultaneously evaluate candidates and support their development. The easy solution to the problem is to have samples scored by uninvolved and genuinely disinterested parties, thus removing what might be thought of as a conflict of interest. Aside from the ongoing challenge of locating, paying for, and training sufficient numbers of scorers to support sample use, another and perhaps more serious difficulty arises with this solution. Having scorers who know the program, the area, and the students very likely leads to more responsible, sensitive, and fair scoring, sometimes to the teacher candidate’s benefit and sometimes not—as when the tenure-track scorer revealed background knowledge of the area in which the candidate was teaching, even while asserting, “that’s a lens that shouldn’t be used.”


Linn and his colleagues (1991) raised an important question about the cognitive complexity of the tasks required by performance assessments. They observed that “it should not simply be assumed . . . that a hands-on scientific task encourages the development of problem solving skills, reasoning ability, or more sophisticated mental models of the scientific phenomenon” (p. 19). So it is with Teacher Work Samples. The more precise the requirements of a successful sample are—and precision is needed for achieving greater fairness in scoring, understood here as consistency—the more likely it is that students will create predictable products; in turn, the more predictable the product, the less likely it is that a sample will actually represent well a student’s most innovative thinking and perhaps best work. As previously noted, the model is inherently conservative, biased in favor of direct instructional models and teaching strategies. Relatedly, Linn and his colleagues posed questions about content quality and also to meaningfulness: “The tasks selected to measure a given content domain should themselves be worthy of the time and efforts of students and raters” (p. 19). If students are unable to demonstrate their best work and thinking, it is difficult to imagine that the effort put into work samples is worthy of their time or fully meaningful. Data from this study speak only indirectly to these issues.


There is some evidence within the wider literature that producing work samples, like portfolios, encourages reflectivity and invites careful consideration of the relationship between what teachers do and what students learn (Devlin-Schere, Daly, Burroughs, & McCartan, 2007; Gordinier, Conway, & Journet, 2006). These are potentially valuable benefits of sample use. As the excerpts quoted from the failed sample suggest, however, problems likely remain. One of these, noted in the case sample, is the dependence of judgments of sample quality on writing ability. Work samples, like the portfolios used for National Board Certification (Burroughs, 2001), are writing tests. Despite claims to the contrary, there remains a large gap between the writing activities associated with sample production and classroom performance, raising questions of content coverage and of transferability noted by Linn and his colleagues (1991). After all, a sample, excluding appendixes, is to be only 18–20 double-spaced typed pages. Given so little space, each sentence is precious, which helps explain why the case sample author found listing items necessary and why listing was advised, as the clinical faculty member actually noted. Nevertheless, even for a beginning teacher who struggles to write clearly, reporting on an effort to create and use a preassessment, for example—a source of difficulty in the case sample—undoubtedly is of some value even though the presentation is confusing. But is creating an evaluation system so dependent on student writing ability fair?


In another work, Linn (1994) made the case that different standards of validity are acceptable when stakes are high and when they are low. Assessment of TWSs is summative and often high stakes, yet, as noted, achieving a reasonable level of validity in scoring is extraordinarily difficult, perhaps impossible (for this reason, a decision has been made not to use samples for high-stakes purposes at BYU). To lower the stakes and to provide more helpful feedback to students, some institutions, Colorado State University among them, avoid typical evaluative terms in favor of advanced, proficient, and developing. Developing appears to be more than a euphemism for “failing.” There is a problem, however: In what sense can a beginning teacher involved in student teaching be considered proficient? Advanced is actually easier to understand in the sense of “above average.” What could proficient possibly mean for someone just starting out in teaching or any other complex professional activity? Language of this kind is reminiscent of discussions of two decades ago about the development of expertise in teaching (see Berliner, 1994). As noted, to be valid, the criteria for assessing samples ultimately must be embedded in some sort of understanding of beginning teacher development, and this should be a complex and rich understanding. Implicitly, such considerations undoubtedly do influence how rubrics are scored. Yet, the descriptors commonly employed even for the meets expectation category consistently deny this possibility; this is a problem noted by one tenure-track faculty scorer in this study, who commented: “If you think about the development (.5), and the level of the teacher at this stage (.2). You are a very experienced teacher so I probably would expect more from y:ou, using similar language from the rubric. .hhhh I don’t know if that is appropriate thinking or not.”  


Clearly, sample scoring always will be complex and messy, and issues of validity are, and will ever be, persistent and never fully or adequately satisfied. Given this messiness and complexity, the rules that underpin scoring and the strategies used to produce agreement (not consensus) identified in this study make some sense even as they are worrisome; as noted, some sort of rules and strategies are always in play in human interaction, and these dramatically shape outcomes. Knowing the rules and strategies of sample scoring points toward an obvious question: Given the context of teacher education, are these rules and strategies inevitable? Moreover, are there more desirable alternatives? If so, can they be achieved, and at what cost? Work samples may be very helpful means for assessing teacher education program content and probably do have value for encouraging student reflectivity, but there are good reasons for doubting the wisdom of their use for determining a student’s suitability for teaching. At this point, a good deal of additional research is needed to more fully determine the strengths and weaknesses of Teacher Work Samples and to uncover the challenges of sample scoring. Such work is needed to better understand and realize the educative potential of work samples and to prevent their misuse.


APPENDIX—Conversation Analysis Transcript Codes


Left brackets [ indicate the point at which a current speaker’s talk is overlapped by another’s talk; Equal signs = indicate no gap between lines, no stop; Numbers in parentheses (.9) indicate elapsed time in silence in tenths of a second; A dot . indicates a tiny gap; underlining indicates some form of stress indicated by pitch or amplitude; Colons indicate prolongation of the immediately prior sound, and the length of the row of colons indicates the length of prolongation; Except at the beginning of lines, CAPITALS indicate an especially loud sound compared with the surrounding talk; With a preceding dot, a row of h’s (.hhhh) indicates an inbreath, no dot indicates outbreath, and the length of the row suggests length of inbreath or outbreath; parentheses without content indicate an inability to determine what was said; double parentheses (( )) contain coder descriptions. (From Silverman, 2006, pp. 398–399)


References


Berliner, D. C. (1994). Expertise: The wonder of exemplary performances. In J. Marieri & C. Collins (Eds.), Creating powerful thinking in teachers and students (pp. 161–186). Fort Worth, TX: Harcourt Brace.


Brigham Young University. (2006, August). Educator preparation program: Teacher work sample. Retrieved November 15, 2007, from http://education.byu.edu/deans/documents/TWS_rubric.pdf


Burroughs, R. (2001). Composing standards and composing teachers: The problem of National Board Certification. Journal of Teacher Education, 52, 223–232.


Cochran-Smith, M. (2001). The outcomes question in teacher education. Teaching and Teacher Education,17, 527–546.


Cochran-Smith, M. (2005). The new teacher education: For better or for worse? Educational Researcher, 34(7), 3–17.


Denner, P. R., Salzman, S. A., & Bangert, A. W. (2001). Linking teacher assessment to student performance: A benchmarking, generalizability, and validity study of the use of teacher work samples. Journal of Personnel Work in Education, 12, 269–285.


Devlin-Scherer, R., Daly, J., Burroughs, G., & McCartan, W. (2007). The value of the teacher work sample for improving instruction and program. Action in Teacher Education, 29, 51–60.


Francis, D., & Hester, S. (2004). An invitation to ethnomethodology. London: Sage.


Garfinkel, H. (1967). Studies in ethnomethodology. Englewood Cliffs, NJ: Prentice Hall.


Girod, G. R. (Ed.). (2002). Connecting teaching and learning: A handbook for teacher educators on teacher work sample methodology. Washington, DC: AACTE Publications.


Gordinier, C., Conway, K., & Journet, A. (2006). Facilitating teacher candidates’ reflective development through the use of portfolios, teacher work sample, and guided reflections. Teaching and Learning, 20, 89–105.


Hutchby, I., & Wooffitt, R. (1999). Conversation analysis. Cambridge, England: Polity.


Linn, R. L. (1994). Performance assessment: Policy promises and technical measurement standards. Educational Researcher, 23(9), 4–14.


Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15–21.


Mehan, H., & Wood, H. (1975). The reality of ethnomethodology. New York: Wiley.


Moss, P. A., & Schutz, A. (2001). Educational standards, assessment, and the search for consensus. American Educational Research Journal, 38, 37–70.


Rawls, A. W. (2002). Introduction. In H. Garfinkel, Ethnomethodology’s program (pp. 1–64). Lanham, MD: Rowman and Littlefield.


Schalock, M. D. (1998). Accountability, student learning, and the preparation and licensure of teachers. Journal of Personnel Evaluation in Education, 12, 269–285.


Schalock, H. D., Schalock, M. D., & Ayres, R. (2006). Scaling up research in teacher education: New demands on theory, measurement, and design. Journal of Teacher Education, 57, 102–119.


Seligman, A. B. (1997). The problem of trust. Princeton, NJ: Princeton University Press.


Silverman, D. (1993). Interpreting qualitative data: Methods for analysing talk, text and interaction. London: Sage.


Silverman, D. (2006). Interpreting qualitative data, third edition. London: Sage.


Toulmin, S. (2001). Return to reason. Cambridge, MA: Harvard University Press.


Watkins, P., & Bratberg, W. (2006). Teacher work sample methodology: Assessment and design compatibility with fine arts instruction. National Forum of Teacher Education Journal, 17(3), 1–10.


Wenger, E. (1999). Communities of practice. New York: Cambridge University Press.


Wichita State University. (2007, Fall). Teacher work sample: Handbook and rubrics. Retrieved November 5, 2007, from http://webs.wichita.edu/depttools/depttoolsmemberfiles/ess/TWSDocument.pdf


Wineburg, M. S. (2006). Evidence in teacher preparation: Establishing a framework for accountability. Journal of Teacher Education, 57, 51–64.




Cite This Article as: Teachers College Record Volume 112 Number 3, 2010, p. 775-810
https://www.tcrecord.org ID Number: 15886, Date Accessed: 10/20/2021 10:59:22 PM

Purchase Reprint Rights for this article or review
 
Article Tools
Related Articles

Related Discussion
 
Post a Comment | Read All

About the Author
  • Robert Bullough, Jr.
    Brigham Young University
    E-mail Author
    ROBERT V. BULLOUGH, JR., is professor of teacher education and associate director, Center for the Improvement of Teacher Education and Schooling (CITES), Brigham Young University, Provo, Utah. His interests are wide ranging, from rare education book collecting to narrative inquiry into teacher lives and the history of education. His most recent books include Counternarratives: Studies of Teacher Education and Becoming and Being a Teacher (2008), and, with Craig Kridel, Stories of the Eight-Year Study: Reexamining Secondary Education in America (2007), winner of the AERA Division B, Outstanding Book Award. Both books are published by SUNY Press.
 
Member Center
In Print
This Month's Issue

Submit
EMAIL

Twitter

RSS