State Assessment Becomes Political Spectacle--Part IV: First Act in the History of Arizona Assessment Policy
by Mary Lee Smith, Walter Heinecke & Audrey Noble - September 13, 2000
…Continued from Part III: Defining ASAP: Scripts and Readings
ASSESSMENT POLICY PRIOR TO ASAP
You have to go back in time before the advent of ASAP to understand its place in people’s consciousness as a reaction against standardized testing. Under the previous state assessment policy, Arizona students experienced one of the highest test burdens in the nation. Legislation mandated the administration of the Iowa Test of Basic Skills (ITBS) to every pupil in grades 3-8 (the Tests of Academic Progress in grades 9-12). The full battery of both tests (products of Riverside) was given in the spring of every year and the results published by school and grade. Although the state placed no consequences on the scores of students, some districts based decisions about salaries of administrators and teachers in part on test results. High stakes are often in the eye of the beholder, however, as teachers reported a high degree of embarrassment, shame, and pressure when the results (reported at the district, school and grade level) came out in the newspapers each summer.
In addition to the norm-referenced testing, the state also required that each school district prepare and administer tests of basic skills, a list of which had been first approved by the State Board of Education in 1983. Dissatisfaction with the list resulted in the development and authorization of the Arizona Essential Skills, for which the state would hold districts accountable. In 1987, the Board of Education appointed representative groups of educators and content specialists to committees that wrote the content frameworks and revised them based on extensive hearings around the state. Staff of the Arizona Department of Education guided the work of these committees toward the newly emerging principles of constructivism and progressivism. Like the work of all such committees, however, the finished products were compromises, so that even traditional basic skills schools could interpret the Essential Skills as accommodating their own instructional preferences.
Almost a decade later, an influential staff member reflected on their work:
"The Language Arts Essential Skills was very different from the polyglot of little skills that had been around for awhile. The Language Arts Essential Skills looked at what people were learning about from a constructivist philosophy of education. It looked at what the writing teachers were saying and the writing professors, writing research was saying about writing as a process. It looked at new ways of reading instruction, somewhat whole-language based or literature, and it looked at integrating the language arts so that you didn’t teach reading separately from writing. You looked at how they supported each other. And, finally, it looked at the possibilities of assessing language and other subjects directly that--instead of judging whether students can write by having them do a multiple choice test, it looked at having students actually write, and assessing that writing. So all those things came together to produce the Language Arts Essential Skills, but the conflict, the problem that had to be solved, was that the testing the state was doing was the most invasive in the nation."
Both anecdote and research studies portrayed educators’ dissatisfaction with state assessment policy (Nolan, Haladyna et al. 1989) (Smith, Edelsky et al. 1989). High-stakes uses of standardized testing stood in the way of curricular reform. Said a progressive reformer:
"What became evident to everyone was that we had these curriculum frameworks out there that were representing the latest and the best thinking in the content areas, and those were mirroring what we know about the way people learn. And that the dramatic difference between what we said we wanted and what the tests were measuring, and then the new ideas circulating in the testing circles about what you test and what you get, and how you test, and how you get it, made it almost imperative at one point for the testing to be examined. I don’t think testing in Arizona would have changed a whole lot if we hadn’t had the curriculum framework that suggested a very different way of testing, that you can’t test writing as a process with a multiple choice test."
THE BIRTH OF ASAP
One can trace the origins of ASAP to the dissatisfactions of various policy actors and constituencies. In the late eighties, the Board of Education and the Arizona Department of Education seem to have arrived at a shared definition of "the problem": that existing state tests failed to cohere with the newly developed progressive, constructivist content frameworks. C. Diane Bishop was an influential member of the Board in the late eighties and became state Superintendent and head of ADE in 1991. Then a Democrat, she taught high school mathematics and thus "understood higher-level thinking in mathematics." Bishop’s ideas found support among the curriculum specialists at ADE, the Arizona English Teachers Association, the Arizona Education Association, local affiliates of the National Council of Teachers of Mathematics and the Center for Establishing Dialogue in Education, and local university professors. ADE commissioned two research studies that contributed data to this common definition of the problem. One study compared the content of the ITBS with the content in the Essential Skills and found that only 26% of the Skills were tested. A survey (Nolan, Haladyna et al. 1989) showed that most Arizona educators disputed the validity of the state norm-referenced tests, spent too much time preparing students to take them, and believed that the tests had deleterious effects on students, teachers, and the curriculum.
A powerful policy constituency at that time consisted of a group of teachers who believed strongly that students construct knowledge actively and intentionally from their interactions with texts, teachers, and other students; that reading, writing, and problem solving are parts of a whole, and teaching is best when it acknowledges this. These educators believed that one structural barrier to expanding this mode of teaching and learning was the state-mandated standardized testing program. Standardized tests, they believed, encouraged teachers to teach in ways that mimicked the form of mandated assessment. That is, since the ITBS assessed spelling by the ability of the pupils to identify which of four words happened to be misspelled, that teachers would teach spelling in precisely that way -- using work sheets that looked like the test items themselves and consisted of recognition of misspellings. Authentic writing would be postponed and de-emphasized altogether while classes spent time perfecting spelling and grammar. There was empirical support for their view that state testing narrowed curriculum and restricted instructional methods (Smith, Noble et al. 1994). If the state would only stop with all the standardized testing, they reasoned, the way would open for better education. Alternatively referred to as whole language or constructivist teachers, they had collaborated with language and university professors of language arts in an organization titled the Center for Establishing Dialogue in Education (CED). A separate organization of many of the same educators had successfully lobbied for a change in legislated assessment policy to exclude first-graders from ITBS testing. Flushed with this success, the group kept up pressure on the remaining assessment policy and turned out to be a natural ally to those ADE staffers who shared their perspectives on progressive education. Common ground was also found by language educators and activists who believed that standardized tests adversely affect children whose first language is other than English. They wanted any assessment policy to allow students to be tested in their home language.
The momentum for change was building. Even within this coalition, however, one can uncover alternative views of what was problematic with existing state assessments. One subgroup believed that the ITBS detracted from efforts to reform instruction toward progressivism and constructivism. The other believed that the ITBS did not provide adequate accountability to the Essential Skills.
It was the perspective of the latter group, which desired more accountability for particular outcomes, that resonated most with the policy constituency with the power to make change: the legislature. Asked later about the function of ASAP, a member of this group said:
"This assessment is an accountability measure, because we want those Essential Skills taught. And the only way we know that it’s going to be done is if you drop in and take an assessment of that... because there really have been no accountability measures up until now... It was a matter of here we have the Essential Skills and I think there was ample evidence that many school districts weren’t getting at that. It was, ‘this too shall pass,’ and teachers were still teaching what they were teaching. They weren’t focusing on those Essential Skills. I think that was a driving force to put this all under a legislative piece and put a little teeth into this thing."
Absent from the above quotation is even a nod in the direction of reform toward progressivism. From the sum of data from policy makers analyzed in the early nineties, we find little hint that the legislators involved in the birth of ASAP had concern or understanding of those principles of schooling that so motivated the policy actors at ADE and in the professional associations. Nevertheless, the actors (legislators, ADE, Superintendent, and Board of Education) came together in the Goals for Educational Excellence project to develop new assessment policy and write enabling legislation. The report the project issued in November 1987 concentrated on accountability principles more than reform ideals. For example, "The keys to the future were... a combination of basic skills --communication and computation -- as well as skills such as citizenship, interpersonal skills, thinking skills, and developing creativity." And, "education must emphasize measurement of results to be accountable for accomplishing its goals." Little progressivism and reform there.
Arizona Revised Statutes 15-741 that became effective in May, 1990 required the State Board to: 1) Adopt and implement Essential Skills tests that measure pupil achievement... of the state board adopted Essential Skills in reading, writing, and mathematics in grades three, eight, and twelve; 2) Ensure that the tests are uniform across the state, scored in an objective manner, and yield national comparisons; 3) Conduct a survey on "non-test indicators;" 3) Require districts to submit plans for assessment of Essential Skills at all grade levels; 4) Publish report cards at the pupil, school, district, and state levels; and 5) Require norm-referenced, standardized tests at grades 4, 7, and 10. In addition, the legislation affirmed existing (but not previously enforced) provisions for a policy of promotion from grade to grade based on achievement of the Essential Skills.
The legislation itself never mentions the Arizona Student Assessment Program nor commits schools to any principles of practice at all, nor to any particular form of testing. Performance assessment, alternative assessment, authentic assessment -- these terms were not codified in the law. The only thing that supported the progressive agenda in the 1990 legislation was decreasing ITBS testing and moving its administration to the fall, when its results would serve diagnostic rather than accountability functions. Everything else about ASAP was a radical transformation by Bishop and the progressives then serving at ADE. But no one knew it at the time, which made the later suspension of Form D so much of a shock.
Policy actors would later report a high degree of consensus among the Board, legislature, and ADE at the outset of the ASAP program. Yet what they failed to see was that these agencies were agreeing to quite different things. The legislators believed that they were promoting greater accountability as a result of the legislation, while parts of the department and board believed that the state had embarked on a bold new vision of teaching and learning. This confluence of alternative, even internally contradictory perspectives reflects Kingdon’s theory that policies usually obscure underlying contradictions in values and perspectives of political actors whose various agendas come together temporarily (Kingdon, 1995). Indeed, the legislation would probably not even have passed if the contradictions were brought to the surface. The ambiguities and contradictions then send conflicting signals to those who must implement the policy and those who are supposed to react to it.
ASAP: A TEXT AND MULTIPLE READINGS
Most educators never came into contact with the legislation itself and probably would have been surprised by its wording if they had done so. What they were exposed to, in contrast, was the program implemented by ADE and the extensive communication about the function of ASAP to change curriculum and teachers’ instructional practice by changing the nature of the test. Every communication from ADE, the extensive series of meetings, workshops, newsletters, and the like, all trumpeted the merits of performance assessment and the kind of education that was consistent with it.
The rational policy model assumes that a policy has an objective and fixed reality. Interactionist theories (Lipsky 1980; Hall 1995) assume that the text of a policy is only a point of departure for the persons who interpret and implement it and may undergo substantial unwritten revision and transformation over time and levels of the system. The moving target of Arizona assessment policy, rife with internal contradictions and ambiguities from the outset, evolved and diversified in relation to a) shifting power among coalitions of policy actors, b) variations and limitations in capacity-building efforts, c) confrontations with the tests themselves, and d) scarcity of time and resources.
Three reform-minded officials at ADE, supported by the Board, AEA, and constructivist educators set the tone for the program at this stage. These individuals were effective and outspoken advocates for alternative assessment and for reforming instruction to be more holistic, thematic, and aimed at students’ active engagement and capacity to make connections, solve complex problems, and communicate their thoughts. They believed that assessment must be authentic and integrated with instruction, subjects integrated with each other around interesting, real-life problems, that teachers should be co-learners, coaches, collaborators, facilitators of learning, and actors rather than targets of curricular reform, that instruction should follow new research on cognition, multiple intelligence, constructivist learning theory and the like. They also believed that mandating a performance test would move teachers away from reductionistic, basic skills -- drill and practice -- teaching of subjects in isolation, the worksheet curriculum.
As one of them said later, "we saw the potential of doing something special... the chance for the state to break the lock of the norm-referenced testing which was not serving teaching or learning very well."
The Department public discourse rarely included the accountability aspects of ASAP in those early days, an omission that fit the reformers’ agenda. Later, one declared:
"[ASAP] was never intended to be that [a high stakes test], never intended to be used as a determiner for whether or not students graduate, never intended to be a determiner of whether or not they go from grade to grade."
Initial reactions of educators to ASAP were positive, for they believed that whatever else the policy entailed, it was preferable to the widely despised ITBS, with its accountability overtones. Opposing voices cited the subjectivity of scoring the performance assessments and feared that schools would de-emphasize basic skills, but their decibel level was nowhere near the level it would attain later.
Against this reform coalition, however, was a legislature that was still more interested in holding teachers’ "feet to the fire" than in changing the way they taught. As its composition became more conservative and Republican over time, one heard more complaints about, for example, the "subjective" scoring of the performance measures and the fact that "anti-business and environmental activist attitudes" had crept into the content of the tests.
Within ADE, the staff was not of one mind about assessment policy, even at the beginning. Bishop herself had always spoken of ASAP as a way to focus teachers’ attention on the state’s curriculum frameworks and students’ attention on mastery of the Essential Skills. Staff and officials concerned with ASAP were organized in two different units, with the reform agenda represented in the ASAP Unit. Meanwhile, the Pupil Achievement Testing Unit, made up of individuals who were experienced with norm-referenced and criterion-referenced assessment (not performance assessment), aligned themselves with the accountability values of the legislature. Two factions -- two different sets of values and priorities and definitions of assessment policy, each privately discounting or talking past the other.
About three years into the development of this program, however, the three key reformers left the department. Remaining staff proved to be less thoroughly grounded in progressive educational principles and practice and less effective in maintaining the direction that ADE initially took. While the ASAP Unit changed faces and voices, the Assessment Unit remained consistent. By that time, the accountability forces within the department had begun to dominate the discourse. More ADE staff time was spent standardizing and monitoring districts’ assessment, reporting, and goal-setting procedures. Superintendent Bishop and the testing department now set the tone.
A critical element to assessment policy was missing if its intent was to change the way teachers teach. Legislation that mandated state tests had failed to authorize funds to provide for teacher training in the principles and practices of performance assessment and of holistic, thematic, and problem-solving instruction and curriculum reform. For a reform of this type, extensive time and training have proved in other states to be necessary (Flexer & Gerstner 1994) (Shepard, Flexer et al. 1995). The report of the Goals for Educational Excellence panel had earlier promised the legislature that the new program would cost no more than the previous testing program, and this efficiency value in the state political culture foreshadowed subsequent problems. Now ADE was in a bind, able to mandate assessment policy but powerless to fund state-wide training of teachers to adapt to it. Some districts with sufficient wealth and officials who were open to the new direction suggested by ASAP invested considerable resources in local capacity development, but these were in the minority. In a state with considerable disparity in taxing ability, the already rich and poor districts reproduced disparities in staff development for ASAP as well.
The pace of transformation of the program away from its reform agenda increased under the press from key legislators to start producing data for accountability purposes, which was of course their definition and intention all along. As an ADE official reported later, the department chose to underplay the technical and administrative problems that had surfaced along with capacity building needs:
"We should have gone back to [the Legislature] and said, ‘we’re going to need some more training money, we need more field test money,’ but the things looked good, they had been sent out to the schools, teachers saying let’s get going, we want to do this. The Legislature was saying let’s get going, let’s get going. At that point what should have happened is we should have said we need two more years. We need another state-wide pilot, we need more of the psychometric people in making sure the thing is ready to go, and we need additional district training budgets so when they come on line with this they could train their teachers. We had underestimated the profound training effects that this would have, clearly underestimated what it would be."
Because it was a political project, ASAP had limited time either to develop the capacities of teachers and schools or to develop sound psychometric instruments. As a result of political pressure, it was necessary for the ADE and the test publisher to produce the various performance test forms in weeks rather than years. Form D-1 was commissioned and administered before all the psychometric and administrative kinks of Form A were worked out, D-2 was commissioned and administered before the characteristics of D-1 were corrected or even known. Nor was there an equating study to show whether Form D could function as an "audit" of Form A.
The notion of using Form D as an "audit" evolved over time, though insiders would claim that they intended that function all along. However, officials early in the Bishop administration suggested that Form D would be used as an efficient monitor of districts’ scoring and reporting of their local (DAP) assessments, which consisted of Forms A, B, and/or C, criterion-referenced measures, or portfolios. Districts could choose which assessment form they wanted, provided it could be scored by the state generic rubric and contingent on ADE approval. Although the word audit implies in the business community that an independent professional has verifed that the company has used the proper procedures in its financial statements, ADE operationalized the concept to be the correlation of results of state-administered and scored Form Ds with the DAP assessments, at the individual pupil level.
An ADE official would later recall that the development of D "was done in a fairly shabby way, without adequate field testing." But ADE "didn’t act on this information for a couple of years." An ADE insider at the time agreed:
"Now, the problem there was that the first Form D was used, we tried it out, we reported the results, but Riverside ran a concurrent field test on the form D. The concurrent form D-1 field test was returned to the department in late ‘93, or the fall of ‘93 sometime. And what it said is that the D form didn’t match the A form well enough. But, for whatever reason, the staff of the Department kind of took that report and put it on the shelf. Because what it said is we want to do the D form. Politically, the thing was developing its own momentum down there. Nobody wanted to stop the process, nobody wanted to pull it back. Riverside staff was saying you’ve got to stop this because the D now needs to be revised and re-field tested to be sure that it matches the A that it’s auditing. Wasn’t done. [The report] was shelved.... and D-2 then was commissioned, ... was developed, was not field-tested, and was ready to go as the next state-wide audit."
Nor was there time, regardless of intent, for ADE to seek independent evaluation of the performance assessments or consultation by experts in the incipient technology of performance assessment. During interviews conducted later, the contractors also noted that development time was too short and that the state had overlooked the ramifications of getting the assessments in the field on such a short timeline.
Beyond ignoring the problematic technical data of early versions of the state performance assessments, Superintendent Bishop and other ADE staff reacted defensively to any criticism of ASAP. At a meeting of educators sponsored by ADE and AEA, she reacted to objections about ASAP by warning that if teachers complained too much, the conservative policy actors would likely move to reinstate universal standardized testing. At the time, the complaints seemed quite reasonable and problems for the most part correctable, having to do with glitches in administration, the burden of purchasing test materials, question wording, insufficient time limits, inadequately prepared scorers, vague scoring rubrics, and lack of time and training. The former president of AEA recalled that the superintendent "lost it" when informed of the teachers’ position, however. Open debate over assessment policy did not happen.
In June of 1993, the initial results of ASAP Form D were reported. The newspapers published the results by school and grade level and ranked them in much the same manner as they had always reported the standardized test results. The Arizona Daily Star headlined its report, "Tests say schools are failing." The Superintendent called the results disturbing and distressing, but failed to note the possible technical problems associated with any new test undergoing its maiden voyage. She criticized schools and teachers for not adapting fast enough and for not teaching "the way kids learn." Educators were shocked and dismayed at ADE and media reaction. Many had believed or were led to believe that performance assessments were to have a different function than had standardized tests. Instead, the high-stakes accountability function of ASAP was fully revealed.
Time and political capital had begun to run out for the reform faction in ADE. The State Board of Education, prompted by key legislators, demanded action on the accountability front. The use of ASAP in determining high school graduation became part of assessment policy in January 1994 through the action of state Board of Education rule R7-2-317. A Task Force on Graduation Standards was then appointed to make recommendations about proficiency levels. Its recommendations were later adopted, specifying a level of proficiency for graduation from Grade 12: "A student shall demonstrate competency in reading, writing, mathematics, social studies and science...by attaining a score of 3 or 4 on each question or item of each Form A assessment [of ASAP]... scored with the corresponding Essential Skills (ASAP) generic rubric..." The Task Force had met a number of times and had self-administered the Form A tests for twelfth grade and used the generic rubrics to score them. A member would later report that they considered the rubric scores in terms of percentages, as if the assessment was like a competency measure. That is, four was the highest score on the rubric, and 3 of 4 was close to 75%, and a less than 75% mastery level would be taken by the public as too lenient. Therefore, a 3 would be the cut off between mastery and non-mastery, between graduation and non-graduation. There is no evidence that the Task Force examined technical data (for example, standard errors around cut scores) or consulted experts on established procedures for setting cut-scores. It specifically ignored the Riverside technical report that warned against the use of Form A for pupil level reporting, let alone accountability. The Task Force report listed its rationale this way:
"After reviewing every question/item on the Form A assessments and sample student responses at each level of the 4 point rubric, it was determined that a score of less than a ‘3’ would not represent an adequate demonstration of competency."
In addition to recommending proficiency levels for graduation, the Task Force also suggested that limited English speakers be allowed to demonstrate their proficiency in their own language and all students who initially fail should be able to take the test again.
By 1995, what schools were attempting to implement was many things to many people. Administering ASAP was more about high stakes accountability, more about standardizing and centralizing education at the state level, and less about progressive reform than one would have predicted from the discourse of 1992. The political nature of this assessment policy was revealed in the press of time, the shifting power balance, and the subversion of balanced debate about where this program was headed. With all that, one may wonder whether any effects were possible.
CONSEQUENCES OF ASAP
Was ASAP effective in achieving its reform aims? The state neither conducted nor commissioned a rigorous evaluation of what happened as a result of ASAP. ADE monitored compliance through the DAPs and conducted a survey of teachers, but one would be hard pressed to call this a serious evaluation. Our independent policy study of the consequences of ASAP, however, tracked implementation and reactions for nearly the complete life of the program. Beginning in 1992, the year in which a pilot administration of Form A was conducted, we studied policy makers as they initiated administration of the program and transformed its initial intents. Then in the first year of Form D administration, we conducted in-depth case studies of schools while they were in the throes of accommodation and reform. In the next year we continued our observation of the case study schools and conducted a representative survey of Arizona educators about response to ASAP.
What we found (Smith 1996) can be summarized as follows: Arizona educators were fully cognizant of ASAP, though they defined it in quite different ways: as a preferable alternative to standardized testing, as a way to change teaching in a constructivist direction, or as just another state mandate. Somewhat less than half of the educators we studied approved of ASAP as they defined it. Much of the disapproval seemed to be the result of the implementation of ASAP Form D (e.g., inadequate time limits or directions, inappropriate item content or scoring rubrics, the use of ASAP scores for high-stakes purposes, etc.) rather than the idea of ASAP. Change in curriculum and teaching consistent with ASAP also varied widely and depended on certain local characteristics. For example, if there were adequate financial resources to make the change and knowledgeable personnel to help, if the existing teaching patterns and beliefs were amenable to constructivism, if there was little commitment to standardized achievement testing and traditional education, then change was evident, but not otherwise. The low rate of change can be attributed in part to inadequate professional development. The state failed to provide resources for teacher training, even though the reform implied fundamental changes in teacher knowledge and skill. This left the responsibility in the hands of the districts, many of which were too strapped financially to do anything. Although a few districts devoted impressive resources to develop teachers’ capacities to implement ASAP reform ideals and invested in curricular changes as well, the average number of hours of relevant professional development reported by teachers across the state was only about 8 hours over a two-year period. Still, there was enormous effort spent by teachers and administrators simply in complying with ASAP testing and reported requirements. At the least, the evidence points to a wide-spread increase in students writing extended texts.
Though our surveys and case studies were made available to ADE staff, they never publicly acknowledged the results and we saw no signs that the research had any effect on subsequent decisions. Later on, when we interviewed policy actors subsequent to the demise of ASAP, we found that their interpretations of the consequences of ASAP fit their current agenda and position. For example, an ADE insider during Bishop’s administration said this about what happened as a result of ASAP reform:
"Well, the great strength of it was that it was changing the behavior in the classroom. We were making change with respect to teaching methodology, instruction, technique, those kinds of things. Also materials were being changed, moving from a reliance on rote memorization, teachers primarily engaged in lecture and students repeating back what they heard, to one where students were actually engaged in the application of knowledge. Teachers were engaging the students in the learning process more. In other words, they would have to solve problems, they would have to apply whatever they had learned in the classroom to a real live situation. They would have to actually write."
But a legislator reflecting back on the program noted:
"[There were] constant complaints about the content of the test, about perceptions that the test wasn’t valid, that it disrupted the classroom -- just constant complaints, and zero -- zero [support to keep it going]."
A district administrator noted that the limited consequences of ASAP might be due to the slow pace that schools have of changing anything and the complexity of the particular change involved in moving from basic skills teaching to more thematic, problem-solving, constructivist teaching. Teachers in that district were finally ready for ASAP Form D just at the point when it was suspended:
"They were all ready to go. They were just getting to the point where teachers were taking it seriously and gearing their teaching to it. The first year no one took it seriously, because the district was convinced that ASAP was not going to last, and we didn’t do too well. The second year teachers were just starting to get going, but they didn’t have the understanding to do it well. By the third year they were ready and eager to show what their students could do. But those teachers that modified their teaching will go ahead in that direction. The instructional aspect really had an effect."
However, the pace of curricular and teaching change failed to keep pace with the pace of political change.
Next – Part 5 – Cast Changes in the Second Act of Assessment Policy History