Accountability, California Style: Counting or Accounting?
by Michael Russell, Jennifer Higgins & Anastasia Raczek - 2004
Across the nation and at nearly all levels of our educational system, efforts to hold schools accountable for student learning dominate strategies for improving the quality of education. At both the national and state level, student testing stands at the center of educational accountability programs, such that schools are effectively held accountability for increases in student test scores. This working definition of accountability contrasts sharply with the formal definition of accountability in which those discharged with duties are expected to provide an account—that is, a description and an explanation—of their duties and conduct and to assist in determining whether said conduct was responsible. This article focuses specifically on the accountability system introduced in California in 1999 and identifies several shortcomings of test-based accountability systems. These shortcomings fall into two broad categories: unrealistic expectations established by the system and failure to identify why schools succeed or fail to meet test-based goals. Based on shortcomings identified in California, as well as limitations of test-based accountability itself, several recommendations for an improved accountability system are offered. Chief among these recommendations is that notions of accountability be expanded such that accountability becomes a two-way process in which all levels from the classroom up to the state should be asked to account for their practices and the impact those practices have on students and their learning.
Across the nation and at nearly all levels of our educational system, efforts to hold schools accountable dominate strategies for improving the quality of education. At both the national and state level, student testing stands at the center of educational accountability programs. This dominance is made clear in President Bushs No Child Left Behind Act of 2001 (Public Law No: 107-110), which requires states to test students in Grades 3-8 in reading and mathematics. Although the presidents education policy does not stipulate how states should use test scores, the legislation itself and rhetoric surrounding the legislation exemplify the extent to which many education and political leaders equate educational accountability with student testing.
When accountability systems focus primarily (or exclusively) on test scores, educational accountability becomes defined as requiring schools to improve student test scores from year to year. In this way, the operational definition of accountability in education is based on a single set of student outcome measures, namely changes in test scores, without consideration of school policies and practices, or educational opportunities provided to students. To hold a school accountable one must only count the number of points students scores change over the course of a year.
As Haney and Raczek (1994) describe, this working definition of accountability differs noticeably from a more formal definition of accountability. According to the Oxford English Dictionary, the word accountability means the quality of being accountable; liability to give account of, and answer for, discharge of duties or conduct; responsible or amenableness (p. 65). In this more formal definition, those discharged with duties are expected to provide an accountthat is, a description and an explanationof their duties and conduct, in order to assist in determining whether said conduct was responsible. Where this more formal definition differs most notably from the working definition in education is in the active role the leaders play in telling the story of education in their school(s) and the extent to which this education is responsible to its constituentsstudents, families, and the community.
This article examines the educational accountability system in the state of California from the perspective that such systems should result in an accountinginforming consumers about what schools are doing and how well. An accountability system that is test based alone is, by definition, a limited one. And Californias further reductionist approach of developing a single accountability "index" number does little to inform schools, parents, students, or policy makers about the quality of education or the factors that influence this quality.
In this article, we first describe the accountability system California established in 1999 and then identify several shortcomings of this system. These shortcomings fall into two broad categories: unrealistic expectations established by the system and failure to identify why schools succeed or fail to meet test-based goals. We then explore lessons learned from accountability systems established in other states and examine recent data that provides insight into the extent to which these lessons are emerging in California. We conclude by considering alternative approaches to accountability and present a blueprint for a system that is more likely to provide information that schools and policy makers can use to improve the quality of education in California schools.
CALIFORNIAS INDEX OF ACCOUNTABILITYTHE API
In contrast to the steady and consistent assessment system in place in California during the 1970s and 1980s, large-scale student assessment in California has been tumultuous during the past decade. Between 1990 and 2000, teachers, students and local communities have been faced with five separate assessment systems. Some of these systems have employed a variety of test instruments that were closely linked to state standards, while others have employed off-the-shelf standardized tests. Some of the systems have given districts latitude in determining what tests to use to assess student learning while others have mandated which instruments must be used.
The system established in 1999, known as the Academic Performance Index (API), grew out of the Standardized Testing and Reporting (STAR) program. Established in 1998, the STAR program replaced district-level programs that employed multiple-measures of student learning with a single state-mandated standardized test, the SAT-9. A year later, the Public Schools Accountability Act (PSAA) folded the STAR program into Californias current accountability system.
There are three components to PSAA:
The API, an index to measure school performance;
The Immediate Intervention/Underperforming Schools Program (II/ USP) to help underperforming schools improve academic performance; and
The Governors Performance Awards (GPA) program to reward schools for improving academic performance.
At its core, the PSAA legislation led to the creation of the API, which is a numeric index that ranges from a low of 200 to a high of 1,000 and has a target performance level of 800. Schools not meeting the defined standard of 800, or demonstrating sufficient annual growth (described later), are not eligible for monetary Governors Performance Awards, School Site Employee Performance Bonuses, or Staff Performance Incentives. Such schools also can be identified for participation in the II/USP.
The 1999 PSAA legislation, and subsequent amendments, defines in a general way what the API should be and what level of performance is required for a school to be considered successful. However, the process of defining the components of the law to a level of specificity adequate for actual implementation required a committee of expert advisors to interpret the intent of a policy written by legislators and make decisions about operational definitions for five distinct variables embedded in the accountability system. Four of these decisions focused on the calculation of an API:
The selection of indicators of which the API is composed
The relative weight of each chosen indicator
The selection of cut-scores for performance band allocation for indicators
The relative weight for each performance band
The fifth decision focused on the choice of an API target score.
While available documents indicate that decisions about some of these variables were informed by simulations and modeling conducted by members of the technical advisory committee, it is clear that the process of selecting values for this system was performed quickly to ensure that a law approved by the governor in April, 1999 could be implemented by that July.1
As just one example of how limited time led to a questionable policy decision, the PSAA Advisory Committees final report for the 1999 API opens with four concerns, including a concern about the limitations of the SAT-9 as the sole accountability measure for California. As the committee states,
Reluctantly, the Committee has arrived at the conclusion that for 1999, the API should consist solely of norm-referenced test results from the Stanford 9 administered as part of the Standardized Testing and Reporting (STAR) Program. The Stanford 9, however, has serious limitations as an accountability instrument for California public education. The norm-referenced component of this test is not linked to California content and performance standards. As a result, the Committee advocates that as soon as possible the SBE base the API predominately on measures linked to Californias content and performance standards. (CDE, 1999a, p. 2)
One might question whether a system with serious limitations that is unrelated to current educational standards should hold such high stakes for students and schools in California. Clearly, this decision was a matter of judgment and, without time to develop a better-aligned test, was deemed the only alternative.2
CALCULATING AN APITHE PROCEDURE
Generating an API from Stanford 9 test scores requires an arcane calculation process. To calculate the API, individual student scores, in national percentile ranks (NPRs), in each subject area on the SAT-9 are combined into a single number that represents school performance. First, student NPR scores for each subject test are categorized into one of five Performance Bands. Next, the percentage of students scoring within each of the five performance bands is weighted by a different factor. These weighted proportions are combined to produce summary scores for each content area. The performance bands and associated weighting factors are as follows:
Band 1, Far Below Basic: 1-19th NPR, weighting factor 200
Band 2, Below Basic: 20-39th NPR, weighting factor 500
Band 3, Basic: 40-59th NPR, weighting factor 700
Band 4, Proficient: 60-79th NPR, weighting factor 875
Band 5, Advanced: 80-99th NPR, weighting factor 1,000
Results for content areas are then weighted and summed to produce a single number between 200 and 1,000, representing the schools API score (CDE, 2000).3
USE OF API SCORES
An API score is calculated for each school every year. The current target for each school is an API score of at least 800. This interim target was established by the Advisory Committee for the Public Schools Accountability Act, based on data analyses by the Committees Technical Design Group. In the Advisory Committees 1999 Report, the Committee emphasizes that this target is demanding:
These data analyses document exactly how demanding this target of 800 is. For 1999, a target represents an exemplary level of performance that was attained only by a very small percentage of California schools: an estimated eight percent of the elementary schools in the state, six percent of the middle schools, and four percent of the high schools. (CDE 1999a, p. 14)
For those schools that do not meet the interim target of 800, an API Growth Target is calculated. The Growth Target is determined by subtracting a schools current API score from 800 and then multiplying the difference by 5%. The Target is compared with actual change in API the following year. In this way, schools are expected to close the gap between their current performance and the target performance level by 5% each year. For schools that are within 20 points of the target, the API index is expected to grow by at least 1 point (CDE, 2001).
Beyond meeting the 5% growth target, schools whose API score is below 800 are also expected to demonstrate comparable improvement in academic achievement by all numerically significant ethnic and socio-economically disadvantaged subgroups (CDE, 1999a, p. 17). For these subgroups, comparable improvement is defined as 80% of the schoolwide growth target. Under these rules, a school is said to have made its target if the API based on all students in the school increases by at least 5% and the API for each numerically significant subgroup increases by at least 4% (CDE, 2001e).
ADDING MEASURES TO THE API
Recognizing that the SAT-9 has serious limitations as an accountability instrument for California public education because this test is not linked to California content and performance standards (CDE, 1999a, p. 2), California Standards Tests (CSTs) are being developed for English Language Arts,4 Mathematics, History-Social Science, Science, Writing, Coordinated/ Integrated Sciences. Unlike the SAT-9, which is a nationally norm-referenced test, the CST tests are criterion referenced, specifically designed to be aligned with state standards. In addition to the content-area CST tests that will be administered in Grades 3-8, end-of-course tests are also being developed for several mathematics, science and social studies courses typically completed in high school. These end-of-course tests are criterion referenced and are being added as components of the API.
Over time, the state plans to fully replace the SAT-9 tests with CST tests. Although it is unclear when this full replacement is scheduled to occur, it will undoubtedly take several more years. But test scores are not the only components under consideration for inclusion in the accountability index. The PSAA legislation mandates that measures such as student and teacher attendance rates and high school graduation rates be incorporated into the API calculation when available, valid, and reliable (CDE, 1999a, p. 4).5
DECISIONS MADE BASED ON API SCORES
In addition to comparing the actual changes in each schools API score to their annual growth target, schools are compared with other schools in two ways. First, schools are ranked by deciles, within elementary, middle and high schools. Second, schools are compared to other schools with similar characteristics. These characteristics include pupil mobility, pupil ethnicity, pupil socioeconomic status, percentage of teachers who are fully credentialed, percentage of teachers who hold emergency credentials, percentage of students who are English language learners, average class size per grade level, and whether the school operates multi-track year-round educational programs. A statistical model is used to create an index for similar schools (CDE, 2001). Therefore, each school receives four rankings: an overall ranking, a similar school ranking, an overall growth ranking, and a similar school growth ranking. Only the overall ranking and growth ranking are used for official purposes, namely as criteria for II/USP eligibility and for financial awards.6
REASONABLENESS OF API SYSTEM AND EXPECTATIONS
Given the important decisions made about schools and teachers based on API scores, it is important to examine the reasonableness of the API systems expectations. To this end, we examine the reasonableness of three aspects of the API: the percentage of students who must perform above the national average for a school to meet the interim target; the ability of schools to achieve 5% growth each year; and the number of years required for many schools to meet the interim target of 800. Note that while the following analyses focus on the norm-referenced tests comprising the API, the issues described remain germane as CSTs are added to the API system.
REASONABLENESS OF INTERIM TARGET
As the 1999 Advisory Committee Report states, the interim API target of 800 is very demanding. By way of example, an elementary or middle school in which student SAT-9 scores are distributed identically to the national norm group would receive an API of 655. To obtain an API of 800, Rogosa (2000, p. 1) estimates that a little less than three-quarters of the students in the school must exceed the national 50th percentile on each SAT-9 test. At first it may seem reasonable that, on average, students within a school perform above the national average. It also may seem reasonable to expect students to perform above the 60th percentile, on average. Recognize, however, that the 60th percentile is just over a quarter of a standard deviation above the mean. If student performance on the SAT-9 within a California school was distributed identically to the national norm group, this accomplishment would represent an effect size of approximately .25. In education, an effect size of .25 is considered moderate and is often viewed as having important practical significance. However, the distribution of student performance on the SAT-9 in California is noticeably lower than the national norm group. As Herman, Brown, and Baker (2000) report, nearly a fifth of Californias students are not proficient in English as compared to less than two percent nationwide. This, and other differences in demographics, contributes to performance that is well below the national average. In Grades 2 through 11, mean percentile ranks in 1999 ranged from 32nd to 46th on the SAT-9 reading test and from the 44th to the 52nd percentile in mathematics. Given these starting points, the effect size required to move students in California, on average, from where they are now to above the 60th percentile range from .20 to .73. Again, an effect size above .2 is considered to be of practical significance, while an effect size of .73 represents an extraordinary change. As Mosteller (1995) states, Although effect sizes of the magnitude of 0.1, 0.2, or 0.3 may not seem to be impressive gains for a single individual, for a population they can be quite substantial (p. 120). While an API target of 800 establishes an admirable goal, its loftiness destines many schools to failure.
CAN 5% GROWTH BE SUSTAINED EACH YEAR?
Californias accountability system requires schools to close the gap between their API and the interim target of 800 by at least 5% each year. Since a schools API is based solely on student test scores, meeting this expectation requires an increase in students scores. Rogosa (2000) estimates that a schools API score would increase by roughly 8 points if the performance of all students in the school increased by 1 percentile point. Thus, for schools whose current API is at or above 640, a universal increase of one percentile rank on all tests would produce satisfactory growth of at least 5% toward the target. Similarly, a universal increase of two percentile points on all tests would produce satisfactory growth for schools whose API is at 480. And below this, student scores must increase roughly 3 percentile points.
At first brush, Rogosas estimation creates the impression that a one to three percentile point increase is reasonable. On most of the SAT-9 tests, this growth would be obtained if all students answered one or two more questions correctly. With additional instruction and another opportunity to take a given test, it seems reasonable that a students score would increase by a few or more points. But the gains students must make are not on the same test. Rather, the gains must be made on the test for the next grade level. While some of the subject matter overlaps across years, additional skills and knowledge are required to perform at the same level from year to year. Although often misinterpreted as showing no growth, percentile ranks that remain the same across years actually represent substantial growthgrowth that is identical to the average student performing at that level nationwide. And increases in percentile ranks across years represent even more growth than the typical student nationwide.
But even if a school was able to improve the performance of its students as they pass through each grade level, these growth expectations ignore the fact that schools need to educate not only those students who have been in their past care, but also a set of new arrivals each year. Each year, two new groups of students enter a given school. One set of entering students comprises the lowest grade level in the school. In a K-5 school, this set of entering students becomes the kindergarten class. Some of these students may come from any number of preschool programs. Others may not have attended preschool at all, while still others may have recently arrived in the United States. A second entering set is composed of students who move into the school during the course of the school year. Some of these students may come from other schools in the same district, from other districts in California, from other states, or from other nations.7
For students entering kindergarten, it is reasonable to assume that the distribution of skills and knowledge is roughly comparable to that of students across the nation.8 Thus, it is reasonable to assume that if a nationally normed standardized test like the SAT-9 was administered to kindergartners in California, at least half of the students would perform below the mean (that is, would receive an NPR score of 50 or lower). Based on past and current performance of Californias LEP students on SAT-9, one might also assume that kindergarten students (or any student newly arrived to California from most other countries) whose primary language is not English would perform considerably below the mean. If these assumptions hold, it is also reasonable to assume that even if the school is extraordinarily successful in improving the performance of these entering groups of students, by the time they are in Grade 2, and eligible for taking the tests comprising the API, a large minority of the students would still be performing below the national mean. The larger the LEP population of these entering groups, the larger the percentage of students still performing below the mean is likely to be. Thus, absent an unusually large effect during the first 2 years of each childs formal schooling, a substantial portion of the second grade class will perform below the national mean. In turn, the "poor" performance of this segment of the second grade must be offset by much higher performance of students in upper grade levels to generate a high API.
To illustrate this situation, Table 1 models the relationship between the percentage of LEP students, the performance of second graders (assumed to be "normally" distributed for all non-LEP students and distributed evenly between the two lowest performance bands for all LEP students), and the performance required of Grade 3-5 students to obtain an API of 800. In a school that contains no LEP students, the performance of grade two students is assumed to reflect that of students across the nation. Thus the API score based only on Grade 2 students would be about 655.9 Given this starting point for Grade 2 students, 15% of students in Grades 3-5 would need to perform between the 40th and 59th percentile rank and 85% of students must perform between the 60th and 79th percentile rank.10
For a school that contains 20% LEP students, the performance of second grade students would be lower (on average), resulting in an API of 594 in this example. To offset the second grade API, 96% of students in Grades 3-5 must perform above the 60th percentile. Clearly, this level of performance is unrealistic. Yet, this unrealistic expectation applies to nearly 50% of schools in California (Russell, 2002).
NUMBER OF YEARS TO OBTAIN INTERIM BENCHMARK
Despite concerns about the reasonableness of achieving 5% improvement every year, a school that moves the required 5% a year toward the interim target is actually making slow progress toward the interim target of 800. For example, a school whose students, on average, perform at the national mean would obtain an API of 655. If that school closed the gap by the minimum 5% each year, it would take 48 years for a school whose API began at 655 to reach the interim target of 800. Similarly, for the median California high school with an API of 636, it would take 52 years to reach the interim target. And for a low performing school, whose current API is 354, 71 years are required. In other words, if all schools met their growth expectation each year, four to seven generations of students would pass through Californias schools before all schools met the interim benchmark.
USE OF API SCORES FOR IMPROVING AND IDENTIFYING PROMISING EDUCATIONAL PRACTICES
Beyond establishing an extremely lofty performance target and ambitious levels of sustained growth for many of Californias schools, the API system provides little information that schools can use to identify strengths and weaknesses in their educational programs or provide insight into why scores have or have not improved. In this section, we describe how the process of aggregating test scores across grade levels and subject areas within each school may mask patterns of performance across and within grade levels and complicates attempts to identify factors that influence student performance. We then explore the meaning of score gains and demonstrate how the current API system does not provide insight into which educational practices may be influencing student learning as measured by standardized tests.
RESULTS OF AGGREGATING ACROSS GRADE LEVELS
As Kane and Staiger (2001) and Haney (2002) describe, aggregating individual test scores to produce a single school score can result in significant volatility in school scores. Examining test scores for several grade levels in North Carolina and Massachusetts, these researchers found that there were dramatic changes in average scores for schools that were based on fewer than 100 students.
California addresses the problem of volatility in two ways. First, scores are aggregated across all grades within a school rather than within each grade level. As a result, even in schools that have relatively small numbers of students in each grade, the total number of students the school API is calculated from is usually larger than 100. Second, for those schools that contain fewer than 100 students, PSAA specifies that an alternative API system will be established.
Although aggregation of scores across grade levels may help decrease the volatility of score changes, it presents at least two additional challenges. First, aggregation across grade levels masks differences in performance and/or gains at different grade levels. As noted above, students in California perform worse on average than students across the nation on the SAT-9. But this underperformance is not uniform across grade levels. Figure 1 indicates that Grades 9-11 perform noticeably worse than all other grades on the SAT-9 Reading test. For Grades 2-8, mean SAT-9 scores differ between grade levels by as much as 6 points on the reading test and 8 points on the math test. And, whereas Grade 3 has the lowest mean reading NPR among Grades 2 to 8 on the reading test, it is one of the top three scorers in math for those grades; Grade 4 is at the bottom.
Similarly, aggregating test scores masks variations in performance between students. As Russell (2002) demonstrates, several different combinations of student test scores produce n API of 800. As an example, a school in which 75% of students performed at the 99th percentile and 25% performed at the 1st percentile would earn an API of 800. Similarly, a school in which 57.5% of students performed at the 60th percentile and 42.5% performed at the 40th percentile would also obtain an API of 800. An API of 800 would also be obtained for a school in which 57.5% of students performed at the 79th percentile and 42.5% performed at the 59th percentile.
While these examples are extreme and highly improbable, they illustrate two points. First, an API of 800 (or any other number for that matter) can be obtained by many different combinations of scores that may be relatively uniform or may differ dramatically within a given school. Second, beyond indicating that on average students are performing above the 50th per-centile, an API of 800 (or any value for that matter) does a poor job characterizing the actual performance of students in a school. In short, the API represents data summarized to such an extent that it no longer conveys useful diagnostic information.
AGGREGATE SCORES AND ECOLOGICAL FALLACY
A second problem with aggregating test scores stems from the difficulty in using aggregates to explain what factors actually caused scores to change. As Haney and Raczek (1994, p. 17) state, attempting to hold schools or educational programs accountable in terms of aggregate performance of students almost guarantees that accountability in terms of explaining patterns of individual students learning will be largely impossible.
By way of example, Haney and Raczek recount the work of Robinson (1950) who performed a series of analyses to explore the relationship between race and illiteracy using 1930 U.S. Census data. Depending on the level of aggregation, the correlation between race and illiteracy varied dramatically. When regions of the country were the unit of analysis, the correlation was 0.95. When state averages were calculated and then correlated, the correlation dropped to 0.77. And when individuals were the unit of analysis, the correlation was only 0.20. Thus, depending on the unit of analysis, correlations can vary widely.
The ecological fallacy associated with using aggregates to summarize student performance is relevant to the API in at least two ways. First, the focus on school-level performance across grade levels rather than within grade levels or classrooms obfuscates the impact of efforts within these lower level units to improve student learning. Second, although the Similar School Index is not used to inform formal decisions about the success or shortcomings of schools, the focus on school-level performance and characteristics may promote fallacious conclusions about the impacts of school-level programs and the influence other variables have on the success of these programs. While aggregation at the grade or classroom level may be a poor fix for this second problem, it might promote closer examination of practices and issues within these smaller operational units.
MEANING OF SCORE GAINS
It is often assumed that an increase in test scores represents an increase in learning. Over the past decade, however, several studies suggest that this assumption becomes tenuous when schools are mandated to increase scores on a standardized test administered over several years. As an example, during the 1990s, Kentucky put into place a complex, multiple-measure assessment system. Between 1992 and 1996, student scores on these assessment instruments increased. In 1998, Koretz and Barron (1998) performed a series of analyses to examine the validity of these gains. Among their findings were that score gains on KIRIS did not translate to score changes on other related tests. As an example, fourth-grade KIRIS reading scores increased by three-fourths of a standard deviation but did not change on NAEP. While math scores on KIRIS and NAEP increased across the four years, the gains on KIRIS were about 3.5 times larger than the gains on NAEP. In addition, Koretz and Barron (1998) noted that the initial score gains for many of the tests were very large relative to past evidence about large-scale changes in performance, and several were huge (p. 114). The authors explain that meaningful gains of these magnitudes would be highly unusual, but observed gains of this size are less surprising. It is common to find large gains in mean scores during the first years of administration of a new assessment, in part because of familiarization.
Similarly, Texas has had its accountability system in place for over ten years now. Over this time period, the percentage of students passing the state tests has increased dramatically. However, through a series of analyses, Haney (2000) found:
Little relationship between changes in TAAS scores and high school grades
Large gains on TAAS were not mirrored by changes in scores on the Scholastic Aptitude Test (also known as the SAT)
Gains on NAEP were about one-third the size of gains on TAAS, and when gains on NAEP are adjusted for Texass unusually large exclusion rate, the gap increases further
Haney also presents considerable evidence that much of the gains in TAAS scores resulted from changes in retention and drop-out rates rather than increases in learning.
Score Gains in California
As Figures 2 and 3 show, SAT-9 scores in California have increased between 1999 and 2001. Across all grade levels, the largest gains occurred during the first year. The pattern of gains on SAT-9 in California are also similar to gains in Kentucky during the early years of KIRIS with the sharpest gains occurring during the first 2 years, after which the gains flatten.
As was the case with the standardized tests in Kentucky and Texas, the sharp increases in California on the SAT-9 do not generalize to NAEP.11 Whereas Californias Grade 4 SAT-9 Math scores saw a sharp increase, Californias Grade 4 NAEP Math scores increased at about the same rate as those of the nation. And, whereas Californias Grade 8 SAT-9 Math scores increased slightly between 1998 and 2001, Californias Grade 8 NAEP Math scores decreased slightly between 1996 and 2000 while the national average increased. Thus, whereas Californias Grade 4 SAT-9 Math scores suggest that California gained sharply on the nation, Californias grade 4 NAEP Math scores suggest that the gain was negligible. And whereas Californias Grade 8 SAT-9 Math scores suggest that California gained on the nation, Californias Grade 8 NAEP Math scores suggest that the gap between the state and the nation actually increased.
Later, we present data that may begin to explain some of the causes for these early increases and lack of transference to NAEP. Not surprising, these causes are similar to those associated with score gains in Kentucky and Texas and include a focus on test-taking skills, teaching to the test, and increased retention in some schools.
IMPACT OF PSAA/API ON SCHOOL AND CLASSROOM PRACTICES
State educational leaders establish test-based accountability systems to both motivate educators to focus on specific types of learning and improve student learning (Sheppard, 1990). The intent to influence instructional practices was implicit in the API Framework developed by the State Board of Education which states, The API must strive to the greatest extent to measure content, skills, and competencies that can be taught and learned in school and that reflect the state standards (CDE, 1999b). Clearly, by emphasizing that the content, skills and competencies tested must be teachable, the Board anticipated that schools and teachers would endeavor to teach them.
But beyond influencing what and, potentially, how teachers teach, state accountability programs that rely heavily (or entirely) on test results can have negative consequences. Among the negative consequences, Herman et al. (2000) list increases in retention, increases in dropout rates, narrowing of the curriculum, and decreased attention to topics and subjects not tested. As Herman et al. (2000) describe, the dropout rate is of interest in itself, but also to assure that schools are not achieving higher test scores at the cost of more children leaving the system (p. 9). There is ample evidence that this unintended outcome is occurring in other states (Haney, 2000). Some observers have also noted that the high-stakes testing programs lead to questionable educational practices such as focusing instruction on test-taking skills, falsely classifying poor-performing students as SPED so that their scores are excluded from averages, altering test administration conditions, providing inappropriate instruction during testing, and, in some extreme cases, altering student response sheets.
EMERGING PATTERNS IN CALIFORNIA
As the Advisory Committee specifically states: A major priority of the accountability system must be to identify, evaluate, and mitigate unintended consequences (CDE, 1999b, p. 3). In this section, we examine some of the consequences, both positive and negative, that are emerging in California. The analyses that follow are based on an examination of responses from a random sample of 433 California teachers to a survey administered in the late winter of 2001 to teachers across the nation by the National Board on Educational Testing and Public Policy (Pedulla, Abrams, Madaus, Russell, Ramos, & Miao, 2003).12 Specifically we use data from this survey to explore the potential impact of the testing program on four broad areas: alignment of instruction to the standards and the test, changed emphasis on tested and non-tested subjects, preparation for tests, and conduct of questionable educational practices.13
Alignment of Instruction to State Standards and Tests
Sixty-two percent of California teachers agree that their districts curriculum is aligned with the test. Fewer teachers appear to be designing tests in the classroom that have the same content (40%) or the same format as the state tests (34%). Less than half the teachers also believe that the tests are not compatible with their daily instruction (43%) or their instructional materials (40%). Perhaps most importantly, 73% of the teachers believe that the testing program is leading some teachers to teach in ways that are not consistent with what they believe is good practice.
Changed Emphasis on Tested and Nontested Subjects
Many teachers report that the amount of time spent on several activities has changed in response to the state testing program. Eighty percent of teachers report that the amount of instructional time on subjects that are tested increased while 58% of teachers report that instruction on non-tested areas has decreased. Teachers indicate that instruction in the fine arts, physical education and foreign language have decreased. Finally, 28% of the teachers report that teachers in their school do not use computers when teaching writing because the state-mandated writing test is handwritten.
Preparation for State Tests
Teachers in California employ a variety of methods to prepare students for the state tests. The most common practices include teaching specific test-taking skills (87%), encouraging students to work and prepare for the test (76%), and teaching the standards (69%). In addition, teachers provide students with items similar to those on the test (64%) and/or use preparation materials developed by someone outside of the school (55%). Surprisingly, 8.5% reported providing students with released items (this is surprising because the items are not released) while 6.8% indicated that they did not provide any special preparation for the test.
Conduct of Practices of Questionable Educational Value
Beyond specific preparation for the test, teachers indicate that the testing program is impacting the atmosphere within schools and classrooms. Over three-quarters of teachers believe that students are under intense pressure to perform well on the tests, but just over half of the teachers believe that their students feel destined to do poorly on the test no matter how hard they try. Only 7% of teachers believe that the tests are motivating students who were previously unmotivated. Similarly, pressures within schools have led nearly two-thirds of teachers to focus solely on test preparation. In many cases, teachers believe this preparation has led to practices that improve test scores but not learning. And a third of teachers report that retention has increased because of the tests.
MOVING BEYOND TEST SCORES
Although student test scores have become the predominant form of educational accountability in most states, we believe that a sole focus on test scores is a seriously flawed approach to helping schools improve teaching and learning. As Amrein and Berliner (2002), Koretz and Barron (1998), Haney (2000), and several other researchers have shown, test-based educational accountability systems create as many, if not more, problems than they solve. While scores on the state tests often increase (creating the illusion of improved learning), more often than not, this "improved" learning does not translate to other tests. While these test-based systems are intended to better prepare students for college and the workplace, they often do not lead to increases in SAT or ACT scores, increases in the percentage of students attending college, increases in college completion, or improvements in college readiness. To the contrary, evidence is beginning to emerge that adequate performance on some state tests do not indicate that all of these students are prepared for a promising future (McNeil & Valenzula, 2001).
Why is test-based accountability failing? The answers are numerous. A sole focus on changes in test scores takes attention away from quality, well-rounded instruction across the disciplines and instead focuses instruction narrowly on what is tested.14 This intense focus of instruction on what is tested might not be as problematic if tests tested everything that was important for students to know. But they do not: Time does not permit them to do so, nor is it necessary to test everyone on everything.
Even more important, the single-minded focus on short-term changes in test scores provides little incentive for schools to improve their practices or to better serve students long-term educational and social needs. Beyond taking instructional time away from non-tested areas, there is clear evidence that schools are engaging in questionable practices to improve test scores. In the worst cases, these practices include outright cheating (see Minutes, Superintendent Advisory Committee PSAA, May 25, 2000 and Philip Spears Deposition, pp. 171-262). In other cases, the practices are more subtle such as reclassifying students as special needs so that they are not tested or their scores are not included in the school average, encouraging students to be absent on the day of testing, retaining students, and/or counseling students to seek other avenues for education such as a GED or attending an alternative school. In short, the result of test-based accountability is less about how to improve student learning by improving the conditions that affect learning. Instead, the focus is on obtaining prescribed changes in test scores.
Reducing these problems requires an improved accountability system. To improve the current system, the types of information considered by the system must be expanded to include inputs as well as outputs. As an example, there is a clear relationship between the percent of Emergency Credentialed teachers in a California school and that schools API score (Russell, 2002). As seen in Figure 4, 47% of schools that contain less than half of their teachers fully credentialed have API scores in the lowest decile while only 7% of these schools have API scores above the 5th decile.
Given the relationship between emergency credentialed teachers and API scores, one first step toward improving the performance of students is to replace emergency credentialed teachers with teachers that are fully credentialed. Including a measure of the percentage of Emergency Credentialed teachers in a school in Californias accountability system would provide an important piece of information, and benchmarks for a desirable level of Emergency credentialed teachers could be established (most likely, 0%). But teacher quality is only one of many inputs that may be in need of improvement. Others include adequate textbooks, curricular materials, access to current technology, classrooms and schools that are not overcrowded, sanitary conditions, an environment conducive to learning, and so on.
BLUEPRINT FOR CALIFORNIA
To provide schools, constituents, funding agencies, and policy makers with a more thorough understanding of the impacts of school-based programs on student learning, a more comprehensive accountability system is needed. To better satisfy the needs of schools and their constituents and to overcome the shortcomings discussed earlier, we believe that comprehensive accountability systems must meet the following criteria:
Provide relevant and timely information that schools can use to examine the impact their programs have on a wide spectrum of student learning
Focus both on inputs and on outputs
Collect more valid and authentic measures of student achievement15
Implement a statewide student level data system
Be sensitive to local context
Increase the responsibility of teachers and school-leaders for accounting for educational practices and their outcomes
The state explicitly states that the API-based accountability system should include a range of outcome variables including scores from tests that are aligned with the state frameworks, graduation rates, and student and teacher attendance rates (CDE, 1999a). The CDE also states, The API is part of an overall accountability system that must include comprehensive information which incorporates contextual and background indicators beyond those required by law (CDE, 1999b). Despite these proclamations, the API-based accountability system currently relies solely on test scores (several of which are poorly aligned with the state frameworks). As described by the CDE (1999b), a truly comprehensive accountability system would ask schools to describe the programs and practices they have in place, the appropriateness of these programs and practices given specific context and background indicators, and the impacts these programs have on a variety of student outcomes. Programs and practices might include but should not be limited to:
Access to quality teachers (e.g., student-teacher ratios, percentage of teachers with emergency credentials, percentage with masters degree or higher)
Access to books, textbooks and other learning materials (e.g., ratio of library books to students, ratio of course specific textbooks to students, ratio of students to computers, ratio of students to Internet-accessible computers)
Type of school calendar (e.g., multitrack year-round schools; schools operating under the concept 6 model)
Availability of appropriate instructional materials, specially trained teachers for ELL students
Adequacy of school facilities (e.g., overcrowding, access to sanitary facilitiesratio of students to functioning toilets, ratio of contaminated classrooms to total classrooms; availability of functional heating and cooling systems, presence of lead paint)
Subject area curricular materials used (e.g., math curriculum/textbooks, science curriculum/textbooks)
Availability of Advanced Placement courses (e.g., number of courses offered, number of sections available)
Professional development opportunities (e.g., topics focused on during professional development, number of hours offered, number of hours taken, percentage of faculty participating)
Student outcomes might include but should not be limited to:
Performance on tests closely aligned with the state frameworks
Course-taking patterns (higher vs. lower level mathematics, AP courses)
Percentage of students completing all courses required for university admission
Percentage of students taking college entrance exams
To be clear, simply recording data for each of these variables would be a vast improvement over the current system. Yet without requiring schools to actively describe the effects their inputs have on these outputs, identify potential problem areas, and establish short- and long-term goals, the educational benefits of accountability will not be fully realized.
Moreover, the goals set through this process should not be limited to changes in outcomes. Given that inputs affect outcomes and that at times it is the inputs that must be altered before outcomes are impacted, schools must be allowed and encouraged to set goals that focus first on the inputs. As an example, given the correlation between emergency credentialed teachers (ECT) and API scores, an interim goals in schools with high percentages of these ECTs should focus on decreasing the percentage of ECTs (ideally to 0%) rather than on increasing students test scores. Only after significant progress towards this interim goal has been reached should attention then focus on to changes in test scores.
As the system is reformed, educational and political leaders should also consider who is actually being held accountable and for what. As an example, we know that quality teachers, quality curricular materials, and quality facilities all result in positive impacts on student learning. But to what degree does a teacher, a school, or a district have control over each of these factors? Districts, the city or town, and the state in which schools operate have control over funds that impact facilities. Is it reasonable, then, to hold schools accountable for quality facilities? While local school leaders often have much say in the hiring process, they do not control the amount of funding available to pay salaries. Nor do they have control over the locations in which they operate. Is it reasonable to hold schools solely accountable for the quality of their teachers? Clearly schools should be asked to provide accounts for the conditions (physical and instructional) that they provide for students. When the schools account makes it clear that the conditions are in need of improvement, all those entities that can impact those conditions should also be expected to provide accounts of the actions they take and how those actions impact the conditions. Depending on the conditions in need of improvement, this responsibility may extend from the individual school, through the district, and up to the state. In short, an effective accountability system should not focus on a single level within the educational system. Instead, accountability should be a two-way process in which all levels from the classroom up to the state should be asked to account for their practices and the impact those practices have on students and their learning.
THE ROLE OF TESTS IN AN ACCOUNTABILITY SYSTEM
While reconsidering accountability systems, it is important to consider whether tests serve as a signaling effect or as an outcome measure. As a signaling effect, tests are used to identify schools and/or districts in which poor conditions (physical and/or instructional) may exist. In such cases, poor test performance may prompt an inquiry into the conditions that may be impacting student performance. Once identified, actions may be taken to improve these conditions. Conversely, when test scores are used as an outcome measure, schools first identify conditions for improvement. Once these conditions are improved, the test scores are used to examine the impact these actions had on student learning.
Currently, Californias API-based system more closely resembles a signaling effect than an outcome measure. Test scores are used to target schools that are performing poorly. Once targeted, a school becomes eligible for funding that supports an investigation into conditions that may be negatively impacting student performance. The schools are then expected to remedy these conditions, but the extent to which the conditions are actually remedied is never examined.
Two problems arise when test scores are used as a signal that conditions are in need of improvement. First, this approach assumes that if test performance is acceptable, then the conditions must be good. As described earlier, test scores often improve due to practices of questionable educational value. In other cases, test scores are good due to conditions outside of the school (most often in schools serving students with high socio-economic status) and in spite of poor conditions or practices within the school. Using test scores to identify schools in need of improvement overlooks many schools that appear to perform well but have conditions that are in need of improvement. Second, since the way to avoid being targeted is to have "acceptable" test scores, schools can be attracted to remedies that improve test scores without improving conditions. As an example, schools that have invested in technology to improve student writing, research and other high-order skills may begin using the software to drill students on topics included on the test and/or may decrease the amount of student writing performed on the computer because the tests are administered on paper (Russell & Haney, 2000).
Finally, using test scores to signal potential problems puts schools, districts, and the state educational agency on a mission to discover what is already known: conditions matter. Given the large body of evidence that quality teachers, quality instruction, quality instructional materials, and quality facilities have a positive impact on student learning, it would be more efficient to ask all schools to examine their current conditions, identify those that are in need of improvement, and then hold the school, district, and state educational leaders accountable for improving those conditions. In this context, test scores might be used to examine the impact the improved conditions have on students and their learning. In turn, these conditions and their apparent impacts would be documented through public accounts provided by each school.
Without question, implementing a two-way accountability system that asks schools to identify areas for improvement, requires school, district and state leaders to support such improvements, and requires educational leaders at all levels to provide accounts of their actions and the effects of those actions is substantially more difficult and expensive than simply asking schools to improve test scores. However, such a system would not only embrace the full meaning of accountabilitythat is providing an account of ones actionsIt would the promote information gathering, reflection, and action that is more likely to improve the quality of education. In essence, comprehensive accountability requires educational leaders at all levels of the system to decrease the focus on changes in test scores and increase the focus on taking responsible actions that improve instruction and learning.
Amrein, A. L., & Berliner, D. C. (2002). High-stakes testing, uncertainty, and student learning. Education Policy Analysis Archives, 10(18), Retrieved January 15, 2003, from http://epaa.asu.edu/epaa/v10n18/
California Department of Education. (1999a). The 1999 Base Year Academic Performance Index (API): The Report of the Advisory Committee for the Public Schools Accountability Act of 1999. Sacramento, CA: Author.
California Department of Education. (1999b). Framework for the Academic Performance Index. Retrieved January 15, 2003, from http://www.cde.ca.gov/psaa/board/june
California Department of Education. (2000). Key elements of Senate Bill IX (Chapter 3 of 1999), Public Schools Accountability Act of 1999. Retrieved August 31, 2004, from http://www.cde.ca.gov/psaa/keyelements.pdf
California Department of Education. (2001). Explanatory notes for the 2001 Academic Performance Index Base Report. Retrieved August 31, 2004, from http://www.cde.ca.gov/psaa/api/api0102/base/expn01b.pdf
Cizek, G. J. (Ed.). (2001). Setting performance standards: Concepts, methods, and perspectives. Mahwah, NJ: Erlbaum.
Commission on Instructionally Supportive Assessment. (2001). Building tests to support instruction and accountability: A guide for policymakers. Retrieved October 7, 2003, from http://www.aasa.org/issues_and_insights/ assessment/Building_Tests.pdf
Haney, W. (2000). The myth of the Texas miracle in education. Education Policy Analysis Archives,8(41), Retrieved February 7, 2002, from http://epaa.asu.edu/epaa/v8n41/
Haney, W. (2002). Lake Woebeguaranteed: Misuse of test scores in Massachusetts, Part I. Education Policy Analysis Archives, 10(24), Retrieved August 31, 2004, from http://epaa.asu.edu/epaa/v10n24/
Haney, W., & Raczek, A. (1994). Surmounting outcomes accountability in education. Unpublished manuscript.
Herman, J. L., Brown, R. S., & Baker, E. L. (2000). Student assessment and student achievement in the California public school system (CSE Technical Report). Los Angeles, CA: Center for the Study of Evaluation, Center for Research on Evaluation, Standards, and Student Testing.
Kane, T J., & Staiger, D. O. (2001). Improving school accountability measures. Cambridge, MA: National Bureau of Economic Research.
Koretz, D. M., & Barron, S. I. (1998). The validity of gains in scores on the Kentucky Instructional Results Information System (KIRIS). Santa Monica, CA: RAND.
Linn, R. L., Baker, E. L., & Dunbar, S. B. (1991). Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher, 20(8), 15-21.
McNeil, L., & Valenzuela, A. (2001). The harmful impact of the TAAS system of testing in Texas: Beneath the accountability rhetoric. In M. Kornhaber & G. Orfield (Eds.), Raising standards or raising barriers? Inequality and high stakes testing in public education (pp. 127-150). New York: Century Foundation.
Mosteller, F. (1995). The Tennessee study of class size in the early school grades. The Future of Children, 5(2), 113-127.
Oxford English Dictionary. (1971). Oxford, UK: Oxford University Press.
Pedulla, J., Abrams, L., Madaus, G., Russell, M., Ramos, M., & Miao, J. (2003). Perceived effects of state-mandated testing programs on teaching and learning: Findings from a national survey of teachers. Chestnut Hill, MA: National Board on Educational Testing and Public Policy, Boston College.
Robinson, W. S. (1950). Ecological correlations and the behavior of individuals. American Sociological Review, 15, 351-357.
Rogosa, D. (2000). Interpretive notes for the Academic Performance Index. Retrieved February 6, 2002, from http://www.cde.ca.gov/psaa/api/fallapi/apnotes.pdf
Russell, M., & Haney, W. (2000). Bridging the gap between testing and technology in schools. Educational Policy Analysis Archives, 8(18). Retrieved March 10, 2002, from http://epaa.asu.edu/epaa/v8n 19.html
Shepard, L. (1990). Inflating test score gains: Is the problem old norms or teaching the test. Educational Measurement: Issues and Practice, 15-22.
Wenglinsky, H. (2002). How schools matter: The link between teacher classroom practices and student academic performance. Education Policy Analysis Archives, 10(12), Retrieved on February 7, 2002, from http://epaa.asu.edu/epaa/v10n12/
The White House. (2002). Fact sheet: No Child Left Behind Act. Retrieved January 9, 2002, from http://www.whitehouse.gov/news/releases/2002/01/print/20020208.html