Applying the Joint Committee's 1994 Standards in International Contexts: A Case Study of Education Evaluations in Bangladesh
by Madhabi Chatterji - 2005
This case study examines the applicability of 1994 standards, offered by the Joint Committee on Standards for Educational Evaluation, to evaluations conducted in international contexts. The work is undertaken in response to an open invitation from the Joint Committee in its 1994 publication. The article addresses two purposes. First, it asks whether the standards in the four broad areasutility, feasibility, propriety, and accuracycan be applied as written to guide and monitor evaluation practices in developing countries when the programmatic focus and evaluation models, including relationships among sponsors, program participants, stakeholders, and evaluators, vary significantly from the assumptions underlying the 1994 standards. Second, it develops and refines methods for conducting metaevaluations of international evaluations by analyzing documentary and interview-based data from one case, represented by series of connected studies on education and health literacy programs in Bangladesh. The findings set the stage for more informed discussions on the robustness of existing standards and on the need for continuing case studies toward generating a revised or new set of standards for international evaluations in diverse fields, programs, and policy areas. The 1994 standards are presently undergoing revision by a task force of the American Evaluation Association.
The most current version of The Program Evaluation Standards, developed by a committee chaired by James R. Sanders (Sanders, 1994), ensued from a pioneering project in 1975 that aimed to develop guidelines for conducting useful, feasible, ethical, and sound evaluations of educational programs. Since the 1970s, the field of evaluation practice has rapidly expanded beyond educational programs to include institutions, programs, and services in health, rehabilitation, emergency management, economic development, and other community- and nation-building efforts. Institutions, programs, and services in both public and private sectors now serve as objects of evaluations. Sponsors for programs and evaluations vary greatly, including government, nongovernment, and private agencies or foundations that cross national boundaries. In terms of geography, projects expand their reach beyond North American settings to both industrialized and developing nations.
With respect to evaluation practices, evaluators now employ a diverse array of evaluation models, or approaches to information gathering, monitoring, and supporting program development efforts. In operation, evaluation models are characterized not simply by conceptual considerations and methodological characteristics but also by the roles played by evaluation researchers and the relationships they share with program sponsors, program participants, and other stakeholders (for visions of different approaches, see Fetterman, 2001b; Scriven, 2001; Torres & Preskill, 2001; for taxonomies of evaluation models, see Stufflebeam, 1999; Worthen, Sanders, & Fitzpatrick, 1997). When evaluation researchers (referred to as evaluators henceforward) are inclusive and participatory in their orientation, their roles and responsibilities often blur with those of the program sponsors, delivery personnel, and participants. Inclusive evaluation modelsfor example, Fettermans empowerment evaluation approach (2001a) often envision stakeholders themselves assuming the roles of evaluators as a program evolves. In more traditional evaluation models, in contrast, external evaluators retain the design and implementation authority in evaluation processes and tend to use discipline-based research methods stemming from quantitative or qualitative traditions of inquiry. In the former approach, evaluations are guided chiefly by formative purposes, typically aiming toward supporting program development; in the latter, the evaluation aims can be both formative or summative, with a greater emphasis on formal program monitoring, judgments of worth, and accountability. Yet another category of work, referred to as pseudoevaluations by Stufflebeam (1999), is represented in efforts driven by public relations or political motives; these reports rely heavily on informally gathered, subjective, and sometimes massaged data, with consultants essentially serving as program advocates rather than objective evaluators.
The globalization of the evaluation profession and the rapidity with which the field is evolving have generated much discussion on standards, guidelines, and ethics pertinent to evaluation practices. Leaders in the evaluation field have articulated visions for standards-driven professional practice (Sanders, 2001). At a 1995 conference in Vancouver, Canada, delegates from 50 countries gathered to discuss and acknowledge the international nature of the evaluation profession, implying a need for sound evaluation practices worldwide. Articles in recent issues of the American Journal of Evaluation (see, for example, Hopson, 2001) discussed evaluations in international contexts, particularly debating the necessity for establishing separate standards and guidelines to accommodate cultural and political diversities in developing versus more developed countries. In particular, Hopsons paper brought to light discussions about the inadequacies inherent in the current standards (Sanders, 1994) on topics of democracy, inclusiveness, and social justice as they influence evaluations conducted in Africa. Debates on the need to establish different guidelines in African countries such as Namibia have occurred over the past decade and have led to proposals for establishing a set of African evaluation standards. Currently, several industrialized countries have developed their own standards to guide regional program evaluations (e.g., the United Kingdom, Germany, Switzerland, and Japan).
In this article, I explore the applicability of the present version of the Program Evaluation Standards, 2nd edition (Sanders, 1994), to evaluations conducted in international contexts, focusing particularly on education and health literacy programs. I ask whether the standards in the four broad areasutility, feasibility, accuracy, and proprietyare written in ways that will permit metaevaluation of evaluations conducted in a developing country undergoing intensive nation-building efforts. I use the term metaevaluation to refer to a structured review of the quality of evaluation practices that are reflected in a defined body of evaluation work, using the standards as a guiding framework. Because the standards originated in the United States, how well these guidelines can help monitor and guide evaluation practices in environments unlike those in Western developed countries is a question that the professional community must address.
The present study, representing the first case in a collective case study (after Stake, 1997, p. 404), responds to an invitation extended to users of the 1994 standards to identify conflicts or limitations in applying the guidelines to evaluations in a variety of fields and contexts. My conclusions set the stage for more informed discussions on the robustness of the existing standards and in evaluating whether a need exists at all for revising or developing additional guidelines for international evaluations of programs and services inside and outside the realm of education.
An additional aim of the article is to develop methods for analysis of diverse evaluation cases from multiple fields in international contexts, particularly developing nations, leading to a collective case study on the applicability of the standards. Although the study of a single case, with all its idiosyncrasies, acquaints a researcher to unique and atypical factors that begin and end with that case, a collective case analysis approach involving in-depth examinations of a series of cases pertinent to a particular problem, policy, or phenomenon can effectively aid in revealing patterns and consistencies that are more typical and generalizable (after Stake, 1997). In this study, I begin by acknowledging the limitations of a single-case approach for addressing the main purpose of the study but simultaneously use the present case both to derive useful case analysis methods and as a springboard to pursue a longer term research agenda. Once effective methods are developed for conducting metaevaluations, it becomes possible to pursue the first question posed in a more systematic and consistent mannerthat is, to examine 10 or more evaluation cases across an array of fields and settings so that comprehensive recommendations can be made to the Joint Committee for possible changes to the standards should such a need be indicated. In sum, the first case study in my broader research plan forms the focus of the present article. It deals with evaluations of primary education programs in Bangladesh. Future implications for conducting metaevaluations of like and unlike evaluation cases are considered in the conclusion.
To begin, I summarize the substantive nature of the 1994 standards and submit a formal argument to initiate the present work. Part of my rationale for beginning this line of inquiry stems from my personal experiences as a university instructor in evaluation theory and methods. In my courses, students select evaluations in a program or policy area of their interest and, using the standards, critically evaluate practices evidenced in reports. International students routinely give me anecdotal feedback on difficulties they face with program evaluations in developing nations, a fact that prompted me to examine a few reports more in depth.
There is also currently afoot some difference of opinion in the evaluation community as to whether the existing standards should be examined at all. As the previously cited literature suggests, some have raised questions on the generalizability of the 1994 standards (Hopson, 2001). Patton (2001) also recently addressed the need for guidelines to structure the currently popular lesson learning mania evidenced in worldwide evaluation practices of education, health, and economic development programs. Others, however, feel no need to reexamine or modify them. At the 2002 conference of the American Evaluation Association, I approached a leader in the International Evaluation-Topical Interest Group (TIG) to inquire into the TIGs work on the issue. I learned that standards-related questions have been raised over the past several years, but the overall consensus is that the existing standards work well. Such differences make it necessary to illustrate the diversity in evaluation models and methods found in international evaluations and the nature of the challenges that could potentially be faced in applying the current standards to different studies.1
THE PROGRAM EVALUATION STANDARDS: A SUMMARY OF THEIR PURPOSES AND APPLICATIONS
The 1994 standards resulted from a collaborative effort among 16 professional associations in the United States, including the American Educational Research Association, the American Psychological Association, and the American Evaluation Association. They consist of 30 standards for appraising evaluation practices in four broad domains: utility, feasibility, propriety, and accuracy.
The term evaluation is defined in the text and glossary of the Standards as the systematic investigation of the worth or merit of an object (see Sanders, 1994, p. 3). The document clarifies that evaluations could be conducted on established and ongoing programs, institutions, or service areas, or to projects with more limited time frames. The standards are intended to serve as (a) guides for designing and carrying out sound evaluations and stimulating the use of evaluation findings in appropriate ways; (b) resources for teaching clients/stakeholders about the purposes for evaluations and what they can expect from evaluative efforts; (c) a framework for conducting metaevaluations, or appraisals of the quality of evaluation practices in given projects and programs; (d) resources in proposal development for developing and evaluating new programs or projects; and (e) guiding criteria for assessments of evaluator knowledge and credibility.
The 30 standards are not discussed individually here; the general intent of each group of guidelines is described as follows in the 1994 publication.
1. Utility standards (U1-U7) are intended to ensure that an evaluation will serve the information needs of the intended audience and users. These standards deal with identification of relevant stakeholders, formulation of evaluation questions to address stakeholder information needs, and the usability, clarity, and timeliness of the reports for stakeholders and clients. Evaluation impact is also addressed under the utility standards.
2. Feasibility standards (F1-F3) are intended to ensure that an evaluation is designed and conducted in a manner that is prudent, practical, diplomatic, and cost effective. These standards acknowledge the social and political context in which social programs and institutions reside; they stipulate that evaluations be conducted in politically viable ways.
3. Propriety standards (P1-P8) are intended to ensure that evaluations are conducted legally, ethically, and with due regard for the welfare of those involved in the evaluation, as well as those affected by the results. These standards deal with respecting the rights of human subjects, compliance with agreements about the confidentiality of information gathered, the appropriate release of results, and so on.
4. Accuracy standards (A1-A12) deal with methodological rigor and technical adequacy of information on the product, program, institution, or service area that is evaluated. They are intended to ensure that quantitative and qualitative procedures employed are credible and that the information gathered, analyzed, and conveyed about various aspects of the program is technically defensible.
The text provides the reader with an extensive array of illustrative examples of violations of particular standards and discusses the repercussions of the same on the quality of evaluation practices and evaluator credibility. These instructive cases are valuable guides for users. Their content leads logically to the committees definition that evaluation is the systematic investigation of the worth or merit of an object (Sanders, 1994, p. 3).
Simultaneously, narrative and illustrative cases of the Standards suggest a particular value orientation of the Joint Committeespecifically that evaluations mainly involve judgments of worth of the object evaluated and hence are mainly summative in orientation; that they are conducted by external rather than internal evaluators; that evaluator roles are distinct and separate from program participants or stakeholders at large; and that evaluators possess formal training as methodologists. The document also suggests that any plan for delivering evaluation services should have a metaevaluation (A12) built into it. The Standards offers a checklist to guide metaevaluative efforts, with a four-category rating scale that can be applied to each standard, as follows:
" This standard was addressed.
" This standard was partially addressed.
" This standard was not addressed.
" This standard was not applicable.
HOW INTERNATIONAL EVALUATIONS DEPART FROM ASSUMPTIONS MADE IN THE 1994 STANDARDS
I now present a few examples of international evaluations that could present ambiguities and pose challenges in using the current standards. This section does not present formal case analyses; it simply attempts to illustrate the need for more systematic inquiry into the applicability of the standards with international evaluations in developing countries.
The cases in this section were informally but purposively selected to reflect common types of international evaluations in the developing world that obviously fall outside the boundaries of the present standards because of their focus on programs outside the field of education. All instances emerged from my graduate-level courses in which students engage in metaevaluations of published evaluations. My examples include evaluation reports by members of the World Banks Operations Evaluation Department (OED), focusing mainly on economic development efforts, and an evaluation of a community-based health and rehabilitation services project in Afghanistan, supported by the United Nations (Boyce, 1999).
OED Studies and Research Reports
The World Bank is recognized as the worlds largest lender in the health, nutrition, and population sector to the developing world. Simultaneously, its financial lending activities support the areas of industry, technology, agriculture, environment, and macro- and microeconomic development on small and large countrywide projects. The OED, staffed by qualified economists, researchers, and evaluators, reports to the executive directors of World Bank, publishing large numbers of formal reports on programs, policies, and issues pertinent to their lending operations.
OED studies span a variety of evaluation models and methods but mainly use economic development and outcome indicators. In their 2002 publications catalog, one finds comprehensive assessments of the overall effectiveness of the Banks development programs in various countries (Hanna, 2000) and reports describing lessons learned in more limited sectors of activity (Stout & Johnston, 1998).
Superficial reviews of reports suggest that there may be three areas of conflict in using the standards with OED studies. First, OED studies rarely have a singular focus on education products, programs, or institutions, the primary focus of the 1994 standards. Second, the stakeholder focus of OED studies is more limited than what appears to be expected in the standards guidelines. The utility standards suggest that all potential stakeholders needs should be accounted for when designing evaluations. OED studies tend to be deliberately designed to serve decision-making needs of World Banks management and operations. Although characterized as an independent entity in Bank publications, the OED serves as an arm of the World Bank; the results from their studies appear to be directed more toward the World Banks leadership than local governments, client countries, or stakeholders at large in regions that receive funds. The biggest departure from the assumptions underlying the standards, however, lies in the fact that many World Bank efforts include an Evaluation Capacity Development (ECB) component (see Mackay, 2002). ECB efforts tend to alter roles and responsibilities of World Bank representatives, evaluators, and project participants/deliverers in host countries as they engage local stakeholders in evaluation activities and in improving use of evaluation results. ECB has been recently promoted by the evaluation community at large as a means for improving use of evaluations. Yet, published standards for sound evaluation practices fail to directly address the criteria for evaluating long-term evaluations that incorporate ECB components. Depending on the type of evaluation report, then, the standards can be applied to greater or lesser degrees to the OED studies.
Comprehensive Disabled Afghans Program
The Afghan project evaluation (Boyce, 1999), conducted by a single North American university-based researcher, has some parallels with the World Bank evaluations but departs even further from assumptions underlying the 1994 standards. The evaluator visited the project for a short period solely for the purpose of the evaluation. The object of the evaluation was the Comprehensive Disabled Afghans Program (CDAP), established in 1995 as a joint United Nations Development Program (UNDP) and United Nations Office of Project Services (UNOPS) interagency initiative in Afghanistan. The program was supported by multiple donors. At the time of the evaluation, it had six implementing-partner nongovernment organizations (NGOs) operating in different regions of the country.
CDAPs objective was to provide rehabilitation, mobility devices, and other social services through employment support, orthopedic workshops, physiotherapy workshops, special education, and health awareness programs. The target population consisted of persons with disabilities, including landmine survivors. The purpose of the CDAP evaluation was characterized as lesson learning and process assessment, with an intent to alter the delivery model as field conditions changed. The evaluators visit occurred as CDAP services were being provided to disabled individuals in a client country. In intent, the evaluation appeared to be one component of a much broader effort to create an effective system of care, networks, partnerships, and support that would foster follow-up work among the donor agencies and the client country, Afghanistan.
The CDAP report suggests that the evaluator was serving as a representative of the sponsor or donor agencies during the visit, and he describes his expert observations through visits and interviews with various project participants. In the report, the evaluator outlines potentially important program goals, processes, and contextual variables, offering an action model for CDAP services based on his assessment of local conditions and resources. The actual methods used for data gathering and compilation are not detailed in the report but can be inferred. In some respects, the evaluation appears to follow an inclusive, participant-oriented model (after Worthen et al., 1997) in that the evaluation seemed to be a component of the larger service delivery project; hence, the evaluator could be simultaneously viewed as a service deliverer and program designer rather than an external evaluator. In other respects, the evaluation reflects application of an expertise-oriented model (after Worthen et al.) in which an outside consultant, with knowledge of and experience with physical therapy and rehabilitation programs, appears to have provided an experts opinion to donor agencies of his perceptions of how well the project was going based on a brief visit, and recommended modifications to the program design. Thus, in terms of purpose, the evaluation was a formative one in which the main aim was program development. In more specific terms, the work was simultaneously a context evaluation (an examination of variables in the setting where the programs services were to be delivered) and a process evaluation (monitoring of service delivery and program processes), leading together to a determination of new needs for designing a modified service delivery and use plan.
The evaluation approach, role of the consultant, and the type of report that it generated made it difficult to appraise the CDAP evaluation with the 1994 standards as a guideeven on a superficial level. In a very general sense, an ambiguity arose because this was a needs assessment-cum-process evaluation unmotivated by a need to judge the overall merit or worth of the CDAP. Further, multiple project sponsors, donor agencies, and participants were engaged in different aspects of service delivery or evaluation. One could thus ask whose work should be held to the standards in such projectsthe evaluator, sponsors, donors, or local NGOs, or all these groups?
There were other specific areas of conflict as well. For example, the utility standards are currently formulated with an emphasis on meeting information needs of all stakeholders (see U1). Given the nature of the CDAP, donor-target country relationship, and the evaluation purposes, one could ask if compliance with the first utility standard is relevant for lesson-learning evaluations of context and process variables in early phases of international development projects. One could extend that question to also ask whether all seven utility standards need full or partial compliance when the evaluation purposes are lesson-learning.
Similarly, to meet the feasibility standards, one has to assume that evaluation is clearly separated from program design and delivery, where there are no political tensions or conflicts as individuals switch roles from being evaluators to program personnel. When evaluation serves as a component of service design in a broad range of areas, one may wonder which feasibility standards would apply and how. Likewise, the propriety standards assume that the evaluator has obtained formal consent, and often legal agreements with clients, to proceed with an evaluation. One may question how the propriety standards would be applied when it is the sponsors and donor agencies who establish relationships with a client country, along with local and national governing bodies.
Finally, the accuracy standards call for a clear description of quantitative and qualitative procedures employed by the evaluator in reports. This raises questions as to whether the accuracy standards are potentially violated if a lesson-learning report does not offer descriptions of the purpose and procedures as methodically as do many empirical research reports of a summative outcome evaluation. What criteria should be used for judging the quality of lesson-learning context or process evaluations, in which the roles of evaluators, clients, sponsors, and implementers cross over? This is a question that is not at all well addressed in the 1994 standards.
It is reasonable that large-scale service provision efforts include both needs assessments and process/service monitoring phases in the overall evaluation design. Thus, there must be some acknowledgment of this reality in current professional guidelines for evaluation practice. Further, when the lines between members of NGOs, evaluators from multievaluation agencies, and multiple donor organizations are blurred, a consistent set of guidelines appears to be necessary for both evaluation researchers and the others to follow.
In sum, even a cursory review of a small number of international evaluations suggested a need for more systematic inquiry on the utility of the 1994 standards as guidelines. I began that line of inquiry with an evaluation case study from Bangladesh.
My methods for examining the selected evaluation case follow, with attention to the criteria I used for case selection, the number and types of data sources examined, and frameworks for data organization and analysis.
Four theoretical or methodologically grounded criteria derived from the preceding review of international evaluations were identified for case selection. The Bangladesh evaluation case met all of the criteria set. The criteria were that:
1. The evaluations are conducted in international contexts, preferably in developing countries.
2. There is enough available information, through documentary or other sources, to allow a systematic appraisal of the evaluation practices with respect to a majority of the 30 standards, if not all. (Thus, if the available data were scant, and it led to too many cannot determine or not applicable conclusions during analysis, the case was dropped.)
3. The evaluations are conducted in the context of larger nation- or community-building efforts.
4. Multiple donors, sponsors, deliverers, evaluators, and participants, either external and local, are involved in the project or program.
The pool from which the Bangladeshi case was selected included seven NGO-, U.N.-, or World Bank-sponsored reports, including those discussed in the previous section. The World Bank reports were among those published in its annual report for 1999 by the Operations Evaluation Department (Hanna, 2000). The choice of the Bangladeshi evaluations was mainly based on the accessibility of the lead evaluator for interviews within the time frame of the study and the availability of sufficient documentary information to begin the first case analysis (he was visiting the United States and had copies of necessary documents).
DATA SOURCES AND ANALYSIS
To conduct the metaevaluation, I compiled documentary and interview-based data on the selected evaluation case. The data were subjected to a two-stage analysis. First, published evaluation reports and associated materials in print were analyzed using the criteria in Appendix A as a framework. Next, the principal investigator/evaluator of the studies was interviewed to obtain deeper understandings of the limitations and usefulness of the standards in evaluating programs in non-Western developing countries. The interview protocol was built exclusively around the 30 standards and designed to iron out areas of ambiguity that surfaced during documentary analysis.
Organizing Documentary Data
Appendix A, the protocol for analyzing documentary data sources, enabled a classification of the data with reference to 14 variables, including type of service area or project, geographic context and development status of the nation, field conditions, the sponsoring/commissioning agencies, whether evaluators were internal or external, evaluator training and background, the evaluation purposes, and the predominant evaluation models and research designs reflected in the work. Additionally, one could conduct a preliminary analysis of how well the body of work as a whole appeared to have complied with the standards in the four broad areas and in areas in which the data posed problems or conflicts in applying individual standards.
A few words are necessary to clarify the classification schemes on evaluation purposes, models, and research designs. To classify studies in terms of evaluation purposes, I attempted to identify whether the work was formative (oriented toward conceptualizing, planning, modifying, or improving programs, projects, or services), summative (oriented toward judgments of worth, effectiveness, or merit), or some combination of the two. A more specific categorization helped identify the evaluation models, defined here by the specific evaluation questions or issues that evaluators pursued. Studies could be placed in the following categories with respect to evaluation models:
Context evaluations. Evaluation questions are mainly concerned with variables in the context in which a program, project, or service area is situated and how these affect program design or delivery. Needs assessments fall in this category.
Input evaluations. Evaluation questions are mainly concerned with input variables, such as how resources are allocated to implement a program, project, or service area.
Process evaluations. Evaluation questions are mainly concerned with process variables, such as monitoring how well a program, project, or service is being implemented.
Outcome evaluations. Evaluation questions are mainly concerned with determining the impact of a program, project, or service area, or the extent to which the desired outcomes have been achieved with target populations.
Systemic evaluations. Evaluation questions examine the functioning of a program, project, or organizational system as a whole, focusing on all or some combination of context, input, process, and outcome variables and their interdependencies.
Lesson learning. Subjective and relatively informal appraisals of a program, project, or service area based on brief site visits, generally early in the programs history.
Pseudoevaluation. Subjective appraisals of a program, project, or service area by advocates, mainly for public relations and promotion purposes.
Finally, with respect to specific kinds of research design that evaluators employed, studies could be categorized as qualitative, quantitative, or a combination of the two. All the above taxonomies used for classification of reports were previously tested with a sample of evaluations on standards-based reforms in the United States. Definitions and their sources are provided in greater detail in Chatterji (2002).
Gathering and Organizing Interview Data
The second framework for analysis, Appendix B, consists of a list of guiding questions and probes pertinent to individual standards. The interview questions were intended to seek and extract particular types of information on those standards that were difficult to evaluate solely on the basis of the documentary data. The process made evident that the accuracy standards could be applied reasonably and with ease through documentary reviews; however, information on how well practices met the utility, feasibility, and propriety standards were best done through interviews because there was rarely sufficient information in the reports to allow reasonable assessments.
The documentary data sources for the study were Chowdhury, Ziegahn, Haque, Shrestha, & Ahmed, 1994; Chowdhury, Nath, & Choudhury, 2002; Chowdhury, Nath, Choudhury, & Ahmed, 2002; and Chowdhury, Choudhury, Nath, Ahmed, & Alam, 2001. All available documents from the lead researcher that were written in English were used for the analysis.
In addition, two informal interviews, one in person and the other by telephone, were conducted with the lead researcher, Dr. A. M. R. Chowdhury, between September and October 2002 to obtain details of different aspects of the Bangladesh Rural Advancement Committees (BRACs) work. Detailed interview transcripts are available from the author.
CASE STUDY RESULTS
Table 1 summarizes the results of the documentary analysis for the Bangladesh case. Although the work was conducted in a country categorized as a least developed country by the United Nations and World Bank, the case was unique in some respects and contrasted quite sharply with the international evaluations discussed in earlier sections of this article. Some of the differences were as follows.
1. A primary focus of the work was on education programs, although health literacy and life skills were examined.
2. Rather than being based on a brief site visit by foreign evaluators/ researchers, the studies were sustained over time and led by a group of indigenous Bangladeshi researchers, all members of a local NGO involved in social reform projects from the early 1970s.
3. The evaluation effort lasted for a long period of time (101 years), funded by multiple international donors, and is continuing today. The shape and form of the project has changed.
4. Throughout, a trained research and evaluation team held the authority in designing, data gathering, and executing the evaluation studies.
DETAILED DESCRIPTION OF THE CASE
Led by a team of Bangladeshi researchers from BRAC and in collaboration with academics at the University of Dhaka, various international sponsors, and eventually, government policymakers/representatives, a series of national evaluations was set in motion between 1990 and 2000 to assess and monitor the status of primary education in Bangladesh. BRAC is a local NGO. It emerged from the 1971 national war for freedom, resulting in the establishment of Bangladesh as an independent republic. Since then, BRAC has been committed to comprehensive rural development programs in Bangladesh, including relief, economic development, and environmental, health, education, and research programs. BRACs work is mainly supported by internal funds, with 20% of its budget contributed by international donors, including OXFAM, members of the European Union (such as the British and Dutch governments), and UNICEF. UNICEF funded the first methodological study in education by BRAC, which led to an ongoing program of evaluation research. BRAC employs 56,000 staff members. Its researchers are well schooled, often with graduate and advanced graduate degrees from reputable universities in the United States and the United Kingdom. BRACs resource pool is adequate for inviting international consultants when needed; one study included a researcher from a U.S. university. Table 1 summarizes the defining characteristics of the BRAC evaluations.
Evaluation Purposes and Models
The BRAC evaluations, now labeled as the Education Watch studies, span a decade. The nature and purposes of the studies changed considerably over time. According to the lead researcher, the original purposes for the evaluations were to study the effectiveness and quality of primary education programsan unambiguously summative purpose (Chowdhury, personal communication). The evaluation was stimulated by the outcome of the World Conference on Education for All (WCEFA) held in Jomtien, Thailand, which encouraged participating nations to make a commitment toward improving levels of basic education in their regions. BRACs leader asserted that he was up to the challenge; consequently, the research and evaluation division of BRAC assumed the responsibility of developing the methodology and implementing a long-term program of research to first evaluate the quality of existing programs and then to monitor basic education in formal and informal settings, including government and nongovernment schools, and to identify new educational needs. An early publication (Chowdhury et al., 1994) described pilot efforts in establishing a sample survey methodology and instrumentation to assess childrens knowledge of basic reading, writing, arithmetic, and life skills at a national level. By 2001, however, the Education Watch reports showed that the original evaluation purposes had expanded considerably to include the following (Chowdhury, Nath, & Choudhury, 2002; Chowdhury, Nath, Choudhury, & Ahmed, 2002):
" Assessments of internal efficiency of schooling (namely, cost-effectiveness studies examining costs versus quality of primary education programs)
" Examination of a large number of school level indicators, such as enrollment rates, dropouts, attendance, retention, teacher qualifications and training, student-teacher ratios, gender differences, and basic literacy levels.
In sum, the specific evaluation questions suggested that the evaluations were now more systemic in orientation, with clearly formative purposes. Looked at another way, the early BRAC studies employed traditional models of evaluation consistent with characterizations of the objectives-oriented evaluation found in the literature (after Worthen et al., 1997). Here, information gathering on a national scale was triggered by internationally set goals to raise literacy and education levels of peoples in developing and least developed countries.
Toward the end of a decade, however, studies reflected more management-oriented models (after Worthen et al., 1997). Studies were now designed to generate ongoing information to support the planning, monitoring, and ongoing development of a nationwide education system, encompassing both government and nongovernment schools. Eventually, therefore, the education agenda set by the ruling national government of Bangladesh, with a broader stakeholder group, appeared to guide the evaluation designs. In all cases, the information was channeled back to upper levels of the national and regional leaders and to international donor agencies that sponsored the BRAC evaluations.
The BRAC studies were nonexperimental in design. There were no comparisons attempted between different educational interventions, no controls instituted, and no variables manipulated. The studies were formal large-scale sample surveys; the outcome of interest consisted of achievement in basic reading, writing, arithmetic, life, and health literacy skills. Achievement was measured with an interview-based tool, including written and oral components.
The conceptualization, design, and conduction of all phases of the research remained in the control of the BRAC research and evaluation division and field workers trained by them.
DID UTILITY STANDARDS APPLY?
The utility standards, as written, could be applied quite unambiguously to examine the BRAC studies. BRACs early reports recognized the need to generate information and feed it back to planners (Chowdhury et al., 1994, p. 366). Reports released in later years (by 2002) were documented to have received wide attention from various stakeholders including policymakers (Chowdhury, Nath, & Choudhury, 2002, p. xxvii). As the evaluation purposes of the BRAC studies changed over time, so did the clients and stakeholders, and the evaluator-stakeholder relationships.
In the first study (Chowdhury et al., 1994), the group that set the evaluation purposes, formulated the questions for the large-scale survey, and directed the overall research agenda consisting primarily of BRAC representatives, their international sponsors, and local university academics. Although BRAC researchers recognized that their primary target audience was actually the Bangladeshi government (Chowdhury, personal communication), the design authority for the study was retained by the lead research group. The purposes of the first pilot study were to develop and refine a research methodology (Chowdhury et al.). Implicit in BRACs approach, however, was a long-term agenda for reaching their primary audience and consumer: the ruling Bangladeshi government. In the first interview, the lead researcher admitted that initially, BRAC was not too successful in gaining the governments attention to the results. There was a sense of complacency because the government had been doing its own work to address the gender gaps in education; there was also some resistance in allowing international involvement. It was the bureaucracy and its members that were viewed as more resistant; they would not acknowledge or mention the report at meetings. When a three-page report was released by BRAC, the government initially responded with a long list of criticisms of the work (Chowdhury, personal communication).
The picture had changed considerably by the time the third Education Watch report was published. Ten years had passed. The advisory group for the research (and contributors to the report) now included top government representatives, such as advisors to the president, the government of Bangladesh, and advisors from the Campaign for Popular Education, a nongovernmental initiative for which the Education Watch reports generated formative information. This committee also included a member of the national press. The new prime minister highlighted educational goals, and BRAC took advantage of the PMs support for education (Chowdhury, personal communication) to broaden their evaluation design. National dissemination of the Education Watch reports were carefully planned via press conferences and private television stations. In my second interview focusing on the Watch studies, the lead researcher stated that BRAC saw themselves as being ultimately responsible to the people of Bangladesh (Chowdhury, personal communication).
The language of the reports and the verbal descriptions from the principal researcher suggested, however, that the BRAC reports were initially generated for consumption by regional leadership groups rather than local school personnel, teachers, or the population at large. Schools and school leaders were not viewed as the primary audiences; they did not receive the results in any formal way. Reports, however, were published in multiple formsshort and long, technical and nontechnical, in Bengali and in Englishand look comparable to polished publications for literate audiences in industrialized countries.
DID FEASIBILITY STANDARDS APPLY?
Feasibility guidelines and dealings with costs, practicality, and political viability of evaluations could likewise be applied without major conflicts or confusion. In the interview focusing on the first BRAC study, the lead researcher acknowledged that the initial development costs for the BRAC research program were high, but the long-term pay-off was worth the investment (Chowdhury, personal communication). One U.S. researcher was invited as a consultant in the early phases; this entailed a cost. BRAC researchers also went abroad to the United States or the United Kingdom for training. Over time, the skills of the research team were reported to have greatly improved. Thus, although costs were incurred, there appeared to be a long-term agenda for building capacity, awareness, and education levels of evaluators and local stakeholders.
DID PROPRIETY STANDARDS APPLY?
Propriety standards deal mainly with formal agreements, consent, and respect for human subjects. Here, it appeared that a rigid application of the Standards would result in revealing shortcomings of the BRAC evaluations that were rooted in the limited infrastructures of the country where the work was done. As of now, Bangladesh does not have the legal and subject protection infrastructure with which researchers can formally comply. Thus, the propriety standards, as written now, set expectations that do not fully account for the limitations in nations such as Bangladesh.
In districts and localities that were sampled, the BRAC studies involved surveys of students aged 9-12, school personnel, and teachers. According to the lead researcher, the BRAC field workers, who are all Bangladeshis, are trained to obtain informal verbal consent from respondents. No coercion was used. No rewards or incentives were given to participants, other than free pencils and erasers that they used to respond to surveys. People hardly refuse, according to the lead evaluator; if a question threatens them, they tend to distort or falsify the information. This happened on questions relating to income, for example, when respondents were afraid that they would be taxed if they revealed their actual earnings. Field researchers in BRAC are trained to interact with survey respondents in their own language and in nonthreatening ways (Chowdhury, personal communication).
DID THE ACCURACY STANDARDS APPLY?
Because of the clear documentation of methods and design protocols in all the reports examined (some of which included publications in international journals), the accuracy standards could be applied quite easily with the BRAC evaluations.
The BRAC studies showed attention to systematic research and methodological issues that meet Western academic standards. The sampling design (three-stage cluster samples), the instrument development and pilot work, use of supplementary data collection tools (such as checklists), and field researcher training protocols were all developed and refined over time. Their reports document in detail the national context of their work, design phases, and methodological refinements. Results are reported mainly in the form of descriptive statistics and graphs. In terms of reporting quality, BRAC publications resemble documents published by the National Center for Educational Statistics in the United States, which also describe the condition of education.
Over time, BRAC evaluations appear to have acquired the rare status of becoming a part of systemic educational change in Bangladesh. What might have made the long-term evaluations politically viable in this case is the nationality of BRACs research team. They were insiders who were able to sustain their effort over a long period of time and win the confidence and buy-in of key stakeholder groups, including politicians with national influence. The lead researcher attributed the impact of the BRAC studies to a larger nation-building context in which the work occurred: It is true that the main work is done at BRAC, but we work as a part of a civil society initiative called Education Watch, which is coordinated by Campaign for Popular Education . . . the broad base of the Watch is largely responsible for the (studies) gaining wide credibility (Chowdhury, personal communication). Because of BRACs history in social work and reform in Bangladesh, trust issues were more limited than they might have been had outside evaluators conducted the work.
DID THE OVERALL STANDARDS FRAMEWORK APPLY?
During interviews, the lead BRAC researcher admitted to not having ever seen the 1994 standards. Despite the reported lack of awareness, the evaluators approach to conducting the studies showed attention to most of the criteria that the standards outline. The only area in which accommodations seemed necessary pertained to the propriety standards. An inflexible and somewhat blind application of the propriety standards may have revealed that BRAC studies were noncompliant in attaining formal consent of human subjects. The violations, if any, were rather marginal, and evaluators made reasonable adaptations to the standards to maximize ethical practices locally. In terms of possible revisions to the 1994 standards, one might identify a need to broaden the language of the standards to make them more adaptable to existing conditions and infrastructures a developing country such as Bangladesh. Keeping aside the limitations of a single case approach in making broader generalizations, intrinsically, the overall 1994 standards framework appeared to be quite applicable to the Bangladesh case.
The maturity of a profession depends on professional accountability and establishment of usable guidelines for appropriately trained professionals and researchers to follow. That is the spirit in which the 1994 standards were written. The improvement and expansion of the standards where necessary must likewise be undertaken by the professional and scholarly community as the boundaries of evaluation practice expand.
In conducting critical analyses of evaluations, one could attempt to fit the standards to evaluations (examine how well agreed-upon standards function with different evaluations) or fit evaluations to the standards (examine the extent to which evaluations are compliant with accepted professional standards of practice). This study attempted to do the former. The main question dealt with whether the 1994 standards, as written, were useful in guiding and monitoring the quality of diverse types of evaluations in international contexts.
With respect to the usefulness of the standards to evaluations in developing countries, two conclusions seem to be supported by the data from the BRAC case, acknowledging that it is only the first case in a series of forthcoming case analyses.
1. The 1994 standards make certain assumptions about the type of evaluation (defined as evaluation model in this article), the evaluation purposes, the training of evaluators, and the relationships between evaluators, sponsors, and recipients of programs/services. When the assumptions are met, even in developing countries (as in Bangladesh), the standards tend to apply quite well.
2. Most research and evaluation methodologists trained in Western countries tend to comply with the standards, or make reasonable adaptations to the standards even when they are not aware that they exist.
WHERE THE 1994 STANDARDS ARE LIKELY TO WORK
That the framework given in the Standards appeared to apply quite well to the BRAC evaluations was possibly because the studies had an education focus, consistent with the main thrust of the 1994 standards. Another influential factor appeared to be the education and training of the research and evaluation team; the BRAC team consisted of educated and trained research methodologists from reputable international schools. Their training facilitated an automatic compliance with a majority of the 1994 standards.
The ease in applying the 1994 standards to the Bangladesh case was also facilitated by BRACs choice of formal designs and more traditional evaluation models. The language of the 1994 standards and the examples of standards violations incorporated therein all suggest that the Joint Committee started with certain assumptions about evaluator roles, functions, and training. As discussed before, the documents content suggests that most evaluations are expected to be conducted by external evaluatorsthat is, evaluators who are clearly distinct in their roles from the program sponsors, participants and stakeholdersan assumption that was met with the BRAC evaluations. Likewise, although the presentation of accuracy standards acknowledges that evaluators could potentially gather both quantitative and qualitative data, the text appears to expect formalized studies that only well-trained methodologists can deliver. Again, this assumption was also met with the BRAC evaluations. In publishing its work in peer-reviewed international journals, BRAC was also subjected to some meta-evaluative appraisal, although not according to all the criteria given in the 1994 standards. Consistency of the BRAC studies with the some of the major underlying assumptions of the 1994 standards, then, made the straightforward application of the standards possible, irrespective of the fact that the studies were done in a least-developed international context.
METHODS FOR CONDUCTING METAEVALUATIONS
A second question in this article deals with addressing the how to issues for metaevaluations of the full gamut of evaluation efforts going on in international, and particularly developing, nations. There is a need for consensus on methodological guidelines for conducting sound metaevaluations with the 30 standards as guiding criteria. To my best knowledge, only a few studies are presently published that attempt to use the standards-based rating scale in the way recommended by the Joint Committee (see, for example, Scott-Little, Hamann, & Jurs, 2002). Most such studies rely only on one report and simply provide subjective judgments of the authors using the rating scale categories. For several standards (U6, F2, F3, P2-P5, P8, A1, and A12) in the recent study cited (Scott-Little et al.), large numbers of the ratings fell in the unable to judge or not addressed categories, rendering somewhat limited results.
An obvious conclusion from the present case study is that reasonable metaevaluations with the 30 standards as a framework are not possible by relying only on documentary data and certainly not on one evaluation report. Without examining a broad array of documents and having access to in-depth information from original authors, definitive conclusions are difficult to make and should not be made by outside reviewers. In the BRAC case, significant information necessary for comprehensive understanding of the evaluations had to be obtained through direct interviews with the principal investigator/evaluator. In particular, I faced barriers when applying the standards on utility, feasibility, and propriety based only on documentary evidence. Reliance on multiple data sources and use of a combination of documentary and interview-based analysis seemed more effective in extracting the information necessary to make accurate and fair judgments. The current protocols (Appendixes A and B) functioned well in the BRAC case but may need to be modified as new and different cases are examined.
To conclude, the limitations of a single-case approach needs to be reiterated. Second, the BRAC case analysis was based on all the documentary information written in English that could be marshaled within the tight timeline for the present study. Greater numbers and more varied types of documents should be reviewed for other cases to obtain comprehensive and deeper understandings of metaevaluative issues. Third, the BRAC studies represent one type of international evaluation that employs a management-oriented evaluation model and sample survey methods. As documented in this article, other internationally sponsored community- and nation-building projects alter some of the assumptions that the present standards make about the parameters of evaluation practice and definitions of evaluators. In particular, lesson-learning evaluations, or evaluations of the type evidenced in the CDAP report, may require a rethinking and broadening of the language and guidelines of the 1994 standards. Thus, case selection for continuing metaevaluative analyses must be deliberate and representative of the range of evaluation models and research methods found in international contexts.
There is clearly a need to continue to examine more cases prior to arriving at a formal set of recommendations for the evaluation community or the Joint Committee to consider. I conclude with the recommendation that further cases be analyzed and their results triangulated and cross-validated with the present findings, with continuing discussions among evaluators/ researchers engaged in international evaluations on usefulness of the standards. Data from both similar and dissimilar cases will help augment future discussions on improving the present guidelines in meaningful ways.
Boyce, W. (1999). Comprehensive Disabled Afghans Project: Integration of the disabled and marginalized, Canada. Kingston, ON: Queens University, International Center for the Advancement of Community-Based Rehabilitation.
Chatterji, M. (2002). Models and methods for examining standards-based reforms and accountability initiatives: Have the tools of inquiry answered the pressing questions on improving schools? Review of Educational Research, 72, 345-386.
Chowdhury, A. M. R., Choudhury, R. K., Nath, S. R., Ahmed, M., & Alam, M. (2001). Education Watch 2000, a question of quality: State of primary education in Bangladesh. Dhaka, Bangladesh: Campaign for Popular Education, University Press.
Chowdhury, A. M. R., Nath, S. R., & Choudhury, R. K. (2002). Enrolment at primary level: Gender difference disappears in Bangladesh. International Journal of Educational Development, 40, 437-454.
Chowdhury, A. M. R., Nath, S. R., Choudhury, R. K., & Ahmed, M. (2002). Renewed hope and daunting challenges: State of primary education in Bangladesh. Dhaka, Bangladesh: Campaign for Popular Education, University Press.
Chowdhury, A. M. R., Ziegahn, L., Haque, N., Shrestha, G. L., & Ahmed, Z. (1994). Assessing basic competencies: A practical methodology. International Review of Education, 40, 437-454.
Fetterman, D. M. (2001a). Foundations of empowerment evaluation. Thousand Oaks, CA: Sage.
Fetterman, D. M. (2001b). The transformation of evaluation into a collaboration: A vision of evaluation in the 21st century. American Journal of Evaluation, 22, 381-386.
Hanna, N. K. (2000). 1999 Annual review of development effectiveness. Washington, DC: World Bank, Operations Evaluation Department/World Bank Info Shop.
Hopson, R. (2001). Global and local conversations on culture, diversity, and social justice in evaluation: Issues to consider in a 9/11 era. American Journal of Evaluation, 22, 381-386.
Patton, M. Q. (2001). Evaluation, knowledge management, best practices, and high-quality lessons learned. American Journal of Evaluation, 22, 329-336.
Mackay, K. (2002). The World Banks ECB experience. New Directions for Evaluation, 93, 81-99.
Sanders, J. R. (1994). The Program Evaluation Standards: How to assess evaluations of educational programs (2nd ed.). Thousand Oaks, CA: Sage.
Sanders, J. R. (2001). A vision for evaluation. American Journal of Evaluation, 22, 363-366.
Scott-Little, C., Hamann, M. S., & Jurs, S. G. (2002). Evaluations of after-school programs: A meta-evaluation of methodologies and narrative synthesis of findings. American Journal of Evaluation, 23, 387-419.
Stout, S., & Johnston, T A. (1998). Lessons from experience in HNP. Washington, DC: World Bank, Operations Evaluation Department/ World Bank Info Shop.
Scriven, M. (2001). Evaluation: Future tense. American Journal of Evaluation, 22, 301-308.
Stake, R. E. (1997). Case study methods in educational research. In R. M.Jaeger (Ed.), Complementary methods for research in education (pp. 401-422). Washington, DC: American Educational Research Association.
Stufflebeam, D. L. (1999). Foundational models for 21st century program evaluation. Kalamazoo, MI: The Evaluation Center, Western Michigan University.
Torres, R. T., & Preskill, H. (2001). Evaluation and organizational learning: Past, present, and future. American Journal of Evaluation, 22, 387-396.
Worthen, B. R., Sanders, J. R., & Fitzpatrick, J. L. (1997). Program evaluation: Alternative approaches and practical guidelines (2nd ed.). New York: Addison-Wesley Longman.