"...Where Angels Fear to Tread"
by Peter A. Taylor - 1970
Professor Taylor's philosophic critique of evaluative behaviors continues a discussion which began seven years ago in The Record. Before evaluators go too far, he suggests, they should confront the problem of what constitutes admissible evidence, the limits of measurement itself, the social nature of practical truth, and the role of personal value-experiences in affecting descriptions of a "world." He calls, in sum, for greater attention to goals and instrumentation, both of which require evaluation. He hopes to see educational evaluators working in teams and becoming clearer about meanings, more thoughtful in making judgments, more critical of objectives. The controversy goes on "where angels fear to tread."
In recent years, evaluative behaviors have been reasoned to consist of two major sets of activities, namely description and judgment.1Researchers have continued to unearth and writers to acknowledge the multiplicity and multidimensionality of the outcomes of the educative process. It is the complexity of the outcomes of education which is one of the main causes of the difficulties surrounding the collection and use of data and which has increased the numbers of decisions required of the evaluator. In view of the mass of actual and potential data at the call of the educational evaluator, it is essential to be able to ask what, out of that mass, is admissible evidence for a given evaluative undertaking. The evaluator must be able to reach a decision as to what will be meaningful for a given venture among all the many pieces of information that he might seek, and as to what is appropriate among all the different techniques and methodologies for data-collection that he might use.
The Problem of Meaning
Clearly, the crux of meaning as distinct from significance rests in what constitutes admissible evidence. A meaningful statement is one that is amenable to truth analysis, that is, is capable of a truth evaluation. If what is asserted by a statement can be known to be true, and is true, then that statement is undeniably meaningful; or, if what is asserted is false and can be known to be false, then that statement is a meaningful one by virtue of its falsification. And it is the function of critical philosophy to determine whether or not statements are subject either to verification or to falsification. Should a given statement prove to be neither verifiable or falsifiable, nor true by its purely logical status, then it has no meaning.
Admissible evidence now becomes that which is allowed by the rules of critical analysis. Both positivists and empiricists have generally admitted two types of evidence: evidence in the usual sense of observation, sense-perception, reports and the like; and evidence that results from the legitimate use of a language according to the rules of its particular grammar. Thus a statement can be meaningful if it correctly reports some empirical state of affairs (the width of a piece of wire, the indicator reading of a galvanometer), or if it is logically true (that is, given certain definitions and rules of manipulation, valid conclusions will result from application of the rules). In other words, if a statement asserts an empirical state of affairs, and data either actually or potentially exist such that communicants can agree as to whether or not that state of affairs obtains, then that statement can be given meaning by virtue of its verification or falsification. And statements may also be meaningful if they follow an argument within some set of prescribed rules even though the specific referents of the statements are unknown or unrealizable. For example: given that X implies Y, then from X we conclude Y, regardless of whether or not there is anything in the real world for which either X or Y stands.
Thus, insofar as we are a positivist or an empiricist, two classes of statements have meaning: empirical ones which are meaningful by virtue of their being empirically true or false, and logical ones, which are meaningful by virtue of their following a set of prescribed rules for construction and inference. At a first glance, these criteria for meaningfulness are straightforward, and are indeed as old as empiricism itself. To ask for the meaningfulness of a statement one need only apply the criteria of meaning. Observation-based statements, reports of evidence, instrument readings, descriptions of apparatus or methods, and physical stimulus conditions are meaningful, as are the arguments of logic and mathematics. Statements that are not reducible to data or to a purely logical format are not meaningful. Philosophically, this implies that the metaphysics that spawned such terms as "essence," "the soul," "the id," and so forth, and, interestingly enough, even words like "atom" when it was first used, was largely evoking nonsense, except perhaps insofar as it directed the attention of scholars in the field.
So far as meaning is concerned, some words are not at all problematic: conventional definitions, connectives and articles, for examples. However, those words or collections of words which purport to signify some thing in the universe, or some state of affairs, or some activity, can give rise to problems if unequivocal rules for designating their referents are not available. Within the behavioral sciences, for example, it is clear immediately that some words are unambiguous. Words which describe a piece of apparatus, or a subject's overt behavior (the S is "turning the page," "running to the paint jar") are readily understood. Such words symbolize phenomena which all of us can witness. If a question of usage arises, the case in doubt can be presented directly to a panel of judges who have not only had a dictionary-definition available, but who can experience the phenomenon to which the word is to refer. But suppose we consider a simple experiment. Suppose we ask a child to respond to a set of color-discs with the appropriate name: thus when we show him a red disc, he says "red," when we show him a green disc he says "green," and so on. The results are unequivocal: the subject names all the colors correctly. If we repeat this "experiment" with other children, and presuming that none are colorblind or so young as not to have had any experience in naming colors, we get unanimous agreement in naming the colors. We may then conclude (by way of a generalization) that under the prevailing conditions, the children perceive the colors correctly.
Now notice the not-too-subtle change. We have changed from "name" the colors correctly to "perceive" the colors correctly. While we concentrated on the naming response, our experiment was a purely behavioral one. By changing to perceiving as the response, a semantic complication has arisen. If the phenomena we are reporting are those of the act of naming colors there is no ambiguity. It is a simple matter to correlate the actual color of the disc and the response given, and everyone (allowing for exceptions like color blindness) concurs in what is meant by our description. However, if we report that if a child says "red" when he is confronted by a red disc that he is "seeing" red, we are in an ambiguous situation. On the one hand, our description is of his discriminating colors; on the other hand, our description is of what he sees—but not having access to what he sees, there is no way for us to know if he used the response "I am seeing red" according to a dictionary definition. If this argument is tedious, it is nonetheless essential; when a child reports that he has a "sore tummy," what is it, precisely, that he is reporting? In dealing with the child, a parent will take him to a physician who will look for observable symptoms and make clinical inferences. To whom shall the evaluator run when his client complains of "dull textbooks"? How does he know where to look, and indeed what are "external symptoms"?
If science consisted of no more than the collection of publicly verifiable facts and the manipulation of logical grammars, it would be simple enough to be dogmatic as to what kinds of data are acceptable. For instance, if it were only a question of prescribing the limits for admissible data, then behavioral evidence is acceptable to the psychologist and evaluator, whereas reports taken to signify introspective states of consciousness are not, unless what we want to observe is stated reports. Few scientists, if any, are content to merely collect data, however. And educational evaluators are—within the dual role ascribed to them—somewhat duty-bound to go beyond the accumulation of information. They, in common with other scientists, feel the need to interpret the data, to seek explanations, to construct theories. But theories are more than summaries of data and more than aids to calculation. Invariably they include statements which are neither purely logical nor purely empirical. They include hypothetical terminology which is only indirectly related to data. And it is here that the task of adjudicating meaning becomes complex. There is at least one class of scientific statements—namely, statements about hypothetical entities—which is problematic. No easy decision can be attained as to the semantic status of such statements, for they are neither purely logical nor purely empirical.
Another problem arises with respect to universal statements. Historically, scientists have attempted to construct lawlike descriptions of natural phenomena, relying largely on the processes of induction to discover these laws. The problem is: how are any lawlike statements to be accepted as meaningful if an empiricist criterion for meaning requires that such statements be reduced to signifying a finite set of empirical events? Here again the hypothesis-problem intrudes. Even in the field of logic, the system of formal argument has been questioned owing to our inability to establish consistency and completeness. The matter of completeness is a complicated philosophical issue, but briefly it means that within any logical system, any statement that may be expressed should be capable of either proof or disproof. The proof that it is impossible to establish completeness for logical systems has been with us since the early 1930's,2 leaving us in an interesting quandary. It is possible in a logical system, such as simple arithmetic, to formulate true theorems for which no proof can be found; yet such theorems cannot be dismissed as nonsensical since they (usually) follow the rules of the logical system for forming meaningful statements.
For an empiricist, the alternative to absolute skepticism is evidential convergence. Scientists do not hesitate to acknowledge that Truth is beyond their reach: but acknowledging that, it becomes a spur to better and better approximations. Obviously, convergence is more assuring if it exists within some mathematical framework, whereby each new result lies inside a mathematical asymptote. But this is rarely the case. Each advance in knowledge sparks the flame of Truth, only to falter and die. Certainty is an unattainable end. Perhaps this is the hardest pill for the scientist (and assuredly for the layman who wants "the" answer) to swallow. Yet he has to condition himself, for the philosophy of science is essentially a matter of choosing amongst alternative constructs and explanations without recourse to proof.
The Measurement Problem
The problem doesn't end here. Exactly as there are conceptual limits to establishing meaning and approximating Truth, the evaluator must recognize that there are limits to measurement itself. Of course, we probably have nowhere in the behavioral sciences yet even approached a reasonable limit, but limits to measurement are something that should be remembered in establishing payoffs to evaluative activities.
Perhaps eighty years ago, when the system of ideas of what we now call "classical physics" had been established, most physicists believed that they had an essentially true picture of the world. All that remained to perfect the picture, they thought, was to paint in the close details and make minor corrections to the physical laws. For this it was necessary to increase the accuracy of quantitative observations, on the one hand by increasing the accuracy and sensitivity of instruments, on the other hand by a careful logical analysis of possible errors. In fact, the situation became such that a physicist who succeeded in measuring a physical constant (sic) with a new degree of precision would assure himself of recognition in academic life.
The classical physicists realized, of course, that the process of measuring necessarily involves a mutual interaction between the thing being measured and the measuring device, which must of necessity change the measured object. For example, when one uses a micrometer screw gauge to measure the thickness of a piece of wire, the instrument causes a slight depression in the material of the wire. Or if compression effects were considered serious, and the thickness were measured optically under a microscope, the use of a concentrated beam of light would heat the wire slightly and make it expand. Similarly, to measure an electric current, at least a fraction of it must pass through the galvanometer, and that alters the original value of the current. Examples like this could go on and on. Yet it seemed to the physicist that, by increasing the sensitivity of his measuring devices, he could reduce this interaction effect below any preordained level, and that in any case corrections for the interaction could be calculated as long as the interaction was of a lawful character itself.
Around the turn of this century, this and other basic problems in measurement began to become a great deal more involved, partly as an outcome of new experimental techniques and partly as a result of new insights in the nature of physical laws themselves.
It is obvious that a physical quantity can be measured with accuracy only if it is precisely defined. Our discussion above should have made this clear. Consider the measurement of properties of a gas. Now according to the kinetic theory the laws governing the macroscopic behavior of a gas hold strictly only for systems consisting of a theoretically infinite number of molecules. In other words, macroscopic quantities such as temperature and pressure (and hence density) are defined as statistical averages over infinitely large numbers of molecular events. But any device for measuring such a quantity can do no more than record the outcome for a large but finite number of events: for instance a pressure gauge records the net result of a finite number of molecular impacts on its sensitive surface. Hence even the most perfect instrument could not give the true value of the quantity to be measured and repeated measurements will show irregular fluctuations due to the irregular thermal movement of the molecules—fluctuations which become more and more apparent as the sensitivity of the instrument increases. The Brownian movement, first described in 1827, is a perfect example of readily observed fluctuation in movement.
The recognition that there is no point in trying to increase the accuracy of measurement of such statistical quantities beyond the inherent statistical error came early this century. It was Einstein who pointed out that when we try to do so, we will encounter another difficulty. The internal movement of particles in the measuring device itself will produce irregular fluctuations in the readings, which increases with increased sensitivity and increasing temperature. In communication systems using electronic tubes to amplify weak signals, such fluctuations become apparent as background "noise." Clearly nothing can be gained in the performance of a receiver by increasing its sensitivity beyond the level at which fluctuations (noise) begin to exceed the strength of the signal itself.
In view of these facts the older ideas about the action of the measuring instrument on the thing being measured had also to be revised. Because of the Brownian motion of the measuring device this reaction has an irregular character and, therefore, it cannot be completely corrected by compensatory calculations. Even nuclear processes are irregular in character and cannot, therefore, be measured with complete accuracy. If we evoke quantum theory and the concept of "zero-point fluctuations," we might say that a particle performs a kind of drunken dance about its classical path. As a result the calculation of interactive effects of instrument and object to be measured can at best be at the probabilistic level. All this has been reduced by the physicist to the so-called "uncertainty principle" which attempts to account for the indeterminancy of measurement by the quantum laws of motion.
If the physical science picture has been pointed to with some tedium, it has been with the hope that it will be realized that the evaluator should not be given, nor should he accept, the task of attempting to refine his measurement activities beyond the inherent statistical error. What this is will not always be obvious, or even knowable within a foreseeable future. But with the assurance that the statistical error associated with human endeavor is likely to be large, it is probably more effective for the evaluator to concentrate (as a first step) on evoking broadly descriptive measures than to expend effort on trying to obtain convergent truths. A suitable guide here will be to examine the literature related to the activities of behaviors he is measuring. If he can be assured that the effort of obtaining refined data is worthwhile, then he would probably want to do so. If the inherent error is large, then measurement effort should be so apportioned. Only as a subsequent stage might the evaluator be interested in generating "truth," in its generalized sense. The basic point is that if exact measurement is impossible in physics, it is likely to be impossible in behavioral science. The evaluator should temper his efforts to measure to the likelihood of reasonable payoff from those efforts.
The Social Nature of Acceptable Evidence
It should have become quite clear by now that functional, practical truth (acceptable evidence) is social—the truth of agreement—and that absolute truth is not available in the language of science. The question now is: who does the agreeing?
In a technical sense, it is some in-group that does the agreeing. An "in-group" is any group whose members share a unique set of attitudes, beliefs, and sources that set them apart from other groups. Conversely, an "out-group" is a group seen critically by the in-group, and that differs from the latter in some manner recognized by the in-group. Neither group in the grand course of events remains either "in" or "out": changing forces transmute the one into the other, and back again, through time. If it so happens that the opposition to the current "in-group" is strong, then the "in-group" sets up prejudicial barriers against the "out-group" and unconscious biases maintain and reinforce these barriers. The "out-group" responds with counter-prejudices, and so on. Sometimes over-enthusiastic innovators, who for a while find no opposition, create a fictitious "out-group" in order to find a source for feedback, and a stimulus to effort.
"In-groups" will often develop a special language of their own—new terms that have only an immediate referent but which in due course become more widely accepted neologisms. A special language promotes "in-group" solidarity, and marks its members as different from the "out-group." Probably the best test of special languages is time: insofar as a new terminology survives, then so far can its contribution to communication be valued.
But to say that agreement is the sanction for truth is to do little more than to say that people will believe what they accept. There is, however, an element of respectability that enters here. Agreement is not easy to get, especially the agreement of a highly intelligent group of people. Still, the question remains: what are the conditions for accepting evidence in science?
On this question, there are two or three broadly relevant things to be said. First, scientific agreement means agreement by scientists, by persons trained to interpret empirical evidence and to see its implications. The approval of philosophers, of publishers and editors, of the politician, and the whole of the rest of the public constitutes nothing. Obviously, the reason evoked here for rejecting the opinion of the public and of untrained savants is that they do not know how to evaluate empirical evidence. Not only are they liable to misinterpret data (or be totally incapable of reading it), but they tend to put too much reliance on semantic representation—to assume that literary reports are self-validating.
This argument has its implications for the training of evaluators and for the operation of evaluators in the field. In the first place, the assertion is that evaluators—insofar as they wish to produce scientifically acceptable output—must be trained as scientists, trained to insist on the right kinds of data and trained to interpret it. Agreement amongst scientists means that more than one evaluator should be available. Now, since in many instances evaluators are asked to produce reports on events which are non-replicable (either in reality, due to the complex environment being nonreplicable or deliberately, since feedback is used to change the subsequent course of events), to ensure themselves that their evidence is scientifically admissible, it would seem essential to work in teams. Further, although they should be willing and able to call on social philosophers, on political, economic and lay opinion, they should always remember that wisdom in one field does not always transfer to another. The evaluator must establish, himself, the criteria for acceptable evidence, and take unto himself the responsibility for interpreting that evidence.
Second, the basic feature that distinguishes scientific observation from other empirical ways of observing Nature is control. The use of control is the scientist's way of protecting himself from bias, of protecting his inferences and conclusions from his own personal predilections. "Control" in its scientific sense means substituting a more conservative (and in this case, accurate) relative statement for an absolute one. At best, the scientist observes a difference between events, not a single event.
In this sense, not all prudent evaluative activity can be scientific. Control may be impractical (though rarely conceptually impossible). Political "science," economics, medicine and education all have frequently to act in the absence of evidence obtained under controlled conditions. Society is not always ready to use scientific method on all possible occasions: few of us can imagine parents being willing to hold their children out of school to act as a control group for some curriculum experiment.
A third, and briefer, point to make about the validation of scientific belief is the fairly obvious one that all evidence must be taken into account. Some years ago now, Lafleur (1951) made this point most dramatically in reviewing Velikovsky's Worlds in Collision. In an extended paper3 on the criteria of acceptable evidence, Lafleur points to the necessity for suspending judgment about a small inconsistency which contradicts a great mass of otherwise consistent information. Velikovsky, it will be recalled, attempted to throw overboard the whole of Newtonian mechanics in order to support the biblical story of creation. The question that Lafleur raised was: what substitute did Velikovsky have for Newtonian mechanics? Should physics abandon the more fully confirmed larger system in order to believe in a tiny, imperfectly plausible contradiction?
As far as evaluation is concerned, this conceptually "brief point"—taking all evidence into account—looms far larger. As has been pointed out constantly in this paper, the criteria for meaningful, scientific acceptance of data are not always possible to employ. And the evaluator usually wishes to be able to make some use of opinion and values (to which we shall presently turn). Here is a problem to which evaluators must pay more attention: what constitutes the full span of evidence? The answer will often come through the experimental establishment of contingencies: just as often it will have to come (at least in the meantime until new or better experimental techniques are available) through consensus. Just what antecedent data are relevant to curriculum evaluation, for instance? Should the evaluator press a child for information regarding his father's income? Is it "more important" to find out if he needs to wear spectacles than that he comes from a town with a population of 10,000 people? In what ways are "intents" to be described? It is currently quite impossible, even with all our electronic sophistication, to take "all evidence," including the notorious kitchen sink, into account. The evaluator must select, and do so on the most scientific basis he can. And at present there is a great deal to be done to help him make choices and decisions on the results of careful experiment.
The Problem of Valuing
It can probably be agreed that the distinction between facts and values is no more than a distinction between two types of claims that may be differentiated by the kinds of scientific and logical evidence that support them. The distinction is not between a system of rational empirical inquiry and thought outside science. Value-free social science is beyond possibility. It is impossible because judgments as to the merit of explanations, theories, experiments, data and instrumentation are expressions of value and essential to any science. In the present context, values are an undesirable intrusion unless they are explicitly laid bare, since it is the role of the social sciences (and hence the evaluator as a practitioner of social science) to provide solutions for the resolution of social problems, and that requires specific recommendations, not just descriptions. Although they do not enter as a problem in the physical sciences (although they may in applications, such as nuclear warfare or birth control), moral value judgments are seen as essential in the social sciences4 and certainly enter as a major factor in the concerns of the evaluator. What kinds of rewards are acceptable to society in order to evoke certain responses? What kinds of punishment? Should the Bible be discussed in schools? Who should provide sex education?
A value is always an experience of a person. Crudely—for this "definition" will not stand too close a scrutiny—a value is used to refer to an experience that a person desires. What is valued (and here "value" can assume both positive and negative attributes) is the experience, some personal undergoing. The crucial point, however, is that the experiencing of value is not the valuing of experiences. When one evaluates, one is analyzing and co-relating sets of value-experience. Furthermore, evaluating finds one predisposed to some values and value-patterns. Even before an evaluator starts his consideration of value-experience, before he starts to formalize his procedures, he has an established set of behaviors that seek some experiences and avoid others.
Therefore, when he evaluates, the evaluator does not create the value-experiencings. He grades value-experiences in relation to each other, not only in terms of their "felt" qualities (or patterns of qualities) but also with regard to their internal consistency, their mutual coherence. But in grading, the evaluator is forced to move beyond his experiences themselves to a set of equally awkward facts about his psycho-physiological makeup in relation to what creates the value-experiences. The evaluating of value-experiences takes the evaluator beyond their immediately enjoyed quality (positive or negative) to some understanding of the preconditions within himself and outside him that produce the value-experiences. To evaluate, in this sense, is to establish an awareness (within the evaluator) of the causal relationships that exist within personal experiences and in the interaction of man and environment.
Thus, a value-pattern becomes a description of the world with man left in it: values relate the evaluator to the world. Evaluative activity requires of the evaluator the added ability to introspect and determine his own prejudices, to be able to tolerate and even seek opposing value-judgments.
In an earlier paper with Maguire5an evaluation model was proposed that effectively drew attention to the need for assessing congruences of outcomes to goals, and at the same time permitted acts of both formative and summative evaluation. Now I am calling for greater attention in evaluation to the evaluation of the goals themselves, and to instrumentation evaluation.
This is not a new call. The problems of meaning, of measurement and of values have been written about for centuries. But evaluators have tended not to heed the call—there have been other matters that have been more pressing, clients who wanted fairly quick answers, test data that was relatively easy to collect and obtain agreement on. Even some of the terminology has been used, but used in ways that bring solutions to problems other than those raised here. Cronbach,6for example, uses "meaning" in the sense of "attitudes"; Stake7 uses "judgments" to cover a series of activities from the most scientifically ordained inference to the passage of a subjective, idiosyncratic like-dislike statement.
People want to know how well their materials or activities achieve certain goals. And they are entitled to know, of course. But they should also be willing to be told (with or without actually being told—with or without actually asking for it) that their goals are (or aren't) worth achieving within some dynamic framework. Who cares how well the objectives of a mathematics curriculum are met if those objectives do not include the processes of addition and multiplication? Who cares if a school produces people with a high consciousness for traffic safety if they are never aware of a need for personal health standards?
Here again is an argument for teams of evaluators, never single evaluators. It will be extremely rarely that a given individual will be sufficiently broadly endowed with skills to enable him to pass the socio-philosophical judgments required for this kind of analysis and at the same time to make scientific judgments on empirical data. It has been pointed out already that the kind of training for one does not necessarily generalize to the other. There is a pressing need for a kind of super-evaluation, and evaluation of the evaluative activities themselves, and the end to which they are directed. Are these goals meaningful? Do these processes lead to admissible evidence? What kind of evidence is this—to whom should it be passed for judgment? Is this person properly qualified to make this kind of judgment? Are these instruments oversensitives, in the sense that what is measured by them is (in context) largely "noise"? Is this judgment overly clouded by the personal value-system of the evaluator involved?
Each of these questions is a difficult one to answer, but not impossible. And the countenance of evaluation is changing sufficiently fast to make them questions to which some attention must be given.
In a typical congruence-oriented evaluation a statement is made as to the extent to which goals are met. To the extent that the goals are not attained, what has to be altered? More often than not, it is the material or the instructional technique, and the like, that are changed. But need they be? Who has proclaimed that goals are infallible, sacrosanct? How often do we, as evaluators, question the goals themselves? The possibility of giving a differential weighting to goals has not been overlooked in practice, and is a first approximation to the kind of concern being expressed here.
The problem of meaningfulness of goals is not altogether separate from the meaning attached to statements of objectives. In much of my own work and the work of Maguire in this area (all of this material being in various stages of progression towards the press) attempts have been made to obtain maps of semantic meaning ("map" being used in its mathematical sense). Through various scaling and factoring procedures we have attempted to find out the ways in which the logic of grammar is being manipulated. But these measurement-based efforts are not an absolution for the evaluator: he must, if necessary, obtain appropriate assistance to enable him to return to the question of meaningfulness with a chance of reaching a solution, and having assured himself of the latter, to find out if the goals are worth achieving anyway.
Similarly with the problems of obtaining appropriate judges. Fairly frequently, the evaluator will not be in a position to judge the over-all cohesiveness of a set of goals. It would seem essential to include a subject-matter specialist on the evaluating team to carry out this, among other, functions. More frequently the evaluator will not be able to make with anywhere near a necessary degree of efficiency social-philosophical judgments. Most likely he will be trained to interpret the more objective kinds of data. The kinds of judgment to be passed during the course of an evaluation must be identified, and provision made for skilled judgments in these areas.
Finally, the problem of instrumentation arises. There is no point here in going into all the philosophical implications of constructs such as reliability and validity, or into the logic of experimental design. But as a general problem they are vital. How much effort do we need to put into obtaining highly sensitive tests to differentiate individuals when we are assessing the impact of curricular materials? Is it better (and it probably is) to use multivariate approaches than univariate approaches in order to obtain differences in comparative experiments? Are our methods having too much of an interactive effect? Are we operating on the assumption of certain underlying descriptions of characteristics that are totally unreasonable?
The very word "evaluate" implies a comparison. Objectives are patently of no importance unless they occur in a describable context. But simply to describe is not enough (as we said right at the beginning). To say that an objective is a good one, that a course is desirable or useful is—at the very least —to make an implied comparison. So we are confronted with experimental design problems for obtaining controls, for producing double-blind situations.
Perhaps we have gone too fast in our evaluation activities by concerning ourselves unduly with obtaining statements and measures of inputs and outcomes. Evaluators have tended to work alone, or in teams that have a certain bias toward the psychometric. Before we go too far, before we start to work within a set methodological framework, it would seem to be desirable to include more ways of looking at the meaning of goals, the semantics of objectives, the appropriateness of our measures and our judgments, and the evoking of value-biases.
1 See Lee J. Cronbach, "Course Improvement through Evaluation," Teachers College Record, 64, 1963, pp. 672-683; Peter A. Taylor and Thomas O. Maguire, "A Theoretical Evaluation Model," Manitoba Journal of Educational Research, 1, 1966, pp. 12-17; Robert E. Stake, "The Countenance of Educational Evaluation," Teachers College Record, 68, 1967, pp. 523-539.
2 Kurt Godel. On Formally Undecidable Propositions of Principia Mathematica and Related Systems. (Original, 1931. Translated by B. Meltzer.) London: Oliver Boyd, 1962.
3 L. J. Lafleur, "Cranks and Scientists," Scientific Monographs, 73, 1951, pp. 284-290.
4 Michael Scriven, "Value Claims in the Social Sciences." Mimeo. Social Sciences Education Consortium, 1966
5 Taylor and Maguire, op. cit.
6 Cronbach, op. cit.
7 Stake, op. cit.