Developing and Validating Test Items
reviewed by Gregory Cizek - December 19, 2013
One way to think about Developing and Validating Test Items is as a substantially expanded edition of the previous work of one of its authors. Thomas Haladyna has a long record of scholarship and practical experience in test development; much of his work has been published in various journals and in the definitive guidelines presented in another book, Developing and Validating Multiple Choice Items (2004), now in its third edition. In producing the current volume, Haladyna has collaborated with Michael Rodrigueza partnership that joins arguably the two contemporary leading experts on item developmentestablishing a solid prima facie case for the authoritativeness of the present volume.
From a simple comparison of the titles, it is also clear how the present volume has elaborated on the former work. The earlier work focused on a single item format (multiple choice), whereas the latest work expands guidance on item construction to many other assessment formats, including a diverse slate of selected-response formats (e.g., matching, true-false, alternate choice) and constructed-response formats (e.g., essay, performance, fill-in-the-blank, portfolio). The art and science of item generation and scoring are thoroughly covered in this definitive, 400-plus page work on the topic.
In fact, one of my first reactions to the book was that the publishers marketing copy presented inside the front cover seemed remarkably spot-on. The book is accurately touted as being comprehensive and based on theory and research and as having illustrative examples and a focus on validity (p. i), and it delivers on each of those promises.
Users of this book will find all the guidance necessary to choose an appropriate test item format; to create and refine test questions in diverse formats; and, importantly, to do so in a way that promotes the accurate interpretation of examinees performances on those itemsenhancing the essential characteristic of sound testing: validity (see Cizek, 2012). Instead of the usual listing of chapters, it might be most helpful to simply note that the book addresses all of the following:
* fundamentals for item developers, including deep attention to validity (Chapters One-Four, 16-19);
* guidelines for developing selected-response (Chapters Five-Nine) and constructed-response (Chapters Ten-Twelve) items;
* Special attention to assessing writing skill (Chapter Thirteen), competence for professional licensure or certification (Chapter Fourteen), and examinees with special needs (Chapter Fifteen); and
* the future of item development and validation (Chapter Nineteen).
The balance of this review will first highlight a few particularly noteworthy aspects of the book. Then, suggestions will be provided for what might be added to an even more expanded edition. However, readers are cautioned not to make an inappropriate inference from the fact that that the listing of suggestions is somewhat more extensive than the list of praises. Overall, Developing and Validating Test Items is the most comprehensive and authoritative reference for those who engage in, oversee, or evaluate item development for educational, licensure, or certification assessments.
Of the books many strengths, I think readers will particularly appreciate that it begins with clear, concise, user-friendly definition of key terms such as test, construct, and domain. And, it is hard to over-emphasize the books strong grounding in validity as an overarching themeas it should be. The authors also frequently link their advice to the best practices for testing embodied by the Standards for Educational and Psychological Measurement (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, 1999; revision forthcoming in 2014).
Another strength is that the volume contains a number of aids that users will find helpful, including diagrams, checklists and other features that provide key guidelines, options, taxonomies, or resources in succinct form. One of the most helpful strategies used by the authors is the inclusion of examples/non-examples. That is, readers are given an illustration of an item that meets a technical guideline, and one that does not meet the guideline. In my opinion, the authors could invoke this strategy even more frequently; in many more cases, item developers could benefit from seeing typical missteps and how they are easily corrected. Finally, the authors are to be commended for including a chapter on developing/accommodating items so that they are accessible to persons with disabilities, and a chapter on the development of items for surveys. (A survey is also meets the definition of a test in that it is a systematic sample of respondents knowledge, attitude, opinion, etc.)
The field of testing is evolving rapidly, so it is no surprise that there are some topics that, just a few years ago, were on the distant horizon. The following comprise a few additions or revisions on small points (specks on the chrome) that might warrant attention in the second edition.
Overall, it was my impression that the book may have had one leg still too firmly planted in existing paradigms and that a second edition should reach further. One admittedly minor example of this is the authors recommendation that all items in an item bank should be kept in camera-ready format (p. 19). On a more substantive note, I thought that the book did not incorporate computer-related aspects as much as necessary for today and into the future. Tests are increasingly being delivered by computer, and the term test item is increasingly being preceded by the modifier technology-enhanced. Such formats not only have the potential to assess a greater range of cognitive skills, but they also carry the potential to allow greater accommodations for persons with special needs, as well as challenges when the same test items are administered in both computer-based and paper-and-pencil modes. Technology-enhanced item formats take center stage in assessments ranging from the large-scale measures of student achievement tests developed for the National Assessment of Educational Progress (NAEP; see http://nationsreportcard.gov/science_2009/ict_tasks.asp), to examinations used for professional licensure and certification such as the AICPA examinations for certifying public accountants (see http://apps.aicpa.org/CBTeSampleTest/SampleTestStart.html). Admittedly, the universe of possibilities for creating innovative item formats and response modes using the computer seems endless, but much greater attention should be given to new computer-based formats. Examples of those items and evidence about their performance and value for measuring achievement/competence would make strong additions to the next edition.
In addition, the technology of computer scoring of constructed-response is well-established (see, e.g., Shermis & Burstein, 2003; 2013) and attention to the challenges of developing items amenable to automated scoring would also seem to be an important topic to include for the future. In the same vein, although not as advanced as automated item scoring, significant advances have been made in automated item generation. A revised edition might consider incorporation of highly efficient, cost-effective, and one-off item generation technologies such as those described by Foster and Miller (2009), Stenner, Fisher, Stone, and Burdick (2013) and others.
A final computer-related exhortation is that more attention should be paid to the many ways that computer delivery can aid in removing barriers for examinees with special needs, and to the special challenge of ensuring comparability of inferences when (ostensibly) equivalent items are administered in different delivery modes.
Earlier in this review, I commented positively on the authors decision to include a chapter on developing survey items. In conjunction with expanded treatment of computer-based formats, it would seem desirable for the authors to increase coverage of computer use in both developing and administering those items. Many researchers, test developers, graduate students, and others develop and administer surveys using tools such as Survey Monkey, Qualtrics, Google Docs, and others. The range of presentation and formatting options permitted by these providers is impressive, and affords the survey item writer many options that go beyond the familiar 1-5, Strongly Disagree to Strongly Agree format.
Finally, I had just a few quibbles with the treatment of some topics. I stress that these are minor disagreements or differences in perspective, but ones that I hope readers will consider to be helpful. For one, I would add cognitive level to the list of item attributes that should be captured in an item bank. To its credit, the book has an entire chapter on cognitive item demands; this area is surely increasing in importance for item developers.
For another, I would avoid advice framed as absolutes: For example, regarding complex multiple-choice items, the authors state This format should not be used (p. 72). Regarding the size of item pools, the authors recommend that item pools exceed the length of a test form by at least a factor of 2.5 (p. 132). In my experience, test purpose, contexts, familiarity, resources, and a host of other factors conspire to make it inadvisable to rule out formats in an absolute sense, and even item pool depth depends on factors such as the number of new forms needed in each testing cycle, the extent of item pilot-testing that is possible, the extent of security threats, the proportional representation of various content subareas within the pool, and many other factors.
Finally, as a student of the late Robert Ebel at Michigan State University, I would fail to honor that legacy if I didnt encourage the authors to elaborate on the use of the Alternate Choice (AC) item format. The examples in the book primarily illustrate this format as essentially a two-option multiple choice item. I believe that Ebel (1981, 1982), who is credited with proposing the AC format, saw their value primarily as a refinement of the true/false format in which embedded directional opposites were used as answer choices in a single, focused statement. Typically, terms such as more/less, increase/decrease, improve/degrade, etc. are used to test examinees knowledge of relationships. Two simple examples testing knowledge about Gay-Lussacs law and word processing might be:
1) The pressure of a gas of fixed mass and fixed volume a) increases b) decreases when the temperature of the gas increases.
2) When the font of a document is increased a) more b) fewer characters can be included in a line of text.
In conclusion, the authors of Developing and Validating Test Items attempted a difficult task: creating a go-to resource of definitive guidance for producing test items that can be used confidently in the service of measuring complex characteristics in the increasingly consequential contexts of education, licensure, and certification, and in the rapidly changing world of assessment technology. I believe they succeeded remarkably. Although there are some portions of the book that seem already ripe for new material to be added, I believe that readers will be well served by the comprehensive, evidence-based, and user-friendly information contained in the current volume.
American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17(1), 31-43.
Ebel, R. L. (1981, April). Some advantages of alternate-choice items. Paper presented at the annual meeting of the National Council on Measurement in Education, Los Angeles, CA.
Ebel, R. L. (1982). Proposed solutions to two problems of test construction. Journal of Educational Measurement, 19(4), 267-278.
Foster, D., & Miller, H. L. (2009). A new format for multiple-choice testing: Discrete-option multiple choice. Results from early studies. Psychology Science Quarterly, 51(4), 355-369.
Haladyna, T. M. (2004). Developing and validating multiple-choice items (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.
Shermis, M. D., & Burstein, J. (Eds.) (2003). Automated essay scoring: A cross-disciplinary perspective. Hillsdale, NJ: Lawrence Erlbaum.
Shermis, M. D., & Burstein, J. (Eds.) (2013). Handbook of automated essay evaluation: Current applications and new directions. New York: Routledge.
Stenner, A. J., Fisher, W. P., Stone, M. H., & Burdick, D. S. (2013). Causal Rasch models. Frontiers in Psychology: Quantitative Psychology and Measurement, doi:10.3389/fpsyg.2013.00536