Home Articles Reader Opinion Editorial Book Reviews Discussion Writers Guide About TCRecord
transparent 13

Developing and Validating Test Items

reviewed by Gregory Cizek - December 19, 2013

coverTitle: Developing and Validating Test Items
Author(s): Thomas M. Haladyna & Michael C. Rodriguez
Publisher: Routledge, New York
ISBN: 0415876052, Pages: 454, Year: 2013
Search for book at Amazon.com

One way to think about Developing and Validating Test Items is as a substantially expanded edition of the previous work of one of its authors. Thomas Haladyna has a long record of scholarship and practical experience in test development; much of his work has been published in various journals and in the definitive guidelines presented in another book, Developing and Validating Multiple Choice Items (2004), now in its third edition.  In producing the current volume, Haladyna has collaborated with Michael Rodriguez—a partnership that joins arguably the two contemporary leading experts on item development—establishing a solid prima facie case for the authoritativeness of the present volume.

From a simple comparison of the titles, it is also clear how the present volume has elaborated on the former work. The earlier work focused on a single item format (multiple choice), whereas the latest work expands guidance on item construction to many other assessment formats, including a diverse slate of selected-response formats (e.g., matching, true-false, alternate choice) and constructed-response formats (e.g., essay, performance, fill-in-the-blank, portfolio).  The art and science of item generation and scoring are thoroughly covered in this definitive, 400-plus page work on the topic.  

In fact, one of my first reactions to the book was that the publisher’s marketing copy presented inside the front cover seemed remarkably spot-on. The book is accurately touted as being “comprehensive” and “based on theory and research” and as having “illustrative examples” and a “focus on validity” (p. i), and it delivers on each of those promises.

Users of this book will find all the guidance necessary to choose an appropriate test item format; to create and refine test questions in diverse formats; and, importantly, to do so in a way that promotes the accurate interpretation of examinees’ performances on those items—enhancing the essential characteristic of sound testing: validity (see Cizek, 2012).  Instead of the usual listing of chapters, it might be most helpful to simply note that the book addresses all of the following:

* fundamentals for item developers, including deep attention to validity (Chapters One-Four, 16-19);

* guidelines for developing selected-response (Chapters Five-Nine) and constructed-response (Chapters Ten-Twelve) items;

* Special attention to assessing writing skill (Chapter Thirteen), competence for professional licensure or certification (Chapter Fourteen), and examinees with special needs (Chapter Fifteen); and

* the future of item development and validation (Chapter Nineteen).


The balance of this review will first highlight a few particularly noteworthy aspects of the book. Then, suggestions will be provided for what might be added to an even more expanded edition. However, readers are cautioned not to make an inappropriate inference from the fact that that the listing of suggestions is somewhat more extensive than the list of praises. Overall, Developing and Validating Test Items is the most comprehensive and authoritative reference for those who engage in, oversee, or evaluate item development for educational, licensure, or certification assessments.

Of the book’s many strengths, I think readers will particularly appreciate that it begins with clear, concise, user-friendly definition of key terms such as test, construct, and domain. And, it is hard to over-emphasize the book’s strong grounding in validity as an overarching theme—as it should be. The authors also frequently link their advice to the best practices for testing embodied by the Standards for Educational and Psychological Measurement (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, 1999; revision forthcoming in 2014).

Another strength is that the volume contains a number of aids that users will find helpful, including diagrams, checklists and other features that provide key guidelines, options, taxonomies, or resources in succinct form. One of the most helpful strategies used by the authors is the inclusion of examples/non-examples. That is, readers are given an illustration of an item that meets a technical guideline, and one that does not meet the guideline. In my opinion, the authors could invoke this strategy even more frequently; in many more cases, item developers could benefit from seeing typical missteps and how they are easily corrected. Finally, the authors are to be commended for including a chapter on developing/accommodating items so that they are accessible to persons with disabilities, and a chapter on the development of items for surveys. (A survey is also meets the definition of a test in that it is a systematic sample of respondents’ knowledge, attitude, opinion, etc.)

The field of testing is evolving rapidly, so it is no surprise that there are some topics that, just a few years ago, were on the distant horizon. The following comprise a few additions or revisions on small points (“specks on the chrome”) that might warrant attention in the second edition.

Overall, it was my impression that the book may have had one leg still too firmly planted in existing paradigms and that a second edition should reach further.  One admittedly minor example of this is the authors’ recommendation that all items in an item bank should be kept in “camera-ready format” (p. 19).  On a more substantive note, I thought that the book did not incorporate computer-related aspects as much as necessary for today and into the future. Tests are increasingly being delivered by computer, and the term test item is increasingly being preceded by the modifier technology-enhanced. Such formats not only have the potential to assess a greater range of cognitive skills, but they also carry the potential to allow greater accommodations for persons with special needs, as well as challenges when the same test items are administered in both computer-based and paper-and-pencil modes. Technology-enhanced item formats take center stage in assessments ranging from the large-scale measures of student achievement tests developed for the National Assessment of Educational Progress (NAEP; see http://nationsreportcard.gov/science_2009/ict_tasks.asp), to examinations used for professional licensure and certification such as the AICPA examinations for certifying public accountants (see http://apps.aicpa.org/CBTeSampleTest/SampleTestStart.html). Admittedly, the universe of possibilities for creating innovative item formats and response modes using the computer seems endless, but much greater attention should be given to new computer-based formats. Examples of those items and evidence about their performance and value for measuring achievement/competence would make strong additions to the next edition.

In addition, the technology of computer scoring of constructed-response is well-established (see, e.g., Shermis & Burstein, 2003; 2013) and attention to the challenges of developing items amenable to automated scoring would also seem to be an important topic to include for the future. In the same vein, although not as advanced as automated item scoring, significant advances have been made in automated item generation. A revised edition might consider incorporation of highly efficient, cost-effective, and  “one-off” item generation technologies such as those described by Foster and Miller (2009), Stenner, Fisher, Stone, and Burdick (2013) and others.

A final computer-related exhortation is that more attention should be paid to the many ways that computer delivery can aid in removing barriers for examinees with special needs, and to the special challenge of ensuring comparability of inferences when (ostensibly) equivalent items are administered in different delivery modes.

Earlier in this review, I commented positively on the authors’ decision to include a chapter on developing survey items. In conjunction with expanded treatment of computer-based formats, it would seem desirable for the authors to increase coverage of computer use in both developing and administering those items. Many researchers, test developers, graduate students, and others develop and administer surveys using tools such as Survey Monkey, Qualtrics, Google Docs, and others. The range of presentation and formatting options permitted by these providers is impressive, and affords the survey item writer many options that go beyond the familiar 1-5, Strongly Disagree to Strongly Agree format.

Finally, I had just a few quibbles with the treatment of some topics. I stress that these are minor disagreements or differences in perspective, but ones that I hope readers will consider to be helpful. For one, I would add cognitive level to the list of item attributes that should be captured in an item bank. To its credit, the book has an entire chapter on cognitive item demands; this area is surely increasing in importance for item developers.  

For another, I would avoid advice framed as absolutes: For example, regarding complex multiple-choice items, the authors state “This format should not be used” (p. 72). Regarding the size of item pools, the authors “recommend that item pools exceed the length of a test form by at least a factor of 2.5” (p. 132). In my experience, test purpose, contexts, familiarity, resources, and a host of other factors conspire to make it inadvisable to rule out formats in an absolute sense, and even item pool depth depends on factors such as the number of new forms needed in each testing cycle, the extent of item pilot-testing that is possible, the extent of security threats, the proportional representation of various content subareas within the pool, and many other factors.

Finally, as a student of the late Robert Ebel at Michigan State University, I would fail to honor that legacy if I didn’t encourage the authors to elaborate on the use of the Alternate Choice (AC) item format. The examples in the book primarily illustrate this format as essentially a two-option multiple choice item. I believe that Ebel (1981, 1982), who is credited with proposing the AC format, saw their value primarily as a refinement of the true/false format in which embedded directional opposites were used as answer choices in a single, focused statement.  Typically, terms such as more/less, increase/decrease, improve/degrade, etc. are used to test examinees’ knowledge of relationships.  Two simple examples testing knowledge about Gay-Lussac’s law and word processing might be:

1) The pressure of a gas of fixed mass and fixed volume a) increases   b) decreases   when the temperature of the gas increases.

2) When the font of a document is increased   a) more  b) fewer  characters can be included in a line of text.

In conclusion, the authors of Developing and Validating Test Items attempted a difficult task: creating a go-to resource of definitive guidance for producing test items that can be used confidently in the service of measuring complex characteristics in the increasingly consequential contexts of education, licensure, and certification, and in the rapidly changing world of assessment technology. I believe they succeeded remarkably. Although there are some portions of the book that seem already ripe for new material to be added, I believe that readers will be well served by the comprehensive, evidence-based, and user-friendly information contained in the current volume.  


American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17(1), 31-43.

Ebel, R. L. (1981, April). Some advantages of alternate-choice items. Paper presented at the annual meeting of the National Council on Measurement in Education, Los Angeles, CA.

Ebel, R. L. (1982). Proposed solutions to two problems of test construction.  Journal of Educational Measurement, 19(4), 267-278.

Foster, D., & Miller, H. L. (2009). A new format for multiple-choice testing: Discrete-option multiple choice. Results from early studies. Psychology Science Quarterly, 51(4), 355-369.

Haladyna, T. M. (2004). Developing and validating multiple-choice items (3rd ed.). Mahwah, NJ: Lawrence Erlbaum.

Shermis, M. D., & Burstein, J. (Eds.) (2003). Automated essay scoring: A cross-disciplinary perspective. Hillsdale, NJ: Lawrence Erlbaum.

Shermis, M. D., & Burstein, J. (Eds.) (2013). Handbook of automated essay evaluation: Current applications and new directions. New York: Routledge.

Stenner, A. J., Fisher, W. P., Stone, M. H., & Burdick, D. S. (2013). Causal Rasch models. Frontiers in Psychology: Quantitative Psychology and Measurement, doi:10.3389/fpsyg.2013.00536

Cite This Article as: Teachers College Record, Date Published: December 19, 2013
https://www.tcrecord.org ID Number: 17369, Date Accessed: 10/23/2021 12:59:05 PM

Purchase Reprint Rights for this article or review
Article Tools
Related Articles

Related Discussion
Post a Comment | Read All

About the Author
  • Gregory Cizek
    University of North Carolina, Chapel Hill
    E-mail Author
    GREGORY J. CIZEK is Professor of Educational Measurement and Evaluation at the University of North Carolina-Chapel Hill, where he teaches courses in psychometrics, assessment, statistics, research methods, and program evaluation. His scholarly interests include standard setting, validity, test security, and testing policy. He is a contributor to the Handbook of Classroom Assessment (1998) and Handbook of Test Development (2006); editor of the Handbook of Educational Policy (1999) and Setting Performance Standards (2001, 2012); co-editor of the Handbook of Formative Assessment (2010, with H. Andrade), and author of Filling in the Blanks (1999), Cheating on Tests: How to Do It, Detect It, and Prevent It (1999), Detecting and Preventing Classroom Cheating (2003), Addressing Test Anxiety in a High-Stakes Environment (with S. Burg, 2005), and Standard Setting: A Practitioner’s Guide (with M. Bunch, 2007). He provides expert consultation at the state and national level on testing programs and policy, including service as a member of the National Assessment Governing Board which oversees the National Assessment of Educational Progress (NAEP). His has worked in leadership positions in the American Educational Research Association (AERA) and is past President of the National Council on Measurement in Education (NCME). Dr. Cizek has managed national licensure and certification testing programs and worked on test development for a statewide testing program. He began his career as an elementary school teacher, and has served as an elected member of a local board of education.
Member Center
In Print
This Month's Issue