Assessments :: Put Your Tests to the Test
Formative assessments are only as good as their item bank. Here’s how to make sure you’re asking the right questions.
Which of the following is one of the four types of quadralaterals?
A) A parallelogram
B) An isosceles
D) The trapezoid
Standard: Classify three-dimensional figures by their
THERE’S A PROBLEM with this question. Actually, there are six problems with it (see “Answers”), and each is characteristic of bad question, or in the vernacularof formative assessment, “item” writing.
Item quality has become an important consideration as formative assessments, which measure student progress toward a particular goal and guide further instruction, have gained new prominence with the advancements in computer technology. Vast repositories of test materials are now available, along with the tools to assemble and deliver them, to gather and analyze the resulting data, and to use this information strategically in the classroom.
But with this emphasis on formative testing comes the need for educators to ensure that the assessment materials, often furnished as large collections called item banks, provide accurate, useful information about students’ specific strengths and needs, and that the assessments are appropriately paced to the curriculum that students have covered. Technology can contribute to the quality of these items but is no substitute for human expertise and critical scrutiny. It may be time for you to put your items to the test. Here, then, are some questions to start with.
1) Do the items assess students’ relevant skills and knowledge?
This, of course, assumes that the relevant skills and knowledge for the area of study (reading, writing, mathematics, etc.) have been clearly defined. If the definitions are the state standards, then in order to align the formative assessment process with these standards, the contents of the item banks must fulfill two criteria. First, an item bank as a whole should provide reasonably thorough coverage of the standards. Second, each item should have a precise and meaningful link to the related standard, thereby allowing the student to demonstrate the extent to which he or she has mastered that particular skill or point of knowledge.
For example, items measuring a student’s ability to analyze written texts should require higher-order thinking skills, such as drawing inferences and making comparisons, rather than simple comprehension of content. Thus, the item-writing process should explicitly focus on matching the curriculum standards.
2) What requirements for ensuring the quality and integrity of the items are found in the documented procedures?
Writing items is in itself a skill that takes time and practice to acquire. The qualifications of the writers and reviewers should include mastery of content, relevant teaching experience, and test-development expertise. Their knowledge is crucial to guaranteeing that the items do indeed match the standards and provide sound instructional guidance. In addition, the writing and reviewing procedures should observe the basic principles of good item writing.
Both multiple-choice items and constructed-response tasks should:
- Measure only the targeted skills or knowledge, not other skills or knowledge outside the item’s intended purpose
- Contain specific, understandable, and unambiguous ideas and language
- Use age-appropriate vocabulary and sentence structure
- Be factually accurate
- Be technically correct, demonstrating impeccable grammar, usage, and mechanics
- Be interesting, engaging, and varied
- Provide effective instructional guidance or feedback
Good item quality combines many different types of efforts and has to accountfor many different considerations. It is a necessity, not a luxury, if formativeassessments are to be of any help to teachers and students alike.
In addition, multiple-choice items should:
- Contain answer choices that are roughly identical in length, parallel in structure, and equally forthright or abstract
- Have only one clearly correct answer, if the item is structured to have a single answer
- Provide plausible “distracters” (incorrect choices)
- Link instructional guidance and feedback to the item and, if appropriate, each answer choice
Constructed-response tasks should also:
- Be readily scorable with rubrics defining the characteristics of answers at each score level
- Be accompanied by a set of predictable responses with rationales (for right and wrong answers) or annotated sample responses for each score level
- If computer-scored, use a scoring engine backed by peerreviewed research
The writing and reviewing procedures should also ensure that items are fair and free of bias, stereotyping, or inflammatory or upsetting content. Visual formats as well as language should be clear and accessible to all students, including those with disabilities. Pools of items should achieve balance and inclusiveness in gender and ethnicity.
3) Do the items represent the desired levels of difficulty and cognitive skill?
Often it is desirable to have items at varying cognitive levels even for a single standard, so that different levels of progress toward meeting the standard can be ascertained. Some items should call upon more basic cognitive skills, such as factual knowledge, and others upon higher-level skills, such as analysis or evaluation. Providing a rich variety contributes to the diagnostic usefulness of the items. Difficulty is a separate characteristic. With large-scale assessments, it is usually evaluated statistically in terms of the percentage of test takers who answer the item correctly, and the correlation between performance on a particular test item and on the overall test. If the items for the formative assessment lack these statistics, then expert judgment together with other tools (such as automated computations of readability levels for reading passages) can provide preliminary difficulty estimates.
4) Is statistical data being used to improve the quality of the items and the formative assessments?
Here, technological capabilities become essential. Computing and analyzing data from the items and formative assessments can help refine the test materials and make the information that they provide meaningful. Once the assessments are administered and scored, software tools can collect information on item difficulty and other psychometric characteristics, just as is done in field-testing items for highstakes tests. The data can help to identify items that do not work as intended and eliminate them from the item bank.
In addition, software tools can compare students’ performance on standards-based formative assessments with their performance on the standards-based statewide tests. Agreement between the two can help confirm the validity of the formative assessments; disagreement can help show where the assessments or items need improvement.
Good item quality, then, combines many different types of efforts and has to account for many different considerations. It is a necessity, not a luxury, if formative assessments are to be of any help to teachers and students alike. Both the items and the follow-up instruction must have value if meaningful academic guidance and the benefits it brings to students can result.
- The question does not match the standard.
- “Quadrilateral” is misspelled.
- The question is inaccurate; there are more than four types of quadrilaterals.
- The answer choices are not parallel.
- Two answer choices (A and D) are correct, even though the question asks about “one” type of figure.
- Although “parallelogram” and “trapezoid” are nouns,“isosceles” and “scalene” are adjectives that modify the noun “triangle."
Nora V. Odendahl, an assessment specialist at Educational TestingService, has worked as a test developer for14 years, with a primary focus on assessments of writing andreading skills.
This article originally appeared in the 01/01/2007 issue of THE Journal.