An umbrella term for all that goes into the process of creating a test
Test development process
1. Test conceptualization
2. Test construction
3. Test tryout
4. Analysis
5. Revision
Test conceptualization
An idea for a test is conceived
Test construction
1. Writing test items (or re-writing/revising existing items)
2. Formatting items
3. Setting scoring rules
4. Designing and building the test
Scaling
The process of setting rules for assigning numbers in measurement; process by which a measuring device is designed and calibrated and by which numbers (or other indices) – scale values – are assigned to different amounts of the trait, attribute or characteristic being measured
Types of scales
Age-based scale
Grade-based scale
Stanine scale
Unidimensional versus multidimensional scale
Comparative versus categorical scale
Examples of Scaling methods
Rating scale
Summative scale (e.g. Likert scale)
Unidimensional and multidimensional scaling
Method of paired comparisons
Comparative scaling
Categorical scaling
Guttman scale/Scalogram analysis
Writing test items
Determining the range of content to cover
Selecting the item formats to employ
Deciding how many items to write in total and for each content area
Item pool
Reservoir from which items will or will not be drawn for the final version of the test
Types of Item formats
Selected-response format (e.g. multiple-choice, matching, true-false)
Constructed-response format (e.g. completion, short-answer, essay)
Multiple-choice items
Stem
Correct alternative/option
Distractors/foils
General item development guidelines and checklists exist for multiple-choice, Likert-type, matching, true-false, short-answer, and essay items
Scoring items
Cumulative model (cumulative credit for a construct)
Class/category scoring (credit for placement in a particular class/category)
Ipsative scoring (comparing a testtaker's score on one scale to another scale within the same test)
Test tryout
The test is tried out on people similar to the intended test-takers, under conditions as identical as possible to the standardized administration
The more subjects in the tryout the better
It should be executed under conditions as identical as possible to the conditions under which the standardized test will be administered
Item analysis
Analysis of testtakers' performance on the test as a whole and on each item
Statistical procedures are employed to assist in making judgments about which items are good, need revision, or should be discarded
Tools to analyze and select items
Item-difficulty index
Item-reliability index
Item-validity index
Item-discrimination index
Item-difficulty index
Proportion of total testtakers who answered the item correctly; the larger the index, the easier the item
Optimal item difficulty
For binary-choice items: .625
For four-option multiple-choice items: 1.25
Item-reliability index
Indication of the internal consistency of a test; the higher the index, the greater the test's internal consistency
Item-validity index
Indication of the degree to which a test is measuring what it purports to measure; the higher the index, the greater the test's criterion-related validity
Item-discrimination index
Indication of how adequately an item separates or discriminates between high scorers and low scorers on an entire test; the higher the value, the more adequately the item discriminates
Negative item-discrimination index is a red flag
Qualitative item analysis
Nonstatistical procedures designed to explore how individual test items work; compares individual items to each other and to the test as a whole
Test revision
Actions taken to modify a test's content or format for the purpose of improving the test's effectiveness as a tool of measurement
When are existing tests due for revision?
Stimulus materials look dated and current testtakers cannot relate to them
Verbal content contains dated vocabulary not readily understood by current testtakers
Words/expressions perceived as inappropriate or offensive due to changes in popular culture
Test norms are no longer adequate due to group membership changes or age-related shifts in abilities
Reliability, validity, or item effectiveness can be significantly improved
The theory on which the test was based has been improved significantly
Preliminary questions in TEST CONCEPTUALIZATION: (p2)
What is the ideal format of the test?
Should more than one form of the test be developed?
What special training will be required of the test users for administering or interpreting the test?
What types of responses will be required of testtakers?
Who benefits from an administration of this test?
Is there any potential for harm as the result of an administration of this test?
How will meaning be attributed to scores on this test?
Preliminary questions in TEST CONCEPTUALIZATION:
What is the test designed to measure?
What is the objective of the test?
Is there a need for this test?
Who will use this test?
Who will take the test?
What content will the test cover?
How will the test be administered?
Scaling methods
• Assignment of numbers to responses so that a test score can be calculated
Rating scale
– a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker.
Summative scale
– summing ratings across all the items to obtain the final test score
Method of paired comparisons
– testtakers are presented with pairs of stimuli which they are asked to compare; selection of one stimuli is according to some rule
Comparative scaling
– entails judgments of a stimulus in comparison with every other stimulus on the scale
Categorical scaling
– stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum
Guttman scale/Scalogram analysis
– items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured
Matching item
• Testtaker is presented with two columns: premises on the left and responses on the right
• The task is to determine which response is best associated with which premise
Binary-choice item (true-false item)
• Takes the form of a sentence that requires the testtaker to indicate whether the statement is or is not a fact
*TYPES OF CONSTRUCTED-RESPONSE ITEMS
Completion item
– requires the examinee to provide a word or phrase that completes a sentence
*TYPES OF CONSTRUCTED-RESPONSE ITEMS
Short-answer item
– requires a succinct response
*TYPES OF CONSTRUCTED-RESPONSE ITEMS
Essay
– requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis, and/or interpretation
*Scoring items
Cumulative model
– cumulative credit with regard to a particular construct