7 Test Construction

Cards (44)

  • Test development
    An umbrella term for all that goes into the process of creating a test
  • Test development process
    1. Test conceptualization
    2. Test construction
    3. Test tryout
    4. Analysis
    5. Revision
  • Test conceptualization
    • An idea for a test is conceived
  • Test construction
    1. Writing test items (or re-writing/revising existing items)
    2. Formatting items
    3. Setting scoring rules
    4. Designing and building the test
  • Scaling
    The process of setting rules for assigning numbers in measurement; process by which a measuring device is designed and calibrated and by which numbers (or other indices) – scale values – are assigned to different amounts of the trait, attribute or characteristic being measured
  • Types of scales
    • Age-based scale
    • Grade-based scale
    • Stanine scale
    • Unidimensional versus multidimensional scale
    • Comparative versus categorical scale
  • Examples of Scaling methods
    • Rating scale
    • Summative scale (e.g. Likert scale)
    • Unidimensional and multidimensional scaling
    • Method of paired comparisons
    • Comparative scaling
    • Categorical scaling
    • Guttman scale/Scalogram analysis
  • Writing test items
    • Determining the range of content to cover
    • Selecting the item formats to employ
    • Deciding how many items to write in total and for each content area
  • Item pool
    Reservoir from which items will or will not be drawn for the final version of the test
  • Types of Item formats
    • Selected-response format (e.g. multiple-choice, matching, true-false)
    • Constructed-response format (e.g. completion, short-answer, essay)
  • Multiple-choice items
    • Stem
    • Correct alternative/option
    • Distractors/foils
  • General item development guidelines and checklists exist for multiple-choice, Likert-type, matching, true-false, short-answer, and essay items
  • Scoring items
    • Cumulative model (cumulative credit for a construct)
    • Class/category scoring (credit for placement in a particular class/category)
    • Ipsative scoring (comparing a testtaker's score on one scale to another scale within the same test)
  • Test tryout
    The test is tried out on people similar to the intended test-takers, under conditions as identical as possible to the standardized administration
    The more subjects in the tryout the better
    It should be executed under conditions as identical as possible to the conditions under which the standardized test will be administered
  • Item analysis
    Analysis of testtakers' performance on the test as a whole and on each item
    Statistical procedures are employed to assist in making judgments about which items are good, need revision, or should be discarded
  • Tools to analyze and select items
    • Item-difficulty index
    • Item-reliability index
    • Item-validity index
    • Item-discrimination index
  • Item-difficulty index
    Proportion of total testtakers who answered the item correctly; the larger the index, the easier the item
  • Optimal item difficulty
    For binary-choice items: .625
    For four-option multiple-choice items: 1.25
  • Item-reliability index
    Indication of the internal consistency of a test; the higher the index, the greater the test's internal consistency
  • Item-validity index
    Indication of the degree to which a test is measuring what it purports to measure; the higher the index, the greater the test's criterion-related validity
  • Item-discrimination index
    Indication of how adequately an item separates or discriminates between high scorers and low scorers on an entire test; the higher the value, the more adequately the item discriminates
  • Negative item-discrimination index is a red flag
  • Qualitative item analysis

    Nonstatistical procedures designed to explore how individual test items work; compares individual items to each other and to the test as a whole
  • Test revision
    Actions taken to modify a test's content or format for the purpose of improving the test's effectiveness as a tool of measurement
  • When are existing tests due for revision?
    • Stimulus materials look dated and current testtakers cannot relate to them
    • Verbal content contains dated vocabulary not readily understood by current testtakers
    • Words/expressions perceived as inappropriate or offensive due to changes in popular culture
    • Test norms are no longer adequate due to group membership changes or age-related shifts in abilities
    • Reliability, validity, or item effectiveness can be significantly improved
    • The theory on which the test was based has been improved significantly
  • Preliminary questions in TEST CONCEPTUALIZATION: (p2)
    • What is the ideal format of the test?
    • Should more than one form of the test be developed?
    • What special training will be required of the test users for administering or interpreting the test?
    • What types of responses will be required of testtakers?
    • Who benefits from an administration of this test?
    • Is there any potential for harm as the result of an administration of this test?
    • How will meaning be attributed to scores on this test?
  • Preliminary questions in TEST CONCEPTUALIZATION:
    What is the test designed to measure?
    What is the objective of the test?
    Is there a need for this test?
    Who will use this test?
    Who will take the test?
    What content will the test cover?
    How will the test be administered?
  • Scaling methods
    • Assignment of numbers to responses so that a test score can be calculated
  • Rating scale
    – a grouping of words, statements, or symbols on which judgments of the strength of a particular trait, attitude, or emotion are indicated by the testtaker.
  • Summative scale
    – summing ratings across all the items to obtain the final test score
  • Method of paired comparisons
    – testtakers are presented with pairs of stimuli which they are asked to compare; selection of one stimuli is according to some rule
  • Comparative scaling
    – entails judgments of a stimulus in comparison with every other stimulus on the scale
  • Categorical scaling
    – stimuli are placed into one of two or more alternative categories that differ quantitatively with respect to some continuum
  • Guttman scale/Scalogram analysis
    – items range sequentially from weaker to stronger expressions of the attitude, belief, or feeling being measured
  • Matching item 

    • Testtaker is presented with two columns: premises on the left and responses on the right
    • The task is to determine which response is best associated with which premise
  • Binary-choice item (true-false item) 

    • Takes the form of a sentence that requires the testtaker to indicate whether the statement is or is not a fact
  • *TYPES OF CONSTRUCTED-RESPONSE ITEMS
    Completion item
    – requires the examinee to provide a word or phrase that completes a sentence
  • *TYPES OF CONSTRUCTED-RESPONSE ITEMS
    Short-answer item
    – requires a succinct response
  • *TYPES OF CONSTRUCTED-RESPONSE ITEMS
    Essay
    – requires the testtaker to respond to a question by writing a composition, typically one that demonstrates recall of facts, understanding, analysis, and/or interpretation
  • *Scoring items
    Cumulative model
    – cumulative credit with regard to a particular construct