5 Reliability

Cards (50)

  • Reliability refers to the consistency of test scores obtained by the same persons when they are re-examined with the same test on different occasions, or with different sets of equivalent items, or under varying examining conditions.
  • Standard error can be used to estimate the extent to which an observed score deviates from a true score.
  • Confidence interval: a range or band of test scores that is likely to contain the true score.
  • The higher the reliability of the test, the lower the standard error.
  • Reliability coefficient is an index of reliability, a proportion that indicates the ratio between the true score variance on a test and the total variance.
  • Error refers to the component of the observed score that does not have to do with the test takers true ability or trait being measured.
  • Observed score = true score plus error (X = T + E).
  • True scores: score that the test-taker would have obtained if measurement was perfect – i.e., we were able to measure without error.
  • Error Variance is the variance from random sources.
  • Measurement Error refers to the factors associated with the process of measuring variable, other than the variable being measured.
  • Random Error: a source of error in measuring a targeted variable caused by unpredictable fluctuations and inconsistencies of other variables in the measurement process (i.e., noise).
  • Systematic Error: a source of error in measuring a variable that is typically constant or proportionate to what is presumed to be the true value of the variable being measured.
  • Test Construction: variation may exist within items on a test or between tests (i.e., item sampling or content sampling).
  • Test Administration: sources of error may stem from the testing environment and test taker variables such as emotional problems, physical discomfort, lack of sleep, and the effects of drugs or medication.
  • Test Scoring and Interpretation: computer testing reduces error in test scoring, but many tests still require expert interpretation (e.g., projective tests).
  • Internal Consistency: a type of reliability that measures the consistency of a test over time.
  • Inter-rater Reliability: a type of reliability that measures the consistency of raters in their evaluation of a test taker.
  • Test-Retest reliability: an estimate of reliability obtained by correlating pairs of scores from the same people on two different administrations of the same test.
  • Parallel forms: for each form of the test, the means and the variances of observed test scores are equal.
  • Kuder-Richardson formula 20: statistic of choice for determining the inter-item consistency of dichotomous items.
  • Coefficient of equivalence: the degree of the relationship between various forms of a test.
  • Coefficient of inter-score reliability: The scores from different raters are correlated with one another.
  • Alternate forms: different versions of a test that have been constructed so as to be parallel, but do not meet the strict requirements of parallel forms but typically item content and difficulty is similar between tests.
  • Coefficient alpha: Developed by Cronbach to estimate the internal consistency of tests in which the items are not scored as 0 or 1 (wrong or right), its values ranges from 0 to 1.
  • Split-half reliability is obtained by correlating two pairs of scores obtained from equivalent halves of a single test administered once.
  • Spearman-Brown formula allows a test developer or user to estimate internal consistency reliability from a correlation of two halves of a test.
  • Reliability estimates in the range of .70 to .80 are good enough for most purposes in basic research, in clinical settings, high reliability is extremely important, or the tests are used for life and death decisions, it must be treated with high standards (i.e., reliability of .90 to .95).
  • Inter-item consistency: The degree of relatedness of items on a test, able to gauge the homogeneity of a test.
  • Inter-scorer reliability: The degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a particular measure, often used with behavioral measures, guards against biases or idiosyncrasies in scoring.
  • Homogeneity measures single trait, heterogeneity measures more than one trait.
  • The nature of the test will often determine the reliability metric.
  • Most appropriate for variables that should be stable over time (e.g., personality) and not appropriate for variables expected to change over time (e.g., mood/states).
  • Estimates tend to decrease as time passes.
  • With intervals over 6 months the estimate of test-retest reliability is called the coefficient of stability.
  • Test items can be homogeneous or heterogeneous in nature.
  • The test is or is not criterion-referenced.
  • The true-score model favors longer tests, the longer the test, the higher reliability coefficients.
  • The true-score model is often referred to as Classical Test Theory (CTT) and is the most widely used model due to its simplicity.
  • The test is a speed or a power test; a power test is long enough to allow test takers to attempt all items, a speed test contains items of uniform level of difficulty so that when given generous time limits, all test takers should be able to complete all the test items correctly.
  • Domain sampling theory is used to evaluate one’s spelling ability by using a sample of words instead of using the entire number of words in the dictionary to comprise the items of the test.