Title: Testing 05
1Testing 05
2Errors Reliability
- Errors in the test cause unreliability.
- The fewer the errors, the more reliable the test
- Sources of errors
- Obvious poor health, fatigue, lack of interest
- Less obvious facets discussed in Fig. 5.3
3Reliability Validity
- Reliability is a necessary condition for
validity. - Reliability validity are complementary aspects
of the measurement. - Reliability How much of the performance is due
to measurement errors, or to factors other than
the language ability we want to measure. - Validity How much of the performance is due to
the language ability we want to measure.
4Reliability Measurement
- Reliability measurement includes logical
analysis and empirical research, i.e. identify
sources of errors and estimate the magnitude of
their effects on the scores.
5Logical Analysis
- Example of identification of source of errors
- Topic in an oral interview business negotiation
- Source of error if we want to measure the test
takers ability of general topics. - Indicator of the ability if we want to the test
takers ability of business English.
6Empirical Research
- Procedures are usually complex.
- Three kinds of theories
- Classical true score theory (CTS)
- Generalizability theory (G-Theory)
- Item Response Theory (IRT)
7Factors on Test Scores
- Characteristics of factors
- general vs. specific
- lasting vs. temporary
- systematic vs. unsystematic
8Factors that affect language test scores
9Variance Standard Deviation
- s standard deviation of the sample
- s standard deviation of the population
- s2 variance of the sample
- s2 variance of the population
- sv?(X-X)2/n-1
- where
- X individual score
- X mean score
- n number of students
10Correlation Coefficient (????)
- Covariance (COV) two variables, X and Y, vary
together. - COV(X,Y)1/(n-1)?(Xi-X)(Yi-Y)
- Correlation Coefficient (Pearson Product-moment
Correlation Coefficient ?????????) - r(x,y)COV(x,y)/sxsy
- r(x,y) 1/(n-1)?(Xi-X)(Yi-Y)/ sxsy
11Correlation Coefficient
- Where
- n number of items
- Xi individual score of the first half
- X mean of the scores in the first half
- Yi individual score of the second half
- Y mean of the scores of the second half
- sx standard deviation of the first half
- sy standard deviation of the second half
12Calculation of Correlation Coefficient
- Manually
- Manual Excel
- Excel
13Classical True Score Theory
- also referred to as the classical reliability
theory because its major task is to estimate the
reliability of the observed scores of a test.
That is, it attempts to estimate the strength of
the relationship between the observed score and
the true score. - sometimes referred to as the true score theory
because its theoretical derivations are based on
a mathematical model known as the true score
model
14(No Transcript)
15Assumptions in CTS
- Assumption 1 The observed score consists of the
true score and the error score, i..e. xxtxe - Assumption 2 Error scores are unsystematic,
random and uncorrelated to the true score, i.e.
s2st2se2
16Parallel Test
- Two tests are parallel if
- xx
- sx2sx2
- rxyrxy
17Correlation Between Parallel Tests
- If the observed scores on two parallel tests are
highly correlated, the effects of the error
scores are minimal. - Reliability is the correlation between the
observed scores of two parallel tests. - The definition is the basis for all estimates of
reliability within CTS theory. - Condition the observed scores on the two tests
are experimentally independent.
18Error Score Estimation and Measurement
- Relations between reliability, true score and
error score - The higher the portion of the true score, the
higher the correlation of the two parallel tests.
(True scores are systematic) - The higher the portion of the error score, the
lower the correlation of the two parallel tests.
(Error scores are random)
19Error Score Estimation and Measurement
- rxxst2/se2
- (st2se2)/sx21
- se2/ sx21- st2/ sx2
- st2/ sx2 rxx
- se2/ sx21- rxx
- se2(1- rxx)/ sx2
20Approaches to Estimate Reliability
- Three approaches based on different sources of
errors. - Internal consistency source of errors from
within the test and scoring procedure - Stability How consistent test scores are over
time. - Equivalence Scores on alternative forms of tests
are equivalent.
21Internal Consistency
- Dichotomous
- Split-half reliability estimates
- The Spearman-Brown split-half estimate
- The Guttman split-half estimate
- Kuder-Richardson reliability coefficients
- Non-dichotomous
- Coefficient alpha
- Rater consistency
22Split-half Reliability Estimates
- Split the test into two halves which have equal
means and variances (equivalence) and are
independent of each other (independence). - 1. divide the test into the first and second
halves. - 2. random halves
- 3. odd-even method
23Spearman-Brown Reliability Estimate
- rxx2rhh/(1rhh)
- where
- rhh correlation between the two halves of the
test - Procedure
- 1. Divide the test into two equal halves
- 2. Calculate the correlation coefficient between
the two halves - 3. Calculate the Spearman-Brown reliability
estimate
24Guttman Split-Half Estimate
- rxx2(1-(sh12sh22)/sx2)
- where
- sh12 variance of the first half
- sh22 variance of the second half
- sx2 variance of the total scores
25Kuder-Richardson Formula 20
- rxxk/(k-1)(1-?pq/sx2)
- where
- k number of items on the test
- p proportion of the correct answers, i.e.
correct answers/total answers (difficulty) - q proportion of the incorrect answers, i.e. 1-p
- sx2 total test score variance
26Kuder-Richardson Formula 21
- rxx(ksx2-x(k-x))/(k-1)sx2
- where
- k number of items on the test
- sx2 total test score variance
- x mean score
27Coefficient alpha
- ak/(k-1)(1-?si2/sx2)
- where
- k number of items on the test
- ?si2 sum of the variances of the different
parts of the test - sx2 variance of the test scores
28Comparison of Estimates Assumptions
29Summary Estimate Procedure
- Spearman-Brown
- 1. split
- 2. variances of each half
- 3. correlation coefficient of each half
- 4. reliability coefficient
30Summary Estimate Procedure
- Guttman
- 1. split
- 2. variances of each half
- 3. variance of the whole test
- 4. reliability coefficient
31Summary Estimate Procedure
- K-C 20
- 1. number of questions
- 2. proportion of correct answers of each question
- 3. proportion of incorrect answers of each
question - 4. sum of the product of p and q
- 5. variance of the whole test
- 6. reliability coefficient
32Summary Estimate Procedure
- K-C 21
- 1. number of questions
- 2. mean of the test
- 3. variance of the test
- 4. reliability coefficient
33Summary Estimate Procedure
- Coefficienta
- 1. number of the parts of the test
- 2. mean of each part
- 3. variance of each part
- 4. sum of variances of all parts
- 5. mean of the test
- 6. variance of the test
- 7. reliability coefficient
34Rater Consistency
35Intra-rater Reliability
- Rate each paper twice. Condition the two ratings
must be independent of each other. - Two ways of estimating
- Spearman-Brown Take each rating as a split half
and compute the reliability coefficient.
36Intra-rater Reliability
- Conditions the two ratings must have the similar
means and variances to ensure the equivalence of
the two ratings - Coefficient alpha Take two ratings as two parts
of a test. - a(k/(k-1))(1-(sx12sx22)/sx1x22)
37Intra-rater Reliability
- where
- k number of ratings
- sx12 variance of the first rating
- sx22 variance of the second rating
- sx1x22 variance of the summed ratings
- Since k2, the formula can be reduced to the
Guttman Reliability Coefficient Formula.
38Inter-rater Reliability
- If there are only two raters, use split-half
estimates to obtain the reliability coefficient. - Or Grade Correlation Coefficient
- rxx1-6?D2/(n(n2-1))
- where
- D difference between the grades of the two
ratings
39Inter-rater Reliability
- n number of the test takers
- See testing 05-2 sheet 5 for example
- Note the same grade should be shared.
- If there are more than two raters, use
Coefficient alpha estimate
40Stability (test-retest reliability)
- Administer the test twice to a group of
individuals and compute the correlation between
the two set of scores. The correlation can then
be interpreted as an indicator of how stable the
scores are over time. - Learning effects and practice effects must be
taken into account.
41Equivalence (parallel forms reliability)
- Use alternative forms of a given test. Compute
and compare the means and standard deviations of
for each of the two forms to determine their
equivalence. The correlation between the two sets
can be interpreted as an indicator of the
equivalence of the two tests or an estimate of
the reliability of either one.
42GENERALIZABILITY THEORY
43GENERALIZABILITY THEORY
- Generalizability theory (G-theory) is a framework
of factorial design and the analysis of variance.
It constitutes a theory and set of procedures for
specifying and estimating the relative effects of
different factors on observed test scores, and
thus provides a means for relating the uses or
interpretations to the way test users specify and
interpret different factors as either abilities
or sources of error.
44GENERALIZABILITY THEORY
- G-theory treats a given measure or score as a
sample from a hypothetical universe of possible
measures, i.e. on the basis of an individual's
performance on a test we generalize to his
performance in other contexts. - Reliability generalizability
- The way we define a given universe of measures
will depend upon the universe of generalization
45Application of G-theory
- Two stages
- G-study
- D-study
46G-study
- consider the uses that will be made of the test
scores, investigate the sources of variance that
are of concern or interest.On the basis of this
generalizability study, the test developer
obtains estimates of the relative sizes of the
different sources of variance ('variance
components').
47D-study
- When the results of the G-study are satisfactory,
the test developer administers the test under
operational conditions, and uses G-theory
procedures to estimate the magnitude of the
variance components. These estimates provide
information that can inform the interpretation
and use of the test scores.
48Significance of G-theory
- The application of G-Theory thus enables test
developers and test users to specify the
different sources of variance that are of concern
for a given test use, to estimate the relative
importance of these different sources
simultaneously, and to employ these estimates in
the interpretation and use of test scores.
49Universes Of Generalization And Universe Of
Measures
- universe of generalization, a domain of uses or
abilities (or both) - the universe of possible measures types of test
scores we would be willing to accept as
indicators of the ability to be measured for the
purpose intended.
50Populations of Persons
- In addition to defining the universe of possible
measures, we must define the group, or population
of persons about whom we are going to make
decisions or inferences.
51Universe Score
- A universe score xp is thus defined as the mean
of a person's scores on all measures from the
universe of possible measures. The universe score
is thus the G-theory analog of the CTS-theory
true scores. The variance of a group of persons'
scores on all measures would be equal to the
universe score variance sp2, which is similar to
CTS true score variance in the sense that it
represents that proportion of observed score
variance that remains constant across different
individuals and different measurement facets and
conditions.
52Universe Score
- The universe score is different from the CTS true
score, however, in that an individual is likely
to have different universe scores for different
universes of measures.
53Generalizability Coefficients
- The G-theory analog of the CTS-theory reliability
coefficient is the generalizability coefficient,
which is defined as the proportion of observed
score variance that is universe score variance - pxx2sp2/sx2
- where sp2 is universe score variance and sx2 is
observed score variance, which includes both
universe score and error variance.
54Estimation
- Variance components sources of variances
- persons(p), forms(f), raters(r)
- sx2sp2sf2sr2spf2spr2sfr2spfr2
- Use ANOVA to compute for the magnitude of the
variance - Analyse those that are significantly large.
55Standard Error of Measurement (SEM)
- We need to know the extent the test score may
vary.(SEM) - Formula of SEM Estimation
- sesxv(1-rxx)
- From
- rxxst2/sx2 (1)
- st2/sx2se2/sx21 (2)
- se2/sx21-st2/sx2 (3)
- se2/sx21-rxx
- se2sx2(1-rxx)
56Interpretation of Test Scores
- Difficulty
- Distinction
- Z score
57Difficulty for Dichotomous Scoring
- pR/n
- where
- p difficulty index
- R right answers
- n number of students
58Difficulty for Dichotomous Scoring (Corrected)
- Cp(kp-1)/(k-1)
- Where
- Cp corrected difficulty index
- p uncorrected difficulty index
- k number of choices
59Difficulty for Non-dichotomous Scoring
60Distinction
- Label the top 27 of the total as the high group
and the lowest 27 of the total as the low group. - DPH-PL
- Where
- D distinction index
- PH rate of the correct answers in the high group
- PL rate of the correct answers in the low group
61Z score
- A way of placing an individual score in the whole
distribution of scores on a test it expresses
how many standard deviation units lie above or
below the mean. Scores above the mean are
positive those below the mean are negative. - An advantage of z scores is that they allow
scores from different tests to be compared, where
the mean and standard deviation differ, and where
score points may not be equal. - Z(X-X)/s
62T-score
- A transformation of a z score, equivalent to it
but with the advantage of avoiding negative
values, and hence often used for reporting
purposes. - T10Z50
63Standardized Score
- A transformation of raw scores which provides a
measure of relative standing in a group and
allows comparison of raw scores from different
distributions, eg. from tests of different
lengths. It does this by converting a raw score
into a standard frame of reference which is
expressed in terms of its relative position in
the distribution of scores. The z score is the
most commonly used standardized score. - Standardized score 100Z500