Title: Core course Session 4
1Core course Session 4
- Hypothesis testing Item/scale analysis
KW 11, 12
2Television Watching Time A manager of a broadcast
station supposes that the number and the types of
television programs and commercials targeted at
children is affected by the amount of time
children watch TV. To gain further insight a
survey was conducted among 100 North American
children, in which they were asked to record the
number of hours they watched TV per week. Based
on these results, various claims are examined.
The manager, for instance, assumes that the mean
watching time is at most 25 hrs/week, while the
North American Child Watch organization believes
that this is at least 30 hrs/week. In addition,
various other claims are advanced with respect to
the variance of the watching time and the
proportion of children that watches television
more than 40 hrs/week. Initially, it is assumed
that the population standard deviation of the
weekly TV watching time is equal to ? 8.0,
which is rapidly relaxed
The mean TV watching time is at most 25 hrs/week
(? is known)
The mean TV watching time is at least 30 hrs/week
(? is known)
The mean TV watching time is 30 hrs/week (? is
not known) - extra
The variance of TV watching time is 100
hrs2/week2 (normality assumed)
The proportion North American children watching
TV more than 40 hrs/week is at most 5
3Agenda Session 4
- Introduction and case
- Hypothesis testing
- general
- mean (variance known)
- mean (variance not known)
- variance
- proportion
- peculiarities type II error, sample size,
p-value approach - Scale reliability and scale/item analysis
4Testing hypotheses about parameters (general
statement)
KW 11.1, 11.2
5Hypothesis testing general
- In hypothesis testing an assumption about a
population parameter is evaluated in light of the
sample information - If the sample outcomes are far from the
postulated parameter value, the assumption is
rejected. Otherwise, the assumption is
maintained. The test procedure establishes a
measure of far (a decision rule) - Rejecting a postulated value while in fact the
assumption is true, is a so-called type I
error, the probability of this type of error is
the so-called significance level of the test (?) - Maintaining a postulated value while in fact the
assumption is wrong, is a so-called type II
error, the probability of this type of error is
denoted ?. 1- ? is the power of the test.
A brief look at the available data and some
examples
6Hypothesis testing general
Based on the background sketch the following
hypotheses are to be evaluated
The mean TV watching time is at least 30 hrs/week
(? is known)
H0 ? ? 30 H1 ? lt 30
H0 ? 30 H1 ? ? 30
The mean TV watching time is 30 hrs/week (? is
not known)
The variance of TV watching time is 100
hrs2/week2 (normality assumed)
H0 ?2 100 H1 ?2? 100
The proportion North American children watching
TV more than 40 hrs/week is at most 5
H0 p ? 0.05 H1 p gt 0.05
The data available to test these assumptions are
on Blackboard, XM09-01 (5th ed)
7Testing hypotheses about a mean(population
variance known)
KW 11.3
The mean TV watching time is at least 30 hrs/week
(? is known)
H0 ? ? 30 H1 ? lt 30
8Hypothesis test for ? (? known)
The mean TV watching time is at least 30 hrs/week
(? is known)
H0 ? ? 30 H1 ? lt 30
??
?0.05
Under H0 Z N(0, 1)
?
?0.05
0
30
-z0.05 -1.645
?
Table 3
9Hypothesis test for ? (? known)
Also, see Refresher Session 4, Slide 33
H0 ? ? 30 H1 ? lt 30
Step 1. Formulate H0 and H1
Step 2. Determine the test statistic
Step 3. State the distribution of the test
statistic
Step 4. Assess the intuitive rejection area
Step 5. Decide upon the significance level
Step 6. Look up the critical values
Step 7. Perform the test
Reject H0 at 5
10Testing hypotheses about a mean (population
variance not known)
KW 12.1-12.2
The mean TV watching time is 30 hrs/week (? is
not known)
H0 ? 30 H1 ? ? 30
11Hypothesis test for ? (? not known)
The mean TV watching time is 30 hrs/week (? is
not known)
H0 ? 30 H1 ? ? 30
??
?0.05
Under H0 T t(n-1) t(99)
?/20.025
?/20.025
0
30
?
?
Table 4
12Hypothesis test for ? (? not known)
?0.05
?/2 0.025
n 100
n -1 99
1.9842
t99,0.025 1.985
13Hypothesis test for ? (? not known)
H0 ? 30 H1 ? ? 30
Step 1. Formulate H0 and H1
Step 2. Determine the test statistic
t(99)
Step 3. State the distribution of the test
statistic
n 100
Step 4. Assess the intuitive rejection area
Step 5. Decide upon the significance level
Step 6. Look up the critical values
Step 7. Perform the test
Reject H0 at 5
14Testing hypotheses about a variance
KW 12.3
The variance of TV watching time is 100
hrs2/week2 (normality assumed)
H0 ?2 100 H1 ?2? 100
15Hypothesis test for ?2 (normality assumed)
The variance of TV watching time is 100
hrs2/week2 (normality assumed)
H0 ?2 100 H1 ?2? 100
?0.05
Under H0 Y c2(n-1) c2(99)
?/2 0.025
?/2 0.025
?
0
?
c299,0.025 128.4
c299,0.975 73.4
Table 5
16Hypothesis test for ?2 (normality assumed)
?0.05
?/2 0.025
1-?/2 0.975
n 100
n -1 99
73.361
128.422
?299,0.025 128.422
?299,0.975 73.361
17Hypothesis test for ?2 (normality assumed)
H0 s2 100 H1 s2 ? 100
Step 1. Formulate H0 and H1
Step 2. Determine the test statistic
Y c2(n-1)
c2(99)
Step 3. State the distribution of the test
statistic
n 100
S2ltlt100 ? S2gtgt 100 Y ltlt 99 ? Y gtgt 99
Step 4. Assess the intuitive rejection area
? 0.05
Step 5. Decide upon the significance level
c299,0.975 73.4 c299,0.025 128.4
Step 6. Look up the critical values
Step 7. Perform the test
Reject H0 ....
18Testing hypotheses about a proportion
KW 12.4
The proportion North American children watching
TV more than 40 hrs/week is at most 5
H0 p ? 0.05 H1 p gt 0.05
19Hypothesis test for p
The proportion North American children watching
TV more than 40 hrs/week is at most 5
H0 p ? 0.05 H1 p gt 0.05
Under H0 Z n(0, 1)
?0.05
? 0.05
0
?
z0.05 1.645
Table 3
20Hypothesis test for p
H0 p ? 0.05 H1 p gt 0.05
Step 1. Formulate H0 and H1
Step 2. Determine the test statistic
Step 3. State the distribution of the test
statistic
Z n(0, 1)
Step 4. Assess the intuitive rejection area
pS gtgt 0.05, Z gtgt 0
? 0.05
Step 5. Decide upon the significance level
z0.05 1.645
Step 6. Look up the critical values
Step 7. Perform the test
Maintain H0 ....
21Specific issues in hypothesis testing
KW 11.3, 11.4
- Type II error
- Sample size
- p-value method
22Hypothesis testingpeculiarities
- Classical test procedure is biased towards the
null hypothesis. The probability of a type I
error is set in advance (? 0.10, 0.05, 0.01),
the probability of making a type II error
follows. But where is this Type II error
(maintaining a false H0) in the testing
procedure? - Increasing the sample size n has previously been
seen to increase the accuracy of the interval
estimates (the error margin became smaller) as a
result of smaller (estimated) standard errors.
What happens with the test procedure when the
sample size increases? - The classical test procedure involves a lot of
calculations. Do we have to go through all these
calculations in each and every situation? An
alternative is the p-value approach to testing
Type II errors
23Hypothesis testing determining the probability
of a type II error
- The test procedure has been designed to control
for the probability of a type I error (?, the
significance level of the test). The probability
of a type II error is not and can not be
determined, unless a specific alternative
hypothesis is available
Imagine the case North American Child Watch
contra Broadcast Program Managers. The former
claims that TV watching time in North America is
at least 30 hrs/week. The latter claims that the
mean TV watching time is at most 25 hrs/week.
Determine the probability of a type II error of
maintaining the NACW claim, while in fact the
assumption of the Broadcast Program Managers is
correct.
H0 ? ? 30 H1 ? ? 25
Evaluated in first example (slides 7-9) for a
vague, non-specific alternative hypothesis
24Hypothesis testing determining the probability
of a type II error
The explanation begins where the one-sided test
procedure (in slide 8) ended for illustra-tion
purposes, the axis has been re-scaled in terms of
X-bar
How would both parties react to this outcome?
30
25
Trade-off between type I and type II errors
25Hypothesis testing determining the probability
of a type II error
- There is a trade-off between the type I and type
II errors - If the probability of a type I error (?, the
significance level of the test) is increased,
e.g. from 5 to 10, then the acceptance region
becomes smaller, and the probability of the type
II error (?) decreases - If the probability of a type I error (?, the
significance level of the test) is decreased,
e.g. from 5 to 1, then the acceptance region
becomes larger, and the probability of the type
II error (?) increases - The only way to have both ? and ? small is to
increase the sample size n. - The power of the test is defined as 1 - ?. It
measures the ability of the test procedure to
reject the null hypothesis when it should (for a
specific alternative). The greater the power of a
test, the better.
Sample Size
26Hypothesis testing increasing the sample size n
The explanation begins half way the previously
explained testing procedure (slide 24), assuming
an arbitrary sample size n0 which is
substantially increased to n1
n1
n0
27Hypothesis testing increasing the sample size n
- When the sample size n increases, the (estimated)
standard error of the sample average (?/?n or
S/?n) decreases, and the acceptance region
becomes smaller (the rejection area becomes
larger) - This (partly) explains why analyses based on
large samples often lead to significant test
results - Note significance does not mean relevance.
Remember that the null hypothesis is a working
hypothesis
?
p-value
28Hypothesis testing the p-value approach to
testing
This explanation begins where the one-sided
testing procedure in slide 8 ended
Note Make sure that both p and ? refer to the
same tails - either to one tail (in a one sided
test) or to two tails (in a two-sided test)!
p lt ? reject H0
p0.0002
lt
29Hypothesis testing the p-value approach to
testing
- The p-value approach to hypothesis testing is a
short-cut alternative to the 7 step approach
explained before. The p-value frequently appears
in computer output - The p-value is defined as the probability of
finding an outcome of the test statistic that is
more extreme than the observed outcome of the
test statistic in the sample, given the null
hypothesis is true - The p-value is directly compared to the
significance level ? to assess whether or not the
null hypothesis should be rejected. - If p gt ?, then H0 should be maintained.
- If p lt ?, then H0 should be rejected
- KW terminology
- p lt 0.01 overwhelming evidence against H0 a
highly significant test result - 0.01 lt p lt 0.05 strong evidence against H0 a
significant test result - 0.05 lt p lt 0.10 weak evidence against H0 no
significant test result at 5 - p gt 0.10 no evidence against H0 no
significant test result at 10
30Internal consistency of multi-item scales
KW --
Some unfinished business. In Session 2, we
discussed the background of multi-item scales.
The quality of these scales or indicators is
often judged on the basis of Cronbachs ? - a
measure of the internal consistency of the scale
31Cronbach alpha coefficients were computed to
test the reliabilities of the TQM scales
(Cronbach, 1951). Typically, these coefficients
should fall within a range of 0.70 to 0.90 for
narrow constructs (...), and 0.55 to 0.70 for
moderately broad constructs (...). In the
empirical study, the coefficients for the twelve
variables ranged between 0.78 and 0.90, and
varied only trivially between the second and
third phases of the research Powell (p.24)
Cronbachs ? is often encountered in the
literature as a measure of the scale reliability
(also see Powell and Geletkanycz), but what is
meant by it?
Recall the classical test model (XTe) and
Powells indicator of TQM measurement
32Internal consistency
- The classical test model postulates that the
observational score of an indicator (X)
consists of a true score (T) and a random
measurement error (?) X T ?. This assumption
is applied to each of the items of the scale
X1 T ?1
X2 T ?2
X3 T ?3
X4 T ?4
33Internal consistency
- Reliability (?), the extent to which an indicator
yields the same results when repeatedly applied,
has various definitions. A particularly useful
one is the ratio of the true score variance and
the variance of the observational score - In the case of the indicator Y, the observational
score is Y ?Xk and the true score is K?T. The
reliability may therefore be obtained as
- which means that this reliability is inversely
related with the ratio of the sum of the item
variances and the variance of the sum of items
(scale variance).
34Internal consistency
- If the observed item scores Xk are highly
(positively) correlated, then V(?Xk ) is (much)
larger than ?V(Xk), and Cronbachs ? is close to
1. A high Cronbach ?, say larger than 0.80, is
interpreted as a good sign a small Cronbach ?,
say below 0.60, indicates poor performance of
the scale - If the item scores are completely unrelated, then
the variance of the scale V(?Xk ) is equal to the
sum of item variances ?V(Xk), and Cronbachs ?
is close to 0. - If item scores are negatively correlated, which
sometimes occurs when one forgets to recode the
negatively stated items, the scale variance V(?Xk
) may be (slightly) smaller than the sum of item
variances ?V(Xk), and Cronbachs ? is lower than
0. - If the number of items K increases, then also
Cronbachs ? increases. - Please note that Cronbachs reliability measures
the internal consistency of a scale ( the degree
to which separate items similarly order the
respondents), and not the behavior of the scale
in repeated measurements in time.
Use of Cronbach ? in scale construction
35Scale or item analysis
KW --
Following the design of the indicator and the
data collection, the observed item scores/ratings
are further examined to obtain the best possible
indicator
36 Attitude towards social issues Some years ago,
our Minister of Justice proposed an obligatory
DNA-test for sexual offenders. During an
undergraduate course at the time, a survey was
conducted to measure the students attitude
towards this DNA-proposal. The survey contained a
large number of items to be rated on a 6 point
evaluation scale (1 strongly disagree to 6
strongly agree). For sake of the example, 6 items
are presented below.
The survey resulted in 69 responses. Based on the
results a scale analysis is performed to obtain
the best possible indicator of attitude towards
the DNA-proposal
What is a scale analysis?
37Scale/item analysis
- Scale construction is not a one-step process that
automatically generates the desired respondent
scores once the survey is completed. Instead, the
contribution of each separate item to the whole
scale is evaluated a posteriori to obtain the
best possible indicator ( an indicator that
optimally succeeds in ordering the respondents) - Scale analysis is performed after the data
collection. It consists of selecting those items
that contribute to the performance of the scale
and skipping the items that do not - Skipping items may be done on various grounds
three criteria are illustrated here - Item-rest correlation the correlation between
separate items and the indicator calculated for
the remaining items should be as high as
possible items with low item-rest correlations
are dropped - Cronbachs ? separate items should add to the
consistent ordering of respondents items that
increase Cronbachs ? when removed, are dropped - Irrelevance criterion items should be
discriminating, i.e. they should successfully
order the respondents in accordance with the
assumed scale model
Illustration of the process
38Calculating Cronbachs ? and performing the
scale/item analysis is done with the
Scale/Reliability Analysis-instruction in the
Analyze-menu
Options Scale and Scale if item deleted are
required to obtain the information for the scale
analysis
Items selected for the item analysis. Initially,
the negatively stated items 8, 9 and 10 are not
recoded
39The pattern of positive and negative item-rest
correlations suggests that some items are
negatively stated and should be recoded before
any further analysis. Action check the survey
form.
R E L I A B I L I T Y A N A L Y S I S -
S C A L E (A L P H A)
N of Statistics for
Mean Variance Std Dev Variables
SCALE 19.2464 7.0119 2.6480
6 Item-total Statistics
Scale Scale Corrected
Mean Variance Item-
Alpha if Item if Item
Total if Item Deleted
Deleted Correlation
Deleted ITEMR01 15.7101 5.3853
.0532 -.2950 ITEMR02 16.1449
5.3905 .0315 -.2693 ITEMR03
15.5072 7.1066 -.2142
.0382 ITEMR08 16.2029 5.8700
-.0034 -.2060 ITEMR09 16.6377
6.4991 -.1283 -.0493 ITEMR10
16.0290 5.7933 -.0770
-.1083 Reliability Coefficients N of Cases
69.0 N of Items 6 Alpha
-.1686
Further checking the ? if item deleted is not
meaningful. Lets see what happens when we recode
the negatively stated items 8, 9 and 10
Cronbachs ? for the scale consisting of the 6
original items has a quite unexpected, low, even
negative outcome
40Checking the survey form (slide 36) shows that
items 8, 9 and 10 should have been recoded before
the analysis. A (negative) answer 2 to item08
should actually be interpreted as a (positive)
answer 5 ( 7-2). A (positive) answer 4 to item10
should actually be interpreted as a (negative)
answer 3 ( 7-4), et cetera. How can we recode
the negatively stated items in SPSS for Windows?
The 7 in the calculations is the maximum rating
value ( 6) plus 1
Example using the Compute-instruction in the
Transform-menu
Example using Compute-statements in the
Syntax-Screen (which is more efficient)
recoding negatively stated items . COMPUTE
xitemr08 7-itemr08 . COMPUTE xitemr09
7-itemr09 . COMPUTE xitemr10 7-itemr10
. EXECUTE.
41 R E L I A B I L I T Y A N A L Y S I S - S
C A L E (A L P H A)
N of Statistics for
Mean Variance Std Dev Variables SCALE
22.5072 21.4007 4.6261
6 Item-total Statistics
Scale Scale Corrected
Mean Variance Item-
Alpha if Item if Item
Total if Item Deleted
Deleted Correlation
Deleted ITEMR01 18.9710 15.3521
.5189 .7080 ITEMR02 19.4058
15.4800 .4730 .7210 ITEMR03
18.7681 15.8278 .5333
.7059 XITEMR08 18.5507 16.1334
.4748 .7203 XITEMR09 18.1159
16.0746 .4571 .7247 XITEMR10
18.7246 14.7319 .4959
.7159 Reliability Coefficients N of Cases
69.0 N of Items 6 Alpha
.7516
Negative or low item-rest correlations are no
longer found. This information does not suggest
any further action.
The scale reliability 0.75 can not be improved by
skipping items. All ? if item deleted are below
0.75. This means that we are finished with this
analysis.
This time, Cronbachs ? is equal to 0.75, which
is fair enough, though not extremely high
42Irrelevance criterion Sometimes, for instance in
the case of multiple choice exams, the item
analysis is followed by a comparison of the a
posteriori rating patterns with the a priori
postulated scale model. Items are considered
relevant if their rating patterns are in
accordance with the postulated scale model. The
analysis of the irrelevance criterion consists of
the following steps
Scale model the more positive respondents are
increasingly likely to express a positive
attitude towards the positive item A (Session 2,
slide 32)
- Calculate the respondent scores based on the
available items - Divide the respondents into several (say 5)
groups with low to high overall scores - For each item, calculate the average rating
values for each of the respondent groups - Compare the pattern of average rating values
with the postulated scale model
The next slides show the results for this
particular example and also how to calculate the
overall respondent score and how to make subgroups
43Item01
Item02
Item03
Item08
Item09
Item10
All items behave as expected their a posteriori
category rating patterns are similar to the a
priori assumed scale models. The positively
stated items 1, 2 and 3 show a monotonically
increasing response patterns (item03 is not
entirely optimal, but not a bad item yet). The
negatively stated items 8, 9 and 10 decrease
monotonically.
SPSS instructions involved
44Calculating the overall respondent scores is done
with the Compute-instruction (slide 40)
score item01 item02 item03 xitem08
xitem09 xitem10
Making the respondent groups is done by
determining a categorical variable with 5 groups
based on the overall score by the
Trans-form/Categorize variables-instruction
Bar charts for each separate item are used to
depict the category rating pattern over the
respondent groups
In the example, average ratings of item01 are
depicted for the 5 respondent groups identified
by nscore
Final step calculating rspondent scores
45After the scale analysis, the final respondent
scores are determined as the sum or average of
the remaining (possible recoded) items. Here, the
respondents attitudes are measured as
score item01 item02 item03 xitem08
xitem09 xitem10
The resulting attitude score is analyzed in the
usual way by making histograms (because the
attitude score is interval by construction) and
calculating descriptive statistics
What can be concluded from these results?
... which is the starting point for the
descriptive analyses (session3)
46to conclude
Next Week
47End of Session 4
- Next time
- statistical process control