Title: LSSG Black Belt Training
1LSSG Black Belt Training
Hypothesis Testing
2Introduction
- Always about a population parameter
- Attempt to prove (or disprove) some assumption
- Setup
- alternate hypothesis What you wish to prove
- Example Change in Y after LSS Project
- null hypothesis Assume the opposite of what is
to be proven. The null is always stated as an
equality. - Example Y after project is same as before
3The test
- Take a sample, compute statistic of interest.
- Standardized mean customer satisfaction score
- How likely is it that if the null were true, you
would get such a statistic? (the p-value) - How likely is it that the sample would show (by
random chance) the difference that we see after
the LSS project, if in fact there was no
improvement? - If very unlikely, then null must be false, hence
alternate is proven beyond reasonable doubt. - If quite likely, then null may be true, so not
enough evidence to discard it in favor of the
alternate.
4Types of Errors
Null is really True Null is really False
reject null, assume alternate is proven Type I Error Believe in improvement when none occured Good Decision
do not reject null, evidence for alternate not strong enough Good Decision Type II Error Cannot show improvement when in fact it occured
5The Testing Process
- Set up Hypotheses (Null and Alternate )
- Pick a significance level (alpha)
- Compute critical value (based on alpha)
- Compute test statistic (from sample)
- Conclude If test statistic gt critical value,
then Alternate Hypothesis proven at alpha level
of significance.
6Hypothesis Testing Roadmap
7Parametric Tests
- Use parametric tests when
- The data are normally distributed
- The variances of populations (if more than one is
sampled from) are equal - The data are at least interval scaled
-
8One sample z - test
- A gap between two parts should be 15 microns. A
sample of 25 measurements shows a mean of 17
microns. Test whether this is significantly
different from 15, assuming the population
standard deviation is known to be 4.
One-Sample Z Test of mu 15 vs not 15 The
assumed standard deviation 4 N Mean SE
Mean 95 CI Z P 25
17.0000 0.8000 (15.4320, 18.5680) 2.50 0.012
9Z-test for proportions
- You wish to test the hypothesis that at least
two-thirds (66) of people in the population
prefer your brand over Brand X. Of the 200
customers surveyed, 140 say they prefer your
brand. Is this statistically significant?
Test and CI for One Proportion Test of p 0.66
vs p gt 0.66
95
Lower Sample X N Sample p Bound
Z-Value P-Value 1 140 200 0.700000
0.646701 1.19 0.116
10One sample t-test
Error Reduction 10 12 9 8 7 12 14 13 15 16 18 12
18 19 20 17 15
The data show reductions in percentage of errors
made by claims processors after undergoing a
training course. The target was a reduction of at
least 13. Was it achieved?
11One Sample t-test Minitab results
- One-Sample T Error Reduction
- Test of mu 13 vs gt 13
-
95 -
Lower - Variable N Mean StDev SE
Mean Bound T P - Error Reduction 17 13.8235 3.9248 0.9519
12.1616 0.87 0.200
The p-value of 0.20 indicates that the reduction
in Errors could not be proven to be greater than
13. P-value of 0.20 shows that the probability
is greater than alpha (0.05) that the difference
may be 13 or less.
12Two Sample t-test
You realize that though the overall reduction is
not proven to be more than 13, there seems to be
a difference between how men and women react to
the training. You separate the 17 observations by
gender, and wish to test whether there is in fact
a significant difference between genders in error
reduction.
- M F
- 10 15
- 12 16
- 9 18
- 8 12
- 7 18
- 12 19
- 14 20
- 13 17
- 15
13Two Sample t-test
The test for equal variances shows that they are
not different for the 2 samples. Thus a 2-sample
t test may be conducted. The results are shown
below. The p-value indicates there is a
significant difference between the genders in
their error reduction due to the training.
- Two-sample T for Error Reduction M vs Error
Reduction F - N Mean StDev SE Mean
- Error Red M 8 10.63 2.50 0.89
- Error Red F 9 16.67 2.45 0.82
- Difference mu (Error Red M) - mu (Error Red F)
- Estimate for difference -6.04167
- 95 CI for difference (-8.60489, -3.47844)
- T-Test of difference 0 (vs not ) T-Value
-5.02 P-Value 0.000 - DF 15
- Both use Pooled StDev 2.4749
14Chi-squared test of independence
For tabulated count data. Two types of glass
sheets are manufactured, and the defects found on
111 sheets are tabulated based on the type of
glass (Type A and Type B) and the location of the
defect on each sheet (Zone 1 and Zone 2). You
wish to test whether the two variables (Type of
glass and Location of error on the glass) are
statistically independent of each other.
15Chi Square Test Results
- Tabulated statistics Glass Type, Location
- Rows Glass Type Columns Location
- Zone 1 Zone 2 All
- Type A 29 23 52
- 29.98 22.02 52.00
- Type B 35 24 59
- 34.02 24.98 59.00
- All 64 47 111
- 64.00 47.00 111.00
- Cell Contents Count
- Expected count
- Pearson Chi-Square 0.143, DF 1, P-Value
0.705
16Assignment
- From the book Doing Data Analysis with Minitab
14 by Robert Carver - Pages 138 142 Choose any 3 of the datasets
mentioned on those pages and answer the related
questions. 1-sample means - Pages 148 -151 Choose any 3 of the datasets
mentioned on those pages and answer the related
questions. 1-sample proportions - Pages 165 -168 Choose any 3 of the datasets
mentioned on those pages and answer the related
questions. 2-sample tests
17Basics of ANOVA
- Analysis of Variance, or ANOVA is a technique
used to test the hypothesis that there is a
difference between the means of two or more
populations. It is used in Regression, as well as
to analyze a factorial experiment design, and in
Gauge RR studies. -
- The basic premise of ANOVA is that differences
in the means of 2 or more groups can be seen by
partitioning the Sum of Squares. Sum of Squares
(SS) is simply the sum of the squared deviations
of the observations from their means. Consider
the following example with two groups. The
measurements show the thumb lengths in
centimeters of two types of primates. - Total variation (SS) is 28, of which only 4
(22) is within the two groups. Thus 24 of the 28
is due to the differences between the groups.
This partitioning of SS into between and
within is used to test the hypothesis that the
groups are in fact different from each other. - See www.statsoft.com for more details.
Obs. Type A Type B
1 2 3 2 3 4 6 7 8
Mean SS 3 2 7 2
Overall Mean 5 SS 28 Overall Mean 5 SS 28 Overall Mean 5 SS 28
18Results of ANOVA
The results of running an ANOVA on the sample
data from the previous slide are shown here. The
hypothesis test computes the F-value as
the ratio of MS Between to MS Within. The
greater the value of F, the greater
the likelihood that there is in fact a difference
between the groups. looking it up in an
F-distribution table shows a p-value of
0.008, indicating a 99.2 confidence that the
difference is real (exists in the Population,
not just in the sample).
- One-way ANOVA Type A, Type B
- Source DF SS MS F P
- Factor 1 24.00 24.00 24.00 0.008
- Error 4 4.00 1.00
- Total 5 28.00
- ___________________________________
- S 1 R-Sq 85.71 R-Sq(adj) 82.14
Minitab Stat/ANOVA/One-Way (unstacked)
19Two-Way ANOVA
Is the strength of steel produced different for
different temperatures to which it is heated and
the speed with which it is cooled? Here 2 factors
(speed and temp) are varied at 2 levels each, and
strengths of 3 parts produced at each combination
are measured as the response variable.
Strength Temp Speed 20.0 Low Slow 22.0 Low Slow 21
.5 Low Slow 23.0 Low Fast 24.0 Low Fast 22.0 Low F
ast 25.0 High Slow 24.0 High Slow 24.5 High Slow 1
7.0 High Fast 18.0 High Fast 17.5 High Fast
- Two-way ANOVA Strength versus Temp, Speed
- Source DF SS MS F
P - Temp 1 3.5208 3.5208 5.45 0.048
- Speed 1 20.0208 20.0208 31.00 0.001
- Interaction 1 58.5208 58.5208 90.61 0.000
- Error 8 5.1667 0.6458
- Total 11 87.2292
- S 0.8036 R-Sq 94.08 R-Sq(adj) 91.86
The results show significant main effects as well
as an interaction effect.
20Two-Way ANOVA
The box plots give an indication of the
interaction effect. The effect of speed on the
response is different for different levels of
temperature. Thus, there is an interaction effect
between temperature and speed.
21Assignment
- From the book Doing Data Analysis with Minitab
14 by Robert Carver - Pages 192 194 Choose any 3 of the datasets
mentioned on those pages and answer the related
questions. 1-way ANOVA - Pages 204 206 Choose any 3 of the datasets
mentioned on those pages and answer the related
questions. 2-way ANOVA
22DOE Overview
- A design of experiment involves controlling
specific inputs (factors) at various levels
(typically 2 levels, like High and Low
settings) to observe the change in output as a
result, and analyzing the data to determine the
significance and relative importance of factors. - The simplest case would be to vary a single
factor, say temperature, while baking cookies.
Keeping all else constant, we can set temperature
at 350 degrees and 400 degrees, and make several
batches at those two temperatures, and measure
the output desired in this case it could be a
rating by experts of crispiness of the cookies on
a scale of 0-10.
23Full Factorial Designs
- A 2F Factorial design implies that there are
factors at 2 levels each. The case described on
the previous slide with only one factor is the
simplest. Having two factors at 2 levels would
give us four combinations. Three factors yield 8
combinations, 4 would yield 16, and so forth. - The following table shows the full factorial
(all 8 combinations) design for 3 factors - temperature,
- baking time, and
- amount of butter,
- each at two levels HIGH and LOW.
Temp Time Butter
Low Low Low
High Low Low
Low High Low
High High Low
Low Low High
High Low High
Low High High
High High High
24Fractional Factorials
- The previous example would require 8 different
setups to bake the cookies. For each setup, one
could bake several batches, say 4 batches, to get
a measure of the internal variation. In practice,
as the number of factors tested grows, it is
difficult to even create all the setups needed,
much less have replications within a setup. - An alternative is to use fractional factorial
designs, typically a ½ or ¼. As the name
suggests, a ½ factorial design with 3 factors
would only require 4 of the 8 combinations to be
tested. This entails some loss of resolution,
usually a confounding of interaction terms, which
may be of no interest to the experimenter, and
can be sacrificed.
Temp Time Butter
High High High
Low Low High
Low High Low
High Low Low
Minitab Stat/DOE/Create Factorial Design/Display
Factorial Designs
25Running the Experiment Outcome Values
- Once the settings to be used are determined, we
can run the experiment and measure the values of
the outcome variable. This table shows the values
of the outcome variable Crisp, showing the
crispiness index for the cookies, for each of the
8 settings of the full factorial experiment.
Temp Time Butter Crisp
Low Low Low 7
High Low Low 10
Low High Low 7
High High Low 5
Low Low High 4
High Low High 9
Low High High 8
High High High 8
26Analysis of Data
- Analyzing the data in Minitab for the main
effects and ignoring interaction - terms, we get the following output
Factorial Fit Crispiness versus Temp, Time,
Butter Estimated Effects and Coefficients for
Crispiness (coded units) Term Effect
Coef SE Coef T P Constant
7.2500 0.3750 19.33 0.000 Temp 3.0000
1.5000 0.3750 4.00 0.016 Time
0.5000 0.2500 0.3750 0.67 0.541 Butter
1.5000 0.7500 0.3750 2.00 0.116 S
1.06066 R-Sq 83.64 R-Sq(adj) 71.36
Analysis of Variance for Crispiness (coded
units) Source DF Seq SS Adj SS
Adj MS F P Main Effects 3 23.000
23.000 7.667 6.81 0.047 Residual Error
4 4.500 4.500 1.125 Total
7 27.500 Estimated Coefficients for Crispiness
using data in uncoded units Term
Coef Constant 7.25000 Temp
1.50000 Time 0.250000 Butter 0.750000
Note that only the temperature is significant
(p-value lower than 0.05). The effect of
temperature is 3.00, which means that if temp.
is set at HIGH, crispiness will increase by 3.00
units on average, compared to the LOW setting.
Minitab Stat/DOE/Create Factorial Design/Analyze
Factorial Design
27Assignment
- From the book Doing Data Analysis with Minitab
14 by Robert Carver - Pages 309 310 Answer any 4 of the 7 questions
on those pages. DOE
28Hypothesis Testing Roadmap
29Non-Parametric Tests
- Use non-parametric tests
- When data are obviously non-normal
- When the sample is too small for the central
limit theorem to lead to normality of averages - When the distribution is not known
- When the data are nominal or ordinal scaled
- Remember that even non-parametric tests have
some assumptions about the data.
30The sign test
- The story
- A patient sign-in process at a hospital is being
evaluated, and the time lapse between arrival and
seeing a physician is recorded for a random
sample of patients. You believe that currently
the median time is over 20 minutes, and wish to
test the hypothesis.
31The sign test data
Data for the test are as follows
- Process
- Time
- 5
- 7
- 15
- 30
- 32
- 35
- 62
- 75
- 80
- 85
- 95
- 100
The histogram of the data shows that it is
non-normal, and the sample size is too small for
the central limit theorem to apply. The data are
at least ordinal in nature (here they are ratio
scaled), satisfying the assumption of the sign
test.
32Sign test - analysis
- Since the hypothesis is that the median is
greater than 20, the test compares each value to
20. Those that are smaller get a negative sign,
those that are larger than 20 get a positive one.
The sign test then computes the probability that
the number of negatives and positives observed
would come about through random chance if the
null were true (that the median time is 20
minutes).
33Sign Test in Minitab - Results
- Sign Test for Median Process Time
- Sign test of median 20.00 versus gt 20.00
- N Below Equal Above P
Median - Process Time 12 3 0 9
0.0730 48.50
In this data, there are 9 observations above 20,
and 3 of them below. This can be shown to have a
.0730 probability of occurring, even if the
median time for the population is in fact not
greater than 20. Thus, there is insufficient
evidence to prove the hypothesis (to reject the
null) at the 5 level, but enough if you are
willing to take a 7.3 risk.
34The sign test other applications
- The sign test can also be used for testing the
value of the median difference between paired
samples, as illustrated in the following link.
The difference between values in a paired sample
can be treated as a single sample, so any
1-sample hypothesis test can be applied. In such
a case, the assumption is that the pairs are
independent of each other. - The equivalent parametric tests for the sign
test are the 1-sample z test and the 1-sample
t-test. - http//davidmlane.com/hyperstat/B135165.html
35Wilcoxon Signed-Rank Test
- A test is conducted where customers are asked to
rate two products based on various criteria, and
come up with a score between 0 and 100 for each.
The testers goal is to check whether product A,
the new version, is perceived to be superior to
product B. The null hypothesis would be that they
are equal to each other.
36Wilcoxon Signed-Rank Test
The measures are rankings by people, so the data
are not necessarily interval scaled, and
certainly not ratio scaled. Thus a paired sample
t-test is not appropriate. A non-parametric
equivalent of that is the Wicoxon Signed-Rank
Test. This is similar to the sign test, but more
sensitive.
Prod A Prod B Diff 55 50 5 60 62
-2 77 70 7 82 78 4 99 90 9 92 95 -3 86 90
-4 84 80 4 90 86 4 72 71 1
37Wilcoxon Signed-Rank Test
- Unlike the sign test, which only looks at whether
something is larger or smaller, this tests uses
the magnitudes of the differences, rank orders
them, and then applies the sign of the difference
and computes the sum of those ranks. This
statistic (called W) has a sampling distribution
that is approximately normal. - For details on the technique, see the link below.
- Assumptions are
- Data are at least ordinal in nature
- The pairs are independent of each other
- Dependent variable is continuous in nature
- http//faculty.vassar.edu/lowry/ch12a.html
38Wilcoxon test in Minitab - Results
- Wilcoxon Signed Rank Test Diff
- Test of median 0.000000 versus median gt
0.000000 - N
- for Wilcoxon
Estimated - N Test Statistic P Median
- Diff 10 10 44.5 0.046 2.500
39Mann-Whitney Test Two Samples
- The Story
- Customers were asked to rate a service in the
past, and 10 people did so. After some
improvements were made, data were collected
again, with a new random set of customers. Twelve
people responded this time. - There is no pairing or matching of data, since
the samples of customers for the old and the new
processes are different. - http//faculty.vassar.edu/lowry/ch11a.html
40Mann-Whitney Test Two Samples
- Old New
- 60 85
- 70 85
- 85 90
- 78 94
- 90 90
- 68 70
- 35 75
- 80 90
- 80 90
- 75 100
- 95
- 90
Note that the assumptions of a 2-sample t-test
are violated because the data are not interval
scaled, and may not be normally distributed. The
Mann-Whitney Test is the non-parametric
alternative to the 2-sample t-test.
41Mann-Whitney Test Two Samples
- The Mann-Whitney test rank orders all the data,
with both columns combined into one. The ranks
are then separated by group so the raw data is
now converted to ranks. The sum of the ranks for
each column is computed. - The sums of ranks are expected to be in
proportion to the sample sizes, if there is no
difference between the groups. Based on this
premise, the actual sum is compared to the
expected sum, and the statistic is tested for
significance. - See details with another example on this link
from Vassar Univ. - http//faculty.vassar.edu/lowry/ch11a.html
42Mann-Whitney Test in Minitab - Results
- Mann-Whitney Test and CI Old, New
- N Median
- Old 10 76.50
- New 12 90.00
- Point estimate for ETA1-ETA2 is -14.00
- 95.6 Percent CI for ETA1-ETA2 is (-22.00,-5.00)
- W 72.5
- Test of ETA1 ETA2 vs ETA1 lt ETA2 is significant
at 0.0028 - The test is significant at 0.0025 (adjusted for
ties)
43Kruskal-Wallis Test 3 or more samples
- Here the data would be similar to the
Mann-Whitney test, except for having more than 2
samples. For parametric data, one would conduct
an ANOVA to test for differences between 3 or
more populations. The Kruskal-Wallis test is thus
a non-parametric equivalent of ANOVA.
44Kruskal-Wallis Test Data
Rating Factor 7 Adults 5 Adults 6 Adults 4 Adults
2 Adults 6 Adults 5 Adults 9 Teens 9 Teens 8 Teens
5 Teens 9 Teens 10 Teens 7 Teens 8 Teens 3 Childr
en 4 Children 3 Children 5 Children 10 Children 2
Children
- Adults Teens Children
- 7 9 3
- 5 9 4
- 6 8 3
- 4 5 5
- 2 9 10
- 6 10 2
- 5 7
- 8
The data show ratings of some product by three
different groups. The same data are shown stacked
on the right to perform the test in Minitab.
45Kruskal-Wallis Test
- The Kruskal-Wallis test proceeds very similarly
to the Mann-Whitney test. The data are all ranked
from low to high values, and the ranks then
separated by group. For each group, the ranks are
summed and averaged. - Each group average is compared to the overall
average, and the deviation measured, weighted by
the number of observations in each group. If the
groups were identical, the deviations from the
grand mean would be a small number (not 0, as
one might intuitively think) that can be
computed. - The actual difference is compared to the expected
one (H statistic computed) to complete the test.
See the link below for details of the
computation, if interested. - http//faculty.vassar.edu/lowry/ch14a.html
46Kruskal-Wallis Test Minitab Results
- Kruskal-Wallis Test Rating versus Factor
- Kruskal-Wallis Test on Rating
- Factor N Median Ave Rank Z
- Adults 7 5.000 8.6 -1.23
- Children 6 3.500 7.2 -1.79
- Teens 8 8.500 15.9 2.86
- Overall 21 11.0
- H 8.37 DF 2 P 0.015
- H 8.48 DF 2 P 0.014 (adjusted for ties)
47Moods Median Test
- Mood median test for Rating
- Chi-Square 10.52 DF 2 P 0.005
- Individual
95.0 CIs - Factor Nlt Ngt Median Q3-Q1
---------------------------------- - Adults 6 1 5.00 2.00
(------------) - Children 5 1 3.50 3.50
(---------------------------) - Teens 1 7 8.50 1.75
(---------) -
---------------------------------- -
4.0 6.0 8.0 - Overall median 6.00
The Moods median test is an alternative to
Kruskal-Wallis. It is generally more robust
against violations of assumptions, but less
powerful.
48Friedmans Test
- Friedmans Test is the non-parametric equivalent
to a randomized block design in an ANOVA. In
other words, there are 3 or more groups, but each
row of values across the groups are matched. - The story
- A persons performance is rated in a normal
state, rated again after introducing noise in the
environment, and finally with the introduction of
classical music in the background. This is done
for a sample of 7 employees.
49Friedmans Test Data
- Normal Noise Music
- 7 5 8
- 8 4 8
- 6 6 8
- 9 5 8
- 5 5 7
- 7 4 9
- 8 4 9
Perform Group Block 7 Normal 1 8 Normal 2 6 Normal
3 9 Normal 4 5 Normal 5 7 Normal 6 8 Normal 7 5 N
oise 1 4 Noise 2 6 Noise 3 5 Noise 4 5 Noise 5 4 N
oise 6 4 Noise 7 8 Music 1 8 Music 2 8 Music 3 8 M
usic 4 7 Music 5 9 Music 6 9 Music 7
The data show the ratings of performance by
person in each of 3 conditions. The same data are
stacked in the table to the right, for doing the
test in Minitab. Each person represents a block
of data, since the 3 numbers for that person are
related.
50Friedmans Test - Analysis
- Friedmans test also ranks the ratings, but this
time the ranking is done internally within each
row the three scores for each person are ranked
1, 2, and 3. These ranks are then summed and
averaged. - If the groups are identical, then one would
expect no difference in the sum or mean of
rankings for each group. In other words, if the
conditions did not affect the performance rating,
the rankings would either be the same, or vary
randomly across people to yield equal sums. - The sums are compared to this expectation to test
the hypothesis. See the following link for more
details. - http//faculty.vassar.edu/lowry/ch15a.html
51Friedmans Test in Minitab Results.
- Friedman Test Perform versus Group blocked by
Block - S 9.50 DF 2 P 0.009
- S 10.64 DF 2 P 0.005 (adjusted for ties)
- Sum
- Est of
- Group N Median Ranks
- Music 7 8.000 19.5
- Noise 7 4.667 8.0
- Normal 7 7.333 14.5
- Grand median 6.667
52Assignment
- From the book Doing Data Analysis with Minitab
14 by Robert Carver - Pages 293 294 Choose any 3 datasets on those
pages and answer the related questions.
Non-parametric tests