Two-sample tests

About This Presentation

Title:

Two-sample tests

Description:

Single population mean (unknown ) Single population proportion. Difference in means (ttest) ... Sum of Jenny Craig's ranks: 7 8 10 13 14 15 16 17 18 19=137 ... – PowerPoint PPT presentation

Number of Views:208

Avg rating:3.0/5.0

Slides: 111

Provided by: Kris147

Learn more at: https://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Two-sample tests

1
Two-sample tests
2
Binary or categorical outcomes (proportions)
Outcome Variable Are the observations correlated? Are the observations correlated? Alternative to the chi-square test if sparse cells
Outcome Variable independent correlated Alternative to the chi-square test if sparse cells
Binary or categorical (e.g. fracture, yes/no) Chi-square test compares proportions between two or more groups Relative risks odds ratios or risk ratios Logistic regression multivariate technique used when outcome is binary gives multivariate-adjusted odds ratios McNemars chi-square test compares binary outcome between correlated groups (e.g., before and after) Conditional logistic regression multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fishers exact test compares proportions between independent groups when there are sparse data (some cells lt5). McNemars exact test compares proportions between correlated groups when there are sparse data (some cells lt5).
3
Recall The odds ratio (two samplescases and
controls)

Interpretation there is a 2.25-fold higher odds
of stroke in smokers vs. non-smokers.

4
Inferences about the odds ratio

Does the sampling distribution follow a normal
distribution?
What is the standard error?

5
Simulation

1. In SAS, assume infinite population of cases
and controls with equal proportion of smokers
(exposure), p.23 (UNDER THE NULL!)
2. Use the random binomial function to randomly
select n50 cases and n50 controls each with
p.23 chance of being a smoker.
3. Calculate the observed odds ratio for the
resulting 2x2 table.
4. Repeat this 1000 times (or some large number
of times).
5. Observe the distribution of odds ratios under
the null hypothesis.

6
Properties of the OR (simulation)
(50 cases/50 controls/23 exposed)
Under the null, this is the expected variability
of the sample OR?note the right skew
7
Properties of the lnOR
Normal!
8
Properties of the lnOR
From the simulation, can get the empirical
standard error (0.5) and p-value (.10)
9
Properties of the lnOR
10
Inferences about the ln(OR)
p.10
11
Confidence interval
Final answer 2.25 (0.85,5.92)
12
Practice problem
Suppose the following data were collected in a
case-control study of brain tumor and cell phone
usage
Is there sufficient evidence for an
association between cell phones and brain tumor?
13
Answer
1. What is your null hypothesis? Null hypothesis
OR1.0 lnOR 0 Alternative hypothesis OR? 1.0
lnORgt0 2. What is your null distribution?
lnOR N(0, )
SD (lnOR) .44 3.
Empirical evidence 2040/6010 800/600
1.33 ? lnOR .288 4. Z (.288-0)/.44
.65 p-value P(Zgt.65 or Zlt-.65) .262 5.
Not enough evidence to reject the null hypothesis
of no association
14
Key measures of relative risk 95 CIs OR and RR
For an odds ratio, 95 confidence limits
For a risk ratio, 95 confidence limits
15
Continuous outcome (means)
Outcome Variable Are the observations independent or correlated? Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size)
Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size)
Continuous (e.g. pain scale, cognitive function) Ttest compares means between two independent groups ANOVA compares means between more than two independent groups Pearsons correlation coefficient (linear correlation) shows linear correlation between two continuous variables Linear regression multivariate regression technique used when the outcome is continuous gives slopes Paired ttest compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling multivariate regression techniques to compare changes over time between two or more groups gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test non-parametric alternative to the paired ttest Wilcoxon sum-rank test (Mann-Whitney U test) non-parametric alternative to the ttest Kruskal-Wallis test non-parametric alternative to ANOVA Spearman rank correlation coefficient non-parametric alternative to Pearsons correlation coefficient
16
The two-sample t-test
17
The two-sample T-test

Is the difference in means that we observe
between two groups more than wed expect to see
based on chance alone?

18
The standard error of the difference of two means

First add the variances and then take the
square root of the sum to get the standard error.

Recall, Var (A-B) Var (A) Var (B) if A and B
are independent!
19
Shown by simulation
One sample of 30 (with SD5).
One sample of 30 (with SD5).
Difference of the two samples.
20
Distribution of differences

If X and Y are the averages of n and m subjects,
respectively

21
But

As before, you usually have to use the sample SD,
since you wont know the true SD ahead of time
So, again becomes a T-distribution...

22
Estimated standard error of the difference.
23
Case 1 un-pooled variance
Question What are your degrees of freedom
here? Answer Not obvious!
24
Case 1 ttest, unpooled variances
It is complicated to figure out the degrees of
freedom here! A good approximation is given as
df harmonic mean (or SAS will tell you!)
25
Case 2 pooled variance
If you assume that the standard deviation of the
characteristic (e.g., IQ) is the same in both
groups, you can pool all the data to estimate a
common standard deviation. This maximizes your
degrees of freedom (and thus your power).
26
Estimated standard error (using pooled variance
estimate)
27
Case 2 ttest, pooled variances
28
Alternate calculation formula ttest, pooled
variance
29
Pooled vs. unpooled variance

Rule of Thumb Use pooled unless you have a
reason not to.
Pooled gives you more degrees of freedom.
Pooled has extra assumption variances are equal
between the two groups.
SAS automatically tests this assumption for you
(Equality of Variances test). If plt.05, this
suggests unequal variances, and better to use
unpooled ttest.

30
Example two-sample t-test

In 1980, some researchers reported that men have
more mathematical ability than women as
evidenced by the 1979 SATs, where a sample of 30
random male adolescents had a mean score 1
standard deviation of 43677 and 30 random female
adolescents scored lower 41681 (genders were
similar in educational backgrounds,
socio-economic status, and age). Do you agree
with the authors conclusions?

31
Data Summary
n Sample Mean Sample Standard Deviation
Group 1 women 30 416 81
Group 2 men 30 436 77
32
Two-sample t-test

1. Define your hypotheses (null, alternative)
H0 ?-? math SAT 0
Ha ?-? math SAT ? 0 two-sided

33
Two-sample t-test

2. Specify your null distribution
F and M have similar standard deviations/variance
s, so make a pooled estimate of variance.

34
Two-sample t-test

3. Observed difference in our experiment 20
points

35
Two-sample t-test

4. Calculate the p-value of what you observed

data _null_

pval(1-probt(.98, 58))2

put pval

run

0.3311563454

5. Do not
reject null! No evidence that men are better in
math )
36
Example 2 Difference in means

Example Rosental, R. and Jacobson, L. (1966)
Teachers expectancies Determinates of pupils
I.Q. gains. Psychological Reports, 19, 115-118.

37
The Experiment (note exact numbers have been
altered)

Grade 3 at Oak School were given an IQ test at
the beginning of the academic year (n90).
Classroom teachers were given a list of names of
students in their classes who had supposedly
scored in the top 20 percent these students were
identified as academic bloomers (n18).
BUT the children on the teachers lists had
actually been randomly assigned to the list.
At the end of the year, the same I.Q. test was
re-administered.

38
Example 2

Statistical question Do students in the
treatment group have more improvement in IQ than
students in the control group?
What will we actually compare?
One-year change in IQ score in the treatment
group vs. one-year change in IQ score in the
control group.

39
Results
Academic bloomers (n18)
Controls (n72)
Change in IQ score
12.2 (2.0)
8.2 (2.0)
12.2 points
8.2 points
Difference4 points
40
What does a 4-point difference mean?

Before we perform any formal statistical analysis
on these data, we already have a lot of
information.
Look at the basic numbers first THEN consider
statistical significance as a secondary guide.

41
Is the association statistically significant?

This 4-point difference could reflect a true
effect or it could be a fluke.
The question is a 4-point difference bigger or
smaller than the expected sampling variability?

42
Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no difference between
academic bloomers and normal students ( the
difference is 0)
43
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is true

These predictions can be made by mathematical
theory or by computer simulation.

44
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truemath theory
45
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is truecomputer simulation

In computer simulation, you simulate taking
repeated samples of the same size from the same
population and observe the sampling variability.
I used computer simulation to take 1000 samples
of 18 treated and 72 controls

46
Computer Simulation Results
47
3. Empirical data

Observed difference in our experiment 12.2-8.2
4.0

48
4. P-value

t-curve with 88 dfs has slightly wider
cut-offs for 95 area (t1.99) than a normal
curve (Z1.96)

p-value lt.0001
49
Visually
50
5. Reject null!

Conclusion I.Q. scores can bias expectancies in
the teachers minds and cause them to
unintentionally treat bright students
differently from those seen as less bright.

51
Confidence interval (more information!!)

95 CI for the difference 4.01.99(.52) (3.0
5.0)

52
What if our standard deviation had been higher?

The standard deviation for change scores in
treatment and control were each 2.0. What if
change scores had been much more variablesay a
standard deviation of 10.0 (for both)?

53
(No Transcript)
54
With a std. dev. of 10.0LESS STATISICAL POWER!
55
Dont forget The paired T-test

Did the control group in the previous experiment
improveat all during the year?
Do not apply a two-sample ttest to answer this
question!
After-Before yields a single sample of
differences
within-group rather than between-group
comparison

56
Continuous outcome (means)
Outcome Variable Are the observations independent or correlated? Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size)
Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size)
Continuous (e.g. pain scale, cognitive function) Ttest compares means between two independent groups ANOVA compares means between more than two independent groups Pearsons correlation coefficient (linear correlation) shows linear correlation between two continuous variables Linear regression multivariate regression technique used when the outcome is continuous gives slopes Paired ttest compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling multivariate regression techniques to compare changes over time between two or more groups gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test non-parametric alternative to the paired ttest Wilcoxon sum-rank test (Mann-Whitney U test) non-parametric alternative to the ttest Kruskal-Wallis test non-parametric alternative to ANOVA Spearman rank correlation coefficient non-parametric alternative to Pearsons correlation coefficient
57
Data Summary
n Sample Mean Sample Standard Deviation
Group 1 Change 72 8.2 2.0

58
Did the control group in the previous experiment
improveat all during the year?
p-value lt.0001
59
Normality assumption of ttest

If the distribution of the trait is normal, fine
to use a t-test.
But if the underlying distribution is not normal
and the sample size is small (rule of thumb ngt30
per group if not too skewed ngt100 if
distribution is really skewed), the Central Limit
Theorem takes some time to kick in. Cannot use
ttest.
Note ttest is very robust against the normality
assumption!

60
Alternative tests when normality is violated
Non-parametric tests
61
Continuous outcome (means)
Outcome Variable Are the observations independent or correlated? Are the observations independent or correlated? Alternatives if the normality assumption is violated (and small sample size)
Outcome Variable independent correlated Alternatives if the normality assumption is violated (and small sample size)
Continuous (e.g. pain scale, cognitive function) Ttest compares means between two independent groups ANOVA compares means between more than two independent groups Pearsons correlation coefficient (linear correlation) shows linear correlation between two continuous variables Linear regression multivariate regression technique used when the outcome is continuous gives slopes Paired ttest compares means between two related groups (e.g., the same subjects before and after) Repeated-measures ANOVA compares changes over time in the means of two or more groups (repeated measurements) Mixed models/GEE modeling multivariate regression techniques to compare changes over time between two or more groups gives rate of change over time Non-parametric statistics Wilcoxon sign-rank test non-parametric alternative to the paired ttest Wilcoxon sum-rank test (Mann-Whitney U test) non-parametric alternative to the ttest Kruskal-Wallis test non-parametric alternative to ANOVA Spearman rank correlation coefficient non-parametric alternative to Pearsons correlation coefficient
62
Non-parametric tests

t-tests require your outcome variable to be
normally distributed (or close enough), for small
samples.
Non-parametric tests are based on RANKS instead
of means and standard deviations (population
parameters).

63
Example non-parametric tests
10 dieters following Atkins diet vs. 10 dieters
following Jenny Craig Hypothetical
RESULTS Atkins group loses an average of 34.5
lbs. J. Craig group loses an average of 18.5
lbs. Conclusion Atkins is better?
64
Example non-parametric tests
BUT, take a closer look at the individual
data Atkins, change in weight (lbs) 4, 3,
0, -3, -4, -5, -11, -14, -15, -300 J. Craig,
change in weight (lbs) -8, -10, -12, -16, -18,
-20, -21, -24, -26, -30
65
Jenny Craig
30
25
20
P
e
r
c
15
e
n
t
10
5
0
-30
-25
-20
-15
-10
-5
0
5
10
15
20
Weight Change
66
Atkins
30
25
20
P
e
r
c
15
e
n
t
10
5
0
-300
-280
-260
-240
-220
-200
-180
-160
-140
-120
-100
-80
-60
-40
-20
0
20
Weight Change
67
t-test inappropriate

Comparing the mean weight loss of the two groups
is not appropriate here.
The distributions do not appear to be normally
distributed.
Moreover, there is an extreme outlier (this
outlier influences the mean a great deal).

68
Wilcoxon rank-sum test

RANK the values, 1 being the least weight loss
and 20 being the most weight loss.
Atkins
4, 3, 0, -3, -4, -5, -11, -14, -15, -300
1, 2, 3, 4, 5, 6, 9, 11, 12, 20
J. Craig
-8, -10, -12, -16, -18, -20, -21, -24, -26, -30
7, 8, 10, 13, 14, 15, 16, 17, 18,
19

69
Wilcoxon rank-sum test

Sum of Atkins ranks
1 2 3 4 5 6 9 11 12 2073
Sum of Jenny Craigs ranks
7 8 10 13 14 1516 17 1819137
Jenny Craig clearly ranked higher!
P-value (from computer) .018

For details of the statistical test, see
appendix of these slides
70
Binary or categorical outcomes (proportions)
Outcome Variable Are the observations correlated? Are the observations correlated? Alternative to the chi-square test if sparse cells
Outcome Variable independent correlated Alternative to the chi-square test if sparse cells
Binary or categorical (e.g. fracture, yes/no) Chi-square test compares proportions between two or more groups Relative risks odds ratios or risk ratios Logistic regression multivariate technique used when outcome is binary gives multivariate-adjusted odds ratios McNemars chi-square test compares binary outcome between two correlated groups (e.g., before and after) Conditional logistic regression multivariate regression technique for a binary outcome when groups are correlated (e.g., matched data) GEE modeling multivariate regression technique for a binary outcome when groups are correlated (e.g., repeated measures) Fishers exact test compares proportions between independent groups when there are sparse data (some cells lt5). McNemars exact test compares proportions between correlated groups when there are sparse data (some cells lt5).
71
Difference in proportions (special case of
chi-square test)
72
Null distribution of a difference in proportions
73
Null distribution of a difference in proportions
74
Difference in proportions test
Null hypothesis The difference in proportions is
0.
75
Recall case-control example
76
Absolute risk Difference in proportions exposed
77
Difference in proportions exposed
78
Example 2 Difference in proportions

Research Question Are antidepressants a risk
factor for suicide attempts in children and
adolescents?

Example modified from Antidepressant Drug
Therapy and Suicide in Severely Depressed
Children and Adults Olfson et al. Arch Gen
Psychiatry.200663865-872.

79
Example 2 Difference in Proportions

Design Case-control study
Methods Researchers used Medicaid records to
compare prescription histories between 263
children and teenagers (6-18 years) who had
attempted suicide and 1241 controls who had never
attempted suicide (all subjects suffered from
depression).
Statistical question Is a history of use of
antidepressants more common among cases than
controls?

80
Example 2

Statistical question Is a history of use of
antidepressants more common among heart disease
cases than controls?
What will we actually compare?
Proportion of cases who used antidepressants in
the past vs. proportion of controls who did

81
Results
No () of cases (n263)
No () of controls (n1241)
Any antidepressant drug ever
120 (46)
448 (36)
46
36
Difference10
82
Is the association statistically significant?

This 10 difference could reflect a true
association or it could be a fluke in this
particular sample.
The question is 10 bigger or smaller than the
expected sampling variability?

83
Hypothesis testing
Step 1 Assume the null hypothesis.
Null hypothesis There is no association between
antidepressant use and suicide attempts in the
target population ( the difference is 0)
84
Hypothesis Testing
Step 2 Predict the sampling variability assuming
the null hypothesis is true
85
Also Computer Simulation Results
86
Hypothesis Testing
Step 3 Do an experiment
We observed a difference of 10 between cases and
controls.
87
Hypothesis Testing
Step 4 Calculate a p-value
88
P-value from our simulation
89
P-value
From our simulation, we estimate the p-value to
be 4/1000 or .004
90
Hypothesis Testing
Step 5 Reject or do not reject the null
hypothesis.
Here we reject the null. Alternative hypothesis
There is an association between antidepressant
use and suicide in the target population.
91
What would a lack of statistical significance
mean?

If this study had sampled only 50 cases and 50
controls, the sampling variability would have
been much higheras shown in this computer
simulation

92
(No Transcript)
93
With only 50 cases and 50 controls
94
Two-tailed p-value
95
Practice problem

An August 2003 research article in Developmental
and Behavioral Pediatrics reported the following
about a sample of UK kids when given a choice of
a non-branded chocolate cereal vs. CoCo Pops, 97
(36) of 37 girls and 71 (27) of 38 boys
preferred the CoCo Pops. Is this evidence that
girls are more likely to choose brand-named
products?

96
Answer

1. Hypotheses
H0 p?-p? 0
Ha p?-p?? 0 two-sided
2. Null distribution of difference of two
proportions
3. Observed difference in our experiment
.97-.71 .26
4. Calculate the p-value of what you observed

data _null_

pval(1-probnorm(3.06))2

put pval

run

0.0022133699 5.
p-value is sufficiently low for us to reject the
null there does appear to be a difference in
gender preferences here.
97
Key two-sample Hypothesis Tests

Test for Ho µx- µy 0 (s2 unknown, but
roughly equal)
Test for Ho p1- p2 0

98
Corresponding confidence intervals

For a difference in means, 2 independent samples
(s2s unknown but roughly equal)
For a difference in proportions, 2 independent
samples

99
Appendix details of rank-sum test
100
Wilcoxon Rank-sum test
101
Example

For example, if team 1 and team 2 (two gymnastic
teams) are competing, and the judges rank all the
individuals in the competition, how can you tell
if team 1 has done significantly better than team
2 or vice versa?

102
Answer

Intuition under the null hypothesis of no
difference between the two groups
If n1n2, the sums of T1 and T2 should be equal.
But if n1 ?n2, then T2 (n2bigger group) should
automatically be bigger. But how much bigger
under the null?
For example, if team 1 has 3 people and team 2
has 10, we could rank all 13 participants from 1
to 13 on individual performance. If team1 (X)
and team2 dont differ in talent, the ranks ought
to be spread evenly among the two groups, e.g.
1 2 X 4 5 6 X 8 9 10 X 12 13 (exactly even
distribution if team1 ranks 3rd, 7th, and 11th)

103
(No Transcript)
104
It turns out that, if the null hypothesis is
true, the difference between the larger-group sum
of ranks and the smaller-group sum of ranks is
exactly equal to the difference between T1 and T2
105
From slide 23
From slide 24
Here, under null U25530-70 U1630-21 U2U130
106

? under null hypothesis, U1 should equal U2

The Us should be equal to each other and will
equal n1n2/2 U1 U2 n1n2 Under null
hypothesis, U1 U2 U0 ?E(U1 U2) 2E(U0)
n1n2 E(U1 U2U0) n1n2/2
So, the test statistic here is not quite the
difference in the sum-of-ranks of the 2
groups? Its the smaller observed U value U0 For
small ns, take U0, and get p-value directly from
a U table.
107
For large enough ns (gt10 per group)
108
Add observed data to the example

Example If the girls on the two gymnastics teams
were ranked as follows
Team 1 1, 5, 7 Observed T1 13
Team 2 2,3,4,6,8,9,10,11,12,13
Observed T2 78
Are the teams significantly different?
Total sum of ranks 1314/2 91
n1n2310 30
Under the null hypothesis expect U1 - U2 0 and
U1 U2 30 (each should equal about 15 under
the null) and U0 15
U130 6 13 23
U2 30 55 78 7
?U0 7
Not quite statistically significant in U
tablep.1084 (see attached) x2 for two-tailed
test

109
Example problem 2
A study was done to compare the Atkins Diet
(low-carb) vs. Jenny Craig (low-cal, low-fat).
The following weight changes were obtained note
they are very skewed because someone lost 100
pounds the mean loss for Atkins is going to look
higher because of the bozo, but does that mean
the diet is better overall? Conduct a
Mann-Whitney U test to compare ranks.

110
Answer
Corresponding Ranks (lower is more weight
loss!)
Sum of ranks for JC 25 (n5) Sum of ranks for
Atkins41 (n6) n1n256 30 under the null
hypothesis expect U1 - U2 0 and U1 U2 30
and U0 15 U130 15 25 20 U2 30
21 41 10 U0 10 n15, n26 Go to
Mann-Whitney chart.p.2143x 2 .42

Write a Comment

User Comments (0)