Title: Stat 10x
1Stat 10x
- J. Chang
- Tuesday, 9/26/00
2Administrative Notes
- Go to Section on Thursday!
3Scatterplots
Last time
- Plot two variables simultaneously.
- Put one variable on horizontal axis,other
variable on vertical axis.
4E.g. weight vs. height
Last time
5E.g. pulse vs. weight
Last time
6Correlation
Last time
- Measures strength and direction of linear
relationship between two variables. - Between -1 and 1.
- 1 perfect linear relationship, positive slope
- -1 perfect linear relationship, negative slope
7Definition of correlation
Last time
- That is
- standardize each xi and yi ,
- multiply, and
- average
8Rough idea of definition
Last time
?
?
9A small example worked out in detail by hand
Last time
- Go to blackboard People might want to take
notes.
10Fathers and sons data
Last time
Correlationr ? 0.5
What is average height of son whose father is 72
?
11Descriptive statistics on father-son data
Last time
- Fathers Mean 68 SD 3
- Sons Mean 69 SD 3
- Average height of son if father is 72 ?
- A natural guess Father is 4/3 SDs above mean,
so guess son will be 4/3 SDs above mean, or
73
12Best guess depends on correlation
Last time
- Guess that son will be,not 4/3 SDs above
mean,but correlation ? 4/3 2/3 SDs above
mean. - That is, in our example, guess sons height to
be69 (2/3) ? 3 71 inches.
13Equation of the regression line
Last time
Just a formula for all the best guesses
14Today
- Notes on regression
- least-squares regression idea
- fraction of variability explained
- Bivariate normal distribution
- Analysis in strips
- Lurking variables
- Perils of aggregation
- Simpsons paradox
- An example of simulation
15Least squares regression
Imagine fitting a line through some data
16(Predicted or fitted ys) (error or
residual)
17The least squares criterion
Want residuals small Minimize sum of
squared residuals
bad fit
better fit
18Flat lines...
y c
Q Which c gives the least-squares fit?
another property of the mean
19r2
20Interpretation
That is,
21Bivariate normal distributions
22Distributions within vertical stripsin a
bivariate Normal distribution
Consider y values in a narrow vertical strip at
x. These have
- SD within a strip is always ? sY ( sY is
SD over all individuals) - If r 1 then SD in a strip is 0
- if r 0 then SD in a strip is same as sY
23Example
1. What percentage of students score over 75 on
final exam?
Easy 75 is (75 - 65)/10 1 SD above
mean. Answer is 1 ? ?(1) ? 0.16 (16 ).
24Example (cont.)
mean SD LSAT scores 650 80 final exams 65 10
r 0.6
2. Among students who get 730 on LSAT, what
fraction get over 75 on final exam?
So we want fraction of N(71,8) distrib to the
right of 75. Standard score for 75 is (75-71)/8
0.5. Answer 1 ? ?(0.5) ? 0.31. (Compare
previous 0.16)
25A Pythagorean identity
Ignoring divisions by n-1, this says Variance
of fitted values (around mean) Variance
of ys around fitted values Variance of
ys.
26fraction of variance explained by the regression
Easily derived from the equation of the
regression line, which we know Homework?
27Notes about regression
- Least-squares regression is not robust
(resistant) - Two kinds of interesting points
- Outlier a point with a large residual
- Influential point if removed, causes a large
change in the regression line
28A little example
?
29little example (cont)
Outlier?
No.
Yes.
Influential?
30(No Transcript)
31(No Transcript)
32Lurking variables
A variable that has an important effect but was
overlooked.
Danger ConfoundingThinking an effect is due to
one variable when it is better explained by
another (lurking) variable.
1971 study People who drink a lot have higher
incidence of bladder cancer. Correlation
noticed. Causation?
33Lurking variables (cont.)
1993 A larger study concluded that after
adjusting for the effects of smoking, no evidence
for increased risk from coffee.
Spurious correlations The correlation is real,
but causation isnt.
34Lurking variables (cont.)
Lurking vars can also hide real correlations.
(...or even reverse correlations)
35More on the perils of aggregationSimpsons
paradox
Categorical data Hospital A Hospital B Died
300 50 Survived 3000 1000
If you needed surgery, which hospital would you
prefer?
36Simpsons paradox (cont.)
Hospital A Hospital B Died 300
50 Survived 3000 1000
MaybeHospital A university medical center,
attracts seriously ill patients from wide
areaHospital B local, fewer seriously sick
patients.
37Simpsons paradox (cont.)
Another (real) example U.C. Berkeley,
1970s Committee searched for discrimination --
higher percentage of male applicants accepted
into grad school than female. Looking at
individual depts, no evidence of admitting men
more than women -- if anything the reverse. ???
Men were applying more to dept.s with higher
acceptance rates, women applying more to depts
that were harder to get into.