Title: ScatterPlots, Correlation, Regression
1ScatterPlots, Correlation, Regression
2(No Transcript)
3- Annual wine consumption (liters of alcohol from
drinking wine/person) and deaths/100,000
4More Examples
- Whenever any reasonable attempt can be made of
using one variable (characteristic) to explain
another variable (characteristic) of the
subjects, the first variable is called
explanatory and the latter is called response
variable. - However this does NOT imply the explanatory
variable is the actual cause of the response
(variable) even though they may have a strong
relationship. - IQ scores and school grades (strong relationship,
not cause). - Family income level and education level (either
could be explanatory, the other is response
variable). - Someones height and grades (not reasonable?).
- Husbands age and wifes age (strong
relationship, not cause).
5More Examples
- In the study of the effect of alcohol on the
change of body temperature. - Intake of carbohydrate and weight.
- Overweight and mortality rate.
- study on Antibiotics and breast cancer.
- Heights of parents and heights of children.
- Explanatory or response variables dont have to
be quantitative. But often they are quantitative.
6More Examples
- Observed meditation practice and age-related
enzyme level (general concern for ones well
being may also be affecting the response (and the
decision to try meditation)Noetic Sciences
Review, Summer 1993, page 28 - Lack of social relationship and frequency of
illness (other factor that predisposes people
both to have lower social activity and become
ill?) Science, Vol. 241 (1988), pp 540-545
7ScatterPlots
- If we have two quantitative variables
(characteristics) of subjects, we plot one
variable against the other. The result is called
ScatterPlot (graph representation of the two
quantitative variables). - If one of the two quantitative variables is
explanatory, the other is response variable, the
explanatory variable is on horizontal axis, the
response variable on the vertical axis.
8How to draw a scatterplot an example
Miles/gal
A cars speed mile/hr and Miles/gal at different
times
Speed
Horizontal Explanatory Vertical Response
9One of the reasons we care about overall pattern
is that we want to make predictions! (Example
Previous slide)
10Page 85
11Page 95
12If the overall pattern of the scatterplot
approaches a straight line, the two variables are
linearly associated otherwise they are not
linearly associated. The strength of the
association, whether linear or non-linear, can be
roughly described and compared. Linear is the
simplest.
A relationship is strong if the points lie close
to the overall pattern curve, and weak if they
are widely scattered about the overall pattern
curve
13More Examples of Relationships
14(No Transcript)
15Introducing categorical variable in Scatterplots
South Rising Page 82
16Linear AssociationCorrelation
- Correlation is a number that measures the
direction and strength of linear association.
Notation r. - The number r is always between 1 and 1.
- If r is negative, the two quantitative variables
are negatively associated otherwise, positively
associated. - The association is strong if r is close to 1 or
1 (or r is close to 1) Weak if r is close to
0. - Only for linear association. In nonlinear
situation, even if there is a strong association,
the correlation r can be 0. Studying linear
association simply because its simple. r doesnt
depend on the units used.
17Correlation Calculation
- Suppose we have data on variables X and Y for n
individuals - x1, x2, , xn and y1, y2, , yn
- Each variable has a mean and std dev
18Correlation 0 Nonlinear Relationship
Miles/gallon Versus Speed
- Linear relationship?
- Correlation is close to zero.
- Strong association!
19Correlation 0 Nonlinear Relationship
Miles/gallon Versus Speed
- Curved (nonlinear) relationship.
20A Simple Example Some students midterm grades
and their final grades
Its a linear relationship!!!
21A Simple Example Some students midterm grades
and their final grades
Sum 85
22(No Transcript)
23Which Relation Is Highly Correlated?
- Husbands versus wifes ages.
- Husbands versus wifes heights.
- Professional golfers putting success distance
of putt in feet versus percent success.
24Problems With Correlations
- Outliers can inflate or deflate correlations (see
next slide) - Groups combined inappropriately may mask
relationships (a third variable) - groups may have different relationships when
separated
25Outliers and Correlation
A
B
For each scatterplot above, how does the outlier
affect the correlation?
A outlier decreases the correlation (weaken
the trend) B outlier increases the
correlation (strengthen the trend) Example 2004
Election Kerry-Edwards
26Linear Regression and Prediction
- If two quantitative variables are linearly
associated (the overall pattern is close to a
straight line), what line is the best line to
describe the data? (want to do predictions!). - The best line is called the least-squares
regression line. This line helps the prediction
of the value of response variable (y) for a given
value of explanatory variable (x). - The slope of the regression line and correlation
r always have the same sign.
27page 105
28Page 106
29Least-squares Regression Line
- Regression equation y bx a
- x is the value of the explanatory variable
- y-hat is the average value of the response
variable (predicted response for a value of x) - a is the intercept
- b is the slope of the straight line
- note that r and b are not the same thing, but
their signs will agree
sx and sy are the standard deviations of the two
variables, and r is their correlation.
30Prediction Via Regression Line Number of New
Birds and Percent Returning
- The regression equation is y-hat
31.9343 ? 0.3040x - y-hat is the average number of new birds for all
colonies with percent x returning - For all colonies with 60 returning, we predict
the average number of new birds to be 13.69 - 31.9343 ? (0.3040)(60) 13.69 birds
- Suppose we know that an individual colony has 60
returning. What would we predict the number of
new birds to be for just that colony?
31Midterm Grades and Final Grades
The slope of the regression line is
The y-intercept is
The regression line is
The predicted final grade for Peter (if his
midterm is 70)?
32Residuals
- A residual is the difference between an observed
value of the response variable and the value
predicted by the regression line - residual y ? y
33Case Study Residuals
Gesell Adaptive Score and Age at First Word
Draper, N. R. and John, J. A. Influential
observations and outliers in regression,
Technometrics, Vol. 23 (1981), pp. 21-26.
34CautionsAbout Correlation and Regression
- only describe linear relationships
- are both affected by outliers
- always plot the data before interpreting
- beware of extrapolation
- predicting outside of the range of x
- beware of lurking variables
- have important effect on the relationship among
the variables in a study, but are not included in
the study - association does not imply causation
35Excel Instructions
- Download/open excel file GDPnLifeE.xls
- ScatterPlot only Select Data Block (B3C12) ?
Insert ? Chart ? XY (Scatter) ? Choose Scatter
subtype ? Next ? Series in Column ? Next ? Change
Title etc.? Next and Finish ? Enhancements - Correlation only Enter correl(Xarray,Yarray)
(correl(B3B12,C3C12)) - Scatter and Regression Line Tools ? Data
Analysis ? Regression ? Next ? Enter Y data
(C3C12) ? Enter X data (B3B12) ? Enter output
position ? Check Line Fit Plot (or residual
plots) ? Ok ? Cursor on predicted value ? right
click ? add Trendline (no trendline for residual
plots) ? Enhancements