Title: CSC 323 Quarter: Spring
1CSC 323 Quarter Spring 02/03
- Daniela Stan Raicu
- School of CTI, DePaul University
2Outline
Chapter 2 Looking at Data Relationships
between two or more variables
- Remarks on Correlation (last slides from the
previous lecture) - Linear regression
- Least-squares regression line
- Residual Analysis
- Cautions about regression and correlation
- SAS procedures for univariate data, scatterplots,
correlation and regression
3Correlation
- The correlation r measures the direction and
strength of the linear relationship between two
quantitative variables.
- Suppose we have the following data
X Y
x1 y1
x2 y2
xn yn
Where sx, sy are the standard deviations for the
two variables X and Y
4More on Correlation
- Correlation ignores distinction between
explanatory and response variables - Correlation requires that both variables be
quantitative - Correlation is not affected by changes in the
unit of measurement of either variable - Correlation measures the strength of only linear
relationships - Correlation is not resistant measure, so outliers
can greatly change the value of r.
5Not all Relationships are Linear Miles per
Gallon versus Speed
- Curved relationship(r is misleading)
- Speed varies from 20 mph to 60 mph
- MPG varies from trial to trial, even at the same
speed - Statistical relationship
Correlation measures the strength of only linear
relationships
6Problems with Correlations
- Outliers can inflate or deflate correlations
- Groups combined inappropriately may mask
relationships (a third variable) - groups may have different relationships when
separated
Plot
Correlation is not resistant measure, so
outliers can greatly change the value of r.
7Linear Regression
Objective To quantify the linear relationship
between an explanatory variable and response
variable by fitting a line to the data (that is,
drawing a line that comes as close as possible to
the points).
Example
8Linear Regression
- A regression line is a straight line that
describes how a response variable y changes as an
explanatory variable x changes.
Linear Regression equation
y a bx b slope rate of change a
intercept (x0)
Height a bage
9Prediction
- Use of Regression to predict the value of y for
any value of x by substituting this x into the
equation of the regression line.
Example Prediction via Regression Line Husband
and Wife Ages
- The regression equation is y 3.6 0.97x,
where y is the average age of all husbands who
have wives of age x - For all women aged 30, we predict the average
husband age to be 32.7 years - 3.6 (0.97)(30) 32.7 years
- Suppose we know that an individual wifes age is
30. What would we predict her husbands age to
be?
10Least-squares Regression
- Used to determine the best line
- We want the line to be as close as possible to
the data points in the vertical (y) direction
(since that is what we are trying to predict) - The least - squares regression line of y on x is
the line that makes the sum of the squares of the
vertical distances of the data points from the
line as small as possible.
Y
Observed value y Error Predicted value
?
?
?
?
?
?
A residual is the difference between an observed
value of the response variable y and the value
predicted by the regression line.
x
11Least - Squares Regression
The regression line makes the prediction errors
as small as possible.
12Least - Squares Regression (cont.)
- How is the least squares regression line
calculated?
13Coefficient of Determination (R2)
- Measures usefulness of regression prediction
- R2 (or r2, the square of the correlation)
measures how much variation in the values of the
response variable (y) is explained by the
regression line - Example
- r1 R21 regression line explains/captures all
(100) of the variation in y - r.7 R2.49 regression line explains almost
half (50) of the variation in y
14A CautionBeware of Extrapolation
- Extrapolation is the use of regression line for
prediction outside the range values of the
explanatory variable x that you used to obtain
the line.
- Such predictions are often not accurate.
- Sarahs height was plotted against her age
- Can you predict her height at age 42 months?
- Can you predict her height at age 30 years (360
months)?
15A CautionBeware of Extrapolation
- Regression liney 71.95 .383 x
- height at age 42 months?
- y 88
- height at age 30 years?
- y 209.8
- She is predicted to be 6 10.5 at age 30.
16Accuracy of the predictions
One possible measure of the accuracy of the
regression predictions is given by the root mean
square error (r.m.s. error). The r.m.s. error is
defined as the square root of the average of the
square residuals
In large data sets, the r.m.s. error is
approximately equal to
17Confounding factor
A confounding factor is a variable that has an
important effect on the relationship among the
variables in a study but it is not included in
the study. Example The mathematics department
of a large university must plan the timetable for
the following year. Data are collected on the
enrollment year, the number x of first-year
students and the number y of students enrolled in
elementary math courses.
The fitted regression line has equation 2491.6
91.0663 x R20.694.
18Influential Point
An observation is influential for the regression
line, if removing it would change considerably
the fitted line. An influential point pulls the
regression line towards itself.
Regression line if ? is omitted
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Influential point/outlier
?
?
?
?
?
?
19Summary - Warnings
- Correlation measures linear association,
regression line should be used only when the
association is linear. - Extrapolation do not use the regression line to
predict values outside the observed range
predictions are not reliable. - Correlation and regression line are sensitive to
influential / extreme points.
20Data Mining
- Exploring really large data bases in the hope of
finding useful patterns is called data mining.
Domain Understanding
Data Selection
Cleaning Preprocessing
Evaluation Interpretation
Knowledge
Discovering patterns
The entire process is iterative and interactive.