Title: Regression Analysis (overview)
1Regression Analysis (overview)
- Regression analysis is the idea of
- analyzing a set of sample data and
- establishing a relationship between two variables
and - explaining how one variable is dependent upon the
other - using this dependency to explain the population
or for the prediction of future data
(life expectancy example see Excel)
2Regression Analysis (overview contd)
- Regression analysis is the idea of
- analyzing a set of sample data and
- establishing a relationship between two variables
and - explaining how one variable is dependent upon the
other - using this dependency to explain the population
or for the prediction of future data
Regarding the second item, we will study linear
relationships. Regarding the third item, we can
extend the analysis to show how one variable is
dependent upon many variables.
3Linear relationships
When two variables have a linear relationship, we
call them X independent variable (horizontal
axis) Y dependent variable (vertical axis) and
describe the relationship as Y a
bX where a intercept b slope
For example, X Year Y Male Life Expectancy a
-445.95 b 0.2608
(see Excel-life expectancy)
4Linear Relationships (contd)
Y a bX, where a is the intercept and b is
the slope
The intercept a is the value of Y when X 0, or
the place on the vertical axis where the line
crosses The slope b describes the pitch or slope
of the line If b gt 0, the line goes up from left
to right the variables have a positive
relationship when X goes up, Y goes up If b lt
0, the line goes down from left to right the
variables have a negative relationship when X
goes up, Y goes down
In regression, we will take two variables X and Y
and determine the a and b that best describe the
relationship between X and Y
Our starting assumption is that Y is dependent on
X
5Doing Regression Analysis
- Do a scatter diagram on the relevant data
- Add a trendline, displaying the equation as well
as R2 - Do the full regression analysis using Excels
Tools gt Data Analysis gt Regression command,
saving and plotting the residuals - Examine the residuals for individual information
(and other trends)
6Method of Least Squares
Think about trying to fit the line Y a bX to
the data, for whatever values of a and b that you
wish
How good is the fit? The smaller the residuals /
errors the better
X Y actual Y predicted Residual Residual2
2 3 5 22 9 3 - 9 -6 36
5 -1 5 25 15 -1 - 15 -16 256
36 256 292
7Method of Least Squares (contd)
The main goal of the least squares method is to
choose a and b so that SSE is minimized
8Method of Least Squares (contd)
SSR Sum of Squares due to Regression amount
of variability explained by the regression
SST SSR SSE
total variability explained variability
unexplained variability
R2 SSR / SST percentage of variability in Y
that is explained by X sample coefficient of
determination
Because of the above equations and the definition
of SST, minimizing SSE is the same as maximizing
SSR and / or R2
R2 is not the best measure of fit
9Method of Least Squares (contd)
A better measure of fit is
Standard error of the estimation, denoted by Se
Se is the standard deviation of the residuals
about their mean
The smaller SSE, the smaller Se
10Practical Trends with R2 and Se
When working with regression, you will often have
to compare two separate regressions and try to
determine which is a better model. In this case,
look for
higher R2 and lower Se
11(No Transcript)
12(No Transcript)
13See KLMZOO Spreadsheet
14(No Transcript)
15(No Transcript)
16Take a Look at regression example In the
Regression Excel File
17Point Estimates From Regression
18More about the Regression Output
When you use the least squares method to perform
a regression analysis, here is a very important
fact
the mean of all residuals equals 0
19More on the Regression Output (contd)
In a regression, what does a slope coefficient of
0 mean? It means that X has no effect on Y at all
- The regression output gives a 95 confidence
interval for the slope coefficient - if 0 is outside the interval, then we are
confident that the slope is not 0 - if 0 is inside the interval, then the slope may
be 0
P-value (or Prob-value) is the probability that
0 is the real slope coefficient
Generally speaking, to make inferences from the
regression, you want the P-value to be small
(less than 0.05)
20A good regression model has
- A high R2 value
- A low Se value
- Slope coefficient which is not likely to equal 0
- 0 is outside the 95 confidence interval
- P-value is less than 0.05
So then what makes a bad model?
21A bad regression model has
- Residuals that are not very well distributed
- overall
- residual histogram should look normal
- for each section of values of X
- residual plot should not have any strong patterns
- e.g., mean of sections residuals should equal 0
(like entire model) - for each subcollection of non-biased sample data
- e.g., mean of subcollections residuals should
equal 0 (like entire model) - in other words, we want no systematic
misprediction - The linear idea just doesnt fit the situation
- Do you have strong reason to believe that Y and X
are not related by a line?
22A bad regression model has (contd)
- Is the distribution of residuals non-normal?
- Is the linear model inappropriate?
When one of these is yes and the difference
from the ideal model is striking, then linear
regression is probably not a good idea
When one of these is yes but the difference
from the ideal model is moderate, then maybe the
linear regression can be improved
(see Excel, temperature, air passenger miles)
23Excel Temperature Model Examining
Residuals Sum/Avg of subcollections should be
close to zero-----use pivot table grouping to
examine Residuals should be normally
distributed use pivot table/chart to create
histogram of residuals
24Air Passenger Miles Model Does the histogram of
residuals look like a normal distribution? What
about the averages over subcollections of
residuals?
25Multiple Linear Regression
Multiple linear regression is very similar to
simple linear regression except that the
dependent variable Y is described by m
independent variables X1, , Xm
Y a b1X1 bmXm
(see Excel, artsy_regression.xls)
26Multiple Linear Regression (contd)
Y a b1X1 bmXm
- Intercept a is the same
- Slope bi is the change in Y given a unit change
in Xi while holding all other variables constant - SST, SSE, SSR, and R2 are the same
- Se is the same except Se sqrt( SSE / (n - m -
1) ) - Slope coefficient C.I.s (one for each Xi) are the
same - P-values (one for each Xi) are the same
(see Excel, artsy_regression.xls)
27Dummy Variables
Dummy variables are variables that only take on
the values 0 or 1 can be used as a tool to put
variables that do not have a natural numerical
interpretation into a regression
For example, sex variable is 0 for female, 1 for
male
For example, dummy variables can also be used to
fix the problem we saw with GRADE in the Artsy
example.
28Dummy Variables (contd)
Artsy example introduce a variable Gk for each
grade, 1 if in grade, 0 if not
Grade / Variable G1 G2 G3 G4 G5 G6 G7 G8
Grade 1 1 0 0 0 0 0 0 0
Grade 2 0 1 0 0 0 0 0 0
Grade 3 0 0 1 0 0 0 0 0
Grade 4 0 0 0 1 0 0 0 0
Grade 5 0 0 0 0 1 0 0 0
Grade 6 0 0 0 0 0 1 0 0
Grade 7 0 0 0 0 0 0 1 0
Grade 8 0 0 0 0 0 0 0 1
29Dummy Variables (contd)
You must choose a base case in this case Grade
1 is a good base case, which means we should
delete the variable G1
Grade / Variable G2 G3 G4 G5 G6 G7 G8
Grade 1 0 0 0 0 0 0 0
Grade 2 1 0 0 0 0 0 0
Grade 3 0 1 0 0 0 0 0
Grade 4 0 0 1 0 0 0 0
Grade 5 0 0 0 1 0 0 0
Grade 6 0 0 0 0 1 0 0
Grade 7 0 0 0 0 0 1 0
Grade 8 0 0 0 0 0 0 1
30Dummy Variables (contd)
- Add these variables to your regression in place
of GRADE - Rerun regression
- New regression eliminates earlier problems
(see Excel, artsy_regression.xls)
31Why not add ALL dummy variables?
Doing so creates Linear Dependency problems See
Salary Data Spreadsheet What happens when you
include all the dummy variables?
32A Basic Techniquefor Multiple Linear Regression
Regression analysis is all about trying to
determine what variables have an effect on a
single variable
In other words, you are trying to describe Y in
terms of X1, , Xm
Some of the variables may not have a significant
effect (which corresponds to a high P-value for
the corresponding slope coefficient)
Efficiency Adding in variables is okay, but
remove the ones that dont have a significant
effect on Y
33Rough Flow Chart for Regression
Do regression
residuals okay but not great?
high P-value?
residuals very bad or you feel linear is not
right?
evidence found?
When comparing different models also take into
account changes in R2 and Se