Regression Analysis (overview) - PowerPoint PPT Presentation

About This Presentation
Title:

Regression Analysis (overview)

Description:

Doing Regression Analysis. Do a scatter diagram on the relevant data ... Regression analysis is all about trying to determine what variables have an ... – PowerPoint PPT presentation

Number of Views:485
Avg rating:3.0/5.0
Slides: 34
Provided by: samuel97
Category:

less

Transcript and Presenter's Notes

Title: Regression Analysis (overview)


1
Regression Analysis (overview)
  • Regression analysis is the idea of
  • analyzing a set of sample data and
  • establishing a relationship between two variables
    and
  • explaining how one variable is dependent upon the
    other
  • using this dependency to explain the population
    or for the prediction of future data

(life expectancy example see Excel)
2
Regression Analysis (overview contd)
  • Regression analysis is the idea of
  • analyzing a set of sample data and
  • establishing a relationship between two variables
    and
  • explaining how one variable is dependent upon the
    other
  • using this dependency to explain the population
    or for the prediction of future data

Regarding the second item, we will study linear
relationships. Regarding the third item, we can
extend the analysis to show how one variable is
dependent upon many variables.
3
Linear relationships
When two variables have a linear relationship, we
call them X independent variable (horizontal
axis) Y dependent variable (vertical axis) and
describe the relationship as Y a
bX where a intercept b slope
For example, X Year Y Male Life Expectancy a
-445.95 b 0.2608
(see Excel-life expectancy)
4
Linear Relationships (contd)
Y a bX, where a is the intercept and b is
the slope
The intercept a is the value of Y when X 0, or
the place on the vertical axis where the line
crosses The slope b describes the pitch or slope
of the line If b gt 0, the line goes up from left
to right the variables have a positive
relationship when X goes up, Y goes up If b lt
0, the line goes down from left to right the
variables have a negative relationship when X
goes up, Y goes down
In regression, we will take two variables X and Y
and determine the a and b that best describe the
relationship between X and Y
Our starting assumption is that Y is dependent on
X
5
Doing Regression Analysis
  1. Do a scatter diagram on the relevant data
  2. Add a trendline, displaying the equation as well
    as R2
  3. Do the full regression analysis using Excels
    Tools gt Data Analysis gt Regression command,
    saving and plotting the residuals
  4. Examine the residuals for individual information
    (and other trends)

6
Method of Least Squares
Think about trying to fit the line Y a bX to
the data, for whatever values of a and b that you
wish
How good is the fit? The smaller the residuals /
errors the better
X Y actual Y predicted Residual Residual2
2 3 5 22 9 3 - 9 -6 36
5 -1 5 25 15 -1 - 15 -16 256
36 256 292
7
Method of Least Squares (contd)
The main goal of the least squares method is to
choose a and b so that SSE is minimized
8
Method of Least Squares (contd)
SSR Sum of Squares due to Regression amount
of variability explained by the regression
SST SSR SSE
total variability explained variability
unexplained variability
R2 SSR / SST percentage of variability in Y
that is explained by X sample coefficient of
determination
Because of the above equations and the definition
of SST, minimizing SSE is the same as maximizing
SSR and / or R2
R2 is not the best measure of fit
9
Method of Least Squares (contd)
A better measure of fit is
Standard error of the estimation, denoted by Se
Se is the standard deviation of the residuals
about their mean
The smaller SSE, the smaller Se
10
Practical Trends with R2 and Se
When working with regression, you will often have
to compare two separate regressions and try to
determine which is a better model. In this case,
look for
higher R2 and lower Se
11
(No Transcript)
12
(No Transcript)
13
See KLMZOO Spreadsheet
14
(No Transcript)
15
(No Transcript)
16
Take a Look at regression example In the
Regression Excel File
17
Point Estimates From Regression
18
More about the Regression Output
When you use the least squares method to perform
a regression analysis, here is a very important
fact
the mean of all residuals equals 0
19
More on the Regression Output (contd)
In a regression, what does a slope coefficient of
0 mean? It means that X has no effect on Y at all
  • The regression output gives a 95 confidence
    interval for the slope coefficient
  • if 0 is outside the interval, then we are
    confident that the slope is not 0
  • if 0 is inside the interval, then the slope may
    be 0

P-value (or Prob-value) is the probability that
0 is the real slope coefficient
Generally speaking, to make inferences from the
regression, you want the P-value to be small
(less than 0.05)
20
A good regression model has
  • A high R2 value
  • A low Se value
  • Slope coefficient which is not likely to equal 0
  • 0 is outside the 95 confidence interval
  • P-value is less than 0.05

So then what makes a bad model?
21
A bad regression model has
  • Residuals that are not very well distributed
  • overall
  • residual histogram should look normal
  • for each section of values of X
  • residual plot should not have any strong patterns
  • e.g., mean of sections residuals should equal 0
    (like entire model)
  • for each subcollection of non-biased sample data
  • e.g., mean of subcollections residuals should
    equal 0 (like entire model)
  • in other words, we want no systematic
    misprediction
  • The linear idea just doesnt fit the situation
  • Do you have strong reason to believe that Y and X
    are not related by a line?

22
A bad regression model has (contd)
  • Is the distribution of residuals non-normal?
  • Is the linear model inappropriate?

When one of these is yes and the difference
from the ideal model is striking, then linear
regression is probably not a good idea
When one of these is yes but the difference
from the ideal model is moderate, then maybe the
linear regression can be improved
(see Excel, temperature, air passenger miles)
23
Excel Temperature Model Examining
Residuals Sum/Avg of subcollections should be
close to zero-----use pivot table grouping to
examine Residuals should be normally
distributed use pivot table/chart to create
histogram of residuals
24
Air Passenger Miles Model Does the histogram of
residuals look like a normal distribution? What
about the averages over subcollections of
residuals?
25
Multiple Linear Regression
Multiple linear regression is very similar to
simple linear regression except that the
dependent variable Y is described by m
independent variables X1, , Xm
Y a b1X1 bmXm
(see Excel, artsy_regression.xls)
26
Multiple Linear Regression (contd)
Y a b1X1 bmXm
  • Intercept a is the same
  • Slope bi is the change in Y given a unit change
    in Xi while holding all other variables constant
  • SST, SSE, SSR, and R2 are the same
  • Se is the same except Se sqrt( SSE / (n - m -
    1) )
  • Slope coefficient C.I.s (one for each Xi) are the
    same
  • P-values (one for each Xi) are the same

(see Excel, artsy_regression.xls)
27
Dummy Variables
Dummy variables are variables that only take on
the values 0 or 1 can be used as a tool to put
variables that do not have a natural numerical
interpretation into a regression
For example, sex variable is 0 for female, 1 for
male
For example, dummy variables can also be used to
fix the problem we saw with GRADE in the Artsy
example.
28
Dummy Variables (contd)
Artsy example introduce a variable Gk for each
grade, 1 if in grade, 0 if not
Grade / Variable G1 G2 G3 G4 G5 G6 G7 G8
Grade 1 1 0 0 0 0 0 0 0
Grade 2 0 1 0 0 0 0 0 0
Grade 3 0 0 1 0 0 0 0 0
Grade 4 0 0 0 1 0 0 0 0
Grade 5 0 0 0 0 1 0 0 0
Grade 6 0 0 0 0 0 1 0 0
Grade 7 0 0 0 0 0 0 1 0
Grade 8 0 0 0 0 0 0 0 1
29
Dummy Variables (contd)
You must choose a base case in this case Grade
1 is a good base case, which means we should
delete the variable G1
Grade / Variable G2 G3 G4 G5 G6 G7 G8
Grade 1 0 0 0 0 0 0 0
Grade 2 1 0 0 0 0 0 0
Grade 3 0 1 0 0 0 0 0
Grade 4 0 0 1 0 0 0 0
Grade 5 0 0 0 1 0 0 0
Grade 6 0 0 0 0 1 0 0
Grade 7 0 0 0 0 0 1 0
Grade 8 0 0 0 0 0 0 1
30
Dummy Variables (contd)
  • Add these variables to your regression in place
    of GRADE
  • Rerun regression
  • New regression eliminates earlier problems

(see Excel, artsy_regression.xls)
31
Why not add ALL dummy variables?
Doing so creates Linear Dependency problems See
Salary Data Spreadsheet What happens when you
include all the dummy variables?
32
A Basic Techniquefor Multiple Linear Regression
Regression analysis is all about trying to
determine what variables have an effect on a
single variable
In other words, you are trying to describe Y in
terms of X1, , Xm
Some of the variables may not have a significant
effect (which corresponds to a high P-value for
the corresponding slope coefficient)
Efficiency Adding in variables is okay, but
remove the ones that dont have a significant
effect on Y
33
Rough Flow Chart for Regression
Do regression
residuals okay but not great?
high P-value?
residuals very bad or you feel linear is not
right?
evidence found?
When comparing different models also take into
account changes in R2 and Se
Write a Comment
User Comments (0)
About PowerShow.com