Title: Issues Regarding Regression Models
1Issues Regarding Regression Models
2Collinearity
- A perfect linear relationship between two (or
more) independent variables is called
collinearity (multi-collinearity) - Under this condition, the least-square regression
coefficients cannot be uniquely defined.
3Collinearity
- A strong but less than perfect linear
relationship between the independent variables
can cause
- Regression coefficients to be unstable,
- Standard errors to the coefficients become
- large, hence, confidence intervals for
- coefficients become large and coefficients
- become imprecise.
4Collinearity Mesurement
- One of the measures to determine the impact of
Collinearity on the precision of the estimates is
called - the Variance Inflation Factor (VIF).
- Regression / Linear / Statistics / (check)
Collinearity
5Collinearity Effect
- Wrong signs for the coefficients
- Drastic changes in the coefficients in terms of
size and/or sign as a new variable is added to
the equation. - High VIF (VIF gt 5) or Low Tolerance (lt 0.1)
- are indicators of collinearity.
6Collinearity Remedies
- There is no Quick Fix for collinearity,
- Some strategies
- 1. Variable selection for the model
- Based on correlation matrix, some of the highly
correlated variables could be excluded from the
model, - 2. Ridge Regression instead Ordinary Least
Squared Regression (OLR).
7Unusual Data
- A single observation that is substantially
different from all other observations can make a
large difference in the results of your
regression analysis. - If a single observation (or small group of
observations) substantially changes your results,
you would want to know about this and investigate
further. - There are three ways that an observation can be
unusual.
8Unusual Data
- Outliers In linear regression, an outlier is an
observation with large residual. In other words,
it is an observation whose dependent-variable
value is unusual given its values on the
predictor variables. An outlier may indicate a
sample peculiarity or may indicate a data entry
error or other problem.
9Unusual Data
- Leverage An observation with an extreme value on
a predictor variable is called a point with high
leverage. Leverage is a measure of how far an
independent variable deviates from its mean.
These leverage points can have an unusually large
effect on the estimate of regression
coefficients.
10Unusual Data
- Influence An observation is said to be
influential if removing the observation
substantially changes the estimate of
coefficients. Influence can be thought of as the
product of leverage and outlierness.
11Influential Data Diagnosis
- Cooks D
- If Cooks distance for a particular observation
is greater than a cutoff point than that
observation could be considered as influential
data. - One such cutoff point is
- Di gt 4 / (n-k-1)
- where, k number of independent variables
- D gt 1, strong indication of problem
12Influential Data Diagnostics on SPSS
- Standardized DfBETA(s)
- Change in the regression coefficient that results
from the deletion of the ith case. A standardized
DfBETA value is computed for each case for each
regression coefficient generated by a model. - Cut-off Points
- gt 0 means case i increases the slope
- lt 0 means case i decreases the slope
- DfBETA(s) gt 2 strong indication of influence
- DfBETA(s) gt 2/sqrt(n) might be problem
13Influential Data Diagnostics on SPSS
- Leverage h
- max(h) lt 0.2 OK, no problem
- 0.2 lt max(h) lt 0.5, might be problem
- max(h) gt 0.5, usually a problem of too much
leverage for one case - h gt 2k/n, top few of cases
14Influential Data Diagnostics on SPSS
- Standardized DfFIT
- Change in the predicted value when the ith case
is deleted. - Cut-off Point
- DfFIT gt 2sqrt(k/n) problem
15Influential Data Remedies
- The unusual data need to be investigated
- For example, it may stem from an error in data
entry - The model could be re-specified, robust
estimation methods could be used, - An influential data could only be discarded if it
is a truly bad data and cannot be corrected.
16Checking the Assumptions
- There are assumptions that need to be met to
accept the results of Regression analysis and use
the model for future decision making - Linearity
- Independence of errors (No autocorrelation),
- Normality of errors,
- Constant Variance of errors (Homoscadasticity).
17Tests for Linearity
- Linearity
- Plot dependent variable against each of the
independent variables separately. - Decide whether linear regression is a
Reasonable description of the tendency in the
data. - Consider curvilinear pattern,
- Consider undue influence of one data point on the
regression line, etc.
18Nonlinear Relationships
Diminishing Returns Relationship of Advertising
versus Sales
Sales
Advertising
19Analysis of Residuals
3
2
1
Residuals
0
-1
-2
(a) Nonlinear Pattern
-3
3
2
1
Residuals
0
-1
-2
(b) Linear Pattern
-3
20Tests for Independence
- Independence of Errors
- Ljung-Box Test
- Graphs / Time Series / Autocorrelations
- Plot residuals against time (Residual-Time Plot)
- Residuals form y-axis, time form x-axis
- If the residuals group alternately into positive
and negative clusters then that indicates
auto-correlation
21Residuals-Time Plot
- Notice the tendency of the residuals to group
alternately into positive and negative clusters. - That is an indication that the residuals are not
independent but auto-correlated.
22Analysis of Residuals
3
2
1
Residuals
0
-1
-2
(a) Independent Residuals
Time
-3
3
2
1
Residuals
0
-1
-2
(b) Residuals Not Independent
-3
Time
23Non-Independence Remedies
- EGLS (Estimated Generalized Least Squares)
Methods - Prais-Winsten
- Cochrane-Orcutt
- (Note that these are effective only for
first-order autocorrelation.)
24Tests for Normality
- Normality of Errors
- Kolmogorov-Smirnov test on Residuals
- Compute Skewness
- Compute Kurtosis
- Jarque-Bera Test
25Jarque-Bera Test
- Compute JB-Test Statistics
- JB (n/6)Skew(Data_Range)
- (n/24) ( Kurt(Data_Range)-3)2
- Compare with Chi-square_alpha with 2 df
- Chiinv ( alpha / tails, 2)
- Ho JB lt Chi-square_alpha
26Non-Normality Remedies
- To stabilize error variance, one of the most
frequently used technique is data transformation.
- X and/or Y values could be transformed by
employing power to those variables, - y (or x) gt yp (or xp)
- where p -2, -1, -½, ½, 2, 3
27Tests for Constant Variance
- Constant Variance of Errors
- Divide Residuals in two half and run an F-test
- Plot residuals against y-estimates
- Residuals form y-axis and estimated y-values form
x-axis. - When errors get larger (or smaller) as y-values
increase that would indicate non-constant
variance.
28Analysis of Residuals
3
2
1
Residuals
0
-1
-2
-3
x1
(a) Variance Decreases as x Increases
29Analysis of Residuals
3
2
1
Residuals
0
-1
-2
-3
x1
(b) Variance Increases as x Increases
30Analysis of Residuals
3
2
1
Residuals
0
-1
-2
-3
x1
(c) Constant Variance
31Non-Constant Variance Remedies
- Transform dependent variable (y)
- y gt yp
- where p -2, -1, -½, ½, 2, 3
- Weighted Least Square Regression Method
32Next Lesson
- (Lesson - 07/A)
- Qualitative Judgmental Forecasting Methods