Title: Model Specification and Multicollinearity
1Model Specification and Multicollinearity
2Model Specification Errors Omitting Relevant
Variables and Including Irrelevant Variables
- To properly estimate a regression model, we need
to have specified the correct model - A typical specification error occurs when the
estimated model does not include the correct set
of explanatory variables - This specification error takes two forms
- Omitting one or more relevant explanatory
variables - Including one or more irrelevant explanatory
variables - Either form of specification error results in
problems with OLS estimates
3Model Specification Errors Omitting Relevant
Variables
- Example Two-factor model of stock returns
- Suppose that the true model that explains a
particular stocks returns is given by a
two-factor model with the growth of GDP and the
inflation rate as factors - Suppose instead that we estimated the following
model
4Model Specification Errors Omitting Relevant
Variables
- The above model has been estimated by omitting
the explanatory variable INF - Thus, the error term of this model is actually
equal to - If there is any correlation between the omitted
variable (INF) and the explanatory variable
(GDP), then there is a violation of classical
assumption III
5Model Specification Errors Omitting Relevant
Variables
- This means that the explanatory variable and the
error term are not uncorrelated - If that is the case, the OLS estimate of ?1 (the
coefficient of GDP) will be biased - As in the above example, it is highly likely that
there will be some correlation between two
financial (or economic) variables - If, however, the correlation is low or the true
coefficient of the omitted variable is zero, then
the specification error is very small
6Model Specification Errors Omitting Relevant
Variables
- How can we correct the omitted variable bias in a
model? - A simple solution is to add the omitted variable
back to the model, but the problem with this
solution is to be able to detect which is the
omitted variable - Omitted variable bias is hard to detect, but
there could be some obvious indications of this
specification error - For example, our estimated model has a
significant coefficient with the opposite sign
from that expected by our arguments
7Model Specification Errors Omitting Relevant
Variables
- The best way to detect the omitted variable
specification bias is to rely on the theoretical
arguments behind the model - Which variables does the theory suggest should be
included? - What are the expected signs of the coefficients?
- Have we omitted a variable that most other
similar studies include in the estimated model? - Note, though, that a significant coefficient with
the unexpected sign can also occur due to a small
sample size - However, most of the data sets used in empirical
finance are large enough that this most likely is
not the cause of the specification bias
8Model Specification Errors Including Irrelevant
Variables
- Example Going back to the two-factor model,
suppose that we include a third explanatory
variable in the model, for example, the degree of
wage inequality (INEQ) - So, we estimate the following model
- The inclusion of an irrelevant variable (INEQ) in
the model increases the standard errors of the
estimated coefficients and, thus, decreases the
t-statistics
9Model Specification Errors Including Irrelevant
Variables
- This implies that it will be more difficult to
reject a null hypothesis that a coefficient of
one of the explanatory variables is equal to zero - Also, the inclusion of an irrelevant variable
will usually decrease the adjusted R-sq (but not
the R-sq) - Finally, we can show that the inclusion of an
irrelevant variable does still allow us to obtain
unbiased estimates of the models coefficients
10Model Specification Criteria
- To decide whether an explanatory variable belongs
in a regression model, we can test whether most
of the following conditions hold - The importance of theory Is the decision to
include an explanatory variable in the model
theoretically sound? - t-Test Is the variable statistically significant
and does it have the expected coefficient sign? - Adjusted R2 Does the overall fit of the model
improve when we add the explanatory variable? - Bias Do the coefficients of the other variables
change significantly (sign or statistical
significance) when we add the variable to the
model?
11Problems with Specification Searches
- In an attempt to find the right or desired
model, a researcher may estimate numerous models
until an estimated model with the desired
properties is obtained - It is definitely the case that the wrong approach
to model specification is data mining - In this case, the researcher would estimate every
possible model and choose to report only those
that produce desired results - The researcher should try to minimize the number
of estimated models and guide the selection of
variables mainly on theory and not purely on
statistical fit
12Sequential Model Specification Searches
- In an effort to find the appropriate regression
model, it is common to begin with a benchmark (or
base) specification and then sequentially add or
drop variables - The base specification can rely on theory and
then add or drop variables based on adjusted R2
and t-statistics - In this effort, it is important to follow the
principle of parsimony try to find the simplest
model that best fits the data
13Model Specification Choosing the Functional Form
- One of the assumptions to derive the nice
properties of OLS estimates is that the estimated
model is linear - What if the relationship between two variables is
not linear? - OLS maintains its nice properties of unbiased and
minimum variance estimates if we transform the
nonlinear relationship into a model that is
linear in the coefficients - Interesting case
- Double-log (log-log) form
14Model Specification Choosing the Functional Form
- Double-log Form
- A model where both the dependent and explanatory
variables are in natural logs - Example A well-known model of nominal exchange
rate determination is the Purchasing Power Parity
(PPP) model - s P/P
- s nominal exchange rate (e.g. Euro/), P
price level in the Eurozone, P price level in
the US
15Model Specification Choosing the Functional Form
- Taking natural logs, we can estimate the
following model - ln(s) ?0 ?1ln(P) ?2ln(P) ?i
- Property of double-log model Estimated
coefficients show elasticities between dependent
and explanatory variables - Example A 1 change in P will result in a ?1
change in the nominal exchange rate (s)
16Multicollinearity
- Multicollinearity occurs when some or all of the
explanatory variables in the regression model are
highly correlated - In this case, assumption VI of the classical
model does not hold and OLS estimates lose some
of their nice properties - It is common, particularly in the case of time
series data, that two or more explanatory
variables are correlated - When multicollinearity is present, the estimated
coefficients are unstable in the degree of
statistical significance, magnitude and sign
17The Impact of Multicollinearity on the OLS
Estimates
- Multicollinearity has the following consequences
on the OLS estimated model - The OLS estimates remain unbiased
- The standard errors of the estimated coefficients
are higher and, thus, the t-statistics fall - OLS estimates become very sensitive to the
addition or removal of explanatory variables or
to changes in the data sample - The overall fit of the regression (and the
significance of non-multicollinear coefficients)
is to a large extent unaffected - This implies that a sign of multicollinearity is
a high adjusted R-sq and no statistically
significant coefficients
18Detecting Multicollinearity
- One approach to detect multicollinearity is to
examine the simple correlation coefficients
between explanatory variables - This will be shown in the correlation matrix
between the models variables - Some researchers consider a correlation
coefficient with an absolute value above .80 to
be an indication of concern for multicollinearity - A second detection approach is to use the
Variance Inflation Factor (VIF)
19Detecting Multicollinearity
- The VIF method tries to detect multicollinearity
by examining the degree to which a given
explanatory variable is explained by the others - The method involves the following steps
- Run an OLS regression of the explanatory variable
Xi on all other explanatory variables - Calculate the VIF for the coefficient of variable
Xi given by 1/(1 R2i) where the R-sq
is that given by the regression - Evaluate the size of the VIF
- Rule of thumb if the VIF of the coefficient of
explanatory variable Xi is greater than 5 then
the higher is the impact of multicollinearity on
the estimated coefficient of this variable
20Correcting for Multicollinearity
- One way to deal with multicollinearity is to do
nothing - Multicollinearity may not be so severe in a model
to make the explanatory variables insignificant
or produce coefficients different from those
expected - Only when one of the above two issues is
observed, multicollinearity problems should be
addressed - A common way to deal with multicollinearity is to
drop one of the correlated explanatory variables
(a redundant variable)
21Correcting for Multicollinearity
- However, in this case there is the risk of
causing a specification bias (omitted variable
bias) - But, if dropping a variable is the best course of
action, how should this decision be made? - The decision of which variable to drop should be
based on theory - The variable with the least theoretical support
should be dropped - Finally, a way to deal with multicollinearity is
to increase the sample size (if this makes sense)
since a larger data set will provide more
accurate estimates of the model