Model Specification and Multicollinearity


Model Specification and Multicollinearity ... we estimated the following model ... The above model has been estimated by omitting the explanatory variable INF ...

Model Specification and Multicollinearity

Model Specification and Multicollinearity
Model Specification Errors Omitting Relevant
Variables and Including Irrelevant Variables
  • To properly estimate a regression model, we need
    to have specified the correct model
  • A typical specification error occurs when the
    estimated model does not include the correct set
    of explanatory variables
  • This specification error takes two forms
  • Omitting one or more relevant explanatory
  • Including one or more irrelevant explanatory
  • Either form of specification error results in
    problems with OLS estimates

Model Specification Errors Omitting Relevant
  • Example Two-factor model of stock returns
  • Suppose that the true model that explains a
    particular stocks returns is given by a
    two-factor model with the growth of GDP and the
    inflation rate as factors
  • Suppose instead that we estimated the following

Model Specification Errors Omitting Relevant
  • The above model has been estimated by omitting
    the explanatory variable INF
  • Thus, the error term of this model is actually
    equal to
  • If there is any correlation between the omitted
    variable (INF) and the explanatory variable
    (GDP), then there is a violation of classical
    assumption III

Model Specification Errors Omitting Relevant
  • This means that the explanatory variable and the
    error term are not uncorrelated
  • If that is the case, the OLS estimate of ?1 (the
    coefficient of GDP) will be biased
  • As in the above example, it is highly likely that
    there will be some correlation between two
    financial (or economic) variables
  • If, however, the correlation is low or the true
    coefficient of the omitted variable is zero, then
    the specification error is very small

Model Specification Errors Omitting Relevant
  • How can we correct the omitted variable bias in a
  • A simple solution is to add the omitted variable
    back to the model, but the problem with this
    solution is to be able to detect which is the
    omitted variable
  • Omitted variable bias is hard to detect, but
    there could be some obvious indications of this
    specification error
  • For example, our estimated model has a
    significant coefficient with the opposite sign
    from that expected by our arguments

Model Specification Errors Omitting Relevant
  • The best way to detect the omitted variable
    specification bias is to rely on the theoretical
    arguments behind the model
  • Which variables does the theory suggest should be
  • What are the expected signs of the coefficients?
  • Have we omitted a variable that most other
    similar studies include in the estimated model?
  • Note, though, that a significant coefficient with
    the unexpected sign can also occur due to a small
    sample size
  • However, most of the data sets used in empirical
    finance are large enough that this most likely is
    not the cause of the specification bias

Model Specification Errors Including Irrelevant
  • Example Going back to the two-factor model,
    suppose that we include a third explanatory
    variable in the model, for example, the degree of
    wage inequality (INEQ)
  • So, we estimate the following model
  • The inclusion of an irrelevant variable (INEQ) in
    the model increases the standard errors of the
    estimated coefficients and, thus, decreases the

Model Specification Errors Including Irrelevant
  • This implies that it will be more difficult to
    reject a null hypothesis that a coefficient of
    one of the explanatory variables is equal to zero
  • Also, the inclusion of an irrelevant variable
    will usually decrease the adjusted R-sq (but not
    the R-sq)
  • Finally, we can show that the inclusion of an
    irrelevant variable does still allow us to obtain
    unbiased estimates of the models coefficients

Model Specification Criteria
  • To decide whether an explanatory variable belongs
    in a regression model, we can test whether most
    of the following conditions hold
  • The importance of theory Is the decision to
    include an explanatory variable in the model
    theoretically sound?
  • t-Test Is the variable statistically significant
    and does it have the expected coefficient sign?
  • Adjusted R2 Does the overall fit of the model
    improve when we add the explanatory variable?
  • Bias Do the coefficients of the other variables
    change significantly (sign or statistical
    significance) when we add the variable to the

Problems with Specification Searches
  • In an attempt to find the right or desired
    model, a researcher may estimate numerous models
    until an estimated model with the desired
    properties is obtained
  • It is definitely the case that the wrong approach
    to model specification is data mining
  • In this case, the researcher would estimate every
    possible model and choose to report only those
    that produce desired results
  • The researcher should try to minimize the number
    of estimated models and guide the selection of
    variables mainly on theory and not purely on
    statistical fit

Sequential Model Specification Searches
  • In an effort to find the appropriate regression
    model, it is common to begin with a benchmark (or
    base) specification and then sequentially add or
    drop variables
  • The base specification can rely on theory and
    then add or drop variables based on adjusted R2
    and t-statistics
  • In this effort, it is important to follow the
    principle of parsimony try to find the simplest
    model that best fits the data

Model Specification Choosing the Functional Form
  • One of the assumptions to derive the nice
    properties of OLS estimates is that the estimated
    model is linear
  • What if the relationship between two variables is
    not linear?
  • OLS maintains its nice properties of unbiased and
    minimum variance estimates if we transform the
    nonlinear relationship into a model that is
    linear in the coefficients
  • Interesting case
  • Double-log (log-log) form

Model Specification Choosing the Functional Form
  • Double-log Form
  • A model where both the dependent and explanatory
    variables are in natural logs
  • Example A well-known model of nominal exchange
    rate determination is the Purchasing Power Parity
    (PPP) model
  • s P/P
  • s nominal exchange rate (e.g. Euro/), P
    price level in the Eurozone, P price level in
    the US

Model Specification Choosing the Functional Form
  • Taking natural logs, we can estimate the
    following model
  • ln(s) ?0 ?1ln(P) ?2ln(P) ?i
  • Property of double-log model Estimated
    coefficients show elasticities between dependent
    and explanatory variables
  • Example A 1 change in P will result in a ?1
    change in the nominal exchange rate (s)

  • Multicollinearity occurs when some or all of the
    explanatory variables in the regression model are
    highly correlated
  • In this case, assumption VI of the classical
    model does not hold and OLS estimates lose some
    of their nice properties
  • It is common, particularly in the case of time
    series data, that two or more explanatory
    variables are correlated
  • When multicollinearity is present, the estimated
    coefficients are unstable in the degree of
    statistical significance, magnitude and sign

The Impact of Multicollinearity on the OLS
  • Multicollinearity has the following consequences
    on the OLS estimated model
  • The OLS estimates remain unbiased
  • The standard errors of the estimated coefficients
    are higher and, thus, the t-statistics fall
  • OLS estimates become very sensitive to the
    addition or removal of explanatory variables or
    to changes in the data sample
  • The overall fit of the regression (and the
    significance of non-multicollinear coefficients)
    is to a large extent unaffected
  • This implies that a sign of multicollinearity is
    a high adjusted R-sq and no statistically
    significant coefficients

Detecting Multicollinearity
  • One approach to detect multicollinearity is to
    examine the simple correlation coefficients
    between explanatory variables
  • This will be shown in the correlation matrix
    between the models variables
  • Some researchers consider a correlation
    coefficient with an absolute value above .80 to
    be an indication of concern for multicollinearity
  • A second detection approach is to use the
    Variance Inflation Factor (VIF)

Detecting Multicollinearity
  • The VIF method tries to detect multicollinearity
    by examining the degree to which a given
    explanatory variable is explained by the others
  • The method involves the following steps
  • Run an OLS regression of the explanatory variable
    Xi on all other explanatory variables
  • Calculate the VIF for the coefficient of variable
    Xi given by 1/(1 R2i) where the R-sq
    is that given by the regression
  • Evaluate the size of the VIF
  • Rule of thumb if the VIF of the coefficient of
    explanatory variable Xi is greater than 5 then
    the higher is the impact of multicollinearity on
    the estimated coefficient of this variable

Correcting for Multicollinearity
  • One way to deal with multicollinearity is to do
  • Multicollinearity may not be so severe in a model
    to make the explanatory variables insignificant
    or produce coefficients different from those
  • Only when one of the above two issues is
    observed, multicollinearity problems should be
  • A common way to deal with multicollinearity is to
    drop one of the correlated explanatory variables
    (a redundant variable)

Correcting for Multicollinearity
  • However, in this case there is the risk of
    causing a specification bias (omitted variable
  • But, if dropping a variable is the best course of
    action, how should this decision be made?
  • The decision of which variable to drop should be
    based on theory
  • The variable with the least theoretical support
    should be dropped
  • Finally, a way to deal with multicollinearity is
    to increase the sample size (if this makes sense)
    since a larger data set will provide more
    accurate estimates of the model
