Title: Econometrics Course: Endogeneity
1Econometrics CourseEndogeneity Simultaneity
2Overview
- Endogeneity
- Sources
- Responses
- Omitted Variables
- Measurement Error
- Proxy Variables
- Method of Instrumental Variables
- Properties
- Validity and strength of instruments
3Definition of Endogeneity
- Suppose we have a regression equation
- y a b1x1 b2x2 e
- The variable x1 is endogenous if it is correlated
with e. - Note that this is related to, but not identical
to, the heuristic definition that x1 is
determined within the model.
4Sources of Endogeneity
- 1. Omitted variables
-
- If the true model underlying the data is
- y a b1x1 b2x2 b3x3 n
- but you estimate the model
- y a b1x1 b2x2 e
- then variable x1 will be endogenous if it is
correlated with x3. Why? Because e f (n, x3).
5Sources of Endogeneity
- 2. Measurement error
- Suppose the true model underlying the data is
- y a b1x1 b2x2 e
- but you estimate the model
- y a b1x1 b2x2 e
- where (x2 x2 j).
6Sources of Endogeneity
- 2. Measurement error - continued
- Variable x2 will be endogenous if j depends on
x2. - Example Suppose that x2 measures hospital size
- (no. of beds), and that the measurement error is
greater for larger hospitals. Then as x2 grows,
so does j. Thus e is correlated with x2, causing
endogeneity.
7Sources of Endogeneity
- 2. Measurement error - continued
- Rearranging the equation, we have
- y a b1x1 b2x2 e
- y a b1x1 b2(x2 j) e
- y a b1x1 b2x2 (e b2 j)
- If j f(x2) then error term is correlated with
x2, causing endogeneity.
8Sources of Endogeneity
- 3. Simultaneity
- A system of simultaneous equations occurs when
two or more left-hand side variables are
functions of each other (there are other ways of
stating it, too) - y1 a b1x1 g2y2 e
- y2 a g1x1 g2y1 e
9Sources of Endogeneity
- 3. Simultaneity
- With some algebra you can rewrite these two
equations in reduced form as a single equation
with an endogenous regressor.
10Pretesting for Endogeneity
- The most famous test is Hausman (1978). Many
others are described in Nakamura and Nakamura
(1998). - Idea the method of instrumental variables (IV)
uses two-stage least squares (2SLS). If there is
no endogeneity, it is more efficient to use OLS.
If there is endogeneity, OLS is inconsistent and
so 2SLS is best.
11Pretesting for Endogeneity
- Problem the tests all have low power,
particularly when 2SLS would cause a significant
loss of efficiency. - In practice, many people use a Hausman test, fail
to reject the null hypothesis of no endogeneity,
and then use OLS. - A more statistically reliable approach is to base
judgments of endogeneity on how the system under
study works.
12Responses to Endogeneity
- What if you are unsure whether a variable is
endogenous? - Approach 1 ignore it
- Approach 2 use instrumental variables (IV) --
described later -- for every possibly endogenous
variable - Approach 3 subtract out the variable using
time-series (panel) data
13Responses to Endogeneity
- Approach 1 ignore it
- -- Not advisable true endogeneity causes
OLS to be inconsistent - Approach 2 use IV on every possibly endogenous
variable - -- Not advisable it will cause a loss of
efficiency (and hence wider confidence intervals)
and may lead to bias.
14Responses to Endogeneity
- Approach 3 Difference it out
- Suppose that the endogeneity is fixed over time,
such as measurement error or an omitted variable.
Further, suppose that observe data in two time
periods. - A difference-in-difference (DD) model can be
used subtract values at time 1 (before) from
values at time 2 (after) and the endogenous
variable will drop out.
15Responses to Endogeneity
- Approach 3 Difference it out -- continued
- Limitations
- - DD models will not eliminate selection bias.
- - DD models only eliminate fixed variables
sometimes endogenous variables change values over
time
16Dealing with Omitted Variables
17Dealing with Omitted Variables
- The investigator should have a conceptual model
of the process under study. Guided by this
understanding, there are a few options for
dealing with omitted variables. - 1. Find additional data so that every relevant
variable is included. - 2. Ignore it
- - Acceptable only if omitted variable is
uncorrelated with all included variables
otherwise the coefficient estimates will be
biased up or down.
18Dealing with Omitted Variables
- 3. Find proxy variable
- Suppose the following
- y is the outcome
- q is the omitted variable
- z is the proxy for q
- What properties should the proxy z have?
19Dealing with Omitted Variables
- a. Proxy z should be strongly correlated with q.
- b. Proxy z must be redundant ( ignorable)
- E (y x, q, z) E (y x, q)
- c. Omitted q must be uncorrelated with other
regressors conditional on z - (corr (q , xj) 0 z) for each xj
20Dealing with Omitted Variables
- The last two mean roughly that q and z provide
similar information about the outcome. - You dont observe q, so how can you prove these
conditions are met? Either argue it from theory
or test the assumption using other data.
21Dealing with Measurement Error
22Dealing with Measurement Error
- 1. Improve measurement
- - DSS improved by refusing extreme outlier
values - - NPPD improved by requiring more complete data
- 2. Argue that the degree of error is small
- - Use outside data for validation
- 3. Argue that error is uncorrelated with included
variables
23Dealing with Proxy Variables
24Dealing with Proxy Variables
- 1. What if proxy variable z is correlated with a
regressor x? - OLS is inconsistent, but one can hope and argue
that the inconsistency is less than if z is
omitted.
25Dealing with Proxy Variables
- 2. Consider using a lagged dependent variable as
a proxy variable. - Example If you believe that omitted variable qt
strongly affects outcome yt, then a lagged value
of y (such as yt-2) is probably correlated with
qt as well. - Problem yt-2 may be correlated with other xs as
well, leading to inconsistency.
26Dealing with Proxy Variables
- 3. Consider using multiple proxy variables for a
single omitted variable. -
- How? Simply put all proxy variables in the
equation. - Note they all must meet the requirements for
proxies.
27Dealing with Proxy Variables
- 4. What if omitted variable q interacts with a
regressor x? - y a b1x b2q b3qx e
- ? dy/dx b1 b3q
- marginal effect of x on y involves q, which is
unobserved
28Dealing with Proxy Variables
- Demean z take every value of z and subtract out
the grand (overall) average value. Call it zd. - y a b1x b2zd b3zdx e
- ? dy/dx b1 b3zd
- b1 because Ezd 0
29Instrumental Variables
30Method of Instrumental Variables
- Often used to deal with simultaneity.
- More generally, IV applies whenever a regressor x
is correlated with the error term e. -
31IV Definition
- Model y a b1x1 b2x2 e
-
- Suppose that x2 is endogenous to y. An
instrumental variable is one that - (a) is correlated with the endogenous variable
x2 - (b) is uncorrelated with error term e
- (c) should not enter the main equation (i.e.,
does not - explain y)
32Two-Stage Least Squares
- Two-stage least squares (2SLS) approach
- Stage 1
- Predict x2 as a function of all other
variables plus an IV (call it z) - x2 a g1x1 g2z n
- Create predicted values of x2 call them x2p
-
33Two-Stage Least Squares
- Two-stage least squares (2SLS) approach
- Stage 2
- Predict y as a function of x2p and all other
variables (but not z) - y a b1x1 b2 x2p e
-
- Note adjust the standard errors to account
for the fact that x2p is predicted.
34Two-Stage Residual Inclusion
- 2SLS is only consistent when the Stage 2 equation
is linear. - If Stage 2 is nonlinear, use the two-stage
residual inclusion (2SRI) method - - Stage 1 as in 2SLS, leading to predicted x2p
- - Develop residuals v x2 - x2p
35Two-Stage Residual Inclusion
- - Stage 2
- Predict y as a function of x1, x2 (not x2p)
and the new residuals v - y f (a b1x1 b2 x2 b3v) e
- where f(.) is a nonlinear function.
- Note that if Stage 2 is linear, then 2SRI
yields the same results as 2SLS.
36Multiple IVs
- What if you have multiple endogenous variables?
- 1. The number of IVs must equal or exceed the
number of endogenous variables - 2. Estimate a separate 1st-stage regression for
each endogenous variable - 3. Every 1st-stage regression should contain all
IVs
37IV Issues
- Two issues plague the IV method
- 1. No IV is available
- 2. A potential IV is found, but its quality is
uncertain
38IV Issues
- What if there is no IV?
- State that no IV exists and forge ahead anyway,
arguing that any bias in OLS is likely to be
small. - Argue that the endogeneity is weak on theoretical
grounds. - Argue that external data indicate that the bias
from OLS is likely to be small.
39IV Properties
- What if you have an IV of unknown quality?
- Two characteristics mark a good IV
- 1. Validity
- 2. Strength
40IV Validity
- Validity has several components
- a. Non-zero correlation with x2
- b. Uncorrelated with error term e
- c. Uncorrelated with y except through x2
- d. Monotonicity as z increases, x2 increases
41IV Validity
- There are several ways to show validity of an IV
- Non-zero correlation with the endogenous variable
can be shown directly. - Robustness do alternative IVs yield similar
results? - Non-correlation with the outcome variable of the
2nd - stage. This point must be argued from theory,
an understanding of how the system under study
works.
42IV Validity
- Warning one cannot simply add a candidate IV to
the main model (i.e., the 2nd stage) to see
whether it is significant. The result is biased. - BUT
- If there are multiple IVs, one can use a test of
over-identifying restrictions.
43IV Validity
- Overidentification number of candidate IVs
exceeds number of endogenous variables. - Suppose that
- (a) You have one endogenous variable and three
candidate IVs - (b) You know that one of the IVs is truly valid.
- Use the known-valid IV in the 1st stage and put
the remaining two IVs in the 2nd stage.
44IV Validity
- Over-identification test, continued
- If the two remaining IVs are jointly
insignificant in the 2nd stage, then this
supports their use as alternative IVs. - Problem this only works if the IV(s) in the 1st
stage are truly valid and you dont know that!
45IV Validity
- Over-identification test, continued
- Partial solution use Sargans (1984) test,
which assumes only that one or more of your IVs
are valid you dont have to specify which. This
method fails only if none of the IVs is valid. - In the end, you must argue for validity on
conceptual grounds at a minimum.
46IV Validity
- Conceptual arguments
- 1. Explain why z should influence x2
- 2. Explain why z should not influence y directly
- 3. Anticipate objections about omitted variables
that link z to the error term e. Show that z is
not related to those omitted variables, perhaps
using outside data. For example, use data on
non-veterans to support a claim about how
veterans act.
47IV Properties
- Two characteristics mark a good instrumental
variable - 1. Validity
- 2. Strength
48Strong IVs
- A strong instrument has a high correlation with
the endogenous variable. -
- How strong a correlation? Staiger Stock (1997)
recommend a partial F statistic of 5 or greater. - - Run 1st stage with and without the IV.
- - Compare the overall F statistics a difference
of 5 or - more is sufficient evidence of strength.
49Weak IVs
-
- If the IVs are weak,
- 2SLS and 2SRI are consistent, but there can be
considerable bias even in large samples - standard errors are too small
- 2SLS and 2SRI perform poorly
50Weak IVs
- What to do if IVs are weak?
- If there is a single endogenous variable, use a
conditional likelihood ratio (CLR) test - perform a regular likelihood ratio test
- adjust the critical values
- available in Stata see Stata Journal, 3,
57-70 - and http//elsa.berkeley.edu/wp/marcelo.pdf
by Moreira - and Poi
51Weak IVs
- What if there are multiple endogenous variables
and only weak IVs? - A solution has not been developed yet!
52Selected References
53Selected References
- JM Wooldridge. Econometric analysis of cross
section and panel data. Cambridge, MA MIT
Press, 2002. - A graduate-level econometrics textbook with
lengthy textual descriptions of practical issues. - HS Bloom, ed. Learning more from social
experiments evolving analytic approaches.
Russell Sage. -
- A largely non-technical exploration of how
instrumental variables are found and used, with
examples from welfare reform studies.
54Selected References
- MP Murray. Avoiding invalid instruments and
coping with weak instruments. Journal of Economic
Perspectives 200620(4) 111-132. -
- A superb reference with relatively few
equations. Has an extensive reference list. - A Nakamura, M Nakamura. Model specification and
endogeneity. Journal of Econometrics
199883213-237. - Presents major endogeneity tests, explores
approaches to endogeneity testing. Somewhat
iconoclastic.
55Selected References
- M McClellan, B McNeil, J Newhouse. Does more
intensive treatment of acute myocardial
infarction in the elderly reduce mortality?
Analysis using instrumental variables.
JAMA1994272(11)859-66 - Classic paper using IV in health, but
challenging to read. - J Newhouse, M McClellan. Econometrics in outcomes
research the use of instrumental variables. Ann
Rev Pub Health 1998 1917-34. - Non-technical introduction to IV.
56Selected References
- J Terza, A Basu, P Rathouz. Two-stage residual
inclusion estimation Addressing endogeneity in
health econometric modeling. Journal of Health
Economics 200827531-543. - Explains two-stage residual inclusion models and
contrasts them to two-stage least squares.
Moderately technical.
57Acknowledgements
- Much of the content of this presentation is
derived from Wooldridge (2002), Murray (2006),
and Nakamura and Nakamura (2006). - Helpful comments were also provided by HERC staff.
58Questions?