Econometrics Course: Endogeneity - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Econometrics Course: Endogeneity

Description:

Pretesting for Endogeneity Problem: the tests all have low power, particularly when 2SLS would cause a significant loss of efficiency. In practice, ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 59
Provided by: MarkSm9
Category:

less

Transcript and Presenter's Notes

Title: Econometrics Course: Endogeneity


1
Econometrics CourseEndogeneity Simultaneity
  • Mark W. Smith

2
Overview
  • Endogeneity
  • Sources
  • Responses
  • Omitted Variables
  • Measurement Error
  • Proxy Variables
  • Method of Instrumental Variables
  • Properties
  • Validity and strength of instruments

3
Definition of Endogeneity
  • Suppose we have a regression equation
  • y a b1x1 b2x2 e
  • The variable x1 is endogenous if it is correlated
    with e.
  • Note that this is related to, but not identical
    to, the heuristic definition that x1 is
    determined within the model.

4
Sources of Endogeneity
  • 1. Omitted variables
  • If the true model underlying the data is
  • y a b1x1 b2x2 b3x3 n
  • but you estimate the model
  • y a b1x1 b2x2 e
  • then variable x1 will be endogenous if it is
    correlated with x3. Why? Because e f (n, x3).

5
Sources of Endogeneity
  • 2. Measurement error
  • Suppose the true model underlying the data is
  • y a b1x1 b2x2 e
  • but you estimate the model
  • y a b1x1 b2x2 e
  • where (x2 x2 j).

6
Sources of Endogeneity
  • 2. Measurement error - continued
  • Variable x2 will be endogenous if j depends on
    x2.
  • Example Suppose that x2 measures hospital size
  • (no. of beds), and that the measurement error is
    greater for larger hospitals. Then as x2 grows,
    so does j. Thus e is correlated with x2, causing
    endogeneity.

7
Sources of Endogeneity
  • 2. Measurement error - continued
  • Rearranging the equation, we have
  • y a b1x1 b2x2 e
  • y a b1x1 b2(x2 j) e
  • y a b1x1 b2x2 (e b2 j)
  • If j f(x2) then error term is correlated with
    x2, causing endogeneity.

8
Sources of Endogeneity
  • 3. Simultaneity
  • A system of simultaneous equations occurs when
    two or more left-hand side variables are
    functions of each other (there are other ways of
    stating it, too)
  • y1 a b1x1 g2y2 e
  • y2 a g1x1 g2y1 e

9
Sources of Endogeneity
  • 3. Simultaneity
  • With some algebra you can rewrite these two
    equations in reduced form as a single equation
    with an endogenous regressor.

10
Pretesting for Endogeneity
  • The most famous test is Hausman (1978). Many
    others are described in Nakamura and Nakamura
    (1998).
  • Idea the method of instrumental variables (IV)
    uses two-stage least squares (2SLS). If there is
    no endogeneity, it is more efficient to use OLS.
    If there is endogeneity, OLS is inconsistent and
    so 2SLS is best.

11
Pretesting for Endogeneity
  • Problem the tests all have low power,
    particularly when 2SLS would cause a significant
    loss of efficiency.
  • In practice, many people use a Hausman test, fail
    to reject the null hypothesis of no endogeneity,
    and then use OLS.
  • A more statistically reliable approach is to base
    judgments of endogeneity on how the system under
    study works.

12
Responses to Endogeneity
  • What if you are unsure whether a variable is
    endogenous?
  • Approach 1 ignore it
  • Approach 2 use instrumental variables (IV) --
    described later -- for every possibly endogenous
    variable
  • Approach 3 subtract out the variable using
    time-series (panel) data

13
Responses to Endogeneity
  • Approach 1 ignore it
  • -- Not advisable true endogeneity causes
    OLS to be inconsistent
  • Approach 2 use IV on every possibly endogenous
    variable
  • -- Not advisable it will cause a loss of
    efficiency (and hence wider confidence intervals)
    and may lead to bias.

14
Responses to Endogeneity
  • Approach 3 Difference it out
  • Suppose that the endogeneity is fixed over time,
    such as measurement error or an omitted variable.
    Further, suppose that observe data in two time
    periods.
  • A difference-in-difference (DD) model can be
    used subtract values at time 1 (before) from
    values at time 2 (after) and the endogenous
    variable will drop out.

15
Responses to Endogeneity
  • Approach 3 Difference it out -- continued
  • Limitations
  • - DD models will not eliminate selection bias.
  • - DD models only eliminate fixed variables
    sometimes endogenous variables change values over
    time

16
Dealing with Omitted Variables
17
Dealing with Omitted Variables
  • The investigator should have a conceptual model
    of the process under study. Guided by this
    understanding, there are a few options for
    dealing with omitted variables.
  • 1. Find additional data so that every relevant
    variable is included.
  • 2. Ignore it
  • - Acceptable only if omitted variable is
    uncorrelated with all included variables
    otherwise the coefficient estimates will be
    biased up or down.

18
Dealing with Omitted Variables
  • 3. Find proxy variable
  • Suppose the following
  • y is the outcome
  • q is the omitted variable
  • z is the proxy for q
  • What properties should the proxy z have?

19
Dealing with Omitted Variables
  • a. Proxy z should be strongly correlated with q.
  • b. Proxy z must be redundant ( ignorable)
  • E (y x, q, z) E (y x, q)
  • c. Omitted q must be uncorrelated with other
    regressors conditional on z
  • (corr (q , xj) 0 z) for each xj

20
Dealing with Omitted Variables
  • The last two mean roughly that q and z provide
    similar information about the outcome.
  • You dont observe q, so how can you prove these
    conditions are met? Either argue it from theory
    or test the assumption using other data.

21
Dealing with Measurement Error
22
Dealing with Measurement Error
  • 1. Improve measurement
  • - DSS improved by refusing extreme outlier
    values
  • - NPPD improved by requiring more complete data
  • 2. Argue that the degree of error is small
  • - Use outside data for validation
  • 3. Argue that error is uncorrelated with included
    variables

23
Dealing with Proxy Variables
24
Dealing with Proxy Variables
  • 1. What if proxy variable z is correlated with a
    regressor x?
  • OLS is inconsistent, but one can hope and argue
    that the inconsistency is less than if z is
    omitted.

25
Dealing with Proxy Variables
  • 2. Consider using a lagged dependent variable as
    a proxy variable.
  • Example If you believe that omitted variable qt
    strongly affects outcome yt, then a lagged value
    of y (such as yt-2) is probably correlated with
    qt as well.
  • Problem yt-2 may be correlated with other xs as
    well, leading to inconsistency.

26
Dealing with Proxy Variables
  • 3. Consider using multiple proxy variables for a
    single omitted variable.
  • How? Simply put all proxy variables in the
    equation.
  • Note they all must meet the requirements for
    proxies.

27
Dealing with Proxy Variables
  • 4. What if omitted variable q interacts with a
    regressor x?
  • y a b1x b2q b3qx e
  • ? dy/dx b1 b3q
  • marginal effect of x on y involves q, which is
    unobserved

28
Dealing with Proxy Variables
  • Demean z take every value of z and subtract out
    the grand (overall) average value. Call it zd.
  • y a b1x b2zd b3zdx e
  • ? dy/dx b1 b3zd
  • b1 because Ezd 0

29
Instrumental Variables
30
Method of Instrumental Variables
  • Often used to deal with simultaneity.
  • More generally, IV applies whenever a regressor x
    is correlated with the error term e.

31
IV Definition
  • Model y a b1x1 b2x2 e
  • Suppose that x2 is endogenous to y. An
    instrumental variable is one that
  • (a) is correlated with the endogenous variable
    x2
  • (b) is uncorrelated with error term e
  • (c) should not enter the main equation (i.e.,
    does not
  • explain y)

32
Two-Stage Least Squares
  • Two-stage least squares (2SLS) approach
  • Stage 1
  • Predict x2 as a function of all other
    variables plus an IV (call it z)
  • x2 a g1x1 g2z n
  • Create predicted values of x2 call them x2p

33
Two-Stage Least Squares
  • Two-stage least squares (2SLS) approach
  • Stage 2
  • Predict y as a function of x2p and all other
    variables (but not z)
  • y a b1x1 b2 x2p e
  • Note adjust the standard errors to account
    for the fact that x2p is predicted.

34
Two-Stage Residual Inclusion
  • 2SLS is only consistent when the Stage 2 equation
    is linear.
  • If Stage 2 is nonlinear, use the two-stage
    residual inclusion (2SRI) method
  • - Stage 1 as in 2SLS, leading to predicted x2p
  • - Develop residuals v x2 - x2p

35
Two-Stage Residual Inclusion
  • - Stage 2
  • Predict y as a function of x1, x2 (not x2p)
    and the new residuals v
  • y f (a b1x1 b2 x2 b3v) e
  • where f(.) is a nonlinear function.
  • Note that if Stage 2 is linear, then 2SRI
    yields the same results as 2SLS.

36
Multiple IVs
  • What if you have multiple endogenous variables?
  • 1. The number of IVs must equal or exceed the
    number of endogenous variables
  • 2. Estimate a separate 1st-stage regression for
    each endogenous variable
  • 3. Every 1st-stage regression should contain all
    IVs

37
IV Issues
  • Two issues plague the IV method
  • 1. No IV is available
  • 2. A potential IV is found, but its quality is
    uncertain

38
IV Issues
  • What if there is no IV?
  • State that no IV exists and forge ahead anyway,
    arguing that any bias in OLS is likely to be
    small.
  • Argue that the endogeneity is weak on theoretical
    grounds.
  • Argue that external data indicate that the bias
    from OLS is likely to be small.

39
IV Properties
  • What if you have an IV of unknown quality?
  • Two characteristics mark a good IV
  • 1. Validity
  • 2. Strength

40
IV Validity
  • Validity has several components
  • a. Non-zero correlation with x2
  • b. Uncorrelated with error term e
  • c. Uncorrelated with y except through x2
  • d. Monotonicity as z increases, x2 increases

41
IV Validity
  • There are several ways to show validity of an IV
  • Non-zero correlation with the endogenous variable
    can be shown directly.
  • Robustness do alternative IVs yield similar
    results?
  • Non-correlation with the outcome variable of the
    2nd
  • stage. This point must be argued from theory,
    an understanding of how the system under study
    works.

42
IV Validity
  • Warning one cannot simply add a candidate IV to
    the main model (i.e., the 2nd stage) to see
    whether it is significant. The result is biased.
  • BUT
  • If there are multiple IVs, one can use a test of
    over-identifying restrictions.

43
IV Validity
  • Overidentification number of candidate IVs
    exceeds number of endogenous variables.
  • Suppose that
  • (a) You have one endogenous variable and three
    candidate IVs
  • (b) You know that one of the IVs is truly valid.
  • Use the known-valid IV in the 1st stage and put
    the remaining two IVs in the 2nd stage.

44
IV Validity
  • Over-identification test, continued
  • If the two remaining IVs are jointly
    insignificant in the 2nd stage, then this
    supports their use as alternative IVs.
  • Problem this only works if the IV(s) in the 1st
    stage are truly valid and you dont know that!

45
IV Validity
  • Over-identification test, continued
  • Partial solution use Sargans (1984) test,
    which assumes only that one or more of your IVs
    are valid you dont have to specify which. This
    method fails only if none of the IVs is valid.
  • In the end, you must argue for validity on
    conceptual grounds at a minimum.

46
IV Validity
  • Conceptual arguments
  • 1. Explain why z should influence x2
  • 2. Explain why z should not influence y directly
  • 3. Anticipate objections about omitted variables
    that link z to the error term e. Show that z is
    not related to those omitted variables, perhaps
    using outside data. For example, use data on
    non-veterans to support a claim about how
    veterans act.

47
IV Properties
  • Two characteristics mark a good instrumental
    variable
  • 1. Validity
  • 2. Strength

48
Strong IVs
  • A strong instrument has a high correlation with
    the endogenous variable.
  • How strong a correlation? Staiger Stock (1997)
    recommend a partial F statistic of 5 or greater.
  • - Run 1st stage with and without the IV.
  • - Compare the overall F statistics a difference
    of 5 or
  • more is sufficient evidence of strength.

49
Weak IVs
  • If the IVs are weak,
  • 2SLS and 2SRI are consistent, but there can be
    considerable bias even in large samples
  • standard errors are too small
  • 2SLS and 2SRI perform poorly

50
Weak IVs
  • What to do if IVs are weak?
  • If there is a single endogenous variable, use a
    conditional likelihood ratio (CLR) test
  • perform a regular likelihood ratio test
  • adjust the critical values
  • available in Stata see Stata Journal, 3,
    57-70
  • and http//elsa.berkeley.edu/wp/marcelo.pdf
    by Moreira
  • and Poi

51
Weak IVs
  • What if there are multiple endogenous variables
    and only weak IVs?
  • A solution has not been developed yet!

52
Selected References
53
Selected References
  • JM Wooldridge. Econometric analysis of cross
    section and panel data. Cambridge, MA MIT
    Press, 2002.
  • A graduate-level econometrics textbook with
    lengthy textual descriptions of practical issues.
  • HS Bloom, ed. Learning more from social
    experiments evolving analytic approaches.
    Russell Sage.
  • A largely non-technical exploration of how
    instrumental variables are found and used, with
    examples from welfare reform studies.

54
Selected References
  • MP Murray. Avoiding invalid instruments and
    coping with weak instruments. Journal of Economic
    Perspectives 200620(4) 111-132.
  • A superb reference with relatively few
    equations. Has an extensive reference list.
  • A Nakamura, M Nakamura. Model specification and
    endogeneity. Journal of Econometrics
    199883213-237.
  • Presents major endogeneity tests, explores
    approaches to endogeneity testing. Somewhat
    iconoclastic.

55
Selected References
  • M McClellan, B McNeil, J Newhouse. Does more
    intensive treatment of acute myocardial
    infarction in the elderly reduce mortality?
    Analysis using instrumental variables.
    JAMA1994272(11)859-66
  • Classic paper using IV in health, but
    challenging to read.
  • J Newhouse, M McClellan. Econometrics in outcomes
    research the use of instrumental variables. Ann
    Rev Pub Health 1998 1917-34.
  • Non-technical introduction to IV.

56
Selected References
  • J Terza, A Basu, P Rathouz. Two-stage residual
    inclusion estimation Addressing endogeneity in
    health econometric modeling. Journal of Health
    Economics 200827531-543.
  • Explains two-stage residual inclusion models and
    contrasts them to two-stage least squares.
    Moderately technical.

57
Acknowledgements
  • Much of the content of this presentation is
    derived from Wooldridge (2002), Murray (2006),
    and Nakamura and Nakamura (2006).
  • Helpful comments were also provided by HERC staff.

58
Questions?
Write a Comment
User Comments (0)
About PowerShow.com