5. Endogenous right hand side variables

About This Presentation

Title:

5. Endogenous right hand side variables

Description:

5. Endogenous right hand side variables 5.1 The problem of endogeneity bias 5.2 The basic idea underlying the use of instrumental variables 5.3 When the endogenous ... – PowerPoint PPT presentation

Number of Views:274

Avg rating:3.0/5.0

Slides: 68

Provided by: accl2

Category:

more less

Transcript and Presenter's Notes

Title: 5. Endogenous right hand side variables

1
5. Endogenous right hand side variables

5.1 The problem of endogeneity bias
5.2 The basic idea underlying the use of
instrumental variables
5.3 When the endogenous right hand side variable
is continuous
5.4 When the endogenous right hand side variable
is binary

2
5.1 Endogeneity bias

Consider a simple OLS regression
Yit a0 a1 X1it uit
Recall that our estimate of a1 will be unbiased
only if we can assume that X1it is uncorrelated
with the error term (uit)
We have discussed two ways to help ensure that
this assumption is true
First, we should control for any observable
variables that affect Yit and which are
correlated with X1it. For example, we should
control for X2it if X2it affects Yit and X2it is
correlated with X1it (see Chapter 2)
Yit a0 a1 X1it a2 X2it uit

3
5.1 Endogeneity bias

Second, if we have panel data, we can control for
any unobservable firm-specific characteristics
(ui) that affect Yit and which are correlated
with the X variables.
From Chapter 4
Yit a0 a1 X1it a2 X2it ui eit
We control for the correlations between ui and
the X variables by estimating fixed effects
models.
Our estimates of a1 and a2 are unbiased if the X
variables are uncorrelated with eit. In this
case, we say that the X variables are exogenous.

4
5.1 Endogeneity bias

Unfortunately, multiple regression and fixed
effects models do not always ensure that the X
variables are uncorrelated with the error term
if we do not observe all the variables that
affect Y and that are correlated with X, multiple
regression will not solve the problem.
if we do not have panel data, the fixed effects
models cannot be estimated.
even if we have panel data, the Y and X variables
may display little variation over time in which
case the fixed effects models can be unreliable
(Zhou, 2001).
even if we have panel data and the Y and X
variables display sufficient variation over time,
the unobservable variables that are correlated
with X may not be constant over time in which
case the fixed effects models will not solve the
problem.

A variable is more likely to be correlated with
the error term if it is endogenous
Endogenous means that the variable is
determined within the economic model that we are
trying to estimate.
For example, suppose that Y2it is an endogenous
explanatory variable
Y1it a0 a1 Y2it a2 Xit uit (1)
Y2it b0 b1 Xit b2 Zit vit
(2)
Equations (1) and (2) have a triangular
structure since Y2it is assumed to affect Y1it,
but Y1it is assumed not to affect Y2it
Given this triangular structure, the OLS estimate
of a1 in equation (1) is unbiased only if vit is
uncorrelated with uit
If vit is correlated with uit, then Y2it is
correlated with uit which means that the OLS
estimate of a1 would be biased
To avoid this bias, we must estimate equation (1)
instrumental variables (IV) regression rather
than OLS.

Equations (1) and (2) are called structural
equations because they describe the economic
relationship between Y1it and Y2it
We can obtain a reduced-form equation by
substituting eq. (2) into eq. (1)
Y1it a0 a1 (b0 b1 Xit b2 Zit vit) a2
Xit uit
In this reduced-form equation, all the
explanatory variables (Xit and Zit) are exogenous
The basic idea underlying IV regression is to
remove vit from the Y1it model so that our
estimate of a1 is unbiased.

7
5.2 The basic idea underlying the use of
instrumental variables

Note that vit is removed from the Y1it model if
we use the predicted rather than the actual
values of Y2it on the right hand side.
We predict Y2it using all the exogenous variables
in the system (in our example, we use the two
exogenous variables Xit and Zit)

8
5.2 The basic idea

We then use the predicted rather than the actual
values of Y2it when estimating the Y1it model
The a1 estimate is biased in eq. (3) but it is
unbiased in eq. (4) because the vit term has been
removed.

In eq. (4) the estimated coefficient for the Zit
variable is
We already know the value of from eq.
(2)
Therefore
It is important to note that the
coefficient can be estimated only if there is at
least one exogenous variable in the structural
model for Y2it that is excluded from the
structural model for Y1it
This is the Zit variable in eq. (2)

In eq. (4) the coefficient is just
identified because there is only one exogenous
variable (Zit) that is in the Y2it model and that
is excluded from the Y1it model

Suppose we had included Zit in both models
In this case, the coefficient cannot be
identified because we estimate and
In other words, we cannot determine whether the
effect of Zit on Y1it is a main effect (a3) or an
indirect effect through Y2it (a1b2)
Here we say that the system of equations is
under-identified

Suppose we had included two exogenous variables
in the Y2it model and we excluded both these
variables from the Y1it model
Now we have estimates of , ,
, and .
Therefore
Here we say that the system of equations is
over-identified
In this example, the system is triangular
because there are two equations and one
endogenous right-hand side variable

13
5.3 When the endogenous right hand side variable
is continuous

When the models have a triangular structure, the
models can be estimated using the ivregress
command
The models can be estimated using 2SLS or LIML or
GMM
2SLS is more commonly used in practice

14
5.3.1 Estimating triangular models using 2SLS
(ivregress)

Go to MySite
Open up the housing.dta file which provides data
from 50 U.S. states (1980 Census)
use "J\phd\housing.dta", clear
pct_population_urban the of the population
that lives in urban areas
family_income median annual family income
housing_value median value of private housing
rent median monthly housing rental payments
region1 region 4 dummy variables for four
regions in the U.S.

Suppose we want to estimate the following
rent a0 a1 pct_population_urban
a2 housing_value u
housing_value b0 b1 family_income
b2 region2 b3 region3 b4 region4 v
This is a triangular system because there are two
equations and one endogenous right hand side
variable (housing_value)
If u and v are correlated, the OLS estimate of a2
will be biased in the rent model

If we ignore the endogeneity problem and estimate
the rent model using simple OLS
reg rent housing_value pct_population_urban
To take account of the potential endogeneity
problem we use the ivregress command
ivregress estimator depvar1 varlist1 (depvar2
varlistiv)
estimator is either 2sls or liml or gmm
depvar1 is the dependent variable for the model
which has an endogenous regressor
varlist1 are the exogenous variables in the model
that has the endogenous regressor
depvar2 is the endogenous regressor
varlistiv are the exogenous variables that are
believed to affect the endogenous regressor

The models that we want to estimate are
rent a0 a1 pct_population_urban a2
housing_value u
housing_value b0 b1 family_income b2
region2 b3 region3 b4 region4 v
The rent model has an endogenous regressor
ivregress 2sls rent pct_population_urban
(housing_value family_income region2 region3
region4)
ivregress liml rent pct_population_urban
(housing_value family_income region2 region3
region4)
ivregress gmm rent pct_population_urban
(housing_value family_income region2 region3
region4)
The housing_value model can be estimated using
OLS as there are no endogenous regressors
reg housing_value family_income region2 region3
region4

We should test whether
our chosen instruments are exogenous (i.e., they
should be uncorrelated with the error term) and
it is valid to exclude some of them from the
model that has the endogenous regressor.
If they are not exogenous or they should not be
excluded, they are not valid instruments.

The tests for instrument validity are also known
as tests of over-identifying restrictions
because the tests can only be performed if the
model with the endogenous regressor is
overidentified
the tests assume that at least one of the chosen
instruments is valid (unfortunately this
assumption cannot be tested)
In our example, the instrumented housing_value
variable is overidentified because four of the
exogenous variables (family_income region2
region3 region4) are excluded from the rent
model.
If we had excluded only one of these variables,
the instrumented housing_value variable would
have been just identified in which case it
would not be possible to test for instrument
validity.

We obtain the tests for instrument validity by
typing estat overid after we run ivregress
ivregress 2sls rent pct_population_urban
(housing_value family_income region2 region3
region4)
estat overid
These tests are statistically significant, which
means the chosen instruments are not valid.

This is not surprising because we did not have
good reason to assume that they are exogenous and
validly excluded from the rent model.
For example
family_income is endogenous if family incomes
depend on housing values and rents
Why would this be true?
rents may be different across the four regions,
so the region dummies should not be excluded from
the rent model

We obtain different statistics for the tests of
instrument validity if the models are estimated
using LIML or GMM
However, the conclusions are the same as in our
previous example
ivregress liml rent pct_population_urban
(housing_value family_income region2 region3
region4)
estat overid
ivregress gmm rent pct_population_urban
(housing_value family_income region2 region3
region4)
estat overid

Note that we cannot test for instrument validity
when the endogenous regressor is just identified
This is because the test statistics are obtained
under the assumption that at least one of the
instruments is valid
For example
ivregress 2sls rent pct_population_urban
(housing_value family_income)
estat overid
ivregress liml rent pct_population_urban
(housing_value family_income)
estat overid
ivregress gmm rent pct_population_urban
(housing_value family_income)
estat overid

We can also test whether the coefficient of the
endogenous regressor is biased under OLS.
We obtain two Hausman tests for endogeneity bias
by typing estat endogenous after we run ivregress
ivregress 2sls rent pct_population_urban
(housing_value family_income region2 region3
region4)
estat endogenous
(The Durbin statistic uses an estimate of the
error terms variance assuming that the variable
being tested is exogenous whereas the Wu-Hausman
statistic assumes that the variable being tested
is endogenous)
Given these results, we may be tempted to reject
the hypothesis that housing_value is exogenous
However, the Hausman tests for endogeneity bias
are only reliable if the chosen instruments are
valid. In our example they are not, and so we
cannot draw conclusions about the potential for
endogeneity bias.

25
Class exercise 5a

Using the fees.dta file, estimate the following
models for audit fees and company size
lnaf a0 a1 lnta a2 big6 u
lnta b0 b1 ln_age b2 listed v
where lnaf is the log of audit fees, lnta is the
log of total assets, ln_age is the log of the
companys age in years, listed is a dummy
variable indicating whether the companys shares
are publicly traded on a market.
Is the instrumented lnta variable
over-identified, just-identified, or
under-identified? Explain.
Estimate the audit fee model using 2SLS.
Test the validity of the chosen instrumental
variables.
Test whether the lnta variable is affected by
endogeneity bias.
Verify that the test for instrument validity is
not available if you change the model so that it
is just-identified.

The key to estimating IV models is to find one or
more exogenous variables that explains the
endogenous regressor and that can be safely
excluded from the main equation.
Unfortunately, most accounting studies that use
IV regression do not attempt to justify why their
chosen instruments are exogenous or why they can
be excluded from the structural model.
As a result, Larcker and Rusticus (2010)
criticize the way in which accounting studies
have applied IV regression
A key problem is that the IV results can be very
sensitive to the researchers choice of which
variables to exclude from the structural model
and, in many studies, these variables have been
chosen in a very arbitrary way

27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30

Larcker and Rusticus (2010) recommend that
researchers justify their chosen instruments
using theory or economic intuition
the estat overid test should not be used to
select instruments on purely statistical grounds
particularly as the test is invalid if all of the
chosen instruments are also invalid
When testing instrument validity (estat overid)
and endogeneity bias (estat endog), it is also
important to consider your sample size
in large samples, the tests may reject a null
hypothesis that is nearly true.
in small samples, the tests may fail to reject a
null hypothesis that is very false.

31
5.3.2 Estimating simultaneous equations using
3SLS (reg3)

So far we have been examining a triangular
system. For example, Y2it affects Y1it but Y1it
does not affect Y2it
Y1it a0 a1 Y2it a2 Xit a3 Z2it uit
Y2it b0 b2 Xit b3 Z1it vit
In a simultaneous system, both dependent
variables affect each other
Y1it a0 a1 Y2it a2 Xit a3 Z2it uit
Y2it b0 b1 Y1it b2 Xit b3 Z1it vit

Y1it a0 a1 Y2it a2 Xit a3 Z2it uit
Y2it b0 b1 Y1it b2 Xit b3 Z1it vit
In this case, the OLS estimates are biased
because
Eq. (1) shows that uit affects Y1it while eq. (2)
shows that Y1it affects Y2it. As a result, it
must be true that uit is correlated with Y2it in
eq. (1). Therefore, the OLS estimate of a1 would
be biased in eq. (1).
Eq. (2) shows that vit affects Y2it while eq. (1)
shows that Y2it affects Y1it. As a result, it
must be true that vit is correlated with Y1it in
eq. (2). Therefore, the OLS estimate of b1 would
be biased in eq. (2).

For example, it seems reasonable to argue that
housing values depend on rents as well as rents
depending on housing values
rent a0 a1 housing_value a2
pct_population_urban u
housing_value b0 b1 rent b2 family_income
b3 region2 b4 region3 b5 region4 v
Note that for identification, each equation must
contain at least one exogenous variable that is
not included in the other equation. These are
pct_population_urban in the rent model
family_income, region2 - region4 in the
housing_value model

We estimate this kind of model using the reg3
command
reg3 (depvar1 varlist1) (depvar2 varlist2)
use "J\phd\housing.dta", clear
reg3 (rent housing_value pct_population_urban)
(housing_value rent family_income region2
region3 region4)
Unfortunately, the overid and endog commands are
not currently available with reg3

35
5.4 When the endogenous right hand side variable
is binary

So far we have been dealing with the case where
the endogenous regressor is continuous.
We may want to estimate a model in which the
endogenous regressor is binary.
This brings us to a special class of models which
are known as self-selection or Heckman
models. Selectivity Endogeneity where the
endogenous regressor is binary
The basic idea is similar to the instrumental
variable techniques that we have already
discussed.

Examples of endogenous binary variables in
accounting
Companies decide whether to use hedge contracts
(Barton, 2001 Pincus and Rajgopal, 2002).
Companies decide whether to grant stock options
(Core and Guay, 1999).
Companies decide whether to hire Big 5 or non-Big
5 auditors (e.g., Chaney et al., 2004).
Governments decide whether to fully or partially
privatize (Guedhami and Pittman, 2006).
Companies decide whether to follow international
financial reporting strategy (Leuz and
Verrecchia, 2000).
Companies decide whether to recognize financial
instruments at fair value or disclose (Ahmed et
al., 2006).
Companies decide whether or not to go private
(Engel et al., 2002).

37
Selection model

Concerns about selectivity arise when the RHS
dummy variable (D) is endogenous
Endogeneity results in bias if E(u D) ? 0.
If u and v are correlated, then E(u D) ? 0, in
which case the OLS estimate of the effect of D on
Y would be biased.

38
Selection model

The intuition underlying Heckman is to estimate
and then control for E(u D). First model the
choice of D
Z is a vector of exogenous variables that affect
D but have no direct effect on Y.

39
Selection model
D
Z
Y
40
Selection model

Estimate E(u D) and include it as a control
variable on the RHS of the Y model
E(u D) ?? IMR where ? captures the
correlation between u and v while ? is the
standard deviation of u and

41
Selection model

The MILLS variable is added as a control for
selectivity in the Y model
The OLS estimate of the effect of D on Y is now
unbiased because E(e D) 0.
The D and Y models can be estimated in two-steps
or estimated jointly using maximum likelihood
(ML)
ML yields separate estimates of ? and ?.
The two-step yields an estimate of ??.
Under the null of no selectivity bias, ? 0 and
?? 0.

42
Class exercise 5b

We are going to look at a fictional dataset on
2,000 women.
use "J\phd\heckman.dta", clear
sum age education married children wage
Suppose we believe that older and more highly
educated women earn higher wages. Why would it be
wrong to estimate the following model?
reg wage age education
Estimate a probit model to test whether women are
more likely to be employed if they are married,
have children, are older and more highly educated.

43
5.4 When the endogenous right hand side variable
is binary (heckman)

It is easy to estimate the two-step Heckman model
in STATA
heckman depvar1 varlist1, select (depvar2
varlist1), twostep
where depvar1 is the dependent variable in the
main equation and depvar2 is the dependent
variable in the selection model
Going back to our dataset on female wages
heckman wage education age, select(emp married
children education age) twostep

44
(No Transcript)
45

The 657 censored observations are the women who
are not in employment.
The Wald chi2 tests the overall significance of
the model.

Womens wages are higher if they are older and
more highly educated

The probit model of employment is exactly the
same as what we had before
Women are more likely to be in employment if they
are married, have children, are more highly
educated or older.

The lamba variable is simply the IMR that was
estimated from the emp model
The IMR coefficient is 4.00 and statistically
significant
there is statistically significant evidence of a
selection effect.

The IMR coefficient is the product of rho and
sigma (??)
Thus, 4.00 0.67 5.95

47
Class exercise 5c

Estimate the following audit fee models
separately for Big 6 and Non-Big 6 audit clients
lnaf a0 a1 lnta u (1)
lnaf a0 a1 lnsales u (2)
where lnaf log of audit fees, lnta log of
total assets, lnsales log of sales
Use the heckman command to control for
endogeneity with respect to the companys
selected auditor. Your auditor choice models are
as follows
big6 b0 b1 lnsales b2 lnta v
nbig6 c0 c1 lnsales c2 lnta w
where big6 1 (big6 0) if the company chooses
a Big 6 (Non-Big 6) auditor and nbig6 1 (nbig6
0) if the company chooses a Non-Big 6 (Big 6)
auditor.

48
Class exercise 5c

What exclusion restrictions are you imposing in
equations (1) and (2)?
Is there statistically significant evidence of
selectivity?
For the two different specifications of the audit
fee model
what are the signs of the MILLS coefficients?
what are the signs of rho?

49
Treatment effects model

In exercise 5c, we estimated the audit fee models
separately for the Big 6 and non-Big 6 audit
clients
To do this, we use the heckman command
Suppose that we want to estimate one audit fee
model with Big 6 on the right hand side of the
equation (i.e., we assume that the X coefficients
have the same slope in the two equations)

50
Treatment effects model

We can estimate this model using the treatreg
command
treatreg lnaf lnta, treat (big6 lnta lnsales)
twostep
treatreg lnaf lnsales, treat (big6 lnta
lnsales) twostep
If we dont specify the twostep option we will
get the ML estimates
sometimes the ML model will not converge due to a
nonconcave likelihood function
treatreg lnaf lnta, treat (big6 lnta lnsales)

51
Treatment effects model

The results for both the treatment effects and
Heckman models can be very sensitive to the model
specification.
For example, the Big 6 fee premium can easily
flip signs from positive to negative
treatreg lnaf lnta, treat (big6 lnta lnsales)
twostep
treatreg lnaf lnta lnsales, treat (big6 lnta
lnsales) twostep
Note that there are no exclusion restrictions (Z
variables) in the second specification since lnta
and lnsales appear in both the first stage and
second stage models

52
Exclusion restrictions

Francis, Lennox, Francis Wang (2012) argue that
many accounting studies have estimated the
Heckman and treatment effects models incorrectly
It is well recognized (in economics) that
exogenous Z variables from the first stage choice
model need to be validly excluded from the second
stage outcome regression (Little, 1985 Little
and Rubin, 1987 Manning et al., 1987).
Accounting studies have generally failed to (a)
impose exclusion restrictions, or (b) provide
compelling grounds for the validity of the
exclusion restrictions.

53
Exclusion restrictions

Economists recognize that it is important to
justify why the Zs can be validly excluded from
the Y model.
For example, Angrist (1990) examines how military
service affects the earnings of veteran soldiers
after they are discharged from the army.
This involves a selection issue because
individuals join the military if they have poor
wage offers in other types of job.
Angrist (1990) tackles the selectivity issue
using data from the Vietnam era, when military
service was partly determined by a draft lottery.

54
Exclusion restrictions
D military service
Z Random lottery
Y civilian earnings
55
Exclusion restrictions

Angrist and Evans (1998) test whether child
bearing reduces female participation in the labor
market
Selectivity is an issue because women are more
likely to have children rather than enter the
labor market if their wage offers would be low
(i.e., lower opportunity cost).
Use the gender of the second child as instrument
for the decision to have a third child.

56
Angrist and Evans (1998) Exclusion restriction
D decision to have a third child
Z Sex composition of first two children
Y female participation in labor market
57
Exclusion restrictions

In accounting, many studies fail to justify why Z
has no direct impact on Y.
Many studies do not report results for the D
model, so the reader cannot evaluate the power of
the Z variables for identifying selectivity.
Some studies estimate models in which there are
no nominated Z variables.

58
Exclusion restrictions

When there are no exclusion restrictions,
identification of the MILLS coefficients relies
on the assumed non-linearity
MILLS will capture any misspecification of the
functional relation between X and Y (e.g.,
non-linearity) in addition to any selectivity
bias.

59
Exclusion restrictions

Little (1985) Relying on nonlinearities to
identify selectivity bias is unappealing
because it is very difficult to distinguish
empirically between selectivity and
misspecification of the models functional form.
STATA manual Theoretically, one does not need
such identifying variables, but without them, one
is depending on functional form to identify the
model. It would be difficult to take such results
seriously since the functional-form assumptions
have no firm basis in theory.
A failure to nominate any Z variables can worsen
the problems of multicollinearity (Manning et
al., 1987 Puhani, 2000 Leung and Yu, 2000).

60
Example Chaney, Jeter and Shivakumar (2004)
D BIG5 (company hires a Big 5 or non-Big 5
auditor)
Y Audit fees
Z null set
61
Example Leuz and Verrecchia (2000)
D IR97 (international reporting)
Z ROA, Capital intensity, UK/US listing.
Y Cost of capital
62
(No Transcript)
63
Leuz and Verrecchia (2000)

Is it valid to assume that ROA, Capital
intensity, and UK/US listing have no direct
effect on the cost of capital?
Are these Z variables really exogenous?

64
(No Transcript)
65
Leuz and Verrecchia (2000)

Are the tests for selectivity bias powerful?
Are the results sensitive to functional form?
(see the free float variable).
LV do not report results using OLS
LV do not report whether their results are
sensitive to alternative model specifications.

66
Going forward

Researchers need to be aware that Heckman and
treatment effects models can provide results that
are extremely fragile. Sensitivity primarily
affects the RHS variable that is assumed to be
endogenous (D) and the IMRs.
Studies need to discuss
why the Zs are exogenous
why the Zs have no direct effect on Y
whether the Zs are powerful predictors of D
The signs and significance of the IMRs alone do
not provide compelling evidence as to the
direction or existence of selectivity bias.
Selection studies should routinely report tests
for multicollinearity problems.

67
Summary

When the endogenous regressor is continuous, you
can control for endogeneity using the ivregress
or reg3 commands.
When the endogenous regressor is binary, you can
control for endogeneity using the heckman or
treatreg commands.
If you want to control for endogeneity, it is
vitally important that you have a good
justification for your chosen exclusion
restrictions.
Choosing arbitrary exclusion restrictions will
probably give you garbage results.

Write a Comment

User Comments (0)

About PowerShow.com

5. Endogenous right hand side variables - PowerPoint PPT Presentation

5. Endogenous right hand side variables

5. Endogenous right hand side variables 5.1 The problem of endogeneity bias 5.2 The basic idea underlying the use of instrumental variables 5.3 When the endogenous ... – PowerPoint PPT presentation