Title: Advanced regression analysis in Stata
1Advanced regression analysis in Stata
- ASSR Short Intensive Course
- Herman van de Werfhorst
- April 2009
- Day 1 introduction to Stata and multivariate
modelling
2Programme
- Day 1
- Introduction to stata
- Basics the linear model (OLS)
- Regression diagnostics
- Logistic regression (logit)
- Day 2
- Mixed models fixed and random effects
- Applications of mixed models in stata
3Introduction to stata
- Data files .dta
- Do files command logs
- Ado files additional programmes developed by the
research community - Log files saved output
- Automatically generating publishable tables
4Hands on stata
- Creating .do file
- Starting output log file
- Missing values
- Scale construction
- Factor analysis
- Reliability analysis
5Scale construction
Reliability coefficient Cronbachs Alpha
Nnumber of items, r average correlation
between items
Low reliability reduces correlations.
6Scale construction 1 The Likert scale.
- 1. Recode variables/items in the right direction.
High scores should indicate a high position on
the underlying dimension - 2. Take the average score on all selected items
7Example ESS2002 immigration attitudes
8Scale construction 2 Factor analysis
Item1
Factor 1
Item2
Item3
Factor 2
Item4
Item5
Some handy info - Kim Mueller, factor
analysis (Sage publications, somewhere in the
1970s) - http//www.siu.edu/epse1/pohlmann/factgl
os/
9OLS regression
- Minimize sums of squared errors
- Y a bX e
10Coefficient of determination R2 Proportion of
variance explained by X variables
11Diagnostics multicollinearity
- Bivariate correlations. gt 0.7 suspicious
- Regression of every X variable on other X
variables (x1a b1X2 b2X3 etc) - R-sq larger than 0.6 critical.
- Tolerance 1-Rsq, less than 0.4 critical.
- 1/Tolerance Variance inflation factor (VIF),
larger than 2.5 is critical (some say larger than
6!) - VIF How much is the standard error inflated
compared to when X variables were not correlated
12Examine residuals
- Studentized residual (standardized, accounting
for possible different variances according to X) - Rule of thumb studentized residuals larger than
3.61 are outliers - Removal leads to lower standard errors of
estimate, higher R-sq. Non-removal makes tests
more conservative
13Studentized residuals
14Outliers influential cases
- Leverage how far the individuals X value
differs from the mean X. The larger the value,
the stronger the impact on determining y-hat
values
15Outliers (contd)
- DFBETA (i) change in the estimate of ß when
deleting individual i (one for each ß for each i) - DFFIT (i) effect on the fit of deleting
individual i. (one for each i)
16- Cooks Distance impact of i on all parameter
estimates jointly. - Influence of i is a function of the residual (on
Y) and of the place in the distribution in X
17Dfbeta in stata standardized dfbetas. gt 1
suspicious
18(No Transcript)
19Logistic regression
20Non-continuous outcomes
- Binary
- Ordinal
- Categorical
- OLS has no assumptions on the distribution of the
X variables - It however assumes a continuous Y variable, with
conditional normal distribution - Binary outcomes Logit models logistic
regression models - Part of the Generalized Linear Model framework
(GLM) - If OLS regression applied to binary outcomes
predicted probability could be lt0 or gt1 (which is
impossible)
21The problem of the linear probability model
22The dependent variable
- Outcome 0 or 1
- (0 no 1 yes)
- Probabilities (P) between 0 and 1.
- If 30 of respondents are voluntary member, the
mean probability of membership is 0.3.
23The Generalized Linear Model (GLM)
24GLMs
- The function g(µ) is the link function
- Identity link (OLS)
- Logit link
- Log link
- Probit link
-
- Next to a link function, you also need to specify
the probability distribution (normal, binomial,
gamma, etc). - Thus GLMs let you choose the probability
distribution instead of assuming it. - Logistic regression a binomial distribution,
with a logit link - Advantage over probit we interpret the results
in terms of odds (p/1-p), and thus in odds ratios.
25And the winner of the Best Statistic Ever Award
is.
- The odds ratio
- (in my opinion)
- Advantage margin-independent association measure
for contingency tables. - E.g. Relative mobility versus absolute mobility
26Logistic regression model
- Ratio p / 1-p Odds yes versus no range
0-infinity - The natural logarithm of this odds (log-odds or
logit) - 0 lt odds lt 1 log-odds lt 0
- odds gt 1 log-odds gt 0
- odds1 log-odds0
27(No Transcript)
28The impact of X-variables on the logit
- The logit is a linear function of the X
variables. - Logististic regression coefficient b
- Antilog of b eb
- Odds ratio
29Odds ratio
Odds ratio (A / B) / (C / D)
30Back to probabilities
31Likelihood function
- Between 0 and 1
- Log likelihood between -? and 0
- -2 LL between 0 and ?
- Chi-square distributed (?2)
32(No Transcript)
33An example in stata who votes?
34Logit Postestimation
- Predict
- P probability Y1
- Xb Linear prediction ln(p/1-p)
- Rs standardized residual
35Ordered logit model
Similar to estimating separate binary logit
models with equal slopes Advantage total of
probabilities 1 Problem the proportional odds
assumption
36Multinomial logit model
Similar to estimating separate binary logit
models with unequal slopes Advantage total of
probabilities 1 Problem many parameter
estimates