Title: Logistic regression
1Logistic regression
- V506 Class 14
- December 3, 2009
2Overview
- Categorical, binary dependent variable
- Logistic regression model
- Simple logistic regression
- Multiple logistic regression
- SPSS logistic regression
3Regression with a categorical, binary dependent
variable
- Dependent variable that takes on value of 0 or 1,
e.g., failure or success, no or yes - Independent variables interval scale variables
(and possibly dummy variables)
4Regression model to predict probability of success
- While dependent variables is 0 or 1
- Regression model would predict probability of
success p with values between 0 and 1
5Linear probability model
- Obvious possibility is to use traditional linear
regression model - But this has problems
- Distribution of dependent variable hardly normal
- Predicted probabilities cannot be less than 0,
greater than 1
6Linear probability model predictions
7Logistic regression model
- Instead, use logistic transformation (logit) of
probability, log of the odds
8Logistic regression model predictions
9Estimation of logistic regression model
- Least-squares no longer best way of estimating
parameters of logistic regression model - Instead use maximum likelihood estimation
- Finds values of parameters that have greatest
probability, given data
10Logistic regression example
- Analysis of space shuttle data to see if that
might have provided evidence of risk in launching
Challenger on its last flight in 1985 - Based on
- Tufte, Visual Explanations, 1997
- Hamilton, Statistics with Stata, 2006
11Space shuttle data
- Data on 24 space shuttle launches prior to
Challenger - Dependent variable, whether shuttle flight
experienced thermal distress incident - Independent variables
- Date whether shuttle changes or age has effect
- Temperature whether joint temperature on
booster has effect
12First modeldate as single independent variable
- Dependent variable
- Any thermal distress on launch
- Independent variable
- Date (days since 1/1/60)
- SPSS procedure
- Regression, Binary logistic
13Predicted probability of thermal distress using
date
14Statistical significance of model
- Chi-squared test calculated using -2 times log
likelihood
15Predictive power (fit) of modelpseudo R2s
- No exact equivalent of R2 for linear regression
model - Different estimates of pseudo R2 range from 0
(no fit) to 1 (perfect prediction)
16Predictive power (fit) of modelclassification
improvement
- Classification tables without using date, using
model, with date
17Logistic regression coefficients and tests of
significance
- Probability positively related to date
- Logistic regression coefficient for date has
p-value of 0.051 - Wald test not as powerful as chi-squared test
based upon -2 Log Likelihood
18Interpreting logistic regression coefficient
- Regression coefficient for date, B 0.002, is
the change in the logit of the probability
associated with a change of one day - Unfortunately, this does not have a very
intuitive meaning - Instead, look at exponential of B, exp(B), which
is the value a change in independent variable
from 0 to 1 increases (or decreases) the odds
19Exponential of B as change in odds
20What does odds mean?
- Odds is the ratio of probability of success to
probability of failure - Like odds on horse races
- Even odds, odds 1, implies probability equals
0.5 - Odds 2 means 2 to 1 in favor of success,
implies probability of 0.667 - Odds 0.5 means 1 to 2 in favor (or 2 to 1
against) success, implies probability of 0.333
21Interpreting exponential of B as change in odds
- SPSS reports Exp(B) 1.002 for date
- This means that the odds of a thermal incident
are 1.002 higher for each day later in the
program that the shuttle is launched - Doesnt sound like much, but 1.002365 2.074, so
odds of thermal incident are twice as great for
each year later that the shuttle is launched
22Multiple logistic regression
- Logistic regression can be extended to use
multiple independent variables exactly like
linear regression
23Adding joint temperature to the logistic
regression model
- Dependent variable
- Any thermal distress on launch
- Independent variables
- Date (days since 1/1/60)
- Joint temperature, degrees F
24Overall model results
25Classification improvement
- Classification accuracy increased from 73.9
percent to 78.3 percent over previous model
26Logistic regression coefficients
27Interpreting logistic regression coefficients
- date has p-value of 0.030 and similar Exp(B)
- Logistic regression coefficient for temp is
negative - As temperature decreases, probability of thermal
distress increases - p-value for temperature is 0.140
- Exp(B) for temperature is 0.841
- 20 degree decrease in temperature therefore
implies 0.841-20 31.92 increase in odds of
thermal distress
28Issue of significance, error
- p-value of 0.140 for temperature would not
normally result in rejection of null hypothesis
of no relationship of temperature to probability
of failure, null hypothesis that B 0 for Joint
temperature - But p-value is probability of Type I error, error
that arises when rejecting null hypothesis when
it null hypothesis is true
29Type II error
- Type II error is the inverse of Type I error
- Type II error is the error that arises in not
rejecting the null hypothesis when it is false - In this cases, Type II error is the error in
failing to conclude that there is a relationship
between joint temperature and thermal failures
when there is a relationship
30Type II error and Challenger
- In this context, one is more concerned with Type
II error, failing to conclude that there is a
relationship of joint failure to temperature when
that is true - Because failure to conclude this led to
recommendation to launch Challenger under low
temperature conditions
31Logistic regression in SPSS
- Data preparation
- Dependent variable should have value of 1 or 0
- SPSS will recode categorical variable with 2
values, but better to do it yourself - Independent variables are interval, scale
variables or dummy variables, as in linear
regression
32Logistic regression in SPSS
- Analyze, Regression, Binary Logistic
- Dependent, response variable entered into
Dependent box - Independent variables (interval, scale variables
and dummy variables) entered into Covariates box - Categorical button provides other options for
handling categorical variables, but we wont deal
with here
33SPSS logistic regression output
34SPSS logistic regression output
Block 0 Beginning Block
Classification accuracy without independent variab
les, model
35SPSS logistic regression output
Block 1 Method Enter
p-value for significance of entire model
Pseudo-R2 values giving idea of fit of the model
36SPSS logistic regression output
Classification accuracy for model Can be compared
with classification accuracy without model or for
other models
37SPSS logistic regression output
p-values for hypothesis tests of whether
regression coefficients not equal to 0
Logistic regression coefficientsPositive,
probability increases as variable increases
negative, probability decreases as
variable increases
Exponential of regression coefficients effects
on odds greater than 1 means odds increase as
variable increases less than 1 means odds
decrease as variable increases
38Predicting urban trail use in Indianapolis
- John R. Ottensmann and Greg Lindsey. A use-based
measure of accessibility to linear features to
predict urban trail use. Journal of Transport and
Land Use 1, 1 (2008) 41-63. - Survey of residents of Marion County
- Questions on whether they used any of the
greenway trails in previous month - Logistic regression to predict use, using
- Individual characteristics from survey
- Distance to nearest trail and more complex
measures of trail accessibility
39Logistic regressions for use of trails (odds
ratios robust standard errors in parentheses)
40Measure of use-based accessibility to linear
features
- Sorry, I couldnt help myself!