An%20Introduction%20to%20Logistic%20Regression - PowerPoint PPT Presentation

About This Presentation
Title:

An%20Introduction%20to%20Logistic%20Regression

Description:

none – PowerPoint PPT presentation

Number of Views:202
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: An%20Introduction%20to%20Logistic%20Regression


1
An Introduction to Logistic Regression
  • JohnWhitehead
  • Department of Economics
  • Appalachian State University

2
Outline
  • Introduction and Description
  • Some Potential Problems and Solutions
  • Writing Up the Results

3
Introduction and Description
  • Why use logistic regression?
  • Estimation by maximum likelihood
  • Interpreting coefficients
  • Hypothesis testing
  • Evaluating the performance of the model

4
Why use logistic regression?
  • There are many important research topics for
    which the dependent variable is "limited."
  • For example voting, morbidity or mortality, and
    participation data is not continuous or
    distributed normally.
  • Binary logistic regression is a type of
    regression analysis where the dependent variable
    is a dummy variable coded 0 (did not vote) or
    1(did vote)

5
The Linear Probability Model
  • In the OLS regression
  • Y ? ??X e where Y (0, 1)
  • The error terms are heteroskedastic
  • e is not normally distributed because Y takes on
    only two values
  • The predicted probabilities can be greater than 1
    or less than 0

6
An Example Hurricane Evacuations
Q EVAC Did you evacuate your home to go
someplace safer before Hurricane Dennis (Floyd)
hit?   1 YES 2 NO 3 DON'T KNOW 4 REFUSED
7
The Data
8
OLS Results
9
Problems
Predicted Values outside the 0,1 range
10
Heteroskedasticity
Park Test
11
The Logistic Regression Model
  • The "logit" model solves these problemslnp/(1-
    p) ?? ?X e
  • p is the probability that the event Y occurs,
    p(Y1)
  • p/(1-p) is the "odds ratio"
  • lnp/(1-p) is the log odds ratio, or "logit"

12
  • More
  • The logistic distribution constrains the
    estimated probabilities to lie between 0 and 1.
  • The estimated probability is p 1/1 exp(-?
    - ? X)
  • if you let ? ? X 0, then p .50
  • as ? ? X gets really big, p approaches 1
  • as ? ? X gets really small, p approaches 0

13
(No Transcript)
14
Comparing LP and Logit Models
Y
LP Model
1
X
0
15
Maximum Likelihood Estimation (MLE)
  • MLE is a statistical method for estimating the
    coefficients of a model.
  • The likelihood function (L) measures the
    probability of observing the particular set of
    dependent variable values (p1, p2, ..., pn) that
    occur in the sample L Prob (p1 p2 pn)
  • The higher the L, the higher the probability of
    observing the ps in the sample.

16
  • MLE involves finding the coefficients (?, ?) that
    makes the log of the likelihood function (LL lt 0)
    as large as possible
  • Or, finds the coefficients that make -2 times the
    log of the likelihood function (-2LL) as small as
    possible
  • The maximum likelihood estimates solve the
    following condition Y - p(Y1)Xi
    0summed over all observations, i 1,,n

17
Interpreting Coefficients
  • Since lnp/(1-p) ?? ?X e
  • The slope coefficient (?) is interpreted as the
    rate of change in the "log odds" as X changes
    not very useful.
  • Since p 1/1 exp(-? - ? X) The marginal
    effect of a change in X on the probability is
    ?p/?X f(? X) ?

18
  • An interpretation of the logit coefficient which
    is usually more intuitive is the "odds ratio"
  • Since p/(1-p) exp(? ?X)
  • exp(?) is the effect of the independent variable
    on the "odds ratio"

19
From SPSS Output
Households with pets are 1.933 times more likely
to evacuate than those without pets.
20
Hypothesis Testing
  • The Wald statistic for the ? coefficient
    is Wald ? /s.e.B2which is distributed
    chi-square with 1 degree of freedom.
  • The "Partial R" (in SPSS output) is R
    (Wald-2)/(-2LL(?)1/2

21
An Example
22
Evaluating the Performance of the Model
  • There are several statistics which can be used
    for comparing alternative models or evaluating
    the performance of a single model
  • Model Chi-Square
  • Percent Correct Predictions
  • Pseudo-R2

23
Model Chi-Square
  • The model likelihood ratio (LR), statistic is
    LRi -2LL(?) - LL(?, ?) Or, as you are
    reading SPSS printout LRi -2LL (of
    beginning model) - -2LL (of ending model)
  • The LR statistic is distributed chi-square with i
    degrees of freedom, where i is the number of
    independent variables
  • Use the Model Chi-Square statistic to determine
    if the overall model is statistically
    significant.

24
An Example
25
Percent Correct Predictions
  • The "Percent Correct Predictions" statistic
    assumes that if the estimated p is greater than
    or equal to .5 then the event is expected to
    occur and not occur otherwise.
  • By assigning these probabilities 0s and 1s and
    comparing these to the actual 0s and 1s, the
    correct Yes, correct No, and overall correct
    scores are calculated.

26
An Example
27
Pseudo-R2
  • One psuedo-R2 statistic is the McFadden's-R2
    statistic McFadden's-R2 1 -
    LL(?,?)/LL(?) 1 - -2LL(?, ?)/-2LL(?)
    (from SPSS printout)
  • where the R2 is a scalar measure which varies
    between 0 and (somewhat close to) 1 much like the
    R2 in a LP model.

28
An Example
29
Some potential problems and solutions
  • Omitted Variable Bias
  • Irrelevant Variable Bias
  • Functional Form
  • Multicollinearity
  • Structural Breaks

30
Omitted Variable Bias
  • Omitted variable(s) can result in bias in the
    coefficient estimates. To test for omitted
    variables you can conduct a likelihood ratio
    testLRq -2LL(constrained model, ik-q)
    - -2LL(unconstrained model, ik) where LR
    is distributed chi-square with q degrees of
    freedom, with q 1 or more omitted variables
  • This test is conducted automatically by SPSS if
    you specify "blocks" of independent variables

31
An Example
32
Constructing the LR Test
Since the chi-squared value is less than the
critical value the set of coefficients is not
statistically significant. The full model is not
an improvement over the partial model.
33
Irrelevant Variable Bias
  • The inclusion of irrelevant variable(s) can
    result in poor model fit.
  • You can consult your Wald statistics or conduct a
    likelihood ratio test.

34
Functional Form
  • Errors in functional form can result in biased
    coefficient estimates and poor model fit.
  • You should try different functional forms by
    logging the independent variables, adding squared
    terms, etc.
  • Then consult the Wald statistics and model
    chi-square statistics to determine which model
    performs best.

35
Multicollinearity
  • The presence of multicollinearity will not lead
    to biased coefficients.
  • But the standard errors of the coefficients will
    be inflated.
  • If a variable which you think should be
    statistically significant is not, consult the
    correlation coefficients.
  • If two variables are correlated at a rate greater
    than .6, .7, .8, etc. then try dropping the least
    theoretically important of the two.

36
Structural Breaks
  • You may have structural breaks in your data.
    Pooling the data imposes the restriction that an
    independent variable has the same effect on the
    dependent variable for different groups of data
    when the opposite may be true.
  • You can conduct a likelihood ratio testLRi1
    -2LL(pooled model) -2LL(sample 1)
    -2LL(sample 2) where samples 1 and 2 are
    pooled, and i is the number of independent
    variables.

37
An Example
  • Is the evacuation behavior from Hurricanes Dennis
    and Floyd statistically equivalent?

38
Constructing the LR Test
Since the chi-squared value is greater than the
critical value the set of coefficients are
statistically different. The pooled model is
inappropriate.
39
What should you do?
  • Try adding a dummy variableFLOYD 1 if Floyd,
    0 if Dennis

40
Writing Up Results
  • Present descriptive statistics in a table
  • Make it clear that the dependent variable is
    discrete (0, 1) and not continuous and that you
    will use logistic regression.
  • Logistic regression is a standard statistical
    procedure so you don't (necessarily) need to
    write out the formula for it. You also (usually)
    don't need to justify that you are using Logit
    instead of the LP model or Probit (similar to
    logit but based on the normal distribution the
    tails are less fat).


41
An Example
"The dependent variable which measures the
willingness to evacuate is EVAC. EVAC is equal to
1 if the respondent evacuated their home during
Hurricanes Floyd and Dennis and 0 otherwise. The
logistic regression model is used to estimate the
factors which influence evacuation behavior."
42
Organize your regression results in a table
  • In the heading state that your dependent variable
    (dependent variable EVAC) and that these are
    "logistic regression results.
  • Present coefficient estimates, t-statistics (or
    Wald, whichever you prefer), and (at least the)
    model chi-square statistic for overall model fit
  • If you are comparing several model specifications
    you should also present the correct predictions
    and/or Pseudo-R2 statistics to evaluate model
    performance
  • If you are comparing models with hypotheses about
    different blocks of coefficients or testing for
    structural breaks in the data, you could present
    the ending log-likelihood values.

43
An Example
44
When describing the statistics in the tables,
point out the highlights for the reader. What
are the statistically significant variables?
  • "The results from Model 1 indicate that coastal
    residents behave according to risk theory. The
    coefficient on the MOBLHOME variable is negative
    and statistically significant at the p lt .01
    level (t-value 5.42). Mobile home residents are
    4.75 times more likely to evacuate.

45
Is the overall model statistically significant?
The overall model is significant at the .01
level according to the Model chi-square
statistic. The model predicts 69.5 of the
responses correctly. The McFadden's R2 is .066."
46
Which model is preferred?
"Model 2 includes three additional independent
variables. According to the likelihood ratio test
statistic, the partial model is superior to the
full model of overall model fit. The block
chi-square statistic is not statistically
significant at the .01 level (critical value
11.35 df3). The coefficient on the children,
gender, and race variables are not statistically
significant at standard levels."
47
Also
  • You usually don't need to discuss the magnitude
    of the coefficients--just the sign ( or -) and
    statistical significance.
  • If your audience is unfamiliar with the
    extensions (beyond SPSS or SAS printouts) to
    logistic regression, discuss the calculation of
    the statistics in an appendix or footnote or
    provide a citation.
  • Always state the degrees of freedom for your
    likelihood-ratio (chi-square) test.

48
References
  • http//personal.ecu.edu/whiteheadj/data/logit/
  • http//personal.ecu.edu/whiteheadj/data/logit/logi
    tpap.htm
  • E-mail WhiteheadJ_at_mail.ecu.edu
Write a Comment
User Comments (0)
About PowerShow.com