Introduction to Logistic Regression - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Introduction to Logistic Regression

Description:

... independent random samples is to use the Pearson goodness of fit chi-square test. ... Request Pearson chi-square. goodness-of-fit analysis. Goodness-of-fit Results ... – PowerPoint PPT presentation

Number of Views:273
Avg rating:3.0/5.0
Slides: 47
Provided by: stat96
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Logistic Regression


1
Introduction to Logistic Regression
  • E. Barry MoserDepartment ofExperimental
    StatisticsLouisiana State University

2
Political Poll Example
  • A polling organization was asked to determine
    whether or not the preference for a particular
    candidate for a political office differed between
    males and females.
  • The organization took a simple random sample of
    size 250 males and another SRS of size 250
    females from the population of interest and
    determined the voting preference of each
    individual in the survey.

3
The Data Set
Data omitted to conserve space.
4
Summarized Data
Proc Freq DataPoll OrderData Table
PreferenceSex / NoRow NoPercent Run
5
Data Properties
  • The data are organized into 2 groups (SEX Male,
    Female n1n2250)
  • The response variable (PREFERENCE) has only 2
    values (discrete/categorical) For, Against
  • The groups and subjects are independent (SRS)
  • The data can be completely summarized, without
    loss of information, into a 2x2 table

6
Hypothesis
  • The question to answer is whether preference for
    the candidate differs between the 2 groups.
  • This question typically is translated to mean
    that we wish to compare the proportions of
    individuals in favor of the candidate between the
    two groups. This can be restated in terms of the
    probability, pi, that a randomly selected
    individual favors the candidate.

7
Homogeneity of Proportions
  • Comparisons of proportions among 2 or more
    populations is called homogeneity or proportions
    since the typical null hypothesis is that the
    category probabilities or proportions are the
    same for all populations.

8
Binomial Distribution
  • Model for the number of successes y out of n
    trials.
  • Is used as a model for binary discrete data since
    the probability of getting one of the two
    outcomes is one minus the probability of getting
    the other.
  • Has a variance smaller than its mean. For some
    data the model does not fit and extra-binomial
    variation must be accounted for.
  • Typically modeled using logistic regression.

9
Binomial Distribution
  • range
  • parameters
  • meannp
  • variancenp(1-p)

10
Goodness of fit
  • One approach to test the hypothesis of equal
    proportions from independent random samples is to
    use the Pearson goodness of fit chi-square test.
  • For large (table) cell counts (counts in the
    categories), the binomial (or multinomial) counts
    can be approximated using the Poisson
    distribution, for which the expected value (mean)
    and variance are equal.
  • Let ni be the observed cell count and let Ei be
    the cell count predicted by a model. The Ei is
    called the expected cell count.

11
Goodness of fit
  • A Z statistic for this cell would be
  • The square of a Z statistic is a chi-square
    random variable with 1 degree of freedom.

12
Goodness of fit
  • The sum of these k cell chi-squares is also a
    chi-square and corresponds with the Pearson
    chi-square goodness of fit statistic
  • The degrees of freedom for X2 is dfk-p-c, where
    kthe number of cells, pnumber of parameters
    estimated by the model, and cnumber of
    constraints in the model.

13
Goodness of fit test
  • If the model fits the data very well, then the
    expected cell counts will be very close or
    similar to the observed cell counts. This would
    result in a small chisquare statistic.
  • If the the model explains the counts poorly, then
    the chisquare will be large.
  • Therefore, the chisquare goodness of fit test is
    a one-tailed test.
  • The magnitudes of the individual cell chi-square
    contributions can be used to judge how the cells
    contribute to the overall lack-of-fit.

14
Expected and Cell Chi-squares
Proc Freq DataPoll OrderData Table
PreferenceSex / NoRow NoPercent Expected
Chisq Exact Chisq Run
Request expected values. Request Pearson
chi-squaregoodness-of-fit analysis.
15
Goodness-of-fit Results
Both the Pearson chi-square and the likelihood
ratio chi-squareare measures of goodness-of-fit
and test the hypothesis of equalproportions
between the sexes. The exact statement requested
thatan exact test of the proportions be computed
under the binomialmodel, and doesnt require the
Poisson-gtNormal approximation.
16
Extension
  • Although the homogeneity goodness-of-fit
    contingency table approach can be extended to
    multiple predictor variables, it doesnt appear
    natural to consider continuous predictor
    variables in this context.
  • An alternative approach is logistic analysis,
    where the parameter, p, in the binomial
    distribution is modeled as a function of
    predictor variables.

17
Logistic Analysis
  • Specifically logistic analysis links p from the
    binomial distribution with the predictors using
    the form
  • Note that p has a non-linear (logistic)
    relationship with the predictor variables (which
    could be dummy variables).

18
Log Odds
  • The relationship is called the odds
  • The relationship is called the
    log-odds or logit.
  • An odds of 1 (or 11) indicates that the event
    and its complement are equally likely. Odds
    greater than 1 indicate that the event is more
    likely to happen (and vice versa).

19
Odds Ratio
  • Let X1 and X2 be dummy indicator variables for
    sexmale and sexfemale, respectively, and
    consider the model
  • Equality of males and females implies that b1b2
    and that b1-b20.
  • Let pm be the probability of preference in the
    male group, and pf be the probability of
    preference in the female group.

20
Odds Ratio
  • Now rewrite our model solution

21
Odds Ratio
  • The quantity is called
    the odds ratio.
  • If two populations have the same odds for an
    event, then it also implies that the
    probabilities are the same. Under this situation,
    the odds ratio is 1. Thus, an alternative test is
    to test whether or not the odds ratio is
    different from 1. Note also that an odds ratio of
    1 implies a log odds ratio0.

22
Logistic Analysis of Preference
Request dummy variableslike GLM and MIXED.
  • Proc Logistic DataPoll
  • Class Sex / ParamGLM
  • Model Preference(Event"For") Sex / ExpB
  • Exact Sex
  • Run
  • Quit

Preference is the response variable,but the
program doesnt know whichlevel of the response
you wish to callsuccess. Thus you can be
explicitby using the EVENT option.
Compute theexact p-value underthe
permutationconcept. This methodology cantake
considerablecomputing time.
Exponentiate the parameter estimates.Sometimes
parameters will correspond exactly with log odds
ratios, sodetransformation shows the odds ratio.
23
Logistic Analysis Tables
Here we can see that themodel that we
areattempting to fit is a logisticmodel with a
binary response.
Here we learn the valuesof the response
variableand how many occurrencesare in the
dataset for each.
24
Class Level Table
The columns containing the 1s and0s correspond
with the dummyvariables that will be included in
theanalysis. This is the GLM or
MIXEDparameterization because each levelof SEX
is assigned a dummy variable.
25
Model Fit Tables
This table indicates thata solution has been
found.
If the procedure did not converge, then either
more iterations arerequired (adjust a parameter
in the procedure statement), or moreoften than
not, there is some condition with the data that
the algorithm is having difficulty with, or the
model is ill specified.
The fit measures given here arerelative to
alternate models forthe data, or may be used to
directlytest hypotheses for full and
reducedmodels (likelihood ratio test).
26
Hypotheses Tests
The global test is atest that all of
theparameters, other thanthe intercept, are
allzero.
The Type 3 test of effects breaks the above test
statistic into components corresponding with
effects specified in the model.
The tests above indicate that there is a
difference between malesand females with respect
to preference for the candidate. Thelikelihood
ratio test above corresponds exactly with the one
givenin PROC FREQ.
27
Parameter Estimates
These are the beta coefficients corresponding
with b0, b1, and b2, respectively. Note that b1-
b2 b10.3708, and Exp(0.3708)1.449.This is the
odds ratio of females to males, so preference for
thecandidate by females is 44 larger than it is
by males. Note thatthis is not an estimate of
the probability of preference for thecandidate,
though these estimates can be derived from the
abovemodel results.
28
Odds Ratios
Since odds ratios are usually important to the
interpretationof these models, a separate table
of them is constructed.Note that the 95
confidence limit does not include 1. Intervals
based upon the likelihood can also be
requestedif desired.
29
Exact Test
These are the exact p-value results using the
permutationdistribution of the test statistic.
The mid p-value adjusts for the discreteness of
the test distribution. Again, these
resultsindicate that preference for the
candidate differs betweenmales and females.
30
Default Class Parameterization
The default CLASS statement,without a PARAM
option, usesan effect parameterization that
differs from GLM and MIXED. Thelast level is
not included in the model.
Nottheoddsratio.
Note that the parameter estimate associated with
the Malecategory is 10.1854-0.1854. Note,
however, that the difference of these 2
parameters remains 0.3708.
31
Credit Risk Example
  • Consider now an example in which the
    credit-worthiness of students is to be predicted
    from various student data (sex, age, major, grade
    point average, and hours carried in current
    semester).
  • The objective is to build a model with useful
    predictors of risk (good/bad), rather than to
    test specific a priori hypotheses.

32
Data Input
  • Title1 "Credit Modelling of Students"
  • Data Credit
  • Input ID Sex Major Age GPT HRS Risk
  • DataLines
  • 1 FEMALE SCI 25 4.0 5
    GOOD
  • 2 MALE HUM 28 3.3 5
    BAD
  • 3 FEMALE SOC 25 3.3 0
    BAD
  • 4 FEMALE BUS 24 2.2 20
    GOOD

MAJOR is college of major (BUS, HUM, SCI,
SOC)GPT is the grade point average (2.0-4.0)HRS
is the current semester hours carried (0, 5, 10,
15, 20)RISK is credit risk (Good, Bad)
33
Simple Logistic Regression Model
Predict risk from HRS.
  • Proc Logistic DataCredit
  • Model Risk(Event"GOOD") Hrs / ExpB
    LackFit CLOddsPL
  • Units Hrs5
  • Output OutResults PredictedRiskPred
    LowerRiskLower UpperRiskUpper
  • Run

Exponentiate estimates.Test for
lack-of-fit.Construct profile likelihoodconfiden
ce limits on theodds ratio estimates.
Compute the odds ratiosfor HRS based upon
a5-units change in HRS.
The output statement creates a new data set in
which thepredicted probability of GOOD and its
lower and upper 95confidence limits are added
to the original data set.
34
Parameter Estimates
Here the parameter corresponding with the HRS
predictor isjudged to be different from zero.
The odds of being a GOODrisk is about 80 more
(1.807) for each 1 hour more carried.
35
Odds Ratios
For the students in the data set, HRS are either
0, 5, 10, 15, or 20.Thus, it may make much more
sense to compute the odd ratio interms of a
5-unit change in HRS. The units statement in
theprogram generates the above table. For each
additional course of5 hours, the odds of being a
GOOD risk increase over 19 times greater.
36
Lack-of-fit Test
The lack-of-fit test formed by grouping HRS into
5 groups (theseare natural groupings) finds that
the linear relationship betweenHRS and the logit
of risk is sufficient to explain the
associationbetween these two variables. A
curvilinear model is not needed.
37
Plot the Prediction Model
  • Proc Sort DataResults
  • By Hrs
  • Run
  • GOptions ResetAxis ResetSymbol
  • Proc GPlot DataResults
  • Plot RiskPredHrs1 RiskLowerHrs2
    RiskUpperHrs3
  • / VAxisAxis1 HAxisAxis2 Overlay
  • Axis1 Label(A90 "Predicted Risk of GOOD")
  • Order(0 To 1 By 0.25)
  • Axis2 Label("Hours Carried")
  • Order(0 To 20 By 5)
  • Symbol1 CBlack VDot ISpline L1
  • Symbol2 CBlack VNone ISpline L2
  • Symbol3 CBlack VNone ISpline L2
  • Run Quit

38
HRS Risk Model
39
Multiple Logistic Regression
  • Proc Logistic DataCredit
  • Model Risk(Event"GOOD") GPT Hrs / ExpB
    LackFit CLOddsPL
  • Units Hrs5 GPT0.25
  • Output OutResults PredictedRiskPred
    LowerRiskLower UpperRiskUpper
  • Run

Note that for GPT we have requested odds ratios
for a 0.25-unitchange in GPT (something more
realistic than a 1-unit change).
40
Parameter Estimates
It appears that both predictors have parameters
that aredifferent from zero. If the 1-unit
change were useful foreach of the predictors,
then the estimated odds ratios inthe last column
would be useful. The table based upon theunits
statement is probably more valuable for this
purpose.
41
Odds Ratios
If the specified changes in GPT (0.25) and HRS
(5)represent comparable levels of change, then
the HRSvariable would have a much greater impact
on the oddsof being a GOOD risk than would GPT.
42
Prediction Graph
  • Proc Sort DataResults By GPT Run
  • GOptions ResetAxis ResetSymbol ResetLegend
  • Proc GPlot DataResults
  • Plot RiskPredGPTHrs
  • / VAxisAxis1 HAxisAxis2 LegendLegend1
  • Axis1 Label(A90 "Predicted Risk of GOOD")
  • Order(0 To 1 By 0.25)
  • Axis2 Label("Grade Point Average")
  • Order(2 To 4 By 0.25)
  • Legend1 Position(Inside Middle Right)
  • Frame Across1 ShapeSymbol(10,1)
  • Symbol1 CBlack VCircle ISpline L1
  • Symbol2 CBlack VTriangle ISpline L1
  • Symbol3 CBlack VSquare ISpline L1
  • Symbol4 CBlack VDiamond ISpline L1
  • Symbol5 CBlack VDot ISpline L1
  • Run Quit

43
Prediction Graph
44
Multiple Logistic Analysis
  • Proc Logistic DataCredit
  • Class Sex Major
  • Model Risk(Event"GOOD") Sex Major Age GPT HRS
    / ExpB CLOddsPL
  • Units Hrs5 GPT0.25
  • Contrast "BUS VS HUM " Major 1 -1 0
  • Contrast "BUS VS SCI " Major 1 0 -1
  • Contrast "BUS VS SOC " Major 2 1 1
  • Contrast "BUS VS Others" Major 4 0 0
  • Run

This code uses the default class effect
parameterization, andso contrast statement
coefficients will look quite differentthan those
used with GLM or MIXED.
45
Alternative Code
GLM parameterization.
  • Proc Logistic DataCredit
  • Class Sex Major / ParamGLM
  • Model Risk(Event"GOOD") Sex Major Age GPT HRS
    / ExpB CLOddsPL
  • Units Hrs5 GPT0.25
  • Contrast "BUS VS HUM " Major 1 -1 0 0
  • Contrast "BUS VS SCI " Major 1 0 -1 0
  • Contrast "BUS VS SOC " Major 1 0 0 -1
  • Contrast "BUS VS Others" Major 3 -1 -1 -1
  • Run

These contrasts will match similar questions that
you haveposed with GLM or MIXED. These are the
same contraststhat are specified on the previous
slide.
46
Test Results
Some of these effects arenot significant and
shouldprobably be removed fromthe predictive
model. Onemight consider a backwardelimination
approach andremove sex from the modeland refit.
Likelihood ratiotests could also be used tohelp
judge effects to retain.
Alternatively, the fit statistics, such as AIC,
could be used toselect the most parsimonious
model. One could fit all possiblemodels and
select the one with the smaller AIC.
Write a Comment
User Comments (0)
About PowerShow.com