Title: Introduction to Logistic Regression
1Introduction to Logistic Regression
- E. Barry MoserDepartment ofExperimental
StatisticsLouisiana State University
2Political Poll Example
- A polling organization was asked to determine
whether or not the preference for a particular
candidate for a political office differed between
males and females. - The organization took a simple random sample of
size 250 males and another SRS of size 250
females from the population of interest and
determined the voting preference of each
individual in the survey.
3The Data Set
Data omitted to conserve space.
4Summarized Data
Proc Freq DataPoll OrderData Table
PreferenceSex / NoRow NoPercent Run
5Data Properties
- The data are organized into 2 groups (SEX Male,
Female n1n2250) - The response variable (PREFERENCE) has only 2
values (discrete/categorical) For, Against - The groups and subjects are independent (SRS)
- The data can be completely summarized, without
loss of information, into a 2x2 table
6Hypothesis
- The question to answer is whether preference for
the candidate differs between the 2 groups. - This question typically is translated to mean
that we wish to compare the proportions of
individuals in favor of the candidate between the
two groups. This can be restated in terms of the
probability, pi, that a randomly selected
individual favors the candidate. -
7Homogeneity of Proportions
- Comparisons of proportions among 2 or more
populations is called homogeneity or proportions
since the typical null hypothesis is that the
category probabilities or proportions are the
same for all populations.
8Binomial Distribution
- Model for the number of successes y out of n
trials. - Is used as a model for binary discrete data since
the probability of getting one of the two
outcomes is one minus the probability of getting
the other. - Has a variance smaller than its mean. For some
data the model does not fit and extra-binomial
variation must be accounted for. - Typically modeled using logistic regression.
9Binomial Distribution
-
- range
- parameters
- meannp
- variancenp(1-p)
10Goodness of fit
- One approach to test the hypothesis of equal
proportions from independent random samples is to
use the Pearson goodness of fit chi-square test. - For large (table) cell counts (counts in the
categories), the binomial (or multinomial) counts
can be approximated using the Poisson
distribution, for which the expected value (mean)
and variance are equal. - Let ni be the observed cell count and let Ei be
the cell count predicted by a model. The Ei is
called the expected cell count.
11Goodness of fit
- A Z statistic for this cell would be
- The square of a Z statistic is a chi-square
random variable with 1 degree of freedom.
12Goodness of fit
- The sum of these k cell chi-squares is also a
chi-square and corresponds with the Pearson
chi-square goodness of fit statistic - The degrees of freedom for X2 is dfk-p-c, where
kthe number of cells, pnumber of parameters
estimated by the model, and cnumber of
constraints in the model.
13Goodness of fit test
- If the model fits the data very well, then the
expected cell counts will be very close or
similar to the observed cell counts. This would
result in a small chisquare statistic. - If the the model explains the counts poorly, then
the chisquare will be large. - Therefore, the chisquare goodness of fit test is
a one-tailed test. - The magnitudes of the individual cell chi-square
contributions can be used to judge how the cells
contribute to the overall lack-of-fit.
14Expected and Cell Chi-squares
Proc Freq DataPoll OrderData Table
PreferenceSex / NoRow NoPercent Expected
Chisq Exact Chisq Run
Request expected values. Request Pearson
chi-squaregoodness-of-fit analysis.
15Goodness-of-fit Results
Both the Pearson chi-square and the likelihood
ratio chi-squareare measures of goodness-of-fit
and test the hypothesis of equalproportions
between the sexes. The exact statement requested
thatan exact test of the proportions be computed
under the binomialmodel, and doesnt require the
Poisson-gtNormal approximation.
16Extension
- Although the homogeneity goodness-of-fit
contingency table approach can be extended to
multiple predictor variables, it doesnt appear
natural to consider continuous predictor
variables in this context. - An alternative approach is logistic analysis,
where the parameter, p, in the binomial
distribution is modeled as a function of
predictor variables.
17Logistic Analysis
- Specifically logistic analysis links p from the
binomial distribution with the predictors using
the form - Note that p has a non-linear (logistic)
relationship with the predictor variables (which
could be dummy variables).
18Log Odds
- The relationship is called the odds
- The relationship is called the
log-odds or logit. - An odds of 1 (or 11) indicates that the event
and its complement are equally likely. Odds
greater than 1 indicate that the event is more
likely to happen (and vice versa).
19Odds Ratio
- Let X1 and X2 be dummy indicator variables for
sexmale and sexfemale, respectively, and
consider the model - Equality of males and females implies that b1b2
and that b1-b20. - Let pm be the probability of preference in the
male group, and pf be the probability of
preference in the female group.
20Odds Ratio
- Now rewrite our model solution
21Odds Ratio
- The quantity is called
the odds ratio. - If two populations have the same odds for an
event, then it also implies that the
probabilities are the same. Under this situation,
the odds ratio is 1. Thus, an alternative test is
to test whether or not the odds ratio is
different from 1. Note also that an odds ratio of
1 implies a log odds ratio0.
22Logistic Analysis of Preference
Request dummy variableslike GLM and MIXED.
- Proc Logistic DataPoll
- Class Sex / ParamGLM
- Model Preference(Event"For") Sex / ExpB
- Exact Sex
- Run
- Quit
Preference is the response variable,but the
program doesnt know whichlevel of the response
you wish to callsuccess. Thus you can be
explicitby using the EVENT option.
Compute theexact p-value underthe
permutationconcept. This methodology cantake
considerablecomputing time.
Exponentiate the parameter estimates.Sometimes
parameters will correspond exactly with log odds
ratios, sodetransformation shows the odds ratio.
23Logistic Analysis Tables
Here we can see that themodel that we
areattempting to fit is a logisticmodel with a
binary response.
Here we learn the valuesof the response
variableand how many occurrencesare in the
dataset for each.
24Class Level Table
The columns containing the 1s and0s correspond
with the dummyvariables that will be included in
theanalysis. This is the GLM or
MIXEDparameterization because each levelof SEX
is assigned a dummy variable.
25Model Fit Tables
This table indicates thata solution has been
found.
If the procedure did not converge, then either
more iterations arerequired (adjust a parameter
in the procedure statement), or moreoften than
not, there is some condition with the data that
the algorithm is having difficulty with, or the
model is ill specified.
The fit measures given here arerelative to
alternate models forthe data, or may be used to
directlytest hypotheses for full and
reducedmodels (likelihood ratio test).
26Hypotheses Tests
The global test is atest that all of
theparameters, other thanthe intercept, are
allzero.
The Type 3 test of effects breaks the above test
statistic into components corresponding with
effects specified in the model.
The tests above indicate that there is a
difference between malesand females with respect
to preference for the candidate. Thelikelihood
ratio test above corresponds exactly with the one
givenin PROC FREQ.
27Parameter Estimates
These are the beta coefficients corresponding
with b0, b1, and b2, respectively. Note that b1-
b2 b10.3708, and Exp(0.3708)1.449.This is the
odds ratio of females to males, so preference for
thecandidate by females is 44 larger than it is
by males. Note thatthis is not an estimate of
the probability of preference for thecandidate,
though these estimates can be derived from the
abovemodel results.
28Odds Ratios
Since odds ratios are usually important to the
interpretationof these models, a separate table
of them is constructed.Note that the 95
confidence limit does not include 1. Intervals
based upon the likelihood can also be
requestedif desired.
29Exact Test
These are the exact p-value results using the
permutationdistribution of the test statistic.
The mid p-value adjusts for the discreteness of
the test distribution. Again, these
resultsindicate that preference for the
candidate differs betweenmales and females.
30Default Class Parameterization
The default CLASS statement,without a PARAM
option, usesan effect parameterization that
differs from GLM and MIXED. Thelast level is
not included in the model.
Nottheoddsratio.
Note that the parameter estimate associated with
the Malecategory is 10.1854-0.1854. Note,
however, that the difference of these 2
parameters remains 0.3708.
31Credit Risk Example
- Consider now an example in which the
credit-worthiness of students is to be predicted
from various student data (sex, age, major, grade
point average, and hours carried in current
semester). - The objective is to build a model with useful
predictors of risk (good/bad), rather than to
test specific a priori hypotheses.
32Data Input
- Title1 "Credit Modelling of Students"
- Data Credit
- Input ID Sex Major Age GPT HRS Risk
- DataLines
- 1 FEMALE SCI 25 4.0 5
GOOD - 2 MALE HUM 28 3.3 5
BAD - 3 FEMALE SOC 25 3.3 0
BAD - 4 FEMALE BUS 24 2.2 20
GOOD
MAJOR is college of major (BUS, HUM, SCI,
SOC)GPT is the grade point average (2.0-4.0)HRS
is the current semester hours carried (0, 5, 10,
15, 20)RISK is credit risk (Good, Bad)
33Simple Logistic Regression Model
Predict risk from HRS.
- Proc Logistic DataCredit
- Model Risk(Event"GOOD") Hrs / ExpB
LackFit CLOddsPL - Units Hrs5
- Output OutResults PredictedRiskPred
LowerRiskLower UpperRiskUpper - Run
Exponentiate estimates.Test for
lack-of-fit.Construct profile likelihoodconfiden
ce limits on theodds ratio estimates.
Compute the odds ratiosfor HRS based upon
a5-units change in HRS.
The output statement creates a new data set in
which thepredicted probability of GOOD and its
lower and upper 95confidence limits are added
to the original data set.
34Parameter Estimates
Here the parameter corresponding with the HRS
predictor isjudged to be different from zero.
The odds of being a GOODrisk is about 80 more
(1.807) for each 1 hour more carried.
35Odds Ratios
For the students in the data set, HRS are either
0, 5, 10, 15, or 20.Thus, it may make much more
sense to compute the odd ratio interms of a
5-unit change in HRS. The units statement in
theprogram generates the above table. For each
additional course of5 hours, the odds of being a
GOOD risk increase over 19 times greater.
36Lack-of-fit Test
The lack-of-fit test formed by grouping HRS into
5 groups (theseare natural groupings) finds that
the linear relationship betweenHRS and the logit
of risk is sufficient to explain the
associationbetween these two variables. A
curvilinear model is not needed.
37Plot the Prediction Model
- Proc Sort DataResults
- By Hrs
- Run
- GOptions ResetAxis ResetSymbol
- Proc GPlot DataResults
- Plot RiskPredHrs1 RiskLowerHrs2
RiskUpperHrs3 - / VAxisAxis1 HAxisAxis2 Overlay
- Axis1 Label(A90 "Predicted Risk of GOOD")
- Order(0 To 1 By 0.25)
- Axis2 Label("Hours Carried")
- Order(0 To 20 By 5)
- Symbol1 CBlack VDot ISpline L1
- Symbol2 CBlack VNone ISpline L2
- Symbol3 CBlack VNone ISpline L2
- Run Quit
38HRS Risk Model
39Multiple Logistic Regression
- Proc Logistic DataCredit
- Model Risk(Event"GOOD") GPT Hrs / ExpB
LackFit CLOddsPL - Units Hrs5 GPT0.25
- Output OutResults PredictedRiskPred
LowerRiskLower UpperRiskUpper - Run
Note that for GPT we have requested odds ratios
for a 0.25-unitchange in GPT (something more
realistic than a 1-unit change).
40Parameter Estimates
It appears that both predictors have parameters
that aredifferent from zero. If the 1-unit
change were useful foreach of the predictors,
then the estimated odds ratios inthe last column
would be useful. The table based upon theunits
statement is probably more valuable for this
purpose.
41Odds Ratios
If the specified changes in GPT (0.25) and HRS
(5)represent comparable levels of change, then
the HRSvariable would have a much greater impact
on the oddsof being a GOOD risk than would GPT.
42Prediction Graph
- Proc Sort DataResults By GPT Run
- GOptions ResetAxis ResetSymbol ResetLegend
- Proc GPlot DataResults
- Plot RiskPredGPTHrs
- / VAxisAxis1 HAxisAxis2 LegendLegend1
- Axis1 Label(A90 "Predicted Risk of GOOD")
- Order(0 To 1 By 0.25)
- Axis2 Label("Grade Point Average")
- Order(2 To 4 By 0.25)
- Legend1 Position(Inside Middle Right)
- Frame Across1 ShapeSymbol(10,1)
- Symbol1 CBlack VCircle ISpline L1
- Symbol2 CBlack VTriangle ISpline L1
- Symbol3 CBlack VSquare ISpline L1
- Symbol4 CBlack VDiamond ISpline L1
- Symbol5 CBlack VDot ISpline L1
- Run Quit
43Prediction Graph
44Multiple Logistic Analysis
- Proc Logistic DataCredit
- Class Sex Major
- Model Risk(Event"GOOD") Sex Major Age GPT HRS
/ ExpB CLOddsPL - Units Hrs5 GPT0.25
- Contrast "BUS VS HUM " Major 1 -1 0
- Contrast "BUS VS SCI " Major 1 0 -1
- Contrast "BUS VS SOC " Major 2 1 1
- Contrast "BUS VS Others" Major 4 0 0
- Run
This code uses the default class effect
parameterization, andso contrast statement
coefficients will look quite differentthan those
used with GLM or MIXED.
45Alternative Code
GLM parameterization.
- Proc Logistic DataCredit
- Class Sex Major / ParamGLM
- Model Risk(Event"GOOD") Sex Major Age GPT HRS
/ ExpB CLOddsPL - Units Hrs5 GPT0.25
- Contrast "BUS VS HUM " Major 1 -1 0 0
- Contrast "BUS VS SCI " Major 1 0 -1 0
- Contrast "BUS VS SOC " Major 1 0 0 -1
- Contrast "BUS VS Others" Major 3 -1 -1 -1
- Run
These contrasts will match similar questions that
you haveposed with GLM or MIXED. These are the
same contraststhat are specified on the previous
slide.
46Test Results
Some of these effects arenot significant and
shouldprobably be removed fromthe predictive
model. Onemight consider a backwardelimination
approach andremove sex from the modeland refit.
Likelihood ratiotests could also be used tohelp
judge effects to retain.
Alternatively, the fit statistics, such as AIC,
could be used toselect the most parsimonious
model. One could fit all possiblemodels and
select the one with the smaller AIC.