Introduction to Logistic Regression - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Introduction to Logistic Regression

Description:

... independent random samples is to use the Pearson goodness of fit chi-square test. ... Request Pearson chi-square. goodness-of-fit analysis. Goodness-of-fit Results ... – PowerPoint PPT presentation

Number of Views:275

Avg rating:3.0/5.0

Slides: 47

Provided by: stat96

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Logistic Regression

1
Introduction to Logistic Regression

E. Barry MoserDepartment ofExperimental
StatisticsLouisiana State University

2
Political Poll Example

A polling organization was asked to determine
whether or not the preference for a particular
candidate for a political office differed between
males and females.
The organization took a simple random sample of
size 250 males and another SRS of size 250
females from the population of interest and
determined the voting preference of each
individual in the survey.

3
The Data Set
Data omitted to conserve space.
4
Summarized Data
Proc Freq DataPoll OrderData Table
PreferenceSex / NoRow NoPercent Run
5
Data Properties

The data are organized into 2 groups (SEX Male,
Female n1n2250)
The response variable (PREFERENCE) has only 2
values (discrete/categorical) For, Against
The groups and subjects are independent (SRS)
The data can be completely summarized, without
loss of information, into a 2x2 table

6
Hypothesis

The question to answer is whether preference for
the candidate differs between the 2 groups.
This question typically is translated to mean
that we wish to compare the proportions of
individuals in favor of the candidate between the
two groups. This can be restated in terms of the
probability, pi, that a randomly selected
individual favors the candidate.

7
Homogeneity of Proportions

Comparisons of proportions among 2 or more
populations is called homogeneity or proportions
since the typical null hypothesis is that the
category probabilities or proportions are the
same for all populations.

8
Binomial Distribution

Model for the number of successes y out of n
trials.
Is used as a model for binary discrete data since
the probability of getting one of the two
outcomes is one minus the probability of getting
the other.
Has a variance smaller than its mean. For some
data the model does not fit and extra-binomial
variation must be accounted for.
Typically modeled using logistic regression.

9
Binomial Distribution

range
parameters
meannp
variancenp(1-p)

10
Goodness of fit

One approach to test the hypothesis of equal
proportions from independent random samples is to
use the Pearson goodness of fit chi-square test.
For large (table) cell counts (counts in the
categories), the binomial (or multinomial) counts
can be approximated using the Poisson
distribution, for which the expected value (mean)
and variance are equal.
Let ni be the observed cell count and let Ei be
the cell count predicted by a model. The Ei is
called the expected cell count.

11
Goodness of fit

A Z statistic for this cell would be
The square of a Z statistic is a chi-square
random variable with 1 degree of freedom.

12
Goodness of fit

The sum of these k cell chi-squares is also a
chi-square and corresponds with the Pearson
chi-square goodness of fit statistic
The degrees of freedom for X2 is dfk-p-c, where
kthe number of cells, pnumber of parameters
estimated by the model, and cnumber of
constraints in the model.

13
Goodness of fit test

If the model fits the data very well, then the
expected cell counts will be very close or
similar to the observed cell counts. This would
result in a small chisquare statistic.
If the the model explains the counts poorly, then
the chisquare will be large.
Therefore, the chisquare goodness of fit test is
a one-tailed test.
The magnitudes of the individual cell chi-square
contributions can be used to judge how the cells
contribute to the overall lack-of-fit.

14
Expected and Cell Chi-squares
Proc Freq DataPoll OrderData Table
PreferenceSex / NoRow NoPercent Expected
Chisq Exact Chisq Run
Request expected values. Request Pearson
chi-squaregoodness-of-fit analysis.
15
Goodness-of-fit Results
Both the Pearson chi-square and the likelihood
ratio chi-squareare measures of goodness-of-fit
and test the hypothesis of equalproportions
between the sexes. The exact statement requested
thatan exact test of the proportions be computed
under the binomialmodel, and doesnt require the
Poisson-gtNormal approximation.
16
Extension

Although the homogeneity goodness-of-fit
contingency table approach can be extended to
multiple predictor variables, it doesnt appear
natural to consider continuous predictor
variables in this context.
An alternative approach is logistic analysis,
where the parameter, p, in the binomial
distribution is modeled as a function of
predictor variables.

17
Logistic Analysis

Specifically logistic analysis links p from the
binomial distribution with the predictors using
the form
Note that p has a non-linear (logistic)
relationship with the predictor variables (which
could be dummy variables).

18
Log Odds

The relationship is called the odds
The relationship is called the
log-odds or logit.
An odds of 1 (or 11) indicates that the event
and its complement are equally likely. Odds
greater than 1 indicate that the event is more
likely to happen (and vice versa).

19
Odds Ratio

Let X1 and X2 be dummy indicator variables for
sexmale and sexfemale, respectively, and
consider the model
Equality of males and females implies that b1b2
and that b1-b20.
Let pm be the probability of preference in the
male group, and pf be the probability of
preference in the female group.

20
Odds Ratio

Now rewrite our model solution

21
Odds Ratio

The quantity is called
the odds ratio.
If two populations have the same odds for an
event, then it also implies that the
probabilities are the same. Under this situation,
the odds ratio is 1. Thus, an alternative test is
to test whether or not the odds ratio is
different from 1. Note also that an odds ratio of
1 implies a log odds ratio0.

22
Logistic Analysis of Preference
Request dummy variableslike GLM and MIXED.

Proc Logistic DataPoll
Class Sex / ParamGLM
Model Preference(Event"For") Sex / ExpB
Exact Sex
Run
Quit

Preference is the response variable,but the
program doesnt know whichlevel of the response
you wish to callsuccess. Thus you can be
explicitby using the EVENT option.
Compute theexact p-value underthe
permutationconcept. This methodology cantake
considerablecomputing time.
Exponentiate the parameter estimates.Sometimes
parameters will correspond exactly with log odds
ratios, sodetransformation shows the odds ratio.
23
Logistic Analysis Tables
Here we can see that themodel that we
areattempting to fit is a logisticmodel with a
binary response.
Here we learn the valuesof the response
variableand how many occurrencesare in the
dataset for each.
24
Class Level Table
The columns containing the 1s and0s correspond
with the dummyvariables that will be included in
theanalysis. This is the GLM or
MIXEDparameterization because each levelof SEX
is assigned a dummy variable.
25
Model Fit Tables
This table indicates thata solution has been
found.
If the procedure did not converge, then either
more iterations arerequired (adjust a parameter
in the procedure statement), or moreoften than
not, there is some condition with the data that
the algorithm is having difficulty with, or the
model is ill specified.
The fit measures given here arerelative to
alternate models forthe data, or may be used to
directlytest hypotheses for full and
reducedmodels (likelihood ratio test).
26
Hypotheses Tests
The global test is atest that all of
theparameters, other thanthe intercept, are
allzero.
The Type 3 test of effects breaks the above test
statistic into components corresponding with
effects specified in the model.
The tests above indicate that there is a
difference between malesand females with respect
to preference for the candidate. Thelikelihood
ratio test above corresponds exactly with the one
givenin PROC FREQ.
27
Parameter Estimates
These are the beta coefficients corresponding
with b0, b1, and b2, respectively. Note that b1-
b2 b10.3708, and Exp(0.3708)1.449.This is the
odds ratio of females to males, so preference for
thecandidate by females is 44 larger than it is
by males. Note thatthis is not an estimate of
the probability of preference for thecandidate,
though these estimates can be derived from the
abovemodel results.
28
Odds Ratios
Since odds ratios are usually important to the
interpretationof these models, a separate table
of them is constructed.Note that the 95
confidence limit does not include 1. Intervals
based upon the likelihood can also be
requestedif desired.
29
Exact Test
These are the exact p-value results using the
permutationdistribution of the test statistic.
The mid p-value adjusts for the discreteness of
the test distribution. Again, these
resultsindicate that preference for the
candidate differs betweenmales and females.
30
Default Class Parameterization
The default CLASS statement,without a PARAM
option, usesan effect parameterization that
differs from GLM and MIXED. Thelast level is
not included in the model.
Nottheoddsratio.
Note that the parameter estimate associated with
the Malecategory is 10.1854-0.1854. Note,
however, that the difference of these 2
parameters remains 0.3708.
31
Credit Risk Example

Consider now an example in which the
credit-worthiness of students is to be predicted
from various student data (sex, age, major, grade
point average, and hours carried in current
semester).
The objective is to build a model with useful
predictors of risk (good/bad), rather than to
test specific a priori hypotheses.

32
Data Input

Title1 "Credit Modelling of Students"
Data Credit
Input ID Sex Major Age GPT HRS Risk
DataLines
1 FEMALE SCI 25 4.0 5
GOOD
2 MALE HUM 28 3.3 5
BAD
3 FEMALE SOC 25 3.3 0
BAD
4 FEMALE BUS 24 2.2 20
GOOD

MAJOR is college of major (BUS, HUM, SCI,
SOC)GPT is the grade point average (2.0-4.0)HRS
is the current semester hours carried (0, 5, 10,
15, 20)RISK is credit risk (Good, Bad)
33
Simple Logistic Regression Model
Predict risk from HRS.

Proc Logistic DataCredit
Model Risk(Event"GOOD") Hrs / ExpB
LackFit CLOddsPL
Units Hrs5
Output OutResults PredictedRiskPred
LowerRiskLower UpperRiskUpper
Run

Exponentiate estimates.Test for
lack-of-fit.Construct profile likelihoodconfiden
ce limits on theodds ratio estimates.
Compute the odds ratiosfor HRS based upon
a5-units change in HRS.
The output statement creates a new data set in
which thepredicted probability of GOOD and its
lower and upper 95confidence limits are added
to the original data set.
34
Parameter Estimates
Here the parameter corresponding with the HRS
predictor isjudged to be different from zero.
The odds of being a GOODrisk is about 80 more
(1.807) for each 1 hour more carried.
35
Odds Ratios
For the students in the data set, HRS are either
0, 5, 10, 15, or 20.Thus, it may make much more
sense to compute the odd ratio interms of a
5-unit change in HRS. The units statement in
theprogram generates the above table. For each
additional course of5 hours, the odds of being a
GOOD risk increase over 19 times greater.
36
Lack-of-fit Test
The lack-of-fit test formed by grouping HRS into
5 groups (theseare natural groupings) finds that
the linear relationship betweenHRS and the logit
of risk is sufficient to explain the
associationbetween these two variables. A
curvilinear model is not needed.
37
Plot the Prediction Model

Proc Sort DataResults
By Hrs
Run
GOptions ResetAxis ResetSymbol
Proc GPlot DataResults
Plot RiskPredHrs1 RiskLowerHrs2
RiskUpperHrs3
/ VAxisAxis1 HAxisAxis2 Overlay
Axis1 Label(A90 "Predicted Risk of GOOD")
Order(0 To 1 By 0.25)
Axis2 Label("Hours Carried")
Order(0 To 20 By 5)
Symbol1 CBlack VDot ISpline L1
Symbol2 CBlack VNone ISpline L2
Symbol3 CBlack VNone ISpline L2
Run Quit

38
HRS Risk Model
39
Multiple Logistic Regression

Proc Logistic DataCredit
Model Risk(Event"GOOD") GPT Hrs / ExpB
LackFit CLOddsPL
Units Hrs5 GPT0.25
Output OutResults PredictedRiskPred
LowerRiskLower UpperRiskUpper
Run

Note that for GPT we have requested odds ratios
for a 0.25-unitchange in GPT (something more
realistic than a 1-unit change).
40
Parameter Estimates
It appears that both predictors have parameters
that aredifferent from zero. If the 1-unit
change were useful foreach of the predictors,
then the estimated odds ratios inthe last column
would be useful. The table based upon theunits
statement is probably more valuable for this
purpose.
41
Odds Ratios
If the specified changes in GPT (0.25) and HRS
(5)represent comparable levels of change, then
the HRSvariable would have a much greater impact
on the oddsof being a GOOD risk than would GPT.
42
Prediction Graph

Proc Sort DataResults By GPT Run
GOptions ResetAxis ResetSymbol ResetLegend
Proc GPlot DataResults
Plot RiskPredGPTHrs
/ VAxisAxis1 HAxisAxis2 LegendLegend1
Axis1 Label(A90 "Predicted Risk of GOOD")
Order(0 To 1 By 0.25)
Axis2 Label("Grade Point Average")
Order(2 To 4 By 0.25)
Legend1 Position(Inside Middle Right)
Frame Across1 ShapeSymbol(10,1)
Symbol1 CBlack VCircle ISpline L1
Symbol2 CBlack VTriangle ISpline L1
Symbol3 CBlack VSquare ISpline L1
Symbol4 CBlack VDiamond ISpline L1
Symbol5 CBlack VDot ISpline L1
Run Quit

43
Prediction Graph
44
Multiple Logistic Analysis

Proc Logistic DataCredit
Class Sex Major
Model Risk(Event"GOOD") Sex Major Age GPT HRS
/ ExpB CLOddsPL
Units Hrs5 GPT0.25
Contrast "BUS VS HUM " Major 1 -1 0
Contrast "BUS VS SCI " Major 1 0 -1
Contrast "BUS VS SOC " Major 2 1 1
Contrast "BUS VS Others" Major 4 0 0
Run

This code uses the default class effect
parameterization, andso contrast statement
coefficients will look quite differentthan those
used with GLM or MIXED.
45
Alternative Code
GLM parameterization.

Proc Logistic DataCredit
Class Sex Major / ParamGLM
Model Risk(Event"GOOD") Sex Major Age GPT HRS
/ ExpB CLOddsPL
Units Hrs5 GPT0.25
Contrast "BUS VS HUM " Major 1 -1 0 0
Contrast "BUS VS SCI " Major 1 0 -1 0
Contrast "BUS VS SOC " Major 1 0 0 -1
Contrast "BUS VS Others" Major 3 -1 -1 -1
Run

These contrasts will match similar questions that
you haveposed with GLM or MIXED. These are the
same contraststhat are specified on the previous
slide.
46
Test Results
Some of these effects arenot significant and
shouldprobably be removed fromthe predictive
model. Onemight consider a backwardelimination
approach andremove sex from the modeland refit.
Likelihood ratiotests could also be used tohelp
judge effects to retain.
Alternatively, the fit statistics, such as AIC,
could be used toselect the most parsimonious
model. One could fit all possiblemodels and
select the one with the smaller AIC.

Write a Comment

User Comments (0)