Logistic Regression and Generalized Linear Models: - PowerPoint PPT Presentation

About This Presentation
Title:

Logistic Regression and Generalized Linear Models:

Description:

Logistic Regression and Generalized Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps Blood Screening ESR measurements (erythrocyte ... – PowerPoint PPT presentation

Number of Views:241
Avg rating:3.0/5.0
Slides: 33
Provided by: Information321
Category:

less

Transcript and Presenter's Notes

Title: Logistic Regression and Generalized Linear Models:


1
Logistic Regression and Generalized Linear Models
  • Blood Screening,
  • Womens Role in Society,
  • and Colonic Polyps

2
Blood Screening
  • ESR measurements (erythrocyte sedimentation
    rate)?
  • Study checks for the rate of increase of ESR when
    (blood proteins) fibrinogen and globulin increase
  • Looking for an association between the
    probability of an ESR reading gt 20mm/hr and the
    levels of the two plasma proteins.
  • Less than 20mm/hr indicates a healthy individual

3
Multiple Regression Doesnt Work
  • Not Normally distributed
  • The response variable is binary.
  • In fact, the distribution is Binomial. The can be
    seen by looking at how the error term relates to
    the probability. If y 1 then the error term is
    1- P(y1). We will assume for our Random
    Variables Y that

4
Logistic Regression, a Generalized Linear Model
  • Modelling the expected value of the response
    requires a transformation
  • This chapter is about the logit transformation of
    a model, which is of the form
  • If the response variable is 1, the log odds of
    the response is the logit function of a
    probability.
  • Fixing all explanatory variables but xj, exp(Bj)
    represents the odds that the response variable is
    1 when xj increases by 1.

5
R commands
  • Plasma_glm_1 lt- glm(ESR fibrinogen, data
    plasma, family binomial())?
  • Fits the model, in this example specifying the
    logistic function is implied because of the
    binomial() parameter is already passed.
  • Layout(matrix(12, ncol 2))?
  • Lets you choose the layout of your graph screen.
  • Cdplot(ESR fibrinogen, data plasma)?
  • Cdplot(ESR globulin, data plasma)?
  • cdplot plots conditional densities describing
    how the conditional distribution of a categorical
    variable y changes over a numerical variable x
    (help(cdplot))?

6
Interpretation
  • The area of the dark region is the probability
    that ESR lt 20 mm/hr, and this decreases as the
    protein levels increase
  • There is not much shape to the globulin density
    function.

7
R Commands
  • Confint(plasma_glm_1, parm fibrinogen)?
  • Gives a confidence interval
  • Output
  • Waiting for profiling to be done...
  • 2.5 97.5
  • 0.3389465 3.9988602
  • Exp(coef(plasma_glm_1)fibrinogen)?
  • This is necessary to do the reverse
    transformation to get the model coefficients
  • Output
  • fibrinogen
  • 6.215715

8
R Commands
  • Summary(plasma_glm_1)?
  • Output
  • Deviance Residuals
  • Min 1Q Median 3Q Max
  • -0.9298 -0.5399 -0.4382 -0.3356 2.4794
  • Coefficients
  • Estimate Std. Error z value Pr(gtz)
  • (Intercept) -6.8451 2.7703 -2.471 0.0135
  • fibrinogen 1.8271 0.9009 2.028 0.0425
  • ---
  • Signif. codes 0 0.001 0.01 0.05
    . 0.1 1
  • (Dispersion parameter for binomial family taken
    to be 1)
  • Null deviance 30.885 on 31 degrees of
    freedom
  • Residual deviance 24.840 on 30 degrees of
    freedom

9
R Commands
  • Exp(confint(plasma_glm_1, parm fibrinogen))?
  • Output
  • 2.5 97.5
  • 1.403468 54.535954
  • Plasma_glm_2 lt- glm(ESR fibrinogen globulin,
    data plasma, family binomial())?
  • Use logistic regression for both variables
    fibrinogen and globulin
  • Summary(Plasma_glm_2)?

10
R Commands
  • Summary(Plasma_glm_2)?
  • Output
  • Deviance Residuals
  • Min 1Q Median 3Q Max
  • -0.9683 -0.6122 -0.3458 -0.2116 2.2636
  • Coefficients
  • Estimate Std. Error z value Pr(gtz)
  • (Intercept) -12.7921 5.7963 -2.207 0.0273
  • fibrinogen 1.9104 0.9710 1.967 0.0491
  • globulin 0.1558 0.1195 1.303 0.1925
  • ---
  • Signif. codes 0 0.001 0.01 0.05
    . 0.1 1
  • (Dispersion parameter for binomial family taken
    to be 1)
  • Null deviance 30.885 on 31 degrees of
    freedom
  • Residual deviance 22.971 on 29 degrees of
    freedom

11
Interpretation
  • large confidence interval is because there are
    not many observations in all, and even fewer ESR
    gt 20mm/hr
  • The globulin coefficient is basically zero

12
R Commands
  • Anova(plasma_glm_1, plasma_glm_2, test
    Chisq)?
  • Compares the two models one of just fibrinogen,
    the other of both fibrinogen and globulin
  • Why Chisquared test?
  • ANOVA assumes normal distribution for each data
    set. Why is this OK?
  • Output
  • Analysis of Deviance Table
  • Model 1 ESR fibrinogen
  • Model 2 ESR fibrinogen globulin
  • Resid. Df Resid. Dev Df Deviance P(gtChi)
  • 1 30 24.8404
  • 2 29 22.9711 1 1.8692 0.1716
  • Comparing the residual deviance between the two
    models, we see that the difference is not
    significant. We must surmise that there is no
    association between globulin and ESR level.

13
R Commands
  • Prob lt- predict(plasma_glm_1, type response
  • The model of just fibrinogen useful in creating a
    neat looking bubble plot. First we must take the
    predicted values from the first model and use
    them to determine the size of the bubbles
  • Plot(globulin fibrinogen, data plasma, xlim
    c(2, 6), ylim c(25, 50), pch .)?
  • Plots the second model
  • Symbols(plasmafibrinogen, plasmaglobulin,
    circles prob, add TRUE)?
  • Uses the values of the first model to create
    different bubble sizes.

14
Bubbles size is Probability
  • The plot shows how the probability of having an
    ESR gt 20mm/hr increases as fibrinogen increases

15
Generalized Linear Model
  • Unifies the logistic regression, Analysis of
    Variance and multiple regression techniques.
  • GLMs have three essential parts
  • An error distribution- the distribution of the
    response variable.
  • Normal for Analysis of Variance and Multiple
    Regression
  • Binomial for Logistic Regression

16
Main parts of GLM (continued)?
  • A link function which links the explanatory
    variables to the expected value of the response.
  • Logit function for logistic regression
  • Identity function for ANOVA and multiple
    regression
  • A variance function which shows the dependency of
    the response variable variability on the mean

17
Measure of Fit
  • The deviance shows how well the model fits the
    data
  • Comparing two models deviances
  • Use a likelihood ratio test
  • Compare using Chi-square distribution

18
Womens Role in Society
  • Response to Women should take care of running
    their homes and leave the running the country up
    to men.
  • Factors Education, Sex
  • Response Agree or Disagree
  • data is presented as categories with counts for
    each education, sex combination

19
R Commands
  • Womensrole_glm_1 lt- glm(cbind(agree, disagree)
    sex education, data womensrole, family
    binomial())?
  • This uses the cbind function to change data from
    two responses to one response that is a matrix of
    agree, disagree counts.
  • Summary(womensrole_glm_1)?
  • We can see that education is a significant factor
    to the response.

20
Summary Output
  • Deviance Residuals
  • Min 1Q Median 3Q Max
  • -2.72544 -0.86302 -0.06525 0.84340 3.13315
  • Coefficients
  • Estimate Std. Error z value Pr(gtz)
  • (Intercept) 2.50937 0.18389 13.646 lt2e-16
  • Why is there a P-value for the intercept?
  • sexFemale -0.01145 0.08415 -0.136 0.892
  • education -0.27062 0.01541 -17.560 lt2e-16
  • ---
  • Signif. codes 0 0.001 0.01 0.05
    . 0.1 1
  • (Dispersion parameter for binomial family taken
    to be 1)
  • Null deviance 451.722 on 40 degrees of
    freedom
  • Residual deviance 64.007 on 38 degrees of
    freedom

21
Declaring a function in R
  • Myplot lt- function(role.fitted)
  • Declares the function myplot and passes it an
    object role.fitted to be used in the function
  • f lt- womensrolesex Female
  • Stores everything in the data with sex femail
    in f
  • plot(womensroleeducation, role.fitted, type
    "n", ylab "probability of agreeing", xlab
    "Education", ylim c(0,1))?
  • Plots education against the role.fitted object
    which will be some predicted values from our GLM
    defined as
  • Role.fitted1 lt- predict(womensrole_glm_1, type
    response)?

22
Myplot function (cont)?
  • lines(womensroleeducation!f, role.fitted!f,
    lty 1)
  • lines(womensroleeducationf, role.fittedf,
    lty 2)
  • These graph the lines, one for males and one for
    femails. Lty indicates the kind of line (in this
    case solid or dotted)?
  • lgtxt lt- c("Fitted (Males)", "Fitted (Females)")
  • legend("topright", lgtxt, lty 12, bty "n")
  • These add a legend for each line.

23
Myplot Function (cont)?
  • ylt-womensroleagree/ (womensroleagree
    womensroledisagree)?
  • A basic calculation of the proportion of women
    that agree
  • text(womensroleeducation, y, ifelse(f, "\\VE",
    "\\MA"), family "HersheySerif", cex 1.25)?
  • Plots y and not y using male/female symbols

24
Interpretation
  • The two fitted lines indicate that sex does not
    change a probability of agreeing vs education
  • The symbols of unfitted observations may indicate
    an interaction between sex and education

25
MyPlot SexEducation
  • By running the same analysis, e.g.
  • Womensrole_glm_2 lt- glm(cbind(agree, disagree)
    sexeducation, data womensrole, family
    binomial())?
  • Summary(womensrole_glm_2)?
  • This summary shows a significant p-value for the
    sexeducation interaction
  • Role.fitted2 lt- predict(womensrole_glm_1, type
    response)?
  • Myplot(role.fitted2)?
  • The plot shows that less education is associated
    with agreement that women belong in the home.

26
The Deviance Residual
  • One of the many methods for checking the adequacy
    of the model fit
  • The deviance residual is the square root of the
    part of each observation that contributes to the
    deviance
  • Res lt- residuals( womensrole_glm_2, type
    deviance)?
  • Pulls the residuals for many other models as well
  • Plot(predict(womensrole_glm_2), res, xlab
    fitted values, ylab Residuals, lim
    max(abs(res))c(-1,1))?
  • No visible pattern fit appears ok
  • Abline(h 0, lty 2)?
  • Adds a dotted line at height 0

27
Drug Treatment TestingFamlilial Andenomatous
Polyposis (FAP)?
  • Counts of colonic polyps after 12 months of
    treatment
  • Dont want to be the guy that did that
  • Placebo controlled (Binary factor)?
  • Age

28
GLM Analysis
  • Multiple Regression wont work.
  • Count data strictly positive
  • Normality is not probable
  • Poisson Regression- GLM with a log link function
  • Ensures a Poisson Distribution
  • Ensures positive fitted amounts
  • R command
  • Polyps_glm_1 lt- glm(number treat age, data
    polyps, family poisson())?

29
Model Summary
  • Call
  • glm(formula number treat age, family
    poisson(), data polyps)?
  • Deviance Residuals
  • Min 1Q Median 3Q Max
  • -4.2212 -3.0536 -0.1802 1.4459 5.8301
  • Coefficients
  • Estimate Std. Error z value Pr(gtz)
  • (Intercept) 4.529024 0.146872 30.84 lt 2e-16
  • treatdrug -1.359083 0.117643 -11.55 lt 2e-16
  • age -0.038830 0.005955 -6.52 7.02e-11
  • ---
  • Signif. codes 0 0.001 0.01 0.05
    . 0.1 1
  • (Dispersion parameter for poisson family taken to
    be 1)?
  • Null deviance 378.66 on 19 degrees of
    freedom
  • Residual deviance 179.54 on 17 degrees of
    freedom

30
Model Problem Overdispersion
  • In previous models the variance can be seen as
    dependent solely on the mean
  • E.G. Binomial, Poisson
  • In practice, this doesnt always work.
  • Sometimes the raw data points are not
    independent, there is some correlation, in this
    case, possible clustering
  • Compare residual deviance and degrees of freedom
    to determine
  • These should be basically equal

31
Model Solution Quasi-Likelihood
  • This procedure estimates the other factors that
    might contribute to the variance
  • R Command
  • Polyps_glm_2 lt- glm(number treat age, data
    polyps, family quasipoisson())?
  • Summary(polyps_glm_2)?
  • The coefficients are still significant, but less
    so

32
Homework
  • Please run through the commands for the myplot
    function (pg 100) and send me the command script.
  • Please send me what you thought of the
    presentation, give me a grade and add any
    constructive criticism.
  • zweihanderdawg_at_gmail.com
Write a Comment
User Comments (0)
About PowerShow.com