Correlation and Regression - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Correlation and Regression

Description:

Slides by Brad Evanoff, MD, MPH Talk by Brian Gage, MD, MSc Overview of Correlation and Regression Nondependent and Dependent Relationships Types of Relationship ... – PowerPoint PPT presentation

Number of Views:197
Avg rating:3.0/5.0
Slides: 46
Provided by: gmsImWus
Category:

less

Transcript and Presenter's Notes

Title: Correlation and Regression


1
Correlation and Regression
Slides by Brad Evanoff, MD, MPH Talk by Brian
Gage, MD, MSc
2
Overview of Correlation and Regression
  • Correlation seeks to establish whether a
    relationship exists between two variables
  • Regression seeks to use one variable to predict
    another variable
  • Both measure the extent of a linear relationship
    between two variables
  • Statistical tests are used to determine the
    strength of the relationship

3
Nondependent and Dependent Relationships
  • Types of Relationship
  • Nondependent (correlation) -- neither one of
    variables is target Example protein and fat
    intake
  • Dependent (regression) -- value of one variable
    is used to predict value of another variable.
    Example ACT and MCAT scores for medical
    applicants, MCAT is the dependent and ACT is the
    independent variable
  • Statistical Expressions
  • Correlation Coefficient -- index of nondependent
    relationship
  • Regression Coefficient -- index of dependent
    relationship

4
Example
  • Measure the daily fecal lipid and fecal energy
    for 20 children with cystic fibrosis
  • Plot each individual as a point on a graph which
    has fecal lipid on one axis and fecal energy on
    the other axis
  • What does the distribution of these values look
    like?

5
(No Transcript)
6
(No Transcript)
7
Pearsons Product Moment Correlation Coefficient
  • The correlation coefficient, r, is a measure of
    the interdependent relationship between two
    continuous variables
  • For two variables, x and y, the correlation
    coefficient measures the extent to which greater
    values of x are associated with greater values of
    y

8
  • The value of r can range from -1 to 1
  • Absolute values close to 1, with either sign,
    will represent a close correlation
  • Values close to 0 will represent little or no
    correlation

9
r ?
10
r ?
11
r ?
12
r ?
13
r ?
14
r ?
15
Importance of Scatterplots and Examining the Data
  • Scatterplot F shows the relationship between
    temperature and number of nerve fiber discharges
  • The scatterplot demonstrates a strong
    relationship
  • However, the correlation coefficient, which only
    measures a linear relationship, has a value of
    zero (Note that scatterplot E also has an r value
    of zero but clearly no relationship exists
    between the two variables)

16
  • r values can be tested to see if an observed
    correlation is statistically significant
  • The same distinction between magnitude of effect
    and statistical significance must be made as for
    other tests - a large sample may make small
    correlations statistically significant yet
    clinically meaningless

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Coefficient of Determination, r 2
  • To understand the strength of the relationship
    between two variables
  • The correlation coefficient, r, is squared
  • r 2 shows how much of the variation in one
    measure (say, fecal energy) is accounted for by
    knowing the value of the other measure (fecal
    lipid loss)

21
  • For the cystic fibrosis patients, r .42 and r2
    .18
  • 18 of the variation in fecal energy may be
    accounted for by knowing fecal lipid loss
    (or vice versa)

22
(No Transcript)
23
Caveats
  • Correlation does not imply causation
  • Correlation measures only linear association, and
    many biological systems are better described by
    curvilinear plots
  • This is one reason why data should always be
    looked at first (scatterplot)

24
  • Correlation coefficient assumes normally
    distributed data
  • The correlation coefficient is sensitive to
    extreme values
  • Non-normal distributions can be transformed
    (e.g., logarithmic transformation) or converted
    into ranks and non-parametric correlation test
    can be used (Spearmans rank correlation)

25
Types of Coefficients
Type of Data Continuous v. Continuous Continuous
v. Ordinal Ordinal v. Ordinal
Correlation Coefficient Pearsons r Jaspens
Multiserial Coefficient (M) Spearmans r
(Rho) Kendalls t (Tau)
26
Linear Regression
  • Used when the goal is to predict the value of one
    characteristic from knowledge of another
  • Assumes a straight-line, or linear, relationship
    between two variables
  • But the variable can be transformed 1st
  • When term simple is used with regression, it
    refers to situation where one explanatory
    (independent) variable is used to predict another
  • Multiple regression is used for more than one
    explanatory variable

27
  • If the point at which the line intercepts or
    crosses the Y-axis is a and the slope of the line
    is denoted as b, then Y ß1X ß0
  • Like y mx b
  • The slope is a measure of how much Y changes for
    a one-unit change in X

28
(No Transcript)
29
  • Because the points rarely fall along a perfect
    straight line, there is also an error term e
  • The formula then becomes Y ß1 X ß0 e
  • The error term is a measure of the amount that
    the actual Y values depart from the Y values
    predicted by the equation
  • Regression lines are fitted using a measure
    called least squares, which attempts to find the
    line which minimizes the sum of these errors
    (each of which is squared in the equations)

30
(No Transcript)
31
Example
  • Investigators want to be able to predict a
    potential medical school applicants MCAT scores
    from his or her previous ACT examination score
  • Create scatterplot of ACT and MCAT test scores
  • Calculate the regression equation for ACT scores
    and MCAT scores

32
r ?
33
Y -1.61 0.406X, where Y is the predicted
MCAT score and X is the ACT score
R 0.62
34
  • This model of simple linear regression can be
    extended to situations where there is more than
    one independent variable of interest
  • The equation below shows a model which predicts Y
    based on three independent variables, X1 ,X2 ,
    and X3

35
Multiple Regression
  • Just like simple linear regression, but with more
    variables
  • Allows the independent effects of several
    variables to be studied at once can examine
    contribution of any variable while controlling
    for effects of other variables
  • Useful when predictor (independent) variables and
    the outcome (dependent) variable are numerical
    (continuous) e.g., weight, age, Hct.

36
Multiple Regression
  • Y estimated value for dependent (outcome)
    variable
  • ß0 intercept
  • ß1 partial regression coefficients indicate
    how much Y changes for each unit of
    change in X, when all other variables in
    the model held constant
  • Xi independent (predictor) variables

37
Multiple regression R
  • Multiple R correlation coefficient indicates
    correspondence between Y values predicted by the
    model and Y values observed.
  • R2 amount of variability in Y explained by
    variation in the X variables contained in the
    model
  • Model calculates partial R values - correlation
    coefficient of individual variables - as well as
    R for the whole model

38
Results of Stepwise Regression Predicting
Resident Performance
39
Building A Multiple Regression Model
  • Usual case picking a few significant variables
    from many candidate variables
  • Variables can be included because of clinical
    significance (forced into the model) or because
    of statistical significance
  • Statistical significance usually determined by a
    stepwise process

40
Forward Selection
  • Picks the X variable with the highest R, puts in
    the model
  • Then looks for the X variable which will increase
    R2 by the highest amount
  • Test for statistical significance performed
    (using the F test)
  • If statistically significant, the new variable is
    included in the model, and the variable with the
    next highest R2 is tested
  • The selection stops when no variable can be added
    which significantly increases R2

41
Backwards Elimination
  • Starts with all variables in the model
  • Removes the X variable which results in the
    smallest change in R2
  • Continues to remove variables from the model
    until removal produces a statistically
    significant drop in R2

42
Stepwise regression
  • Similar to forward selection, but after each new
    X added to the model, all X variables already in
    the model are re-checked to see if the addition
    of the new variable has effected their
    significance
  • Bizarre, but unfortunately true running forward
    selection, backward elimination, and stepwise
    regression on the same data often gives different
    answers

43
Multiple Regression Caveats
  • Try not to include predictor variables which are
    highly correlated with each other
  • One X may force the other out, with strange
    results
  • Overfitting too many variables make for an
    unstable model
  • Common rule of thumb need gt 10 subjects (or
    events) for each X variable
  • Model assumes normal distribution for Y variable
  • widely skewed data may give misleading results

44
(No Transcript)
45
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com