Simple Linear Regression: An Introduction - PowerPoint PPT Presentation

About This Presentation
Title:

Simple Linear Regression: An Introduction

Description:

Title: Simple Linear Regression: An Introduction Author: Dr Tuan V. Nguyen Last modified by: Dr Tuan V Nguyen Created Date: 8/13/2002 10:12:30 AM Document ... – PowerPoint PPT presentation

Number of Views:472
Avg rating:3.0/5.0
Slides: 50
Provided by: DrTuanV3
Category:

less

Transcript and Presenter's Notes

Title: Simple Linear Regression: An Introduction


1
Simple Linear RegressionAn Introduction
  • Dr. Tuan V. Nguyen
  • Garvan Institute of Medical Research
  • Sydney

2
  • Give a man three weapons correlation,
    regression and a pen and he will use all three
    (Anon, 1978)

3
An example
ID Age Chol (mg/ml) 1 46 3.5 2 20 1.9 3 52 4.0
4 30 2.6 5 57 4.5 6 25 3.0 7 28 2.9 8 36 3.8 9 22
2.1 10 43 3.8 11 57 4.1 12 33 3.0 13 22 2.5 14 63
4.6 15 40 3.2 16 48 4.2 17 28 2.3 18 49 4.0
Age and cholesterol levels in 18 individuals
4
Read data into R
  • id lt- seq(118)
  • age lt- c(46, 20, 52, 30, 57, 25, 28, 36, 22,
  • 43, 57, 33, 22, 63, 40, 48, 28, 49)
  • chol lt- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8,
    2.1,
  • 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3,
    4.0)
  • plot(chol age, pch16)

5
(No Transcript)
6
Questions of interest
  • Association between age and cholesterol levels
  • Strength of association
  • Prediction of cholesterol for a given age

Correlation and Regression analysis
7
Variance and covariance algebra
  • Let x and y be two random variables from a sample
    of n obervations.
  • Measure of variability of x and y variance
  • Measure of covariation between x and y ?
  • Algebraically
  • var(x y) var(x) var(y)
  • var(x y) var(x) var(y) 2cov(x,y)
  • Where

8
Variance and covariance geometry
  • The independence or dependence between x and y
    can be represented geometrically

y
h
h
y
H
x
x
h2 x2 y2 2xycos(H)
h2 x2 y2
9
Meaning of variance and covariance
  • Variance is always positive
  • If covariance 0, x and y are independent.
  • Covariance is sum of cross-products can be
    positive or negative.
  • Negative covariance deviations in the two
    distributions in are opposite directions, e.g.
    genetic covariation.
  • Positive covariance deviations in the two
    distributions in are in the same direction.
  • Covariance a measure of strength of
    association.

10
Covariance and correlation
  • Covariance is unit-depenent.
  • Coefficient of correlation (r) between x and y is
    a standardized covariance.
  • r is defined by

11
Positive and negative correlation
r 0.9
r -0.9
12
Test of hypothesis of correlation
  • Hypothesis Ho r 0 versus Ho r not equal to
    0.
  • Standard error of r is
  • The t-statistic
  • This statistic has a t distribution with n 2
    degrees of freedom.
  • Fishers z-transformation
  • Standard error of z
  • Then 95 CI of z can be constructed as

13
An illustration of correlation analysis
  • ID Age Cholesterol
  • (x) (y mg/100ml)
  • 46 3.5
  • 20 1.9
  • 52 4.0
  • 30 2.6
  • 57 4.5
  • 25 3.0
  • 28 2.9
  • 36 3.8
  • 22 2.1
  • 43 3.8
  • 57 4.1
  • 33 3.0
  • 22 2.5
  • 63 4.6
  • 40 3.2
  • 48 4.2
  • 28 2.3

Cov(x, y) 10.68
t-statistic 0.56 / 0.26 2.17 Critical t-value
with 17 df and alpha 5 is 2.11 Conclusion
There is a significant association between age
and cholesterol.
14
Simple linear regression analysis
  • Only two variables are of interest one response
    variable and one predictor variable
  • No adjustment is needed for confounding or
    covariate
  • Assessment
  • Quantify the relationship between two variables
  • Prediction
  • Make prediction and validate a test
  • Control
  • Adjusting for confounding effect (in the case of
    multiple variables)

15
Relationship between age and cholesterol
16
Linear regression model
  • Y random variable representing a response
  • X random variable representing a predictor
    variable (predictor, risk factor)
  • Both Y and X can be a categorical variable (e.g.,
    yes / no) or a continuous variable (e.g., age).
  • If Y is categorical, the model is a logistic
    regression model if Y is continuous, a simple
    linear regression model.
  • Model
  • Y a bX e
  • a intercept
  • b slope / gradient
  • random error (variation between subjects in y
    even if x is constant, e.g., variation in
    cholesterol for patients of the same age.)

17
Linear regression assumptions
  • The relationship is linear in terms of the
    parameter
  • X is measured without error
  • The values of Y are independently from each other
    (e.g., Y1 is not correlated with Y2)
  • The random error term (e) is normally distributed
    with mean 0 and constant variance.

18
Expected value and variance
  • If the assumptions are tenable, then
  • The expected value of Y is E(Y x) a bx
  • The variance of Y is var(Y) var(e) s2

19
Estimation of model parameters
Given two points A(x1, y1) and B(x2, y2) in a
two-dimensional space, we can derive an equation
connecting the points.
Gradient
y
B(x2,y2)
Equation y mx a What happen if we have
more than 2 points?
dy
A(x1,y1)
dx
a
0
x
20
Estimation of a and b
  • For a series of pairs (x1, y1), (x2, y2), (x3,
    y3), , (xn, yn)
  • Let a and b be sample estimates for parameters a
    and b,
  • We have a sample equation Y a bx
  • Aim finding the values of a and b so that (Y
    Y) is minimal.
  • Let SSE sum of (Yi a bxi)2.
  • Values of a and b that minimise SSE are called
    least square estimates.

21
Criteria of estimation
yi
Chol
Age
The goal of least square estimator (LSE) is to
find a and b such that the sum of d2 is minimal.
22
Estimation of a and b
  • After some calculus operations, the results can
    be shown to be

Where
  • When the regression assumptions are valid, the
    estimators of a and b have the following
    properties
  • Unbiased
  • Uniformly minimal variance (eg efficient)

23
Goodness-of-fit
  • Now, we have the equation Y a bX e
  • Question how well the regression equation
    describe the actual data?
  • Answer coefficient of determination (R2) the
    amount of variation in Y is explained by the
    variation in X.

24
Partitioning of variations concept
  • SST sum of squared difference between yi and
    the mean of y.
  • SSR sum of squared difference between the
    predicted value of y and the mean of y.
  • SSE sum of squared difference between the
    observed and predicted value of y.
  • SST SSR SSE
  • The the coefficient of determination is
  • R2 SSR / SST

25
Partitioning of variations geometry
SSE
SST
Chol (Y)
SSR
mean
Age (X)
26
Partitioning of variations algebra
  • Some statistics
  • Total variation
  • Attributed to the model
  • Residual sum of square
  • SST SSR SSE
  • SSR SST SSE

27
Analysis of variance
  • SS increases in proportion to sample size (n)
  • Mean squares (MS) normalise for degrees of
    freedom (df)
  • MSR SSR / p (where p number of degrees of
    freedom)
  • MSE SSE / (n p 1)
  • MST SST / (n 1)
  • Analysis of variance (ANOVA) table

Source d.f. Sum of squares (SS) Mean squares (MS) F-test
Regression Residual Total p Np 1 n 1 SSR SSE SST MSR MSE MSR/MSE
28
Hypothesis tests in regression analysis
  • Now, we have
  • Sample data Y a bX e
  • Population Y a bX e
  • Ho b 0. There is no linear association
    between the outcome and predictor variable.
  • In layman language what is the chance, given
    the sample data that we observed, of observing a
    sample of data that is less consistent with the
    null hypothesis of no association?

29
Inference about slope (parameter b)
  • Recall that e is assumed to be normally
    distributed with mean 0 and variance s2.
  • Estimate of s2 is MSE (or s2)
  • It can be shown that
  • The expected value of b is b, i.e. E(b) b,
  • The standard error of b is
  • Then the test whether b 0 is t b / SE(b)
    which follows a t-distribution with n-1 degrees
    of freedom.

30
Confidence interval around predicted valued
  • Observed value is Yi.
  • Predicted value is
  • The standard error of the predicted value is
  • Interval estimation for Yi values

31
Checking assumptions
  • Assumption of constant variance
  • Assumption of normality
  • Correctness of functional form
  • Model stability
  • All can be conducted with graphical analysis.
    The residuals from the model or a function of the
    residuals play an important role in all of the
    model diagnostic procedures.

32
Checking assumptions
  • Assumption of constant variance
  • Plot the studentized residuals versus their
    predicted values. Examine whether the
    variability between residuals remains relatively
    constant across the range of fitted values.
  • Assumption of normality
  • Plot the residuals versus their expected values
    under normality (Normal probability plot). If
    the residuals are normally distributed, it should
    fall along a 45o line.
  • Correct functional form?
  • Plot the residuals versus fitted values. Examine
    whether the residual plot for evidence of a
    non-linear trend in the value of the residual
    across the range of fitted values.
  • Model stability
  • Check whether one or more observations are
    influential. Use Cooks distance.

33
Checking assumptions (Cont)
  • Cooks distance (D) is a measure of the magnitude
    by which the fitted values of the regression
    model change if the ith observation is removed
    from the data set.
  • Leverage is a measure of how extreme the value of
    xi is relative to the remaining value of x.
  • The Studentized residual provides a measure of
    how extreme the value of yi is relative to the
    remaining value of y.

34
Remedial measures
  • Non-constant variance
  • Transform the response variable (y) to a new
    scale (e.g. logarithm) is often helpful.
  • If no transformation can achieve the non-constant
    variance problem, use a more robust estimator
    such as iterative weighted least squares.
  • Non-normality
  • Non-normality and non-constant variance go
    hand-in-hand.
  • Outliers
  • Check for accuracy
  • Use robust estimator

35
Regression analysis using R
  • id lt- seq(118)
  • age lt- c(46, 20, 52, 30, 57, 25, 28, 36, 22,
  • 43, 57, 33, 22, 63, 40, 48, 28, 49)
  • chol lt- c(3.5, 1.9, 4.0, 2.6, 4.5, 3.0, 2.9, 3.8,
    2.1,
  • 3.8, 4.1, 3.0, 2.5, 4.6, 3.2, 4.2, 2.3,
    4.0)
  • Fit linear regression model
  • reg lt- lm(chol age)
  • summary(reg)

36
ANOVA result
  • gt anova(reg)
  • Analysis of Variance Table
  • Response chol
  • Df Sum Sq Mean Sq F value Pr(gtF)
  • age 1 10.4944 10.4944 114.57 1.058e-08
  • Residuals 16 1.4656 0.0916
  • ---
  • Signif. codes 0 '' 0.001 '' 0.01 '' 0.05
    '.' 0.1 ' ' 1

37
Results of R analysis
gt summary(reg) Call lm(formula chol
age) Residuals Min 1Q Median
3Q Max -0.40729 -0.24133 -0.04522 0.17939
0.63040 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) 1.089218
0.221466 4.918 0.000154 age
0.057788 0.005399 10.704 1.06e-08
--- Signif. codes 0 '' 0.001 '' 0.01
'' 0.05 '.' 0.1 ' ' 1 Residual standard error
0.3027 on 16 degrees of freedom Multiple
R-Squared 0.8775, Adjusted R-squared 0.8698
F-statistic 114.6 on 1 and 16 DF, p-value
1.058e-08
38
Diagnostics influential data
par(mfrowc(2,2)) plot(reg)
39
A non-linear illustration BMI and sexual
attractiveness
  • Study on 44 university students
  • Measure body mass index (BMI)
  • Sexual attractiveness (SA) score

id lt- seq(144) bmi lt- c(11.00, 12.00, 12.50,
14.00, 14.00, 14.00, 14.00, 14.00,
14.00, 14.80, 15.00, 15.00, 15.50, 16.00,
16.50, 17.00, 17.00, 18.00, 18.00, 19.00,
19.00, 20.00, 20.00, 20.00, 20.50,
22.00, 23.00, 23.00, 24.00, 24.50,
25.00, 25.00, 26.00, 26.00, 26.50,
28.00, 29.00, 31.00, 32.00, 33.00, 34.00, 35.50,
36.00, 36.00) sa lt- c(2.0, 2.8, 1.8,
1.8, 2.0, 2.8, 3.2, 3.1, 4.0, 1.5, 3.2,
3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4, 6.3,
6.5, 4.9, 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5,
3.7, 3.5, 4.0, 3.7, 3.6, 3.4, 3.3, 2.9,
2.1, 2.0, 2.1, 2.1, 2.0, 1.8, 1.7)
40
Linear regression analysis of BMI and SA
reg lt- lm (sa bmi) summary(reg) Residuals
Min 1Q Median 3Q Max
-2.54204 -0.97584 0.05082 1.16160 2.70856
Coefficients Estimate Std. Error t
value Pr(gtt) (Intercept) 4.92512
0.64489 7.637 1.81e-09 bmi -0.05967
0.02862 -2.084 0.0432 --- Signif.
codes 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1
' ' 1 Residual standard error 1.354 on 42
degrees of freedom Multiple R-Squared 0.09376,
Adjusted R-squared 0.07218 F-statistic 4.345
on 1 and 42 DF, p-value 0.04323
41
BMI and SA analysis of residuals
plot(reg)
42
BMI and SA a simple plot
par(mfrowc(1,1)) reg lt- lm(sa bmi) plot(sa
bmi, pch16) abline(reg)
43
Re-analysis of sexual attractiveness data
  • Fit 3 regression models
  • linear lt- lm(sa bmi)
  • quad lt- lm(sa poly(bmi, 2))
  • cubic lt- lm(sa poly(bmi, 3))
  • Make new BMI axis
  • bmi.new lt- 1040
  • Get predicted values
  • quad.pred lt- predict(quad,data.frame(bmibmi.new))
  • cubic.pred lt- predict(cubic,data.frame(bmibmi.new
    ))
  • Plot predicted values
  • abline(reg)
  • lines(bmi.new, quad.pred, col"blue",lwd3)
  • lines(bmi.new, cubic.pred, col"red",lwd3)

44
(No Transcript)
45
Some comments Interpretation of correlation
  • Correlation lies between 1 and 1. A very small
    correlation does not mean that no linear
    association between the two variables. The
    relationship may be non-linear.
  • For curlinearity, a rank correlation is better
    than the Pearsons correlation.
  • A small correlation (eg 0.1) may be statistically
    significant, but clinically unimportant.
  • R2 is another measure of strength of association.
    An r 0.7 may sound impressive, but R2 is 0.49!
  • Correlation does not mean causation.

46
Some comments Interpretation of correlation
  • Be careful with multiple correlations. For p
    variables, there are p(p 1)/2 possible pairs of
    correlation, and false positive is a problem.
  • Correlation can not be inferred directly from
    association.
  • r(age, weight) 0.05 r(weight, fat) 0.03 it
    does not mean that r(age, fat) is near zero.
  • In fact, r(age, fat) 0.79.

47
Some comments Interpretation of regression
  • The fitted line (regression) is only an estimated
    of the relation between these variables in the
    population.
  • Uncertainty associated with estimated parameters.
  • Regression line should not be used to make
    prediction of x values outside the range of
    values in the observed data.
  • A statistical model is an approximation the
    true relation may be nonlinear, but a linear is
    a reasonable approximation.

48
Some comments Reporting results
  • Results should be reported in sufficient details
    nature of response variable, predictor variable
    any transformation checking assumptions, etc.
  • Regression coefficients (a, b), their associated
    standard errors, and R2 are useful summary.

49
Some final comments
  • Equations are the cornerstone on which the
    edifice of science rests.
  • Equations are like poems, or even an onion.
  • So, be careful with your building of equations!
Write a Comment
User Comments (0)
About PowerShow.com