Regression04: 1 - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Regression04: 1

Description:

An Introduction to REGRESSION AND CORRELATION Since points are in all 4 quadrants: sxy= 0 r = 0, b1 = 0 (a) (b) Correlation, r , in (a) is greater than r in (b ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 40
Provided by: Penelop84
Learn more at: http://people.umass.edu
Category:

less

Transcript and Presenter's Notes

Title: Regression04: 1


1
An Introduction to REGRESSION AND CORRELATION
2
  • How do we measure the association of 2
    continuous, numeric scale variables?
  • Example
  • Observations are available on
  • a sample of 30 individuals
  • systolic blood pressure (SBP)
  • age
  • We are interested in
  • the relationship between SBP and age
  • for these patients (descriptive)
  • and for the population which they represent
    (inferential).

3
Data on 30 individuals
4
  • Note
  • We have 30 pairs of observations which we can
    denote as
  • (x1,y1) (39,144)
  • (x2,y2 ) (47, 220)
  • (x30,y30) (69, 175)
  • Where
  • xi refers to age for the i th subject
  • yi to SBP for the i th subject

5
These data pairs may be considered as points
in two dimensional space, so that we may plot
them on a graph.
Scatter diagram of age and systolic blood pressure
  • Note
  • age and SBP seem to be related
  • Younger subjects tend to have lower SBP
  • older subjects higher SBP.

SBP (mm Hg)

240
220
200
180
160
140
120
0
20
30
40
50
60
80
0
70
AGE in years
6
How can this relationship be measured?
y
y
No relationship between x and y. Spread is even
in all directions.


x
x
Linear relationship A line indicates the main
direction of the spread of points.
y

Non-linear relationship between x and y. A curve
best describes the relationship.
x
7
Math Review Equation for a Line
y
bo y-intercept value of y when x0
b
0
x
b1 slope Dy / Dx (change in y)/(change in x)
8
b1 slope Dy / Dx (change in y) / (change
in x)


Slope gt 0 positive slope (as x increases, y
increases)
Slope 0

Slope lt 0 negative slope (as x increases, y
decreases)

9
  • Now, given a set of data, how can we get the line
    that best fits or best represents the data?
  • When it is appropriate to predict one variable
    (y) from another variable (x) -- there is some
    directionality in the relationship then
  • Commonly use a technique know as Least Squares
    Regression to estimate
  • intercept b0
  • slope b1
  • denote the estimates b0 and b1, respectively



(referred to as beta-nought-hat and beta-one-hat)
10
We are looking for that line which minimizes the
vertical distances to the data points.

  • For each observed value xi, we have
  • an observed yi, and the
  • predicted value yi, on the line yi b0 b1xi
  • The vertical distances are di (yi yi).






11
That is, we have xi observed x for ith
subject yi observed y for ith subject yi
predicted y for ith subject

y

(xi,yi)

yi


yi

(xi,yi)

x
xi
12
The squared distances are di2 (yi yi)2
and the sum of squared deviations from the line
(sound familiar?) is Sdi2 S(yi yi)2 We want
the line such that is minimized.


13
  • The unbiased estimates of b0 and b1 which are
  • the least squares estimates and
  • the minimum variance estimates
  • Are

Use calculus in previous equations to solve
14
  • Example
  • Using the data on 30 individuals where we
    measured
  • AGE (x)
  • SBP (y)
  • n 30, y 142.53, x 45.13
  • We get

15
Thus, the equation for this straight line is
given by
240
220
200
180
160

140


120
0
20
30
40
50
60
70
80
AGE
16
  • Now,
  • If yi yi for all i, then SSE0 ? perfect fit
    to line
  • As the fit gets worse, SSE gets larger
  • SSE serves as measure of fit to line








17
  • One of the assumptions for regression analysis is
    that of homoscedasticity
  • the variance of y is the same for any xthat is,
    the spread of values for y at each level of x
    remains constant

y
Spread of yx
Spread of y ignoring x
x
18
An estimate of s2 is given by
Lose 2 df for estimating b0 and b1
  • The standard error, syx
  • is a measure of the spread of y
  • around its predicted value y
  • for each value of x.


19
  • In our example
  • And the estimated standard error is
  • That is,
  • for any given age x,
  • the standard error of SBP is estimated as 17.31
    mmhg.

20
  • To address the question of association of x and y
  • We want to know if the slope is zero
  • Ho b10
  • Ha b1?0

240
220
200
180
160

140


120
0
20
30
40
50
60
70
80
AGE
21
  • Now, if we assume
  • that for any fixed value of x
  • y is normally distributed
  • Then we can show that
  • In practice, since s2 is unknown
  • Use syx2 in place of s2
  • Use the t-distribution, with n-2 df
  • For hypothesis testing and CI

22
  • With these assumptions, to test
  • Ho b10
  • Ha b1?0
  • Test statistic

23
In our example The achieved significance is
then With plt.05, Reject Ho and conclude that
age (x) provides significant information for
predicting SBP (y).
24
In Minitab, enter the data in 2 columns, for SBP
and AGE, and select Stat ?
Regression ? Regression
Response is Y variable
Predictor is X variable
25
Regression Analysis spb versus age The
regression equation is spb 98.7 0.971
age Predictor Coef SE Coef T
P Constant 98.71 10.00 9.87
0.000 age 0.9709 0.2102 4.62
0.000 S 17.31 R-Sq 43.2 R-Sq(adj)
41.2 Analysis of Variance Source DF
SS MS F P Regression 1
6394.0 6394.0 21.33 0.000 Error 28
8393.4 299.8 Total 29 14787.5
26
  • Youll note that a significance test is also
    provided for b0
  • H0 b00 vs. Ha b0?0
  • T P
  • 9.87 0.000
  • We are rarely interested in tests of b0.
  • It is often outside of the range of the data
    (e.g., here the youngest age is 20)
  • In this case it can be interpreted as the
    predicted SBP at age0 not meaningful.
  • It is inappropriate to interpret regression
    relationships outside the range of the observed
    data.

27
  • A better model might exist
  • (e.g, one with a curvilinear term)
  • but there is a linear component.





























































y





































Here, a curve would provide a better fit
Linear model fits better than y y

28























































































x provides little or no help in predicting y
The true relationship between x and y is not
linear.
or
Note even when Ho b10 is rejected, some other
non-linear model may be better
29
  • Part 2 The Correlation Coefficient
  • Provides a measure of how 2 random variables are
    associated, without assuming any direction to the
    association (i.e., no sense that x is predictive
    of y, just that they are related)
  • Also a measure of the strength of the
    straight-line relationship between X and Y
  • It can also be shown that

30
  • Characteristics of correlation coefficient r
  • -1 ? r ? 1
  • -1 implies perfect negative correlation
  • 0 implies no correlation
  • 1 implies perfect positive correlation
  • r is dimensionless it is independent of units
    of x or y
  • r always has same sign as slope
  • r is the sample estimator of the population
    correlation r

31
  • If we
  • divide the data into 4 quadrants by lines at the
    means of x and y
  • and for each point, examine the direction of the
    deviation from these means

for (xi, yi) examine sign (/-) of (xi mx)
and (yi my) for each quadrant
II
I

-


m
y
-

-
IV
III
-
m
x
32
Quadrant
I



II
-

-
III
-
-

IV

-
-
I
II
Covariance between x and y

-


m
y
-

-
III
IV
-
m
x
33
Correlation between x and y
Now, if points look like
Since most points are in QI and QIII sxygt 0
? r gt 0, b1 gt0
Since most points are in QII and QIV sxylt 0
? r lt 0, b1 lt0
34
Since points are in all 4 quadrants sxy 0
? r 0, b1 0
35
(a)
(b)

Correlation, r , in (a) is greater than r in (b),
since points are closer to line in (a) This is
true, even when the slopes are the same.
36
  • Testing Hypotheses on Correlation
  • To test
  • Ho r 0 vs. Ha r ? 0
  • Use
  • It is identical to testing for b1 0


37
In Minitab Stat ? Basic Stats ? Correlation
Correlations sbp, age Pearson correlation of
sbp and age 0.658 P-Value 0.000
38
Note that the Regression Analysis results provide
a value for r2 (see slide 25) R-Sq
43.2 Use this to compute r ?.432 .657 We
also have the significance test for zero
correlation Ho r0 vs. Ha r?0 Since it is
identical to the test of zero slope T
P 4.62 0.000
39
  • Regression and Correlation Analysis are closely
    related
  • Correlation evaluates the strength of a linear
    association
  • Does not impose any directionality on the
    relationship
  • Regression evaluates strength of a linear
    relationship (slope of line)
  • Direction is imposed( e.g., age ? SBP rather
    than the reverse)
  • Significance test on slope, b1, is equivalent to
    significance test on correlation r

Write a Comment
User Comments (0)
About PowerShow.com