Correlation and Regression - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Correlation and Regression

Description:

y and x random (bivariate normal) assesses strength of (linear) relationship only ... Linear regression and correlation both represent a special case of general curve ... – PowerPoint PPT presentation

Number of Views:147
Avg rating:3.0/5.0
Slides: 32
Provided by: lawrence50
Category:

less

Transcript and Presenter's Notes

Title: Correlation and Regression


1
Correlation and Regression
  • (linear curve fitting)

2
correlation vs. regression
correlationish data
regressionish data
3
correlation/regression commonalities
  • assesses strength of relationship
  • operates on interval or ratio data
  • assumes a linear relationship

4
correlation / regression differences
  • regression
  • experimenter determines x
  • (ideally) multiple measurements at each x
  • produces equation (allowing prediction)
  • correlation
  • y and x random (bivariate normal)
  • assesses strength of (linear) relationship only

5
An Important Point
Linear regression and correlation both represent
a special case of general curve fitting. In
nature, linear relationships are rare.
Mathematically, they are easy.
6
scatterplots
7
intuitive definitions
  • correlation the degree to which data points
    cluster about a line (of non-zero slope)
  • regression determination of the line about which
    the data points cluster
  • (these are intimately related, obviously)

8
towards correlation the covariance
  • obviously related to variance
  • is just a number that, all else being equal,
    will be largest when x y (the Cauchy-Schwartz
    inequality)
  • (also gets larger with variability of x or y),
    so

9
The Pearson Product-Moment Correlation
Coefficient (Pearsons r)
10
Pearsons r
11
Pearsons r (cont.)
  • still follows the Cauchy-Schwartz inequality
  • is independent of the variability of x or y
    (which has been literally factored out)

12
some example rs
r0.1
r0.4
r0.6
r0.8
13
cautionary notes on r
  • we have not obviously fit a straight line to the
    data in computing r
  • but we have nonetheless
  • r assumes a linear relationship
  • or, more technically correct, r assesses the
    strength of whatever linear relationship does
    exist, independent of any other relationships

14
more example correlations
15
Adjusted r
  • it exists
  • it is a good idea
  • it is rarely reported
  • (use it as a diagnostic)

16
(demo)
17
The Sampling Distribution of r
  • is about 1/sqrt(n-1) wide
  • influenced by n
  • influenced by r itself
  • can be transformed into t
  • it buys us all the same things that any sampling
    distribution does
  • confidence intervals
  • hypothesis testing

18
Confidence Intervals about r
  • For low to middling r and middling to high n, r
    is distributed as a Gaussian with
  • mr and s1/sqrt(n-1)
  • So 95 CIs would just be /- 2s or 2/sqrt(n-1)
  • If theres any doubt (or anything important in
    the balance!), I would simulate the distribution

19
Another route to CIs
let
then r is roughly normal with sr 1/sqrt(n-3)
(this is clever but, again, it is so easy to
simulate)
20
Hypothesis testing on r
  • we have the sampling distributions
  • is r non-zero? where is zero with respect to
    the sampling distribution of r?
  • are two rs different? where is zero with
    respect to the sampling distribution of r1-r2?

21
A shortcut for simple hypothesis testing
Pearsons r to Students t
which is distributed as t on n-2 df
once again, Im more comfortable with
simulating the sampling distribution
22
Linear Regression
23
Which of all possible lines best describes my
data?
Find the line that is simultaneously closest to
all the data.
In other words, minimize the quantity g
24
Finding the best fit line
y mx b
bad offset, high g
bad slope, high g
good fit, low g
25
The fitting function
So g
and
or (more importantly)
26
The fitting surface
best fit m b
g
b
m
27
Finding min(g)
  • cast about randomly
  • evaluate local slope move downhill
  • etc.
  • in the case of a line, setting dg/dm and dg/db
    0 yields 2 equations that have a closed form
    solution!
  • (which is why everyone does linear regressions)

28
Best fit linear parameters
Solving dg/dm0 and dg/db0 simultaneously
gives the regression equations or the normal
equations
29
What does it mean?
  • mdy/dx (obviously)
  • rate of change of y
  • change in y for one unit of change in x
  • amount of bang for your IV buck
  • b value of y at x0
  • sometimes has meaning, sometimes not

30
Standard whipping of dead horse
  • correlation does not imply causation
  • neither does any other statistic!!!

31
Other important matters
  • how good is the linear relationship?
  • how sure am I about the slope?
  • the intercept?
  • any given prediction?
Write a Comment
User Comments (0)
About PowerShow.com