Title: Correlation and Regression
1Correlation and Regression
2correlation vs. regression
correlationish data
regressionish data
3correlation/regression commonalities
- assesses strength of relationship
- operates on interval or ratio data
- assumes a linear relationship
4correlation / regression differences
- regression
- experimenter determines x
- (ideally) multiple measurements at each x
- produces equation (allowing prediction)
- correlation
- y and x random (bivariate normal)
- assesses strength of (linear) relationship only
5An Important Point
Linear regression and correlation both represent
a special case of general curve fitting. In
nature, linear relationships are rare.
Mathematically, they are easy.
6scatterplots
7intuitive definitions
- correlation the degree to which data points
cluster about a line (of non-zero slope) - regression determination of the line about which
the data points cluster - (these are intimately related, obviously)
8towards correlation the covariance
- obviously related to variance
- is just a number that, all else being equal,
will be largest when x y (the Cauchy-Schwartz
inequality) - (also gets larger with variability of x or y),
so
9The Pearson Product-Moment Correlation
Coefficient (Pearsons r)
10Pearsons r
11Pearsons r (cont.)
- still follows the Cauchy-Schwartz inequality
- is independent of the variability of x or y
(which has been literally factored out)
12some example rs
r0.1
r0.4
r0.6
r0.8
13cautionary notes on r
- we have not obviously fit a straight line to the
data in computing r - but we have nonetheless
- r assumes a linear relationship
- or, more technically correct, r assesses the
strength of whatever linear relationship does
exist, independent of any other relationships
14more example correlations
15Adjusted r
- it exists
- it is a good idea
- it is rarely reported
- (use it as a diagnostic)
16(demo)
17The Sampling Distribution of r
- is about 1/sqrt(n-1) wide
- influenced by n
- influenced by r itself
- can be transformed into t
- it buys us all the same things that any sampling
distribution does - confidence intervals
- hypothesis testing
18Confidence Intervals about r
- For low to middling r and middling to high n, r
is distributed as a Gaussian with - mr and s1/sqrt(n-1)
- So 95 CIs would just be /- 2s or 2/sqrt(n-1)
- If theres any doubt (or anything important in
the balance!), I would simulate the distribution
19Another route to CIs
let
then r is roughly normal with sr 1/sqrt(n-3)
(this is clever but, again, it is so easy to
simulate)
20Hypothesis testing on r
- we have the sampling distributions
- is r non-zero? where is zero with respect to
the sampling distribution of r? - are two rs different? where is zero with
respect to the sampling distribution of r1-r2?
21A shortcut for simple hypothesis testing
Pearsons r to Students t
which is distributed as t on n-2 df
once again, Im more comfortable with
simulating the sampling distribution
22Linear Regression
23Which of all possible lines best describes my
data?
Find the line that is simultaneously closest to
all the data.
In other words, minimize the quantity g
24Finding the best fit line
y mx b
bad offset, high g
bad slope, high g
good fit, low g
25The fitting function
So g
and
or (more importantly)
26The fitting surface
best fit m b
g
b
m
27Finding min(g)
- cast about randomly
- evaluate local slope move downhill
- etc.
- in the case of a line, setting dg/dm and dg/db
0 yields 2 equations that have a closed form
solution! - (which is why everyone does linear regressions)
28Best fit linear parameters
Solving dg/dm0 and dg/db0 simultaneously
gives the regression equations or the normal
equations
29What does it mean?
- mdy/dx (obviously)
- rate of change of y
- change in y for one unit of change in x
- amount of bang for your IV buck
- b value of y at x0
- sometimes has meaning, sometimes not
30Standard whipping of dead horse
- correlation does not imply causation
- neither does any other statistic!!!
31Other important matters
- how good is the linear relationship?
- how sure am I about the slope?
- the intercept?
- any given prediction?