Title: Regression vs. Correlation
1Regression vs. Correlation
Both Two variables Continuous
data Regression Change in X causes change in
Y Independent and dependent variables or Pred
ict X based on Y Correlation No dependence
(causation) assumed Estimate the degree to which
2 variables vary together
2Correlation more on bivariate statistics
No dependence (causation) assumed Can call
variables XY or X1X2 Are to variables
independent, or do they covary
3Nature of variables Nature of variables
Purpose of investigator Y random, X fixed Both random
Establish and estimate dependence of Y upon X, describe functional relationship or predict Y from X Model I regression Model II regression, with few exceptions, eg prediction
Establish and estimate association (interdependence) between X Y Meaningless Correlation co-efficient, significance only if , normally distributed
Adapted from Sokal Rolf pg 559
4Visualize Correlation
positive
negative
Y(X2)
Y(X2)
X1
X1
Increase in X associated with increase in Y
Increase in X associated with decrease in Y
5No correlation
No correlation
Y(X2)
Y(X2)
X1
X1
horizontal
vertical
6Pearson product-moment correlation coefficient
Summed products of deviations of x y
? xy
r
?
? x2 ? y2
ss X ss Y
?(x-xbar) (y-ybar)
?
?(x-xbar)2 ?(y-ybar)2
7Equivalent calculations (1)
? xy
r
(n-1) sxsy
Where sx SD X sy SD Y
8Equivalent calculations (2)
? (Yi-Ybar)2
regression SS
(r2)
? (Yi-Ybar)2
total SS
?
regression SS
?
r r2
total SS
9Testing significance H0 r (?) 0
Assumes that data come from bivariate normal
distribution
true population parameter
10r
t
sr
SE of r
?
1-r2
sr
n-2
Reject null if t calc gt t?(2), ?
11data start infile 'C\Documents and
Settings\cmayer3\My Documents\teaching\Biostatisti
cs\Lectures\monitoring data for corr.csv' dlm','
DSD input year day site depth temp DO spCond
turb pH Kpar secchi alk Chla options ls180
proc print data one set start options
ls100 proc corr var temp DO spCond turb pH
Kpar secchi alk Chla Correlations on raw
data data two set start lnturblog(turb)
Create new variables by transformation lnsecchil
og(secchi) lgturblog10(turb) lgsecchilog10(s
ecchi) sqturbsqrt(turb) sqsecchisqrt(secc
hi) proc print data three set two
Correlations on transformed data proc corr
var lnturb lnsecchi proc corr var lgturb
lgsecchi proc corr var sqturb sqsecchi data
four set two Plot raw and transformed options
ls100 proc plot plot turbsecchi plot
lnturblnsecchi plot lgturblgsecchi plot
sqturbsqsecchi run
12 Pearson
Correlation Coefficients
Prob gt r under H0 Rho0
Number of
Observations temp DO
spCond turb pH Kpar secchi
alk Chla temp 1.00000 -0.21792
0.06538 -0.14523 0.35328 -0.23911 0.15689
0.11311 0.37612 0.0302
0.5202 0.1515 0.0003 0.1541 0.1209
0.3895 0.0001 99 99
99 99 99 37 99
60 99 DO -0.21792 1.00000
0.01542 -0.21550 0.50679 -0.24013 -0.06504
0.15790 0.38699 0.0302
0.8796 0.0322 lt.0001 0.1523 0.5224
0.2282 lt.0001 99 99
99 99 99 37 99
60 99 spCond 0.06538 0.01542
1.00000 0.48214 -0.29017 0.78394 -0.51332
0.74021 0.21367 0.5202 0.8796
lt.0001 0.0036 lt.0001 lt.0001
lt.0001 0.0337 99 99
99 99 99 37 99
60 99 turb -0.14523 -0.21550
0.48214 1.00000 -0.33727 0.89941 -0.50336
0.47441 0.07208 0.1515 0.0322
lt.0001 0.0006 lt.0001 lt.0001
0.0001 0.4783 99 99
99 99 99 37 99
60 99 pH 0.35328 0.50679
-0.29017 -0.33727 1.00000 -0.56355 0.14049
-0.14061 0.61033 0.0003 lt.0001
0.0036 0.0006 0.0003 0.1654
0.2839 lt.0001 99 99
99 99 99 37 99
60 99 Kpar -0.23911 -0.24013
0.78394 0.89941 -0.56355 1.00000 -0.76680
0.85542 0.04579 0.1541 0.1523
lt.0001 lt.0001 0.0003 lt.0001
lt.0001 0.7878 37 37
37 37 37 37 37
29 37 secchi 0.15689 -0.06504
-0.51332 -0.50336 0.14049 -0.76680 1.00000
-0.49649 -0.30918 0.1209 0.5224
lt.0001 lt.0001 0.1654 lt.0001
lt.0001 0.0018 99 99
99 99 99 37 99
60 99
alk 0.11311 0.15790 0.74021
0.47441 -0.14061 0.85542 -0.49649 1.00000
0.12410 0.3895 0.2282 lt.0001
0.0001 0.2839 lt.0001 lt.0001
0.3448 60 60 60
60 60 29 60 60
60 Chla 0.37612 0.38699 0.21367
0.07208 0.61033 0.04579 -0.30918 0.12410
1.00000 0.0001 lt.0001 0.0337
0.4783 lt.0001 0.7878 0.0018 0.3448
99 99 99 99
99 37 99 60 99
13Nonparametric statistics
?Sometimes called distribution free statistics
because they do not require that the data fit a
normal distribution ? Many nonparametric
procedures are based on ranked data. Data are
ranked by ordering them from lowest to highest
and assigning them, in order, the integer values
from 1 to the sample size.
14Some Commonly Used Statistical Tests Some Commonly Used Statistical Tests Some Commonly Used Statistical Tests
Normal theory based test Corresponding nonparametric test Purpose of test
t test for independent samples Mann-Whitney U test Wilcoxon rank-sum test Compares two independent samples
Paired t test Wilcoxon matched pairs signed-rank test Examines a set of differences
Pearson correlation coefficient Spearman rank correlation coefficient Assesses the linear association between two variables.
One way analysis of variance (F test) Kruskal-Wallis analysis of variance by ranks Compares three or more groups
Two way analysis of variance Friedman Two way analysis of variance Compares groups classified by two different factors
From http//www.tufts.edu/gdallal/npar.htm
15Data transformations
? Data transformation can correct deviation
from normality and uneven variance
(heteroscedasticity) ? See chapter 13 in Zar ?
Pretty much.. Whatever works, works. Some
common ones are for or proportion use asin of
square root log10 for density (/m2) ? Right
transformation can allow you to use parametric
statistics