Title: Regression: 2 Multiple Linear Regression and Path Analysis
1Regression(2) Multiple Linear Regression and
Path Analysis
- Hal Whitehead
- BIOL4062/5062
2Multiple Linear Regression and Path Analysis
- Multiple linear regression
- assumptions
- parameter estimation
- hypothesis tests
- selecting independent variables
- collinearity
- polynomial regression
- Path analysis
3Regression
- One Dependent Variable Y
- Independent Variables X1,X2,X3,...
4Purposes of Regression
- 1. Relationship between Y and X's
- 2. Quantitative prediction of Y
- 3. Relationship between Y and X controlling for
C - 4. Which of X's are most important?
- 5. Best mathematical model
- 6. Compare regression relationships Y1 on X,
Y2 on X - 7. Assess interactive effects of X's
5- Simple regression one X
- Multiple regression two or more X's
- Y ß0 ß1 X(1) ß2 X(2) ß3 X(3) ... ßk
X(k) E
6Multiple linear regressionassumptions (1)
- For any specific combination of X's, Y is a
(univariate) random variable with a certain
probability distribution having finite mean and
variance (Existence) - Y values are statistically independent of one
another (Independence) - Mean value of Y given the X's is a straight
linear function of the X's (Linearity)
7Multiple linear regressionassumptions (2)
- The variance of Y is the same for any fixed
combinations of X's (Homoscedasticity) - For any fixed combination of X's, Y has a normal
distribution (Normality) - There are no measurement errors in the X's (Xs
measured without error)
8Multiple linear regressionparameter estimation
- Y ß0 ß1 X(1) ß2 X(2) ß3 X(3) ... ßk
X(k) E - Estimate the ß's in multiple regression using
least squares - Sizes of the coefficients not good indicators of
importance of X variables - Number of data points in multiple regression
- at least one more than number of Xs
- preferably 5 times number of Xs
9Why do Large Animals have Large
Brains?(Schoenemann Brain Behav. Evol. 2004)
N39
- Multiple regression of Y Log (CNS) on
- X s ß SE(ß)
- Log(Mass) -0.49 (0.70)
- Log(Fat) -0.07 (0.10)
- Log(Muscle) 1.03 (0.54)
- Log(Heart) 0.42 (0.22)
- Log(Bone) -0.07 (0.30)
10Multiple linear regressionhypothesis tests
- Usually test
- H0 Y ß0 ß1X(1) ß2X(2) ... ßjX(j)
E - H1 Y ß0 ß1X(1) ß2X(2) ... ßjX(j)
... ßkX(k) E - F-test with k-j, n-(k-j)-1 degrees of freedom
(partial F-test) - H0 variables X(j1),,X(k) do not help explain
variability in Y
11Multiple linear regressionhypothesis tests
- e.g. Test significance of overall multiple
regression - H0 Y ß0 E
- H1 Y ß0 ß1X(1) ß2X(2) ... ßkX(k)
E - Test significance of
- adding independent variable
- deleting independent variable
12Why do Large Animals have Large
Brains?(Schoenemann Brain Behav. Evol. 2004)
- Multiple regression of Y Log (CNS) on
- X s ß SE(ß) P
- Log(Mass) -0.49 (0.70) 0.49
- Log(Fat) -0.07 (0.10) 0.52
- Log(Muscle) 1.03 (0.54) 0.07
- Log(Heart) 0.42 (0.22) 0.06
- Log(Bone) -0.07 (0.30) 0.83
Tests whether removal of variable reduces fit
13Multiple linear regressionselecting independent
variables
- Reasons for selecting a subset of independent
variables (Xs) - cost (financial and other)
- simplicity
- improved prediction
- improved explanation
14Multiple linear regressionselecting independent
variables
- Partial F-test
- predetermined forward selection
- forward selection based upon improvement in fit
- backward selection based upon improvement in fit
- stepwise (backward/forward)
- Mallows C(p)
- AIC
15Multiple linear regressionselecting independent
variables
- Partial F-test
- predetermined forward selection
- Mass, Bone, Heart, Muscle, Fat
- forward selection based upon improvement in fit
- backward selection based upon improvement in fit
- Stepwise (backward/forward)
16Multiple linear regressionselecting independent
variables
- Partial F-test
- predetermined forward selection
- forward selection based upon improvement in fit
- backward selection based upon improvement in fit
- stepwise (backward/forward)
17Why do Large Animals have Large
Brains?(Schoenemann Brain Behav. Evol. 2004)
- Complete model (r20.97)
- Forward stepwise (a-to-enter0.15
a-to-remove0.15) - 1. Constant (r20.00)
- 2. Constant Muscle (r20.97)
- 3. Constant Muscle Heart (r20.97)
- 4. Constant Muscle Heart Mass (r20.97)
- -0.18 - 0.82xMass 1.24xMuscle 0.39xHeart
18Why do Large Animals have Large
Brains?(Schoenemann Brain Behav. Evol. 2004)
- Complete model (r20.97)
- Backward stepwise (a-to-enter0.15
a-to-remove0.15) - 1. All (r20.97)
- 2. Remove Bone (r20.97)
- 3. Remove Fat (r20.97)
- -0.18 - 0.82xMass 1.24xMuscle 0.39xHeart
19Comparing models
- Mallows C(p)
- C(p) (k-p).F(p) (2p-k1)
- k parameters in full model p parameters in
restricted model - F(p) is the F value comparing the fit of the
restricted model with that of the full model - Lowest C(p) is best model
- Akaike Information Criteria (AIC)
- AICn.Log(s2) 2p
- Lowest AIC indicates best model
- Can compare models not included in one another
20Comparing models
21Collinearity
- If two (or more) Xs are linearly related
- they are collinear
- the regression problem is indeterminate
- X(3)5.X(2)16, or
- X(2)4.X(1) 16.X(4)
- If they are nearly linearly related (near
collinearity), coefficients and tests are very
inaccurate
22What to do about collinearity?
- Centering (mean 0)
- Scaling (SD 1)
- Regression on first few Principal Components
- Ridge Regression
23Curvilinear (Polynomial) Regression
- Y ß0 ß1X ß2X² ß3X3 ... ßkXk E
- Used to fit fairly complex curves to data
- ßs estimated using least squares
- Use sequential partial F-tests, or AIC, to find
how many terms to use - kgt3 is rare in biology
- Better to transform data and use simple linear
regression, when possible
24Curvilinear (Polynomial) Regression
Y0.066 0.00727.X Y0.117 0.00085.X
0.00009.X² Y0.201 - 0.01371.X 0.00061.X² -
0.000005.X3
From Sokal and Rohlf
25Path Analysis
26Path Analysis
- Models with causal structure
- Represented by path diagram
- All variables quantitative
- All path relationships assumed linear
- (transformations may help)
27Path Analysis
- All paths one way
- A gt C
- C gt A
- No loops
- Some variables may not be directly observed
- residual variables (U)
- Some variables not observed but known to exist
- latent variables (D)
28Path Analysis
- Path coefficients and other statistics calculated
using multiple regressions - Variables are
- centered (mean 0) so no constants in
regressions - often standardized (SD 1)
- So path coefficients usually between -1 and 1
- Paths with coefficients not significantly
different from zero may be eliminated
29Path Analysis an example
- Isaak and Hubert. 2001. Production of stream
habitat gradients by montane watersheds
hypothesis tests based on spatially explicit path
analyses Can. J. Fish. Aquat. Sci.
30- - - Predicted negative interaction
________ Predicted positive interaction
31(No Transcript)
32(No Transcript)