Title: Experimental design and statistical methods in biology
1Experimental design and statistical methods in
biology
- Lesson 7
- Analysis of ratios
- Logistic regression
- Generalized Linear Models (GENMOD)
- Analysis of frequency tables
- Log-linear models
2Imagine an investigation aiming at examining
lynxs food selection.
- Method Contents of 6 lynx stomachs were
examined. - Remains that could be identified to species
(genus, family, order, class) were weighed - Number of prey individuals were estimated from
the remains (bones, feathers, hairs etc).
3(No Transcript)
4(No Transcript)
5Because it is ratio between number of observations
6Because it depends on weight unit i.e. 534
g/1021 g 0.534 kg/1.021 kg 0.523
7The noise of a true ratio depends on the number
of observations.
For instance, let p denote the probability that a
prey item is a mouse and 1- p the probability
that it is not a mouse. Then the probability of
the next prey item being a mouse (given that p is
binomially distributed) is found as
8The variance of r
The variance of p
which shows that the variance of p declines with n
9Conditions for analyzing frequency tables
- Observations are on a nominal scale
- Observations show stochastic independence (that
is, every new observation should be drawn from a
distribution (binomial or multinomial) with
probability pj for event j and 1- pj for not
event j.
For instance, teeth from prey animals cannot be
regarded as stochastically independent because
each prey contributes with 1,., m observations.
In this situation, it is more correct to use the
prey item as the unit of observation.
10Logistic regression
- Used when data are dichotomous.
- Used when data are true fractions between 0 and 1
11Example
Does predation of eggs in nests of Oyster catcher
depend on
- The distance from the nest to the nearest nest of
Herring gull? - On the vegetation surrounding the nest?
- On the number of eggs in the nest?
12(No Transcript)
13(No Transcript)
14Data
OBS DIST EGGS VEG KILLED 1
0.5 3 B 3 2 1.0 7
C 5 3 5.7 5 B
1 4 3.8 9 A
6 5 3.0 7 C 5
6 6.1 8 A 3
........ 57 3.3 3 A
3
15Analysis of dichotomous data
- Nests are categorized according to whether
predation has occurred or not. - No predation is scored as 0
- Predation is scored as 1
16Plus/minus predator visit to Oyster catcher nest
17The purpose is to fit a model to the data a
model that predicts the probability of a nest
being predated
18The logistic regression model
19y 0
20How to do it in SAS
21DATA logist OPTIONS LINESIZE 90 / Example
on logistic regression / / The example is
inspirered by Dorthe Lahrmann's investigations of
Oyster catchers (strandskader) on Langli in
Ho Bugt / INFILE 'h\lin-mod\logist.prn'
FIRSTOBS2 INPUT dist eggs veg killed /
dist Distance to the nearest nest of Herring
gull (sølvmåge)/ / eggs Number of Oyster
catcher eggs in a nest / / veg vegetation
type surrounding an Oyster catcher nest/ IF
killed gt 0 THEN visit 1 IF killed 0 THEN
visit 0 / If killed gt 0 then the nest has
been visited by a predator at least once /
22/ Eksempel A Analysis of a nest has been
visited or not-visited by predators, i.e. visit
1 or 0 / PROC GENMOD / The procedure is
Generalized Linear Models / TITLE 'Eksempel
A' CLASS veg / veg is a class variable
/ MODEL visit dist veg /DISTbinomial
LINKlogit TYPE3 DSCALE OBSTATS / DIST
distribution function (here chosen as binomial)
/ / LINK the model uses a
logit-transformation of data / / TYPE3 type
3 is used in order to evaluate the relative
contribution of the different factors on the
independent variable / / DSCALE an option
which tells SAS to scale the error in order to
meet the demands of the model. If DSCALE is
approximately 1, scaling is not needed. / /
OBSTATS gives the predicted values as well as
their confidence limits / RUN
23 Eksempel A 1019 Thursday, November 22, 2001
87 The GENMOD
Procedure
Model Information
Description Value
Data Set
WORK.LOGIST Distribution
BINOMIAL
Link Function LOGIT
Dependent Variable VISIT
Observations Used
57 Number Of Events
52 Number Of
Trials 57
Class Level Information
Class Levels Values
VEG 3 A B C
24Low values (for a given DF) indicate a good fit
Criteria For Assessing Goodness Of Fit
Criterion DF Value
Value/DF Deviance
53 20.2819 0.3827
Scaled Deviance 53 53.0000
1.0000 Pearson Chi-Square
53 22.2740 0.4203
Scaled Pearson X2 53 58.2057
1.0982 Log Likelihood
. -26.5000 .
Values greater than unity indicate overdispersion
(variance greater than expected)
Values less than unity indicate underdispersion
(variance less than expected)
25Analysis Of Parameter Estimates
Parameter DF Estimate Std Err
ChiSquare PrgtChi INTERCEPT
1 8.5639 2.1271 16.2093 0.0001
DIST 1 -1.0032
0.2651 14.3173 0.0002 VEG
A 1 0.2489 0.9555 0.0678
0.7945 VEG B 1
0.4370 0.9250 0.2232 0.6366
VEG C 0 0.0000 0.0000
. . SCALE 0
0.6186 0.0000 . . NOTE
The scale parameter was estimated by the square
root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source NDF DDF F PrgtF
ChiSquare PrgtChi DIST 1
53 34.8596 0.0001 34.8596 0.0001
VEG 2 53 0.1118 0.8944
0.2237 0.8942
26 Criteria For Assessing Goodness Of Fit
Criterion DF Value
Value/DF Deviance
55 20.3675 0.3703
Scaled Deviance 55 55.0000
1.0000 Pearson Chi-Square
55 21.6364 0.3934
Scaled Pearson X2 55 58.4265
1.0623 Log Likelihood
. -27.5000 .
Analysis Of Parameter Estimates
Parameter DF Estimate Std Err
ChiSquare PrgtChi INTERCEPT 1
8.8288 2.0182 19.1363 0.0001
DIST 1 -1.0012 0.2587
14.9777 0.0001 SCALE 0
0.6085 0.0000 . . NOTE
The scale parameter was estimated by the square
root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source NDF DDF F PrgtF
ChiSquare PrgtChi DIST 1
55 36.4999 0.0001 36.4999 0.0001
27 Observation Statistics VISIT Pred
Xbeta Std HessWgt Lower
Upper Resraw 1 0.9998
8.3283 1.8909 0.000652 0.9903
1.0000 0.000242 1 0.9996
7.8277 1.7639 0.001075 0.9875
1.0000 0.000398 1 0.9578
3.1222 0.6185 0.1091 0.8710
0.9871 0.0422 1 0.9935
5.0244 1.0628 0.0175 0.9498
0.9992 0.006533 1 0.9971
5.8253 1.2605 0.007924 0.9663
0.9998 0.002943 1 0.9383
2.7217 0.5356 0.1563 0.8418
0.9775 0.0617 1 0.9971
5.8253 1.2605 0.007924 0.9663
0.9998 0.002943 1 0.9973
5.9255 1.2854 0.007173 0.9679
0.9998 0.002663 0 0.3358
-0.6822 0.5813 0.6023 0.1392
0.6123 -0.3358 1 0.9764
3.7229 0.7525 0.0622 0.9045
0.9945 0.0236 0 0.7150
..........................................
28Predicted values and 95 confidence limits
29/ Example B Analysis of the fraction of eggs in
a nest that are lost / PROC GENMOD /
procedure is Generalized Linear Models / TITLE
'Eksempel B' CLASS veg / veg is a class
variable / MODEL killed/eggs dist veg
eggs/DISTbinomial LINKlogit TYPE3 DSCALE
OBSTATS / DIST distribution function (here
chosen as binomial) / / LINK the model uses
a logit-transformation of data / / TYPE3
SS3 is used to determine the contribution of the
individual factors to the dependent variable /
/ DSCALE option that can be used if
Deviance/DF is different from 1. It reduces
the risk of Type 1 errors if the scale parameter
is gt 1 og the risk of a Type II errors, if
the scale parameter is lt 1 / / OBSTATS
gives the predicted values, and the confidence
limits / RUN
30 Eksempel B 1226 Thursday, November 22, 2001
7 The GENMOD
Procedure
Model Information
Description Value
Data Set
WORK.LOGIST Distribution
BINOMIAL
Link Function LOGIT
Dependent Variable
KILLED Dependent Variable
EGGS
Observations Used 57
Number Of Events 183
Number Of Trials
336 Class Level
Information
Class Levels Values
VEG 3 A B C
31 Criteria For Assessing Goodness Of Fit
Criterion DF Value
Value/DF Deviance
52 53.9491 1.0375
Scaled Deviance 52 52.0000
1.0000 Pearson Chi-Square
52 44.1413 0.8489
Scaled Pearson X2 52 42.5465
0.8182 Log Likelihood
. -171.3777 .
32Analysis Of Parameter Estimates
Parameter DF Estimate Std Err
ChiSquare PrgtChi INTERCEPT
1 2.6437 0.5644 21.9369 0.0001
DIST 1 -0.5284
0.0623 71.9060 0.0001 VEG
A 1 0.1425 0.3629 0.1541
0.6946 VEG B 1
0.1623 0.3602 0.2029 0.6524
VEG C 0 0.0000 0.0000
. . EGGS 1
-0.0314 0.0637 0.2433 0.6219
SCALE 0 1.0186 0.0000
. . NOTE The scale parameter
was estimated by the square root of
DEVIANCE/DOF. LR
Statistics For Type 3 Analysis
Source NDF DDF F PrgtF
ChiSquare PrgtChi DIST 1
52 97.2164 0.0001 97.2164 0.0001
VEG 2 52 0.1135 0.8929
0.2271 0.8927 EGGS 1
52 0.2443 0.6232 0.2443 0.6211
33 Criteria For Assessing Goodness Of Fit
Criterion DF Value
Value/DF Deviance
55 54.5182 0.9912
Scaled Deviance 55 55.0000
1.0000 Pearson Chi-Square
55 45.0882 0.8198
Scaled Pearson X2 55 45.4867
0.8270 Log Likelihood
. -179.6600 .
Analysis Of Parameter Estimates
Parameter DF Estimate Std Err
ChiSquare PrgtChi INTERCEPT 1
2.5156 0.2950 72.7128 0.0001
DIST 1 -0.5212 0.0589
78.3656 0.0001 SCALE 0
0.9956 0.0000 . . NOTE
The scale parameter was estimated by the square
root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source NDF DDF F PrgtF
ChiSquare PrgtChi DIST 1
55 107.8859 0.0001 107.8859 0.0001
34Predicted values and 95 confidence limits
35 Criteria For Assessing Goodness Of Fit
Criterion DF Value
Value/DF Deviance
52 53.9491 1.0375
Scaled Deviance 52 52.0000
1.0000 Pearson Chi-Square
52 44.1413 0.8489
Scaled Pearson X2 52 42.5465
0.8182 Log Likelihood
. -171.3777 .
36The likelihood function
37The binomial distribution
A nest contains n eggs of which r are eaten by
predators. The probability that a given egg is
eaten is denoted p. The probability that exactly
r of the eggs are killed is
38r1 number of killed eggs out of n1 eggs in the
first nest
r2 number of killed eggs out of n2 eggs in the
second nest
ri number of killed eggs out of ni eggs in the
ith nest
times
L P(r1) P(r2) P(r3)....... P(ri)...... P(rk)
ln L ln P(r1) ln P(r2) ln P(r3) ... ln
P(ri) ... ln P(rk)
39Maximum likelihood
are found as the values that maximize the
likelihood of observing exactly r1, r2,
....,ri.... positive events out of n1, n2,
....,ni.... events
The maximum value of L can be found by
differentiation of L with respect to ß0 , ß1,
...., ßp, and setting the derivative equal to 0.
This is the same as differentiation with respect
to ln L
......
40The variance of a parameter
41An example Estimation of ß0
42(No Transcript)
43The variance of
44- Analysis of frequency tables
45One-way classification
Example Tomato plants
Height of tomato-plants is determined by two
allels T tall (dominant) d dwarf (recessive)
Leaf morphology is determined by two allels C
cut-leaves (dominant) p potato-shaped leaves
(recessive)
46TTCC x ddpp
x TdCp
47F2-generation
489 Tall, Cut-leaves
493 Tall, potato-leaves
9 Tall, Cut-leaves
503 Tall, potato-leaves
3 dwarf, cut-leaves
9 Tall, Cut-leaves
513 Tall, potato-leaves
3 dwarf, cut-leaves
9 Tall, Cut-leaves
1 dwarf,potato-leaves
52 H0 The observed distribution agrees with the
expected 9331 distribution
H1 The observed distribution does not agree the
expected distribution
a 0.05
53 54 55 56?2 one-sample test
57G-test
58G-test
59G-test
60G-test
is distributed approximately as ?2 with df a-1
with 3 df
P 0.687
Conclusion The observed and the expected
distributions agree well