Title: Bayesian Statistics: A Biologist
1Bayesian Statistics A Biologists Interpretation
Marguerite Pelletier URI Natural Resources
Science / U.S. EPA
2How have Bayesian Methods been used?
- Federal allocation of money Bayesian analysis
of population characteristics such as poverty
in small geographic areas - Microsoft Windows Office Assistant Bayesian
artificial intelligence algorithm - It has been suggested that Bayesian statistics
be used in environmental science because it
addresses questions about the probability of
events occurring, which allows better
decision-making
3Bayesian Statistics vs. Frequentist Statistics
- Frequentist (Traditional) Statistics
- Assumes a fixed, true value for parameter of
interest (e.g., mean, std dev) - Expected value average value obtained by
random sampling repeated ad infinitum - Can only reject the null hypothesis (Ho), not
support the alternative hypothesis (Ha)
p-values indicate statistical rareness - Large sample sizes make rejection of Ho more
likely - Confidence intervals generated shows
confidence about value of parameter, not how
likely that parameter is in real life
4Bayesian Statistics vs. Frequentist Statistics,
cont.
- Bayesian Statistics
- Assumes parameter of interest (e.g., mean, std
dev) variable and based on the data - Can test the probability of the alternate
hypothesis (Ha) or hypotheses given the data
(which is what most scientists really care
about) - Generates probability for any hypothesis being
true - Sample sizes taken into account large sample
size alone wont cause acceptance of the
hypothesis - Creates credible intervals rather than
confidence intervals tells how likely the
answer is in the real world
5How do Bayesian Statistics Work?
Posterior probability Fishers Likelihood
function Prior probability
Expected
likelihood function Likelihood function Given
data, with a known (or predicted) distribution
(i.e., Normal, Poisson), a likelihood
function (probability distribution) can be
calculated Prior probability based on
existing data or a subjective indication of
what the investigator believes to be
true Expected likelihood function marginal
distribution of data given hyperparameter
takes sample size into account
Bayes Rule Posterior ? Likelihood Priors
6Problems with Bayesian Statistics
- Computationally intense (integration of complex
functions) Howeverbetter computers and
development of Markov Chain Monte Carlo methods
made techniques more accessible - Not directly applicable for many complex
statistical analyses Can be used for certain
regression techniques and to generate posterior
distn given a prior. Attempts to utilize it in
clustering unsuccessful - Not readily available in most common
statistical software (SPSS, SAS) - Not applicable to very rare events priors
dominate the function so the posterior
doesnt change implies that further study is
not needed/useful
7So When are Bayesian Statistics Useful?
- When limited data available formalizes the
use of Best Professional Judgment (Case
Study 1) - When Bayesian algorithms have been developed
for a statistic e.g., regression (Case
Study 2) - After using more traditional statistical
methods develop a probability
distribution (Case Study 3) - When the answer is a single number rather than
a complex function (e.g., simple calculation
not complex multivariate analysis)
8Case Study 1 Development of a Bayesian
Probability Network in the Neuse River Estuary,
N.C.
(Borsuk ME, Stow CA, Reckhow KH 2003. An
integrated approach to TMDL development for the
Neuse River estuary using a Bayesian probability
network. Journal of Water Resources Planning and
Management, accepted)
9Summary of Project
- Neuse River estuary impaired due to nitrogen
(eutrophication problems), requiring a Total
Maximum Daily Load (TMDL) to be developed - For development of a TMDL, links must be
developed between pollutant load ( N ), and
water quality impairment - Because of the range of endpoints and the need
to determine probability of impact, a
Bayesian Network was developed - Data for the model came from routine water
quality monitoring and from elicited judgment
of scientific experts
10River N
River Flow
Algal Density
Pfisteria abundance
CarbonProduction
WaterTemperature
Bayesian Network
Sediment OxygenDemand
System variable
Node or Submodel
Oxygen Concentration
Duration of Stratification
Association
ShellfishAbundance
Days ofHypoxia
Frequency of Cross-Channel Winds
Frequency of Fish Kills
Fish Population Health
11Use of Bayesian Network (focus on Fish Kills)
- Fish kills low bottom D.O. cross-channel
winds (force bottom water fish to shores)
fish health (influences susceptibility) - Two expert fisheries biologists asked about the
likelihood of fish kill given certain
conditions (various wind/hypoxia/fish health
scenarios) - All probabilistic relationships (including fish
kill info) incorporated into Bayesian
network. - Four nitrogen reduction scenarios assessed 0,
15, 30, 45 and 60 (relative to 1991-1995
baseline) using Latin Hypercube sampling - As N inputs decreased, mean chl and exceedance
frequency also reduced. - Fish kills dont change substantially with N
reduction fish kills relatively rare,
effect of reduced C production is damped out
further along the causal chain
12Case Study 2 Assessing Spatial Population
Viability Models using Bayesian Statistics
(Mac Nally R, Fleishman E, Fay JP, Murphy DD
2003. Modeling butterfly species richness using
mesoscale environmental variables model
construction and validation for the mountain
ranges in the Great Basin of western North
America. Biological Conservation 11021-31.
13Summary of Project
- Species richness ? local environmental
variables - Over large scales these variables hard to
collect - This study (14) environmental variables from
GIS and remote sensing used to predict
butterfly species richness - Poisson regression used to develop appropriate
models from the 28 variables (IV IV2)
Schwartz Information Criteria used for selection - Appropriate variables then used in Bayesian
Poisson model - Model output validated against additional field
data
14Bayesian Poisson Regression
log ?i ? ? ?kXik ? Yi Poisson (
?i )
where ?i mean (unobservable, true) spp
richness at site i ?, ?k regression
coefficients non-informative priors ? model
error Yi observed spp richness
- Markov Chain-Monte Carlo algorithm 1000
iteration burn-in, 3000 iterations to
generate parameter estimates and mean spp
richness estimates - New model run using validation data and
regression-coefficient distn from the 1st
model - Model worked well for same mountain range, but
not for new range
15Case Study 3 Assessing Spatial Population
Viability Models using Bayesian Statistics
(McCarthy MA, Lindenmayer DB, Possingham HP 2001.
Assessing spatial PVA models of arboreal
marsupials using significance tests and Bayesian
statistics. Biological Conservation 98191-200.
16Summary of Project
- Population Viability Analysis used in
Conservation Biology to assess potential for
species extinction - Many models based on limited data assessed
via significance tests or Bayesian methods - Metapopulation models (for 4 arboreal
marsupials) were developed - 2 competing null models also developed
- No effect of fragmentation
- No dispersal between patches
- Models were compared using likelihood and
Bayesian methods
17Model Comparison
- Predicted presence in patches was compared to
observed presence using logistic
regression ln (o/(1 o)) ? ?ln(p/(1 -
p)) where o observed presence p predicted
presence ?, ? regression coefficients - Significant differences between predicted and
observed if ? significantly different from 0
or ? significantly different from 1 - Models compared using log-likelihood models
with higher log-likelihood values (closer to
0) more closely match data - Bayesian posterior probabilities used to
compare models higher probabilities more
closely match data prior all 3 models equally
plausible Probability of Model likelihood of
model / sum of all likelihoods
18Conclusions
- Comparison with actual data
- Full model best for greater glider,
yellow-bellied glider - No fragmentation model best for mountain
brushtail possum, ringtail possum (but
predicted values ½ observed values) - Log-likelihood values
- Confirm no fragmentation model best for 2
possum spp - Confimed full model best for the greater
glider - Yellow bellied glider equally represented by
full model and no dispersal model - Bayesian statistics confirmed log-likelihood
results - Authors indicated that significance tests
useful to assess model accuracy Bayesian
methods useful for comparing models but
computationally intense
19(No Transcript)