Title: Maximum Entropy
1Maximum Entropy
- RESM 575
- Spring 2009
- Lecture 13
2Maximum entropy
(Phillips et al. 2008)
- History
- E. T. Janes 1957
- Thermodynamics
- Inference and information theory
3The Maximum Entropy Method
- Origins Jaynes 1957, statistical mechanics
- Recent use machine learning, eg. automatic
language translation - To estimate an unknown distribution
- Determine what you know (constraints)
- Among distributions satisfying constraints
- Output the one with maximum entropy
4(No Transcript)
5What is it?
- Maxent is a general-purpose method for making
- predictions or inferences from incomplete
information. - Its origins lie in statistical mechanics (Jaynes,
1957), and it remains an active area of research
with an Annual Conference, Maximum Entropy and
Bayesian Methods, that explores applications in
diverse areas such as - astronomy, portfolio optimization, image
reconstruction, statistical physics and signal
processing.
6Like other Bayesian models
- Uses prior information
- Maxent is an alternative to methods of inference
of classical statistics
7Maximum Entropy Principle
The fact that a certain probability distribution
maximizes entropy subject to certain constraints
representing our incomplete information, is the
fundamental property which justifies the use of
that distribution for inference it agrees with
everything that is known but carefully avoids
assuming anything that is not known (Jaynes,
1990).
8Why?
- Introduced as a general approach for presence
only modeling of species distributions, suitable
for all existing applications involving
presence-only datasets.
9Modeling species distributions
Yellow-throated Vireo
occurrence points
environmental variables
10Estimating a probability distribution
- Given
- Map divided into cells
- Environmental variables, with values in each cell
- Occurrence points samples from an unknown
distribution - Our task is to estimate the unknown
probability distribution - Note
- The distribution sums to 1 over the whole map
- Most probability values will be very small
- Different from estimating probability of presence
11Entropy
- More entropy more spread out, closer to
uniform distribution - 2nd law of thermodynamics
- Without external influences, a system moves to
increase entropy - Maximum entropy method
- Apply constraints to remove external influences
- Species spreads out to fill areas with suitable
conditions -
12Using Maxent for Species Distributions
- Features
- Constraints
- Regularization
13Features impose constraints
Feature environmental variable, or function
thereof
find distribution p of maximum entropy such
that for all features f mean(f) sample average
of f
14Features
- Environmental variables or functions thereof.
- Maxent has these classes of features (others are
possible) - Linear variable itself
- Quadratic square of variable
- Product product of two variables
- Binary (indicator) membership in a
category - Threshold
- Hinge
1
0
Environmental variable
1
0
Environmental variable
15Constraints
Each feature type imposes constraints on output
distribution Linear features mean Quadratic
features variance Product features
covariance Threshold features proportion
above threshold Hinge features mean above
threshold Binary features (categorical)
proportion in each category
16Regularization
precipitation
sample average
true mean
temperature
find distribution p of maximum entropy such
that Mean(f) in confidence region of sample
average of f
17The Maxent distribution
is always a Gibbs distribution
q?(x) exp(Sj ?jfj(x)) / Z
Z is a scaling factor so distribution sums to
1 fj is the jth feature ?j is a
coefficient, calculated by the program
18Maxent is penalized maximum likelihood
Log likelihood LogLikelihood(q?) 1/m Si
ln(q?(xi)) where x1 xm are the occurrence
points.
Maxent maximizes regularized likelihood LogLike
lihood(q?) - Sj ßj?j where ßj is the width of
the confidence interval for fj Similar to Akaike
Information Criterion (AIC).
19Output
- When Maxent is applied to presence-only species
distribution modeling, the pixels of the study
area make up the space on which the Maxent
probability distribution is defined, - Pixels with known species occurrence records
constitute the sample points, and the features
are - climatic variables,
- elevation,
- soil category,
- vegetation type or other environmental variables,
and functions thereof.
20To note
- Sometimes both presence and absence occurrence
data are available for the development of models,
in which case general-purpose statistical methods
can be used - (for an overview of the variety of techniques
currently in use, see Corsi et al., 2000 Elith,
2002 Guisan and Zimmerman, 2000 Scott et al.,
2002).
21Opportunity
- However, while vast stores of presence-only data
exist, (records etc.) absence data are rarely
available, - Poorly sampled areas, remote, difficult
- Absence data may be of questionable value in many
situations
22(No Transcript)
23Background
- 16 modeling methods
- 226 well surveyed species in 6 regions of the
world
24The authors used three statistics, the area under
the Receiver Operating Characteristic curve
(AUC), correlation (COR) and Kappa, to assess the
agreement between the presence-absence records
and the predictions.
25(No Transcript)
26(No Transcript)
27Maximum Entropy
- Only useful when applied to testable information.
(whether a given distribution is consistent with
it) - Given testable information, the maximum entropy
procedure consists of seeking the probability
distribution which maximizes information entropy,
subject to the constraints of the information. - This constrained optimization problem is
typically solved using the method of Lagrange
multipliers.
28(No Transcript)
29Output format
Raw output Cumulative output
30Cumulative output format
- Gives estimate of omission rate
- A pixel p has cumulative value c
- Total probability of pixels with lower
probability than p is c - Set a threshold of c
- Binary model with presence if cumulative value
c - Omission rate is c if test data drawn from
Maxent distribution - Predict omission rate of c for real test data
- Example thresholds
- 5 (light red)
- 20 (dark red)
31Logistic output format
- Estimates probability of presence
- Between 0 and 1
- Scaled so that a typical presence has value 0.5
- Defined as
- c q?(x) / (1 c q?(x))
- where c exp(H(q?(x))
- Probability of presence depends on sampling
details - Site size
- Observation time
- These details should correspond to collection
effort for occurrence points
32Response curves
- Show how predicted probability of presence
depends on each variable - Simple features ? simpler model
- Easier interpretation
- Complex features ? complex model
- Better fit to data
- Linear quadratic (top)
- Threshold features (middle)
- All feature types (bottom)
33Effect of regularization multiplier 0.2
Smaller confidence Intervals Lower
entropy Less spread-out
34Effect of regularization multiplier 5
Larger confidence Intervals Higher
entropy More spread-out
35Effect of regularization over-fitting
Regularization multiplier 1.0 (not over-fit)
Regularization multiplier 0.2 (clearly over-fit)
36- Sage grouse distribution model
- MAXENT software package
- Consistently superior to alternative methods
- Robust to colinearity between explanatory
variables - Accepts continuous and categorical variables
- Stable distribution with limited training data
- Evaluates relative variable importance
37West Virginia Conservation Prioritization using
Species Distribution Modeling
- Michael Dougherty
- West Virginia Division of Natural Resources
The Conservation Fund
38- Project Goals
- Develop statewide conservation prioritization map
based on the - distribution of
- Species of Greatest Conservation Need (SGCN)
- Habitats of Concern
- Existing public land
- The Challenge
- Develop distribution models for 500 state-tracked
species - Species include plants, herps, birds, bats,
mammals, aquatics - Modeling process must be defensible, transparent,
and repeatable
39- Occurrence data
- 1. State Natural Heritage Program Biotics
database - Biologists collect Source Features
- Source Features are grouped into Element
Occurrences (EOs) - EOs represent known populations
- Species identification is accurate and spatial
accuracy documented - Use of EOs seems to greatly reduce spatial
autocorrelation - 2. Community Ecologists Vegetation Plots
Database
40- Predictor Variables
- Developed a broad range of predictor variables
- Climate
- Landcover
- Terrain
- Ecoregions
- Geology
- Soils
- Disturbances
41- Workflow Overview
- Build an array of workstations to run models
- Develop R scripts to automate running the
maxent models by iterating through all 500
species - Develop web-based map viewer to assist
biologists in reviewing maxent model results - Perform patch and connectivity analysis using
FunConn - (TBD) Assign weights to patches and connectors
42- Scripting Steps
- Developed R script to performed variable
pre-selection using boosted regression trees to
reduce the number of variables to an appropriate
number (30) - Developed R script to produce the maxent batch
files and perform file management - Developed R script to harvest maxent results, a
Python script to store grids in an ArcSDE
database, and publish results to a website - (TBD) Develop R scripts to perform functional
connectivity analysis - (TBD) Perform layer weighting to produce
conservation prioritization index
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48Occurrence localities
- Csv file format. Each line has
- Species name
- X coordinate
- Y coordinate
- Multiple species can be in 1 file.
Example species,longitude,latitude bradypus_vari
egatus,-65.4,-10.3833 bradypus_variegatus,-65.3833
,-10.3833 bradypus_variegatus,-65.1333,-16.8 brady
pus_variegatus,-63.6667,-17.45
49Environmental variables
- ESRI ascii raster grid file format.
- One file per environmental variable
- All files must have exactly the same bounds, cell
size - Coordinate system must be same as for occurrence
localities - Alternative Diva .grd format.
50Samples with data (SWD) format
- Environmental data given with samples in a .csv
format file - Example
- species,longitude,latitude,cld,dtr,ecoreg,frs,h_de
m,pre,pre_l10,pre_l1,pre_l4,pre_l7,tmn,tmp,tmx,vap
- bradypus_variegatus,-65.4,-10.3833,76.0,104.0,10.0
,2.0,121.0,46.0,41.0,84.0,54.0,3.0,192.0,266.0,337
.0,279.0 - bradypus_variegatus,-65.3833,-10.3833,76.0,104.0,1
0.0,2.0,121.0,46.0,40.0,84.0,54.0,3.0,192.0,266.0,
337.0,279.0 - bradypus_variegatus,-65.1333,-16.8,57.0,114.0,10.0
,1.0,211.0,65.0,56.0,129.0,58.0,34.0,140.0,244.0,3
21.0,221.0 - bradypus_variegatus,-63.6667,-17.45,57.0,112.0,10.
0,3.0,363.0,36.0,33.0,71.0,27.0,13.0,135.0,229.0,3
07.0,202.0
51Background data in SWD format
- Environmental data at (typically) random points
in study area - Useful
- when environmental grids huge
- Maxent needs only small random sample (10,000)
- when doing non-uniform sampling
- Example
- species,longitude,latitude,cld,dtr,ecoreg,frs,h_de
m,pre,pre_l10,pre_l1,pre_l4,pre_l7,tmn,tmp,tmx,vap
- background,-61.775,6.175,60.0,100.0,10.0,0.0,747.0
,55.0,24.0,57.0,45.0,81.0,182.0,239.0,300.0,232.0 - background,-66.075,5.325,67.0,116.0,10.0,3.0,1038.
0,75.0,16.0,68.0,64.0,145.0,181.0,246.0,331.0,234.
0 - background,-59.875,-26.325,47.0,129.0,9.0,1.0,73.0
,31.0,43.0,32.0,43.0,10.0,97.0,218.0,339.0,189.0 - background,-68.375,-15.375,58.0,112.0,10.0,44.0,20
39.0,33.0,67.0,31.0,30.0,6.0,101.0,181.0,251.0,133
.0 - background,-68.525,4.775,72.0,95.0,10.0,0.0,65.0,7
2.0,16.0,65.0,69.0,133.0,218.0,271.0,346.0,289.0