Title: Inference%20on%20Relational%20Models%20Using%20Markov%20Chain%20Monte%20Carlo
1Inference on Relational Models Using Markov Chain
Monte Carlo
- Brian Milch
- Massachusetts Institute of Technology
- UAI Tutorial
- July 19, 2007
2Example 1 Bibliographies
3Example 2 Aircraft Tracking
t1
t2
t3
4Inference on Relational Structures
2.3 x 10-12
4.5 x 10-14
1.2 x 10-12
Russell
Norvig
Roberts
Russell
AI A Mod...
Advance...
AI A Mod...
Rus...
AI...
AI A...
Rus...
AI...
AI A...
Rob...
Adv...
Rob...
Seuss
Shak...
Tempest
The...
If you...
Hamlet
Rus...
AI...
AI A...
Shak...
Haml...
Wm...
Seu...
The...
Seu...
6.7 x 10-16
8.9 x 10-16
5.0 x 10-20
5Markov Chain Monte Carlo (MCMC)
- Markov chain s1, s2, ... over worlds where
evidence E is true - Approximate P(QE) as fraction of s1, s2, ...
that satisfy query Q
Q
E
6Outline
- Probabilistic models for relational structures
- Modeling the number of objects
- Three mistakes that are easy to make
- Markov chain Monte Carlo (MCMC)
- Gibbs sampling
- Metropolis-Hastings
- MCMC over events
- Case studies
- Citation matching
- Multi-target tracking
7Simple Example Clustering
10
20
30
40
50
60
70
80
90
100
Wingspan (cm)
8Simple Bayesian Mixture Model
- Number of latent objects is known to be k
- For each latent object i, have parameter
- For each data point j, have object selectorand
observable value
9BN for Mixture Model
?1
?2
?k
X1
X2
X3
Xn
C1
C2
C3
Cn
10Context-Specific Dependencies
?1
?2
?k
X1
X2
X3
Xn
C1
C2
C3
Cn
2
1
2
11Extensions to Mixture Model
- Random number of latent objects k, with
distribution p(k) such as - Uniform(1, , 100)
- Geometric(0.1)
- Poisson(10)
- Random distribution ? for selecting objects
- p(? k) Dirichlet(?1,..., ?k)(Dirichlet
distribution over probability vectors) - Still symmetric each ?i ?/k
unbounded!
12Existence versus Observation
- A latent object can exist even if no observations
correspond to it - Bird species may not be observed yet
- Aircraft may fly over without yielding any blips
- Two questions
- How many objects correspond to observations?
- How many objects are there in total?
- Observed 3 species, each 100 times probably no
more - Observed 200 species, each 1 or 2 times probably
more exist
13Expecting Additional Objects
r observed species
observe more later?
- P(ever observe new species seen r so far)
bounded by P(k ? r) - So as species observed ? ?, probability of ever
seeing more ? 0 - What if we dont want this?
14Dirichlet Process Mixtures
- Set k ?, let ? be infinite-dimensional
probability vector with stick-breaking prior -
- Another view Define prior directly on partitions
of data points, allowing unbounded number of
blocks - Drawback Cant ask about number of unobserved
latent objects (always infinite)
Ferguson 1983 Sethuraman 1994tutorials
Jordan 2005 Sudderth 2006
15Outline
- Probabilistic models for relational structures
- Modeling the number of objects
- Three mistakes that are easy to make
- Markov chain Monte Carlo (MCMC)
- Gibbs sampling
- Metropolis-Hastings
- MCMC over events
- Case studies
- Citation matching
- Multi-target tracking
16Mistake 1 Ignoring Interchangeability
- Which birds are in species S1?
- Latent object indices are interchangeable
- Posterior on selector variable CB1 is uniform
- Posterior on ?S1 has a peak for each cluster of
birds - Really care about partition of observations
- Partition with r blocks corresponds to k! /
(k-r)! instantiations of the Cj variables
B1
B3
B2
B5
B4
1, 3, 2, 4, 5
(1, 2, 1, 3, 3), (1, 2, 1, 4, 4), (1, 4, 1, 3,
3), (2, 1, 2, 3, 3),
17Ignoring Interchangeability, Contd
- Say k 4. Whats prior probability that B1, B3
are in one species, B2 in another? - Multiply probabilities for CB1, CB2, CB3
(1/4) x (1/4) x (1/4) - Not enough! Partition B1, B3, B2
corresponds to 12 instantiations of Cs - Partition with r blocks corresponds to kPr
instantiations
(S1, S2, S1), (S1, S3, S1), (S1, S4, S1), (S2,
S1, S2), (S2, S3, S2), (S2, S4, S2)(S3, S1, S3),
(S3, S2, S3), (S3, S4, S3), (S4, S1, S4), (S4,
S2, S4), (S4, S3, S4)
18Mistake 2 Underestimating the Bayesian Ockhams
Razor Effect
- Say k 4. Are B1 and B2 in same species?
- Maximum-likelihood estimation would yield one
species with ? 50 and another with ? 52 - But Bayesian model trades off likelihood against
prior probability of getting those ? values
XB150
XB252
10
20
30
40
50
60
70
80
90
100
Wingspan (cm)
19Bayesian Ockhams Razor
XB150
XB252
10
20
30
40
50
60
70
80
90
100
H1 Partition is B1, B2
? 1.3 x 10-4
H2 Partition is B1, B2
? 7.5 x 10-5
Dont use more latent objects than necessary to
explain your data
MacKay 1992
20Mistake 3 Comparing Densities Across Dimensions
XB150
XB252
10
20
30
40
50
60
70
80
90
100
Wingspan (cm)
H1 Partition is B1, B2, ? 51
? 1.5 x 10-5
H1 wins by greater margin
H2 Partition is B1, B2, ?B1 50, ?B2 52
? 4.8 x 10-7
21What If We Change the Units?
XB10.50
XB20.52
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Wingspan (m)
H1 Partition is B1, B2, ? 0.51
? 15
H2 Partition is B1, B2, ?B1 0.50, ?B2
0.52
? 48
Now H2 wins by a landslide
22Lesson Comparing Densities Across Dimensions
- Densities dont behave like probabilities (e.g.,
they can be greater than 1) - Heights of density peaks in spaces of different
dimension are not comparable - Work-arounds
- Find most likely partition first, then most
likely parameters given that partition - Find region in parameter space where most of the
posterior probability mass lies
23Outline
- Probabilistic models for relational structures
- Modeling the number of objects
- Three mistakes that are easy to make
- Markov chain Monte Carlo (MCMC)
- Gibbs sampling
- Metropolis-Hastings
- MCMC over events
- Case studies
- Citation matching
- Multi-target tracking
24Why Not Exact Inference?
- Number of possible partitions is superexponential
in n - Variable elimination?
- Summing out ?i couples all the Cjs
- Summing out Cjcouples all the ?is
25Markov Chain Monte Carlo (MCMC)
- Start in arbitrary state (possible world) s1
satisfying evidence E - Sample s2, s3, ... according to transition kernel
T(si, si1), yielding Markov chain - Approximate p(Q E) by fraction of s1, s2, , sL
that are in Q
Q
E
26Why a Markov Chain?
- Why use Markov chain rather than sampling
independently? - Stochastic local search for high-probability s
- Once we find such s, explore around it
27Convergence
- Stationary distribution ? is such that
- If chain is ergodic (can get to anywhere from
anywhere), then - It has unique stationary distribution ?
- Fraction of s1, s2, ..., sL in Q converges to
?(Q) as L ? ? - Well design T so ?(s) p(s E)
and its aperiodic
28Gibbs Sampling
- Order non-evidence variables V1,V2,...,Vm
- Given state s, sample from T as follows
- Let s? s
- For i 1 to m
- Sample vi? from p(Vi s?-i)
- Let s? (s?-i, Vi vi?)
- Return s?
- Theorem stationary distribution is p(s E)
Conditional for Vi given other vars in s?
Geman Geman 1984
29Gibbs on Bayesian Network
- Conditional for V depends only on factors that
contain v - So condition on Vs Markov blanket mb(V)
parents, children, and co-parents
V
30Gibbs on Bayesian Mixture Model
- Given current state s
- Resample each ?i given prior and Xj Cj i
in s - Resample each Cj given Xj and ?1k
context-specificMarkov blanket
Neal 2000
31Sampling Given Markov Blanket
- If V is discrete, just iterate over values,
normalize, sample from discrete distrib. - If V is continuous
- Simple if child distributions are conjugate to
Vs prior posterior has same form as prior with
different parameters - In general, even sampling from p(v s-V) can be
hard
See BUGS software http//www.mrc-bsu.cam.ac.uk/b
ugs
32Convergence Can Be Slow
?1 20
?2 90
species 2 is far away
10
20
30
40
50
60
70
80
90
100
should be two clusters
Wingspan (cm)
- Cjs wont change until ?2 is in right area
- ?2 does unguided random walk as long as no
observations are associated with it - Especially bad in high dimensions
33Outline
- Probabilistic models for relational structures
- Modeling the number of objects
- Three mistakes that are easy to make
- Markov chain Monte Carlo (MCMC)
- Gibbs sampling
- Metropolis-Hastings
- MCMC over events
- Case studies
- Citation matching
- Multi-target tracking
34Metropolis-Hastings
Metropolis et al. 1953 Hastings 1970
- Define T(si, si1) as follows
- Sample s? from proposal distribution q(s? s)
- Compute acceptance probability
- With probability ?, let si1 s?
else let si1 si
relative posteriorprobabilities
backward / forwardproposal probabilities
Can show that p(s E) is stationary distribution
for T
35Metropolis-Hastings
- Benefits
- Proposal distribution can propose big steps
involving several variables - Only need to compute ratio p(s? E) / p(s E),
ignoring normalization factors - Dont need to sample from conditional distribs
- Limitations
- Proposals must be reversible, else q(s s?) 0
- Need to be able to compute q(s s?) / q(s? s)
36Split-Merge Proposals
- Choose two observations i, j
- If Ci Cj c, then split cluster c
- Get unused latent object c?
- For each observation m such that Cm c, change
Cm to c? with probability 0.5 - Propose new values for ?c, ?c?
- Else merge clusters ci and cj
- For each m such that Cm cj, set Cm ci
- Propose new value for ?c
Jain Neal 2004
37Split-Merge Example
?1 20
?2 90
?2 27
10
20
30
40
50
60
70
80
90
100
Wingspan (cm)
- Split two birds from species 1
- Resample ?2 to match these two birds
- Move is likely to be accepted
38Mixtures of Kernels
- If T1,,Tm all have stationary distribution ?,
then so does mixture - Example Mixture of split-merge and Gibbs moves
- Point Faster convergence
39Outline
- Probabilistic models for relational structures
- Modeling the number of objects
- Three mistakes that are easy to make
- Markov chain Monte Carlo (MCMC)
- Gibbs sampling
- Metropolis-Hastings
- MCMC over events
- Case studies
- Citation matching
- Multi-target tracking
40MCMC States in Split-Merge
- Not complete instantiations!
- No parameters for unobserved species
- States are partial instantiations of random
variables - Each state corresponds to an event set of
outcomes satisfying description
k 12, CB1 S2, CB2 S8, ?S2 31, ?S8 84
41MCMC over Events
Milch Russell 2006
- Markov chain over events ?, with stationary
distrib. proportional to p(?) - Theorem Fraction of visited events in Q
converges to p(QE) if - Each ? is either subset of Q or disjoint from Q
- Events form partition of E
Q
E
42Computing Probabilities of Events
- Engine needs to compute p(??) / p(?n) efficiently
(without summations) - Use instantiations that include all active
parents of the variables they instantiate - Then probability is product of CPDs
43States That Are Even More Abstract
- Typical partial instantiation
- Specifies particular species numbers, even though
species are interchangeable - Let states be abstract partial instantiations
- See Milch Russell 2006 for conditions under
which we can compute probabilities of such events
k 12, CB1 S2, CB2 S8, ?S2 31, ?S8 84
? x ? y ? x k 12, CB1 x, CB2 y, ?x 31,
?y 84
44Outline
- Probabilistic models for relational structures
- Modeling the number of objects
- Three mistakes that are easy to make
- Markov chain Monte Carlo (MCMC)
- Gibbs sampling
- Metropolis-Hastings
- MCMC over events
- Case studies
- Citation matching
- Multi-target tracking
45Representative Applications
- Tracking cars with cameras Pasula et al. 1999
- Segmentation in computer vision Tu Zhu 2002
- Citation matching Pasula et al. 2003
- Multi-target tracking with radar Oh et al. 2004
46Citation Matching Model
Pasula et al. 2003 Milch Russell 2006
Researcher NumResearchersPrior() Name(r)
NamePrior() Paper NumPapersPrior() FirstAutho
r(p) Uniform(Researcher r) Title(p)
TitlePrior() PubCited(c) Uniform(Paper
p) Text(c) NoisyCitationGrammar
(Name(FirstAuthor(PubCited(c))),
Title(PubCited(c)))
47Citation Matching
- Elaboration of generative model shown earlier
- Parameter estimation
- Priors for names, titles, citation formats
learned offline from labeled data - String corruption parameters learned with Monte
Carlo EM - Inference
- MCMC with split-merge proposals
- Guided by canopies of similar citations
- Accuracy stabilizes after 20 minutes
Pasula et al., NIPS 2002
48Citation Matching Results
Four data sets of 300-500 citations, referring
to 150-300 papers
49Cross-Citation Disambiguation
Wauchope, K. Eucalyptus Integrating Natural
Language Input with a Graphical User Interface.
NRL Report NRL/FR/5510-94-9711 (1994).
Is "Eucalyptus" part of the title, or is the
author named K. Eucalyptus Wauchope?
50Preliminary Experiments Information Extraction
- P(citation text title, author names) modeled
with simple HMM - For each paper recover title, author surnames
and given names - Fraction whose attributes are recovered perfectly
in last MCMC state - among papers with one citation 36.1
- among papers with multiple citations 62.6
Can use inferred knowledge for disambiguation
51Multi-Object Tracking
UnobservedObject
FalseDetection
52State Estimation for Aircraft
Aircraft NumAircraftPrior() State(a, t) if t
0 then InitState() else StateTransition(Sta
te(a, Pred(t))) Blip(Source a, Time t)
NumDetectionsCPD(State(a, t)) Blip(Time t)
NumFalseAlarmsPrior() ApparentPos(r)if
(Source(r) null) then FalseAlarmDistrib()else
ObsCPD(State(Source(r), Time(r)))
53Aircraft Entering and Exiting
Aircraft(EntryTime t) NumAircraftPrior() Exi
ts(a, t) if InFlight(a, t) then
Bernoulli(0.1) InFlight(a, t)if t lt
EntryTime(a) then falseelseif t EntryTime(a)
then trueelse (InFlight(a, Pred(t))
!Exits(a, Pred(t))) State(a, t)if t
EntryTime(a) then InitState() elseif
InFlight(a, t) then StateTransition(State(a,
Pred(t))) Blip(Source a, Time t) if
InFlight(a, t) then NumDetectionsCPD(State(a,
t))
plus last two statements from previous slide
54MCMC for Aircraft Tracking
- Uses generative model from previous slide
(although not with BLOG syntax) - Examples of Metropolis-Hastings proposals
Figures by Songhwai Oh
Oh et al., CDC 2004
55Aircraft Tracking Results
Estimation Error
Running Time
MCMC has smallest error, hardly degrades at all
as tracks get dense
MCMC is nearly as fast as greedy algorithm
much faster than MHT
Oh et al., CDC 2004
Figures by Songhwai Oh
56Toward General-Purpose Inference
- Currently, each new application requires new code
for - Proposing moves
- Representing MCMC states
- Computing acceptance probabilities
- Goal
- User specifies model and proposal distribution
- General-purpose code does the rest
57General MCMC Engine
Milch Russell 2006
Model (in declarative language)
MCMC states partial worlds
Custom proposal distribution (Java class)
- Propose MCMC state s? given sn
- Compute ratio q(sn s?) / q(s? sn)
- Compute acceptance probability based on model
- Set sn1
Handle arbitrary proposals efficiently using
context-specific structure
General-purpose engine (Java code)
58Summary
- Models for relational structures go beyond
standard probabilistic inference settings - MCMC provides a feasible path for inference
- Open problems
- More general inference
- Adaptive MCMC
- Integrating discriminative methods
59References
- Blei, D. M. and Jordan, M. I. (2005) Variational
inference for Dirichlet process mixtures. J.
Bayesian Analysis 1(1)121-144. - Casella, G. and Robert, C. P. (1996)
Rao-Blackwellisation of sampling schemes.
Biometrika 83(1)81-94. - Ferguson T. S. (1983) Bayesian density
estimation by mixtures of normal distributions.
In Rizvi, M. H. et al., eds. Recent Advances in
Statistics Papers in Honor of Herman Chernoff on
His Sixtieth Birthday. Academic Press, New York,
pages 287-302. - Geman, S. and Geman, D. (1984) Stochastic
relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Trans. on Pattern
Analysis and Machine Intelligence 6721-741. - Gilks, W. R., Thomas, A. and Spiegelhalter, D. J.
(1994) A language and program for complex
Bayesian modelling. The Statistician
43(1)169-177. - Gilks, W. R., Richardson, S., and Spiegelhalter,
D. J., eds. (1996) Markov Chain Monte Carlo in
Practice. Chapman and Hall. - Green, P. J. (1995) Reversible jump Markov chain
Monte Carlo computation and Bayesian model
determination. Biometrika 82(4)711-732.
60References
- Hastings, W. K. (1970) Monte Carlo sampling
methods using Markov chains and their
applications. Biometrika 5797-109. - Jain, S. and Neal, R. M. (2004) A split-merge
Markov chain Monte Carlo procedure for the
Dirichlet process mixture model. J.
Computational and Graphical Statistics
13(1)158-182. - Jordan M. I. (2005) Dirichlet processes, Chinese
restaurant processes, and all that. Tutorial at
the NIPS Conference, available at
http//www.cs.berkeley.edu/jordan/nips-tutorial05
.ps - MacKay D. J. C. (1992) Bayesian Interpolation
Neural Computation 4(3)414-447. - MacEachern, S. N. (1994) Estimating normal means
with a conjugate style Dirichlet process prior
Communications in Statistics Simulation and
Computation 23727-741. - Metropolis, N., Rosenbluth, A. W., Rosenbluth, M.
N., Teller, A. H. and Teller, E. (1953)
Equations of state calculations by fast
computing machines. J. Chemical Physics
211087-1092. - Milch, B., Marthi, B., Russell, S., Sontag, D.,
Ong, D. L., and Kolobov, A. (2005) BLOG
Probabilistic Models with Unknown Objects. In
Proc. 19th Intl Joint Conf. on AI, pages
1352-1359. - Milch, B. and Russell, S. (2006) General-purpose
MCMC inference over relational structures. In
Proc. 22nd Conf. on Uncertainty in AI, pages
349-358.
61References
- Neal, R. M. (2000) Markov chain sampling methods
for Dirichlet process mixture models. J.
Computational and Graphical Statistics 9249-265. - Oh, S., Russell, S. and Sastry, S. (2004) Markov
chain Monte Carlo data association for general
multi-target tracking problems. In Proc. 43rd
IEEE Conf. on Decision and Control, pages
734-742. - Pasula, H., Russell, S. J., Ostland, M., and
Ritov, Y. (1999) Tracking many objects with many
sensors. In Proc. 16th Intl Joint Conf. on AI,
pages 1160-1171. - Pasula, H., Marthi, B., Milch, B., Russell, S.,
and Shpitser, I. (2003) Identity uncertainty and
citation matching. In Advances in Neural
Information Processing Systems 15, MIT Press,
pages 1401-1408. - Richardson,, S. and Green, P. J. (1997) On
Bayesian analysis of mixtures with an unknown
number of components. J. Royal Statistical
Society B 59731-792. - Sethuraman, J. (1994) A constructive definition
of Dirichlet priors. Statistica Sinica
4639-650. - Sudderth, E. (2006) Graphical models for visual
object recognition and tracking. Ph.D. thesis,
Dept. of EECS, Massachusetts Institute of
Technology, Cambridge, MA. - Tu, Z. and Zhu, S.-C. (2002) Image segmentation
by data-driven Markov chain Monte Carlo. IEEE
Trans. Pattern Analysis and Machine Intelligence
24(5)657-673.