Inference%20on%20Relational%20Models%20Using%20Markov%20Chain%20Monte%20Carlo

About This Presentation

Title:

Inference%20on%20Relational%20Models%20Using%20Markov%20Chain%20Monte%20Carlo

Description:

Inference on Relational Models Using Markov Chain Monte Carlo ... 'Rob...' 'Adv...' 'Rob...' 'Shak...' 'Haml...' 'Wm...' 'Seu...' 'The...' 'Seu...' 'Russell' ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 62

Provided by: brian94

Learn more at: https://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Inference%20on%20Relational%20Models%20Using%20Markov%20Chain%20Monte%20Carlo

1
Inference on Relational Models Using Markov Chain
Monte Carlo

Brian Milch
Massachusetts Institute of Technology
UAI Tutorial
July 19, 2007

2
Example 1 Bibliographies
3
Example 2 Aircraft Tracking
t1
t2
t3
4
Inference on Relational Structures
2.3 x 10-12
4.5 x 10-14
1.2 x 10-12
Russell
Norvig
Roberts
Russell
AI A Mod...
Advance...
AI A Mod...
Rus...
AI...
AI A...
Rus...
AI...
AI A...
Rob...
Adv...
Rob...
Seuss
Shak...
Tempest
The...
If you...
Hamlet
Rus...
AI...
AI A...
Shak...
Haml...
Wm...
Seu...
The...
Seu...
6.7 x 10-16
8.9 x 10-16
5.0 x 10-20
5
Markov Chain Monte Carlo (MCMC)

Markov chain s1, s2, ... over worlds where
evidence E is true
Approximate P(QE) as fraction of s1, s2, ...
that satisfy query Q

Q
E
6
Outline

Probabilistic models for relational structures
Modeling the number of objects
Three mistakes that are easy to make
Markov chain Monte Carlo (MCMC)
Gibbs sampling
Metropolis-Hastings
MCMC over events
Case studies
Citation matching
Multi-target tracking

7
Simple Example Clustering
10
20
30
40
50
60
70
80
90
100
Wingspan (cm)
8
Simple Bayesian Mixture Model

Number of latent objects is known to be k
For each latent object i, have parameter
For each data point j, have object selectorand
observable value

9
BN for Mixture Model

?1
?2
?k

X1
X2
X3
Xn

C1
C2
C3
Cn
10
Context-Specific Dependencies

?1
?2
?k

X1
X2
X3
Xn

C1
C2
C3
Cn
2
1
2
11
Extensions to Mixture Model

Random number of latent objects k, with
distribution p(k) such as
Uniform(1, , 100)
Geometric(0.1)
Poisson(10)
Random distribution ? for selecting objects
p(? k) Dirichlet(?1,..., ?k)(Dirichlet
distribution over probability vectors)
Still symmetric each ?i ?/k

unbounded!
12
Existence versus Observation

A latent object can exist even if no observations
correspond to it
Bird species may not be observed yet
Aircraft may fly over without yielding any blips
Two questions
How many objects correspond to observations?
How many objects are there in total?
Observed 3 species, each 100 times probably no
more
Observed 200 species, each 1 or 2 times probably
more exist

13
Expecting Additional Objects
r observed species
observe more later?

P(ever observe new species seen r so far)
bounded by P(k ? r)
So as species observed ? ?, probability of ever
seeing more ? 0
What if we dont want this?

14
Dirichlet Process Mixtures

Set k ?, let ? be infinite-dimensional
probability vector with stick-breaking prior
Another view Define prior directly on partitions
of data points, allowing unbounded number of
blocks
Drawback Cant ask about number of unobserved
latent objects (always infinite)

Ferguson 1983 Sethuraman 1994tutorials
Jordan 2005 Sudderth 2006
15
Outline

Probabilistic models for relational structures
Modeling the number of objects
Three mistakes that are easy to make
Markov chain Monte Carlo (MCMC)
Gibbs sampling
Metropolis-Hastings
MCMC over events
Case studies
Citation matching
Multi-target tracking

16
Mistake 1 Ignoring Interchangeability

Which birds are in species S1?
Latent object indices are interchangeable
Posterior on selector variable CB1 is uniform
Posterior on ?S1 has a peak for each cluster of
birds
Really care about partition of observations
Partition with r blocks corresponds to k! /
(k-r)! instantiations of the Cj variables

B1
B3
B2
B5
B4
1, 3, 2, 4, 5
(1, 2, 1, 3, 3), (1, 2, 1, 4, 4), (1, 4, 1, 3,
3), (2, 1, 2, 3, 3),
17
Ignoring Interchangeability, Contd

Say k 4. Whats prior probability that B1, B3
are in one species, B2 in another?
Multiply probabilities for CB1, CB2, CB3
(1/4) x (1/4) x (1/4)
Not enough! Partition B1, B3, B2
corresponds to 12 instantiations of Cs
Partition with r blocks corresponds to kPr
instantiations

(S1, S2, S1), (S1, S3, S1), (S1, S4, S1), (S2,
S1, S2), (S2, S3, S2), (S2, S4, S2)(S3, S1, S3),
(S3, S2, S3), (S3, S4, S3), (S4, S1, S4), (S4,
S2, S4), (S4, S3, S4)
18
Mistake 2 Underestimating the Bayesian Ockhams
Razor Effect

Say k 4. Are B1 and B2 in same species?
Maximum-likelihood estimation would yield one
species with ? 50 and another with ? 52
But Bayesian model trades off likelihood against
prior probability of getting those ? values

XB150
XB252
10
20
30
40
50
60
70
80
90
100
Wingspan (cm)
19
Bayesian Ockhams Razor
XB150
XB252
10
20
30
40
50
60
70
80
90
100
H1 Partition is B1, B2
? 1.3 x 10-4
H2 Partition is B1, B2
? 7.5 x 10-5
Dont use more latent objects than necessary to
explain your data
MacKay 1992
20
Mistake 3 Comparing Densities Across Dimensions
XB150
XB252
10
20
30
40
50
60
70
80
90
100
Wingspan (cm)
H1 Partition is B1, B2, ? 51
? 1.5 x 10-5
H1 wins by greater margin
H2 Partition is B1, B2, ?B1 50, ?B2 52
? 4.8 x 10-7
21
What If We Change the Units?
XB10.50
XB20.52
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Wingspan (m)
H1 Partition is B1, B2, ? 0.51
? 15
H2 Partition is B1, B2, ?B1 0.50, ?B2
0.52
? 48
Now H2 wins by a landslide
22
Lesson Comparing Densities Across Dimensions

Densities dont behave like probabilities (e.g.,
they can be greater than 1)
Heights of density peaks in spaces of different
dimension are not comparable
Work-arounds
Find most likely partition first, then most
likely parameters given that partition
Find region in parameter space where most of the
posterior probability mass lies

23
Outline

Probabilistic models for relational structures
Modeling the number of objects
Three mistakes that are easy to make
Markov chain Monte Carlo (MCMC)
Gibbs sampling
Metropolis-Hastings
MCMC over events
Case studies
Citation matching
Multi-target tracking

24
Why Not Exact Inference?

Number of possible partitions is superexponential
in n
Variable elimination?
Summing out ?i couples all the Cjs
Summing out Cjcouples all the ?is

25
Markov Chain Monte Carlo (MCMC)

Start in arbitrary state (possible world) s1
satisfying evidence E
Sample s2, s3, ... according to transition kernel
T(si, si1), yielding Markov chain
Approximate p(Q E) by fraction of s1, s2, , sL
that are in Q

Q
E
26
Why a Markov Chain?

Why use Markov chain rather than sampling
independently?
Stochastic local search for high-probability s
Once we find such s, explore around it

27
Convergence

Stationary distribution ? is such that
If chain is ergodic (can get to anywhere from
anywhere), then
It has unique stationary distribution ?
Fraction of s1, s2, ..., sL in Q converges to
?(Q) as L ? ?
Well design T so ?(s) p(s E)

and its aperiodic
28
Gibbs Sampling

Order non-evidence variables V1,V2,...,Vm
Given state s, sample from T as follows
Let s? s
For i 1 to m
Sample vi? from p(Vi s?-i)
Let s? (s?-i, Vi vi?)
Return s?
Theorem stationary distribution is p(s E)

Conditional for Vi given other vars in s?
Geman Geman 1984
29
Gibbs on Bayesian Network

Conditional for V depends only on factors that
contain v
So condition on Vs Markov blanket mb(V)
parents, children, and co-parents

V
30
Gibbs on Bayesian Mixture Model

Given current state s
Resample each ?i given prior and Xj Cj i
in s
Resample each Cj given Xj and ?1k

context-specificMarkov blanket
Neal 2000
31
Sampling Given Markov Blanket

If V is discrete, just iterate over values,
normalize, sample from discrete distrib.
If V is continuous
Simple if child distributions are conjugate to
Vs prior posterior has same form as prior with
different parameters
In general, even sampling from p(v s-V) can be
hard

See BUGS software http//www.mrc-bsu.cam.ac.uk/b
ugs
32
Convergence Can Be Slow
?1 20
?2 90
species 2 is far away
10
20
30
40
50
60
70
80
90
100
should be two clusters
Wingspan (cm)

Cjs wont change until ?2 is in right area
?2 does unguided random walk as long as no
observations are associated with it
Especially bad in high dimensions

33
Outline

Probabilistic models for relational structures
Modeling the number of objects
Three mistakes that are easy to make
Markov chain Monte Carlo (MCMC)
Gibbs sampling
Metropolis-Hastings
MCMC over events
Case studies
Citation matching
Multi-target tracking

34
Metropolis-Hastings
Metropolis et al. 1953 Hastings 1970

Define T(si, si1) as follows
Sample s? from proposal distribution q(s? s)
Compute acceptance probability
With probability ?, let si1 s?
else let si1 si

relative posteriorprobabilities
backward / forwardproposal probabilities
Can show that p(s E) is stationary distribution
for T
35
Metropolis-Hastings

Benefits
Proposal distribution can propose big steps
involving several variables
Only need to compute ratio p(s? E) / p(s E),
ignoring normalization factors
Dont need to sample from conditional distribs
Limitations
Proposals must be reversible, else q(s s?) 0
Need to be able to compute q(s s?) / q(s? s)

36
Split-Merge Proposals

Choose two observations i, j
If Ci Cj c, then split cluster c
Get unused latent object c?
For each observation m such that Cm c, change
Cm to c? with probability 0.5
Propose new values for ?c, ?c?
Else merge clusters ci and cj
For each m such that Cm cj, set Cm ci
Propose new value for ?c

Jain Neal 2004
37
Split-Merge Example
?1 20
?2 90
?2 27
10
20
30
40
50
60
70
80
90
100
Wingspan (cm)

Split two birds from species 1
Resample ?2 to match these two birds
Move is likely to be accepted

38
Mixtures of Kernels

If T1,,Tm all have stationary distribution ?,
then so does mixture
Example Mixture of split-merge and Gibbs moves
Point Faster convergence

39
Outline

Probabilistic models for relational structures
Modeling the number of objects
Three mistakes that are easy to make
Markov chain Monte Carlo (MCMC)
Gibbs sampling
Metropolis-Hastings
MCMC over events
Case studies
Citation matching
Multi-target tracking

40
MCMC States in Split-Merge

Not complete instantiations!
No parameters for unobserved species
States are partial instantiations of random
variables
Each state corresponds to an event set of
outcomes satisfying description

k 12, CB1 S2, CB2 S8, ?S2 31, ?S8 84
41
MCMC over Events
Milch Russell 2006

Markov chain over events ?, with stationary
distrib. proportional to p(?)
Theorem Fraction of visited events in Q
converges to p(QE) if
Each ? is either subset of Q or disjoint from Q
Events form partition of E

Q
E
42
Computing Probabilities of Events

Engine needs to compute p(??) / p(?n) efficiently
(without summations)
Use instantiations that include all active
parents of the variables they instantiate
Then probability is product of CPDs

43
States That Are Even More Abstract

Typical partial instantiation
Specifies particular species numbers, even though
species are interchangeable
Let states be abstract partial instantiations
See Milch Russell 2006 for conditions under
which we can compute probabilities of such events

k 12, CB1 S2, CB2 S8, ?S2 31, ?S8 84
? x ? y ? x k 12, CB1 x, CB2 y, ?x 31,
?y 84
44
Outline

Probabilistic models for relational structures
Modeling the number of objects
Three mistakes that are easy to make
Markov chain Monte Carlo (MCMC)
Gibbs sampling
Metropolis-Hastings
MCMC over events
Case studies
Citation matching
Multi-target tracking

45
Representative Applications

Tracking cars with cameras Pasula et al. 1999
Segmentation in computer vision Tu Zhu 2002
Citation matching Pasula et al. 2003
Multi-target tracking with radar Oh et al. 2004

46
Citation Matching Model
Pasula et al. 2003 Milch Russell 2006
Researcher NumResearchersPrior() Name(r)
NamePrior() Paper NumPapersPrior() FirstAutho
r(p) Uniform(Researcher r) Title(p)
TitlePrior() PubCited(c) Uniform(Paper
p) Text(c) NoisyCitationGrammar
(Name(FirstAuthor(PubCited(c))),
Title(PubCited(c)))
47
Citation Matching

Elaboration of generative model shown earlier
Parameter estimation
Priors for names, titles, citation formats
learned offline from labeled data
String corruption parameters learned with Monte
Carlo EM
Inference
MCMC with split-merge proposals
Guided by canopies of similar citations
Accuracy stabilizes after 20 minutes

Pasula et al., NIPS 2002
48
Citation Matching Results
Four data sets of 300-500 citations, referring
to 150-300 papers
49
Cross-Citation Disambiguation
Wauchope, K. Eucalyptus Integrating Natural
Language Input with a Graphical User Interface.
NRL Report NRL/FR/5510-94-9711 (1994).
Is "Eucalyptus" part of the title, or is the
author named K. Eucalyptus Wauchope?
50
Preliminary Experiments Information Extraction

P(citation text title, author names) modeled
with simple HMM
For each paper recover title, author surnames
and given names
Fraction whose attributes are recovered perfectly
in last MCMC state
among papers with one citation 36.1
among papers with multiple citations 62.6

Can use inferred knowledge for disambiguation
51
Multi-Object Tracking
UnobservedObject
FalseDetection
52
State Estimation for Aircraft
Aircraft NumAircraftPrior() State(a, t) if t
0 then InitState() else StateTransition(Sta
te(a, Pred(t))) Blip(Source a, Time t)
NumDetectionsCPD(State(a, t)) Blip(Time t)
NumFalseAlarmsPrior() ApparentPos(r)if
(Source(r) null) then FalseAlarmDistrib()else
ObsCPD(State(Source(r), Time(r)))
53
Aircraft Entering and Exiting
Aircraft(EntryTime t) NumAircraftPrior() Exi
ts(a, t) if InFlight(a, t) then
Bernoulli(0.1) InFlight(a, t)if t lt
EntryTime(a) then falseelseif t EntryTime(a)
then trueelse (InFlight(a, Pred(t))
!Exits(a, Pred(t))) State(a, t)if t
EntryTime(a) then InitState() elseif
InFlight(a, t) then StateTransition(State(a,
Pred(t))) Blip(Source a, Time t) if
InFlight(a, t) then NumDetectionsCPD(State(a,
t))
plus last two statements from previous slide
54
MCMC for Aircraft Tracking

Uses generative model from previous slide
(although not with BLOG syntax)
Examples of Metropolis-Hastings proposals

Figures by Songhwai Oh
Oh et al., CDC 2004
55
Aircraft Tracking Results
Estimation Error
Running Time
MCMC has smallest error, hardly degrades at all
as tracks get dense
MCMC is nearly as fast as greedy algorithm
much faster than MHT
Oh et al., CDC 2004
Figures by Songhwai Oh
56
Toward General-Purpose Inference

Currently, each new application requires new code
for
Proposing moves
Representing MCMC states
Computing acceptance probabilities
Goal
User specifies model and proposal distribution
General-purpose code does the rest

57
General MCMC Engine
Milch Russell 2006
Model (in declarative language)
MCMC states partial worlds

Define p(s)

Custom proposal distribution (Java class)

Propose MCMC state s? given sn
Compute ratio q(sn s?) / q(s? sn)

Compute acceptance probability based on model
Set sn1

Handle arbitrary proposals efficiently using
context-specific structure
General-purpose engine (Java code)
58
Summary

Models for relational structures go beyond
standard probabilistic inference settings
MCMC provides a feasible path for inference
Open problems
More general inference
Adaptive MCMC
Integrating discriminative methods

59
References

Blei, D. M. and Jordan, M. I. (2005) Variational
inference for Dirichlet process mixtures. J.
Bayesian Analysis 1(1)121-144.
Casella, G. and Robert, C. P. (1996)
Rao-Blackwellisation of sampling schemes.
Biometrika 83(1)81-94.
Ferguson T. S. (1983) Bayesian density
estimation by mixtures of normal distributions.
In Rizvi, M. H. et al., eds. Recent Advances in
Statistics Papers in Honor of Herman Chernoff on
His Sixtieth Birthday. Academic Press, New York,
pages 287-302.
Geman, S. and Geman, D. (1984) Stochastic
relaxation, Gibbs distributions and the Bayesian
restoration of images. IEEE Trans. on Pattern
Analysis and Machine Intelligence 6721-741.
Gilks, W. R., Thomas, A. and Spiegelhalter, D. J.
(1994) A language and program for complex
Bayesian modelling. The Statistician
43(1)169-177.
Gilks, W. R., Richardson, S., and Spiegelhalter,
D. J., eds. (1996) Markov Chain Monte Carlo in
Practice. Chapman and Hall.
Green, P. J. (1995) Reversible jump Markov chain
Monte Carlo computation and Bayesian model
determination. Biometrika 82(4)711-732.

60
References

Hastings, W. K. (1970) Monte Carlo sampling
methods using Markov chains and their
applications. Biometrika 5797-109.
Jain, S. and Neal, R. M. (2004) A split-merge
Markov chain Monte Carlo procedure for the
Dirichlet process mixture model. J.
Computational and Graphical Statistics
13(1)158-182.
Jordan M. I. (2005) Dirichlet processes, Chinese
restaurant processes, and all that. Tutorial at
the NIPS Conference, available at
http//www.cs.berkeley.edu/jordan/nips-tutorial05
.ps
MacKay D. J. C. (1992) Bayesian Interpolation
Neural Computation 4(3)414-447.
MacEachern, S. N. (1994) Estimating normal means
with a conjugate style Dirichlet process prior
Communications in Statistics Simulation and
Computation 23727-741.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M.
N., Teller, A. H. and Teller, E. (1953)
Equations of state calculations by fast
computing machines. J. Chemical Physics
211087-1092.
Milch, B., Marthi, B., Russell, S., Sontag, D.,
Ong, D. L., and Kolobov, A. (2005) BLOG
Probabilistic Models with Unknown Objects. In
Proc. 19th Intl Joint Conf. on AI, pages
1352-1359.
Milch, B. and Russell, S. (2006) General-purpose
MCMC inference over relational structures. In
Proc. 22nd Conf. on Uncertainty in AI, pages
349-358.

61
References

Neal, R. M. (2000) Markov chain sampling methods
for Dirichlet process mixture models. J.
Computational and Graphical Statistics 9249-265.
Oh, S., Russell, S. and Sastry, S. (2004) Markov
chain Monte Carlo data association for general
multi-target tracking problems. In Proc. 43rd
IEEE Conf. on Decision and Control, pages
734-742.
Pasula, H., Russell, S. J., Ostland, M., and
Ritov, Y. (1999) Tracking many objects with many
sensors. In Proc. 16th Intl Joint Conf. on AI,
pages 1160-1171.
Pasula, H., Marthi, B., Milch, B., Russell, S.,
and Shpitser, I. (2003) Identity uncertainty and
citation matching. In Advances in Neural
Information Processing Systems 15, MIT Press,
pages 1401-1408.
Richardson,, S. and Green, P. J. (1997) On
Bayesian analysis of mixtures with an unknown
number of components. J. Royal Statistical
Society B 59731-792.
Sethuraman, J. (1994) A constructive definition
of Dirichlet priors. Statistica Sinica
4639-650.
Sudderth, E. (2006) Graphical models for visual
object recognition and tracking. Ph.D. thesis,
Dept. of EECS, Massachusetts Institute of
Technology, Cambridge, MA.
Tu, Z. and Zhu, S.-C. (2002) Image segmentation
by data-driven Markov chain Monte Carlo. IEEE
Trans. Pattern Analysis and Machine Intelligence
24(5)657-673.