A Latent Space Approach to Dynamic Embedding of Co-occurrence Data

1 / 29

About This Presentation

Title:

A Latent Space Approach to Dynamic Embedding of Co-occurrence Data

Description:

Pairwise Discrete Entity Links Embedding in R2 ... with parameters related to the Jacobian and Hessian of this Taylor approximation. ... –

Number of Views:57

Avg rating:3.0/5.0

Slides: 30

Provided by: carnegieme

Category:

more less

Transcript and Presenter's Notes

Title: A Latent Space Approach to Dynamic Embedding of Co-occurrence Data

1
A Latent Space Approach to Dynamic Embedding of
Co-occurrence Data

Purnamrita SarkarMachine Learning Department
Sajid M. SiddiqiRobotics Institute
Geoffrey J. GordonMachine Learning Department
CMU

2
A Static ProblemPairwise Discrete Entity Links
? Embedding in R2
Output
Input
Author Keyword
Alice SVM
Bob Neural
Charlie Tree
Alice Entropy
Charlie Neural
E N S T
A 1 0 1 0
B 0 1 0 0
C 0 1 0 1
Co-occurrence counts
Embedding in R2
PairwiseLink Data
This problem was addressed by the CODE algorithm
ofGloberson et al., NIPS 2004
3
A Dynamic ProblemPairwise Discrete Entity Links
Over Time ? Embeddings in R2
PairwiseLink Data pertimestep
Co-occurrence countsper timestep
This is the problem we address with our
algorithm D-CODE (Dynamic CODE). Additionally,
we want distributions over entity coordinates,
rather than point estimates
4
Notation
5
One time-slice of the model
Author Coordinates
. . . . .
Author-Word Co-occurrence Counts
. . . . .
Word Coordinates
6
The dynamic model
7
The Observation Model
Closer the pair in latent space ? Higher the
probability of co-occurrence
But what if we want a distribution over the
coordinates, and not just point estimates ?
Kalman Filters !
8
Kalman Filters
Requirements
1. X and O are real-valued, and 2. p(Xt Xt-1)
and p(OtXt) are Gaussian
Xt hidden Ot observed
Operations

Start with an initial belief over X, i.e. P(X0)
For t 1T, with P(XtO1t-1) as the current
belief
Condition on Ot to obtain P(XtO1t)
Predict the joint belief P(Xt,Xt1O1t) with the
transition model
Roll-up (i.e. integrate) Xt to get the updated
belief P(Xt1O1t)

9
Kalman Filter Inference

N(?obs , ?obs) N(?tt-1 ,
?tt-1)

10
Kalman Filter Inference

N(0 , ?transition) N(?tt-1
, ?tt-1)

11
Approximating the Observation Model

Lets take a closer look at the observation model
It can be moment-matched to a Gaussian
except for the normalizer (a nasty log-sum)

Our approach approximatingthe normalization
constant
12
Linearizing the normalizer

We want to approximate the observation model with
a Gaussian distribution.
Step 1 First order Taylor approximation
However, this is still hard

13
Linearizing the normalizer

We want to approximate the observation model with
a Gaussian distribution.
Step 2 2nd order Taylor approximation of
We obtain a closed-form Gaussian with parameters
related to the Jacobian and Hessian of this
Taylor approximation. This Gaussian
N(?approx,?approx) is our approximated
observation model.
We choose the linearization points to be the
posterior means of the coordinates, given the
data observed so far.
This Gaussian preserves x-y correlations!

14
Approximating the Observation Model
(A)
(B)
Two pairs of contour plots of an authors true
posterior conditional (left panel) in a 3-author
5-word embedding and the corresponding
approximate Gaussian posterior conditional
obtained (right panel). A is a difficult-to-approx
imate bimodal case, B is an easier unimodal case
15
Our Algorithms

D-CODE Expected model probability
Can be obtained in closed form using our
approximation
D-CODE MLE Evaluate model probability using
the posterior means
Static versions of the above, which learn an
embedding on CT-1 to predict for year T.

16
Algorithms for Comparison

We compare with a dynamic version of PCA over
overlapping windows of data
The consistency between configurations between
two consecutive time-steps is maintained by a
Procrustes transform
For ranking, we evaluate our model probability at
the PCA coordinates
We also compare to Locally Linear Embedding (LLE)
on the author prediction task. Like the static
D-CODE variants above, we embed data for year T -
1 and predict for year T. Define author-author
distances based on the words they use, as in Mei
and Shelton (2006). This allows us to compare
with LLE

17
Experiments

We present the experiments in two sections
Qualitative results
Visualization
Quantitative results
Ranking on the Naïve Bayes author prediction
task
Naïve Bayes author prediction We use the
distributions / point-estimates over entity
locations at each timestep to perform Naive Bayes
ranking of authors given a subset of words from a
paper in the next timestep.

18
Synthetic Data

Consider a dataset of 6 authors and 3 words
There is one group of words A13, and two groups
of authors X13 and Y13.
Initially the words Ai are mostly used by authors
Xi.
Over time, the words gradually shift towards
authors Yi.
There is a random noise component in the data.

19
The dynamic embedding successfully reflects
trends in the underlying data
20
Ranking with and w/o distributions
Ranks using D-CODE shift much more smoothly with
time
21
NIPS co-authorship data
22
(No Transcript)
23
(No Transcript)
24
NIPS authors rankings(Jordan, variational)
Average author rank given a word, predicted using
D-CODE (above) and Dynamic PCA (middle), and the
empirical probabilities p(a w) on NIPS data
(below). t 13 corresponds to 1999. Note that
D-CODEs predicted rank is close to 1 when p(a
w) is high, and larger otherwise. In contrast,
Dynamic PCAs predicted rank shows no noticeable
correlation.
25
NIPS authors rankings (Smola, kernel)
Average author rank given a word, predicted using
D-CODE (above) and Dynamic PCA (middle), and the
empirical probabilities p(a w) on NIPS data
(below). t 13 corresponds to 1999. Note that
D-CODEs predicted rank is close to 1 when p(a
w) is high, and larger otherwise. In contrast,
Dynamic PCAs predicted rank shows no noticeable
correlation.
26
NIPS authors rankings (Waibel, speech)
Average author rank given a word, predicted using
D-CODE (above) and Dynamic PCA (middle), and the
empirical probabilities p(a w) on NIPS data
(below). t 13 corresponds to 1999. Note that
D-CODEs predicted rank is close to 1 when p(a
w) is high, and larger otherwise. In contrast,
Dynamic PCAs predicted rank shows no noticeable
correlation.
27
Rank Prediction

Median predicted rank of true authors of papers
in t 13 based on embeddings until t 12.
Values statistically indistinguishable from the
best in each row are in bold. D-CODE is the best
model in most cases, showing the usefulness of
having distributions rather than just point
estimates. D-CODE and D-CODE MLE also beat their
static counterparts, showing the advantage of
dynamic modeling.

28
Conclusion

Novel dynamic embedding algorithm based on
co-occurrence counts data, using Kalman Filters
Visualization
Prediction
Detecting trends
Distributions in embeddings make a difference!!
Can also do smoothing with closed-form updates

29
Acknowledgements