Title: A Latent Space Approach to Dynamic Embedding of Co-occurrence Data
1A Latent Space Approach to Dynamic Embedding of
Co-occurrence Data
- Purnamrita SarkarMachine Learning Department
- Sajid M. SiddiqiRobotics Institute
- Geoffrey J. GordonMachine Learning Department
- CMU
2A Static ProblemPairwise Discrete Entity Links
? Embedding in R2
Output
Input
Author Keyword
Alice SVM
Bob Neural
Charlie Tree
Alice Entropy
Charlie Neural
E N S T
A 1 0 1 0
B 0 1 0 0
C 0 1 0 1
Co-occurrence counts
Embedding in R2
PairwiseLink Data
This problem was addressed by the CODE algorithm
ofGloberson et al., NIPS 2004
3A Dynamic ProblemPairwise Discrete Entity Links
Over Time ? Embeddings in R2
PairwiseLink Data pertimestep
Co-occurrence countsper timestep
This is the problem we address with our
algorithm D-CODE (Dynamic CODE). Additionally,
we want distributions over entity coordinates,
rather than point estimates
4Notation
5One time-slice of the model
Author Coordinates
. . . . .
Author-Word Co-occurrence Counts
. . . . .
Word Coordinates
6The dynamic model
7The Observation Model
Closer the pair in latent space ? Higher the
probability of co-occurrence
But what if we want a distribution over the
coordinates, and not just point estimates ?
Kalman Filters !
8Kalman Filters
Requirements
1. X and O are real-valued, and 2. p(Xt Xt-1)
and p(OtXt) are Gaussian
Xt hidden Ot observed
Operations
- Start with an initial belief over X, i.e. P(X0)
- For t 1T, with P(XtO1t-1) as the current
belief - Condition on Ot to obtain P(XtO1t)
- Predict the joint belief P(Xt,Xt1O1t) with the
transition model - Roll-up (i.e. integrate) Xt to get the updated
belief P(Xt1O1t)
9Kalman Filter Inference
- N(?obs , ?obs) N(?tt-1 ,
?tt-1)
10Kalman Filter Inference
- N(0 , ?transition) N(?tt-1
, ?tt-1)
11Approximating the Observation Model
- Lets take a closer look at the observation model
- It can be moment-matched to a Gaussian
- except for the normalizer (a nasty log-sum)
Our approach approximatingthe normalization
constant
12Linearizing the normalizer
- We want to approximate the observation model with
a Gaussian distribution. - Step 1 First order Taylor approximation
- However, this is still hard
13Linearizing the normalizer
- We want to approximate the observation model with
a Gaussian distribution. - Step 2 2nd order Taylor approximation of
- We obtain a closed-form Gaussian with parameters
related to the Jacobian and Hessian of this
Taylor approximation. This Gaussian
N(?approx,?approx) is our approximated
observation model. - We choose the linearization points to be the
posterior means of the coordinates, given the
data observed so far. - This Gaussian preserves x-y correlations!
14Approximating the Observation Model
(A)
(B)
Two pairs of contour plots of an authors true
posterior conditional (left panel) in a 3-author
5-word embedding and the corresponding
approximate Gaussian posterior conditional
obtained (right panel). A is a difficult-to-approx
imate bimodal case, B is an easier unimodal case
15Our Algorithms
- D-CODE Expected model probability
- Can be obtained in closed form using our
approximation - D-CODE MLE Evaluate model probability using
the posterior means - Static versions of the above, which learn an
embedding on CT-1 to predict for year T.
16Algorithms for Comparison
- We compare with a dynamic version of PCA over
overlapping windows of data - The consistency between configurations between
two consecutive time-steps is maintained by a
Procrustes transform - For ranking, we evaluate our model probability at
the PCA coordinates - We also compare to Locally Linear Embedding (LLE)
on the author prediction task. Like the static
D-CODE variants above, we embed data for year T -
1 and predict for year T. Define author-author
distances based on the words they use, as in Mei
and Shelton (2006). This allows us to compare
with LLE
17Experiments
- We present the experiments in two sections
- Qualitative results
- Visualization
- Quantitative results
- Ranking on the Naïve Bayes author prediction
task - Naïve Bayes author prediction We use the
distributions / point-estimates over entity
locations at each timestep to perform Naive Bayes
ranking of authors given a subset of words from a
paper in the next timestep.
18Synthetic Data
- Consider a dataset of 6 authors and 3 words
- There is one group of words A13, and two groups
of authors X13 and Y13. - Initially the words Ai are mostly used by authors
Xi. - Over time, the words gradually shift towards
authors Yi. - There is a random noise component in the data.
19The dynamic embedding successfully reflects
trends in the underlying data
20Ranking with and w/o distributions
Ranks using D-CODE shift much more smoothly with
time
21NIPS co-authorship data
22(No Transcript)
23(No Transcript)
24NIPS authors rankings(Jordan, variational)
Average author rank given a word, predicted using
D-CODE (above) and Dynamic PCA (middle), and the
empirical probabilities p(a w) on NIPS data
(below). t 13 corresponds to 1999. Note that
D-CODEs predicted rank is close to 1 when p(a
w) is high, and larger otherwise. In contrast,
Dynamic PCAs predicted rank shows no noticeable
correlation.
25NIPS authors rankings (Smola, kernel)
Average author rank given a word, predicted using
D-CODE (above) and Dynamic PCA (middle), and the
empirical probabilities p(a w) on NIPS data
(below). t 13 corresponds to 1999. Note that
D-CODEs predicted rank is close to 1 when p(a
w) is high, and larger otherwise. In contrast,
Dynamic PCAs predicted rank shows no noticeable
correlation.
26NIPS authors rankings (Waibel, speech)
Average author rank given a word, predicted using
D-CODE (above) and Dynamic PCA (middle), and the
empirical probabilities p(a w) on NIPS data
(below). t 13 corresponds to 1999. Note that
D-CODEs predicted rank is close to 1 when p(a
w) is high, and larger otherwise. In contrast,
Dynamic PCAs predicted rank shows no noticeable
correlation.
27Rank Prediction
- Median predicted rank of true authors of papers
in t 13 based on embeddings until t 12.
Values statistically indistinguishable from the
best in each row are in bold. D-CODE is the best
model in most cases, showing the usefulness of
having distributions rather than just point
estimates. D-CODE and D-CODE MLE also beat their
static counterparts, showing the advantage of
dynamic modeling.
28Conclusion
- Novel dynamic embedding algorithm based on
co-occurrence counts data, using Kalman Filters - Visualization
- Prediction
- Detecting trends
- Distributions in embeddings make a difference!!
- Can also do smoothing with closed-form updates
29Acknowledgements
- We gratefully acknowledge Carlos Guestrin for his
guidance and helpful comments.