A Latent Space Approach to Dynamic Embedding of Co-occurrence Data

1 / 29
About This Presentation
Title:

A Latent Space Approach to Dynamic Embedding of Co-occurrence Data

Description:

Pairwise Discrete Entity Links Embedding in R2 ... with parameters related to the Jacobian and Hessian of this Taylor approximation. ... –

Number of Views:57
Avg rating:3.0/5.0
Slides: 30
Provided by: carnegieme
Category:

less

Transcript and Presenter's Notes

Title: A Latent Space Approach to Dynamic Embedding of Co-occurrence Data


1
A Latent Space Approach to Dynamic Embedding of
Co-occurrence Data
  • Purnamrita SarkarMachine Learning Department
  • Sajid M. SiddiqiRobotics Institute
  • Geoffrey J. GordonMachine Learning Department
  • CMU

2
A Static ProblemPairwise Discrete Entity Links
? Embedding in R2
Output
Input
Author Keyword
Alice SVM
Bob Neural
Charlie Tree
Alice Entropy
Charlie Neural
E N S T
A 1 0 1 0
B 0 1 0 0
C 0 1 0 1
Co-occurrence counts
Embedding in R2
PairwiseLink Data
This problem was addressed by the CODE algorithm
ofGloberson et al., NIPS 2004
3
A Dynamic ProblemPairwise Discrete Entity Links
Over Time ? Embeddings in R2
PairwiseLink Data pertimestep
Co-occurrence countsper timestep
This is the problem we address with our
algorithm D-CODE (Dynamic CODE). Additionally,
we want distributions over entity coordinates,
rather than point estimates
4
Notation
5
One time-slice of the model
Author Coordinates
. . . . .
Author-Word Co-occurrence Counts
. . . . .
Word Coordinates
6
The dynamic model
7
The Observation Model
Closer the pair in latent space ? Higher the
probability of co-occurrence
But what if we want a distribution over the
coordinates, and not just point estimates ?
Kalman Filters !
8
Kalman Filters
Requirements
1. X and O are real-valued, and 2. p(Xt Xt-1)
and p(OtXt) are Gaussian
Xt hidden Ot observed
Operations
  • Start with an initial belief over X, i.e. P(X0)
  • For t 1T, with P(XtO1t-1) as the current
    belief
  • Condition on Ot to obtain P(XtO1t)
  • Predict the joint belief P(Xt,Xt1O1t) with the
    transition model
  • Roll-up (i.e. integrate) Xt to get the updated
    belief P(Xt1O1t)

9
Kalman Filter Inference
  • N(?obs , ?obs) N(?tt-1 ,
    ?tt-1)

10
Kalman Filter Inference
  • N(0 , ?transition) N(?tt-1
    , ?tt-1)

11
Approximating the Observation Model
  • Lets take a closer look at the observation model
  • It can be moment-matched to a Gaussian
  • except for the normalizer (a nasty log-sum)

Our approach approximatingthe normalization
constant
12
Linearizing the normalizer
  • We want to approximate the observation model with
    a Gaussian distribution.
  • Step 1 First order Taylor approximation
  • However, this is still hard

13
Linearizing the normalizer
  • We want to approximate the observation model with
    a Gaussian distribution.
  • Step 2 2nd order Taylor approximation of
  • We obtain a closed-form Gaussian with parameters
    related to the Jacobian and Hessian of this
    Taylor approximation. This Gaussian
    N(?approx,?approx) is our approximated
    observation model.
  • We choose the linearization points to be the
    posterior means of the coordinates, given the
    data observed so far.
  • This Gaussian preserves x-y correlations!

14
Approximating the Observation Model
(A)
(B)
Two pairs of contour plots of an authors true
posterior conditional (left panel) in a 3-author
5-word embedding and the corresponding
approximate Gaussian posterior conditional
obtained (right panel). A is a difficult-to-approx
imate bimodal case, B is an easier unimodal case
15
Our Algorithms
  • D-CODE Expected model probability
  • Can be obtained in closed form using our
    approximation
  • D-CODE MLE Evaluate model probability using
    the posterior means
  • Static versions of the above, which learn an
    embedding on CT-1 to predict for year T.

16
Algorithms for Comparison
  • We compare with a dynamic version of PCA over
    overlapping windows of data
  • The consistency between configurations between
    two consecutive time-steps is maintained by a
    Procrustes transform
  • For ranking, we evaluate our model probability at
    the PCA coordinates
  • We also compare to Locally Linear Embedding (LLE)
    on the author prediction task. Like the static
    D-CODE variants above, we embed data for year T -
    1 and predict for year T. Define author-author
    distances based on the words they use, as in Mei
    and Shelton (2006). This allows us to compare
    with LLE

17
Experiments
  • We present the experiments in two sections
  • Qualitative results
  • Visualization
  • Quantitative results
  • Ranking on the Naïve Bayes author prediction
    task
  • Naïve Bayes author prediction We use the
    distributions / point-estimates over entity
    locations at each timestep to perform Naive Bayes
    ranking of authors given a subset of words from a
    paper in the next timestep.

18
Synthetic Data
  • Consider a dataset of 6 authors and 3 words
  • There is one group of words A13, and two groups
    of authors X13 and Y13.
  • Initially the words Ai are mostly used by authors
    Xi.
  • Over time, the words gradually shift towards
    authors Yi.
  • There is a random noise component in the data.

19
The dynamic embedding successfully reflects
trends in the underlying data
20
Ranking with and w/o distributions
Ranks using D-CODE shift much more smoothly with
time
21
NIPS co-authorship data
22
(No Transcript)
23
(No Transcript)
24
NIPS authors rankings(Jordan, variational)
Average author rank given a word, predicted using
D-CODE (above) and Dynamic PCA (middle), and the
empirical probabilities p(a w) on NIPS data
(below). t 13 corresponds to 1999. Note that
D-CODEs predicted rank is close to 1 when p(a
w) is high, and larger otherwise. In contrast,
Dynamic PCAs predicted rank shows no noticeable
correlation.
25
NIPS authors rankings (Smola, kernel)
Average author rank given a word, predicted using
D-CODE (above) and Dynamic PCA (middle), and the
empirical probabilities p(a w) on NIPS data
(below). t 13 corresponds to 1999. Note that
D-CODEs predicted rank is close to 1 when p(a
w) is high, and larger otherwise. In contrast,
Dynamic PCAs predicted rank shows no noticeable
correlation.
26
NIPS authors rankings (Waibel, speech)
Average author rank given a word, predicted using
D-CODE (above) and Dynamic PCA (middle), and the
empirical probabilities p(a w) on NIPS data
(below). t 13 corresponds to 1999. Note that
D-CODEs predicted rank is close to 1 when p(a
w) is high, and larger otherwise. In contrast,
Dynamic PCAs predicted rank shows no noticeable
correlation.
27
Rank Prediction
  • Median predicted rank of true authors of papers
    in t 13 based on embeddings until t 12.
    Values statistically indistinguishable from the
    best in each row are in bold. D-CODE is the best
    model in most cases, showing the usefulness of
    having distributions rather than just point
    estimates. D-CODE and D-CODE MLE also beat their
    static counterparts, showing the advantage of
    dynamic modeling.

28
Conclusion
  • Novel dynamic embedding algorithm based on
    co-occurrence counts data, using Kalman Filters
  • Visualization
  • Prediction
  • Detecting trends
  • Distributions in embeddings make a difference!!
  • Can also do smoothing with closed-form updates

29
Acknowledgements
  • We gratefully acknowledge Carlos Guestrin for his
    guidance and helpful comments.
Write a Comment
User Comments (0)
About PowerShow.com