16,000 documents - PowerPoint PPT Presentation

About This Presentation
Title:

16,000 documents

Description:

Example 16,000 documents 100 topic Picked those with large p(w|z) – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 29
Provided by: acuk
Category:

less

Transcript and Presenter's Notes

Title: 16,000 documents


1
Example
  • 16,000 documents
  • 100 topic
  • Picked those with large p(wz)

2
(No Transcript)
3
New document?
  • Given a new document, compute and
  • words allocated to each topic
  • approximates p(znw)
  • See cases where these values are relatively large
  • 4 topics found

4
Unseen document (contd.)
  • Bag of words - William Randolph Hearst Foundation
    assigned to different topics

5
Applications and empirical results
  • Document modeling
  • Document classification
  • Collaborative filtering

6
Document modeling
  • Task density estimation, high likelihood to
    unseen document
  • Measure of goodness perplexity
  • Monotonically decreases in the likelihood

7
The experiment
Articles Terms
Scientific abstracts 5,225 28,414
Newswire articles 16,333 23,075
8
The experiment (contd.)
  • Preprocessed
  • stop words
  • appearing once
  • 10 held for training
  • Trained with the same stopping criteria

9
Results
10
Overfitting in Mixture of unigrams
  • Peaked posterior in the training set
  • Unseen document with unseen word
  • Word will have very small probability
  • Remedy smoothing

11
Overfitting in pLSI
  • Mixture of topics allowed
  • Marginalize over d to find p(w)
  • Restriction to having the same topic proportions
    as training documents
  • Folding in ignore p(zd) parameters and refit
    p(zdnew)

12
LDA
  • Documents can have different proportions of
    topics
  • No heuristics

13
(No Transcript)
14
Document classification
  • Generative or discriminative
  • Choice of features in document classification
  • LDA as dimensionality reduction technique
  • as LDA features

15
The experiment
  • Binary classification
  • 8000 documents, 15,818 words
  • True label not known
  • 50 topic
  • Trained SVM on the LDA features
  • Compared with SVM on all word features
  • LDA reduced feature space by 99.6

16
GRAIN vs NOT GRAIN
17
EARN vs NOT EARN
18
LDA in document classification
  • Feature space reduced, performance improved
  • Results need further investigation
  • Use for feature selection

19
Collaborative filtering
  • Collection of users and movies they prefer
  • Trained on observed users
  • Task given unobserved user and all movies
    preferred but one, predict the held out movie
  • Only users who positively rated 100 movies
  • Trained on 89 of data

20
Some quantities required
  • Probability of held out movie p(wwobs)
  • For mixture of unigrams and pLSI sum out topic
    variable
  • For LDA sum out topic and Dirichlet variables
    (quantity efficient to compute)

21
Results
22
Further work
  • Other approaches for inference and parameter
    estimation
  • Embedded in another model
  • Other types of data
  • Partial exchangeability

23
Example Visual words
  • Document image
  • Words image features bars, circles
  • Topics face, airplane
  • Bag of words no spatial relationship between
    objects

24
Visual words
25
Identifying the visual words and topics
26
Conclusion
  • Exchangeability, De Finetti Theorem
  • Dirichlet distribution
  • ? Generative
  • ? Bag of words
  • ? Independence assumption in Dirichlet
    distribution - correlated topics

27
Implementations
  • In C (by one of the authors)
  • http//www.cs.princeton.edu/blei/lda-c/
  • In C and Matlab
  • http//chasen.org/daiti-m/dist/lda/

28
References
  • Latent Dirichlet allocation, D. Blei, A. Ng, and
    M. Jordan. In Journal of Machine Learning
    Research, 3993-1022, 2003
  • Discovering object categories in image
    collections. J. Sivic, B. C. Russell, A. A.
    Efros, A. Zisserman, W. T. Freeman. MIT AI Lab
    Memo AIM-2005-005, February, 2005
  • Correlated topic models, David Blei and John
    Lafferty, Advances in Neural Information
    Processing Systems 18, 2005.
Write a Comment
User Comments (0)
About PowerShow.com