Hierarchical Dirichlet Model for Document Classification - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Hierarchical Dirichlet Model for Document Classification

Description:

Sriharsha Veeramachaneni, Diego Sona, Paolo Avesani. ITC IRST ... Ape. Primate. Monkey. Chimpanzee. Gorilla. Taxonomy defined by the editor. Document Corpus ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 18
Provided by: sriha7
Category:

less

Transcript and Presenter's Notes

Title: Hierarchical Dirichlet Model for Document Classification


1
Hierarchical Dirichlet Model for Document
Classification
  • Sriharsha Veeramachaneni, Diego Sona, Paolo
    Avesani
  • ITC IRST
  • Automated Reasoning Systems division
  • Trento - Italy

2
Hierarchical Document Classification
Taxonomy defined by the editor
Document Corpus
Classification
3
Example Web Directories
4
Hierarchical Document Classification Issues
  • Supervised learning
  • Very few training data with large vocabularies
  • Small-sample estimation problems
  • Unsupervised learning
  • Only a few keywords/class to initialize
    clustering algorithms
  • Sparse clusters

5
Hierarchical Document Classification Issues
  • We need good estimators that are tolerant to
    small sample sizes
  • Use regularization for variance reduction
  • Need to use prior knowledge about the problem to
    perform regularization
  • We believe that the class hierarchy contains
    valuable information that can be modeled into the
    prior

6
Dirichlet Distribution
  • The random vector X (x1, x2, ,xn) has a
    Dirichlet distribution with parameters v (v1,
    v2, , vn) if

The mean and covariance matrix are given by
7
Hierarchical Dirichlet Model
  • A document d is a sequence of works drawn from
    the vocabulary of size k
  • The probability of d given the class i in the
    hierarchy is given by
  • Furthermore the parameter vectors themselves
    have
  • Dirichlet priors given by
  • where pa(i) is the parent of node i.
  • s is a smoothing parameter chosen in advance

8
Hierarchical Dirichlet Model
  • Motivation
  • Intuition children of a node are clustered
    around it.
  • That is, the concept at the parent subsumes those
    at the children
  • This is encoded into the model because
  • s controls the variability of the children about
    the parent

9
Hierarchical Dirichlet Model
1
0
10
Parameter Estimation
  • Iterative update algorithm
  • At each node the parameter vector is updated
    based upon
  • The data at the node
  • Prior parameterized by parameter vector at parent
  • The parameter vectors at children

11
Parameter Estimation
  • Use LMMSE estimate

12
A Small Aside Steins Paradox Shrinkage
Estimate
  • Consider the following data
  • out of na baseball players ka are left handed
  • out of nb climbers kb have climbed mount Everest
  • out of nc cars in Bonn kc are foreign made
  • Estimate (pa, pb, pc)
  • ML estimate (ka/na, kb/nb, kc/nc)
  • Stein showed that for squared error loss a better
    estimate would be to shrink the estimates towards
    the overall mean by a positive amount.

13
Experimental Evaluation
  • Data from 8 Google Looksmart taxonomies.
  • Statistics of the datasets

14
Classification Accuracy
NB Naïve Bayes EM Unconstrained EM HD
Hierarchical Dirichlet
15
Choice of Smoothing Parameter s
  • For supervised learning s can be chosen by
    cross-validation
  • For unsupervised learning can be estimated from
    the data by maximizing likelihood on held-out
    data.
  • We recently implemented this and obtained
    comparable results to best s.
  • We showed that the improvement of accuracy is
    observed for a wide range of s.

16
Conclusions
  • Done
  • Incorporated the hierarchy of the classes into
    the model as a prior.
  • Derived a simple parameter estimation algorithm.
  • Showed experimentally that linear regularization
    schemes can be effective in alleviating
    small-sample problems.
  • To be done
  • Develop a computationally efficient way to set
    smoothing parameter.
  • Model links between the documents to improve
    classification accuracy.

17
Thank You
Write a Comment
User Comments (0)
About PowerShow.com