Title: Hierarchical Dirichlet Model for Document Classification
1Hierarchical Dirichlet Model for Document
Classification
- Sriharsha Veeramachaneni, Diego Sona, Paolo
Avesani - ITC IRST
- Automated Reasoning Systems division
- Trento - Italy
2Hierarchical Document Classification
Taxonomy defined by the editor
Document Corpus
Classification
3Example Web Directories
4Hierarchical Document Classification Issues
- Supervised learning
- Very few training data with large vocabularies
- Small-sample estimation problems
- Unsupervised learning
- Only a few keywords/class to initialize
clustering algorithms - Sparse clusters
5Hierarchical Document Classification Issues
- We need good estimators that are tolerant to
small sample sizes - Use regularization for variance reduction
- Need to use prior knowledge about the problem to
perform regularization - We believe that the class hierarchy contains
valuable information that can be modeled into the
prior
6Dirichlet Distribution
- The random vector X (x1, x2, ,xn) has a
Dirichlet distribution with parameters v (v1,
v2, , vn) if -
The mean and covariance matrix are given by
7Hierarchical Dirichlet Model
- A document d is a sequence of works drawn from
the vocabulary of size k - The probability of d given the class i in the
hierarchy is given by
- Furthermore the parameter vectors themselves
have - Dirichlet priors given by
- where pa(i) is the parent of node i.
- s is a smoothing parameter chosen in advance
8Hierarchical Dirichlet Model
- Motivation
- Intuition children of a node are clustered
around it. - That is, the concept at the parent subsumes those
at the children - This is encoded into the model because
- s controls the variability of the children about
the parent
9Hierarchical Dirichlet Model
1
0
10Parameter Estimation
- Iterative update algorithm
- At each node the parameter vector is updated
based upon - The data at the node
- Prior parameterized by parameter vector at parent
- The parameter vectors at children
11Parameter Estimation
12A Small Aside Steins Paradox Shrinkage
Estimate
- Consider the following data
- out of na baseball players ka are left handed
- out of nb climbers kb have climbed mount Everest
- out of nc cars in Bonn kc are foreign made
- Estimate (pa, pb, pc)
- ML estimate (ka/na, kb/nb, kc/nc)
- Stein showed that for squared error loss a better
estimate would be to shrink the estimates towards
the overall mean by a positive amount.
13Experimental Evaluation
- Data from 8 Google Looksmart taxonomies.
- Statistics of the datasets
14Classification Accuracy
NB Naïve Bayes EM Unconstrained EM HD
Hierarchical Dirichlet
15Choice of Smoothing Parameter s
- For supervised learning s can be chosen by
cross-validation - For unsupervised learning can be estimated from
the data by maximizing likelihood on held-out
data. - We recently implemented this and obtained
comparable results to best s. - We showed that the improvement of accuracy is
observed for a wide range of s.
16Conclusions
- Done
- Incorporated the hierarchy of the classes into
the model as a prior. - Derived a simple parameter estimation algorithm.
- Showed experimentally that linear regularization
schemes can be effective in alleviating
small-sample problems. - To be done
- Develop a computationally efficient way to set
smoothing parameter. - Model links between the documents to improve
classification accuracy.
17Thank You