Hierarchical Dirichlet Model for Document Classification - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Hierarchical Dirichlet Model for Document Classification

Description:

Sriharsha Veeramachaneni, Diego Sona, Paolo Avesani. ITC IRST ... Ape. Primate. Monkey. Chimpanzee. Gorilla. Taxonomy defined by the editor. Document Corpus ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 18

Provided by: sriha7

Category:

more less

Transcript and Presenter's Notes

Title: Hierarchical Dirichlet Model for Document Classification

1
Hierarchical Dirichlet Model for Document
Classification

Sriharsha Veeramachaneni, Diego Sona, Paolo
Avesani
ITC IRST
Automated Reasoning Systems division
Trento - Italy

2
Hierarchical Document Classification
Taxonomy defined by the editor
Document Corpus
Classification
3
Example Web Directories
4
Hierarchical Document Classification Issues

Supervised learning
Very few training data with large vocabularies
Small-sample estimation problems
Unsupervised learning
Only a few keywords/class to initialize
clustering algorithms
Sparse clusters

5
Hierarchical Document Classification Issues

We need good estimators that are tolerant to
small sample sizes
Use regularization for variance reduction
Need to use prior knowledge about the problem to
perform regularization
We believe that the class hierarchy contains
valuable information that can be modeled into the
prior

6
Dirichlet Distribution

The random vector X (x1, x2, ,xn) has a
Dirichlet distribution with parameters v (v1,
v2, , vn) if

The mean and covariance matrix are given by
7
Hierarchical Dirichlet Model

A document d is a sequence of works drawn from
the vocabulary of size k
The probability of d given the class i in the
hierarchy is given by

Furthermore the parameter vectors themselves
have
Dirichlet priors given by

where pa(i) is the parent of node i.
s is a smoothing parameter chosen in advance

8
Hierarchical Dirichlet Model

Motivation
Intuition children of a node are clustered
around it.
That is, the concept at the parent subsumes those
at the children
This is encoded into the model because
s controls the variability of the children about
the parent

9
Hierarchical Dirichlet Model
1
0
10
Parameter Estimation

Iterative update algorithm
At each node the parameter vector is updated
based upon
The data at the node
Prior parameterized by parameter vector at parent
The parameter vectors at children

11
Parameter Estimation

Use LMMSE estimate

12
A Small Aside Steins Paradox Shrinkage
Estimate

Consider the following data
out of na baseball players ka are left handed
out of nb climbers kb have climbed mount Everest
out of nc cars in Bonn kc are foreign made
Estimate (pa, pb, pc)
ML estimate (ka/na, kb/nb, kc/nc)
Stein showed that for squared error loss a better
estimate would be to shrink the estimates towards
the overall mean by a positive amount.

13
Experimental Evaluation

Data from 8 Google Looksmart taxonomies.
Statistics of the datasets

14
Classification Accuracy
NB Naïve Bayes EM Unconstrained EM HD
Hierarchical Dirichlet
15
Choice of Smoothing Parameter s

For supervised learning s can be chosen by
cross-validation
For unsupervised learning can be estimated from
the data by maximizing likelihood on held-out
data.
We recently implemented this and obtained
comparable results to best s.
We showed that the improvement of accuracy is
observed for a wide range of s.

16
Conclusions

Done
Incorporated the hierarchy of the classes into
the model as a prior.
Derived a simple parameter estimation algorithm.
Showed experimentally that linear regularization
schemes can be effective in alleviating
small-sample problems.
To be done
Develop a computationally efficient way to set
smoothing parameter.
Model links between the documents to improve
classification accuracy.

17
Thank You

Write a Comment

User Comments (0)