Automatically Labeling Hierarchical Clusters - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Automatically Labeling Hierarchical Clusters

Description:

Number of Views:56

Avg rating:3.0/5.0

Slides: 19

Provided by: pucktadatr

Category:

Tags: automatically | bibliographies | clusters | hierarchical | labeling

Transcript and Presenter's Notes

Title: Automatically Labeling Hierarchical Clusters

1
Automatically Labeling Hierarchical Clusters

2
Motivation

Public comments (eRulemaking)
For each proposed regulation, more than 100,000
comments are received.
There is a need to automatically organize large
collection of documents into hierarchical
structure.
Browsing
Data analysis
Hierarchical clustering!

However, not the focus of this talk
3
Labeling hierarchical clusters

Why care about labeling?
Good generated hierarchy without good
descriptors, is not very useful.
Problem Given a generated document hierarchy,
how to assign good labels to each cluster.

4
Near-duplicate detection (DURIAN)
5
Near-duplicate detection (DURIAN)
6
What is a good label?

Example
For a document cluster A, computer science is
descriptive but not discriminative when its
sibling is fuzzy logic.

A Bibliographies on Neural Networks
7
Previous works

Thereve been lots of works on clustering, but
few actually focus on labeling.
Most commonly, just selected the terms with
highest occurrences as clusters label.
Inferring hierarchical descriptors Glover et al.
CIKM02
Use document frequency based ranking with some
threshold-based term selection.

8
Open Directory Project

9
(No Transcript)
10
Descriptive score (DScore)

Linear model to predict how good a label is.
Features
DFS / S, DFP / P, TFS, TFP
TFIDFS, TFIDFP, TFIDFS / TFIDFP
r(DFS/S), r(DFP/P)
r(TFIDFS), r(TFIDFP)
log r(DFP/P) - log r(DFS/S)
log r(TFIDFP) - log r(TFIDFS)
Training
Automatically generate training examples to learn
weights in the model from ODP labels.
Get synonyms from WordNet

11
Adaptive cutoff

Limited space to display labels
How many labels to display
If confident, only display the top label.
Otherwise display more.
Train another linear regression model, to
determine how many labels to be display based on
the DScore of the top-rank labels.

12
Experiment

ODP (Open Directory Project)
manually generated document hierarchy with
manually assigned labels.
Sampled a subset of the ODP hierarchy.
20,462 web pages from the total of 165
categories.
Try to predict label for each cluster base on the
documents in the cluster and in its parent.

13
Evaluation

Automatic evaluation
If the generated label matches any synonym of the
correct ODP label, then its consider to be
correct.
Use WordNet to automatically obtain synonyms.
Exp generated labels Bee, honey !
apitherapy.
MRR (Mean Reciprocal Rank)
Borrow from Q/A evaluation.
MRR mean of the reciprocal rank of the first
correct answer.

14
Generated labels
MRR0.45
15
Simulated-noise experiment

Automatically generated clusters are not perfect.
Simulate error produced by clustering algorithm.
Randomly swap document to the wrong clusters.
For X noise level
For each document d
With the probability (1-X), d is assigned to the
correct cluster.
With the probability X, d is randomly assigned
to any incorrect clusters, based on cluster size.

16
Simulated-noise experiment

DScore performance does not degrade much even
with high level of noise (up to about 30).
Cutoff doesnt decrease performance.

17
Conclusion

Present a simple linear model.
That effectively assign labels to hierarchical
clusters.
provide a simple way to auto-generate the
training data for such a model from ODP.
Show that just by taking into account the
parent-child relationship, one can greatly
improve the cluster labels.
Evaluate the model in a simulated realistic
setting, which shows that the model performance.

18
Questions?

Write a Comment

User Comments (0)