Automatically Labeling Hierarchical Clusters - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Automatically Labeling Hierarchical Clusters

Description:

For each proposed regulation, more than 100,000 comments are received. ... A: 'Bibliographies on Neural Networks' Previous works ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 19
Provided by: pucktadatr
Category:

less

Transcript and Presenter's Notes

Title: Automatically Labeling Hierarchical Clusters


1
Automatically Labeling Hierarchical Clusters
  • Puck Treeratpituk, Jamie Callan
  • Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University

2
Motivation
  • Public comments (eRulemaking)
  • For each proposed regulation, more than 100,000
    comments are received.
  • There is a need to automatically organize large
    collection of documents into hierarchical
    structure.
  • Browsing
  • Data analysis
  • Hierarchical clustering!

However, not the focus of this talk
3
Labeling hierarchical clusters
  • Why care about labeling?
  • Good generated hierarchy without good
    descriptors, is not very useful.
  • Problem Given a generated document hierarchy,
    how to assign good labels to each cluster.

4
Near-duplicate detection (DURIAN)
5
Near-duplicate detection (DURIAN)
6
What is a good label?
  • Characteristics of a good label?
  • Descriptive - describe the content of the
    clusters.
  • Discriminative - discriminate the cluster from
    its sibling and its parents.
  • Example
  • For a document cluster A, computer science is
    descriptive but not discriminative when its
    sibling is fuzzy logic.

A Bibliographies on Neural Networks
7
Previous works
  • Thereve been lots of works on clustering, but
    few actually focus on labeling.
  • Most commonly, just selected the terms with
    highest occurrences as clusters label.
  • Inferring hierarchical descriptors Glover et al.
    CIKM02
  • Use document frequency based ranking with some
    threshold-based term selection.

8
Open Directory Project
  • Directory of the WWW, maintained by a community
    of volunteers.
  • Use hierarchical ontology scheme to organize
    site listings.

9
(No Transcript)
10
Descriptive score (DScore)
  • Linear model to predict how good a label is.
  • Features
  • DFS / S, DFP / P, TFS, TFP
  • TFIDFS, TFIDFP, TFIDFS / TFIDFP
  • r(DFS/S), r(DFP/P)
  • r(TFIDFS), r(TFIDFP)
  • log r(DFP/P) - log r(DFS/S)
  • log r(TFIDFP) - log r(TFIDFS)
  • Training
  • Automatically generate training examples to learn
    weights in the model from ODP labels.
  • Get synonyms from WordNet

11
Adaptive cutoff
  • Limited space to display labels
  • How many labels to display
  • If confident, only display the top label.
  • Otherwise display more.
  • Train another linear regression model, to
    determine how many labels to be display based on
    the DScore of the top-rank labels.

12
Experiment
  • ODP (Open Directory Project)
  • manually generated document hierarchy with
    manually assigned labels.
  • Sampled a subset of the ODP hierarchy.
  • 20,462 web pages from the total of 165
    categories.
  • Try to predict label for each cluster base on the
    documents in the cluster and in its parent.

13
Evaluation
  • Automatic evaluation
  • If the generated label matches any synonym of the
    correct ODP label, then its consider to be
    correct.
  • Use WordNet to automatically obtain synonyms.
  • Exp generated labels Bee, honey !
    apitherapy.
  • MRR (Mean Reciprocal Rank)
  • Borrow from Q/A evaluation.
  • MRR mean of the reciprocal rank of the first
    correct answer.

14
Generated labels
MRR0.45
15
Simulated-noise experiment
  • Automatically generated clusters are not perfect.
  • Simulate error produced by clustering algorithm.
  • Randomly swap document to the wrong clusters.
  • For X noise level
  • For each document d
  • With the probability (1-X), d is assigned to the
    correct cluster.
  • With the probability X, d is randomly assigned
    to any incorrect clusters, based on cluster size.

16
Simulated-noise experiment
  • DScore performance does not degrade much even
    with high level of noise (up to about 30).
  • Cutoff doesnt decrease performance.

17
Conclusion
  • Present a simple linear model.
  • That effectively assign labels to hierarchical
    clusters.
  • provide a simple way to auto-generate the
    training data for such a model from ODP.
  • Show that just by taking into account the
    parent-child relationship, one can greatly
    improve the cluster labels.
  • Evaluate the model in a simulated realistic
    setting, which shows that the model performance.

18
Questions?
Write a Comment
User Comments (0)
About PowerShow.com