Naive clustering of a large XML document collection - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Naive clustering of a large XML document collection

Description:

Naive clustering of a large XML document collection Antoine Doucet University of Helsinki Department of Computer Science 1st INEX Workshop Schloss Dagstuhl, 10.12.2002 – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 26
Provided by: loui143
Category:

less

Transcript and Presenter's Notes

Title: Naive clustering of a large XML document collection


1
Naive clustering of a large XML document
collection
  • Antoine Doucet
  • University of Helsinki
  • Department of Computer Science
  • 1st INEX Workshop
  • Schloss Dagstuhl, 10.12.2002

2
Introduction
  • Conjecture
  •  As structure is supplementary information
    to raw text, there must exist a way to exploit
    it, that improves the clustering quality
  • Document Clustering
  • in IR
  • XML Document Clustering

3
Introduction
  • Clustering experiments
  • Various feature sets
  • k-means clustering
  • Clustering quality evaluation
  • Results and future work

4
Document Clustering
5
Document Clustering in IR
  • Prior to the query
  • Goal form a taxonomy (Yahoo!)
  • Post-Retrieval Clustering
  • Cluster Hypothesis

6
XML Document Clustering
  • Most work on data-centric XML, that is
  • regular structure
  • few/poor text content
  • Edit Tree Distance
  • as a preprocessing of automated DTD
  • construction methods

7
Clustering Experiments
8
Vector Space Model
  • Various feature sets
  • tags only (183)
  • text only (188,417)
  • text and tags (188,600)
  • normalized tf-idf

9
Clustering
  • Similarity measure cosine
  • k-means algorithm

10
k-means clustering algorithm
  • Initialisation
  • k points chosen as centroids
  • Assign each point to the closest centroid
  • Iterate
  • Re-compute the centroid of each cluster
  • Assign each point to the closest centroid
  • Stop Condition
  • As soon as the centroids are stable

11
Clustering
  • Similarity measure cosine
  • k-means algorithm
  • partitional
  • linear

12
Clustering Evaluation
  • Internal Quality Measures
  • based on average inter- and intra-cluster
    similarities
  • for example
  • cohesiveness (a.k.a. overall similarity)
  • Inappropriate for our experiments
  • Because we use different feature sets, and the
    similarity measures are intrinsecally related to
    the document representation, the similarity
    measures are not comparable over different
    feature sets (that is, throughout our
    experiments).

13
Clustering Evaluation
  • External Quality Measures
  • Based on an existing manual classification
  • Entropy is based on the classes distribution in
    a cluster.
  • Purity measures how much a cluster is
    specialized in a class.
  • F-measure originates from IR.

14
Clustering Evaluation
  • External Quality Measures
  • Which classification can we use ?
  • Usually (TREC) manual assessments are used as
    classes
  • INEX assessments were not yet available
  • The only available classes are the journals in
    which the articles were published

15
Clustering Evaluation
  • Problems with the journal title classification
  • The classes are disjoint.
  • Obviously does not suit text documents
  • The classes are probably too strict.
  • e.g.,there should not be a strict border between
    Transactions on Computers and Transactions on
    ParallelDistributed Systems
  • ...Finally we used the 18 journal classes an
    extra volume class

16
Results
17
Results
  • 15-way clustering

Features Text Tags Text Tags
Entropy 0.633 0.798 0.678
Purity 0.379 0.228 0.372
Time 754 s. 11 s. 837 s.
  • Text is best !
  • Tags only is very fast

18
Results
  • The volume class exception

Features Text Tags Text Tags
Entropy 0.722 0.016 1
Purity 0.295 0.992 0
Precision 28 99 100
Time 754 s. 11 s. 837 s.
  • The most discriminant text features are
    misleading january, society, publish
  • The most discriminant tag features are not
    ltentitygt, lttitlegt, ltsec1gt

19
Results
  • Hints at a better method, by using the tag
    features as a preprocessing
  • k-means clustering based on tags only
  • The n clusters with a cohesiveness above a given
    threshold are kept
  • (k-n)-means clustering of the remaining documents
    based on text features only

20
Results
  • 15-way clustering with and without pre-clustering

Standard Text Tags pre-clustering
Entropy 0.633 0.630
Purity 0.379 0.394
Time 754 s. 11742 s.
  • Does slightly better indeed

21
Conclusion - Discussion
22
Conclusion
  • Recall the evaluation is not 100 reliable
  • Combining structural similarity and content
    similarity improves the clustering quality
  • But it is not straightforward !
  • We shall look for more sophisticated similarity
    measures

23
Next
  • We need
  • More collections to validate the techniques
  • A better classification (manual assessments ?)

24
Next
  • More sophisticated similarity measures
  • How to weight XML elements ?
  • tf-idf ?
  • Element size ? Average element size ? Include the
    sub-elements size ? How to normalize the
    document vectors then ?
  • Store trees in the bag of words ?

25
Next
  • Structure feature selection
  • full path expression ?
  • local path expression ?
  • classes of tag labels ?
  • (tfmath, sgmlmath, math) ? metamath
  • Automatically ?
Write a Comment
User Comments (0)
About PowerShow.com