A Semi-supervised Document Clustering Algorithm based on EM - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

A Semi-supervised Document Clustering Algorithm based on EM

Description:

In between automatic categorization and auto-organization of data ... t, in the expectation step (E), some documents are badly classified, these data ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 18
Provided by: xxx3125
Category:

less

Transcript and Presenter's Notes

Title: A Semi-supervised Document Clustering Algorithm based on EM


1
A Semi-supervised Document Clustering Algorithm
based on EM
  • Leonardo Rigutini and Marco Maggini
  • Department of Information Engineering
  • University of Siena Siena Italy
  • rigutini,maggini_at_dii.unisi.it

2
Outline
  • Document clustering and Semi-supervised
    clustering
  • EM algorithm and limitations
  • Using feature selection filtering to improve the
    EM algorithm
  • The proposed algorithm
  • Experimental results
  • Conclusions

3
Document Clustering
  • Document clustering is a very hard task in
    Automatic Text Processing
  • It requires to extract regular patterns from a
    document collection without a priori knowledge on
    the category structure
  • Difficult task even for humans
  • many different but valid partitions may exist for
    the same collection
  • Lack of information about categories
  • Difficulty in using effective feature selection
    techniques to reduce the noise in the
    representation of texts

4
Semi-supervised clustering
  • In between automatic categorization and
    auto-organization of data
  • A supervisor is not required to specify a set of
    classes, but to split a set of examples into
    groups
  • The initial examples are very few documents (from
    1 to 10 at maximum) for each group
  • The initial examples could be also sets of
    keywords describing the desired groups

5
Feature Selection
  • Document Clustering
  • Impossible to use global information to filter
    words (no information on classes is available)
  • IG, TS, DotRatio are not usable
  • In text representation it is a very important
    issue
  • Very high dimensional space representation
  • Distances between documents are very similar
  • Semi-supervised Clustering
  • An initial filtering can be performed using a
    small amount of initial information

6
EM Algorithm
  • A general algorithm to adjust the parameters of
    the model to the data distribution
  • E step the unlabeled data are labeled by the
    classifier assuming the current configuration as
    correct
  • M step the parameters of the classifier are
    re-estimated using the data labeled at the
    previous E-step, assuming the labels to be
    correct
  • The precedure is iterated until a convergence is
    reached

7
EM algorithm limitations
  • The initialization of the classifier is an
    important issue for the correct final cluster
    composition
  • If the initial centroids are not distribuited as
    the final user would like, the algorithm can form
    clusters with a semantics not matching the users
    criteria
  • The iterative form of the EM algorithm produces a
    reinforcement effect on the badly labeled data
  • If at time t, in the expectation step (E), some
    documents are badly classified, these data
    influence the reestimation step (M) and at time
    t1 other documents will be badly classified
  • This effect is increased with the successive
    iterations of the E-M steps

8
Distribution of distances
  • The distance between two similar documents is
    very close to the one between two dissimilar
    documents
  • It is very probable that the E step badly labels
    some boundary documents
  • EM reaches a trivial solution very often
  • A large central cluster including the major part
    of the documents
  • Various peripheral small clusters including
    outliers

9
Feature Selection
  • At each iteration of EM, the badly labeled data
    influence the reestimation of the parameters,
    moving the centroids to a wrong direction
  • We can reduce the influence of bad labeled
    documents in the M step using a feature selction
    filtering in the EM algorithm
  • We use the labeled dataset produced by the E step
    to filter out the not significative words for
    each class
  • In this way, the noisy words introduced by the
    badly classified documents in the E step, will
    not contribute to the M step

10
The proposed algorithm
  • ssads

11
The algorithm
  • The small initial labeled dataset is used to
    initialize the parameters of the classifier in
    the EM algorithm
  • To extract the most significative words from the
    training dataset an Information Gain filter IG1
    is used
  • Once the unlabeled data have been labeled, the
    Information Gain filter IG2 avoids that wrong
    documents influence the reestimation step
  • The algorithm ends when the confusion matrix does
    not change in two successive iterations

12
Experimental results
  • Dataset
  • We download about 24.000 messages from English
    newsgroups
  • Three different groups
  • Auto
  • Hardware
  • Sport
  • We divided the dataset into 2 subsets
  • Init repository to pick up the start documents
  • Unlabeled datadocuments to cluster

13
Experimental results
  • We decided to test the algorithm with 4 different
    initial configurations1,3,5 and 7 starting
    documents random sampled from the initial dataset
  • All results are averaged on a ten fold
    cross-validation
  • Baseline
  • K-means on the unlabeled data initialized with
    the initial dataset
  • Proposed algorithm
  • To speed up the clustering task, we ran the
    algorithm on a subset of unlabeled data and then
    we used the trained classifier to categorize the
    remaining unlabeled data
  • Two size for the small unlabeled dataset 100 and
    300 documents

14
Baseline experiment
  • K-means on the unlabeled dataset initialized with
    1,3,5 and 7 documents
  • The poor performance depends on the fact that no
    regularization can be applied for the k-means
    algorithm and an assignment of a document to a
    wrong cluster produces a movement of the
    centroids of the two clusters which reinforces
    the wrong assignment

15
Proposed algorithm test 1
  • Proposed algorithm
  • 1,3,5 and 7 documents to inizialise the
    classifier
  • k1100 and k21000 for IG filters
  • 100 documents in the unlabeled dataset

16
Proposed algorithm test 2
  • Proposed algorithm
  • 1,3,5 and 7 documents to inizialise the
    classifier
  • k1100 and k21000 for IG filters
  • 300 documents in the unlabeled dataset

17
Conclusions
  • We presented a semi-supervised version of the EM
    algorithm for document clustering
  • It uses an initial small amount of knowledge to
    guide the EM algorithm in forming the clusters
  • The system partitions a large collection of
    documents providing a small initial amount of
    information about the clusters (for example some
    keywords describing each cluster) and it shows
    quite good results
  • The novel proposal is mainly the use of a
    regularization step which exploits a feature
    selection technique in an EM algorithm
  • With a different initialization technique which
    does not require the supervision of a human
    expert, the algorithm could be completely
    unsupervised
Write a Comment
User Comments (0)
About PowerShow.com