Hierarchical Document Clustering Using Frequent Itemsets

About This Presentation

Title:

Hierarchical Document Clustering Using Frequent Itemsets

Description:

( apple, boy, cat, window ) doc1 = ( 5, 2, 7, 0 ) doc2 = ( 4, 0, 0, 3 ) ... In D. Fisher, editor, Proceedings of (ICML) 97, 14th International Conference on ... – PowerPoint PPT presentation

Number of Views:306

Avg rating:3.0/5.0

Slides: 38

Provided by: benjaminch

Category:

more less

Transcript and Presenter's Notes

Title: Hierarchical Document Clustering Using Frequent Itemsets

1
Hierarchical Document Clustering Using Frequent
Itemsets

Benjamin Fung, Ke Wang, Martin Ester
bfung, wangk, ester_at_cs.sfu.ca
Simon Fraser University
May 1, 2003 (SDM 03)

2
Outline

What is hierarchical document clustering?
Previous works
Our method Frequent Itemset Hierarchical
Clustering (FIHC) ?
Experimental results
Conclusions

3
Hierarchical Document Clustering

Document Clustering Automatic organization of
documents into clusters so that documents within
a cluster have high similarity in comparison to
one another, but are very dissimilar to documents
in other clusters.

Hierarchical Document Clustering

4
Challenges in Hierarchical Document Clustering

High dimensionality.
High volume of data.
Consistently high clustering quality.
Meaningful cluster description.

5
Previous Works

Hierarchical Methods
Agglomerative and Divisive.
Reasonably accurate but not scalable.
Partitioning Methods
Efficient, scalable, easy to implement.
Clustering quality degrades if an inappropriate
of clusters is provided.
Frequent item-based Method
HFTC depends on a greedy heuristic.

6
Preprocessing

Remove stop words and Stemming.
Construct vector model
doci ( item frequency1, if2, if3, , ifm )
e.g.
( apple, boy, cat, window )
doc1 ( 5, 2, 7, 0 )
doc2 ( 4, 0, 0, 3 )
doc3 ( 0, 3, 1, 5 ) ?
document vector

doc1 apple 5 boy 2 cat 7
doc2 apple 4 window 3
doc3 boy 3 cat 1 window 5
7
Algorithm Overview of Our Method (FIHC)
(reduced dimensions feature vectors)
(high dimensional doc vectors)
Construct clusters
Build a Tree
Pruning
8
Definition Global Frequent Itemset

A global frequent itemset refers to a set of
items (words) that appear together in more than a
user-specified fraction of the document set.
The global support of an itemset is the
percentage of documents containing the itemset.
e.g. 7 of the documents contain both words.
apple, window has global support 7.
A global frequent item refers to an item that
belongs to some global frequent itemset, e.g.,
apple.

9
Reduced Dimensions Vector Model

High dimensional vector model
( apple, boy, cat, window )
doc1 ( 5, 2, 1, 1 )
doc2 ( 4, 0, 0, 3 )
doc3 ( 0, 3, 1, 5 )
doc4 ( 8, 0, 2, 0 )
doc5 ( 5, 0, 0, 3 )
Suppose we set the minimum support to 60. The
global frequent itemsets are apple, cat,
window, apple, window
Store the frequencies only for global frequent
items.
( apple, cat, window )
doc1 ( 5, 1, 1 )
doc2 ( 4, 0, 3 )

? document vector
? feature vector
10
Intuition

Frequent itemsets ? combination of words.
Ex. apple Topic Fruits
window Topic Renovation
apple, window Topic Computer

11
Construct Initial Clusters

Construct a cluster for each global frequent
itemset.
Global frequent itemsets apple, cat,
window, apple, window
All documents containing this itemset are
included in the same cluster.

Capple
Cwindow
Capple, window
Ccat
12
Making Clusters Disjoint

Assign a document to the best initial cluster.
Intuitively, a cluster Ci is good for a document
docj if there are many global frequent items in
docj that appear in many documents in Ci.

13
Cluster Frequent Items

A global frequent item is cluster frequent in a
cluster Ci if the item is contained in some
minimum fraction of documents in Ci.
Suppose we set the minimum cluster support to 60.

Capple ( apple, cat, window ) doc1
( 5, 1, 1 ) doc2 ( 4,
0, 3 ) doc4 ( 8, 2, 0
) doc5 ( 5, 0, 3 )
Capple Capple
Item Cluster Support
apple 100
cat 50
window 75
apple and window are cluster frequent items.
14
Score Function (Example)
Capple apple 100 window 75
Cwindow cat 60 window 100
Capple, window apple 100 cat 60 window
100
Ccat cat 100
doc1 apple 5 cat 1 window 3
15
Score Function

Assign each docj to the initial cluster Ci that
has the highest scorei

x represents a global frequent item in docj and
the item is also cluster frequent in Ci.
x represents a global frequent item in docj but
the item is not cluster frequent in Ci.
n(x) is the frequency of x in the feature vector
of docj.
n(x) is the frequency of x in the feature
vector of docj.

16
Score Function (Example)
Capple apple 100 window 75
Cwindow cat 60 window 100
Capple, window apple 100 cat 60 window
100
Ccat cat 100
doc1 apple 5 cat 1 window 3
17
Tree Construction

Put the more specific clusters at the bottom of
the tree.
Put the more general clusters at the top of the
tree.
Build a tree from bottom-up by choosing a parent
for each cluster (start from the cluster with the
largest number of items in its cluster label).

null
cluster label
CS
Sports
Sports, Ball
Sports, Tennis
CS, AI
CS, DM
Sports, Tennis, Ball
18
Choose a Parent Cluster (example)
Sports, Ball
Sports, Tennis
Sports, Tennis, Ball
( CS, DM, AI, Sports, Tennis, Ball ) doc1 ( 0,
0, 0, 5, 10, 2 )
doc2 ( 1, 0, 0, 5, 5,
3 ) doc3 ( 0, 1, 0,
15, 10, 1 )
sum ( 1, 1, 0, 25, 25,
6 )
19
Prune Cluster Tree

Why do we want to prune the tree?
Remove overly specific child clusters.
Documents of the same class (topic) are likely to
be distributed over different subtrees, which
would lead to poor clustering quality.

20
Inter-Cluster Similarity

Inter_Sim of Ca and Cb

Reuse the score function to calculate Sim(Ci ?
Cj).

21
Child Pruning

Efficiently shorten a tree by replacing child
clusters by their parent.
A child is pruned only if it is similar to its
parent.
Prune if Inter_Sim gt 1

null
CS
Sports
CS, DM
CS, AI
Sports, Ball
Sports, Tennis
Sports, Tennis, Racket
22
Sibling Merging

Narrow a tree by merging similar subtrees at
level 1.

null
CS
Sports
IT
CS, DM
CS, AI
Sports, Ball
Sports, Tennis
IT, Server
IT, Engineer
Inter_Sim(CS ? IT) 1.5
Inter_Sim(IT ? Sports) 0.75
Inter_Sim(CS ? Sports) 0.5
23
Sibling Merging
null
CS
Sports
Sports, Ball
Sports, Tennis
CS, DM
CS, AI
IT, Server
IT, Engineer
24
Experimental Results

Compare with state-of-the-art clustering
algorithms
Bisecting k-means (Cluto 2.0 Toolkit)
UPGMA (Cluto 2.0 Toolkit)
HFTC (Source code in Java from author)
Clustering quality.
Efficiency and Scalability.

25
Data Sets

Each document is pre-classified into a single
natural class.

26
Clustering Quality (F-measure)

Widely used evaluation method for clustering
algorithms.
Recall and Precision.
F-measure weighted average of recalls and
precisions.

27
For FIHC and HFTC, we use MinSup from 3 to 6
28
Efficiency
29
Complexity Analysis

Clustering ?f?F global_support(f), where f is a
global frequent itemset. (two scans on
documents)
Constructing tree Removed empty clusters first.
O(n), where n is the number of documents.
Child pruning one scan on remaining clusters.
Sibling merging O(g2), where g is the number of
remaining clusters at level 1.

30
Conclusions

This research exploits frequent itemsets for
defining a cluster.
organizing the cluster hierarchy.
Our contributions
Reduced dimensionality ? efficient and scalable.
High clustering quality.
Number of clusters as optional input parameter.
Meaningful cluster description.

31
Thank you.

Questions?

32
References

C. Aggarwal, S. Gates, and P. Yu. On the merits
of building categorization systems by supervised
clustering. In Proceedings of (KDD) 99, 5th (ACM)
International Conference on Knowledge Discovery
and Data Mining, pages 352356, San Diego, US,
1999. ACM Press, New York, US.
R. Agrawal, C. Aggarwal, and V. V. V. Prasad.
Depth-first generation of large itemsets for
association rules. Technical Report RC21538, IBM
Technical Report, October 1999.
R. Agrawal, C. Aggarwal, and V. V. V. Prasad. A
tree projection algorithm for generation of
frequent item sets. Journal of Parallel and
Distributed Computing, 61(3)350371, 2001.
R. Agrawal, J. Gehrke, D. Gunopulos, and P.
Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. In
Proceedings of ACM SIGMOD International
Conference on Management of Data (SIGMOD98),
pages 94105, 1998.
R. Agrawal, T. Imielinski, and A. N. Swami.
Mining association rules between sets of items in
large databases. In Proceedings of ACM SIGMOD
International Conference on Management of Data
(SIGMOD93), pages 207216, Washington, D.C., May
1993.
R. Agrawal and R. Srikant. Fast algorithm for
mining association rules. In J. B. Bocca, M.
Jarke, and C. Zaniolo, editors, Proc. 20th Int.
Conf. Very Large Data Bases, VLDB, pages 487499.
Morgan Kaufmann, 12-15 1994.
R. Agrawal and R. Srikant. Mining sequential
patterns. In Proc. 1995 Int. Conf. Data
Engineering, pages 314, Taipei, Taiwan, March
1995.
M. Ankerst, M. Breunig, H. Kriegel, and J.
Sander. Optics Ordering points to identify the
clustering structure. In 1999 ACM-SIGMOD Int.
Conf. Management of Data (SIGMOD99), pages
4960, Philadelphia, PA, June 1999.

33
References

F. Beil, M. Ester, and X. Xu. Frequent term-based
text clustering. In Proc. 8th Int. Conf. on
Knowledge Discovery and Data Mining (KDD)2002,
Edmonton, Alberta, Canada, 2002.
http//www.cs.sfu.ca/ ester/publications.html.
H. Borko and M. Bernick. Automatic document
classication. Journal of the ACM, 10151162,
1963.
S. Chakrabarti. Data mining for hypertext A
tutorial survey. SIGKDD Explorations Newsletter
of the Special Interest Group (SIG) on Knowledge
Discovery Data Mining, ACM, 1111, 2000.
M. Charikar, C. Chekuri, T. Feder, and R.
Motwani. Incremental clustering and dynamic
information retrieval. In Proceedings of the 29th
Symposium on Theory Of Computing STOC 1997, pages
626635, 1997.
Classic. ftp//ftp.cs.cornell.edu/pub/smart/.
D. R. Cutting, D. R. Karger, J. O. Pedersen, and
J. W. Tukey. Scatter/gather A cluster-based
approach to browsing large document collections.
In Proceedings of the Fifteenth Annual
International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages
318329, 1992.
P. Domingos and G. Hulten. Mining high-speed data
streams. In Knowledge Discovery and Data Mining,
pages 7180, 2000.
R. C. Dubes and A. K. Jain. Algorithms for
Clustering Data. Prentice Hall College Div,
Englewood Clis, NJ, March 1998.
A. El-Hamdouchi and P. Willet. Comparison of
hierarchic agglomerative clustering methods for
document retrieval. The Computer Journal, 32(3),
1989.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters
in large spatial databases with noise. In
Proceedings of the 2nd int. Conf. on Knowledge
Discovery and Data Mining (KDD 96), pages
226231, Portland, Oregon, August 1996. AAAI
Press.
A. Griffiths, L. A. Robinson, and P. Willett.
Hierarchical agglomerative clustering methods for
automatic document classification. Journal of
Documentation, 40(3)175205, September 1984.
S. Guha, N. Mishra, R. Motwani, and L.
OCallaghan. Clustering data streams. In IEEE
Symposium on Foundations of Computer Science,
pages 359366, 2000.

34
References

S. Guha, R. Rastogi, and K. Shim. Rock A robust
clustering algorithm for categorical attributes.
In Proceedings of the 15th International
Conference on Data Engineering, 1999.
E. H. Han, B. Boley, M. Gini, R. Gross, K.
Hastings, G. Karypis, V. Kumar, B. Mobasher, and
J. Moore. Webace a web agent for document
categorization and exploration. In Proceedings of
the second international conference on Autonomous
agents, pages 408415. ACM Press, 1998.
J. Han and M. Kimber. Data Mining Concepts and
Techniques. Morgan-Kaufmann, August 2000.
J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. In
Proceedings of the 2000 ACM SIGMOD International
Conference on Management of Data (SIGMOD00),
Dallas, Texas, USA, May 2000.
J. Hipp, U. Guntzer, and G. Nakhaeizadeh.
Algorithms for association rule mining - a
general survey and comparison. SIGKDD
Explorations, 2(1)5864, July 2000.
G. Hulten, L. Spencer, and P. Domingos. Mining
time-changing data streams. In Proceedings of the
Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages
97106, San Francisco, CA, 2001. ACM Press.
G. Karypis. Cluto 2.0 clustering toolkit, April
2002. http//www.users.cs.umn.edu/
karypis/cluto/.
L. Kaufman and P. J. Rousseeuw. Finding Groups in
Data An Introduction to Cluster Analysis. John
Wiley and Sons, March 1990.
D. Koller and M. Sahami. Hierarchically
classifying documents using very few words. In D.
Fisher, editor, Proceedings of (ICML) 97, 14th
International Conference on Machine Learning,
pages 170178, Nashville, US, 1997. Morgan
Kaufmann Publishers, San Francisco, US.
Kosala and Blockeel. Web mining research A
survey. SIGKDD Explorations Newsletter of the
Special Interest Group SIG on Knowledge Discovery
Data Mining, 2, 2000.
G. Kowalski and M. Maybury. Information Storage
and Retrieval Systems Theory and Implementation.
Kluwer Academic Publishers, 2 edition, July 2000.

35
References

S. Guha, R. Rastogi, and K. Shim. Rock A robust
clustering algorithm for categorical attributes.
In Proceedings of the 15th International
Conference on Data Engineering, 1999.
E. H. Han, B. Boley, M. Gini, R. Gross, K.
Hastings, G. Karypis, V. Kumar, B. Mobasher, and
J. Moore. Webace a web agent for document
categorization and exploration. In Proceedings of
the second international conference on Autonomous
agents, pages 408415. ACM Press, 1998.
J. Han and M. Kimber. Data Mining Concepts and
Techniques. Morgan-Kaufmann, August 2000.
J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. In
Proceedings of the 2000 ACM SIGMOD International
Conference on Management of Data (SIGMOD00),
Dallas, Texas, USA, May 2000.
J. Hipp, U. Guntzer, and G. Nakhaeizadeh.
Algorithms for association rule mining - a
general survey and comparison. SIGKDD
Explorations, 2(1)5864, July 2000.
G. Hulten, L. Spencer, and P. Domingos. Mining
time-changing data streams. In Proceedings of the
Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages
97106, San Francisco, CA, 2001. ACM Press.
G. Karypis. Cluto 2.0 clustering toolkit, April
2002. http//www.users.cs.umn.edu/
karypis/cluto/.
L. Kaufman and P. J. Rousseeuw. Finding Groups in
Data An Introduction to Cluster Analysis. John
Wiley and Sons, March 1990.
D. Koller and M. Sahami. Hierarchically
classifying documents using very few words. In D.
Fisher, editor, Proceedings of (ICML) 97, 14th
International Conference on Machine Learning,
pages 170178, Nashville, US, 1997. Morgan
Kaufmann Publishers, San Francisco, US.
Kosala and Blockeel. Web mining research A
survey. SIGKDD Explorations Newsletter of the
Special Interest Group SIG on Knowledge Discovery
Data Mining, 2, 2000.
G. Kowalski and M. Maybury. Information Storage
and Retrieval Systems Theory and Implementation.
Kluwer Academic Publishers, 2 edition, July 2000.

36
References

J. Lam. Multi-dimensional constrained gradient
mining. Masters thesis, Simon Fraser University,
August 2001.
B. Larsen and C. Aone. Fast and effective text
mining using linear-time document clustering.
KDD99, 1999.
D. D. Lewis. Reuters. http//www.research.att.com/
lewis/.
B. Liu, W. Hsu, and Y. Ma. Integrating
classification and association rule mining. In
Knowledge Discovery and Data Mining (KDD) 98,
pages 8086, 1998.
Miller. Princeton wordnet, 1990.
M. F. Porter. An algorithm for sux stripping.
Program, 14(3)130137, July 1980.
J. R. Quinlan. C4.5 Programs for Machine
Learning. Morgan Kaufmann, 1993.
K. Ross and D. Srivastava. Fast computation of
sparse datacubes. In M. Jarke, M. Carey, K.
Dittrich, F. Lochovsky, P. Loucopoulos, and M.
Jeusfeld, editors, Proceedings of 23rd
International Conference on Very Large Data Bases
(VLDB97), pages 116125, Athens, Greece, August
1997. Morgan Kaufmann.
H. Schutze and H. Silverstein. Projections for
efficient document clustering. In Proceedings of
SIGIR97, pages 7481, Philadelphia, PA, July
1997.
C. E. Shannon. A mathematical theory of
communication. Bell Systems Technical Journal,
27379423 and 623656, July and October 1948.
M. Steinbach, G. Karypis, and V. Kumar. A
comparison of document clustering techniques. KDD
Workshop on Text Mining00, 2000.
Text REtrival Conference TIPSTER, 1999.
http//trec.nist.gov/.
H. Uchida, M. Zhu, and T. Della Senta. Unl A
gift for a millennium. The United Nations
University, 2000.
C. J. van Rijsbergen. Information Retrieval.
Dept. of Computer Science, University of Glasgow,
Butterworth, London, 2 edition, 1979.
P. Vossen. Eurowordnet, Summer 1999.
K. Wang, C. Xu, and B. Liu. Clustering
transactions using large items. In CIKM99, pages
483490, 1999.

37
References

K. Wang, S. Zhou, and Y He. Hierarchical
classification of real life documents. In
Proceedings of the 1st (SIAM) International
Conference on Data Mining, Chicago, US, 2001.
W. Wang, J. Yang, and R. R. Muntz. Sting A
statistical information grid approach to spatial
data mining. In M. Jarke, M. J. Carey, K. R.
Dittrich, F. H. Lochovsky, P. Loucopoulos, and M.
A. Jeusfeld, editors, VLDB97, Proceedings of
23rd International Conference on Very Large Data
Bases, pages 186195, Athens, Greece, August
25-29 1997. Morgan Kaufmann.
Yahoo! http//www.yahoo.com/.
O. Zamir, O. Etzioni, O. Madani, and R. M. Karp.
Fast and intuitive clustering of web documents.
In KDD97, pages 287290, 1997.