Title: Hierarchical Document Clustering Using Frequent Itemsets
1Hierarchical Document Clustering Using Frequent
Itemsets
- Benjamin Fung, Ke Wang, Martin Ester
- bfung, wangk, ester_at_cs.sfu.ca
- Simon Fraser University
- May 1, 2003 (SDM 03)
2Outline
- What is hierarchical document clustering?
- Previous works
- Our method Frequent Itemset Hierarchical
Clustering (FIHC) ? - Experimental results
- Conclusions
3Hierarchical Document Clustering
- Document Clustering Automatic organization of
documents into clusters so that documents within
a cluster have high similarity in comparison to
one another, but are very dissimilar to documents
in other clusters.
- Hierarchical Document Clustering
4Challenges in Hierarchical Document Clustering
- High dimensionality.
- High volume of data.
- Consistently high clustering quality.
- Meaningful cluster description.
5Previous Works
- Hierarchical Methods
- Agglomerative and Divisive.
- Reasonably accurate but not scalable.
- Partitioning Methods
- Efficient, scalable, easy to implement.
- Clustering quality degrades if an inappropriate
of clusters is provided. - Frequent item-based Method
- HFTC depends on a greedy heuristic.
6Preprocessing
- Remove stop words and Stemming.
- Construct vector model
- doci ( item frequency1, if2, if3, , ifm )
- e.g.
- ( apple, boy, cat, window )
- doc1 ( 5, 2, 7, 0 )
- doc2 ( 4, 0, 0, 3 )
- doc3 ( 0, 3, 1, 5 ) ?
document vector
doc1 apple 5 boy 2 cat 7
doc2 apple 4 window 3
doc3 boy 3 cat 1 window 5
7Algorithm Overview of Our Method (FIHC)
(reduced dimensions feature vectors)
(high dimensional doc vectors)
Construct clusters
Build a Tree
Pruning
8Definition Global Frequent Itemset
- A global frequent itemset refers to a set of
items (words) that appear together in more than a
user-specified fraction of the document set. - The global support of an itemset is the
percentage of documents containing the itemset. - e.g. 7 of the documents contain both words.
- apple, window has global support 7.
- A global frequent item refers to an item that
belongs to some global frequent itemset, e.g.,
apple.
9Reduced Dimensions Vector Model
- High dimensional vector model
- ( apple, boy, cat, window )
- doc1 ( 5, 2, 1, 1 )
- doc2 ( 4, 0, 0, 3 )
- doc3 ( 0, 3, 1, 5 )
- doc4 ( 8, 0, 2, 0 )
- doc5 ( 5, 0, 0, 3 )
- Suppose we set the minimum support to 60. The
global frequent itemsets are apple, cat,
window, apple, window - Store the frequencies only for global frequent
items. - ( apple, cat, window )
- doc1 ( 5, 1, 1 )
- doc2 ( 4, 0, 3 )
? document vector
? feature vector
10Intuition
- Frequent itemsets ? combination of words.
- Ex. apple Topic Fruits
- window Topic Renovation
- apple, window Topic Computer
11Construct Initial Clusters
- Construct a cluster for each global frequent
itemset. - Global frequent itemsets apple, cat,
window, apple, window - All documents containing this itemset are
included in the same cluster.
Capple
Cwindow
Capple, window
Ccat
12Making Clusters Disjoint
- Assign a document to the best initial cluster.
- Intuitively, a cluster Ci is good for a document
docj if there are many global frequent items in
docj that appear in many documents in Ci.
13Cluster Frequent Items
- A global frequent item is cluster frequent in a
cluster Ci if the item is contained in some
minimum fraction of documents in Ci. - Suppose we set the minimum cluster support to 60.
Capple ( apple, cat, window ) doc1
( 5, 1, 1 ) doc2 ( 4,
0, 3 ) doc4 ( 8, 2, 0
) doc5 ( 5, 0, 3 )
Capple Capple
Item Cluster Support
apple 100
cat 50
window 75
apple and window are cluster frequent items.
14Score Function (Example)
Capple apple 100 window 75
Cwindow cat 60 window 100
Capple, window apple 100 cat 60 window
100
Ccat cat 100
doc1 apple 5 cat 1 window 3
15Score Function
- Assign each docj to the initial cluster Ci that
has the highest scorei
- x represents a global frequent item in docj and
the item is also cluster frequent in Ci. - x represents a global frequent item in docj but
the item is not cluster frequent in Ci. - n(x) is the frequency of x in the feature vector
of docj. - n(x) is the frequency of x in the feature
vector of docj.
16Score Function (Example)
Capple apple 100 window 75
Cwindow cat 60 window 100
Capple, window apple 100 cat 60 window
100
Ccat cat 100
doc1 apple 5 cat 1 window 3
17Tree Construction
- Put the more specific clusters at the bottom of
the tree. - Put the more general clusters at the top of the
tree. - Build a tree from bottom-up by choosing a parent
for each cluster (start from the cluster with the
largest number of items in its cluster label).
null
cluster label
CS
Sports
Sports, Ball
Sports, Tennis
CS, AI
CS, DM
Sports, Tennis, Ball
18Choose a Parent Cluster (example)
Sports, Ball
Sports, Tennis
Sports, Tennis, Ball
( CS, DM, AI, Sports, Tennis, Ball ) doc1 ( 0,
0, 0, 5, 10, 2 )
doc2 ( 1, 0, 0, 5, 5,
3 ) doc3 ( 0, 1, 0,
15, 10, 1 )
sum ( 1, 1, 0, 25, 25,
6 )
19Prune Cluster Tree
- Why do we want to prune the tree?
- Remove overly specific child clusters.
- Documents of the same class (topic) are likely to
be distributed over different subtrees, which
would lead to poor clustering quality.
20Inter-Cluster Similarity
- Reuse the score function to calculate Sim(Ci ?
Cj).
21Child Pruning
- Efficiently shorten a tree by replacing child
clusters by their parent. - A child is pruned only if it is similar to its
parent. - Prune if Inter_Sim gt 1
null
CS
Sports
CS, DM
CS, AI
Sports, Ball
Sports, Tennis
Sports, Tennis, Racket
22Sibling Merging
- Narrow a tree by merging similar subtrees at
level 1.
null
CS
Sports
IT
CS, DM
CS, AI
Sports, Ball
Sports, Tennis
IT, Server
IT, Engineer
Inter_Sim(CS ? IT) 1.5
Inter_Sim(IT ? Sports) 0.75
Inter_Sim(CS ? Sports) 0.5
23Sibling Merging
null
CS
Sports
Sports, Ball
Sports, Tennis
CS, DM
CS, AI
IT, Server
IT, Engineer
24Experimental Results
- Compare with state-of-the-art clustering
algorithms - Bisecting k-means (Cluto 2.0 Toolkit)
- UPGMA (Cluto 2.0 Toolkit)
- HFTC (Source code in Java from author)
- Clustering quality.
- Efficiency and Scalability.
25Data Sets
- Each document is pre-classified into a single
natural class.
26Clustering Quality (F-measure)
- Widely used evaluation method for clustering
algorithms. - Recall and Precision.
- F-measure weighted average of recalls and
precisions.
27For FIHC and HFTC, we use MinSup from 3 to 6
28Efficiency
29Complexity Analysis
- Clustering ?f?F global_support(f), where f is a
global frequent itemset. (two scans on
documents) - Constructing tree Removed empty clusters first.
O(n), where n is the number of documents. - Child pruning one scan on remaining clusters.
- Sibling merging O(g2), where g is the number of
remaining clusters at level 1.
30Conclusions
- This research exploits frequent itemsets for
- defining a cluster.
- organizing the cluster hierarchy.
- Our contributions
- Reduced dimensionality ? efficient and scalable.
- High clustering quality.
- Number of clusters as optional input parameter.
- Meaningful cluster description.
31Thank you.
32References
- C. Aggarwal, S. Gates, and P. Yu. On the merits
of building categorization systems by supervised
clustering. In Proceedings of (KDD) 99, 5th (ACM)
International Conference on Knowledge Discovery
and Data Mining, pages 352356, San Diego, US,
1999. ACM Press, New York, US. - R. Agrawal, C. Aggarwal, and V. V. V. Prasad.
Depth-first generation of large itemsets for
association rules. Technical Report RC21538, IBM
Technical Report, October 1999. - R. Agrawal, C. Aggarwal, and V. V. V. Prasad. A
tree projection algorithm for generation of
frequent item sets. Journal of Parallel and
Distributed Computing, 61(3)350371, 2001. - R. Agrawal, J. Gehrke, D. Gunopulos, and P.
Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications. In
Proceedings of ACM SIGMOD International
Conference on Management of Data (SIGMOD98),
pages 94105, 1998. - R. Agrawal, T. Imielinski, and A. N. Swami.
Mining association rules between sets of items in
large databases. In Proceedings of ACM SIGMOD
International Conference on Management of Data
(SIGMOD93), pages 207216, Washington, D.C., May
1993. - R. Agrawal and R. Srikant. Fast algorithm for
mining association rules. In J. B. Bocca, M.
Jarke, and C. Zaniolo, editors, Proc. 20th Int.
Conf. Very Large Data Bases, VLDB, pages 487499.
Morgan Kaufmann, 12-15 1994. - R. Agrawal and R. Srikant. Mining sequential
patterns. In Proc. 1995 Int. Conf. Data
Engineering, pages 314, Taipei, Taiwan, March
1995. - M. Ankerst, M. Breunig, H. Kriegel, and J.
Sander. Optics Ordering points to identify the
clustering structure. In 1999 ACM-SIGMOD Int.
Conf. Management of Data (SIGMOD99), pages
4960, Philadelphia, PA, June 1999.
33References
- F. Beil, M. Ester, and X. Xu. Frequent term-based
text clustering. In Proc. 8th Int. Conf. on
Knowledge Discovery and Data Mining (KDD)2002,
Edmonton, Alberta, Canada, 2002.
http//www.cs.sfu.ca/ ester/publications.html. - H. Borko and M. Bernick. Automatic document
classication. Journal of the ACM, 10151162,
1963. - S. Chakrabarti. Data mining for hypertext A
tutorial survey. SIGKDD Explorations Newsletter
of the Special Interest Group (SIG) on Knowledge
Discovery Data Mining, ACM, 1111, 2000. - M. Charikar, C. Chekuri, T. Feder, and R.
Motwani. Incremental clustering and dynamic
information retrieval. In Proceedings of the 29th
Symposium on Theory Of Computing STOC 1997, pages
626635, 1997. - Classic. ftp//ftp.cs.cornell.edu/pub/smart/.
- D. R. Cutting, D. R. Karger, J. O. Pedersen, and
J. W. Tukey. Scatter/gather A cluster-based
approach to browsing large document collections.
In Proceedings of the Fifteenth Annual
International ACM SIGIR Conference on Research
and Development in Information Retrieval, pages
318329, 1992. - P. Domingos and G. Hulten. Mining high-speed data
streams. In Knowledge Discovery and Data Mining,
pages 7180, 2000. - R. C. Dubes and A. K. Jain. Algorithms for
Clustering Data. Prentice Hall College Div,
Englewood Clis, NJ, March 1998. - A. El-Hamdouchi and P. Willet. Comparison of
hierarchic agglomerative clustering methods for
document retrieval. The Computer Journal, 32(3),
1989. - M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters
in large spatial databases with noise. In
Proceedings of the 2nd int. Conf. on Knowledge
Discovery and Data Mining (KDD 96), pages
226231, Portland, Oregon, August 1996. AAAI
Press. - A. Griffiths, L. A. Robinson, and P. Willett.
Hierarchical agglomerative clustering methods for
automatic document classification. Journal of
Documentation, 40(3)175205, September 1984. - S. Guha, N. Mishra, R. Motwani, and L.
OCallaghan. Clustering data streams. In IEEE
Symposium on Foundations of Computer Science,
pages 359366, 2000.
34References
- S. Guha, R. Rastogi, and K. Shim. Rock A robust
clustering algorithm for categorical attributes.
In Proceedings of the 15th International
Conference on Data Engineering, 1999. - E. H. Han, B. Boley, M. Gini, R. Gross, K.
Hastings, G. Karypis, V. Kumar, B. Mobasher, and
J. Moore. Webace a web agent for document
categorization and exploration. In Proceedings of
the second international conference on Autonomous
agents, pages 408415. ACM Press, 1998. - J. Han and M. Kimber. Data Mining Concepts and
Techniques. Morgan-Kaufmann, August 2000. - J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. In
Proceedings of the 2000 ACM SIGMOD International
Conference on Management of Data (SIGMOD00),
Dallas, Texas, USA, May 2000. - J. Hipp, U. Guntzer, and G. Nakhaeizadeh.
Algorithms for association rule mining - a
general survey and comparison. SIGKDD
Explorations, 2(1)5864, July 2000. - G. Hulten, L. Spencer, and P. Domingos. Mining
time-changing data streams. In Proceedings of the
Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages
97106, San Francisco, CA, 2001. ACM Press. - G. Karypis. Cluto 2.0 clustering toolkit, April
2002. http//www.users.cs.umn.edu/
karypis/cluto/. - L. Kaufman and P. J. Rousseeuw. Finding Groups in
Data An Introduction to Cluster Analysis. John
Wiley and Sons, March 1990. - D. Koller and M. Sahami. Hierarchically
classifying documents using very few words. In D.
Fisher, editor, Proceedings of (ICML) 97, 14th
International Conference on Machine Learning,
pages 170178, Nashville, US, 1997. Morgan
Kaufmann Publishers, San Francisco, US. - Kosala and Blockeel. Web mining research A
survey. SIGKDD Explorations Newsletter of the
Special Interest Group SIG on Knowledge Discovery
Data Mining, 2, 2000. - G. Kowalski and M. Maybury. Information Storage
and Retrieval Systems Theory and Implementation.
Kluwer Academic Publishers, 2 edition, July 2000.
35References
- S. Guha, R. Rastogi, and K. Shim. Rock A robust
clustering algorithm for categorical attributes.
In Proceedings of the 15th International
Conference on Data Engineering, 1999. - E. H. Han, B. Boley, M. Gini, R. Gross, K.
Hastings, G. Karypis, V. Kumar, B. Mobasher, and
J. Moore. Webace a web agent for document
categorization and exploration. In Proceedings of
the second international conference on Autonomous
agents, pages 408415. ACM Press, 1998. - J. Han and M. Kimber. Data Mining Concepts and
Techniques. Morgan-Kaufmann, August 2000. - J. Han, J. Pei, and Y. Yin. Mining frequent
patterns without candidate generation. In
Proceedings of the 2000 ACM SIGMOD International
Conference on Management of Data (SIGMOD00),
Dallas, Texas, USA, May 2000. - J. Hipp, U. Guntzer, and G. Nakhaeizadeh.
Algorithms for association rule mining - a
general survey and comparison. SIGKDD
Explorations, 2(1)5864, July 2000. - G. Hulten, L. Spencer, and P. Domingos. Mining
time-changing data streams. In Proceedings of the
Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages
97106, San Francisco, CA, 2001. ACM Press. - G. Karypis. Cluto 2.0 clustering toolkit, April
2002. http//www.users.cs.umn.edu/
karypis/cluto/. - L. Kaufman and P. J. Rousseeuw. Finding Groups in
Data An Introduction to Cluster Analysis. John
Wiley and Sons, March 1990. - D. Koller and M. Sahami. Hierarchically
classifying documents using very few words. In D.
Fisher, editor, Proceedings of (ICML) 97, 14th
International Conference on Machine Learning,
pages 170178, Nashville, US, 1997. Morgan
Kaufmann Publishers, San Francisco, US. - Kosala and Blockeel. Web mining research A
survey. SIGKDD Explorations Newsletter of the
Special Interest Group SIG on Knowledge Discovery
Data Mining, 2, 2000. - G. Kowalski and M. Maybury. Information Storage
and Retrieval Systems Theory and Implementation.
Kluwer Academic Publishers, 2 edition, July 2000.
36References
- J. Lam. Multi-dimensional constrained gradient
mining. Masters thesis, Simon Fraser University,
August 2001. - B. Larsen and C. Aone. Fast and effective text
mining using linear-time document clustering.
KDD99, 1999. - D. D. Lewis. Reuters. http//www.research.att.com/
lewis/. - B. Liu, W. Hsu, and Y. Ma. Integrating
classification and association rule mining. In
Knowledge Discovery and Data Mining (KDD) 98,
pages 8086, 1998. - Miller. Princeton wordnet, 1990.
- M. F. Porter. An algorithm for sux stripping.
Program, 14(3)130137, July 1980. - J. R. Quinlan. C4.5 Programs for Machine
Learning. Morgan Kaufmann, 1993. - K. Ross and D. Srivastava. Fast computation of
sparse datacubes. In M. Jarke, M. Carey, K.
Dittrich, F. Lochovsky, P. Loucopoulos, and M.
Jeusfeld, editors, Proceedings of 23rd
International Conference on Very Large Data Bases
(VLDB97), pages 116125, Athens, Greece, August
1997. Morgan Kaufmann. - H. Schutze and H. Silverstein. Projections for
efficient document clustering. In Proceedings of
SIGIR97, pages 7481, Philadelphia, PA, July
1997. - C. E. Shannon. A mathematical theory of
communication. Bell Systems Technical Journal,
27379423 and 623656, July and October 1948. - M. Steinbach, G. Karypis, and V. Kumar. A
comparison of document clustering techniques. KDD
Workshop on Text Mining00, 2000. - Text REtrival Conference TIPSTER, 1999.
http//trec.nist.gov/. - H. Uchida, M. Zhu, and T. Della Senta. Unl A
gift for a millennium. The United Nations
University, 2000. - C. J. van Rijsbergen. Information Retrieval.
Dept. of Computer Science, University of Glasgow,
Butterworth, London, 2 edition, 1979. - P. Vossen. Eurowordnet, Summer 1999.
- K. Wang, C. Xu, and B. Liu. Clustering
transactions using large items. In CIKM99, pages
483490, 1999.
37References
- K. Wang, S. Zhou, and Y He. Hierarchical
classification of real life documents. In
Proceedings of the 1st (SIAM) International
Conference on Data Mining, Chicago, US, 2001. - W. Wang, J. Yang, and R. R. Muntz. Sting A
statistical information grid approach to spatial
data mining. In M. Jarke, M. J. Carey, K. R.
Dittrich, F. H. Lochovsky, P. Loucopoulos, and M.
A. Jeusfeld, editors, VLDB97, Proceedings of
23rd International Conference on Very Large Data
Bases, pages 186195, Athens, Greece, August
25-29 1997. Morgan Kaufmann. - Yahoo! http//www.yahoo.com/.
- O. Zamir, O. Etzioni, O. Madani, and R. M. Karp.
Fast and intuitive clustering of web documents.
In KDD97, pages 287290, 1997.