Title: Taxonomy Generation from Documents
1Taxonomy Generation from Documents
2Taxonomy ?
- Definition
- Represent a form of information classes
- As elucidated by Bailey (1994), taxonomy is an
orderly classification of information according
to presumed natural relationships. A vocabulary
is the simplest form of taxonomy. The typical
form of taxonomy is a hierarchy. At the top
level, general terms or descriptive phrases are
used. Each of the general terms has beneath it a
set of terms that provide more refinement of the
top-level term. Each of these second level terms
may have a set of refining terms beneath it. - Subject Taxonomy of Documents
- Organize documents according to subject classes
of document content - EX Web directory trees (Web portal or Web
enterprise portal), or library classification
schemes
3Using Subject Taxonomy for Search
- Complement search engines and provide
high-quality searches in a Web portal or an
enterprise information portal (Sullivan02). - Users will have an easier time finding
information. - users do not have to come up with keywords to
find information. - can categorize all content in a portal or
intranet. This will improve the chance users will
find what they are looking for in their searches.
4Automatic Taxonomy Generation
- Taxonomy-generation tools create a Yahoo!-like
directory structure for navigating content in a
portal or intranet. - Some tools, such as Verity's K2 Enterprise,
Semio, Stratify, Quiver and SmartLogik provide
automated methods for initial taxonomy
construction - These approaches based on hierarchical document
clustering techniques. - Rely on term extraction and document clustering
techniques for analyzing the content similarity
of the indexing documents. - Similar documents can be grouped together and
form a category. - The subject terms of the category are extracted
from that represent the key concepts of the
documents..
5Difficulties
- Suffer from some difficulties.
- Document clustering varies in accuracy
- Scalability
- Whether the generated categories are
comprehensive and adaptable with users
preferences and usages
6Clustering of Search Result Pages
- Taxonomy Generation from Terms
7Examples of Document Sources
8Examples of Documents Sources
- Multimedia Sources
- Family photos
- News videos
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14Related Techniques
- Extraction
- Word segmentation term extraction
- Word sense disambiguation
- Named entity extraction
- Key pattern extraction
- Semantic Categorization Clustering
- Term classification
- Relation Identification
- Relation extraction, synonym extraction, hyponym
identification - Taxonomy Generation
- Ontology Generation
15Motivation Example I
Text segment Document Query term Search-result
snippets Clustering HAC Single-linkage Complete-li
nkage
Averageg-linkage K-means Cluster
naming Cohesiveness Isolation Precision Recall F1-
measure
Key terms
16Motivation Example I
Text segment Document Query term Search-result
snippets Clustering HAC Single-linkage Complete-li
nkage
Averageg-linkage K-means Cluster
naming Cohesiveness Isolation Precision Recall F1-
measure
Key terms
Cohesiveness Isolation Precision Recall F1-measure
Text segment Document Query term Search-result
snippets
Clustering HAC K-Means Single-linkage Complete-lin
kage Min-max cut Cluster naming
Cohesiveness Isolation
Precision Recall F1-measure
Clustering HAC K-means
Single-linkage Complete-linkage Min-max cut
Cluster naming
17Motivation Example II
- Organizing Queries/Questions
users
Call center
mail
email
phone
Web interface
Ex. questions
18Motivation Example II
- Organizing Queries/Questions
users
Call center
mail
email
phone
Web interface
Ex. questions
19The Intuition
- How does a human do when facing an unknown term?
- Send it to search engines
- Refer to the context it appears
- Infer its meaning(s)
Academia Sinica ... The top academic institution
in Taiwan. It is engaged in conducting scientific
research in its... Learn Chinese with Clavis
Sinica Chinese language reading and Clavis
Sinica Chinese language learning software
Chinese characters. Learn Chinese with Clavis
Sinica. ... Institute of Information Science
Academia Sinica Homepage Information about
research groups, papers, and projects
Sinica ?
20Overview of the Approach
21Experimental Results Structure
HAC
22Experimental Results Structure
HACP
23Data Representation
- Highly-ranked search-result snippets are used to
enrich the representation of the text segments - Huge pages have been indexed and continuously
refreshed - Features are extracted from neighbor contents
(context) - Vector space model is adopted to represent each
text segment. The weight of each feature term is
24Hierarchical Clustering Algorithm
- Hierarchical agglomerative clustering (HAC)
- Treat each input object as a singleton cluster
- Compute the similarities between all pairs of
objects - Merge the most similar two clusters
- Update the similarity matrix with new cluster
- Repeat step 3 and 4 until only a single cluster
remains
25In-cluster Similarity
- Inter-object similarity
- Inter-cluster similarity
26Hierarchical Cluster Partitioning
- To get a more natural multi-way tree from the
binary tree generated by HAC algorithm
27Hierarchical Cluster Partitioning
- To get a more natural multi-way tree from the
binary tree generated by HAC algorithm
28Hierarchical Cluster Partitioning
- To get a more natural multi-way tree from the
binary tree generated by HAC algorithm
Cut level l
C9
C9
1
C8
2
C7
C7
3
C6
C6
4
C1
C2
C3
C4
C5
C1
C2
C3
C4
C5
29Hierarchical Cluster Partitioning
- To get a more natural multi-way tree from the
binary tree generated by HAC algorithm
Cut level l
C9
C9
1
C8
2
C7
C7
3
C6
C6
4
C1
C2
C3
C4
C5
C1
C2
C3
C4
C5
30Hierarchical Cluster Partitioning
- To get a more natural multi-way tree from the
binary tree generated by HAC algorithm
Cut level l
C9
C9
1
C8
2
C7
3
C6
4
T1
T2
T3
T4
T5
C1
C2
C3
C4
C5
31Cluster Quality Measure
- Min-max cut
- Minimize the inter-similarities
- Maximize the intra-similarities
32Experiment Data
- Yahoo! CS directory
- 36, 177, 278 category names in 1st, 2nd, 3rd
levels - Yahoo! People/Scientists
- 250 famous people in 9 science fields
- Paper
-
- QuizNLQ
- 163 quiz question in 7 categories
33Evaluation Metric F-measure
34Experimental Results Accuracy
- Yahoo! CS category names with various
inter-cluster similarities - Average-linkage in various data sets
35Experimental Results Structure
HAC
36Experimental Results Structure
HACP
37Experimental Results Structure
- Structure measure based on
- Depth of the hierarchy
- Total clusters
- Average number of child clusters
38User Evaluation Comprehension
- Five qualitative measures
- Cohesiveness
- Isolation
- Hierarchy
- Navigation balance
- Readability
39User Evaluation Usability
- Compare with the hierarchies constructed by human
- Group I construct hierarchy from scratch
- Group II use auto-generated hierarchy as
reference
40Conclusion
- Current contributions
- Study the feasibility of using search-result
snippets and clustering technique to general
text-segment taxonomy generation - Conduct extensive experiment are performed on
several typical data sets and the results are
promising - Future work
- Apply the approach on more types of text segments
- Investigate good cluster naming to enhance its
usability - Explore techniques to deal with longer text
segments with less search-result snippets