Taxonomy Generation from Documents - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Taxonomy Generation from Documents

Description:

Users will have an easier time finding information. ... are performed on several typical data sets and the results are promising. Future work ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 41
Provided by: wkd
Category:

less

Transcript and Presenter's Notes

Title: Taxonomy Generation from Documents


1
Taxonomy Generation from Documents
  • ???
  • ?????? ?????

2
Taxonomy ?
  • Definition
  • Represent a form of information classes
  • As elucidated by Bailey (1994), taxonomy is an
    orderly classification of information according
    to presumed natural relationships. A vocabulary
    is the simplest form of taxonomy. The typical
    form of taxonomy is a hierarchy. At the top
    level, general terms or descriptive phrases are
    used. Each of the general terms has beneath it a
    set of terms that provide more refinement of the
    top-level term. Each of these second level terms
    may have a set of refining terms beneath it.
  • Subject Taxonomy of Documents
  • Organize documents according to subject classes
    of document content
  • EX Web directory trees (Web portal or Web
    enterprise portal), or library classification
    schemes

3
Using Subject Taxonomy for Search
  • Complement search engines and provide
    high-quality searches in a Web portal or an
    enterprise information portal (Sullivan02).
  • Users will have an easier time finding
    information.
  • users do not have to come up with keywords to
    find information.
  • can categorize all content in a portal or
    intranet. This will improve the chance users will
    find what they are looking for in their searches.

4
Automatic Taxonomy Generation
  • Taxonomy-generation tools create a Yahoo!-like
    directory structure for navigating content in a
    portal or intranet.
  • Some tools, such as Verity's K2 Enterprise,
    Semio, Stratify, Quiver and SmartLogik provide
    automated methods for initial taxonomy
    construction
  • These approaches based on hierarchical document
    clustering techniques.
  • Rely on term extraction and document clustering
    techniques for analyzing the content similarity
    of the indexing documents.
  • Similar documents can be grouped together and
    form a category.
  • The subject terms of the category are extracted
    from that represent the key concepts of the
    documents..

5
Difficulties
  • Suffer from some difficulties.
  • Document clustering varies in accuracy
  • Scalability
  • Whether the generated categories are
    comprehensive and adaptable with users
    preferences and usages

6
Clustering of Search Result Pages
  • Taxonomy Generation from Terms

7
Examples of Document Sources
  • 45???????
  • ??????

8
Examples of Documents Sources
  • Multimedia Sources
  • Family photos
  • News videos

9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
Related Techniques
  • Extraction
  • Word segmentation term extraction
  • Word sense disambiguation
  • Named entity extraction
  • Key pattern extraction
  • Semantic Categorization Clustering
  • Term classification
  • Relation Identification
  • Relation extraction, synonym extraction, hyponym
    identification
  • Taxonomy Generation
  • Ontology Generation

15
Motivation Example I
Text segment Document Query term Search-result
snippets Clustering HAC Single-linkage Complete-li
nkage
Averageg-linkage K-means Cluster
naming Cohesiveness Isolation Precision Recall F1-
measure
Key terms

16
Motivation Example I
Text segment Document Query term Search-result
snippets Clustering HAC Single-linkage Complete-li
nkage
Averageg-linkage K-means Cluster
naming Cohesiveness Isolation Precision Recall F1-
measure
Key terms

Cohesiveness Isolation Precision Recall F1-measure
Text segment Document Query term Search-result
snippets
Clustering HAC K-Means Single-linkage Complete-lin
kage Min-max cut Cluster naming
Cohesiveness Isolation
Precision Recall F1-measure
Clustering HAC K-means
Single-linkage Complete-linkage Min-max cut
Cluster naming
17
Motivation Example II
  • Organizing Queries/Questions

users

Call center
mail
email
phone
Web interface

Ex. questions

18
Motivation Example II
  • Organizing Queries/Questions

users

Call center
mail
email
phone
Web interface
Ex. questions
19
The Intuition
  • How does a human do when facing an unknown term?
  • Send it to search engines
  • Refer to the context it appears
  • Infer its meaning(s)

Academia Sinica ... The top academic institution
in Taiwan. It is engaged in conducting scientific
research in its... Learn Chinese with Clavis
Sinica Chinese language reading and Clavis
Sinica Chinese language learning software
Chinese characters. Learn Chinese with Clavis
Sinica. ... Institute of Information Science
Academia Sinica Homepage Information about
research groups, papers, and projects
Sinica ?
20
Overview of the Approach
21
Experimental Results Structure
HAC
22
Experimental Results Structure
HACP
23
Data Representation
  • Highly-ranked search-result snippets are used to
    enrich the representation of the text segments
  • Huge pages have been indexed and continuously
    refreshed
  • Features are extracted from neighbor contents
    (context)
  • Vector space model is adopted to represent each
    text segment. The weight of each feature term is

24
Hierarchical Clustering Algorithm
  • Hierarchical agglomerative clustering (HAC)
  • Treat each input object as a singleton cluster
  • Compute the similarities between all pairs of
    objects
  • Merge the most similar two clusters
  • Update the similarity matrix with new cluster
  • Repeat step 3 and 4 until only a single cluster
    remains

25
In-cluster Similarity
  • Inter-object similarity
  • Inter-cluster similarity

26
Hierarchical Cluster Partitioning
  • To get a more natural multi-way tree from the
    binary tree generated by HAC algorithm

27
Hierarchical Cluster Partitioning
  • To get a more natural multi-way tree from the
    binary tree generated by HAC algorithm

28
Hierarchical Cluster Partitioning
  • To get a more natural multi-way tree from the
    binary tree generated by HAC algorithm

Cut level l
C9
C9
1
C8
2
C7
C7
3
C6
C6
4
C1
C2
C3
C4
C5
C1
C2
C3
C4
C5
29
Hierarchical Cluster Partitioning
  • To get a more natural multi-way tree from the
    binary tree generated by HAC algorithm

Cut level l
C9
C9
1
C8
2
C7
C7
3
C6
C6
4
C1
C2
C3
C4
C5
C1
C2
C3
C4
C5
30
Hierarchical Cluster Partitioning
  • To get a more natural multi-way tree from the
    binary tree generated by HAC algorithm

Cut level l
C9
C9
1
C8
2
C7
3
C6
4
T1
T2
T3
T4
T5
C1
C2
C3
C4
C5
31
Cluster Quality Measure
  • Min-max cut
  • Minimize the inter-similarities
  • Maximize the intra-similarities

32
Experiment Data
  • Yahoo! CS directory
  • 36, 177, 278 category names in 1st, 2nd, 3rd
    levels
  • Yahoo! People/Scientists
  • 250 famous people in 9 science fields
  • Paper
  • QuizNLQ
  • 163 quiz question in 7 categories

33
Evaluation Metric F-measure
34
Experimental Results Accuracy
  • Yahoo! CS category names with various
    inter-cluster similarities
  • Average-linkage in various data sets

35
Experimental Results Structure
HAC
36
Experimental Results Structure
HACP
37
Experimental Results Structure
  • Structure measure based on
  • Depth of the hierarchy
  • Total clusters
  • Average number of child clusters

38
User Evaluation Comprehension
  • Five qualitative measures
  • Cohesiveness
  • Isolation
  • Hierarchy
  • Navigation balance
  • Readability

39
User Evaluation Usability
  • Compare with the hierarchies constructed by human
  • Group I construct hierarchy from scratch
  • Group II use auto-generated hierarchy as
    reference

40
Conclusion
  • Current contributions
  • Study the feasibility of using search-result
    snippets and clustering technique to general
    text-segment taxonomy generation
  • Conduct extensive experiment are performed on
    several typical data sets and the results are
    promising
  • Future work
  • Apply the approach on more types of text segments
  • Investigate good cluster naming to enhance its
    usability
  • Explore techniques to deal with longer text
    segments with less search-result snippets
Write a Comment
User Comments (0)
About PowerShow.com