Information Summarizing from Documents with UserDefined Classes - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Information Summarizing from Documents with UserDefined Classes

Description:

Blog. Introduction (II) Summarization. Clustering. Classification. LiveSum ... Department of Health belongs to both 'Health' and 'Politics' ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 18
Provided by: iisSin
Category:

less

Transcript and Presenter's Notes

Title: Information Summarizing from Documents with UserDefined Classes


1
Information Summarizing from Documents with
User-Defined Classes
  • Hsin-Chen Chiao
  • Institute of Information Science
  • Academia Sinica

2
Introduction (I)
News-group
Web Pages
Digital Libraries
Files
Blog
3
Introduction (II)
Summarization
Clustering
Classification
4
LiveSum
  • Extracting possible keyterms from documents
  • Labeling user-defined concepts on the keyterms
  • Locating main characters, locations,
    organizations and their concepts (defined by
    users)

5
System Overview
LiveSum
Documents
Information Summarizing Results
6
Example
7
Term Extraction Human Names
  • Chinese Word Segmentation System
  • Chinese Knowledge Information Processing Group,
    Academia Sinica
  • Segmenting all the Chinese sentences into
    meaningful shortest phrases with proper grammar
    types
  • E.g.?????(Institute of Information Science)
    gt??(Information)(Na) ??(Science)(Na)
    ?(Institute)(D)

8
Term Extraction Long Terms
  • PAT Tree
  • Chinese significant terms tend to be long
  • Using suffix tree to represent the data and
    mutual information to decide candidate terms
  • Reference Chien, L.-F, Huang, T.-I., and Chien,
    M.-C. PAT-tree-based keyword extraction for
    Chinese information retrieval. In Proceedings of
    ACM-SIGIR, pp. 50-58, 1997

9
LiveClassifier
Concept Tree
Michael Jordan
Science
Art
Sport
Search Engines
Sport
kNN
10
Experiment Corpus
  • NTCIR-2 Chinese data
  • Three kinds of experiments
  • Term Extraction Accuracy
  • Usage Test
  • Classification Accuracy

11
Experiment (I) Chinese Word Segmentation and PAT
Tree Test
  • Randomly select 50 articles
  • Human labels correct terms

12
Experiment (I) Discussion
  • Precision is low because
  • Machine extracts lots of terms human would easily
    ignore.
  • ????(The United States Government)
  • Common phrases are also extracted, but they can
    be filtered out easily.
  • ????(that is to say)

13
Experiment (II) Usage Test
  • Each keyterm is labeled with one of the 16
    classes
  • Record extracting keyterms and labeling
    concepts time
  • Pre-Processing time 3 mins 42 secs

14
Experiment (III) Inclusion Rate Test
  • Yahoo Directory Tree 16 classes
  • Arts, Entertainments, Sports, Education, Science,
    Region, Traveling, Internet Guides, Business,
    Computers, Health, Media, Social, Politics, and
    Books
  • Each keyterm assigns only one class
  • After classification, each keyterm will get a
    rank list of the 16 classes sorted by the
    similarity scores.

15
Experiment (III) Inclusion Rate Test
16
Discussion
  • Department of Health belongs to both Health and
    Politics
  • A singer shot a Education-related commercial will
    be labeled as Entertainments and Education
  • Due to the use of LiveClassifier, the data we
    collect are all from the Web, which might be
    different from what people really think about.
  • Germany Region and Sports

17
Conclusion
  • LiveSum combines two systems to help summarize
    keyterms in multiple articles.
  • LiveSum is highly sensitive to internet
    documents. The concepts of a term might change
    with the time passing.
Write a Comment
User Comments (0)
About PowerShow.com