Understanding the Flow of Content in Summarizing HTML Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Understanding the Flow of Content in Summarizing HTML Documents

Description:

... area of handheld devices i.e. PDAs and Cell phones is too small for useful web browsing ... Cell phones. USA/Europe: WAP. Japan. iMode (NTT DoCoMo) J-Sky (J-Phone) ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 16
Provided by: ahmadfuadr
Category:

less

Transcript and Presenter's Notes

Title: Understanding the Flow of Content in Summarizing HTML Documents


1
Understanding the Flow of Content in Summarizing
HTML Documents
  • A. Rahman H. Alam R. Hartono
  • Document Analysis and Recognition Team (DART)
  • BCL Computers Inc.
  • Santa Clara, Calif, USA

2
Basic Problem Statement
  • How do we summarize web based documents?
  • Does HTML structure gives us any clue to the
    understanding of the content?
  • Does flow of content has anything to do with the
    main message?

3
Why Summarization?
  • Display area of handheld devices i.e. PDAs and
    Cell phones is too small for useful web browsing
  • Download times is still too slow for comfortable
    browsing using wireless devices
  • Cost factor is still too high

4
Current need?
  • Viewing website using small screen handheld
    devices
  • Since web sites are written using HTML codes, we
    need to translate these to systems that the
    wireless devices can support.

5
Current Solutions
  • Handcrafting
  • Custom Web Sites are typically crafted by hand by
    a set of content experts
  • Transcoding
  • Thranscoding replaces HTML tags with suitable
    device specific tags (HDML, WML etc)

6
Handcrafting
  • Automation
  • Use of XML.
  • There is no standard XML tagset (Document Type
    Definition DTD) in use by vendors.
  • XML has been available to web designers for the
    last 10 years. Examination of websites shows
    little use of document structural elements.
  • Web masters see themselves as artists rather than
    programmers.
  • XML may meet the same fate as SGML, an earlier
    attempt to create structured documents.

7
Handcrafting
  • Take an existing website and make it available to
    wireless access. Aether Systems, Mshift and 2Roam
    currently offer these types of solutions.
  • Use a proprietary graphical interface to ease the
    development of wireless applications from
    scratch. Covigo and iConverse offer these type of
    solutions.
  • Let the user do all coding in languages such as
    C or Java. ThinAirApps offers this type of
    solution.

8
Handcrafting
  • Labor intensive
  • Expensive.
  • Typically less than 1 of a web site gets
    converted to wireless content.

9
Transcoding
  • Transcoding was introduced in Japan during
    1999-2000. It was widely rejected by the Japanese
    users.
  • Recently, Google and Pixo introduced this
    solution for the US market, but have so far
    failed to attract attention of end users.

10
The Alternate Solution
  • Separate the content into smaller segments
  • Generate a summary of these segments
  • Prioritize these summaries from individual
    segments
  • Put together to form a summary of the overall
    document

11
Summarization vs. Transcoding
  • Long displays
  • Long download times
  • Finding information difficult
  • No mapping of the importance of content in the
    original document

12
Steps to Summarization
  • Structural analysis Understanding the
    relationship of the various segments with the
    document
  • Decomposition Breakdown on these segments into
    operational units
  • Contextual Analysis Employment of context to
    revise the segmentation
  • (Continuedgt)

13
Steps to Summarization (Continued)
  • Labeling gt Segment Summary Extraction of a low
    level summary of the segment
  • Priority Estimating importance of these segments
  • Table of Content (TOC) gt Document Summary
    Putting together a summary of the document

14
Supported Devices and Formats
  • PDAs (HTML3.2)
  • Cell phones
  • USA/Europe
  • WAP
  • Japan
  • iMode (NTT DoCoMo)
  • J-Sky (J-Phone)
  • EZWeb (KDDI)

15
Conclusion
  • It is a good idea to use flow of content in
    understanding web documents
  • Content can be used effectively to summarize web
    documents
  • HTML structure is a good starting point, but not
    enough to understand context
  • Summarization offers significant advantages over
    transcoding
  • Summarization also helps in faster browsing
    experience
Write a Comment
User Comments (0)
About PowerShow.com