Hiroshi Nakagawa - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Hiroshi Nakagawa

Description:

The last time of this workshop was held one year and 5 months ago in Yokohama. ... Adam Berger & Vibhu Mittal,'OCELOT:A System for summarizing web pages' ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 24
Provided by: rDlItcU
Category:

less

Transcript and Presenter's Notes

Title: Hiroshi Nakagawa


1
GJWS for IR and NLP 2001
NLP for Mobile Environment
  • Hiroshi Nakagawa
  • (University of Tokyo)

2
Prologue
  • The last time of this workshop was held one year
    and 5 months ago in Yokohama.
  • Dramatic change of my academic interest triggered
    by the director of library of Univ. of Tokyo.
  • 4 million books out of 7 million books possessed
    by our university are electronically cataloged ,
    say OPAC.
  • He asked me to build a system which enables us to
    access this OPAC from mobile terminals such as
    mobile phone and PDA.

3
  • Then we started to develop the prototype system.
  • During this development, I realized that NLP
    technologies are necessary for this type of
    system
  • As for input system, Prof. Tanaka has already
    proposed an excellent idea
  • As for output, the retrieved contents must be
    transformed into readable and understandable form
    on small screen of mobile terminals.
  • This transformation is only to be accomplished
    with new NLP technologies.

4
Current mobile terminals
  • i-mode mobile phone
  • 8 characters x 6 lines ? 10 chara x 9lines (in
    Japanese) if using Latin character set, 16 to 20
    chracters per line
  • C-HTML (contents description language)
  • WAP terminal
  • WML,HDML(Card and Deck structure)
  • PDA
  • PalmOS, WindowsCE

5
Current Situation of Mobile Environment
  • Mobile phone subscribers outnumbered
    ordinary(traditional)phone subscribers in Japan.
  • The function of mobile terminal including cell
    phone came and is coming to be on a par with PC.
  • IMT2000 enables a broad band communication
    channel.
  • A screen size of mobile terminal is still and
    will be very small.

6
New Paradigm for Contents Authoring
  • Variety of contents description language and
    mobile terminal
  • Converting a contents to fit each individual
    language and terminal device by hand requires too
    much work, time and cost.
  • The solution is one source contents which can
    fit every device through translation into each
    description language, namely
  • Transcoding or One source Multi device

7
Original plain text or in XML
CHTML
NLP technologies Such as Summarization,
Paraphrasing, hyphenation, hypertext, table
normalization
Converting Mobile-XML contents into the form well
displayed on each mobile terminal
screen. (transcoding)
Web contents in HTML
Mobile-XML Constraints for displaying are
expressed as tagged contents
WAP(HDML)
8
Contents description language containing tags
that constrain the way to display
  • Contents for mobile terminal display
  • Already i-mode editor (i-edom)
  • Converting the original contents into
    trans-codable contents based on XML with tags
    that constrain the way to display for every part
    such as words.
  • The general trans-codable contents are finally
    transformed into each language like CHTML, WML,
    etc. to high readability contents automatically.

9
Example of constraints which enable contents be
readable
  • ltpgt of HTML forces to insert new-line.
  • Soft constraints express how to display contents
    on a mobile terminals small screen with
    inserting extra tags of each contents description
    language.
  • Example no line break tag ltno-line-breakgt
  • What ltno-line-breakgtstrategic
    softlt/no-line-breakgt means is that
    ltno-line-breakgt UML based software
    lt/no-line-breakgt .
  • Whatltbrgt
  • strategic soft
  • means is thatltbrgt
  • UML based software
  • The aim of inserting ltbrgt tags is that strategic
    soft and UML based software do not split into
    two lines respectively. If a screen is wide
    enough, we might not need to insert any tags.

10
  • In mark-up languages, the dichotomy between
    logical structure and physical rendering has been
    a tradition. But
  • No line break
  • Hyper text
  • Normalizing table data, etc.
  • are all about rendering and at the same time
    logical structure.
  • From this fact, we have to note that rendering
    and logical structure cannot be separated!
  • Thus, we have to have contents description
    language (maybe based on XML) to deal with this.

11
Small screen and big screen
  • We have to cope with contents displaying
    architecture that treats small screen of mobile
    terminals and big screen of ordinary PCs
    uniformly.
  • Intensive reading with enough time of PC screen
  • ?? quick reading of mobile terminal
  • Research from Cognitive science perspective
  • Image, map, etc. are also to be considered.

12
What is the problem in displaying contents on a
small screen of mobile terminal?
  • To easily recognize contents meaning (line by
    line)
  • To browse the whole contents on one screen.
  • To comprehend the meaning of contents smoothly
    and rapidly
  • To reduce the times of click for scrolling to
    find what we intend to know
  • To reduce the amount of communication data
  • How to use colors, fonts, pict-grams, tables,
    figures, etc.
  • How to display table data more understandably
  • What is the specification of contents description
    language Mobile-XML to fulfill these
    requirements?

13
Bad example of new-lines
  • ??? Starting
  • ??1 time of
  • 4?45 talk is 1
  • ??? 445. Then
  • ??1 ten minutes
  • 0??? earlier sit
  • ??? down in
  • ??? your
  • ??? reserved
  • ??? seat
  • ???

14
Good example
  • ?? Starting time
  • ??? of talk is
  • 1445 1445.
  • ??? Then
  • 10?? ten minutes earlier
  • ?? sit down
  • ??? in your reserved
  • ???? seat.

15
  • What kind of linguistic unit should not be split
    into two lines for readability and
    understandability?
  • To automatically recognize unbreakable linguistic
    unit
  • What is this kind of linguistic unit?
  • Time 1400 1 ? ?
  • Proper nouns like human name 400
    ??
  • Linguistic insight and cognitive factors ??
  • Machine learning
  • To use paraphrasing to avoid unwanted new-lines.

16
Paraphrasing
  • Paraphrasing
  • Tow oclock afternoon ? 1400, 1400
  • The secretary of states Colin Powell ? Powell
  • Hongo, Bunkyo, Tokyo, JAPAN ? Hongou, Bunkyo,
    or Hongou
  • Inheritance, i.e. in the context of library,
  • Library opens at 10AM, and closes at 17PM
  • ? 10AM 17PM
  • Abbreviation
  • Electro Technical Laboratory ? ETL

17
Paraphrasing
  • Noun phrases, case particles, verbs omission
  • The upper house finally started to vote ? The
    upper house to vote
  • In what context we can omit these
  • In verb omission, what is the understandable use
    of particles
  • Particle omission
  • ????????( the montage is open to public )
  • ? ??????? ( montage opened )
  • What is the condition of particle omission?
  • Replace a word with a particle or a symbol
  • The police suspects arson? arson?
  • When we can replace what with ? ?

18
How to conduct our research
  • Building corpus ( ordinary article - summarized
    article )
  • Collect news paper articles on the Internet
  • Collect news paper article for mobile terminal
    such as i-mode
  • 200 news paper articles of Web news a day,
  • 40 i-mode artciles a day
  • Alignment of these two corpora
  • Within one day, terms occurrence based similarity
    by cosine measure
  • Recall 0.6 0.7, Precision 1.0

19
(2)Regional governor Fukuda declared to proceed
Nanma dam plan. Residents there got angry saying
Election time promise discarded.
Nanma dam Governor declared to proceed the plan.
Residents there oppose saying Election time
promise discarded. Tochigi regional governor
Akio Fukuda declared on 8th that he will proceed
the national plan of river development Nanma dam
with reducing its scale. At last November
election campaign he promised to cancel two plan
of dam in the region, and got elected in the
first time. The residents who opposed the plan
get angry saying He discarded his promise.
------
20
Summarization strategy
  • The first paragraph is important in newspaper
    articles
  • Titles or lead sentences are used as seeds. Then
    adding extra information
  • If the articles describe the same incident, the
    proper names become abbreviated
  • Summarizing multi-text about the same incident
  • What is the important information among these

21
Conclusions
  • We started this project by one sentence from the
    library directors.
  • Every thing is just starting.
  • But many people seem to agree the importance of
    this problem. Actually we got three research
    funds this year for this project.

22
Related works
  • Automatic summary generation from Web page
  • Adam Berger Vibhu Mittal,OCELOTA System for
    summarizing web pages, SIGIR2000, 144-151, 2000
  • Closed caption generation on TV by sentence final
    part compression with paraphrasing
  • Sirai and Ehara, ?????????????????????,TAO
    Workshop on TV Closed Caption,7-30, 1999
  • IBM?WebSphere Transcoding Publisher
  • http//www-6.ibm.com/jp/software/network/transcodi
    ng/
  • HTML??CHTML,HDML?????????
  • IBMs Dharna transforms the original contents
    tree structure (M-tree) into V-tree which is
    suitable for mobile terminal displaying.
  • Kitayama and Hirose,Dharma ??????????????????????
    ????????????,????46-2, 576-581, 2001

23
Related Products
  • Algo Group Inc.
  • http//www.argogroup.co.jp/product.html
  • Transform i-mode contents into Ezweb( one of WAP
    language)?and J-Sky contents.
  • Japan Jeecom ?Mobile Commerce Toolkit?
  • http//www.jeecom.co.jp/jeecom/030012-mct.php3
  • Device independent contents editing tool kit
  • NTT-ME ?MITAi-kun?
  • http//www.ntt-me.co.jp/mitai/
  • V-Campus mobile environment in Rikkyou Univ.
Write a Comment
User Comments (0)
About PowerShow.com