Title: Hiroshi Nakagawa
1GJWS for IR and NLP 2001
NLP for Mobile Environment
- Hiroshi Nakagawa
- (University of Tokyo)
2Prologue
- The last time of this workshop was held one year
and 5 months ago in Yokohama. - Dramatic change of my academic interest triggered
by the director of library of Univ. of Tokyo. - 4 million books out of 7 million books possessed
by our university are electronically cataloged ,
say OPAC. - He asked me to build a system which enables us to
access this OPAC from mobile terminals such as
mobile phone and PDA.
3- Then we started to develop the prototype system.
- During this development, I realized that NLP
technologies are necessary for this type of
system - As for input system, Prof. Tanaka has already
proposed an excellent idea - As for output, the retrieved contents must be
transformed into readable and understandable form
on small screen of mobile terminals. - This transformation is only to be accomplished
with new NLP technologies.
4Current mobile terminals
- i-mode mobile phone
- 8 characters x 6 lines ? 10 chara x 9lines (in
Japanese) if using Latin character set, 16 to 20
chracters per line - C-HTML (contents description language)
- WAP terminal
- WML,HDML(Card and Deck structure)
- PDA
- PalmOS, WindowsCE
5Current Situation of Mobile Environment
- Mobile phone subscribers outnumbered
ordinary(traditional)phone subscribers in Japan. - The function of mobile terminal including cell
phone came and is coming to be on a par with PC. - IMT2000 enables a broad band communication
channel. - A screen size of mobile terminal is still and
will be very small.
6New Paradigm for Contents Authoring
- Variety of contents description language and
mobile terminal - Converting a contents to fit each individual
language and terminal device by hand requires too
much work, time and cost. - The solution is one source contents which can
fit every device through translation into each
description language, namely - Transcoding or One source Multi device
7Original plain text or in XML
CHTML
NLP technologies Such as Summarization,
Paraphrasing, hyphenation, hypertext, table
normalization
Converting Mobile-XML contents into the form well
displayed on each mobile terminal
screen. (transcoding)
Web contents in HTML
Mobile-XML Constraints for displaying are
expressed as tagged contents
WAP(HDML)
8Contents description language containing tags
that constrain the way to display
- Contents for mobile terminal display
- Already i-mode editor (i-edom)
- Converting the original contents into
trans-codable contents based on XML with tags
that constrain the way to display for every part
such as words. - The general trans-codable contents are finally
transformed into each language like CHTML, WML,
etc. to high readability contents automatically.
9Example of constraints which enable contents be
readable
- ltpgt of HTML forces to insert new-line.
- Soft constraints express how to display contents
on a mobile terminals small screen with
inserting extra tags of each contents description
language. - Example no line break tag ltno-line-breakgt
- What ltno-line-breakgtstrategic
softlt/no-line-breakgt means is that
ltno-line-breakgt UML based software
lt/no-line-breakgt . - Whatltbrgt
- strategic soft
- means is thatltbrgt
- UML based software
- The aim of inserting ltbrgt tags is that strategic
soft and UML based software do not split into
two lines respectively. If a screen is wide
enough, we might not need to insert any tags.
10- In mark-up languages, the dichotomy between
logical structure and physical rendering has been
a tradition. But - No line break
- Hyper text
- Normalizing table data, etc.
- are all about rendering and at the same time
logical structure. - From this fact, we have to note that rendering
and logical structure cannot be separated! - Thus, we have to have contents description
language (maybe based on XML) to deal with this.
11Small screen and big screen
- We have to cope with contents displaying
architecture that treats small screen of mobile
terminals and big screen of ordinary PCs
uniformly. - Intensive reading with enough time of PC screen
- ?? quick reading of mobile terminal
- Research from Cognitive science perspective
- Image, map, etc. are also to be considered.
12What is the problem in displaying contents on a
small screen of mobile terminal?
- To easily recognize contents meaning (line by
line) - To browse the whole contents on one screen.
- To comprehend the meaning of contents smoothly
and rapidly - To reduce the times of click for scrolling to
find what we intend to know - To reduce the amount of communication data
- How to use colors, fonts, pict-grams, tables,
figures, etc. - How to display table data more understandably
- What is the specification of contents description
language Mobile-XML to fulfill these
requirements?
13Bad example of new-lines
- ??? Starting
- ??1 time of
- 4?45 talk is 1
- ??? 445. Then
- ??1 ten minutes
- 0??? earlier sit
- ??? down in
- ??? your
- ??? reserved
- ??? seat
- ???
-
-
-
14Good example
- ?? Starting time
- ??? of talk is
- 1445 1445.
- ??? Then
- 10?? ten minutes earlier
- ?? sit down
- ??? in your reserved
- ???? seat.
15- What kind of linguistic unit should not be split
into two lines for readability and
understandability?
- To automatically recognize unbreakable linguistic
unit - What is this kind of linguistic unit?
- Time 1400 1 ? ?
- Proper nouns like human name 400
?? - Linguistic insight and cognitive factors ??
- Machine learning
- To use paraphrasing to avoid unwanted new-lines.
16Paraphrasing
- Paraphrasing
- Tow oclock afternoon ? 1400, 1400
- The secretary of states Colin Powell ? Powell
- Hongo, Bunkyo, Tokyo, JAPAN ? Hongou, Bunkyo,
or Hongou - Inheritance, i.e. in the context of library,
- Library opens at 10AM, and closes at 17PM
- ? 10AM 17PM
- Abbreviation
- Electro Technical Laboratory ? ETL
17Paraphrasing
- Noun phrases, case particles, verbs omission
- The upper house finally started to vote ? The
upper house to vote - In what context we can omit these
- In verb omission, what is the understandable use
of particles - Particle omission
- ????????( the montage is open to public )
- ? ??????? ( montage opened )
- What is the condition of particle omission?
- Replace a word with a particle or a symbol
- The police suspects arson? arson?
- When we can replace what with ? ?
18How to conduct our research
- Building corpus ( ordinary article - summarized
article ) - Collect news paper articles on the Internet
- Collect news paper article for mobile terminal
such as i-mode - 200 news paper articles of Web news a day,
- 40 i-mode artciles a day
- Alignment of these two corpora
- Within one day, terms occurrence based similarity
by cosine measure - Recall 0.6 0.7, Precision 1.0
19(2)Regional governor Fukuda declared to proceed
Nanma dam plan. Residents there got angry saying
Election time promise discarded.
Nanma dam Governor declared to proceed the plan.
Residents there oppose saying Election time
promise discarded. Tochigi regional governor
Akio Fukuda declared on 8th that he will proceed
the national plan of river development Nanma dam
with reducing its scale. At last November
election campaign he promised to cancel two plan
of dam in the region, and got elected in the
first time. The residents who opposed the plan
get angry saying He discarded his promise.
------
20Summarization strategy
- The first paragraph is important in newspaper
articles - Titles or lead sentences are used as seeds. Then
adding extra information - If the articles describe the same incident, the
proper names become abbreviated - Summarizing multi-text about the same incident
- What is the important information among these
21Conclusions
- We started this project by one sentence from the
library directors. - Every thing is just starting.
- But many people seem to agree the importance of
this problem. Actually we got three research
funds this year for this project.
22Related works
- Automatic summary generation from Web page
- Adam Berger Vibhu Mittal,OCELOTA System for
summarizing web pages, SIGIR2000, 144-151, 2000 - Closed caption generation on TV by sentence final
part compression with paraphrasing - Sirai and Ehara, ?????????????????????,TAO
Workshop on TV Closed Caption,7-30, 1999 - IBM?WebSphere Transcoding Publisher
- http//www-6.ibm.com/jp/software/network/transcodi
ng/ - HTML??CHTML,HDML?????????
- IBMs Dharna transforms the original contents
tree structure (M-tree) into V-tree which is
suitable for mobile terminal displaying. - Kitayama and Hirose,Dharma ??????????????????????
????????????,????46-2, 576-581, 2001
23Related Products
- Algo Group Inc.
- http//www.argogroup.co.jp/product.html
- Transform i-mode contents into Ezweb( one of WAP
language)?and J-Sky contents. - Japan Jeecom ?Mobile Commerce Toolkit?
- http//www.jeecom.co.jp/jeecom/030012-mct.php3
- Device independent contents editing tool kit
- NTT-ME ?MITAi-kun?
- http//www.ntt-me.co.jp/mitai/
- V-Campus mobile environment in Rikkyou Univ.