Title: The New Bill of Rights of Information Society
1The New Bill of Rights of Information Society
- Raj Reddy and Jaime Carbonell
- Carnegie Mellon University
- March 23, 2006
- Talk at Google
2New Bill of Rights
- Get the right information
- e.g. search engines
- To the right people
- e.g. categorizing, routing
- At the right time
- e.g. Just-in-Time (task modeling, planning)
- In the right language
- e.g. machine translation
- With the right level of detail
- e.g. summarization
- In the right medium
- e.g. access to information in non-textual media
3Relevant Technologies
- search engines
- classification, routing
- anticipatory analysis
- machine translation
- summarization
- speech input and output
- right information
- right people
- right time
- right language
- right level of detail
- right medium
4right information Search Engines
5 The Right Information
- Right Information from future Search Engines
- How to go beyond just relevance to query (all)
and popularity - Eliminate massive redundancy e.g. web-based
email - Should not result in
- multiple links to different yahoo sites promoting
their email, or even non-Yahoo sites discussing
just Yahoo-email. - Should result in
- a link to Yahoo email, one to MSN email, one to
Gmail, one that compares them, etc. - First show trusted info sources and
user-community-vetted sources - At least for important info (medical, financial,
educational, ), I want to trust what I read,
e.g., - For new medical treatments
- First info from hospitals, medical schools, the
AMA, medical publications, etc. , and - NOT from Joe Shmos quack practice page or from
the National Enquirer. - Maximum Marginal Relevance
- Novelty Detection
- Named Entity Extraction
6 Beyond Pure Relevance in IR
- Current Information Retrieval Technology Only
Maximizes Relevance to Query - What about information novelty, timeliness,
appropriateness, validity, comprehensibility,
density, medium,...?? - Novelty is approximated by non-redundancy!
- we really want to maximize relevance to the
query, given the user profile and interaction
history, - P(U(f i , ..., f n ) Q C U H)
- where Q query, C collection set,
- U user profile, H interaction history
- ...but we dont yet know how. Darn.
7Maximal Marginal Relevance vs. Standard
Information Retrieval
documents
query
MMR
Standard IR
IR
8Novelty Detection
- Find the first report of a new event
- (Unconditional) Dissimilarity with Past
- Decision threshold on most-similar story
- (Linear) temporal decay
- Length-filter (for teasers)
- Cosine similarity with standard weights
9New First Story Detection Directions
- Topic-conditional models
- e.g. airplane, investigation, FAA, FBI,
casualties, ? topic, not event - TWA 800, March 12, 1997 ? event
- First categorize into topic, then use
maximally-discriminative terms within topic - Rely on situated named entities
- e.g. Arcan as victim, Sharon as peacemaker
10 Link Detection in Texts
- Find text (e.g. Newstories) that mention the same
underlying events. - Could be combined with novelty (e.g. something
new about interesting event.) - Techniques text similarity, NEs, situated NEs,
relations, topic-conditioned models,
11Named-Entity identification
- Purpose to answer questions such as
- Who is mentioned in these 100 Society articles?
- What locations are listed in these 2000 web
pages? - What companies are mentioned in these patent
applications? - What products were evaluated by Consumer Reports
this year?
12Named Entity Identification
- President Clinton decided to send special trade
envoy Mickey Kantor to the special Asian economic
meeting in Singapore this week. Ms. Xuemei Peng,
trade minister from China, and Mr. Hideto Suzuki
from Japans Ministry of Trade and Industry will
also attend. Singapore, who is hosting the
meeting, will probably be represented by its
foreign and economic ministers. The Australian
representative, Mr. Langford, will not attend,
though no reason has been given. The parties hope
to reach a framework for currency stabilization.
13Methods for NE Extraction
- Finite-State Transducers w/variables
- Example output
- FNAME Bill LNAME Clinton TITLE
President - FSTs Learned from labeled data
- Statistical learning (also from labeled data)
- Hidden Markov Models (HMMs)
- Exponential (maximum-entropy) models
- Conditional Random Fields Lafferty et al
14Named Entity Identification
- Extracted Named Entities (NEs)
- People Places
- President Clinton Singapore
- Mickey Kantor Japan
- Ms. Xuemei Peng China
- Mr. Hideto Suzuki Australia
- Mr. Langford
15Role Situated NEs
- Motivation It is useful to know roles of NEs
- Who participated in the economic meeting?
- Who hosted the economic meeting?
- Who was discussed in the economic meeting?
- Who was absent from the the economic meeting?
16 Emerging Methods for Extracting Relations
- Link Parsers at Clause Level
- Based on dependency grammars
- Probabilistic enhancements Lafferty, Venable
- Island-Driven Parsers
- GLR Lavie, Chart Nyberg, Placeway, LC-Flex
Rose - Tree-bank-trained probabilistic CF parsers IBM,
Collins - Herald the return of deep(er) NLP techniques.
- Relevant to new Q/A from free-text initiative.
- Too complex for inductive learning (today).
17Relational NE Extraction
- Example (Who does What to Whom)
- "John Snell reporting for Wall Street. Today
Flexicon Inc. announced a tender offer for
Supplyhouse Ltd. for 30 per share, representing
a 30 premium over Fridays closing price.
Flexicon expects to acquire Supplyhouse by Q4
2001 without problems from federal regulators"
18Fact Extraction Application
- Useful for relational DB filling, to prepare data
for standard DM/machine-learning methods - Acquirer Acquiree Sh.price Year
- __________________________________
- Flexicon Logi-truck 18 1999
- Flexicon Supplyhouse 30 2001
- buy.com reel.com 10 2000
- ... ... ... ...
19right peopleText Categorization
20The Right People
- User-focused search is key
- If a 7-year old is working on a school project
- taking good care of ones heart and types in
heart care, she will want links to pages like - You and your friendly heart,
- Tips for taking good care of your heart,
- Intro to how the heart works etc.
- NOT the latest New England Journal of Medicine
article on Cardiological implications of
immuo-active proteases. - If a cardiologist issues the query, exactly the
opposite is desired - Search engines must know their users better, and
the user tasks - Social affiliation groups for search and for
automatically categorizing, prioritizing and
routing incoming info or search results. New
machine learning technology allows for scalable
high-accuracy hierarchical categorization. - Family group
- Organization group
- Country group
- Disaster affected group
- Stockholder group
21Text Categorization
- Assign labels to each document or web-page
- Labels may be topics such as Yahoo-categories
- finance, sports, News?World?Asia?Business
- Labels may be genres
- editorials, movie-reviews, news
- Labels may be routing codes
- send to marketing, send to customer service
22Text Categorization
Methods
- Manual assignment
- as in Yahoo
- Hand-coded rules
- as in Reuters
- Machine Learning (dominant paradigm)
- Words in text become predictors
- Category labels become to be predicted
- Predictor-feature reduction (SVD, ?2, )
- Apply any inductive method kNN, NB, DT,
23 Multi-tier Event Classification
24right timeframeJust-in-Time - no sooner or
later
25Just in Time Information
- Get the information to user exactly when it is
needed - Immediately when the information is requested
- Prepositioned if it requires time to fetch
download (eg HDTV video) - requires anticipatory analysis and pre-fetching
- How about push technology for, e.g. stock
alerts, reminders, breaking news? - Depends on user activity
- Sleeping or Dont Disturb or in Meeting ? wait
your chance - Reading email ? now if info is urgent, later
otherwise - Group info before delivering (e.g. show 3 stock
alerts together) - Info directly relevant to users current task ?
immediately
26right languageTranslation
27Access to Multilingual Information
- Language Identification (from text, speech,
handwriting) - Trans-lingual retrieval (query in 1 language,
results in multiple languages) - Requires more than query-word out-of-context
translation (see Carbonell et al 1997 IJCAI
paper) to do it well - Full translation (e.g. of web page, of search
results snippets, ) - General reading quality (as targeted now)
- Focused on getting entities right (who, what,
where, when mentioned) - Partial on-demand translation
- Reading assistant translation in context while
reading an original document, by highlighting
unfamiliar words, phrases, passages. - On-demand Text to Speech
- Transliteration
28in the Right Language
- Knowledge-Engineered MT
- Transfer rule MT (commercial systems)
- High-Accuracy Interlingual MT (domain focused)
- Parallel Corpus-Trainable MT
- Statistical MT (noisy channel, exponential
models) - Example-Based MT (generalized G-EBMT)
- Transfer-rule learning MT (corpus informants)
- Multi-Engine MT
- Omnivorous approach combines the above to
maximize coverage minimize errors
29Types of Machine Translation
Interlingua
Semantic Analysis
Sentence Planning
Transfer Rules
Syntactic Parsing
Text Generation
Source (Arabic)
Target (English)
Direct EBMT
30EBMT example
English I would like to meet
her. Mapudungun Ayükefun trawüael fey
engu.
English The tallest man is
my father. Mapudungun Chi doy fütra chi
wentru fey ta inche ñi chaw.
English I would like to meet the
tallest man Mapudungun (new)
Ayükefun trawüael Chi doy fütra chi
wentru Mapudungun (correct) Ayüken ñi
trawüael chi doy fütra wentruengu.
31Multi-Engine Machine Translation
- MT Systems have different strengths
- Rapidly adaptable Statistical, example-based
- Good grammar Rule-Based (linguisitic) MT
- High precision in narrow domains KBMT
- Minority Language MT Learnable from informant
- Combine results of parallel-invoked MT
- Select best of multiple translations
- Selection based on optimizing combination of
- Target language joint-exponential model
- Confidence scores of individual MT engines
32Illustration of Multi-Engine MT
33State of the Art in MEMTfor New Hot Languages
- We can do now
- Gisting MT for any new language in 2-3 weeks
(given parallel text) - Medium quality MT in 6 months (given more
parallel text, informant, bi-lingual dictionary) - Improve-as-you-go MT
- Field MT system in PCs
- We cannot do yet
- High-accuracy MT for open domains
- Cope with spoken-only languages
- Reliable speech-speech MT (but BABYLON is coming)
- MT on your wristwatch
34right level of detailSummarization
35Right Level of Detail
- Automate summarization with hyperlink one-click
drilldown on user selected section(s). - Purpose Driven summaries are in service of an
information need, not one-size fits all (as in
Shaoms outline and the DUC NIST evaluations) - EXAMPLE A summary of a 650-page clinical study
can focus on - effectiveness of the new drug for target disease
- methodology of the study (control group,
statistical rigor,) - deleterious side effects if any
- target population of study (e.g. acne-suffering
teens, not eczema suffering adults .depending on
the users task or information query
36Information Structuring and Summarization
- Hierarchical multi-level pre-computed summary
structure, or on-the-fly drilldown expansion of
info. - Headline
- Abstract 1 or 1 page
- Summary 5-10 or 10 pages
- Document 100
- Scope of Summary
- Single big document (e.g. big clinical study)
- Tight cluster of search results (e.g. vivisimo)
- Related set of clusters (e.g. conflicting
opinions on how to cope with Irans nuclear
capabilities) - Focused area of knowledge (e.g. Whats known
about Pluto? Lycos has good project in this via
Hotbot) - Specific kinds of commonly asked information(e.g.
synthesize a bio on person X from any
web-accessible info)
37Document Summarization
38right mediumFinding information in
Non-textual Media
39Indexing and Searching Non-textual (Analog)
Content
- Speech ? text (speech recognition)
- Text ? speech
- TTS FESTVOX by far most popular high-quality
system - Handwriting ? text (handwriting recognition)
- Printed text ? electronic text (OCR)
- Picture ? caption key words (automatically) for
indexing and searching - Diagram, tables, graphs, maps ? caption key words
(automatically)
40Conclusion
41What is Text Mining
- Search documents, web, news
- Categorize by topic, taxonomy
- Enables filtering, routing, multi-text summaries,
- Extract names, relations,
- Summarize text, rules, trends,
- Detect redundancy, novelty, anomalies,
- Predict outcomes, behaviors, trends,
Who did what to whom and where?
42 Data Mining vs. Text Mining
- Data relational tables
- DM universe huge
- DM tasks
- DB cleanup
- Taxonomic classification
- Supervised learning with predictive classifiers
- Unsupervised learning clustering, anomaly
detection - Visualization of results
- Text HTML, free form
- TM universe 103X DM
- TM tasks
- All the DM tasks,
- plus
- Extraction of roles, relations and facts
- Machine translation for multi-lingual sources
- Parse NL-query (vs. SQL)
- NL-generation of results