Title: WHIRL
1WHIRL Reasoning with IE output
2Announcements
- Next week mid-term progress reports on project
- Talks Mon, Wed
- Written 2-page status update Wed midnight
- Dont get stressed about format
- Things to talk about
- Problem and approach
- Related work
- Dataset characteristics, baseline performance
- Your experiences so far whats been hard
3What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
4What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
QA
End User
5What is Information Extraction
As a task
Answering questions from a user using information
in text
Is building a conventional DB a necessary
subgoal? When can you answer questions without
one?
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
QA
End User
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15WHIRL project (1997-2000)
- WHIRL initiated when at ATT Bell Labs
ATT Research
ATT Labs - Research
ATT Research
ATT Labs
ATT Research Shannon Laboratory
ATT Shannon Labs
16When are two entities the same?
- Bell Labs
- Bell Telephone Labs
- ATT Bell Labs
- AT Labs
- ATT LabsResearch
- ATT Labs Research, Shannon Laboratory
- Shannon Labs
- Bell Labs Innovations
- Lucent Technologies/Bell Labs Innovations
1925
History of Innovation From 1925 to today, ATT
has attracted some of the world's greatest
scientists, engineers and developers.
www.research.att.com
Bell Labs Facts Bell Laboratories, the research
and development arm of Lucent Technologies, has
been operating continuously since 1925
bell-labs.com
17In the once upon a time days of the First Age of
Magic, the prudent sorcerer regarded his own true
name as his most valued possession but also the
greatest threat to his continued good health,
for--the stories go--once an enemy, even a weak
unskilled enemy, learned the sorcerer's true
name, then routine and widely known spells could
destroy or enslave even the most powerful. As
times passed, and we graduated to the Age of
Reason and thence to the first and second
industrial revolutions, such notions were
discredited. Now it seems that the Wheel has
turned full circle (even if there never really
was a First Age) and we are back to worrying
about true names again The first hint Mr.
Slippery had that his own True Name might be
known--and, for that matter, known to the Great
Enemy--came with the appearance of two black
Lincolns humming up the long dirt driveway ...
Roger Pollack was in his garden weeding, had been
there nearly the whole morning.... Four heavy-set
men and a hard-looking female piled out, started
purposefully across his well-tended cabbage
patch. This had been, of course, Roger Pollack's
great fear. They had discovered Mr. Slippery's
True Name and it was Roger Andrew Pollack
TIN/SSAN 0959-34-2861.
18When are two entities are the same?
Buddhism rejects the key element in folk
psychology the idea of a self (a unified
personal identity that is continuous through
time) King Milinda and Nagasena (the Buddhist
sage) discuss personal identity Milinda
gradually realizes that "Nagasena" (the word)
does not stand for anything he can point to
not the hairs on Nagasena's head, nor the hairs
of the body, nor the "nails, teeth, skin,
muscles, sinews, bones, marrow, kidneys, ..."
etc Milinda concludes that "Nagasena" doesn't
stand for anything If we can't say what a person
is, then how do we know a person is the same
person through time? There's really no you,
and if there's no you, there are no beliefs or
desires for you to have The folk psychology
picture is profoundly misleading and believing it
will make you miserable. -S. LaFave
19(No Transcript)
20Deduction via co-operation
User
- Economic issues
- Who pays for integration? Who tracks errors
inconsistencies? Who fixes bugs? Who pushes for
clarity in underlying concepts and object
identifiers? - Standards approach ? publishers are responsible
? publishers pay - Mediator approach 3rd party does the work,
agnostic as to cost
Integrated KB
Site1
Site3
Site2
KB1
KB3
KB2
Standard Terminology
21Traditional approach
Linkage
Queries
Uncertainty about what to link must be decided by
the integration system, not the end user
22WHIRL approach
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Strongest links those agreeable to most users
William Will Cohen Cohn
Steve Steven Minton Mitton
Weaker links those agreeable to some users
even weaker links
William David Cohen Cohn
23WHIRL approach
Link items as needed by Q
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
24WHIRL queries
- Assume two relations
- review(movieTitle,reviewText) archive of reviews
- listing(theatre, movieTitle, showTimes, ) now
showing
The Hitchhikers Guide to the Galaxy, 2005 This is a faithful re-creation of the original radio series not surprisingly, as Adams wrote the screenplay .
Men in Black, 1997 Will Smith does an excellent job in this
Space Balls, 1987 Only a die-hard Mel Brooks fan could claim to enjoy
Star Wars Episode III The Senator Theater 100, 415, 730pm.
Cinderella Man The Rotunda Cinema 100, 430, 730pm.
25WHIRL queries
- Find reviews of sci-fi comedies movie domain
- FROM review SELECT WHERE r.textsci fi comedy
- (like standard ranked retrieval of sci-fi
comedy) - Where is that sci-fi comedy playing?
- FROM review as r, LISTING as s, SELECT
- WHERE r.titles.title and r.textsci fi comedy
- (best answers titles are similar to each other
e.g., Hitchhikers Guide to the Galaxy and The
Hitchhikers Guide to the Galaxy, 2005 and the
review text is similar to sci-fi comedy)
26WHIRL queries
- Similarity is based on TFIDF? rare words are most
important. - Search for high-ranking answers uses inverted
indices.
- It is easy to find the (few) items that match
on important terms - Search for strong matches
can prune unimportant terms
Star Wars Episode III
Hitchhikers Guide to the Galaxy
Cinderella Man
The Hitchhikers Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987
hitchhiker movie00137
the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, ..
27Inference in WHIRL
- Best-first search pick state s that is best
according to f(s) - Suppose graph is a tree, and for all s, s, if s
is reachable from s then f(s)gtf(s). Then A
outputs the globally best goal state s first,
and then next best, ...
28Inference in WHIRL
- Explode p(X1,X2,X3) find all DB tuples
ltp,a1,a2,a3gt for p and bind Xi to ai. - Constrain XY if X is bound to a and Y is
unbound, - find DB column C to which Y should be bound
- pick a term t in X, find proper inverted index
for t in C, and bind Y to something in that index - Keep track of ts used previously, and dont
allow Y to contain one.
29Inference in WHIRL
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Outline
- Information integration
- Some history
- The problem, the economics, and the economic
problem - Soft information integration
- Concrete uses of soft integration
- Classification
- Collaborative filtering
- Set expansion
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45Stopped about here.
46Outline
- Information integration
- Some history
- The problem, the economics, and the economic
problem - Soft information integration
- Concrete uses of soft integration
- Classification
- Collaborative filtering
- Set expansion
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51Other string distances
52Robust distance metrics for strings
- Kinds of distances between s and t
- Edit-distance based (Levenshtein, Smith-Waterman,
) distance is cost of cheapest sequence of
edits that transform s to t. - Term-based (TFIDF, Jaccard, DICE, ) distance
based on set of words in s and t, usually
weighting important words - Which methods work best when?
53Robust distance metrics for strings
SecondString (Cohen, Ravikumar, Fienberg, IIWeb
2003)
- Java toolkit of string-matching methods from AI,
Statistics, IR and DB communities - Tools for evaluating performance on test data
- Used to experimentally compare a number of metrics
54Results Edit-distance variants
- Monge-Elkan (a carefully-tuned
Smith-Waterman variant) is the best on average
across the benchmark datasets
11-pt interpolated recall/precision curves
averaged across 11 benchmark problems
55Results Edit-distance variants
But Monge-Elkan is sometimes outperformed on
specific datasets
Precision-recall for Monge-Elkan and one other
method (Levenshtein) on a specific benchmark
56SoftTFDF A robust distance metric
- We also compared edit-distance based and
term-based methods, and evaluated a new hybrid
method - SoftTFIDF, for token sets S and T
- Extends TFIDF by including pairs of words in S
and T that almost matchi.e., that are highly
similar according to a second distance metric
(the Jaro-Winkler metric, an edit-distance like
metric).
57(No Transcript)
58Comparing token-based, edit-distance, and hybrid
distance metrics
SFS is a vanilla IDF weight on each token (circa
1959!)
59SoftTFIDF is a Robust Distance Metric
60Cohen, Kautz McAllister paper
61(No Transcript)
62(No Transcript)
63Definitions
- S, H are sets of tuples over references
- B. Selman1, William W. Cohen34, B Selman2,
- Ipot is a weighted set of possible arcs.
- I is a subset of I. Given r, follow a chain of
arcs to get the final interpretation of r. - B. Selman1 ? Bart Selman22 ? ? B.
Selman27
64Goal
- Given S and Ipot, find the I that minimizes
Number of arcs
Total weight of all arcs
tuples in hard DB HI(S)
- Idea find MAP hard database behind S
- Arcs correspond to errors/abbreviations.
- Chains of transformations correspond to errors
that propogate via copying
65Facts about hardening
- This simplifies a very simple generative model
for a database - Generate tuples in H one by one
- Generate arcs I in Ipot one by one
- Generate tuples in S one by one (given H and I)
- Greedy method makes sense
- Easy merges can lower the cost of later hard
merges - Hardening is hard
- NP hard even under severe restrictionsbecause
the choices of what to merge where are all
interconnected.
66B.selman? Bart Selman Critical in -gt
Critical .. For ..
affil(Bert Sealmann3, Cornell3) author(Bert
Sealmann3, BLACKBOX problem solving 3)