WHIRL - PowerPoint PPT Presentation

About This Presentation
Title:

WHIRL

Description:

WHIRL Reasoning with IE output 11/1/10 – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 65
Provided by: William1216
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: WHIRL


1
WHIRL Reasoning with IE output
  • 11/1/10

2
Announcements
  • Next week mid-term progress reports on project
  • Talks Mon, Wed
  • Written 2-page status update Wed midnight
  • Dont get stressed about format
  • Things to talk about
  • Problem and approach
  • Related work
  • Dataset characteristics, baseline performance
  • Your experiences so far whats been hard

3
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




4
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
QA
End User
5
What is Information Extraction
As a task
Answering questions from a user using information
in text
Is building a conventional DB a necessary
subgoal? When can you answer questions without
one?
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
QA
End User
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
WHIRL project (1997-2000)
  • WHIRL initiated when at ATT Bell Labs

ATT Research
ATT Labs - Research
ATT Research
ATT Labs
ATT Research Shannon Laboratory
ATT Shannon Labs
16
When are two entities the same?
  • Bell Labs
  • Bell Telephone Labs
  • ATT Bell Labs
  • AT Labs
  • ATT LabsResearch
  • ATT Labs Research, Shannon Laboratory
  • Shannon Labs
  • Bell Labs Innovations
  • Lucent Technologies/Bell Labs Innovations

1925
History of Innovation From 1925 to today, ATT
has attracted some of the world's greatest
scientists, engineers and developers.
www.research.att.com
Bell Labs Facts Bell Laboratories, the research
and development arm of Lucent Technologies, has
been operating continuously since 1925
bell-labs.com
17
In the once upon a time days of the First Age of
Magic, the prudent sorcerer regarded his own true
name as his most valued possession but also the
greatest threat to his continued good health,
for--the stories go--once an enemy, even a weak
unskilled enemy, learned the sorcerer's true
name, then routine and widely known spells could
destroy or enslave even the most powerful. As
times passed, and we graduated to the Age of
Reason and thence to the first and second
industrial revolutions, such notions were
discredited. Now it seems that the Wheel has
turned full circle (even if there never really
was a First Age) and we are back to worrying
about true names again The first hint Mr.
Slippery had that his own True Name might be
known--and, for that matter, known to the Great
Enemy--came with the appearance of two black
Lincolns humming up the long dirt driveway ...
Roger Pollack was in his garden weeding, had been
there nearly the whole morning.... Four heavy-set
men and a hard-looking female piled out, started
purposefully across his well-tended cabbage
patch. This had been, of course, Roger Pollack's
great fear. They had discovered Mr. Slippery's
True Name and it was Roger Andrew Pollack
TIN/SSAN 0959-34-2861.
18
When are two entities are the same?
Buddhism rejects the key element in folk
psychology the idea of a self (a unified
personal identity that is continuous through
time) King Milinda and Nagasena (the Buddhist
sage) discuss personal identity Milinda
gradually realizes that "Nagasena" (the word)
does not stand for anything he can point to
not the hairs on Nagasena's head, nor the hairs
of the body, nor the "nails, teeth, skin,
muscles, sinews, bones, marrow, kidneys, ..."
etc Milinda concludes that "Nagasena" doesn't
stand for anything If we can't say what a person
is, then how do we know a person is the same
person through time? There's really no you,
and if there's no you, there are no beliefs or
desires for you to have The folk psychology
picture is profoundly misleading and believing it
will make you miserable. -S. LaFave
19
(No Transcript)
20
Deduction via co-operation
User
  • Economic issues
  • Who pays for integration? Who tracks errors
    inconsistencies? Who fixes bugs? Who pushes for
    clarity in underlying concepts and object
    identifiers?
  • Standards approach ? publishers are responsible
    ? publishers pay
  • Mediator approach 3rd party does the work,
    agnostic as to cost

Integrated KB
Site1
Site3
Site2
KB1
KB3
KB2
Standard Terminology
21
Traditional approach
Linkage
Queries
Uncertainty about what to link must be decided by
the integration system, not the end user
22
WHIRL approach
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Strongest links those agreeable to most users
William Will Cohen Cohn
Steve Steven Minton Mitton
Weaker links those agreeable to some users
even weaker links
William David Cohen Cohn
23
WHIRL approach
Link items as needed by Q
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
24
WHIRL queries
  • Assume two relations
  • review(movieTitle,reviewText) archive of reviews
  • listing(theatre, movieTitle, showTimes, ) now
    showing

The Hitchhikers Guide to the Galaxy, 2005 This is a faithful re-creation of the original radio series not surprisingly, as Adams wrote the screenplay .
Men in Black, 1997 Will Smith does an excellent job in this
Space Balls, 1987 Only a die-hard Mel Brooks fan could claim to enjoy

Star Wars Episode III The Senator Theater 100, 415, 730pm.
Cinderella Man The Rotunda Cinema 100, 430, 730pm.

25
WHIRL queries
  • Find reviews of sci-fi comedies movie domain
  • FROM review SELECT WHERE r.textsci fi comedy
  • (like standard ranked retrieval of sci-fi
    comedy)
  • Where is that sci-fi comedy playing?
  • FROM review as r, LISTING as s, SELECT
  • WHERE r.titles.title and r.textsci fi comedy
  • (best answers titles are similar to each other
    e.g., Hitchhikers Guide to the Galaxy and The
    Hitchhikers Guide to the Galaxy, 2005 and the
    review text is similar to sci-fi comedy)

26
WHIRL queries
  • Similarity is based on TFIDF? rare words are most
    important.
  • Search for high-ranking answers uses inverted
    indices.

- It is easy to find the (few) items that match
on important terms - Search for strong matches
can prune unimportant terms
Star Wars Episode III
Hitchhikers Guide to the Galaxy
Cinderella Man

The Hitchhikers Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987

hitchhiker movie00137
the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, ..

27
Inference in WHIRL
  • Best-first search pick state s that is best
    according to f(s)
  • Suppose graph is a tree, and for all s, s, if s
    is reachable from s then f(s)gtf(s). Then A
    outputs the globally best goal state s first,
    and then next best, ...

28
Inference in WHIRL
  • Explode p(X1,X2,X3) find all DB tuples
    ltp,a1,a2,a3gt for p and bind Xi to ai.
  • Constrain XY if X is bound to a and Y is
    unbound,
  • find DB column C to which Y should be bound
  • pick a term t in X, find proper inverted index
    for t in C, and bind Y to something in that index
  • Keep track of ts used previously, and dont
    allow Y to contain one.

29
Inference in WHIRL
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Outline
  • Information integration
  • Some history
  • The problem, the economics, and the economic
    problem
  • Soft information integration
  • Concrete uses of soft integration
  • Classification
  • Collaborative filtering
  • Set expansion

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
Stopped about here.
46
Outline
  • Information integration
  • Some history
  • The problem, the economics, and the economic
    problem
  • Soft information integration
  • Concrete uses of soft integration
  • Classification
  • Collaborative filtering
  • Set expansion

47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
Other string distances
52
Robust distance metrics for strings
  • Kinds of distances between s and t
  • Edit-distance based (Levenshtein, Smith-Waterman,
    ) distance is cost of cheapest sequence of
    edits that transform s to t.
  • Term-based (TFIDF, Jaccard, DICE, ) distance
    based on set of words in s and t, usually
    weighting important words
  • Which methods work best when?

53
Robust distance metrics for strings
SecondString (Cohen, Ravikumar, Fienberg, IIWeb
2003)
  • Java toolkit of string-matching methods from AI,
    Statistics, IR and DB communities
  • Tools for evaluating performance on test data
  • Used to experimentally compare a number of metrics

54
Results Edit-distance variants
  • Monge-Elkan (a carefully-tuned
    Smith-Waterman variant) is the best on average
    across the benchmark datasets

11-pt interpolated recall/precision curves
averaged across 11 benchmark problems
55
Results Edit-distance variants
But Monge-Elkan is sometimes outperformed on
specific datasets
Precision-recall for Monge-Elkan and one other
method (Levenshtein) on a specific benchmark
56
SoftTFDF A robust distance metric
  • We also compared edit-distance based and
    term-based methods, and evaluated a new hybrid
    method
  • SoftTFIDF, for token sets S and T
  • Extends TFIDF by including pairs of words in S
    and T that almost matchi.e., that are highly
    similar according to a second distance metric
    (the Jaro-Winkler metric, an edit-distance like
    metric).

57
(No Transcript)
58
Comparing token-based, edit-distance, and hybrid
distance metrics
SFS is a vanilla IDF weight on each token (circa
1959!)
59
SoftTFIDF is a Robust Distance Metric
60
Cohen, Kautz McAllister paper
61
(No Transcript)
62
(No Transcript)
63
Definitions
  • S, H are sets of tuples over references
  • B. Selman1, William W. Cohen34, B Selman2,
  • Ipot is a weighted set of possible arcs.
  • I is a subset of I. Given r, follow a chain of
    arcs to get the final interpretation of r.
  • B. Selman1 ? Bart Selman22 ? ? B.
    Selman27

64
Goal
  • Given S and Ipot, find the I that minimizes

Number of arcs
Total weight of all arcs
tuples in hard DB HI(S)
  • Idea find MAP hard database behind S
  • Arcs correspond to errors/abbreviations.
  • Chains of transformations correspond to errors
    that propogate via copying

65
Facts about hardening
  • This simplifies a very simple generative model
    for a database
  • Generate tuples in H one by one
  • Generate arcs I in Ipot one by one
  • Generate tuples in S one by one (given H and I)
  • Greedy method makes sense
  • Easy merges can lower the cost of later hard
    merges
  • Hardening is hard
  • NP hard even under severe restrictionsbecause
    the choices of what to merge where are all
    interconnected.

66
B.selman? Bart Selman Critical in -gt
Critical .. For ..
affil(Bert Sealmann3, Cornell3) author(Bert
Sealmann3, BLACKBOX problem solving 3)
Write a Comment
User Comments (0)
About PowerShow.com