Title: MultiDocument Database Extraction and Fusion
1Multi-DocumentDatabase Extractionand Fusion
- Gideon S. Mann
- Department of Computer Science
- University of Massachusetts at Amherst
- gideon.mann_at_gmail.com
2Information Extraction
Named Entity Extraction
(MUC, 1995)
The Pentagon has denied a request that top U.S.
commanders in Hawaii in 1941 be absolved of blame
for failing to be on alert for the Japanese
attack on Pearl Harbor, but the military agreed
that top Washington officials also must share the
blame. A Pentagon study re-affirmed the
conclusion of previous government investigations
that both Rear Admiral Husband E. Kimmel and his
Army counterpart, Maj. Gen. Walter C. Short,
committed errors of judgment'' leading up to
the Dec. 7, 1941, debacle.
Organizations, Locations, Times, People
Biological Entity Extraction (2003)
In addition, IL-2 but not IL-12 induced nuclear
factors NF-kappa B and AP, and regulation of the
nuclear levels of these two DNA binding protein
complexes is correlated with IFN-gamma and GM-CSF
gene expression.
Protein Molecules, Protein complex, Misc-Gene
3Information Extraction (cont.)
Template Filling (MUC3, 1992)
DEV-MUC3-0011 (NOSC) LIMA, 9 JAN 90 (EFE) --
TEXT AUTHORITIES HAVE REPORTED THAT
FORMER PERUVIAN DEFENSE MINISTER GENERAL ENRIQUE
LOPEZ ALBUJAR DIED TODAY IN LIMA AS A CONSEQUENCE
OF A TERRORIST ATTACK. LOPEZ ALBUJAR, FORMER
ARMY COMMANDER GENERAL AND DEFENSE MINISTER
UNTIL MAY 1989, WAS RIDDLED WITH BULLETS BY THREE
YOUNG INDIVIDUALS AS HE WAS GETTING OUT OF HIS
CAR IN AN OPEN PARKING LOT IN A COMMERCIAL CENTER
IN THE RESIDENTIAL NEIGHBORHOOD OF SAN ISIDRO.
LOPEZ ALBUJAR, 63, WAS DRIVING HIS OWN CAR
WITHOUT AN ESCORT. HE WAS SHOT EIGHT TIMES IN
THE CHEST. THE FORMER MINISTER WAS RUSHED TO THE
AIR FORCE HOSPITAL WHERE HE DIED.
- MESSAGE ID DEV-MUC3-0011
- INCIDENT DATE 09 JAN 90
- INCIDENT LOCATION PERU LIMA (CITY) SAN
ISIDRO (NEIGHBORHOOD) - INCIDENT TYPE ATTACK
- INCIDENT STAGE OF EXECUTION ACCOMPLISHED
- INCIDENT INSTRUMENT ID -
- INCIDENT INSTRUMENT TYPE GUN "-"
- PERP INCIDENT CATEGORY -
- PERP INDIVIDUAL ID "THREE YOUNG
INDIVIDUALS" - HUM TGT NAME "ENRIQUE LOPEZ ALBUJAR"
- HUM TGT DESCRIPTION "FORMER ARMY COMMANDER
GENERAL AND DEFENSE MINISTER" "ENRIQUE LOPEZ
ALBUJAR" - HUM TGT TYPE FORMER GOVERNMENT OFFICIAL /
FORMER ACTIVE MILITARY "ENRIQUE LOPEZ ALBUJAR" - HUM TGT NUMBER 1 "ENRIQUE LOPEZ ALBUJAR"
- HUM TGT EFFECT OF INCIDENT DEATH "ENRIQUE
LOPEZ ALBUJAR"
4Database Extraction
- A type of Template Filling
- Sentence-level relations (vs. document-level
templates) - Interdependencies between fields
Miles Davis was born in Alton, Illinois on May
25, 1926.
Philip Condit retired in 2003, and was replaced
by Harry Stonecipher as CEO of Boeing.
5Database Fusion
Large corpus with (some) redundant information
Extract
Partial, noisy relations extracted from each
document
Fuse
A consensus database created from many
noisy databases
61) Statistical Relation Extraction
Binary Relations only (higher order relations
addressed in fusion) Yields Confidence Weighting
W
- Minimally Supervised Training
- From Examples with Types
- Statistical Models
- Classification Models
- Sequence Models
.95
.80
.79
.75
.40
72) Statistical Relation Fusion
One target relation (e.g. birthday(Miles Davis,Y))
83) Statistical Database Fusion
- Cross-field bootstrapping
- Probabilistic database constraint violation
9Biographic Fact Extraction
Wikipedia Biography
Billie Holiday (1915 1959), also called Lady
Day, was an American singer, generally considered
one of the greatest jazz voices of all time,
alongside Sarah Vaughan and Ella Fitzgerald.
She grew up in the Fells Point section of
Baltimore.
Her mother, Sadie Fagan, was allegedly only
thirteen at the time of her birth her father
Clarence Holiday, a jazz guitarist who would play
for Fletcher Henderson, was fifteen.
Holiday married trombonist Jimmy Monroe on August
25, 1941. While still married to Monroe, she took
up with trumpeter Joe Guy as his common law wife.
She finally divorced Monroe in 1957 as she split
with Guy.
10Management Succession
Roger B. Smith Michigan '47 Former chairman of
General Motors.
Robert C. Stempel WPI Retired Chairman CEO,
General Motors Corporation
The Board of Directors elects John F. Smith, Jr.,
chief executive officer and president following
the resignation of Robert C. Stempel.
G. Richard "Rick" Wagoner Jr. was named president
and chief executive officer of General Motors on
June 1, 2000.
11Minimally Supervised Training for Fact Extraction
- use hook as query
- retrieve different depths of ranked list
Label
Miles Davis was born in 1926
First
Birthyear
Last
Train
Model
12From Examples to Positive Instances
- Label every exact
- match to hook or
- target
Billie Holiday (1915 1959), also called Lady
Day, was an American singer, generally considered
one of the greatest jazz voices of all time,
alongside Sarah Vaughan and Ella Fitzgerald.
Holiday married trombonist Jimmy Monroe on August
25, 1941.
- Every sentence which contains a hook and a
target - is a positive training instance.
Billie Holiday (1915 1959)
13Negative Instances by Pattern Match
Billie Holiday ( 1915 1959), also
- Generate patterns which
- extract those positive
- Instances.
X ( Y
- Estimate how much noise those patterns generate
P(birthyear X ) Y)
- PROBLEMS
- - Only works for fixed patterns.
- - Not a general training method.
14Negative Instances by Target Type
- Label every exact
- match to hook or
- target and label
- all words which
- match the target type.
Billie Holiday ( 1915 , 1959), also called Lady
Day, was an American singer, generally considered
one of the greatest jazz voices of all time,
alongside Sarah Vaughan and Ella Fitzgerald.
Holiday married trombonist Jimmy Monroe on August
25, 1941.
Billie Holiday ( 1915 1959)
- Every sentence which contains a hook and a match
with - the target type (that isnt the correct target)
is a negative - instance.
Billie Holiday (1915 1959),
Holiday married trombonist Jimmy Monroe on August
25, 1941.
15Fact Extraction Models
Positive Instances Only
No need for type tagger
- Phrase Conditional Model (rote)
- Naïve Bayes I
- CRF I
Positive Negative Instances
Better models
- Naïve Bayes II
- Maximum Entropy Model
- CRF II
16Phrase Conditional Likelihood (PCL)
P(occupation X was a Y)
musician
Miles
Davis
was
a
Positive Instance
Target
First
Last
Y
Y
Phrase match
Miles
Davis
a
was
king
Negative Instance
False
First
Last
N
N
Can find negative examples without a type model
17Naïve Bayes I Positive Instances Only
P(was occupation) x P(a occupation)
Positive Instance
musician
Miles
Davis
was
a
Target
First
Last
Y
Y
Unigram Language model as background model
P(was ) x P(a)
Trained Without Negative Instances
18Naïve Bayes II Positive Negative Instances
P(was occupation) x P(a occupation)
occupation type
Positive Instance
musician
Miles
Davis
was
a
Target
First
Last
Y
Y
P(was not occupation) x P(married not
occupation) x P(to not occupation) x P(an
not occupation)
occupation type
Negative Instance
actress
Miles
Davis
married
an
was
to
False
First
Last
N
N
N
N
19Maximum Entropy Model
musician
Miles
Davis
was
a
Positive Instance
Target
First
Last
Y
Y
actress
Miles
Davis
married
an
was
to
Negative Instance
False
First
Last
N
N
N
N
Arbitrary features unigrams, n-grams,
string length, interposing
typed
20CRF I Positive Instances only
musician
Miles
Davis
was
a
Positive Instance
Target
First
Last
Y
Y
actress
Miles
Davis
married
an
was
to
Negative Instance
I
First
Last
I
I
I
I
Different from maximum entropy model, when there
are multiple candidate targets in one sentence.
21CRF II Positive Negative Instances
musician
Miles
Davis
was
a
Positive Instance
Target
First
Last
Y
Y
actress
Miles
Davis
married
an
was
to
Negative Instance
False
First
Last
N
N
N
N
In extraction, force the CRF to choose
between labeling an occupation as Target or False
22Extraction Experiments
Scripts Extract Formatted Data
Training data (15 people) Ground truth for
evaluation (132 people)
23Training
Extraction
150
150
Label
Extract
Miles Davis was born in 1926
First
Birthyear
Last
Train
Fuse
Model
24Extraction
Relations from all 150 documents, multiple
relations per person
25Precision
PCL, CRF II have best precision
26Pseudo-Recall
Naïve Bayes II, CRF I have best Pseudo-Recall
27Relation Fusion
Weighted relations from each document
- Highest Confidence Relation
over all documents
Threshold Frequency
28Label Marginals for Fusion
where
is the label for the candidate target
computed via constrained forwards/backawards
(Culotta, McCallum 2004) summed over all sentences
a
musician
Miles
Davis
was
, and Davis was a great one.
Target
29Fusion
One relation per person, fused from multiple
documents
30Biographic Fact Fusion Accuracy
CRF II (with negative examples) using Sum of
Label Marginal Fusion performs best
31Fusion Accuracy
Threshold frequency fusion for PCL and Naïve
Bayes, Label Marginal fusion for the CRFs
CRF II is winner in all relations but occupation
32Retrieval Set Size vs. Fusion Accuracy
Precision
Fusion Accuracy
Recall
33Fusion across Relations
Relations co-occur
Relations are mutually constrained
Likely to mention someones birthday and
birthyear in the same sentence
Relations form a timeline and must obey
those semantics
34Cross-Field Bootstrapping for co-occurrence
Hesse was born on July 2. Hesse was born on July
2, 1877. On July 2, 1877, Hesse came into the
world in the town of Calw, Germany. In 1877,
Hesses mother gave birth to a baby boy in Calw.
Fix order for extraction/fusion of relations. In
training, add features for relations that are
previous in order.
35Cross-field Bootstrapping Accuracy
36Database Dependencies
Smith
Stempel
Smith
Wagoner
1990
1992
2000
37Semantic Database Constraints
Smith
Stempel
Smith
Wagoner
1990
1992
2000
38Constraints for Re-ranking
Corroborative
Contradictory
Only CEO start dates belong in the database
A CEOs start is before their end
If one CEO precedes a second CEO, all their dates
must be in order
39Probabilistic Database
is the single relation fusion score
Assumes strong Independence between relations
40Probabilistic Database with Constraints
where
is a constraint that holds over
and has the value 1 only when the constraint is
not violated. u(?) number of universally
quantified relations in the constraint e(?)
number of existentially quantified relations in
the constraint
41Relation Re-scoring using Constraint Violation
1 if the constraint holds
1 if the constraint applies
This measures how likely a relation is to cause a
constraint violation, given that the constraint
applies.
In practice, with multiple constraints.
42Example Constraint
Start dates come before end dates
Applies?
Holds?
only false when the constraint is violated
43Relation Re-scoring for start(x,t)
precision
recall
44Relation Re-scoring for end(x,t)
precision
recall
45Relation Re-scoring for precedes(x,y)
precision
recall
46MaxF1 for each constraint
47Conclusion
- A general method for minimally supervised
training (where negative examples are generated
using a type model) - Fusion across multiple documents using the CRF
label marginals - Cross-document bootstrapping for taking
advantage of multi-relation co-occurrence - Probabilistic database constraint violation for
leveraging cross-relation dependencies