Verification of Facts across Document Boundaries - PowerPoint PPT Presentation

1 / 76

About This Presentation

Title:

Verification of Facts across Document Boundaries

Description:

Ellipsis as anaphora. Reference resolution. Distinguish three ... Ellipsis as anaphora. Rule-based; automatically trained; hybrid. Deferred commitment: refres ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 77

Provided by: velblodVid

Category:

more less

Transcript and Presenter's Notes

Title: Verification of Facts across Document Boundaries

1
Verification of Facts across Document Boundaries

Roman Yangarber
University of Helsinki 2006

2
Synopsis

Extraction of facts from text
Focus on relation extraction, n-ary
Position paper
Baseline extraction of facts
Local analysis
Assess confidence locally
Aggregate facts globally
Verify across a large collection
Results
Demos (?)

3
Baseline IE

Local extraction
Process one document at a time e.g., a news
stream
Analyze text
Fill in answer template(s)
forget the result
Prior knowledge
Contained in knowledge bases
Pattern bases
Ontologies
Recognizers trained on a corpus (pre-tagged
manually)
Given a priori, Static
Not reused during extraction process
NB document collection need not be static

4
Impact of traditional processing

Utility
Facts/events typically have multiple references
in a document collection
Facts evolve, are updated, repeated, corrected
User would benefit from cross-document linking of
facts/events
Performance
Glass ceiling on performance
When viewed locally, the problem may be too
restricted, and therefore ill-posed
system may need more than
Document-local information
Static prior knowledge

5
Reasons behind local IE

Tradition
Evaluation
MUC
ACE

6
Proposal

When IE system extracts facts from a document in
a large collection, use
A priori knowledge
Information in current document
A posteriori knowledge, derived from elsewhere in
the collection
Extensions to IE paradigm
Locally, do not commit to hard answers
Assess confidence locally
Aggregate information globally
Leverage global information to find improvements

7
Local analysis

Traditional approach
When system processes document D0, it may
generate a filled template/database record
Each slot in the record is a single, best guess
answer
Deferred commitment
Return a distribution of answers

8
Example

According to Reuters, stocks declined today on
Wall Street after lower than expected earnings
forecasts from Hewlett-Packard, Co. Investors
are being cautious, said Stephen Kolano, an
equity trader for Mellon Growth Advisors in
Boston, awaiting further news on the companys
planned merger with Compaq.
? merger(the company, Compaq)
System must resolve the definite NP reference,
based on features in local context
Distance from anaphor
Relative position in sentence
Syntactic constraints semantic constraints

9
Reference resolution

Distinguish three kinds of references
Immediate
Indirect
Elliptic implicit
Ellipsis as anaphora

10
Ellipsis example

Epidemics domain
Pattern
np(people) vg(die) in np(location)
matches
5 people died in Sudan.
5 people died.
Fill template
epidemic( disease, victim, location, )
Virtually transformed to ?
5 people died in Location.

11
Reference resolution

Distinguish three kinds of references
Immediate
Indirect
Elliptic implicit
Ellipsis as anaphora

12
Reference resolution

Distinguish three kinds of references
Immediate
A. Unique in story
A. Unique in sentence
- as strong criterion as immediate (almost)
Indirect
Elliptic implicit
Ellipsis as anaphora
Rule-based automatically trained hybrid

13
Deferred commitment refres

Relax reference resolution mechanism
Each non-immediate reference is linked to a
distribution of answers
Ranked list of value-confidence pairs
A global analysis phase will try to infer the
globally optimal answer

14
Global analysis

Fact base totality of facts extracted from the
collection of documents
a posteriori knowledge
Imperfect
Improve fact base using global information
If a fact is attested more than once, confidence
increases

15
Example

Optimal global interpretation e.g.D1 and D3
correct, and D2 partly wrong

16
Example
17
Experiments Global aggregation

How can global aggregation be leveraged to obtain
an improvement in results (overall)
Job Advertisements
prior work
Epidemic Surveillance
Relation correction
Estimate utility of global aggregation

18
Application Job advertisements

Nahm Mooney, 2000. Integrate IE and DM
1. Run IE on part of collection ? extracted facts
2. DM on facts ? association rules between fills
3. Augment IE with mined rules if rule
fires, add attributes from head of rule
Example
Facts jobprogrammer skillsJava, C
platformUnix,Mac
Rules Lunux ? platform ? C ? skills
Second-phase extraction
If rule fires and head attribute is mentioned
anywhere, then extract it too.
Improve recall, no significant loss in precision

19
Application Epidemic Surveillance

On-line incremental database
Start from plain text
Extract database records
Disease
Location
Date
Victims
Kind of victim/descriptor people, animals,
plants
Victim status sick, dead

20
Utility Confidence

Reference resolution immediate reference
Compute local confidence for entire fact
For each record, local confidence is a function
of three key attributes disease, location,
date
Produce confidence c so that c gt ?conf , when
all key attributes are are immediate (or unique)
Record is locally confident if c gt ?conf
(?conf 0.75)
In human manual judgements, confidence was found
to be correlated with correctness
(0.55)

21
Global aggregation

Aggregate related records if they have similar
values of key attributes
Disease identical
Location close (?)
Time close (within a small window, e.g. 1
month)
Aggregate records into groups (epidemics)
Any confident record induces a group
Repeat extend timeline if another record falls
within time window, irrespective of its
confidence

Outbreak span
22
Local confidence
23
Judge fills in non-confident records
24
Judge fills in non-confident records
25
Judge fills in non-confident records
26
Judge fills in non-confident records
27
Experiment relation correction

Extracts location and state (country) attributes
Early commitment Single best-guess
Deferred commitment Distribution of answers
Location may be ambiguous
Aggregate records for each value of location
Gather all possible states
Apply majority vote to correct mistakes
Document by document, to avoid mapping location
to most common state (baseline)

28
Performance of relation correction
29
Conclusion

Locally
Defer commitment
Assess confidence
Globally
Aggregate (imperfectly!) extracted facts
Leverage global information to obtain improvements

30
End

31
(No Transcript)
32
Experiment assessment

Confidence
Global clustering

33
What is a fact

Basic Entities and Names identify all
persons, organizations, locations,
artefacts, medicines/drugs, diseases
Why is even this already useful? Examples
find all persons related to person X
find all companies related to company Y
find all diseases in country Z
try with Google
Complex Relationships and events
how entities are related to each other,
how they interact

33
34
What it means to find a fact

unstructured ? structured representation
plain text ? spreadsheet, database table

34
35
Example Executive Search

George Garrick, 40 years old, president of
theLondon-based European Information
ServicesInc., was appointed chief executive
officer ofNielsen Marketing Research, USA.

36
Example Executive Search

George Garrick, 40 years old, president of
theLondon-based European Information
ServicesInc., was appointed chief executive
officer ofNielsen Marketing Research, USA.

37
Example Epidemics
Rule X confirm N death in Loc
37
38
Why is it important

Once facts are in database
can search for them more easily
computer can process them intelligently
find patterns and trends
Certain queries cannotbe done with
keywordsalone
Information explosion

39
IE and IR
Additional processing
40
Focused Search

Not spontaneous, random search
Users spend much time on persistent, focused
search repeated pursuit of facts that are
important in their analysis/research
User places higher value on information related
to long-standing interest, to which s/he has a
long-term commitment, than on information related
to one-time interest

41
Why is it difficult references

Language is complex
George Garrick, 40 years old, has served as
president of Sony, Inc. for 13 years.
ltmore textgt..
The company announced his resignation effective
July.

Date June, 23, 2000
George Garrick, 40 years old,
Sony, Inc.
The company
his
October
42
Example applications

Database of global epidemics
Database of corporate executives
Coprorate mergers and acquisitions
Lawsuits / Legal action, Bankruptcy
Terrorist attacks
Natural disasters
Space launches rockets, missiles,
Air accidents

43
(No Transcript)
44
Information extraction

Finding facts
names of entities
people
places
organizations
etc.
relations between entities
organizations employ people
events
who was affected, how, when, where

45
IE Pipeline knowledge bases

45
45
46
Why IE is useful

Semantic index into document collection
For known scenarios, more reliable than keyword
index
Example answer query like
Where does a given disease appear?

47
Performance

Ceiling of 70
Many factors compounded
Name classification
Reference resolution
Coverage of event patterns
Elided elements in events

48
Examples of problems

Names
Diseases
Agents
bacteria, viruses, fungi, algae, parasites,
Vectors
Drugs
Locations
Reference resolution
Location-country/state relation (normalization)

49
Examples of problems

Reference resolution
Location-country/state relation (normalization)
easy (because functional relation, almost)
More generally, how can we verify a filled slot
No functional relations
(e.g., any disease can occur anywhere)

50
Current research

Favor unsupervised/weakly supervised techniques
Minimize manual labor
Allow us to use much larger corpora for training
Unsupervised acquisition of semantic knowledge
Names
Semantic patterns

51
Current research

Cross-document fact validation
Aggregate information across documents
Correct errors made in earlier stages of pipeline
Provide richer information through the pipeline
Propagate multiple hypotheses (instead of best
guess)
Self-confidence ratings

52
Cross-document integration

Can provide strong global evidence
Also poses challenges
Time-varying information
Evolving epidemics, changing (growing) numbers
Resolution of doubtful events (at later time)
Corrections
Can be viewed as deeper understanding of domain
E.g., reason about epidemics from incidents

53
Conclusion

Applications form good base for research
Observe performance improvements in real setting
Provide large fact base, for cross-document
integration

54
(No Transcript)
55
Synopsis

Information Extraction (IE)
extracting factual information from textual
documents, written in natural human language.
IE on a large scale
in contrast with the traditional study of IE,
focusing on the smaller-scale, laboratory
setting.
Applying IE methods to large collections of text
attempts to exploit the massive redundancy in the
facts contained in the collections.
Redundancy is inherent in the stream of emerging
events, whether the topic is general news,
science/medicine, business, etc.

55
56
Structure of talk

Demonstration and motivation
Problem domain
Need semantic knowledge
What is a pattern?
What is a name?
Techniques on a large-scale
Learning semantic patterns
Learning semantic lexicons
Learning global trends in extracted data
For automatic recovery from errors

56
57
Objectives

Acquire semantic patterns for IE
To customize IE system rapidly
Minimally supervised acquisition
Build upon previously described methods
Common feature start out with high precision,
then gradually trade off precision for recall
Problem of convergence
Algorithm discovers stream of patterns
Want to find out when to stop

58
Outline

Prior Work
Basic Unsupervised Learner
Counter-Training
Experiments and Results
General framework
Current work/Conclusion

59
Prior Work

On knowledge acquisition
Riloff (1996), Yangarber, Grishman, Tapanainen,
Huttunen (2000)
Review by human, supervised, or set thresholds
Thelen Riloff (2002), Yangarber, Lin, Grishman
(2002)
Natural convergence
Collins Singer (1999), Yarowsky (1995)

60
Basic Unsupervised Algorithm

For pattern discovery
Pre-processing
Factor out NEs (and other OOVs)
NE tagger, e.g.
Parse
General-purpose dependency parser
Tree normalization
Passive, relative, ? active
Pattern extraction
Tree ? core constituents S V O
John Smith was hired by IBM ? Company hire
Person

61
Bootstrapping Learner

Initial query
A small set of seed patterns which partially
characterize topic of interest
Retrieve documents containing seed patterns
Relevant documents
Rank patterns (in relevant documents)
According to frequency in relevant docs vs.
Overall frequency
Add top-ranked pattern to seed pattern set

repeat
62
Pattern score

Trade-off recall and precision
Where
HH(p) set of documents matching p
K(d) set of patterns matching document d

63
Counter-training

Eventually mono-learner will pick up non-specific
patterns
Match documents relevant to the scenario, but
also match non-relevant documents
Introduce multiple learners in parallel
Learning in different, competing scenarios
Documents which are ambiguous will receive high
relevance score in more than one scenario
Prevent learning patterns which match such
ambiguous documents