Title: Verification of Facts across Document Boundaries
1Verification of Facts across Document Boundaries
- Roman Yangarber
- University of Helsinki 2006
2Synopsis
- Extraction of facts from text
- Focus on relation extraction, n-ary
- Position paper
- Baseline extraction of facts
- Local analysis
- Assess confidence locally
- Aggregate facts globally
- Verify across a large collection
- Results
- Demos (?)
3Baseline IE
- Local extraction
- Process one document at a time e.g., a news
stream - Analyze text
- Fill in answer template(s)
- forget the result
- Prior knowledge
- Contained in knowledge bases
- Pattern bases
- Ontologies
- Recognizers trained on a corpus (pre-tagged
manually) -
- Given a priori, Static
- Not reused during extraction process
- NB document collection need not be static
4Impact of traditional processing
- Utility
- Facts/events typically have multiple references
in a document collection - Facts evolve, are updated, repeated, corrected
- User would benefit from cross-document linking of
facts/events - Performance
- Glass ceiling on performance
- When viewed locally, the problem may be too
restricted, and therefore ill-posed - system may need more than
- Document-local information
- Static prior knowledge
5Reasons behind local IE
- Tradition
- Evaluation
- MUC
- ACE
6Proposal
- When IE system extracts facts from a document in
a large collection, use - A priori knowledge
- Information in current document
- A posteriori knowledge, derived from elsewhere in
the collection - Extensions to IE paradigm
- Locally, do not commit to hard answers
- Assess confidence locally
- Aggregate information globally
- Leverage global information to find improvements
7Local analysis
- Traditional approach
- When system processes document D0, it may
generate a filled template/database record - Each slot in the record is a single, best guess
answer - Deferred commitment
- Return a distribution of answers
8Example
- According to Reuters, stocks declined today on
Wall Street after lower than expected earnings
forecasts from Hewlett-Packard, Co. Investors
are being cautious, said Stephen Kolano, an
equity trader for Mellon Growth Advisors in
Boston, awaiting further news on the companys
planned merger with Compaq. - ? merger(the company, Compaq)
- System must resolve the definite NP reference,
based on features in local context - Distance from anaphor
- Relative position in sentence
- Syntactic constraints semantic constraints
9Reference resolution
- Distinguish three kinds of references
- Immediate
- Indirect
- Elliptic implicit
- Ellipsis as anaphora
10Ellipsis example
- Epidemics domain
- Pattern
- np(people) vg(die) in np(location)
- matches
- 5 people died in Sudan.
- 5 people died.
- Fill template
- epidemic( disease, victim, location, )
- Virtually transformed to ?
- 5 people died in Location.
11Reference resolution
- Distinguish three kinds of references
- Immediate
- Indirect
- Elliptic implicit
- Ellipsis as anaphora
12Reference resolution
- Distinguish three kinds of references
- Immediate
- A. Unique in story
- A. Unique in sentence
- - as strong criterion as immediate (almost)
- Indirect
- Elliptic implicit
- Ellipsis as anaphora
- Rule-based automatically trained hybrid
13Deferred commitment refres
- Relax reference resolution mechanism
- Each non-immediate reference is linked to a
distribution of answers - Ranked list of value-confidence pairs
- A global analysis phase will try to infer the
globally optimal answer
14Global analysis
- Fact base totality of facts extracted from the
collection of documents - a posteriori knowledge
- Imperfect
- Improve fact base using global information
- If a fact is attested more than once, confidence
increases
15Example
- Optimal global interpretation e.g.D1 and D3
correct, and D2 partly wrong
16Example
17Experiments Global aggregation
- How can global aggregation be leveraged to obtain
an improvement in results (overall) - Job Advertisements
- prior work
- Epidemic Surveillance
- Relation correction
- Estimate utility of global aggregation
18Application Job advertisements
- Nahm Mooney, 2000. Integrate IE and DM
- 1. Run IE on part of collection ? extracted facts
- 2. DM on facts ? association rules between fills
- 3. Augment IE with mined rules if rule
fires, add attributes from head of rule - Example
- Facts jobprogrammer skillsJava, C
platformUnix,Mac - Rules Lunux ? platform ? C ? skills
- Second-phase extraction
- If rule fires and head attribute is mentioned
anywhere, then extract it too. - Improve recall, no significant loss in precision
19Application Epidemic Surveillance
- On-line incremental database
- Start from plain text
- Extract database records
- Disease
- Location
- Date
- Victims
- Kind of victim/descriptor people, animals,
plants - Victim status sick, dead
20Utility Confidence
- Reference resolution immediate reference
- Compute local confidence for entire fact
- For each record, local confidence is a function
of three key attributes disease, location,
date - Produce confidence c so that c gt ?conf , when
all key attributes are are immediate (or unique) - Record is locally confident if c gt ?conf
- (?conf 0.75)
- In human manual judgements, confidence was found
to be correlated with correctness - (0.55)
21Global aggregation
- Aggregate related records if they have similar
values of key attributes - Disease identical
- Location close (?)
- Time close (within a small window, e.g. 1
month) - Aggregate records into groups (epidemics)
- Any confident record induces a group
- Repeat extend timeline if another record falls
within time window, irrespective of its
confidence
Outbreak span
22Local confidence
23Judge fills in non-confident records
24Judge fills in non-confident records
25Judge fills in non-confident records
26Judge fills in non-confident records
27Experiment relation correction
- Extracts location and state (country) attributes
- Early commitment Single best-guess
- Deferred commitment Distribution of answers
- Location may be ambiguous
- Aggregate records for each value of location
- Gather all possible states
- Apply majority vote to correct mistakes
- Document by document, to avoid mapping location
to most common state (baseline)
28Performance of relation correction
29Conclusion
- Locally
- Defer commitment
- Assess confidence
- Globally
- Aggregate (imperfectly!) extracted facts
- Leverage global information to obtain improvements
30End
31(No Transcript)
32Experiment assessment
- Confidence
- Global clustering
33What is a fact
- Basic Entities and Names identify all
- persons, organizations, locations,
- artefacts, medicines/drugs, diseases
- Why is even this already useful? Examples
- find all persons related to person X
- find all companies related to company Y
- find all diseases in country Z
- try with Google
- Complex Relationships and events
- how entities are related to each other,
- how they interact
33
34What it means to find a fact
- unstructured ? structured representation
- plain text ? spreadsheet, database table
34
35Example Executive Search
- George Garrick, 40 years old, president of
theLondon-based European Information
ServicesInc., was appointed chief executive
officer ofNielsen Marketing Research, USA.
36Example Executive Search
- George Garrick, 40 years old, president of
theLondon-based European Information
ServicesInc., was appointed chief executive
officer ofNielsen Marketing Research, USA.
37Example Epidemics
Rule X confirm N death in Loc
37
38Why is it important
- Once facts are in database
- can search for them more easily
- computer can process them intelligently
- find patterns and trends
- Certain queries cannotbe done with
keywordsalone - Information explosion
39IE and IR
Additional processing
40Focused Search
- Not spontaneous, random search
- Users spend much time on persistent, focused
search repeated pursuit of facts that are
important in their analysis/research - User places higher value on information related
to long-standing interest, to which s/he has a
long-term commitment, than on information related
to one-time interest
41Why is it difficult references
- Language is complex
- George Garrick, 40 years old, has served as
president of Sony, Inc. for 13 years. - ltmore textgt..
- The company announced his resignation effective
July.
Date June, 23, 2000
George Garrick, 40 years old,
Sony, Inc.
The company
his
October
42Example applications
- Database of global epidemics
- Database of corporate executives
- Coprorate mergers and acquisitions
- Lawsuits / Legal action, Bankruptcy
- Terrorist attacks
- Natural disasters
- Space launches rockets, missiles,
- Air accidents
43(No Transcript)
44Information extraction
- Finding facts
- names of entities
- people
- places
- organizations
- etc.
- relations between entities
- organizations employ people
- events
- who was affected, how, when, where
45IE Pipeline knowledge bases
45
45
46Why IE is useful
- Semantic index into document collection
- For known scenarios, more reliable than keyword
index - Example answer query like
- Where does a given disease appear?
47Performance
- Ceiling of 70
- Many factors compounded
- Name classification
- Reference resolution
- Coverage of event patterns
- Elided elements in events
48Examples of problems
- Names
- Diseases
- Agents
- bacteria, viruses, fungi, algae, parasites,
- Vectors
- Drugs
-
- Locations
- Reference resolution
- Location-country/state relation (normalization)
49Examples of problems
- Reference resolution
- Location-country/state relation (normalization)
- easy (because functional relation, almost)
- More generally, how can we verify a filled slot
- No functional relations
- (e.g., any disease can occur anywhere)
50Current research
- Favor unsupervised/weakly supervised techniques
- Minimize manual labor
- Allow us to use much larger corpora for training
- Unsupervised acquisition of semantic knowledge
- Names
- Semantic patterns
51Current research
- Cross-document fact validation
- Aggregate information across documents
- Correct errors made in earlier stages of pipeline
- Provide richer information through the pipeline
- Propagate multiple hypotheses (instead of best
guess) - Self-confidence ratings
52Cross-document integration
- Can provide strong global evidence
- Also poses challenges
- Time-varying information
- Evolving epidemics, changing (growing) numbers
- Resolution of doubtful events (at later time)
- Corrections
- Can be viewed as deeper understanding of domain
- E.g., reason about epidemics from incidents
53Conclusion
- Applications form good base for research
- Observe performance improvements in real setting
- Provide large fact base, for cross-document
integration
54(No Transcript)
55Synopsis
- Information Extraction (IE)
- extracting factual information from textual
documents, written in natural human language. - IE on a large scale
- in contrast with the traditional study of IE,
focusing on the smaller-scale, laboratory
setting. - Applying IE methods to large collections of text
attempts to exploit the massive redundancy in the
facts contained in the collections. - Redundancy is inherent in the stream of emerging
events, whether the topic is general news,
science/medicine, business, etc.
55
56Structure of talk
- Demonstration and motivation
- Problem domain
- Need semantic knowledge
- What is a pattern?
- What is a name?
- Techniques on a large-scale
- Learning semantic patterns
- Learning semantic lexicons
- Learning global trends in extracted data
- For automatic recovery from errors
56
57Objectives
- Acquire semantic patterns for IE
- To customize IE system rapidly
- Minimally supervised acquisition
- Build upon previously described methods
- Common feature start out with high precision,
then gradually trade off precision for recall - Problem of convergence
- Algorithm discovers stream of patterns
- Want to find out when to stop
58Outline
- Prior Work
- Basic Unsupervised Learner
- Counter-Training
- Experiments and Results
- General framework
- Current work/Conclusion
59Prior Work
- On knowledge acquisition
- Riloff (1996), Yangarber, Grishman, Tapanainen,
Huttunen (2000) - Review by human, supervised, or set thresholds
- Thelen Riloff (2002), Yangarber, Lin, Grishman
(2002) - Natural convergence
- Collins Singer (1999), Yarowsky (1995)
60Basic Unsupervised Algorithm
- For pattern discovery
- Pre-processing
- Factor out NEs (and other OOVs)
- NE tagger, e.g.
- Parse
- General-purpose dependency parser
- Tree normalization
- Passive, relative, ? active
- Pattern extraction
- Tree ? core constituents S V O
- John Smith was hired by IBM ? Company hire
Person
61Bootstrapping Learner
- Initial query
- A small set of seed patterns which partially
characterize topic of interest - Retrieve documents containing seed patterns
- Relevant documents
- Rank patterns (in relevant documents)
- According to frequency in relevant docs vs.
Overall frequency - Add top-ranked pattern to seed pattern set
repeat
62Pattern score
- Trade-off recall and precision
- Where
- HH(p) set of documents matching p
- K(d) set of patterns matching document d
63Counter-training
- Eventually mono-learner will pick up non-specific
patterns - Match documents relevant to the scenario, but
also match non-relevant documents - Introduce multiple learners in parallel
- Learning in different, competing scenarios
- Documents which are ambiguous will receive high
relevance score in more than one scenario - Prevent learning patterns which match such
ambiguous documents
64Counter-training
S2
S1
S3
65Refine Pattern Precision
- Take into account negative evidence provided by
other learners - In scenario Si
- Prec(p) gt 0
- Continue as long as number of categories gt 1
66Experiments
- Corpus
- WSJ 1992-1994
- 15,000 documents/3Mb
- Indirect evaluation
- Test
- 250 documents for each tested scenario
- 100 MUC-6 training data (Management Succession)
- 150 documents tagged manually for each scenario
67Scenarios
68(No Transcript)
69(No Transcript)
70Current Problems
- Choice of scenarios
- Uneven representation within the Corpus
- Choice of seeds
- Control focus of learning by making scenario
stronger - Ambiguity/overlap
- At document level
- At pattern level
71(No Transcript)
72(No Transcript)
73Counter-training framework
- Pre-process large corpus
- Factor out irrelevant information to reduce
sparseness - Give seeds to several category learners
- Add negative learners if possible
- (Seeds can be patterns or datapoints)
- Partition dataset
- Relevant to some learner, or relevant to none
- For each learner
- Rank rules
- Keep best
- Rank datapoins
- Keep best
- Repeat until convergence
74Other Knowledge Discovery Tasks
- On name finding and categorization
- Yangarber, Lin, Grishman (2002)
- Thelen Riloff (2002)
75Conclusion
- In counter-training unsupervised learners help
each other to bootstrap - by finding their own, weaklyreliable positive
evidence - by providing reliable negative evidence to each
other - Unsupervised learners supervise each other
76Counter-training
- Train several learners simultaneously
- Compete with each other in different domains
- Improve precision
- Convergence provide indication to each other
when to stop learning