Verification of Facts across Document Boundaries - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Verification of Facts across Document Boundaries

Description:

Ellipsis as anaphora. Reference resolution. Distinguish three ... Ellipsis as anaphora. Rule-based; automatically trained; hybrid. Deferred commitment: refres ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 77
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: Verification of Facts across Document Boundaries


1
Verification of Facts across Document Boundaries
  • Roman Yangarber
  • University of Helsinki 2006

2
Synopsis
  • Extraction of facts from text
  • Focus on relation extraction, n-ary
  • Position paper
  • Baseline extraction of facts
  • Local analysis
  • Assess confidence locally
  • Aggregate facts globally
  • Verify across a large collection
  • Results
  • Demos (?)

3
Baseline IE
  • Local extraction
  • Process one document at a time e.g., a news
    stream
  • Analyze text
  • Fill in answer template(s)
  • forget the result
  • Prior knowledge
  • Contained in knowledge bases
  • Pattern bases
  • Ontologies
  • Recognizers trained on a corpus (pre-tagged
    manually)
  • Given a priori, Static
  • Not reused during extraction process
  • NB document collection need not be static

4
Impact of traditional processing
  • Utility
  • Facts/events typically have multiple references
    in a document collection
  • Facts evolve, are updated, repeated, corrected
  • User would benefit from cross-document linking of
    facts/events
  • Performance
  • Glass ceiling on performance
  • When viewed locally, the problem may be too
    restricted, and therefore ill-posed
  • system may need more than
  • Document-local information
  • Static prior knowledge

5
Reasons behind local IE
  • Tradition
  • Evaluation
  • MUC
  • ACE

6
Proposal
  • When IE system extracts facts from a document in
    a large collection, use
  • A priori knowledge
  • Information in current document
  • A posteriori knowledge, derived from elsewhere in
    the collection
  • Extensions to IE paradigm
  • Locally, do not commit to hard answers
  • Assess confidence locally
  • Aggregate information globally
  • Leverage global information to find improvements

7
Local analysis
  • Traditional approach
  • When system processes document D0, it may
    generate a filled template/database record
  • Each slot in the record is a single, best guess
    answer
  • Deferred commitment
  • Return a distribution of answers

8
Example
  • According to Reuters, stocks declined today on
    Wall Street after lower than expected earnings
    forecasts from Hewlett-Packard, Co. Investors
    are being cautious, said Stephen Kolano, an
    equity trader for Mellon Growth Advisors in
    Boston, awaiting further news on the companys
    planned merger with Compaq.
  • ? merger(the company, Compaq)
  • System must resolve the definite NP reference,
    based on features in local context
  • Distance from anaphor
  • Relative position in sentence
  • Syntactic constraints semantic constraints

9
Reference resolution
  • Distinguish three kinds of references
  • Immediate
  • Indirect
  • Elliptic implicit
  • Ellipsis as anaphora

10
Ellipsis example
  • Epidemics domain
  • Pattern
  • np(people) vg(die) in np(location)
  • matches
  • 5 people died in Sudan.
  • 5 people died.
  • Fill template
  • epidemic( disease, victim, location, )
  • Virtually transformed to ?
  • 5 people died in Location.

11
Reference resolution
  • Distinguish three kinds of references
  • Immediate
  • Indirect
  • Elliptic implicit
  • Ellipsis as anaphora

12
Reference resolution
  • Distinguish three kinds of references
  • Immediate
  • A. Unique in story
  • A. Unique in sentence
  • - as strong criterion as immediate (almost)
  • Indirect
  • Elliptic implicit
  • Ellipsis as anaphora
  • Rule-based automatically trained hybrid

13
Deferred commitment refres
  • Relax reference resolution mechanism
  • Each non-immediate reference is linked to a
    distribution of answers
  • Ranked list of value-confidence pairs
  • A global analysis phase will try to infer the
    globally optimal answer

14
Global analysis
  • Fact base totality of facts extracted from the
    collection of documents
  • a posteriori knowledge
  • Imperfect
  • Improve fact base using global information
  • If a fact is attested more than once, confidence
    increases

15
Example
  • Optimal global interpretation e.g.D1 and D3
    correct, and D2 partly wrong

16
Example
17
Experiments Global aggregation
  • How can global aggregation be leveraged to obtain
    an improvement in results (overall)
  • Job Advertisements
  • prior work
  • Epidemic Surveillance
  • Relation correction
  • Estimate utility of global aggregation

18
Application Job advertisements
  • Nahm Mooney, 2000. Integrate IE and DM
  • 1. Run IE on part of collection ? extracted facts
  • 2. DM on facts ? association rules between fills
  • 3. Augment IE with mined rules if rule
    fires, add attributes from head of rule
  • Example
  • Facts jobprogrammer skillsJava, C
    platformUnix,Mac
  • Rules Lunux ? platform ? C ? skills
  • Second-phase extraction
  • If rule fires and head attribute is mentioned
    anywhere, then extract it too.
  • Improve recall, no significant loss in precision

19
Application Epidemic Surveillance
  • On-line incremental database
  • Start from plain text
  • Extract database records
  • Disease
  • Location
  • Date
  • Victims
  • Kind of victim/descriptor people, animals,
    plants
  • Victim status sick, dead

20
Utility Confidence
  • Reference resolution immediate reference
  • Compute local confidence for entire fact
  • For each record, local confidence is a function
    of three key attributes disease, location,
    date
  • Produce confidence c so that c gt ?conf , when
    all key attributes are are immediate (or unique)
  • Record is locally confident if c gt ?conf
  • (?conf 0.75)
  • In human manual judgements, confidence was found
    to be correlated with correctness
  • (0.55)

21
Global aggregation
  • Aggregate related records if they have similar
    values of key attributes
  • Disease identical
  • Location close (?)
  • Time close (within a small window, e.g. 1
    month)
  • Aggregate records into groups (epidemics)
  • Any confident record induces a group
  • Repeat extend timeline if another record falls
    within time window, irrespective of its
    confidence

Outbreak span
22
Local confidence
23
Judge fills in non-confident records
24
Judge fills in non-confident records
25
Judge fills in non-confident records
26
Judge fills in non-confident records
27
Experiment relation correction
  • Extracts location and state (country) attributes
  • Early commitment Single best-guess
  • Deferred commitment Distribution of answers
  • Location may be ambiguous
  • Aggregate records for each value of location
  • Gather all possible states
  • Apply majority vote to correct mistakes
  • Document by document, to avoid mapping location
    to most common state (baseline)

28
Performance of relation correction
29
Conclusion
  • Locally
  • Defer commitment
  • Assess confidence
  • Globally
  • Aggregate (imperfectly!) extracted facts
  • Leverage global information to obtain improvements

30
End
  • end

31
(No Transcript)
32
Experiment assessment
  • Confidence
  • Global clustering

33
What is a fact
  • Basic Entities and Names identify all
  • persons, organizations, locations,
  • artefacts, medicines/drugs, diseases
  • Why is even this already useful? Examples
  • find all persons related to person X
  • find all companies related to company Y
  • find all diseases in country Z
  • try with Google
  • Complex Relationships and events
  • how entities are related to each other,
  • how they interact

33
34
What it means to find a fact
  • unstructured ? structured representation
  • plain text ? spreadsheet, database table

34
35
Example Executive Search
  • George Garrick, 40 years old, president of
    theLondon-based European Information
    ServicesInc., was appointed chief executive
    officer ofNielsen Marketing Research, USA.

36
Example Executive Search
  • George Garrick, 40 years old, president of
    theLondon-based European Information
    ServicesInc., was appointed chief executive
    officer ofNielsen Marketing Research, USA.

37
Example Epidemics
Rule X confirm N death in Loc
37
38
Why is it important
  • Once facts are in database
  • can search for them more easily
  • computer can process them intelligently
  • find patterns and trends
  • Certain queries cannotbe done with
    keywordsalone
  • Information explosion

39
IE and IR
Additional processing
40
Focused Search
  • Not spontaneous, random search
  • Users spend much time on persistent, focused
    search repeated pursuit of facts that are
    important in their analysis/research
  • User places higher value on information related
    to long-standing interest, to which s/he has a
    long-term commitment, than on information related
    to one-time interest

41
Why is it difficult references
  • Language is complex
  • George Garrick, 40 years old, has served as
    president of Sony, Inc. for 13 years.
  • ltmore textgt..
  • The company announced his resignation effective
    July.

Date June, 23, 2000
George Garrick, 40 years old,
Sony, Inc.
The company
his
October
42
Example applications
  • Database of global epidemics
  • Database of corporate executives
  • Coprorate mergers and acquisitions
  • Lawsuits / Legal action, Bankruptcy
  • Terrorist attacks
  • Natural disasters
  • Space launches rockets, missiles,
  • Air accidents

43
(No Transcript)
44
Information extraction
  • Finding facts
  • names of entities
  • people
  • places
  • organizations
  • etc.
  • relations between entities
  • organizations employ people
  • events
  • who was affected, how, when, where

45
IE Pipeline knowledge bases

45
45
46
Why IE is useful
  • Semantic index into document collection
  • For known scenarios, more reliable than keyword
    index
  • Example answer query like
  • Where does a given disease appear?

47
Performance
  • Ceiling of 70
  • Many factors compounded
  • Name classification
  • Reference resolution
  • Coverage of event patterns
  • Elided elements in events

48
Examples of problems
  • Names
  • Diseases
  • Agents
  • bacteria, viruses, fungi, algae, parasites,
  • Vectors
  • Drugs
  • Locations
  • Reference resolution
  • Location-country/state relation (normalization)

49
Examples of problems
  • Reference resolution
  • Location-country/state relation (normalization)
  • easy (because functional relation, almost)
  • More generally, how can we verify a filled slot
  • No functional relations
  • (e.g., any disease can occur anywhere)

50
Current research
  • Favor unsupervised/weakly supervised techniques
  • Minimize manual labor
  • Allow us to use much larger corpora for training
  • Unsupervised acquisition of semantic knowledge
  • Names
  • Semantic patterns

51
Current research
  • Cross-document fact validation
  • Aggregate information across documents
  • Correct errors made in earlier stages of pipeline
  • Provide richer information through the pipeline
  • Propagate multiple hypotheses (instead of best
    guess)
  • Self-confidence ratings

52
Cross-document integration
  • Can provide strong global evidence
  • Also poses challenges
  • Time-varying information
  • Evolving epidemics, changing (growing) numbers
  • Resolution of doubtful events (at later time)
  • Corrections
  • Can be viewed as deeper understanding of domain
  • E.g., reason about epidemics from incidents

53
Conclusion
  • Applications form good base for research
  • Observe performance improvements in real setting
  • Provide large fact base, for cross-document
    integration

54
(No Transcript)
55
Synopsis
  • Information Extraction (IE)
  • extracting factual information from textual
    documents, written in natural human language.
  • IE on a large scale
  • in contrast with the traditional study of IE,
    focusing on the smaller-scale, laboratory
    setting.
  • Applying IE methods to large collections of text
    attempts to exploit the massive redundancy in the
    facts contained in the collections.
  • Redundancy is inherent in the stream of emerging
    events, whether the topic is general news,
    science/medicine, business, etc.

55
56
Structure of talk
  • Demonstration and motivation
  • Problem domain
  • Need semantic knowledge
  • What is a pattern?
  • What is a name?
  • Techniques on a large-scale
  • Learning semantic patterns
  • Learning semantic lexicons
  • Learning global trends in extracted data
  • For automatic recovery from errors

56
57
Objectives
  • Acquire semantic patterns for IE
  • To customize IE system rapidly
  • Minimally supervised acquisition
  • Build upon previously described methods
  • Common feature start out with high precision,
    then gradually trade off precision for recall
  • Problem of convergence
  • Algorithm discovers stream of patterns
  • Want to find out when to stop

58
Outline
  • Prior Work
  • Basic Unsupervised Learner
  • Counter-Training
  • Experiments and Results
  • General framework
  • Current work/Conclusion

59
Prior Work
  • On knowledge acquisition
  • Riloff (1996), Yangarber, Grishman, Tapanainen,
    Huttunen (2000)
  • Review by human, supervised, or set thresholds
  • Thelen Riloff (2002), Yangarber, Lin, Grishman
    (2002)
  • Natural convergence
  • Collins Singer (1999), Yarowsky (1995)

60
Basic Unsupervised Algorithm
  • For pattern discovery
  • Pre-processing
  • Factor out NEs (and other OOVs)
  • NE tagger, e.g.
  • Parse
  • General-purpose dependency parser
  • Tree normalization
  • Passive, relative, ? active
  • Pattern extraction
  • Tree ? core constituents S V O
  • John Smith was hired by IBM ? Company hire
    Person

61
Bootstrapping Learner
  • Initial query
  • A small set of seed patterns which partially
    characterize topic of interest
  • Retrieve documents containing seed patterns
  • Relevant documents
  • Rank patterns (in relevant documents)
  • According to frequency in relevant docs vs.
    Overall frequency
  • Add top-ranked pattern to seed pattern set

repeat
62
Pattern score
  • Trade-off recall and precision
  • Where
  • HH(p) set of documents matching p
  • K(d) set of patterns matching document d

63
Counter-training
  • Eventually mono-learner will pick up non-specific
    patterns
  • Match documents relevant to the scenario, but
    also match non-relevant documents
  • Introduce multiple learners in parallel
  • Learning in different, competing scenarios
  • Documents which are ambiguous will receive high
    relevance score in more than one scenario
  • Prevent learning patterns which match such
    ambiguous documents

64
Counter-training
S2
S1
S3
65
Refine Pattern Precision
  • Take into account negative evidence provided by
    other learners
  • In scenario Si
  • Prec(p) gt 0
  • Continue as long as number of categories gt 1

66
Experiments
  • Corpus
  • WSJ 1992-1994
  • 15,000 documents/3Mb
  • Indirect evaluation
  • Test
  • 250 documents for each tested scenario
  • 100 MUC-6 training data (Management Succession)
  • 150 documents tagged manually for each scenario

67
Scenarios
68
(No Transcript)
69
(No Transcript)
70
Current Problems
  • Choice of scenarios
  • Uneven representation within the Corpus
  • Choice of seeds
  • Control focus of learning by making scenario
    stronger
  • Ambiguity/overlap
  • At document level
  • At pattern level

71
(No Transcript)
72
(No Transcript)
73
Counter-training framework
  • Pre-process large corpus
  • Factor out irrelevant information to reduce
    sparseness
  • Give seeds to several category learners
  • Add negative learners if possible
  • (Seeds can be patterns or datapoints)
  • Partition dataset
  • Relevant to some learner, or relevant to none
  • For each learner
  • Rank rules
  • Keep best
  • Rank datapoins
  • Keep best
  • Repeat until convergence

74
Other Knowledge Discovery Tasks
  • On name finding and categorization
  • Yangarber, Lin, Grishman (2002)
  • Thelen Riloff (2002)

75
Conclusion
  • In counter-training unsupervised learners help
    each other to bootstrap
  • by finding their own, weaklyreliable positive
    evidence
  • by providing reliable negative evidence to each
    other
  • Unsupervised learners supervise each other

76
Counter-training
  • Train several learners simultaneously
  • Compete with each other in different domains
  • Improve precision
  • Convergence provide indication to each other
    when to stop learning
Write a Comment
User Comments (0)
About PowerShow.com