Title: Information Extraction
1Information Extraction
- Jordi Turmo
- TALP Research Centre
- Dep. Llenguatges i Sistemes Informàtics
- Universitat Politècnica de Catalunya
- turmo_at_lsi.upc.edu
- http//www.lsi.upc.edu/turmo
2Summary
- Information Extraction Systems
- Evaluation
- Multilinguality
- Adaptability
3Summary
- Information Extraction Systems
- Introduction
- Historical framework
- Architecture
- Knowledge specific for IE
- Examples
- Evaluation
- Multilinguality
- Adaptability
4Introduction
Definition
- Goal Localization and extraction, in a specific
format, of the relevant information included in a
collection of documents - Input requirements scenario of extraction and
document collection - Output requirements output format
5Introduction
Typology
- Different points of view
- conceptual coverage restricted-domain IE vs.
open-domain IE - language coverage monoligual IE vs.
multilingual IE - media coverage written text IE, speech IE,
image IE, multimedia IE - document type IE from free text, from
semi-structured documents, from structured
documents (including Web pages in HTML and XML)
6Introduction
Typology
- Different points of view
- conceptual converage restricted-domain IE vs.
open-domain IE - language coverage monoligual IE vs.
multilingual IE - media coverage written text IE, speech IE,
image IE, multimedia IE - document type IE from free text, from
semi-structured documents, from structured
documents (including Web pages in HTML and XML)
7Introduction
Example 1 Structured documents
- Web pages
- A list of members of an organization per
- document
- English
- Scenario of Extraction
- Name, degree, school and affiliation of the
member
8Introduction
Example 1 Structured documents
Name Degree School Affiliation WL
Hsu PhD Cornell IIS, Sinica CS Ho PhD NTU
EE,NTIT C.Chen PhD SUNY
EE,NTIT C.Wu PhD Utexas Cedu,NNU Mark
Liao PhD NWU IIS, Sinica CJ Liau
PhD NTU IIS, Sinica WK Cheng PhD
TKU Tunghai WC Wang MS Syracus
FIT ...
9Introduction
Example 2 Semi-structured documents
- 485 seminar announcements
- A description of one seminar per document
- English
- Scenario of Extraction
- Speaker, location, start time and end time of the
- seminar
10Introduction
Example 2 Semi-structured documents
11Introduction
Example 3 Free text
- 318 Wall Street Journal articles
- A description of an incident per document
- English
- Scenario of Extraction
- Type of incident, perpetrator, target, date,
location, - effects and instrument
12Introduction
Example 3 Free text
Incident type bombing date March
19 Location El Salvador San Salvador
(city) Perpetrator urban guerrilla
commandos Physical target power tower Human
target - Effect on physical target destroyed Ef
fect on human target no injury or
death Instrument bomb
13Introduction
Example 4 Free text
- 78 documents
- A description of mushroom per document
- Spanish
- Scenario of Extraction
- colors of parts of mushrooms and the
circumstances - in which they occur
14Introduction
Example 4 Free text
15Introduction
Example 4 Free text
El color blanco de su sombrero pasa a amarillo
crema al corte. El sombrero ennegrece si se corta.
color_1 base blanco tono indef luz indef
Sombrero_1 color
virar_1 inicio final causa corte
color_2 base amarillo tono crema luz indef
Sombrero_2 color
virar_2 inicio indef final causa corte
color_3 base indef tono negro luz indef
16Introduction
Example 5 Combination
- 78 documents
- A description of mushroom per document
- Spanish
- Scenario of Extraction
- Names of the mushroom in different languages,
ethimology - colors of parts of mushrooms and the
circumstances - in which they occur
17Introduction
Example 5 Combination
18Introduction
Applications
- IE from the Web
- Building of news DBs
- Information Integration
- Support for QA and Summarization
-
- Limitation when Plt80
19Introduction
References
- D.E. Appelt, D.J. Israel, 1999
- E. Hovy, 1999
- R.J. Mooney, C. Cardie, 1999
- Muslea, 1999
- J. Cowie, Y. Wilks, 2000
- M.T. Pazienza, 2003
- Turmo, 2003
- Turmo et al. 2005
20Introduction
Recent events
- IJCAI 2001 Workshop on Adaptive Text Extraction
and Mining (ATEM-2001) - ECML 03/PKDD Workshop on Adaptive Text Extraction
and Mining (ATEM-2003) - AAAI 04 Workshop on Adaptive Text Extraction and
Mining (ATEM-2004) - EACL 06 Workshop on Adaptive Text Extraction and
Mining (ATEM-2006) - COLING-ACL 06 Workshop on Information Extraction
Beyond the Document - ECAI 06 Workshop on Adaptive Text Extraction and
Mining (ATEM-2006)
21Summary
- Information Extraction Systems
- Introduction
- Historical framework
- Architecture
- Knowledge specific for IE
- Examples
- Evaluation
- Multilinguality
- Adaptability
22Historical framework
Origin of IE
- Acquisition of the relevant information involved
in knowledge-based systems
- Traditionally (High human cost)
23Historical framework
Origin of IE
- Acquisition of the relevant information involved
in knowledge-based systems
Relevant Information
24Historical framework
Origin of IE
- Text-Based Intelligent Systems (TBIS)
- Information Retrieval
- Information Integration
- Information Filtering
- Information Routing
- Information Extraction
- Document Classification
- Question Answering
- Automatic Summarization
- Topic Detection Tracking
- ...
25Historical framework
Relevant Historical Programs
- Precedents LSP (Sager, 81), FRUMP (DeJong, 82),
- JASPER (Hayes, 86)
- in USA
- (1987-1991) MUC US Navy
- TIPSTER (1991-1998) MUC DARPA
- TIDES (1999-) ACE NIST
- in Europe
- LRE (1993-1996) TREE, AVENTINUS, FACILE, ECRAN,
SPARKLE - PASCAL excellence network (2003-)
26Historical framework
MUC Evolution
- MUC-1 (1987)
- naval operations
- auto-definition of scenarios
- auto-evaluation
- MUC-2 (1989)
- naval operations
- output structure with 10 attributes
- (type of event, agent, place, ...)
- auto-evaluation
27Historical framework
MUC Evolution
- MUC-3 (1991),
- Latin-American terrorism
- output structure with 18 attributes
- (type of incident, date, place, ...)
- recall and precision measures
a
extracted a b e f relevant a f
d recall a 0.5 f/ (a f d) precision a
0.5 f/ (a f b e)
extracted
f
b
e
d
c
parcially extracted
relevant
28Historical framework
MUC Evolution
- MUC-4 (1992),
- Latin-American terrorism
- 24 attributes
- F-score (harmonic average)
- MUC-5 (1993),
- Financial news, microelectronics
- English, Japanese
29Historical framework
MUC Evolution
- MUC-6 (1995),
- finantial news
- subtasks NE, coreference
- tasks TE (template element), ST (scenario
template) - MUC-7 (1998),
- air crashes
- new task TR (template relation)
30Historical framework
MUC Evolution
- MUC-6, MUC-7
- Partial extractions are discarded
extracted a b relevant a d recall a /
(a d) precision a / (a b)
31Summary
- Information Extraction Systems
- Introduction
- Historical framework
- Architecture
- Knowledge specific for IE
- Examples
- Evaluation
- Multilinguality
- Adaptability
32Architecture
General Architecture
- Hobbs,93
- Cascade of transducers (or modules) that add
structure to text and, often, drop out irrelevant
information by applying rules
33Architecture
Traditional Architecture
Document Preprocessing
Conceptual Hierarchy
Pattern Matching
Pattern Base
Postprocess
34Architecture
Traditional Architecture
Text Control
Lexical Analysis
Conceptual Hierarchy
Syntactic Analysis
Pattern Matching
Pattern Base
Postprocess
35Architecture
Traditional Architecture
Text Control
Lexical Analysis
Conceptual Hierarchy
Syntactic Analysis
Pattern Matching
Pattern Base
Discourse Analysis
Output Template Generation
Output Format
36Architecture
Architecture
Text control
- Filtering relevant documents
- Guessing the language of the documents
- Splitting documents into textual zones
- Filtering relevant zones
- Splitting text into appropriate units (eg.
sentences) - Filtering relevant units
- Tokenizing units
37Architecture
Architecture
Text control
38Architecture
Architecture
Text control
ltSombrero bastante carnoso de 4 a 8 cm , convexo
, luego completamente extendido , aplanado y
mamelonado , liso , húmedo e higrófano .gt ltEsta
última condición influye en la variabilidad de su
coloración desde canela claro a toda la gama de
tostados .gt ltCon la edad generalmente palidece
sus tonos .gt ltPuede confundirse con otras
foliotas comestibles , pero alguna especie es
amarga . gtltLos aficionados poco experimentados
pueden también confundir este género con otros no
comestibles , como Hypholoma y Flacemula ,
también lignícolas.gt
39Architecture
Architecture
Lexical analysis
- Identifying morpho-syntactic categories and
semantic categories of words - General lexicon
- Recognizing terminology words
- Specific dictionaries
- Recognizing time expressions, quantities,
abbreviations, - Extending abbreviations
- Lists of abbrev. expansion
40Architecture
Architecture
Lexical analysis
- Recognizing and classifying proper nouns (Named
Entities NERC-) - Gazetteers
- Patterns
- Dealing with unknown words
- Dealing with lexical ambiguities
- POS taggers
- WSD (???)
41Architecture
Architecture
Lexical analysis
time expressions mushroom names abbreviatures numb
ers morphologic parts
ltSombrero bastante carnoso de 4 a 8 cm , convexo
, luego completamente extendido , aplanado y
mamelonado , liso , húmedo e higrófano .gt ltEsta
última condición influye en la variabilidad de su
coloración desde canela claro a toda la gama de
tostados .gt ltCon la edad generalmente palidece
sus tonos .gt ltPuede confundirse con otras
foliotas comestibles , pero alguna especie es
amarga . gtltLos aficionados poco experimentados
pueden también confundir este género con otros no
comestibles , como Hypholoma y Flacemula ,
también lignícolas.gt
Depends on the scenario
42Architecture
Architecture
Lexical analysis
ltA bomb went off this morning near a power tower
in San Salvador leaving a large part of the city
without energy , but no casualties have been
reported .gt ltAccording to unofficial sources ,
the bomb-allegedly detonated by urban guerrilla
commandos- blew up a power tower in the
northwestern part of San Salvador at 0650 .gt
time expressions locations organizations persons
43Architecture
Architecture
Syntactic analysis
- Full parsing (Lolita, LaSIE, LaSIE-II)
- inefficient, sizes of the grammars
- missing robustness (off vocabulary)
- treebank grammars
- cascaded grammars
- Solves some problems related to the tuning and
incompleteness
44Architecture
Architecture
Syntactic analysis
- Partial parsing
- the most commonly used
- chunks or phrasal trees (noun phrases, verbal
phrases, prep phrases, adj phrases, adv phrases) - absence of global dependences
45Architecture
Architecture
Semantic interpretation
- Compositive semantics
- full parsing ?-expressions
- LaSIE, LaSIE-II
- Entries with ?-expressions in the Lexicons
- partial parsing gramatical relations
Vilain,99 - output logical forms
46Architecture
Architecture
Semantic interpretation
- Compositive semantics (example1)
?(z) ?(y) ?(x) (bombing(x,y,z,bomb,today_morning,p
ower_tower(San_Salvador)))
s
vp
pp
np
np np
pp
A bomb went off this morning near a power tower
in San Salvador
go_off ? ?(t) ?(s) ?(r) ?(z) ?(y) ?(x)
(bombing(x,y,z,r,s,t))
power_tower ? ?(x) (power_tower(x))
47Architecture
Architecture
Semantic interpretation
- Compositive semantics (example2)
location_of
place
subj
time
A bomb went off this morning near a power tower
in San Salvador
event(bombing , E) subj(bomb , E) time(today_morni
ng , E) place(power_tower, E) location_of(power_to
wer, San_Salvador)
48Architecture
Architecture
Semantic interpretation
- Pattern matching
- after partial parsing svo dependences
- the most extended
- patterns can be implemented in different ways
- scenario driven approach (TE, TR, ST, )
- Output partial templates
49Architecture
Architecture
Semantic interpretation
- Pattern matching (example)
A bomb went off this morning near a power tower
in San Salvador
np(C-instrument) vp(go_off) np(C-time)
near np(C-place) in np(C-location) ? INSTRUMEN
T C-instrument DATE C-time PHIS_TARGET
C-place LOCATION C-location
50Architecture
Architecture
Discourse analysis
- Inter-sentence analysis
- Co-reference resolution
- Ellipsis resolution
- Alias resolution
- Traditional semantic interpretation procedures
- Template merging procedures
- Inference procedures
- Open-domain and domain-specific knowledge for
inferences
51Architecture
Architecture
Discourse analysis
A bomb went off this morning near a power tower
in San Salvador , but no casualties have been
reported
?(y) ?(x) (bombing(x,y,no_casualties,bomb,today_mo
rning, power_tower(San_Salvador)))
According to unofficial sources , the bomb
-allegedly detonated by urban guerrilla
commandos- blew up a power tower in the
northwestern part of San Salvador at 0650
?(z) ?(y) (bombing(urban_guerrilla_comandos,y,z,bo
mb,0650, power_tower(the_northwestern_part_of_San_
Salvador)))
52Architecture
Architecture
Discourse analysis
?(y) ?(x) (bombing(x,y,no_casualties,bomb,today_mo
rning, power_tower(San_Salvador)))
?(z) ?(y) (bombing(urban_guerrilla_comandos,y,z,bo
mb,0650, power_tower( the_northwestern_part_of_San
_Salvador)))
Unification inference
?(y) (bombing(urban_guerrilla_comandos,y,no_casual
ties,bomb,today_morning,power_tower(San_Salvador))
)
Inference (blew_up ? destroyed)
bombing(urban_guerrilla_comandos,destroyed,no_casu
alties,bomb, today_morning,power_tower(San_Salvado
r))
53Architecture
Architecture
Output template generation
- Mapping of the extracted pieces onto the desired
output format - Specific inferences
- Normalization to predefined values of slots
- Mandatory slots
- Extracted information that implies different
slot values
54Architecture
Architecture
Output template generation
bombing(urban_guerrilla_comandos,destroyed,no_casu
alties,bomb, today_morning,power_tower(San_Salvado
r))
Today_morning ? March_19 No_casualties
no_injuries_or_death
Incident type bombing date March
19 Location El Salvador San Salvador
(city) Perpetrator urban guerrilla
commandos Physical target power tower Human
target - Effect on physical target destroyed Ef
fect on human target no injury or
death Instrument bomb
55Summary
- Information Extraction Systems
- Introduction
- Historical framework
- Architecture
- Knowledge specific for IE
- Examples
- Evaluation
- Multilinguality
- Adaptability
56Knowledge specific for IE
Characteristics of IE systems
- Strong dependence of the domain
- Scenario of extraction
- Semantics vs. syntax
- Discourse analysis
- Strong dependence of the text structure
- Sublanguages
- Meta-information
- Strong dependence of the output format
- BDs
- annotations
57Knowledge specific for IE
Characteristics of IE systems
- Importance of the portability and tuning
- Importance of the Knowledge Engineering
- Modularity
- Basic tasks and specific tasks
- Use of weak and local knowledge
- Importance of the NL resources
- MDRs, ontologies, general lexicons, specific
dictionaries,
58Knowledge specific for IE
Knowledge resources
- Knowledge more or less stable
- general lexicon
- general grammar
- basic NL processors segmenters, taggers,
parsers, - Domain dependent knowledge
- Domain specific vocabularies, terminology
- gazetteers and patterns for NERC
- IE patterns
59Knowledge specific for IE
Types of IE patterns
- Viewpoint 1 type of representation
- rules
np(C-instrument) vp(go_off) np(C-time)
near np(C-place) in np(C-location) ? EventIN
STRUMENT C-instrument EventDATE
C-time EventPHIS_TARGET C-place
EventLOCATION C-location
60Knowledge specific for IE
Types of IE patterns
- Viewpoint 1 type of representation
-
- statistical models (BNs, HMMs, ME, Hyperplanes,
)
61Knowledge specific for IE
Types of IE patterns
- Viewpoint 2 type of values extracted
- slot filler extraction patterns
- (the HMM presented before)
62Knowledge specific for IE
Types of IE patterns
- Viewpoint 2 type of values extracted
- slot filler extraction patterns
- (the HMM presented before)
- event extraction patterns
- (the rule presented before)
np(C-instrument) vp(go_off) np(C-time)
near np(C-place) in np(C-location) ? EventINS
TRUMENT C-instrument EventDATE
C-time EventPHIS_TARGET C-place
EventLOCATION C-location
63Knowledge specific for IE
Types of IE patterns
- Point of view type of values extracted
- slot filler extraction patterns
- (the HMM presented before)
- event extraction patterns
- (the rule presented before)
64Knowledge specific for IE
Types of IE patterns
- Viewpoint 3 number of slot fillers extracted
- single-slot IE patterns
- (the HMM presented before)
- multi-slot IE patterns
- (both rules presented before)
65Summary
- Information Extraction Systems
- Introduction
- Historical framework
- Architecture
- Knowledge specific for IE
- Examples
- Evaluation
- Multilinguality
- Adaptability
66Examples of IE systems
Methodologies Turmo,2002
System Reference Parsing
Semantics Discourse
LaSIE LaSIE-II LOLITA CIRCUS FASTUS BADGER HASTEN
PROTEUS ALEMBIC PIE TURBIO PLUM IE2 LOUELLA SIFT
Gaizauskas et al, 1995 Humphreys et al,
1998 Garigliano et al, 1998 Lehnert et al,
1991 Hobbs et al, 1993 Fisher et al, 1995 Krupka,
1995 Grishman, 1995 Aberdeen et al, 1993 Lin,
1995 Turmo,2002 Weischedel et al, 1995 Aone et
al, 1998 Childs et al, 1995 Miller et al, 1998
indepth understanding
template
merging Chunking Pattern
matching -
semantic Gramm relations
interp interpretation
procedures Partial Parsing pattern
matching Pattern matching
template merging -
sintactico-semantic parsing
67Examples of IE systems
Knowledge Turmo,2002
System Parsing
Semantics Discourse
LaSIE LaSIE-II LOLITA CIRCUS FASTUS BADGER HASTEN
PROTEUS ALEMBIC TURBIO PIE PLUM IE2 LOUELLA SIFT
Treebank grammar ?-expressions hand-c
rafted stratified general grammar General
grammar semantic network
concept nodes (AutoSlog)
hand-crafted IE rules concept
nodes (CRYSTAL) decision trees Phrasal
grammar E-graphs IE
rules (ExDISCO)
hand-crafted gram relations
IE rules (EVIUS) General grammar
hand-crafted IE rules
hand-crafted rules hand-crafted IE rules
decision trees Statistical models for
syntactic-semantic parsing coreference
resolution learned from PTB and on-domain
annotated texts
68Examples of IE systems
LaSIE-II system
gazetteers
Lexicon
Conceptual hierarchy
Sentence splitter
Gazetteer lookup
Buchart parser
Name matcher
Brill tagger
Tagged morph
Discourse interpreter
Template writer
69Examples of IE systems
LaSIE-II system
gazetteers
Lexicon
Conceptual hierarchy
Sentence splitter
Gazetteer lookup
Buchart parser
Name matcher
Brill tagger
Tagged morph
Discourse interpreter
Template writer
- Preprocessing
- NERC preprocess via gazetters and keyword lists
- Root form and inflexional suffix for verbs,
nouns and adjs found in sentences
According_to-adv unofficial-adj sources-n ,
the-det bomb-n allegedly-adv detonateed-v
by-prep urban-adj guerrilla-n commandos-n -
blow_up-v a-det power_tower-n in-prep the-det
northwestern-adj part-n of-prep San Salvador-loc
at-prep 0650
70Examples of IE systems
LaSIE-II system
gazetteers
Lexicon
Conceptual hierarchy
Sentence splitter
Gazetteer lookup
Buchart parser
Name matcher
Brill tagger
Tagged morph
Discourse interpreter
Template Writer
- Syntactico-semantic interpretation
- bottom-up chart parser
- cascade of NERC grammars (eg. aircraft, person,
money, time, timex)
According_to-adv unofficial-adj sources-n ,
the-det bomb-n allegedly-adv detonateed-v
by-prep urban-adj guerrilla-n commandos-n -
blow_up-v a-det power_tower-n in-prep the-det
northwestern part of San Salvador-loc at-prep
0650-time
NE1
NE2
71Examples of IE systems
LaSIE-II system
gazetteers
Lexicon
Conceptual hierarchy
Sentence splitter
Gazetteer lookup
Buchart parser
Name matcher
Brill tagger
Tagged morph
Discourse interpreter
Template Writer
- Syntactico-semantic interpretation
- bottom-up chart parser
- cascade of NERC grammars (eg. aircraft, person,
money, time) - cascade of partial grammars (NPs, PPs, complex
NP, VPs, complex VPs, RelClauses, Sentence)
S(According_to-adv NP(unofficial-adj sources-n)
, NP(the-det bomb-n) allegedly-adv
VP(detonateed-v) PP(by-prep NP(urban-adj
guerrilla-n commandos-n)) - VP(blow_up-v)
NP(a-det power_tower-n) PP(in-prep NP(the-det
NE1-loc)) PP(at-prep NP(NE2-time)))
72Examples of IE systems
LaSIE-II system
gazetteers
Lexicon
Conceptual hierarchy
Sentence splitter
Gazetteer lookup
Buchart parser
Name matcher
Brill tagger
Tagged morph
Discourse interpreter
Template Writer
- Syntactico-semantic interpretation
- bottom-up chart parser
- cascade of NERC grammars (eg. aircraft, person,
money, time) - cascade of partial grammars (NPs, PPs, complex
NP, VPs, complex VPs, RelClauses, Sentence) - QLFs (Note the real implementation of QLFs is
not specified)
Event(E1), detonate(E1,Y,X), urban_guerrilla_coman
do(X), bomb(Y), Event(E2), blow_up(E2,Y,Z),
power_tower(Z), location_of(Z,NE1),
time_of(E2,NE2)
73Examples of IE systems
LaSIE-II system
gazetteers
Lexicon
Conceptual hierarchy
Sentence splitter
Gazetteer lookup
Buchart parser
Name matcher
Brill tagger
Tagged morph
Discourse interpreter
Template writer
- Discourse analysis
- Name matcher Matches variants of NEs across the
text - Discourse interpreter
- adds QLF representation to a semantic net
(links) - adds presuppositions
- coreference resolution
bombing event
implies
Event(E1), detonate(E1,Y,X), urban_guerrilla_coman
do(X), bomb(Y), Event(E2), blow_up(E2,Y,Z),
power_tower(Z), location_of(Z,NE1),
time_of(E2,NE2)
implies
isa
location of event
destroy
74Examples of IE systems
LaSIE-II system
gazetteers
Lexicon
Conceptual hierarchy
Sentence splitter
Gazetteer lookup
Buchart parser
Name matcher
Brill tagger
Tagged morph
Discourse interpreter
Template writer
- Output template generation
- procedure that write the templates in the
desired format
Incident type bombing date March
19 Location El Salvador San Salvador
(city) Perpetrator urban guerrilla
commandos Physical target power tower Human
target - Effect on physical target destroyed Ef
fect on human target no injury or
death Instrument bomb
75Examples of IE systems
PROTEUS system
Lexicon
Chunk grammar
NERC Rules
IE-Rules
Conceptual hierarchy
Format Rules
Inference Rules
Partial parsing
Lexical Analizer
Coreference resolution
Discourse Analysis
Scenario Patterns
Output generator
NERC
76Examples of IE systems
PROTEUS system
Lexicon
Chunk grammar
NERC Rules
IE-Rules
Conceptual hierarchy
Format Rules
Inference Rules
Partial parsing
Lexical Analizer
Coreference resolution
Discourse Analysis
Scenario Patterns
Output generator
NERC
Preprocessing
According_to-adv unofficial-adj sources-n ,
the-det bomb-n allegedly-adv detonated-v
by-prep urban-adj guerrilla-n commandos-n -
blew_up-v a-det power_tower-n in-prep the-det
northwestern part of San Salvador-loc at-prep
0650-time
NE2
NE1
77Examples of IE systems
PROTEUS system
Lexicon
Chunk grammar
NERC Rules
IE-Rules
Conceptual hierarchy
Format Rules
Inference Rules
Partial parsing
Lexical Analizer
Coreference resolution
Discourse Analysis
Scenario Patterns
Output generator
NERC
- Sintactico-semantic interpretation
- basic VP and NP chunkshead_semantics
- semantics refer to types of slot fillers
(Conceptual hierarchy)
According_to-adv NP(unofficial-adj sources-n-s1)
, NP(the-det bomb-n-artifact) allegedly-adv
VP(detonated-v-s3) by-prep NP(urban-adj
guerrilla-n commandos-n-person)
VP(blew_up-v-s4) NP(a-det power_tower-n-building)
in-prep NP(NE1-location) at-prep NP(NE2-time)
78Examples of IE systems
PROTEUS system
Lexicon
Chunk grammar
NERC Rules
IE-Rules
Conceptual hierarchy
Format Rules
Inference Rules
Partial parsing
Lexical Analizer
Coreference resolution
Discourse Analysis
Scenario Patterns
Output generator
NERC
- Sintactico-semantic interpretation
- basic VP and NP chunkshead_semantics
- IE-rules for relations (appositions,
PP-attachments, limited conjunctions) - NP(A-person) , B-integer years old , ?
instance(X,person), name_of(X,A), age_of(X,B) - NP(A-position) of NP(B-company) ?
instance(X,person), position_of(X,A),
company_of(X,B)
Value
Slot
person
Class
Real implementation as objects
A
name
B
age
79Examples of IE systems
PROTEUS system
Lexicon
Chunk grammar
NERC Rules
IE-Rules
Conceptual hierarchy
Format Rules
Inference Rules
Partial parsing
Lexical Analizer
Coreference resolution
Discourse Analysis
Scenario Patterns
Output generator
NERC
- Sintactico-semantic interpretation
- basic VP and NP chunkshead_semantics
- IE-rules for relations (appositions,
PP-attachments, limited conjunctions) - IE-rules for events (PET interface or ExDISCO)
- NP(A-artifact) v-s4 NP(B-building) ?
instance(E1,s4), instrument_of(E1,A),
phisical_target_of(E1,B)
According_to-adv NP(unofficial-adj sources-n-s1)
, NP(the-det bomb-n-artifact) allegedly-adv
VP(detonated-v-s3) by-prep NP(urban-adj
guerrilla-n commandos-n-person)
VP(blew_up-v-s4) NP(a-det power_tower-n-building)
in-prep NP(NE1-location) at-prep NP(NE2-time)
80Examples of IE systems
PROTEUS system
Lexicon
Chunk grammar
NERC Rules
IE-Rules
Conceptual hierarchy
Format Rules
Inference Rules
Partial parsing
Lexical Analizer
Coreference resolution
Discourse Analysis
Scenario Patterns
Output generator
NERC
- Discourse analysis
- antecedents found seeking in sequential order.
- constraints
- instance of a hyperclass
- same number
- share arguments
81Examples of IE systems
PROTEUS system
Lexicon
Chunk grammar
NERC Rules
IE-Rules
Conceptual hierarchy
Format Rules
Inference Rules
Partial parsing
Lexical Analizer
Coreference resolution
Discourse Analysis
Scenario Patterns
Output generator
NERC
- Discourse analysis
- QLFs inference rules more complex QLFs
- conversion of date expressions.
- inference of slot values from the QLFs already
achieved - inference of events from others explicitly
described - Fred, the president of Cuban Cigar Corp., was
appointed vice president of Microsoft - implies
- Fred left the Cuban Cigar Corp.
82Examples of IE systems
PROTEUS system
Lexicon
Chunk grammar
NERC Rules
IE-Rules
Conceptual hierarchy
Format Rules
Inference Rules
Partial parsing
Lexical Analizer
Coreference resolution
Discourse Analysis
Scenario Patterns
Output generator
NERC
- Output template generation
- use of rules to build the templates with the
desired format
83Examples of IE systems
IE2 system
Custom NameTag
Discourse Module
NetOwl Extractor 3.0
PhraseTag
EventTag
TempGen
Hand-crafted rules
84Examples of IE systems
IE2 system
Custom NameTag
Discourse Module
NetOwl Extractor 3.0
PhraseTag
EventTag
TempGen
Hand-crafted rules
- Preprocessing
- only NERC
- SGML-tagged
- general NE types and subtypes
- restricted-domain NE types and subtypes
ltperson id1gtJeff Bantlelt/persongt, ltentity
id2gtNASAlt/entitygts mission operations
directorate representative for the shuttle flight
85Examples of IE systems
IE2 system
Custom NameTag
Discourse Module
NetOwl Extractor 3.0
PhraseTag
EventTag
TempGen
Hand-crafted rules
- Syntactico-semantic interpretation
- SGML-tagging of phrases that are values of slots
- NPs denoting persons (PNP), organizations (ENP),
artifacts (ANP), - local links (location-of, employee-of, owner-of,
)
ltperson id1gtJeff Bantlelt/persongt, ltPNP
affil2gtltentity id2gtNASAlt/entitygts mission
operations directorate representative for the
shuttle flightlt/PNPgt
86Examples of IE systems
IE2 system
Custom NameTag
Discourse Module
NetOwl Extractor 3.0
PhraseTag
EventTag
TempGen
Hand-crafted rules
- Syntactico-semantic interpretation
- SGML-tagging of phrases that are values of slots
in templates - NPs
- local semantic relations (employee-of,
location-of, product-of, ) - event IE-rules (note the real implementation is
not specified) - Vehicle LaunchN ? launch_eventvehicle_info
Vehicle
ltlaunch_event id2 vehicle_info1gtltANPgt The
ltvehicle id1gtArian 5lt/vehiclegt launch lt/ANPgt was
successfully achieved at 6am
87Examples of IE systems
IE2 system
Custom NameTag
Discourse Module
NetOwl Extractor 3.0
PhraseTag
EventTag
TempGen
Hand-crafted rules
- Discourse analysis
- Three coreference resolution methods
- Rule based
- Machine learning based
- Hybrid
- Name alias resolution in addition to that
performed by NetOwl - Definite NPs
- Singular personal pronouns
ltperson id1gtJeff Bantlelt/persongt, ltPNP ref1
affil2gtltentity id2gtNASAlt/entitygts mission
operations directorate representative for the
shuttle flightlt/PNPgt
88Examples of IE systems
IE2 system
Custom NameTag
Discourse Module
NetOwl Extractor 3.0
PhraseTag
EventTag
TempGen
Hand-crafted rules
- Output template generation
- Translates SGML output into templates in the
desired format - Solves and normalizes time expressions
- Performs event merging
89Examples of IE systems
SIFT system
Output generator
Cross-sentece level
Sentence level
IdentifinderTM
Statistical models
90Examples of IE systems
SIFT system
Output generator
Cross-sentece level
Sentence level
IdentifinderTM
Statistical models
- Preprocessing
- NERC using a HMM Bikel et al. 97 Viterbi
maximizing Pr(W,F,C) - each word is tagged with one NE class
start-sentence
person
not-a-name
organization
location
end-sentence
91Examples of IE systems
SIFT system
Output generator
Cross-sentece level
Sentence level
IdentifinderTM
Statistical models
- Syntactico-semantic interpretation
- properties of NEs (TE) and relations (TR)
- generative statistical model Miller et al. 98,
00 - search the most likely augmented parse tree
(bottom-up chart based) - prunning of low probability constituents
92Examples of IE systems
SIFT system
Output generator
Cross-sentece level
Sentence level
IdentifinderTM
Statistical models
Syntactico-semantic interpretation
per/np
per-desc-r/np
emp-of/pp-lnk
org-ptr/pp
per-r/np per-desc/np
org-r/np
per/nnp , det vbn per-desc/nn to
org/nnp org/nnp ,
Nance , a paid consultant to
ABC News ,
93Examples of IE systems
SIFT system
Output generator
Cross-sentece level
Sentence level
IdentifinderTM
Statistical models
- Syntactico-semantic interpretation
- relations between NEs across sentences
- statistical model Miller et al. 98
- classifier of pairs of entities
- entities in different sentences
- entities do not take part into local relations
- their types are compatible with any relation
94Examples of IE systems
TURBIO system
Partial-tree grammar
Lexicon
NERC Rules
IE-rule set scheduling
IE-Rule set processor
IE-Rule sets
Partial parsing
Lexical Analizer
controller
NERC
Output generator
95Examples of IE systems
TURBIO system
Partial-tree grammar
Lexicon
NERC Rules
IE-rule set scheduling
IE-Rule set processor
IE-Rule sets
Partial parsing
Lexical Analizer
controller
NERC
Output generator
- Preprocessing
- WordNet synsets, lemmas, POS tags
- NERC
- parsed trees of noun, verbal, and adjectival
phrases
96Examples of IE systems
TURBIO system
Partial-tree grammar
Lexicon
NERC Rules
IE-rule set scheduling
IE-Rule set processor
IE-Rule sets
Partial parsing
Lexical Analizer
controller
NERC
Output generator
- Syntactico-semantic interpretation
- Hypotesis dependence among relations of NEs
- Iterative execution of IE-rule sets depending on
the scheduling - Example
- Scenario Mushroom parts, their possible colors
and the circumstances by which they are produced - There are colors in the documents that are not
related to any mushroom part, but all colors
related with a circumstance are colors related to
mushroom parts.