Title: CS 388: Natural Language Processing: Information Extraction
1CS 388 Natural Language ProcessingInformation
Extraction
- Raymond J. Mooney
- University of Texas at Austin
1
1
2Information Extraction (IE)
- Identify specific pieces of information (data) in
a unstructured or semi-structured textual
document. - Transform unstructured information in a corpus of
documents or web pages into a structured
database. - Applied to different types of text
- Newspaper articles
- Web pages
- Scientific articles
- Newsgroup messages
- Classified ads
- Medical notes
3Sample Job Posting
Subject US-TN-SOFTWARE PROGRAMMER Date 17 Nov
1996 173729 GMT Organization Reference.Com
Posting Service Message-ID lt56nigpmrs_at_bilbo.refe
rence.comgt SOFTWARE PROGRAMMER Position
available for Software Programmer experienced in
generating software for PC-Based Voice Mail
systems. Experienced in C Programming. Must be
familiar with communicating with and controlling
voice cards preferable Dialogic, however,
experience with others such as Rhetorix and
Natural Microsystems is okay. Prefer 5 years or
more experience with PC Based Voice Mail, but
will consider as little as 2 years. Need to find
a Senior level person who can come on board and
pick up code with very little training. Present
Operating System is DOS. May go to OS-2 or UNIX
in future. Please reply to Kim
Anderson AdNET (901) 458-2888 fax kimander_at_memphis
online.com
Subject US-TN-SOFTWARE PROGRAMMER Date 17 Nov
1996 173729 GMT Organization Reference.Com
Posting Service Message-ID lt56nigpmrs_at_bilbo.refe
rence.comgt SOFTWARE PROGRAMMER Position
available for Software Programmer experienced in
generating software for PC-Based Voice Mail
systems. Experienced in C Programming. Must be
familiar with communicating with and controlling
voice cards preferable Dialogic, however,
experience with others such as Rhetorix and
Natural Microsystems is okay. Prefer 5 years or
more experience with PC Based Voice Mail, but
will consider as little as 2 years. Need to find
a Senior level person who can come on board and
pick up code with very little training. Present
Operating System is DOS. May go to OS-2 or UNIX
in future. Please reply to Kim
Anderson AdNET (901) 458-2888 fax kimander_at_memphis
online.com
4Extracted Job Template
computer_science_job id 56nigpmrs_at_bilbo.referenc
e.com title SOFTWARE PROGRAMMER salary company
recruiter state TN city country US language
C platform PC \ DOS \ OS-2 \ UNIX application ar
ea Voice Mail req_years_experience
2 desired_years_experience 5 req_degree desired_
degree post_date 17 Nov 1996
5Named Entity Recognition
- Specific type of information extraction in which
the goal is to extract formal names of particular
types of entities such as people, places,
organizations, etc. - Usually a preprocessing step for subsequent
task-specific IE, or other tasks such as question
answering.
6Named Entity Recognition Example
U.S. Supreme Court quashes
'illegal' Guantanamo trials Military trials
arranged by the Bush administration for detainees
at Guantanamo Bay are illegal, the United States
Supreme Court ruled Thursday. The court found
that the trials known as military commissions
for people detained on suspicion of terrorist
activity abroad do not conform to any act of
Congress. The justices also rejected the
government's argument that the Geneva Conventions
regarding prisoners of war do not apply to those
held at Guantanamo Bay. Writing for the 5-3
majority, Justice Stephen Breyer said the White
House had overstepped its powers under the U.S.
Constitution. "Congress has not issued the
executive a blank cheque," Breyer
wrote. President George W. Bush said he takes the
ruling very seriously and would find a way to
both respect the court's findings and protect the
American people.
7Named Entity Recognition Example
people places
organizations U.S. Supreme Court
quashes 'illegal' Guantanamo trials Military
trials arranged by the Bush administration for
detainees at Guantanamo Bay are illegal, the
United States Supreme Court ruled Thursday. The
court found that the trials known as military
commissions for people detained on suspicion of
terrorist activity abroad do not conform to any
act of Congress. The justices also rejected the
government's argument that the Geneva Conventions
regarding prisoners of war do not apply to those
held at Guantanamo Bay. Writing for the 5-3
majority, Justice Stephen Breyer said the White
House had overstepped its powers under the U.S.
Constitution. "Congress has not issued the
executive a blank cheque," Breyer
wrote. President George W. Bush said he takes the
ruling very seriously and would find a way to
both respect the court's findings and protect the
American people.
8Relation Extraction
- Once entities are recognized, identify specific
relations between entities - Employed-by
- Located-at
- Part-of
- Example
- Michael Dell is the CEO of Dell Computer
Corporation and lives in Austin Texas.
9Early Information Extraction
- FRUMP (Dejong, 1979) was an early information
extraction system that processed news stories and
identified various types of events (e.g.
earthquakes, terrorist attacks, floods). - Used sketchy scripts of various events to
identify specific pieces of information about
such events. - Able to summarize articles in multiple languages.
- Relied on brittle hand-built symbolic knowledge
structures that were hard to build and not very
robust.
10MUC
- DARPA funded significant efforts in IE in the
early to mid 1990s. - Message Understanding Conference (MUC) was an
annual event/competition where results were
presented. - Focused on extracting information from news
articles - Terrorist events
- Industrial joint ventures
- Company management changes
- Information extraction of particular interest to
the intelligence community (CIA, NSA). - Established standard evaluation methodolgy using
development (training) and test data and metrics
precision, recall, F-measure.
11Other Applications
- Job postings
- Newsgroups Rapier from austin.jobs
- Web pages Flipdog
- Job resumes
- BurningGlass
- Mohomine
- Seminar announcements
- Company information from the web
- Continuing education course info from the web
- University information from the web
- Apartment rental ads
- Molecular biology information from MEDLINE
12Medline Corpus
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene Moreover, cyclin D1
was found to be phosphorylated on tyrosine
residues in vivo and, like cyclin A, was readily
phosphorylated by pp60c-src in vitro. In
synchronized human osteosarcoma cells, cyclin D1
is induced in early G1 and becomes associated
with p9Ckshs1, a Cdk-binding subunit. Immunoprecip
itation experiments with human osteosarcoma cells
and Ewings sarcoma cells demonstrated that
cyclin D1 is associated with both p34cdc2 and
p33cdk2, and that cyclin D1 immune complexes
exhibit appreciable histone H1 kinase activity
13Medline Corpus Named Entity Recognition
(Proteins)
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene Moreover, cyclin D1
was found to be phosphorylated on tyrosine
residues in vivo and, like cyclin A, was readily
phosphorylated by pp60c-src in vitro. In
synchronized human osteosarcoma cells, cyclin D1
is induced in early G1 and becomes associated
with p9Ckshs1, a Cdk-binding subunit. Immunoprecip
itation experiments with human osteosarcoma cells
and Ewings sarcoma cells demonstrated that
cyclin D1 is associated with both p34cdc2 and
p33cdk2, and that cyclin D1 immune complexes
exhibit appreciable histone H1 kinase activity
14Medline Corpus Relation ExtractionProtein
Interactions
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene Moreover, cyclin D1
was found to be phosphorylated on tyrosine
residues in vivo and, like cyclin A, was readily
phosphorylated by pp60c-src in vitro. In
synchronized human osteosarcoma cells, cyclin D1
is induced in early G1 and becomes associated
with p9Ckshs1, a Cdk-binding subunit. Immunoprecip
itation experiments with human osteosarcoma cells
and Ewings sarcoma cells demonstrated that
cyclin D1 is associated with both p34cdc2 and
p33cdk2, and that cyclin D1 immune complexes
exhibit appreciable histone H1 kinase activity
15Web Extraction
- Many web pages are generated automatically from
an underlying database. - Therefore, the HTML structure of pages is fairly
specific and regular (semi-structured). - However, output is intended for human
consumption, not machine interpretation. - An IE system for such generated pages allows the
web site to be viewed as a structured database. - An extractor for a semi-structured web site is
sometimes referred to as a wrapper. - Process of extracting from such pages is
sometimes referred to as screen scraping.
16Amazon Book Description
. lt/tdgtlt/trgt lt/tablegt ltb class"sans"gtThe Age of
Spiritual Machines When Computers Exceed Human
Intelligencelt/bgtltbrgt ltfont faceverdana,arial,helv
etica size-1gt by lta href"/exec/obidos/search-han
dle-url/indexbooksfield-author
Kurzweil2C20Ray/002-6235079-4593641"gt Ray
Kurzweillt/agtltbrgt lt/fontgt ltbrgt lta
href"http//images.amazon.com/images/P/0140282025
.01.LZZZZZZZ.jpg"gt ltimg src"http//images.amazon.
com/images/P/0140282025.01.MZZZZZZZ.gif" width90
height140 alignleft border0gtlt/agt ltfont
faceverdana,arial,helvetica size-1gt ltspan
class"small"gt ltspan class"small"gt ltbgtList
Pricelt/bgt ltspan classlistpricegt14.95lt/spangtltbrgt
ltbgtOur Price ltfont color990000gt11.96lt/fontgtlt/
bgtltbrgt ltbgtYou Savelt/bgt ltfont color990000gtltbgt2.
99 lt/bgt (20)lt/fontgtltbrgt lt/spangt ltpgt ltbrgt
. lt/tdgtlt/trgt lt/tablegt ltb class"sans"gtThe Age of
Spiritual Machines When Computers Exceed Human
Intelligencelt/bgtltbrgt ltfont faceverdana,arial,helv
etica size-1gt by lta href"/exec/obidos/search-han
dle-url/indexbooksfield-author
Kurzweil2C20Ray/002-6235079-4593641"gt Ray
Kurzweillt/agtltbrgt lt/fontgt ltbrgt lta
href"http//images.amazon.com/images/P/0140282025
.01.LZZZZZZZ.jpg"gt ltimg src"http//images.amazon.
com/images/P/0140282025.01.MZZZZZZZ.gif" width90
height140 alignleft border0gtlt/agt ltfont
faceverdana,arial,helvetica size-1gt ltspan
class"small"gt ltspan class"small"gt ltbgtList
Pricelt/bgt ltspan classlistpricegt14.95lt/spangtltbrgt
ltbgtOur Price ltfont color990000gt11.96lt/fontgtlt/
bgtltbrgt ltbgtYou Savelt/bgt ltfont color990000gtltbgt2.
99 lt/bgt (20)lt/fontgtltbrgt lt/spangt ltpgt ltbrgt
17Extracted Book Template
Title The Age of Spiritual Machines
When Computers Exceed Human Intelligence Author
Ray Kurzweil List-Price 14.95 Price 11.96
18Template Types
- Slots in template typically filled by a substring
from the document. - Some slots may have a fixed set of pre-specified
possible fillers that may not occur in the text
itself. - Terrorist act threatened, attempted,
accomplished. - Job type clerical, service, custodial, etc.
- Company type SEC code
- Some slots may allow multiple fillers.
- Programming language
- Some domains may allow multiple extracted
templates per document. - Multiple apartment listings in one ad
19IE as Sequence Labeling
- Can treat IE as a sequence labeling problem.
- Can apply a sliding window classifier using
various classification algorithms. - Can apply probabilistic sequence models
- HMM
- CRF
20Pattern-Matching Rule Extraction
- Another approach to building IE systems is to use
pattern-matching rules for each field to identify
the strings to extract for that field. - When building web extraction systems (wrappers)
manually, it is common to write regular
expression patterns (in a language like Perl) to
identify the desired regions of the text. - Works well when a fairly fixed local context is
sufficient to identify extractions, as in
extracting from web pages generated by a program
or very stylized text like classified ads.
21Regular Expressions
- Language for composing complex patterns from
simpler ones. - An individual character is a regex.
- Union If e1 and e2 are regexes, then (e1 e2 )
is a regex that matches whatever either e1 or e2
matches. - Concatenation If e1 and e2 are regexes, then e1
e2 is a regex that matches a string that consists
of a substring that matches e1 immediately
followed by a substring that matches e2 - Repetition (Kleene closure) If e1 is a regex,
then e1 is a regex that matches a sequence of
zero or more strings that match e1
22Regular Expression Examples
- (ue)nabl(eing) matches
- unable
- unabling
- enable
- enabling
- (unen)able matches
- able
- unable
- unenable
- enununenable
23Enhanced Regexs (Perl)
- Special terms for common sets of characters, such
as alphabetic or numeric or general wildcard. - Special repetition operator () for 1 or more
occurrences. - Special optional operator (?) for 0 or 1
occurrences. - Special repetition operator for specific range of
number of occurrences min,max. - A1,5 One to five As.
- A5, Five or more As
- A5 Exactly five As
24Perl Regexs
- Character classes
- \w (word char) Any alpha-numeric (not \W)
- \d (digit char) Any digit (not \D)
- \s (space char) Any whitespace (not \S)
- . (wildcard) Anything
- Anchor points
- \b (boundary) Word boundary
- Beginning of string
- End of string
25Perl Regex Examples
- U.S. phone number with optional area code
- /\b(\(\d3\)\s?)?\d3-\d4\b/
- Email address
- /\b\S_at_\S(\.com\.edu\.gov\.org\.net)\b/
26Simple Extraction Patterns
- Specify an item to extract for a slot using a
regular expression pattern. - Price pattern \b\\d(\.\d2)?\b
- May require preceding (pre-filler) pattern to
identify proper context. - Amazon list price
- Pre-filler pattern ltbgtList Pricelt/bgt ltspan
classlistpricegt - Filler pattern \\d(\.\d2)?\b
- May require succeeding (post-filler) pattern to
identify the end of the filler. - Amazon list price
- Pre-filler pattern ltbgtList Pricelt/bgt ltspan
classlistpricegt - Filler pattern .
- Post-filler pattern lt/spangt
27Adding NLP Information to Patterns
- If extracting from automatically generated web
pages, simple regex patterns usually work. - If extracting from more natural, unstructured,
human-written text, some NLP may help. - Part-of-speech (POS) tagging
- Mark each word as a noun, verb, preposition, etc.
- Syntactic parsing
- Identify phrases NP, VP, PP
- Semantic word categories (e.g. from WordNet)
- KILL kill, murder, assassinate, strangle,
suffocate - Extraction patterns can use POS or phrase tags.
- Crime victim
- Prefiller POS V, Hypernym KILL
- Filler Phrase NP
28Pattern-Match Rule Learning
- Writing accurate patterns for each slot for each
application requires laborious software
engineering. - Alternative is to use rule induction methods.
- RAPIER system (Califf Mooney, 1999) learns
three regex-style patterns for each slot - Pre-filler pattern
- Filler pattern
- Post-filler pattern
- RAPIER allows use of POS and WordNet categories
in patterns to generalize over lexical items.
29RAPIER Pattern Induction Example
- If goal is to extract the name of the city in
which a posted job is located, the
least-general-generalization constructed by
RAPIER is
30Evaluating IE Accuracy
- Always evaluate performance on independent,
manually-annotated test data not used during
system development. - Measure for each test document
- Total number of correct extractions in the
solution template N - Total number of slot/value pairs extracted by the
system E - Number of extracted slot/value pairs that are
correct (i.e. in the solution template) C - Compute average value of metrics adapted from IR
- Recall C/N
- Precision C/E
- F-Measure Harmonic mean of recall and precision
31IE Experiment in Bioinformatics
- Large scale comparison of IE methods on
identifying names of human proteins in biomedical
journal abstracts (Bunescu et al. 2004). - Goal is to mine the large body of biomedical
literature to extract a useful database of all
known protein interactions. - Biologists can use this protein network to
better understand the overall biochemical
functioning of an organism.
32Non-Learning Protein Extractors
- Dictionary-based extraction
- Uses a gazetteer of known human protein names.
- KEX (Fukuda et al., 1998)
- General protein-name identifier not specialized
for human.
33Learning Methods for Protein Extraction
- Rule-based pattern induction
- Rapier (Califf Mooney, 1999)
- BWI (Freitag Kushmerick, 2000)
- Token classification (chunking approach)
- K-nearest neighbor
- Transformation-Based Rule Learning Abgene
(Tanabe Wilbur, 2002) - Support Vector Machine (maximum-margin
Perceptron) - Maximum entropy (discriminative version of Naïve
Bayes) - Hidden Markov Models
- Conditional Random Fields (Lafferty, McCallum,
and Pereira, 2001) - Relational Markov Networks (Taskar, Abbeel, and
Koller, 2002)
34Biomedical Corpora
- AIMed 750 abstracts that contain the word human
were randomly chosen from Medline for testing
protein name extraction. They were manually
tagged by experts to annotate a total of 5,206
human protein references (Bunescu et al., 2005). - Yapex Another corpus of 200 abstracts manually
tagged for human protein names.
35Experimental Method
- 10-fold cross-validation Average results over 10
trials with different training and (independent)
test data. - For methods which produce confidence in
extractions, vary threshold for extraction in
order to explore recall-precision trade-off. - Use standard methods from information-retrieval
to generate a complete precision-recall curve. - Maximizing F-measure assumes a particular
cost-benefit trade-off between incorrect and
missed extractions.
36Protein Name Extraction ResultsAIMed Corpus
37Protein Name Extraction Results Yapex Corpus
38Relation Extraction
- Biomedical corpora gt Interactions between
Proteins.
interaction
protein
protein
Cyclin D1 is induced in early G1 and becomes
associated with p9Ckshs1, a Cdk binding subunit.
- Newspaper corpora gt relationships (e.g. Role,
Part, Location, Near, Social) between predefined
types of entities (e.g. Person, Organization,
Facility, Location, Geo-Political).
location
location
people
people
facility
Protesters seized several pumping stations,
holding 127 Shell workers hostage.
39ELCS (Extraction using Longest Common
Subsequences)
- A method for inducing pattern-match rules that
extract interactions between previously tagged
proteins. - Each rule consists of a sequence of words with
allowable word gaps between them (similar to
Blaschke Valencia, 2001, 2002). - (7)
interactions (0) between (5) PROT (9) PROT (17) . - Any pair of proteins in a sentence if tagged as
interacting forms a positive example, otherwise
it forms a negative example. - Positive examples are repeatedly generalized to
form rules until the rules become overly general
and start matching negative examples.
40Generalizing Rules using Longest Common
Subsequence
The self - association site appears to be formed
by interactions between helices 1 and 2 of beta
spectrin repeat 17 of one dimer with helix 3 of
alpha spectrin repeat 1 of the other dimer to
form two combined alpha - beta triple - helical
segments . Title - Physical and functional
interactions between the transcriptional
inhibitors Id3 and ITF-2b .
41Protein Interaction Corpus
- 200 abstracts previously known to contain protein
interactions were obtained from the Database of
Interacting Proteins. They contain 1,101
interactions and 4,141 protein names. - As negative examples for interaction extraction
are rare, an extra set of 30 abstracts containing
sentences with non-interacting proteins are
included. - The resulting 230 abstracts are used for testing
protein interaction extraction.
42Protein Interaction Extraction Results(gold-stand
ard protein tags)
43Protein Interaction Extraction Results(automated
protein tags)
44ERK Relation Extraction using a String
Subsequence Kernel
- Subsequences of words and POS tags are used as
implicit features. - Assumes the entities have already been annotated.
- The feature space can be further pruned down in
almost all examples, a sentence asserts a
relationship between two entities using one of
the following patterns - FI Fore-Inter interaction of P1 with P2,
activation of P1 by P2 - I Inter P1 interacts with P2, P1 is
activated by P2 - IA Inter-After P1 P2 complex, P1 and P2
interact
Bunescu et al., 2005.
interaction of (3) PROT (3) with PROT
45Protein Interaction Extraction Results(gold-stand
ard protein tags)
46ACE 2002 Newspaper Corpus
- Newspaper article extraction task.
- Documents
- 422 training documents
- 97 test documents
- Extracted information
- Entities Person, Organization, Facility,
Location, _______Geopolitical Entity - Relations Role, Part, Located, Near, Social
47ACE 2002 Newspaper Corpus
- Compared
- ERK string subsequence kernel extractor
- K4 The tree dependency kernel from
- Culotta et. al, 2004.
Method Precision Recall F-measure
ERK 73.9 35.2 47.7
K4 70.3 26.3 38.0
48Text Mining
- Automatically extract information from a large
corpus to build a large database or
knowledge-base of useful information. - For example, we have used our trained protein
interaction extractor to mine biomedical journal
abstracts - Input 753,459 Medline abstracts that reference
human - Output Database of 6,580 interactions between
3,737 human proteins
49Active Learning
- Annotating training documents for each
application is difficult and expensive. - Random selection can waste effort on annotating
documents that do not help the learner. - Best to focus human effort on annotating the most
informative documents. - Active learning methods pick only the most
informative examples for training. - At each step, select the example that is
estimated to be the most useful for improving the
current learner and then ask the human oracle to
annotate this example.
50Uncertainty Sampling
- Assume learned system can provide confidence in
its predicted labelings of examples. - From a pool of unlabeled data, pick as most
informative, the unlabelled example about which
the current learned system is most uncertain.
Let D be a set of unlabeled examples Until
desired accuracy is reached Apply current
learned system, L, to all examples in D
From D, select the example, E, whose label is
most uncertain Ask the user to label E and
remove it from D. Add E to the training
set and retrain L
51Rapier Uncertainty Sampling Results
52Information Extraction Issues
- Effectively exploiting global information
- Better active learning methods
- Integrating entity and relation extraction
- Unsupervised IE
- Semi-supervised IE
- Adaptation and transfer to new tasks
- Mining extracted data to find cross-document
regularities. - Use resulting mined knowledge to improve IE