Title: CS 124LINGUIST 180: From Languages to Information
1CS 124/LINGUIST 180 From Languages to
Information
- Dan Jurafsky
- Lecture 14 Information Extraction and Semantic
Relation learning
2Information Extraction
- IE extract limited kinds of information from
text - Sometimes called text analytics commercially
- Extract entities (the people, organizations,
locations, times, dates, genes, diseases,
medicines, etc. in a text) - Extract the relations between entities
- Figure out the larger events that are taking place
3Information Extraction
- Digital Libaries
- Google scholar, Citeseer need to extract the
title, author and references - Bioinformatics
- Patent analysis
- Specific market segments for stock analysis
- SEC filings
- Intelligence analysis
4What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
5What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
6What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
named entity extraction
7What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
8What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
9What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
10Task I Named Entity Tagging
11Biomedical NER Motivation
- Rapid growth in biomedical information
- MEDLINE
- primary biomedical research database,
- Over 12 million abstracts,
- 60,000 new abstracts each month.
- Many other biological databases
- w/info on genes, proteins, nucleotide and amino
acid sequences, - GenBank, Swiss-Prot, and Fly-Base
- Dont want to have to build these by hand
12Named Entity Recognition
- General NER vs. Biomedical NER
-
- John Hennessy is a professor at
Stanford University , in Palo
Alto . - TAR independent transactivation by
Tat in cells derived from
the CNS - a novel mechanism of
HIV-1 gene regulation.
13Why BioMed NE tagging is hard
- The list of biomedical entities is growing.
- New genes and proteins are constantly being
discovered, so explicitly enumerating and
searching against a list of known entities is not
scalable. - Part of the difficulty lies in identifying
previously unseen entities based on contextual,
orthographic, and other clues. - Biomedical entities dont have strict naming
conventions. - Common English words such as period, curved, and
for are used for gene names. - Entity names can be ambiguous. For example, in
FlyBase, clk is the gene symbol for the Clock
gene but it also is used as a synonym of the
period gene. - Biomedical entity names are ambiguous
- Experts only agree on whether a word is even a
gene or protein 69 of the time. (Krauthammer et
al., 2000) - Often systematic polysemies between gene, RNA,
DNA, etc.
14Maximum Entropy Markov Model
DNA
O
DNA
O
HIV-1
regulation
gene
of
15Interesting Features
- Words
- Word shapes
- Part-of-Speech tags
- Parsing information
- Searching the web for the word in a given context
- X gene, X mutation, X antagonist
- Gazetteer
- list words whose classification is known
- Abbreviation extraction (Schwartz and Hearst,
2003) - Identify short and long forms when occurring
together in text
Zn finger homeodomain 2 (Zfh 2)
16Features Whats in a Name?
Cotrimoxazole
Wethersfield
Alien Fury Countdown to Invasion
17Task II Relation Extraction
18Relation Extraction Disease Outbreaks
- Extract structured information from text
May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs
Proteus)
19Example Protein Interactions
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
20Problem Which relations hold between 2 entities?
Cure?
Prevent?
Side Effect?
21Different relations between Disease (Hepatitis)
and Treatment
- Cure
- These results suggest that con A-induced
hepatitis was ameliorated by pretreatment with
TJ-135. - Prevent
- A two-dose combined hepatitis A and B vaccine
would facilitate immunization programs - Vague
- Effect of interferon on hepatitis B
22Relation Extraction
- CHICAGO (AP) Citing high fuel prices, United
Airlines said Friday it has increased fares by 6
per round trip on flights to some cities also
served by lower-cost carriers. American Airlines,
a unit AMR, immediately matched the move,
spokesman Tim Wagner said. United, a unit of UAL,
said the increase took effect Thursday night and
applies to most routes where it competes against
discount carriers, such as Chicago to Dallas and
Atlanta and Denver to San Francisco, Los Angeles
and New York
23Relation Types
- As with named entities, the list of relations is
application specific. For generic news texts...
24Relations
- By relation we really mean sets of tuples.
- Think about populating a database.
25Relation Analysis
- We can divide this task into two parts
- Determining if 2 entities are related
- And if they are, classifying the relation
- The reason for doing this is two-fold
- Cutting down on training time for classification
by eliminating most pairs - Producing separate feature-sets that are
appropriate for each task.
26Relation Analysis
- Lets just worry about named entities within the
same sentence
27More relations UMLS
- Unified Medical Language System
- integrates linguistic, terminological and
semantic information - Semantic Network consists of 134 semantic types
and 54 relations between types
Pharmacologic Substance affects Pathologic
Function Pharmacologic Substance causes
Pathologic Function Pharmacologic Substance
complicates Pathologic Function Pharmacologic
Substance diagnoses Pathologic
Function Pharmacologic Substance prevents
Pathologic Function Pharmacologic Substance
treats Pathologic Function
Slide from Paul Buitelaar
28Relations in Ontologies GO (Gene Ontology)
- GO (Gene Ontology)
- Aligns descriptions of gene products in different
databases, including plant, animal and microbial
genomes - Organizing principles are molecular function,
biological process and cellular component
Accession GO0009292 Ontology biological
process Synonyms broad genetic
exchange Definition In the absence of a sexual
life cycle, the processes involved in the
introduction of genetic information to create
a genetically different individual. Term
Lineage all all (164142) GO0008150
biological process (115947) GO0007275
development (11892) GO0009292 genetic
transfer (69)
Slide from Paul Buitelaar
29Relations in Ontologies geographical
Geographical Entity (GE)
is-a
flow_through
Natural GE
Inhabited GE
capital_of
city
country
river
mountain
instance_of
located_in
Neckar
Zugspitze
Germany
capital_of
length (km)
height (m)
flow_through
located_in
Berlin
Stuttgart
367
2962
flow_through
Design Philipp Cimiano
Slide from Paul Buitelaar
30Features
- We can group the features (for both tasks) into
three categories - Features of the named entities involved
- Features derived from the words between and
around the named entities - Features derived from the syntactic environment
that governs the two entities
31Features
- Features of the entities
- Their types
- Concatenation of the types
- Headwords of the entities
- George Washington Bridge
- Words in the entities
- Features between and around
- Particular positions to the left and right of the
entities - /- 1, 2, 3
- Bag of words between the two entities
32Features
- Syntactic environment
- Constituent path through the tree from one to the
other - Base syntactic chunk sequence from one to the
other - Dependency path
33Example
- For the following example, were interested in
the possible relation between American Airlines
and Tim Wagner. - American Airlines, a unit AMR, immediately
matched the move, spokesman Tim Wagner said.
34Example Extraction PatternsSnowball AG2000
LOCATION
ORGANIZATION
LOCATION
ORGANIZATION
35Example Extraction Rule NYU Proteus
36Hyponymy Extraction
- Lets focus a bit on extraction of the hyponymy
relation
37Hyponymy
- One sense is a hyponym of another if the first
sense is more specific, denoting a subclass of
the other - car is a hyponym of vehicle
- dog is a hyponym of animal
- mango is a hyponym of fruit
- Conversely
- vehicle is a hypernym/superordinate of car
- animal is a hypernym of dog
- fruit is a hypernym of mango
38WordNet Hierarchies
39MeSH (Medical Subject Headings) Thesaurus
Definition
MeSH Descriptor
Synonym set
Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and
Il-Yeol Song
40MeSH Tree
- MeSH Ontology
- Hierarchically arranged from most general to most
specific. - Actually a graph rather than a tree
- normally appear in more than one place in the tree
MeSH Tree
Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and
Il-Yeol Song
41Detecting hyponymy and other relations
- Could we discover new hyponyms, and add them to a
taxonomy under the appropriate hypernym? - Why is this important?
- insulin and progesterone are in WN 2.1,
- but leptin and pregnenolone are not.
- combustibility and navigability,
- but not affordability, reusability, or
extensibility. - HTML and SGML, but not XML or XHTML.
- Google and Yahoo, but not Microsoft or
IBM. - This unknown word problem occurs throughout NLP
42Goal Add hyponyms to WordNet directly from
text.
- Intuition from Hearst (1992)
- Agar is a substance prepared from a mixture of
red algae, such as Gelidium, for laboratory or
industrial use - What does Gelidium mean?
- How do you know?
43Goal Add hyponyms to WordNet directly from
text.
- Intuition from Hearst (1992)
- Agar is a substance prepared from a mixture of
red algae, such as Gelidium, for laboratory or
industrial use - What does Gelidium mean?
- How do you know?
44Hearsts Hand-Designed Lexico-Syntactic Patterns
(Hearst, 1992) Automatic Acquisition of
Hyponyms
Y such as X ((, X) (, and/or) X) such Y as
X X or other Y X and other Y Y
including X Y, especially X
45Idea a classifier with patterns as features
- doubly heavy hydrogen atom called deuterium
- Take corpus sentences
- Collect noun pairs
- 752,311 pairs from 6M words of newswire
- Is pair an IS-A?
- 14,387 yes, 737,924 no
- Parse the sentences
- Extract patterns
- 69,592 dependency paths 5 pairs)
- Train classifier on these patterns
- Logistic regression with 70K features(actually
converted to 974,288 bucketed binary features)
1
2
3
4
5
6
46One of 70,000 patterns
called Learned
from cases such as sarcoma / cancer an
uncommon bone cancer called osteogenic sarcoma
and to deuterium / atom .heavy water rich in
the doubly heavy hydrogen atom called
deuterium. New pairs discovered
efflorescence / condition and a condition
called efflorescence are other reasons for
neal_inc / company The company, now called
O'Neal Inc., was sole distributor of
E-Ferol hat_creek_outfit / ranch run a small
ranch called the Hat Creek Outfit. hiv-1 /
aids_virus infected by the AIDS virus, called
HIV-1. bateau_mouche / attraction local
sightseeing attraction called the Bateau
Mouche... kibbutz_malkiyya / collective_farm
an Israeli collective farm called Kibbutz
Malkiyya
47Bootstrapping Approaches
- What if you dont have enough annotated text to
train on. - But you might have some seed tuples
- Or you might have some patterns that work pretty
well - Can you use those seeds to do something useful?
- Co-training and active learning use the seeds to
train classifiers to tag more data to train
better classifiers... - Bootstrapping tries to learn directly (populate a
relation) through direct use of the seeds
48Bootstrapping Example Seed Tuple
- Seed tuple
- Grep (google)
- Mark Twain is buried in Elmira, NY.
- X is buried in Y
- The grave of Mark Twain is in Elmira
- The grave of X is in Y
- Elmira is Mark Twains final resting place
- Y is Xs final resting place.
- Use those patterns to grep for new tuples that
you dont already know
49Bootstrapping Relations
50Task 3 Coreference and Deduplication
51Coreference
- Victoria Chen, Chief Financial Officer of
Megabucks Banking Corp since 2004, saw her pay
jump 20, to 1.3 million, as the 37-year-old
also became the Denver-based financial-service
companys president. It has been ten years since
she came to Megabucks from rival Lotsabucks - The Tin Woodman went to the Emerald City to see
the Wizard of Oz and ask for a heart. After he
asked for it, the Woodman waited for the Wizards
response
52Reference resolution in IE
- First Union Corp. is continuing to wrestle with
severe problems unleashed by a botched merger and
a troubled business strategy. According to
industry insiders at Paine Webber, their
president, John R. Georgius, is planning to
retire by the end of the year.
?
53Extracted Entities Resolving Duplicates
Document 1 The Justice Department has officially
ended its inquiry into the assassinations of John
F. Kennedy and Martin Luther King Jr., finding
no persuasive evidence'' to support conspiracy
theories, according to department documents. The
House Assassinations Committee concluded in 1978
that Kennedy was probably'' assassinated as the
result of a conspiracy involving a second gunman,
a finding that broke from the Warren Commission
's belief that Lee Harvey Oswald acted alone in
Dallas on Nov. 22, 1963. Document 2 In 1953,
Massachusetts Sen. John F. Kennedy married
Jacqueline Lee Bouvier in Newport, R.I. In 1960,
Democratic presidential candidate John F. Kennedy
confronted the issue of his Roman Catholic faith
by telling a Protestant group in Houston, I do
not speak for my church on public matters, and
the church does not speak for me.' Document 3
David Kennedy was born in Leicester, England in
1959. Kennedy co-edited The New Poetry
(Bloodaxe Books 1993), and is the author of New
Relations The Refashioning Of British Poetry
1980-1994 (Seren 1996).
From Li, Morie, Roth, AI Magazine, 2005
54Part IV Template-based Extraction
55Template Filling
- For stories/texts with stereotypical sequences of
events, participants, props etc. - Represent these facts as slots and slot-fillers
templates (frames, scripts, schemas) - Evoke the right template
- Identify the story elements that fill each slot
56Airline Example
57Template-Filling
- Two approaches
- Rules and cascades of rules
- Supervised ML as Sequence Labeling
- Two approaches
- One sequence classifier per slot
- One big sequence classifier
- Each are trained using IOB labels from annotated
data.
58IE Summary
- Named entity recognition and classification
- Coreference analysis
- Temporal and numerical expression analysis
- Event detection and classification
- Relation extraction
- Template analysis
- Related tasks
- Language identification
- Age, nationality and gender classification
59IE Summary
- Note that for each task...
- We found a way to cast it as a problem in
classification - Extracted features
- Trained
- So its classification all the way down. Even if
were dealing with a sequence.