CS 124LINGUIST 180: From Languages to Information

1 / 59
About This Presentation
Title:

CS 124LINGUIST 180: From Languages to Information

Description:

... years, Microsoft Corporation CEO Bill Gates railed against the economic ... Bill Gates CEO Microsoft. Bill Veghte VP Microsoft. Richard Stallman founder ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 60
Provided by: DanJur6

less

Transcript and Presenter's Notes

Title: CS 124LINGUIST 180: From Languages to Information


1
CS 124/LINGUIST 180 From Languages to
Information
  • Dan Jurafsky
  • Lecture 14 Information Extraction and Semantic
    Relation learning

2
Information Extraction
  • IE extract limited kinds of information from
    text
  • Sometimes called text analytics commercially
  • Extract entities (the people, organizations,
    locations, times, dates, genes, diseases,
    medicines, etc. in a text)
  • Extract the relations between entities
  • Figure out the larger events that are taking place

3
Information Extraction
  • Digital Libaries
  • Google scholar, Citeseer need to extract the
    title, author and references
  • Bioinformatics
  • Patent analysis
  • Specific market segments for stock analysis
  • SEC filings
  • Intelligence analysis

4
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
5
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
6
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
named entity extraction
7
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
8
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
9
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation




10
Task I Named Entity Tagging
11
Biomedical NER Motivation
  • Rapid growth in biomedical information
  • MEDLINE
  • primary biomedical research database,
  • Over 12 million abstracts,
  • 60,000 new abstracts each month.
  • Many other biological databases
  • w/info on genes, proteins, nucleotide and amino
    acid sequences,
  • GenBank, Swiss-Prot, and Fly-Base
  • Dont want to have to build these by hand

12
Named Entity Recognition
  • General NER vs. Biomedical NER
  • John Hennessy is a professor at
    Stanford University , in Palo
    Alto .
  • TAR independent transactivation by
    Tat in cells derived from
    the CNS - a novel mechanism of
    HIV-1 gene regulation.

13
Why BioMed NE tagging is hard
  • The list of biomedical entities is growing.
  • New genes and proteins are constantly being
    discovered, so explicitly enumerating and
    searching against a list of known entities is not
    scalable.
  • Part of the difficulty lies in identifying
    previously unseen entities based on contextual,
    orthographic, and other clues.
  • Biomedical entities dont have strict naming
    conventions.
  • Common English words such as period, curved, and
    for are used for gene names.
  • Entity names can be ambiguous. For example, in
    FlyBase, clk is the gene symbol for the Clock
    gene but it also is used as a synonym of the
    period gene.
  • Biomedical entity names are ambiguous
  • Experts only agree on whether a word is even a
    gene or protein 69 of the time. (Krauthammer et
    al., 2000)
  • Often systematic polysemies between gene, RNA,
    DNA, etc.

14
Maximum Entropy Markov Model
DNA
O
DNA
O
HIV-1
regulation
gene
of
15
Interesting Features
  • Words
  • Word shapes
  • Part-of-Speech tags
  • Parsing information
  • Searching the web for the word in a given context
  • X gene, X mutation, X antagonist
  • Gazetteer
  • list words whose classification is known
  • Abbreviation extraction (Schwartz and Hearst,
    2003)
  • Identify short and long forms when occurring
    together in text

Zn finger homeodomain 2 (Zfh 2)
16
Features Whats in a Name?
Cotrimoxazole
Wethersfield
Alien Fury Countdown to Invasion
17
Task II Relation Extraction
18
Relation Extraction Disease Outbreaks
  • Extract structured information from text

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Information Extraction System (e.g., NYUs
Proteus)
19
Example Protein Interactions
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
20
Problem Which relations hold between 2 entities?
Cure?
Prevent?
Side Effect?
21
Different relations between Disease (Hepatitis)
and Treatment
  • Cure
  • These results suggest that con A-induced
    hepatitis was ameliorated by pretreatment with
    TJ-135.
  • Prevent
  • A two-dose combined hepatitis A and B vaccine
    would facilitate immunization programs
  • Vague
  • Effect of interferon on hepatitis B

22
Relation Extraction
  • CHICAGO (AP) Citing high fuel prices, United
    Airlines said Friday it has increased fares by 6
    per round trip on flights to some cities also
    served by lower-cost carriers. American Airlines,
    a unit AMR, immediately matched the move,
    spokesman Tim Wagner said. United, a unit of UAL,
    said the increase took effect Thursday night and
    applies to most routes where it competes against
    discount carriers, such as Chicago to Dallas and
    Atlanta and Denver to San Francisco, Los Angeles
    and New York

23
Relation Types
  • As with named entities, the list of relations is
    application specific. For generic news texts...

24
Relations
  • By relation we really mean sets of tuples.
  • Think about populating a database.

25
Relation Analysis
  • We can divide this task into two parts
  • Determining if 2 entities are related
  • And if they are, classifying the relation
  • The reason for doing this is two-fold
  • Cutting down on training time for classification
    by eliminating most pairs
  • Producing separate feature-sets that are
    appropriate for each task.

26
Relation Analysis
  • Lets just worry about named entities within the
    same sentence

27
More relations UMLS
  • Unified Medical Language System
  • integrates linguistic, terminological and
    semantic information
  • Semantic Network consists of 134 semantic types
    and 54 relations between types

Pharmacologic Substance affects Pathologic
Function Pharmacologic Substance causes
Pathologic Function Pharmacologic Substance
complicates Pathologic Function Pharmacologic
Substance diagnoses Pathologic
Function Pharmacologic Substance prevents
Pathologic Function Pharmacologic Substance
treats Pathologic Function
Slide from Paul Buitelaar
28
Relations in Ontologies GO (Gene Ontology)
  • GO (Gene Ontology)
  • Aligns descriptions of gene products in different
    databases, including plant, animal and microbial
    genomes
  • Organizing principles are molecular function,
    biological process and cellular component

Accession GO0009292 Ontology biological
process Synonyms broad genetic
exchange Definition In the absence of a sexual
life cycle, the processes involved in the
introduction of genetic information to create
a genetically different individual. Term
Lineage all all (164142) GO0008150
biological process (115947) GO0007275
development (11892) GO0009292 genetic
transfer (69)
Slide from Paul Buitelaar
29
Relations in Ontologies geographical
Geographical Entity (GE)
is-a
flow_through
Natural GE
Inhabited GE
capital_of
city
country
river
mountain
instance_of
located_in
Neckar
Zugspitze
Germany
capital_of
length (km)
height (m)
flow_through
located_in
Berlin
Stuttgart
367
2962
flow_through
Design Philipp Cimiano
Slide from Paul Buitelaar
30
Features
  • We can group the features (for both tasks) into
    three categories
  • Features of the named entities involved
  • Features derived from the words between and
    around the named entities
  • Features derived from the syntactic environment
    that governs the two entities

31
Features
  • Features of the entities
  • Their types
  • Concatenation of the types
  • Headwords of the entities
  • George Washington Bridge
  • Words in the entities
  • Features between and around
  • Particular positions to the left and right of the
    entities
  • /- 1, 2, 3
  • Bag of words between the two entities

32
Features
  • Syntactic environment
  • Constituent path through the tree from one to the
    other
  • Base syntactic chunk sequence from one to the
    other
  • Dependency path

33
Example
  • For the following example, were interested in
    the possible relation between American Airlines
    and Tim Wagner.
  • American Airlines, a unit AMR, immediately
    matched the move, spokesman Tim Wagner said.

34
Example Extraction PatternsSnowball AG2000

LOCATION
ORGANIZATION


LOCATION
ORGANIZATION
35
Example Extraction Rule NYU Proteus
36
Hyponymy Extraction
  • Lets focus a bit on extraction of the hyponymy
    relation

37
Hyponymy
  • One sense is a hyponym of another if the first
    sense is more specific, denoting a subclass of
    the other
  • car is a hyponym of vehicle
  • dog is a hyponym of animal
  • mango is a hyponym of fruit
  • Conversely
  • vehicle is a hypernym/superordinate of car
  • animal is a hypernym of dog
  • fruit is a hypernym of mango

38
WordNet Hierarchies
39
MeSH (Medical Subject Headings) Thesaurus
Definition
MeSH Descriptor
Synonym set
Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and
Il-Yeol Song
40
MeSH Tree
  • MeSH Ontology
  • Hierarchically arranged from most general to most
    specific.
  • Actually a graph rather than a tree
  • normally appear in more than one place in the tree

MeSH Tree
Slide from Illhoi Yoo, Xiaohua (Tony) Hu,and
Il-Yeol Song
41
Detecting hyponymy and other relations
  • Could we discover new hyponyms, and add them to a
    taxonomy under the appropriate hypernym?
  • Why is this important?
  • insulin and progesterone are in WN 2.1,
  • but leptin and pregnenolone are not.
  • combustibility and navigability,
  • but not affordability, reusability, or
    extensibility.
  • HTML and SGML, but not XML or XHTML.
  • Google and Yahoo, but not Microsoft or
    IBM.
  • This unknown word problem occurs throughout NLP

42
Goal Add hyponyms to WordNet directly from
text.
  • Intuition from Hearst (1992)
  • Agar is a substance prepared from a mixture of
    red algae, such as Gelidium, for laboratory or
    industrial use
  • What does Gelidium mean?
  • How do you know?

43
Goal Add hyponyms to WordNet directly from
text.
  • Intuition from Hearst (1992)
  • Agar is a substance prepared from a mixture of
    red algae, such as Gelidium, for laboratory or
    industrial use
  • What does Gelidium mean?
  • How do you know?

44
Hearsts Hand-Designed Lexico-Syntactic Patterns
(Hearst, 1992) Automatic Acquisition of
Hyponyms
Y such as X ((, X) (, and/or) X) such Y as
X X or other Y X and other Y Y
including X Y, especially X
45
Idea a classifier with patterns as features
  • doubly heavy hydrogen atom called deuterium
  • Take corpus sentences
  • Collect noun pairs
  • 752,311 pairs from 6M words of newswire
  • Is pair an IS-A?
  • 14,387 yes, 737,924 no
  • Parse the sentences
  • Extract patterns
  • 69,592 dependency paths 5 pairs)
  • Train classifier on these patterns
  • Logistic regression with 70K features(actually
    converted to 974,288 bucketed binary features)

1
  • (Atom, deuterium)

2
  • YES

3
4
5
6
46
One of 70,000 patterns
called Learned
from cases such as sarcoma / cancer an
uncommon bone cancer called osteogenic sarcoma
and to deuterium / atom .heavy water rich in
the doubly heavy hydrogen atom called
deuterium. New pairs discovered
efflorescence / condition and a condition
called efflorescence are other reasons for
neal_inc / company The company, now called
O'Neal Inc., was sole distributor of
E-Ferol hat_creek_outfit / ranch run a small
ranch called the Hat Creek Outfit. hiv-1 /
aids_virus infected by the AIDS virus, called
HIV-1. bateau_mouche / attraction local
sightseeing attraction called the Bateau
Mouche... kibbutz_malkiyya / collective_farm
an Israeli collective farm called Kibbutz
Malkiyya
47
Bootstrapping Approaches
  • What if you dont have enough annotated text to
    train on.
  • But you might have some seed tuples
  • Or you might have some patterns that work pretty
    well
  • Can you use those seeds to do something useful?
  • Co-training and active learning use the seeds to
    train classifiers to tag more data to train
    better classifiers...
  • Bootstrapping tries to learn directly (populate a
    relation) through direct use of the seeds

48
Bootstrapping Example Seed Tuple
  • Seed tuple
  • Grep (google)
  • Mark Twain is buried in Elmira, NY.
  • X is buried in Y
  • The grave of Mark Twain is in Elmira
  • The grave of X is in Y
  • Elmira is Mark Twains final resting place
  • Y is Xs final resting place.
  • Use those patterns to grep for new tuples that
    you dont already know

49
Bootstrapping Relations
50
Task 3 Coreference and Deduplication
51
Coreference
  • Victoria Chen, Chief Financial Officer of
    Megabucks Banking Corp since 2004, saw her pay
    jump 20, to 1.3 million, as the 37-year-old
    also became the Denver-based financial-service
    companys president. It has been ten years since
    she came to Megabucks from rival Lotsabucks
  • The Tin Woodman went to the Emerald City to see
    the Wizard of Oz and ask for a heart. After he
    asked for it, the Woodman waited for the Wizards
    response

52
Reference resolution in IE
  • First Union Corp. is continuing to wrestle with
    severe problems unleashed by a botched merger and
    a troubled business strategy. According to
    industry insiders at Paine Webber, their
    president, John R. Georgius, is planning to
    retire by the end of the year.

?
53
Extracted Entities Resolving Duplicates
Document 1 The Justice Department has officially
ended its inquiry into the assassinations of John
F. Kennedy and Martin Luther King Jr., finding
no persuasive evidence'' to support conspiracy
theories, according to department documents. The
House Assassinations Committee concluded in 1978
that Kennedy was probably'' assassinated as the
result of a conspiracy involving a second gunman,
a finding that broke from the Warren Commission
's belief that Lee Harvey Oswald acted alone in
Dallas on Nov. 22, 1963. Document 2 In 1953,
Massachusetts Sen. John F. Kennedy married
Jacqueline Lee Bouvier in Newport, R.I. In 1960,
Democratic presidential candidate John F. Kennedy
confronted the issue of his Roman Catholic faith
by telling a Protestant group in Houston, I do
not speak for my church on public matters, and
the church does not speak for me.' Document 3
David Kennedy was born in Leicester, England in
1959.  Kennedy co-edited The New Poetry
(Bloodaxe Books 1993), and is the author of New
Relations The Refashioning Of British Poetry
1980-1994 (Seren 1996). 
From Li, Morie, Roth, AI Magazine, 2005
54
Part IV Template-based Extraction
55
Template Filling
  • For stories/texts with stereotypical sequences of
    events, participants, props etc.
  • Represent these facts as slots and slot-fillers
    templates (frames, scripts, schemas)
  • Evoke the right template
  • Identify the story elements that fill each slot

56
Airline Example
57
Template-Filling
  • Two approaches
  • Rules and cascades of rules
  • Supervised ML as Sequence Labeling
  • Two approaches
  • One sequence classifier per slot
  • One big sequence classifier
  • Each are trained using IOB labels from annotated
    data.

58
IE Summary
  • Named entity recognition and classification
  • Coreference analysis
  • Temporal and numerical expression analysis
  • Event detection and classification
  • Relation extraction
  • Template analysis
  • Related tasks
  • Language identification
  • Age, nationality and gender classification

59
IE Summary
  • Note that for each task...
  • We found a way to cast it as a problem in
    classification
  • Extracted features
  • Trained
  • So its classification all the way down. Even if
    were dealing with a sequence.
Write a Comment
User Comments (0)