Information Extraction and Integration: an Overview

About This Presentation

Title:

Information Extraction and Integration: an Overview

Description:

... years, Microsoft Corporation CEO Bill Gates railed against the economic ... Bill Gates CEO Microsoft. Bill Veghte VP Microsoft. Richard Stallman founder ... – PowerPoint PPT presentation

Number of Views:233

Avg rating:3.0/5.0

Slides: 74

Provided by: willia95

Category:

more less

Transcript and Presenter's Notes

Title: Information Extraction and Integration: an Overview

1
Information Extractionand Integration an
Overview

William W. Cohen
Carnegie Mellon University
Jan 12, 2004

2
Administrivia

Course web page
www.cs.cmu.edu/wcohen/10-707/ (or Google me)
No class 1/28 or 2/2.
Unless I get a volunteer?
Classwork
Write ½ page on each paper being discussed.
Starting next week.
Present 1 or 2 optional papers.
Do a course project.

3
Todays lecture

Overview of the task of information extraction.
Overview of some methods for named entity
extraction
Sliding windows, boundary-finding reduce NE to
classification.
HMM, CMM, CRF reduce NE to sequential
classification (an independently interesting
problem)
Overview of some methods for associating,
(grouping, clustering, querying, using, )
extracted data.

4
Example The Problem
Martin Baker, a person
Genomics job
Employers job posting form
5
Example A Solution
6
Extracting Job Openings from the Web
7
Job Openings Category Food Services Keyword
Baker Location Continental U.S.
8
Data Mining the Extracted Job Information
9
IE from Research Papers
10
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
NAME TITLE ORGANIZATION
11
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
12
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification clustering association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
13
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
14
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
15
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association clustering
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation

16
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
17
Tutorial Outline

IE History
Landscape of problems and solutions
Parade of models for segmenting/classifying
Sliding window
Boundary finding
Finite state machines
Trees
Overview of related problems and solutions
Association, Clustering
Integration with Data Mining
Where to go from here

18
IE History

Pre-Web
Mostly news articles
De Jongs FRUMP 1982
Hand-built system to fill Schank-style scripts
from news wire
Message Understanding Conference (MUC) DARPA
87-95, TIPSTER 92-96
Early work dominated by hand-built models
E.g. SRIs FASTUS, hand-built FSMs.
But by 1990s, some machine learning Lehnert,
Cardie, Grishman and then HMMs Elkan Leek 97,
BBN Bikel et al 98
Web
AAAI 94 Spring Symposium on Software Agents
Much discussion of ML applied to Web. Maes,
Mitchell, Etzioni.
Tom Mitchells WebKB, 96
Build KBs from the Web.
Wrapper Induction
Initially hand-build, then ML Soderland 96,
Kushmeric 97,
Citeseer Cora FlipDog contEd courses,
corpInfo,

19
IE History

Biology
Gene/protein entity extraction
Protein/protein fact interaction
Automated curation/integration of databases
At CMU SLIF (Murphy et al, subcellular
information from images text in journal
articles)
Email
EPCA, PAL, RADAR, CALO intelligent office
assistant that understands some part of email
At CMU web site update requests, office-space
requests calendar scheduling requests social
network analysis of email.

20
IE is different in different domains!
Example on web there is less grammar, but more
formatting linking
Newswire
Web
www.apple.com/retail
Apple to Open Its First Retail Store in New York
City MACWORLD EXPO, NEW YORK--July 17,
2002--Apple's first retail store in New York City
will open in Manhattan's SoHo district on
Thursday, July 18 at 800 a.m. EDT. The SoHo
store will be Apple's largest retail store to
date and is a stunning example of Apple's
commitment to offering customers the world's best
computer shopping experience. "Fourteen months
after opening our first retail store, our 31
stores are attracting over 100,000 visitors each
week," said Steve Jobs, Apple's CEO. "We hope our
SoHo store will surprise and delight both Mac and
PC users who want to see everything the Mac can
do to enhance their digital lifestyles."
www.apple.com/retail/soho
www.apple.com/retail/soho/theatre.html
The directory structure, link structure,
formatting layout of the Web is its own new
grammar.
21
Landscape of IE Tasks (1/4)Degree of Formatting
Text paragraphs without formatting
Grammatical sentencesand some formatting links
Astro Teller is the CEO and co-founder of
BodyMedia. Astro holds a Ph.D. in Artificial
Intelligence from Carnegie Mellon University,
where he was inducted as a national Hertz fellow.
His M.S. in symbolic and heuristic computation
and B.S. in computer science are from Stanford
University. His work in science, literature and
business has appeared in international media from
the New York Times to CNN to NPR.
Non-grammatical snippets,rich formatting links
Tables
22
Landscape of IE Tasks (2/4)Intended Breadth of
Coverage
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon.com Book Pages
Resumes
University Names
23
Landscape of IE Tasks (3/4)Complexity
E.g. word patterns
Regular set
Closed set
U.S. phone numbers
U.S. states
Phone (413) 545-1323
He was born in Alabama
The CALD main office can be reached at
412-268-1299
The big Wyoming sky
Ambiguous patterns,needing context andmany
sources of evidence
Complex pattern
U.S. postal addresses
Person names
University of Arkansas P.O. Box 140 Hope, AR
71802
was among the six houses sold by Hope Feldman
that year.
Pawel Opalinski, SoftwareEngineer at WhizBang
Labs.
Headquarters 1128 Main Street, 4th
Floor Cincinnati, Ohio 45210
24
Landscape of IE Tasks (4/4)Single Field/Record
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
25
Evaluation of Single Entity Extraction
TRUTH
Michael Kearns and Sebastian Seung will start
Mondays tutorial, followed by Richard M. Karpe
and Martin Cooke.
PRED
Michael Kearns and Sebastian Seung will start
Mondays tutorial, followed by Richard M. Karpe
and Martin Cooke.
correctly predicted segments 2

Precision

predicted segments 6

correctly predicted segments 2

Recall

true segments 4
1
F1 Harmonic mean of Precision
Recall
((1/P) (1/R)) / 2
26
State of the Art Performance a sample

Named entity recognition from newswire text
Person, Location, Organization,
F1 in high 80s or low- to mid-90s
Binary relation extraction
Contained-in (Location1, Location2)Member-of
(Person1, Organization1)
F1 in 60s or 70s or 80s
Wrapper induction
Extremely accurate performance obtainable
Human effort (10min?) required on each site

27
Landscape of IE Techniques (1/1)Models
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
Any of these models can be used to capture words,
formatting or both.
28
Landscape
Pattern complexity
closed set
regular
complex
ambiguous
Pattern feature domain
words
words formatting
formatting
Pattern scope
site-specific
genre-specific
general
Pattern combinations
entity
binary
n-ary
Models
lexicon
regex
window
boundary
FSM
29
Sliding Windows
30
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
31
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
32
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
33
Extraction by Sliding Window
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
CMU UseNet Seminar Announcement
34
A Naïve Bayes Sliding Window Model
Freitag 1997
00 pm Place Wean Hall Rm 5409
Speaker Sebastian Thrun

w t-m
w t-1
w t
w tn
w tn1
w tnm
prefix
contents
suffix
Estimate Pr(LOCATIONwindow) using Bayes
rule Try all reasonable windows (vary length,
position) Assume independence for length, prefix
words, suffix words, content words Estimate from
data quantities like Pr(Place in
prefixLOCATION)
If P(Wean Hall Rm 5409 LOCATION) is above
some threshold, extract it.
Other examples of sliding window Baluja et al
2000 (decision tree over individual words
their context)
35
Naïve Bayes Sliding Window Results
Domain CMU UseNet Seminar Announcements
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer
Science Carnegie Mellon University
330 pm 7500 Wean
Hall Machine learning has evolved from obscurity
in the 1970s into a vibrant and popular
discipline in artificial intelligence during the
1980s and 1990s. As a result of its success and
growth, machine learning is evolving into a
collection of related disciplines inductive
concept acquisition, analytic learning in problem
solving (e.g. analogy, explanation-based
learning), learning theory (e.g. PAC learning),
genetic algorithms, connectionist learning,
hybrid systems, and so on.
Field F1 Person Name 30 Location 61 Start
Time 98
36
SRV a realistic sliding-window-classifier IE
system
Frietag AAAI 98

What windows to consider?
all windows containing as many tokens as the
shortest example, but no more tokens than the
longest example
How to represent a classifier? It might
Restrict the length of window
Restrict the vocabulary or formatting used
before/after/inside window
Restrict the relative order of tokens
Use inductive logic programming techniques to
express all these

Course Information for CS213 CS
213 C Programming
37
SRV a rule-learner for sliding-window
classification

Primitive predicates used by SRV
token(X,W), allLowerCase(W), numerical(W),
nextToken(W,U), previousToken(W,V)
HTML-specific predicates
inTitleTag(W), inH1Tag(W), inEmTag(W),
emphasized(W) inEmTag(W) or inBTag(W) or
tableNextCol(W,U) U is some token in the
column after the column W is in
tablePreviousCol(W,V), tableRowHeader(W,T),

38
SRV a rule-learner for sliding-window
classification

Non-primitive conditions used by SRV
every(X, f, c) for all W in X f(W)c
some(X, W, , g, c) exists W
g(fk((f1(W)))c
tokenLength(X, relop, c)
position(W,direction,relop, c)
e.g., tokenLength(X,,4), position(W,fromEnd,

39
Rapier an alternative approach
Califf Mooney, AAAI 99

A bottom-up rule learner
initialize RULES to be one rule per example
repeat
randomly pick N pairs of rules (Ri,Rj)
let G1,GN be the consistent pairwise
generalizations
let G Gi that optimizes compression
let RULES RULES G R covers(G,R)
where compression(G,RULES) size of RULES- R
covers(G,R) and covers(G,R) means every
example matching G matches R

40
Course Information for CS213 CS
213 C Programming
courseNum(window1) - token(window1,CS),
doubleton(CS), prevToken(CS,CS213),
inTitle(CS213), nextTok(CS,213),
numeric(213), tripleton(213),
nextTok(213,C), tripleton(C), .
Syllabus and meeting times for Eng
214 Eng 214 Software Engineering for
Non-programmers
courseNum(window2) - token(window2,Eng),
tripleton(Eng), prevToken(Eng,214),
inTitle(214), nextTok(Eng,214),
numeric(214), tripleton(214),
nextTok(214,Software),
courseNum(X) - token(X,A),
prevToken(A, B), inTitle(B),
nextTok(A,C)), numeric(C),
tripleton(C), nextTok(C,D),
41
Rapier an alternative approach

Combines top-down and bottom-up learning
Bottom-up to find common restrictions on content
Top-down greedy addition of restrictions on
context
Use of part-of-speech and semantic features (from
WORDNET).
Special pattern-language based on sequences of
tokens, each of which satisfies one of a set of
given constraints
, ,

42
Rapier results precision/recall
43
Rapier results vs. SRV
44
Rule-learning approaches to sliding-window
classification Summary

SRV, Rapier, and WHISK Soderland KDD 97
Representations for classifiers allow restriction
of the relationships between tokens, etc
Representations are carefully chosen subsets of
even more powerful representations based on logic
programming (ILP and Prolog)
Use of these heavyweight representations is
complicated, but seems to pay off in results
Some questions to consider
Can simpler, propositional representations for
classifiers work (see Roth and Yih)
What learning methods to consider (NB, ILP,
boosting, semi-supervised see Collins Singer)
When do we want to use this method vs fancier
ones?

45
BWI Learning to detect boundaries
Freitag Kushmerick, AAAI 2000

Another formulation learn three probabilistic
classifiers
START(i) Prob( position i starts a field)
END(j) Prob( position j ends a field)
LEN(k) Prob( an extracted field has length k)
Then score a possible extraction (i,j) by
START(i) END(j) LEN(j-i)
LEN(k) is estimated from a histogram

46
BWI Learning to detect boundaries

BWI uses boosting to find detectors for START
and END
Each weak detector has a BEFORE and AFTER pattern
(on tokens before/after position i).
Each pattern is a sequence of tokens and/or
wildcards like anyAlphabeticToken, anyToken,
anyUpperCaseLetter, anyNumber,
Weak learner for patterns uses greedy search (
lookahead) to repeatedly extend a pair of empty
BEFORE,AFTER patterns

47
BWI Learning to detect boundaries
Field F1 Person Name 30 Location 61 Start
Time 98
48
Problems with Sliding Windows and Boundary
Finders

Decisions in neighboring parts of the input are
made independently from each other.
Naïve Bayes Sliding Window may predict a seminar
end time before the seminar start time.
It is possible for two overlapping windows to
both be above threshold.
In a Boundary-Finding system, left boundaries are
laid down independently from right boundaries,
and their pairing happens as a separate step.

49
Finite State Machines
50
Hidden Markov Models
HMMs are the standard sequence modeling tool in
genomics, music, speech, NLP,
Graphical model
Finite state model
S
S
S
transitions
t
-
1
t
t1
...
...
observations
...
Generates State sequence Observation
sequence
O
O
O
t
t
1
-
t
1
o1 o2 o3 o4 o5 o6 o7 o8
Parameters for all states Ss1,s2, Start
state probabilities P(st ) Transition
probabilities P(stst-1 ) Observation
(emission) probabilities P(otst ) Training
Maximize probability of training observations (w/
prior)
Usually a multinomial over atomic, fixed alphabet
51
IE with Hidden Markov Models
Given a sequence of observations
Yesterday Pedro Domingos spoke this example
sentence.
and a trained HMM
person name
location name
background
Find the most likely state sequence (Viterbi)
Yesterday Pedro Domingos spoke this example
sentence.
Any words said to be generated by the designated
person name state extract as a person name
Person name Pedro Domingos
52
HMM Example Nymble
Bikel, et al 1998, BBN IdentiFinder
Task Named Entity Extraction
Transitionprobabilities
Observationprobabilities
Person
end-of-sentence
P(ot st , st-1 )
P(st st-1, ot-1 )
start-of-sentence
Org
P(ot st , ot-1 )
or

(Five other name classes)
Back-off to
Back-off to
P(st st-1 )
P(ot st )
Other
P(st )
P(ot )
Train on 500k words of news wire text.
Case Language F1 . Mixed
English 93 Upper English 91 Mixed Spanish 90

Results
Other examples of shrinkage for HMMs in IE
Freitag and McCallum 99
53
We want More than an Atomic View of Words
Would like richer representation of text many
arbitrary, overlapping features of the words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor last
person name was female next two words are and
Associates
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
54
Problems with Richer Representationand a
Generative Model

These arbitrary features are not independent.
Multiple levels of granularity (chars, words,
phrases)
Multiple dependent modalities (words, formatting,
layout)
Past future
Two choices

Ignore the dependencies. This causes
over-counting of evidence (ala naïve Bayes).
Big problem when combining evidence, as in
Viterbi!
Model the dependencies. Each state would have its
own Bayes Net. But we are already starved for
training data!
S
S
S
S
S
S
t
-
1
t
t1
t
-
1
t
t1
O
O
O
O
O
O
t
t
t
1
-
t
1
-
t
1
t
1
55
Conditional Sequence Models

We prefer a model that is trained to maximize a
conditional probability rather than joint
probabilityP(so) instead of P(s,o)
Can examine features, but not responsible for
generating them.
Dont have to explicitly model their
dependencies.
Dont waste modeling effort trying to generate
what we are given at test time anyway.

56
Conditional Markov Models (CMMs) vs HMMS
St-1
St
St1
...
Ot
Ot1
Ot-1
St-1
St
St1
...
Ot
Ot1
Ot-1
Lots of ML ways to estimate Pr(y x)
57
From HMMs to CRFs
Conditional Finite State Sequence Models
McCallum, Freitag Pereira, 2000
Lafferty, McCallum, Pereira 2001
St-1
St
St1
...
Joint
...
Ot
Ot1
Ot-1
Conditional
where
(A super-special case of Conditional Random
Fields.)
58
Feature Functions
o
Yesterday Pedro Domingos spoke this example
sentence.
o1 o2 o3
o4 o5 o6
o7
s1
s2
s3
s4
59
Efficient Inference
60
Learning Parameters of CRFs
Maximize log-likelihood of parameters L lk
given training data D
Log-likelihood gradient

Methods
iterative scaling (quite slow 2000 iterations
from good start)
gradient, conjugate gradient (faster)
limited-memory quasi-Newton methods (super
fast)

Sha Pereira 2002 Malouf 2002
61
Voted Perceptron Sequence Models
Collins 2001 also Hofmann 2003, Taskar et al
2003
Analogous tothe gradient for this onetraining
instance
Avoids the tricky math very fast uses
pseudo-negative examples of sequences
approximates a margin classifier for good vs
bad sequences
62
Broader Issues in IE
63
Broader View
Up to now we have been focused on segmentation
and classification
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
64
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Documentcollection
Train extraction models
4
Data mine
5
Label training data
65
(1) Association as Binary Classification
Christos Faloutsos conferred with Ted Senator,
the KDD 2003 General Chair.
Person
Person
Role
Person-Role (Christos Faloutsos, KDD 2003
General Chair) ? NO
Person-Role ( Ted Senator, KDD 2003
General Chair) ? YES
Do this with SVMs and tree kernels over parse
trees.
Zelenko et al, 2002
66
(1) Association with Finite State Machines
Ray Craven, 2001
This enzyme, UBC6, localizes to the endoplasmic
reticulum, with the catalytic domain facing the
cytosol.
DET this N enzyme N ubc6 V localizes PREP to ART t
he ADJ endoplasmic N reticulum PREP with ART the A
DJ catalytic N domain V facing ART theN cytosol
Subcellular-localization (UBC6, endoplasmic
reticulum)
67
(1) Association with Graphical Models
Roth Yih 2002
Capture arbitrary-distance dependencies among
predictions.
68
(1) Association with Graphical Models
Roth Yih 2002
Also capture long-distance dependencies among
predictions.
Random variableover the class ofrelation
between entity 2 and 1, e.g. over lives-in,
is-boss-of,
person
Random variableover the class ofentity 1, e.g.
overperson, location,
lives-in
Local languagemodels contributeevidence to
relationclassification.
person?
Local languagemodels contributeevidence to
entityclassification.
Dependencies between classesof entities and
relations!
Inference with loopy belief propagation.
69
(1) Association with Graphical Models
Roth Yih 2002
Also capture long-distance dependencies among
predictions.
Random variableover the class ofrelation
between entity 2 and 1, e.g. over lives-in,
is-boss-of,
person
Random variableover the class ofentity 1, e.g.
overperson, location,
lives-in
Local languagemodels contributeevidence to
relationclassification.
location
Local languagemodels contributeevidence to
entityclassification.
Dependencies between classesof entities and
relations!
Inference with loopy belief propagation.
70
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Documentcollection
Train extraction models
4
Data mine
5
Label training data
When do two extracted strings refer to the same
object?
71
(2) Learning a Distance Metric Between Records
Borthwick, 2000 Cohen Richman, 2001 Bilenko
Mooney, 2002, 2003
Learn Pr (duplicate, not-duplicate record1,
record2)with a Maximum Entropy classifier.
Do greedy agglomerative clustering using this
Probability as a distance metric.
72
(2) String Edit Distance

distance(William Cohen, Willliam Cohon)

s
alignment
t
op
cost
73
(2) Computing String Edit Distance
D(i-1,j-1) d(si,tj) //subst/copy D(i-1,j)1
//insert D(i,j-1)1
//delete
learn these parameters

D(i,j) min
A trace indicates where the min value came from,
and can be used to find edit operations and/or a
best alignment (may be more than 1)
74
(2) String Edit Distance Learning
Bilenko Mooney, 2002, 2003
Precision/recall for MAILING dataset duplicate
detection
75
(2) Information Integration
Minton, Knoblock, et al 2001, Doan, Domingos,
Halevy 2001, Richardson Domingos 2003

Goal might be to merge results of two IE systems

76
(2) Other Information Integration Issues

Distance metrics for text which work well?
Cohen, Ravikumar, Fienberg, 2003
Finessing integration by soft database operations
based on similarity
Cohen, 2000
Integration of complex structured databases
(capture dependencies among multiple merges)
Cohen, MacAllister, Kautz KDD 2000 Pasula,
Marthi, Milch, Russell, Shpitser, NIPS 2002
McCallum and Wellner, KDD WS 2003

77
Relational Identity Uncertaintywith
Probabilistic Relational Models (PRMs)
Russell 2001, Pasula et al 2002 Marthi,
Milch, Russell 2003
(Applied to citation matching, and object
correspondence in vision)
N
id
words
context
id
surname
distance
fonts
age
gender
. . .
. . .
78
Broader View
Now touch on some other issues
Create ontology
3
Spider
Filter by relevance
IE
Tokenize
Segment Classify Associate Cluster
1
2
Database
Load DB
Query, Search
Documentcollection
Train extraction models
4
Data mine
5
Label training data
1
79
(5) Working with IE Data

Some special properties of IE data
It is based on extracted text
It is dirty, (missing extraneous facts,
improperly normalized entity names, etc.)
May need cleaning before use
What operations can be done on dirty,
unnormalized databases?
Datamine it directly.
Query it directly with a language that has soft
joins across similar, but not identical keys.
Cohen 1998
Use it to construct features for learners Cohen
2000
Infer a best underlying clean database Cohen,
Kautz, MacAllester, KDD2000