Title: Kein Folientitel
1Information Extraction
Martin Ester Simon Fraser University School of
Computing Science CMPT 884 Spring 2009
2Information Extraction
- Outline
- Introduction motivation, applications, issues
- Entity extraction hand-coded, machine
learning - Relation extraction supervised, partially
supervised - Entity resolution string similarity, finding
similar pairs, creating groups - Future research
- ? Feldman 2006 Agichtein Sarawagi 2006
3Introduction
- Motivation
- 80 of all human-generated data is natural
language text - search engines return whole documents, requiring
the user to read documents and manually
extract relevant information (entities, facts,
. . .) ? very time-consuming - need for automatic extraction of such
information from collections of natural language
text documents ? information extraction (IE)
4Introduction
- Definitions
- Entity an object of interest such as a person
or organization. - Attribute a property of an entity such as its
name, alias, descriptor, or type. - Relation a relationship held between two or
more entities such as Position of a Person in
a Company. - Event an activity involving several entities
such as a terrorist act, aircraft crash,
management change, new product introduction.
5Introduction
6Introduction
- Applications
- question answering Who is the president of
the US? Where was Martin Luther born? - automatic creation of databases
- e.g., database of protein localizations
or adverse reactions to a drug - opinion mining analyzing online product
reviews to get user feedback
7Introduction
- Challenges
- Complexity of natural language e.g.,
identifying word and sentence boundaries is
fairly easy in European languages, much
harder in Chinese / Japanese - Ambiguity of natural language e.g., homonyms
- Diversity of natural language
- many ways of expressing a given information,
e.g. synonyms - Diversity of writing styles
- e.g., scientific papers, newspaper articles,
maintenance reports, emails, . . .
8Introduction
- Challenges
- names are hard to discover
- impossible to enumerate
- new candidates are generated all the time
- hard to provide syntactic rules
- types of proper names
- people
- companies
- products
- genes - . . .
9Introduction
- Architecture of IE System
Local analysis
Discourse (global) analysis
10Introduction
- Knowledge Engineering Approach
- Extraction rules are hand-crafted by linguists in
cooperation with domain experts. - Most of the work is done by inspecting a set of
relevant documents. - Development of rule set is very time-consuming.
- Requires substantial CS and domain expertise.
- Rule sets are domain-specific, do not transfer
to other domains. - Knowledge engineering (KE) approach often
achieves higher accuracy than machine
learning approach.
11Introduction
- Machine Learning Approach
- Automatically learn model (rules) from
annotated training corpus. - Techniques based on pure statistics and little
linguistic knowledge. - No CS expertise required when building model.
- However creating the annotated corpus is very
laborious, since very large number of
training examples needed. - Transfer to other domains is easier than KE
approach. - Accuracy of machine learning (ML) approach is
typically lower.
12Introduction
- Topics Not Covered
- co-reference resolution e.g., article
referencing a noun (entity) of another sentence - event extraction event has type, actor, time .
. . - sentiment detection a certain statement
(opinion) is classified as positive / negative
13Entity Extraction
- Lexical Analysis
- breaking up the input document into individual
words tokens - token sequence of characters treated as a unit
- punctuation marks also considered as
token e.g., , (comma) - often, use regular expressions to define format
of token
14Entity Extraction
- Syntactic Analysis
- part-of-speech tagging Charniak 1997
- marking up the tokens in a text as
corresponding to a particular part of speech
(POS), based on both its definition, as well as
its context - coarse POS tags e.g., N, V, A, Aux, .
- finer POS tags - PRP personal pronouns
(you, me, she, he, them, him, her, ) - PRP
possessive pronouns (my, our, her, his, ) -
NN singular common nouns (sky, door, theorem,
) - NNS plural common nouns (doors,
theorems, women, ) - NNP singular proper
names (Fifi, IBM, Canada, ) - NNPS plural
proper names (Americas, Carolinas, )
15Entity Extraction
- Syntactic Analysis
- Words often have more than one POS, e.g. back
- The back door JJ
- On my back NN
- Win the voters back RB
- Promised to back the bill VB
- The POS tagging problem is to determine the POS
tag for a particular instance of a word. - e.g., input the lead paint is unsafe
- output the/Det lead/N paint/N is/V unsafe/Adj
16Entity Extraction
- Knowledge Engineering Approach Chaudhuri 2005
- hand-coded rules often relatively
straightforward - easy to incorporate domain knowledge
- require substantial CS expertise
- example rule lttokengt INITIALlt/tokengt
- lttokengtDOT lt/tokengt
- lttokengtCAPSWORDlt/tokengt
- lttokengtCAPSWORDlt/tokengt ? finds person names
with a salutation and two capitalized
words, e.g. Dr. Laura Haas
17Entity Extraction
- Knowledge Engineering Approach
- a more complex example conference
namewordOrdinals"(?firstsecondthirdfourthf
ifthsixthseventheighthninthtentheleventhtwe
lfththirteenthfour teenthfifteenth)" - my numberOrdinals"(?\\d?(?1st2nd3rd1th2th
3th4th5th6th7th8th9th0th))" - my ordinals"(?wordOrdinalsnumberOrdinals)"
- my confTypes"(?ConferenceWorkshopSymposium)"
- my words"(?A-Z\\w\\s)" A word starting
with a capital letter and ending with 0 or more
spaces - my confDescriptors"(?international\\sA-Z\\
s)" .e.g "International Conference ...' or
the conference - name for workshops (e.g. "VLDB Workshop ...")
- my connectors"(?onof)"
- my abbreviations"(?\\(A-Z\\w\\w\\W\\s?(?
\\d\\d)?\\))" abbreviations like
"(SIGMOD'06)" - my fullNamePattern"((?ordinals\\swordscon
fDescriptors)?confTypes(?\\sconnectors\\s.?
\\s)?abb reviations?)(?\\n\\r\\.lt)" . . .
18Entity Extraction
- Machine Learning Approach
- We can view the named entity extraction as a
sequence classification problem classify
each word as belonging to one of the named
entity classes or to the noname class. - Class label of sequence element depends on
neighboring ones. - One of the most popular techniques for dealing
with classifying sequences is Hidden Markov
Models (HMM). - Other popular ML method for entity extraction
Conditional Random Fields Lafferty et al
2001. - Requires large enough labeled (annotated)
training dataset.
19Entity Extraction
- Hidden Markov Models Rabiner 1989
- HMM (Hidden Markov Model) is a finite state
automaton with stochastic state transitions
and symbol emissions. - The automaton models a probabilistic generative
process. - In this process a sequence of symbols is
produced by starting in an initial state,
transitioning to a new state, emitting a
symbol selected by the state and repeating this
transition/emission cycle until a designated
final state is reached. - Very successful in many sequence classification
tasks.
20Entity Extraction
- Example
- HMM for addresses
21Entity Extraction
- Hidden Markov Models
- T length of the sequence of observations
(training set) - N number of states in the model
- qt the actual state at time t
- S S1,...SN (finite set of possible states)
- V O1,...OM (finite set of observation
symbols) - p pi P(q1 Si) starting probabilities
- A aijP(qt1 Si qt Sj) transition
probabilities - B bi(Ot) P(Ot qt Si) emission
probabilities - ? (p, A, B) hidden Markov model
22Entity Extraction
- Hidden Markov Models
- How to find P( O ? ) the probability of an
observation sequence given the HMM model? ?
forward-backward algorithm - How to find ? that maximizes P( O ? )? This
is the task of the training phase. ? Baum-Welch
algorithm - How to find the most likely state trajectory
given ? and O? - This is the task of the test phase. ? Viterbi
algorithm
23Relation Extraction
Organization
Location
Microsoft's central headquarters in Redmond is
home to almost every product group and division.
Microsoft Apple Computer Nike
Redmond Cupertino Portland
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computer headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
Apple's programmers "think different" on a
"campus" in Cupertino, Cal. Nike employees "just
do it" at what the company refers to as its
"World Campus," near Portland, Ore.
24Relation Extraction
- Introduction
- No single source contains all the relations
- Each relation appears on many web pages
- There are repeated patterns in the way
relations are represented on web pages ?
exploit redundancy - Components of relation appear close
together ? use context of occurrence of relation
to determine patterns - pattern consists of constants (tokens) and
variables (placeholders for entities) - tuple instance / occurrence of a relation
25Relation Extraction
- Introduction
- Typically requires entity extraction (tagging)
as preprocessing - Knowledge engineering approach
- - patterns defined over lexical items
- ltcompanygt located in ltlocationgt
- - patterns defined over parsed text
- ((Obj ltcompanygt) (Verb located) ()
(Subj ltlocationgt)) - Machine learning approach
- - learn rules/patterns from examples
- - partially-supervised bootstrap from example
tuples - Agichtein Gravano 2000, Etzioni et
al 2004
26Relation Extraction
- Snowball Agichtein Gravano 2000
- Exploit duality between patterns and tuples
- - find tuples that match a set of patterns
- find patterns that match a lot of tuples?
bootstrapping approach
Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
27Relation Extraction
- Snowball
- how to represent patterns of occurrences?
Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, share
ofRedmond-based Microsoft fell The Armonk-based
IBM introduceda new line The combined company
will operate from Boeings headquarters in
Seattle. Intel, Santa Clara, cut prices of
itsPentium processor.
initial seed tuples
occurrences of seed tuples
28Relation Extraction
- Patterns
- (extraction) pattern has format ltleft, tag1,
middle, tag2, rightgt, - where tag1, tag2 are named-entity tags and
left, middle, and right are vectors of weighted
terms - patterns derived directly from occurrences are
too specific
ORGANIZATION 's central headquarters in
LOCATION is home to...
lt's 0.5gt, ltcentral 0.5gt ltheadquarters 0.5gt, lt
in 0.5gt
ltis 0.75gt, lthome 0.75gt
LOCATION
ORGANIZATION
lt left , tag1 , middle , tag2 , right gt
29Relation Extraction
- Pattern Clusters
- cluster patterns, cluster centroids define
patterns
30Relation Extraction
- Evaluation of Patterns
- How good are new extraction patterns?
- Measure their performance through their accuracy
vs. the initial seed tuples (ground truth).
Boeing, Seattle, said Positive Intel, Santa
Clara, cut prices Positive invest in Microsoft,
New York-based Negativeanalyst Jane Smith said
extraction with pattern ORGANIZATION, LOCATION
31Relation Extraction
- Evaluation of Patterns
- Trust only patterns with high support and
confidence, i.e. that produce many correct
(positive) tuples and only a few false
(negative) tuples. - conf(p) pos(p)/(pos(p)neg(p)) where p
denotes a pattern and pos(p), neg(p) denote the
numbers of positive, negative tuples produced
32Relation Extraction
- Evaluation of Tuples
- Trust only tuples that match many patterns.
- Suppose candidate tuple t matches patterns p1
and p2. What is the probability that t is a
valid tuple? - Assume matches of different patterns are
independent events. - Prt matches p1 and t is not valid 1-conf(p1)
- Prt matches p2 and t is not valid 1-conf(p2)
- Prt matches p1,p2 and t is not valid
(1-conf(p1))(1-conf(p2)) - Prt matches p1,p2 and t is valid 1 -
(1-conf(p1))(1-conf(p2)) - If tuple t matches a set of patterns P
conf(t) 1 - ?p in P(1-conf(p))
33Relation Extraction
- Snowball Algorithm
- 1. Start with seed set R of tuples
- 2. Generate set P of patterns from R
- compute support and confidence for each pattern
in P - discard patterns with low support or confidence
- 3. Generate new set T of tuples matching patterns
P - compute confidence of each tuple in T
- add to R the tuples t in T with
conf(t)gtthreshold. - 4. go back to step 2
34Relation Extraction
- Discussion
- bootstrapping approach requires only a
relatively small number of training tuples
(semi-supervised) - is effective for binary, 11 relations
- bootstrapping approach has been adopted by lots
of subsequent work - pattern evaluation is heuristic and has no theory
behind - ? Statistical Snowball, WWW 09
- what about n-ary relations?
- what about 1m relations?
35Entity Resolution
36Entity Resolution
- Introduction
- Entity resolution - map entity mentions to
the corresponding entities - entities stored in
database or ontology - Challenges
- - large lists with multiple noisy mentions of
the same entity - no single attribute to order
or cluster likely duplicates while
separating them from similarbut different
entities - need to depend on fuzzy and
computationally expensive string similarity
functions.
37Entity Resolution
- Introduction
- Typical approach - define string
similarity numeric attributes are easy to
compare, hard are string attributes needs
to perform approximate matches - find similar
pairs of entities - create groups from
duplicate entity pairs (clustering)
38Entity Resolution
- String Similarity
- Token-based
- Jaccard TF-IDF cosine similarities
- ? suitable for large documents
- Character-based
- Edit-distance and variants like Levenshtein,
Jaro-Winkler Soundex - ? suitable for short strings with spelling
mistakes - Hybrids
39Entity Resolution
- Token-Based String Similarity
- Tokens/words
- ATT Corporation ? ATT , Corporation
- Similarity various measures of overlap of two
sets S,T - Jaccard(S,T) SnT/S?T
- Example
- S ATT Corporation ? ATT , Corporation
- T ATT Corp ? ATT , Corp.
- Jaccard(S,T) 1/3
- Variants weights attached with each token
40Entity Resolution
- Token-Based String Similarity
- Sets transformed to vectors with each term as
dimension - Cosine similarity dot-product of two vectors
each normalized to unit length - ? cosine of angle between them
- Term weight TF/IDF log (tf1) log idf
where - tf frequency of term in a document d
- idf number of documents / number of
documents containing term - ? rare terms are more important
41Entity Resolution
- Token-Based String Similarity
- Widely used in traditional IR
- Example
- ATT Corporation, ATT Corp or ATT Inc
- low weights for Corporation,Corp,Inc,
higher weight for ATT
42Entity Resolution
- Character-Based String Similarity
- Given two strings, S,T, edit(S,T)
- minimum cost sequence of operations to transform
S to T. - Character operations I (insert), D (delete), R
(Replace). - Example edit(Error,Eror) 1, edit(great,grate)
2 - Dynamic programming algorithm to compute edit()
- Several variants (gaps,weights) ? becomes
NP-complete - Varying costs of operations can be learnt
- Suitable for common typing mistakes on small
strings
43Entity Resolution
- Find Duplicate Pairs
- Input a large list of entities with string
attributes - Output all pairs (S,T) of entities which
satisfy a similarity criteria such as - Jaccard(S,T) gt 0.7
- Edit-distance(S,T) lt k
- Naive method for each record pair, compute
similarity score - I/O and CPU intensive, not scalable to millions
of entities - Goal reduce O(n2) cost to O(nw), where w ltlt n
- Reduce number of pairs on which similarity is
computed
44Entity Resolution
- Find Duplicate Pairs
- Method filter and refinement
- Use inexpensive filter to filter out as many
pairs as possible e.g. EditDistance(s,t) d
? q-grams(s) n q-grams(t) max(s,t) -
(d-1)q - 1 - q-gram subsequence of q consecutive
characters e.g. 3-grams for ATT
Corporation AT,TT,T , T C, Co,
orp,rpo,por,ora,rat,ati,tio,ion - If a pair (s, t) does not satisfy the filter, it
cannot satisfy the similarity
criteria e.g., q-grams(s) n q-grams(t) lt
max(s,t) - (d-1)q - 1 ?
EditDistance(s,t) gt d
45Entity Resolution
- Find Duplicate Pairs
- Do not have to apply the filter to all pairs of
entities use index to retrieve subset of
entities that share q-grams - Compute the expensive similarity function only
to pairs that survive the filter step e.g.
EditDistance(s,t)
46Entity Resolution
- Create Groups of Duplicates
- Given pairs of duplicate entities
- Group them such that each group corresponds to
one entity - Many clustering algorithms have been applied
- Number of clusters hard to specify in advance
- Ground truth may be available for some entity
pairs ? semi-supervised clustering
47Entity Resolution
- Create Groups of Duplicates
- Agglomerative clustering repeatedly merge
closest clusters - Definition of closeness of clusters subject to
tuning Average/Max/Min similarity - Efficient implementations possible using special
data structures
48Entity Resolution
- Challenges
- Collective entity resolution consider
relationships between entities and
propagate resolution decisions along these
relationships ? use Markov Logic Networks Parag
Domingos 2005 - Mapping to existing background knowledge
ontology of real world entities may be given
map entities / clusters of entities to ontology
entries ? k-nearest neighbor methods
49Information Extraction
- References
- Eugene Agichtein, Luis Gravano Snowball
Extracting Relations from Large Plain-Text
Collections, ACM DL, 2000 - Eugene Agichtein, Sunita Sarawagi Scalable
Information Extraction and Integration,
Tutorial KDD 2006 - Eugene Charniak Statistical Techniques for
Natural Language Parsing, AI Magazine 18(4),
1997 - S. Chaudhuri, R. Ramakrishnan, and G. Weikum.
Integrating db and ir technologies What is
the sound of one hand clapping?, CIDR 2005 - Ronen FeldmanInformation Extraction Theory and
Practice, Tutorial ICML 2006
50Information Extraction
- References
- John Lafferty, Andrew McCallum, Fernando
Pereira Conditional Random Fields
Probabilistic Models for Segmenting and Labeling
Sequence Data, ICML 2001 - L. R. Rabiner. A tutorial on hidden Markov
models and selected applications in speech
recognition, Proc. IEEE 77(2), 1989