Kein Folientitel - PowerPoint PPT Presentation

About This Presentation
Title:

Kein Folientitel

Description:

... Jaro-Winkler Soundex suitable for short strings with spelling mistakes Hybrids ... What is the sound of one ... Ore. Microsoft's central ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 51
Provided by: est138
Category:

less

Transcript and Presenter's Notes

Title: Kein Folientitel


1
Information Extraction
Martin Ester Simon Fraser University School of
Computing Science CMPT 884 Spring 2009
2
Information Extraction
  • Outline
  • Introduction motivation, applications, issues
  • Entity extraction hand-coded, machine
    learning
  • Relation extraction supervised, partially
    supervised
  • Entity resolution string similarity, finding
    similar pairs, creating groups
  • Future research
  • ? Feldman 2006 Agichtein Sarawagi 2006

3
Introduction
  • Motivation
  • 80 of all human-generated data is natural
    language text
  • search engines return whole documents, requiring
    the user to read documents and manually
    extract relevant information (entities, facts,
    . . .) ? very time-consuming
  • need for automatic extraction of such
    information from collections of natural language
    text documents ? information extraction (IE)

4
Introduction
  • Definitions
  • Entity an object of interest such as a person
    or organization.
  • Attribute a property of an entity such as its
    name, alias, descriptor, or type.
  • Relation a relationship held between two or
    more entities such as Position of a Person in
    a Company.
  • Event an activity involving several entities
    such as a terrorist act, aircraft crash,
    management change, new product introduction.

5
Introduction
  • Example

6
Introduction
  • Applications
  • question answering Who is the president of
    the US? Where was Martin Luther born?
  • automatic creation of databases
  • e.g., database of protein localizations
    or adverse reactions to a drug
  • opinion mining analyzing online product
    reviews to get user feedback

7
Introduction
  • Challenges
  • Complexity of natural language e.g.,
    identifying word and sentence boundaries is
    fairly easy in European languages, much
    harder in Chinese / Japanese
  • Ambiguity of natural language e.g., homonyms
  • Diversity of natural language
  • many ways of expressing a given information,
    e.g. synonyms
  • Diversity of writing styles
  • e.g., scientific papers, newspaper articles,
    maintenance reports, emails, . . .

8
Introduction
  • Challenges
  • names are hard to discover
  • impossible to enumerate
  • new candidates are generated all the time
  • hard to provide syntactic rules
  • types of proper names
  • people
  • companies
  • products
  • genes - . . .

9
Introduction
  • Architecture of IE System

Local analysis
Discourse (global) analysis
10
Introduction
  • Knowledge Engineering Approach
  • Extraction rules are hand-crafted by linguists in
    cooperation with domain experts.
  • Most of the work is done by inspecting a set of
    relevant documents.
  • Development of rule set is very time-consuming.
  • Requires substantial CS and domain expertise.
  • Rule sets are domain-specific, do not transfer
    to other domains.
  • Knowledge engineering (KE) approach often
    achieves higher accuracy than machine
    learning approach.

11
Introduction
  • Machine Learning Approach
  • Automatically learn model (rules) from
    annotated training corpus.
  • Techniques based on pure statistics and little
    linguistic knowledge.
  • No CS expertise required when building model.
  • However creating the annotated corpus is very
    laborious, since very large number of
    training examples needed.
  • Transfer to other domains is easier than KE
    approach.
  • Accuracy of machine learning (ML) approach is
    typically lower.

12
Introduction
  • Topics Not Covered
  • co-reference resolution e.g., article
    referencing a noun (entity) of another sentence
  • event extraction event has type, actor, time .
    . .
  • sentiment detection a certain statement
    (opinion) is classified as positive / negative

13
Entity Extraction
  • Lexical Analysis
  • breaking up the input document into individual
    words tokens
  • token sequence of characters treated as a unit
  • punctuation marks also considered as
    token e.g., , (comma)
  • often, use regular expressions to define format
    of token

14
Entity Extraction
  • Syntactic Analysis
  • part-of-speech tagging Charniak 1997
  • marking up the tokens in a text as
    corresponding to a particular part of speech
    (POS), based on both its definition, as well as
    its context
  • coarse POS tags e.g., N, V, A, Aux, .
  • finer POS tags - PRP personal pronouns
    (you, me, she, he, them, him, her, ) - PRP
    possessive pronouns (my, our, her, his, ) -
    NN singular common nouns (sky, door, theorem,
    ) - NNS plural common nouns (doors,
    theorems, women, ) - NNP singular proper
    names (Fifi, IBM, Canada, ) - NNPS plural
    proper names (Americas, Carolinas, )

15
Entity Extraction
  • Syntactic Analysis
  • Words often have more than one POS, e.g. back
  • The back door JJ
  • On my back NN
  • Win the voters back RB
  • Promised to back the bill VB
  • The POS tagging problem is to determine the POS
    tag for a particular instance of a word.
  • e.g., input the lead paint is unsafe
  • output the/Det lead/N paint/N is/V unsafe/Adj

16
Entity Extraction
  • Knowledge Engineering Approach Chaudhuri 2005
  • hand-coded rules often relatively
    straightforward
  • easy to incorporate domain knowledge
  • require substantial CS expertise
  • example rule lttokengt INITIALlt/tokengt
  • lttokengtDOT lt/tokengt
  • lttokengtCAPSWORDlt/tokengt
  • lttokengtCAPSWORDlt/tokengt ? finds person names
    with a salutation and two capitalized
    words, e.g. Dr. Laura Haas

17
Entity Extraction
  • Knowledge Engineering Approach
  • a more complex example conference
    namewordOrdinals"(?firstsecondthirdfourthf
    ifthsixthseventheighthninthtentheleventhtwe
    lfththirteenthfour teenthfifteenth)"
  • my numberOrdinals"(?\\d?(?1st2nd3rd1th2th
    3th4th5th6th7th8th9th0th))"
  • my ordinals"(?wordOrdinalsnumberOrdinals)"
  • my confTypes"(?ConferenceWorkshopSymposium)"
  • my words"(?A-Z\\w\\s)" A word starting
    with a capital letter and ending with 0 or more
    spaces
  • my confDescriptors"(?international\\sA-Z\\
    s)" .e.g "International Conference ...' or
    the conference
  • name for workshops (e.g. "VLDB Workshop ...")
  • my connectors"(?onof)"
  • my abbreviations"(?\\(A-Z\\w\\w\\W\\s?(?
    \\d\\d)?\\))" abbreviations like
    "(SIGMOD'06)"
  • my fullNamePattern"((?ordinals\\swordscon
    fDescriptors)?confTypes(?\\sconnectors\\s.?
    \\s)?abb reviations?)(?\\n\\r\\.lt)" . . .

18
Entity Extraction
  • Machine Learning Approach
  • We can view the named entity extraction as a
    sequence classification problem classify
    each word as belonging to one of the named
    entity classes or to the noname class.
  • Class label of sequence element depends on
    neighboring ones.
  • One of the most popular techniques for dealing
    with classifying sequences is Hidden Markov
    Models (HMM).
  • Other popular ML method for entity extraction
    Conditional Random Fields Lafferty et al
    2001.
  • Requires large enough labeled (annotated)
    training dataset.

19
Entity Extraction
  • Hidden Markov Models Rabiner 1989
  • HMM (Hidden Markov Model) is a finite state
    automaton with stochastic state transitions
    and symbol emissions.
  • The automaton models a probabilistic generative
    process.
  • In this process a sequence of symbols is
    produced by starting in an initial state,
    transitioning to a new state, emitting a
    symbol selected by the state and repeating this
    transition/emission cycle until a designated
    final state is reached.
  • Very successful in many sequence classification
    tasks.

20
Entity Extraction
  • Example
  • HMM for addresses

21
Entity Extraction
  • Hidden Markov Models
  • T length of the sequence of observations
    (training set)
  • N number of states in the model
  • qt the actual state at time t
  • S S1,...SN (finite set of possible states)
  • V O1,...OM (finite set of observation
    symbols)
  • p pi P(q1 Si) starting probabilities
  • A aijP(qt1 Si qt Sj) transition
    probabilities
  • B bi(Ot) P(Ot qt Si) emission
    probabilities
  • ? (p, A, B) hidden Markov model

22
Entity Extraction
  • Hidden Markov Models
  • How to find P( O ? ) the probability of an
    observation sequence given the HMM model? ?
    forward-backward algorithm
  • How to find ? that maximizes P( O ? )? This
    is the task of the training phase. ? Baum-Welch
    algorithm
  • How to find the most likely state trajectory
    given ? and O?
  • This is the task of the test phase. ? Viterbi
    algorithm

23
Relation Extraction
  • Example

Organization
Location
Microsoft's central headquarters in Redmond is
home to almost every product group and division.
Microsoft Apple Computer Nike
Redmond Cupertino Portland
Brent Barlow, 27, a software analyst and
beta-tester at Apple Computer headquarters in
Cupertino, was fired Monday for "thinking a
little too different."
Apple's programmers "think different" on a
"campus" in Cupertino, Cal. Nike employees "just
do it" at what the company refers to as its
"World Campus," near Portland, Ore.
24
Relation Extraction
  • Introduction
  • No single source contains all the relations
  • Each relation appears on many web pages
  • There are repeated patterns in the way
    relations are represented on web pages ?
    exploit redundancy
  • Components of relation appear close
    together ? use context of occurrence of relation
    to determine patterns
  • pattern consists of constants (tokens) and
    variables (placeholders for entities)
  • tuple instance / occurrence of a relation

25
Relation Extraction
  • Introduction
  • Typically requires entity extraction (tagging)
    as preprocessing
  • Knowledge engineering approach
  • - patterns defined over lexical items
  • ltcompanygt located in ltlocationgt
  • - patterns defined over parsed text
  • ((Obj ltcompanygt) (Verb located) ()
    (Subj ltlocationgt))
  • Machine learning approach
  • - learn rules/patterns from examples
  • - partially-supervised bootstrap from example
    tuples
  • Agichtein Gravano 2000, Etzioni et
    al 2004

26
Relation Extraction
  • Snowball Agichtein Gravano 2000
  • Exploit duality between patterns and tuples
  • - find tuples that match a set of patterns
  • find patterns that match a lot of tuples?
    bootstrapping approach

Initial Seed Tuples
Occurrences of Seed Tuples
Tag Entities
Generate New Seed Tuples
Generate Extraction Patterns
Augment Table
27
Relation Extraction
  • Snowball
  • how to represent patterns of occurrences?

Computer servers at Microsofts headquarters in
Redmond In mid-afternoon trading, share
ofRedmond-based Microsoft fell The Armonk-based
IBM introduceda new line The combined company
will operate from Boeings headquarters in
Seattle. Intel, Santa Clara, cut prices of
itsPentium processor.
initial seed tuples
occurrences of seed tuples
28
Relation Extraction
  • Patterns
  • (extraction) pattern has format ltleft, tag1,
    middle, tag2, rightgt,
  • where tag1, tag2 are named-entity tags and
    left, middle, and right are vectors of weighted
    terms
  • patterns derived directly from occurrences are
    too specific

ORGANIZATION 's central headquarters in
LOCATION is home to...
lt's 0.5gt, ltcentral 0.5gt ltheadquarters 0.5gt, lt
in 0.5gt
ltis 0.75gt, lthome 0.75gt
LOCATION
ORGANIZATION
lt left , tag1 , middle , tag2 , right gt
29
Relation Extraction
  • Pattern Clusters
  • cluster patterns, cluster centroids define
    patterns

30
Relation Extraction
  • Evaluation of Patterns
  • How good are new extraction patterns?
  • Measure their performance through their accuracy
    vs. the initial seed tuples (ground truth).

Boeing, Seattle, said Positive Intel, Santa
Clara, cut prices Positive invest in Microsoft,
New York-based Negativeanalyst Jane Smith said
extraction with pattern ORGANIZATION, LOCATION
31
Relation Extraction
  • Evaluation of Patterns
  • Trust only patterns with high support and
    confidence, i.e. that produce many correct
    (positive) tuples and only a few false
    (negative) tuples.
  • conf(p) pos(p)/(pos(p)neg(p)) where p
    denotes a pattern and pos(p), neg(p) denote the
    numbers of positive, negative tuples produced

32
Relation Extraction
  • Evaluation of Tuples
  • Trust only tuples that match many patterns.
  • Suppose candidate tuple t matches patterns p1
    and p2. What is the probability that t is a
    valid tuple?
  • Assume matches of different patterns are
    independent events.
  • Prt matches p1 and t is not valid 1-conf(p1)
  • Prt matches p2 and t is not valid 1-conf(p2)
  • Prt matches p1,p2 and t is not valid
    (1-conf(p1))(1-conf(p2))
  • Prt matches p1,p2 and t is valid 1 -
    (1-conf(p1))(1-conf(p2))
  • If tuple t matches a set of patterns P
    conf(t) 1 - ?p in P(1-conf(p))

33
Relation Extraction
  • Snowball Algorithm
  • 1. Start with seed set R of tuples
  • 2. Generate set P of patterns from R
  • compute support and confidence for each pattern
    in P
  • discard patterns with low support or confidence
  • 3. Generate new set T of tuples matching patterns
    P
  • compute confidence of each tuple in T
  • add to R the tuples t in T with
    conf(t)gtthreshold.
  • 4. go back to step 2

34
Relation Extraction
  • Discussion
  • bootstrapping approach requires only a
    relatively small number of training tuples
    (semi-supervised)
  • is effective for binary, 11 relations
  • bootstrapping approach has been adopted by lots
    of subsequent work
  • pattern evaluation is heuristic and has no theory
    behind
  • ? Statistical Snowball, WWW 09
  • what about n-ary relations?
  • what about 1m relations?

35
Entity Resolution
  • Introduction

36
Entity Resolution
  • Introduction
  • Entity resolution - map entity mentions to
    the corresponding entities - entities stored in
    database or ontology
  • Challenges
  • - large lists with multiple noisy mentions of
    the same entity - no single attribute to order
    or cluster likely duplicates while
    separating them from similarbut different
    entities - need to depend on fuzzy and
    computationally expensive string similarity
    functions.

37
Entity Resolution
  • Introduction
  • Typical approach - define string
    similarity numeric attributes are easy to
    compare, hard are string attributes needs
    to perform approximate matches - find similar
    pairs of entities - create groups from
    duplicate entity pairs (clustering)

38
Entity Resolution
  • String Similarity
  • Token-based
  • Jaccard TF-IDF cosine similarities
  • ? suitable for large documents
  • Character-based
  • Edit-distance and variants like Levenshtein,
    Jaro-Winkler Soundex
  • ? suitable for short strings with spelling
    mistakes
  • Hybrids

39
Entity Resolution
  • Token-Based String Similarity
  • Tokens/words
  • ATT Corporation ? ATT , Corporation
  • Similarity various measures of overlap of two
    sets S,T
  • Jaccard(S,T) SnT/S?T
  • Example
  • S ATT Corporation ? ATT , Corporation
  • T ATT Corp ? ATT , Corp.
  • Jaccard(S,T) 1/3
  • Variants weights attached with each token

40
Entity Resolution
  • Token-Based String Similarity
  • Sets transformed to vectors with each term as
    dimension
  • Cosine similarity dot-product of two vectors
    each normalized to unit length
  • ? cosine of angle between them
  • Term weight TF/IDF log (tf1) log idf
    where
  • tf frequency of term in a document d
  • idf number of documents / number of
    documents containing term
  • ? rare terms are more important

41
Entity Resolution
  • Token-Based String Similarity
  • Widely used in traditional IR
  • Example
  • ATT Corporation, ATT Corp or ATT Inc
  • low weights for Corporation,Corp,Inc,
    higher weight for ATT

42
Entity Resolution
  • Character-Based String Similarity
  • Given two strings, S,T, edit(S,T)
  • minimum cost sequence of operations to transform
    S to T.
  • Character operations I (insert), D (delete), R
    (Replace).
  • Example edit(Error,Eror) 1, edit(great,grate)
    2
  • Dynamic programming algorithm to compute edit()
  • Several variants (gaps,weights) ? becomes
    NP-complete
  • Varying costs of operations can be learnt
  • Suitable for common typing mistakes on small
    strings

43
Entity Resolution
  • Find Duplicate Pairs
  • Input a large list of entities with string
    attributes
  • Output all pairs (S,T) of entities which
    satisfy a similarity criteria such as
  • Jaccard(S,T) gt 0.7
  • Edit-distance(S,T) lt k
  • Naive method for each record pair, compute
    similarity score
  • I/O and CPU intensive, not scalable to millions
    of entities
  • Goal reduce O(n2) cost to O(nw), where w ltlt n
  • Reduce number of pairs on which similarity is
    computed

44
Entity Resolution
  • Find Duplicate Pairs
  • Method filter and refinement
  • Use inexpensive filter to filter out as many
    pairs as possible e.g. EditDistance(s,t) d
    ? q-grams(s) n q-grams(t) max(s,t) -
    (d-1)q - 1
  • q-gram subsequence of q consecutive
    characters e.g. 3-grams for ATT
    Corporation AT,TT,T , T C, Co,
    orp,rpo,por,ora,rat,ati,tio,ion
  • If a pair (s, t) does not satisfy the filter, it
    cannot satisfy the similarity
    criteria e.g., q-grams(s) n q-grams(t) lt
    max(s,t) - (d-1)q - 1 ?
    EditDistance(s,t) gt d

45
Entity Resolution
  • Find Duplicate Pairs
  • Do not have to apply the filter to all pairs of
    entities use index to retrieve subset of
    entities that share q-grams
  • Compute the expensive similarity function only
    to pairs that survive the filter step e.g.
    EditDistance(s,t)

46
Entity Resolution
  • Create Groups of Duplicates
  • Given pairs of duplicate entities
  • Group them such that each group corresponds to
    one entity
  • Many clustering algorithms have been applied
  • Number of clusters hard to specify in advance
  • Ground truth may be available for some entity
    pairs ? semi-supervised clustering

47
Entity Resolution
  • Create Groups of Duplicates
  • Agglomerative clustering repeatedly merge
    closest clusters
  • Definition of closeness of clusters subject to
    tuning Average/Max/Min similarity
  • Efficient implementations possible using special
    data structures

48
Entity Resolution
  • Challenges
  • Collective entity resolution consider
    relationships between entities and
    propagate resolution decisions along these
    relationships ? use Markov Logic Networks Parag
    Domingos 2005
  • Mapping to existing background knowledge
    ontology of real world entities may be given
    map entities / clusters of entities to ontology
    entries ? k-nearest neighbor methods

49
Information Extraction
  • References
  • Eugene Agichtein, Luis Gravano Snowball
    Extracting Relations from Large Plain-Text
    Collections, ACM DL, 2000
  • Eugene Agichtein, Sunita Sarawagi Scalable
    Information Extraction and Integration,
    Tutorial KDD 2006
  • Eugene Charniak Statistical Techniques for
    Natural Language Parsing, AI Magazine 18(4),
    1997
  • S. Chaudhuri, R. Ramakrishnan, and G. Weikum.
    Integrating db and ir technologies What is
    the sound of one hand clapping?, CIDR 2005
  • Ronen FeldmanInformation Extraction Theory and
    Practice, Tutorial ICML 2006

50
Information Extraction
  • References
  • John Lafferty, Andrew McCallum, Fernando
    Pereira Conditional Random Fields
    Probabilistic Models for Segmenting and Labeling
    Sequence Data, ICML 2001
  • L. R. Rabiner. A tutorial on hidden Markov
    models and selected applications in speech
    recognition, Proc. IEEE 77(2), 1989
Write a Comment
User Comments (0)
About PowerShow.com