Towards Web-Scale Information Extraction - PowerPoint PPT Presentation

About This Presentation
Title:

Towards Web-Scale Information Extraction

Description:

... generated): see Prof. Bing Liu's KDD webinar: http: ... Steve Cook. Ronald Fagin. Eugene Agichtein KDD Webinar: Towards Web-Scale Information Extraction ... – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 51
Provided by: eug133
Category:

less

Transcript and Presenter's Notes

Title: Towards Web-Scale Information Extraction


1
Towards Web-Scale Information Extraction
  • Eugene Agichtein
  • Mathematics Computer Science dept.
  • Emory University
  • eugene_at_mathcs.emory.edu
  • http//www.mathcs.emory.edu/eugene/

2
The Value of Text Data
  • Unstructured text data is the primary form of
    human-generated information
  • Blogs, web pages, news, scientific literature,
    online reviews,
  • Semi-structured data (database generated) see
    Prof. Bing Lius KDD webinar http//www.cs.uic.ed
    u/liub/WCM-Refs.html
  • The techniques discussed here are complimentary
    to structured object extraction methods
  • Need to extract structured information to
    effectively manage, search, and mine the data
  • Information Extraction mature, but active
    research area
  • Intersection of Computational Linguistics,
    Machine Learning, Data mining, Databases, and
    Information Retrieval
  • Traditional focus on accuracy of extraction

3
Example Answering Queries Over Text
For years, Microsoft Corporation CEO Bill Gates
was against open source. But today he appears to
have changed his mind. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Select Name From PEOPLE Where Organization
Microsoft
PEOPLE
Name Title Organization Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman Founder Free
Soft..
Bill Gates Bill Veghte
(from William Cohens IE tutorial, 2003)
4
Outline
  • Information Extraction Tasks
  • Entity tagging
  • Relation extraction
  • Event extraction
  • Scaling up Information Extraction
  • Focus on scaling up to large collections (where
    data mining can be most beneficial)
  • Other dimensions of scalability

5
Information Extraction Tasks
  • Extracting entities and relations this talk
  • Entities named (e.g., Person) and generic (e.g.,
    disease name)
  • Relations entities related in a predefined way
    (e.g., Location of a Disease outbreak, a Company
    Merger/Acquisition)
  • Common extraction subtasks
  • Preprocessing sentence chunking, syntactic
    parsing, morphological analysis
  • Create rules or extraction patterns hand-coded,
    machine learning, and hybrid
  • Apply extraction patterns or rules to extract new
    information
  • Postprocessing and complex extraction not
    covered
  • Co-reference resolution
  • Combining Relations into Events and Facts

6
Previous Information Extraction Tutorials
  • See these tutorials for more details
  • R. Feldman, Information Extraction Theory and
    Practice, ICML 2006http//www.cs.biu.ac.il/feldm
    an/icml_tutorial.html
  • W. Cohen, A. McCallum, Information Extraction and
    Integration an Overview, KDD 2003
    http//www.cs.cmu.edu/wcohen/ie-survey.ppt
  • A. Doan, R. Ramakrishnan, S. Vaithyanathan,
    Managing Information Extraction, SIGMOD06

7
Entity Tagging
  • Identifying mentions of entities (e.g., person
    names, locations, companies) in text
  • MUC (1997) Person, Location, Organization,
    Date/Time/Currency
  • ACE (2005) more than 100 more specific types
  • Hand-coded vs. Machine Learning approaches
  • Best approach depends on entity type and domain
  • Closed class (e.g., geographical locations,
    disease names, gene protein names) hand coded
    dictionaries
  • Syntactic (e.g., phone numbers, zip codes)
    regular expressions
  • Semantic (e.g., person and company names)
    mixture of context, syntactic features,
    dictionaries, heuristics, etc.
  • Almost solved for common/typical entity types

8
Example Extracting Entities from Text
  • Useful for data warehousing, data cleaning, web
    data integration

Address
4089 Whispering Pines Nobel Drive San Diego CA
92122
1
Ronald Fagin, Combining Fuzzy Information from
Multiple Systems, Proc. of ACM SIGMOD, 2002
Citation
Segment(si) Sequence Label(si)
S1 Ronald Fagin Author
S2 Combining Fuzzy Information from Multiple Systems Title
S3 Proc. of ACM SIGMOD Conference
S4 2002 Year
9
Hand-Coded Methods
  • Easy to construct in some cases
  • e.g., to recognize prices, phone numbers, zip
    codes, conference names, etc.
  • Intuitive to debug and maintain
  • Especially if written in a high-level language
  • Can incorporate domain knowledge
  • Scalability issues
  • Labor-intensive to create
  • Highly domain-specific
  • Often corpus-specific
  • Rule-matches can be expensive

IBM Avatar
10
Machine Learning Methods
  • Can work well when lots of training data easy to
    construct
  • Can capture complex patterns that are hard to
    encode with hand-crafted rules
  • e.g., determine whether a review is positive or
    negative
  • extract long complex gene names
  • Non-local dependencies

11
Models (from Cohen McCallum, 2003 )
Classify Pre-segmentedCandidates
Lexicons
Sliding Window
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
member?
Classifier
Classifier
Alabama Alaska Wisconsin Wyoming
which class?
which class?
Try alternatewindow sizes
Context Free Grammars
Finite State Machines
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
V
P
NP
V
NNP
Most likely parse?
Classifier
PP
which class?
VP
NP
VP
BEGIN
END
BEGIN
END
S
and beyond
Any of these models can be used to capture words,
formatting or both.
12
Popular Machine Learning Methods
For details Feldman, 2006 and Cohen, 2004
  • Naive Bayes
  • SRV Freitag-98, Inductive Logic Programming
  • Rapier Califf Mooney-97
  • Hidden Markov Models Leek, 1997
  • Maximum Entropy Markov Models McCallum et al,
    2000
  • Conditional Random Fields Lafferty et al, 2001
  • Scalability
  • Can be labor intensive to construct training data
  • Question how much training data is sufficient?
  • Accuracy rivals hand-coded methods
  • At run time, some features can be expensive to
    construct or process

13
Managing Complex Features CNS2006
R. Fagin and J. Helpern, Belief,
awareness, reasoning, In AI 1998
Many large tables
Authors
Ronald Fagin
Steve Cook
S. Sudarshan
S. Chakrabarti
Nick Koudas
R. K. Narayan
E. F. Codd
J. Widom
  1. Batch processing better than individual top-k?
  2. Find top segmentation without top-k matches for
    all segments?

14
Some Available Entity Taggers
  • ABNER
  • http//www.cs.wisc.edu/bsettles/abner/
  • Linear-chain conditional random fields (CRFs)
    with orthographic and contextual features.
  • Alias-I LingPipe
  • http//www.alias-i.com/lingpipe/
  • MALLET
  • http//mallet.cs.umass.edu/index.php/Main_Page
  • Collection of NLP and ML tools, can be trained
    for name entity tagging
  • MinorThird
  • http//minorthird.sourceforge.net/
  • Tools for learning to extract entities,
    categorization, and some visualization
  • Stanford Named Entity Recognizer
  • http//nlp.stanford.edu/software/CRF-NER.shtml
  • CRF-based entity tagger with non-local features

15
Alias-I LingPipe ( http//www.alias-i.com/lingpipe
/ )
  • Statistical Named-Entity Tagger
  • Generative Statistical Model
  • Find most likely tags given lexical and
    linguistic features
  • Lexical and Tag models
  • Explicitly targets scalability
  • 100K tokens/second runtime
  • Pipelined extraction of entities
  • User-defined mentions, pronouns and stop list
  • Specified in a dictionary, left-to-right, longest
    match
  • Can be trained/bootstrapped on annotated corpora

16
Outline
  • Overview of Information Extraction
  • Entity tagging
  • Relation extraction
  • Event extraction
  • Scaling up Information Extraction
  • Focus on scaling up to large collections (where
    data mining and ML techniques shine)
  • Other dimensions of scalability

17
Relation Extraction
  • Extract structured relations from text
  • Goal tuples of entities that are related in
    predefined way

May 19 1995, Atlanta -- The Centers for Disease
Control and Prevention, which is in the front
line of the world's response to the deadly Ebola
epidemic in Zaire , is finding itself hard
pressed to cope with the crisis
Disease Outbreaks in The New York Times
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System (e.g., NYUs
Proteus)
18
Example Protein Interactions
From AliBaba
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
19
Relation Extraction (continued)
  • Often requires entity tagging as preprocessing,
    and tuning or training for each task
  • Knowledge engineering
  • Experts develop rules, patterns
  • can be defined over lexical items ltcompanygt
    located in ltlocationgt
  • or over syntactic structures ((Obj ltcompanygt)
    (Verb located) () (Subj ltlocationgt))
  • Sophisticated development/debugging environments
    developed
  • Proteus, GATE
  • Machine learning-based approach
  • Supervised Train system over manually labeled
    data
  • Soderland et al. 1997, Muslea et al. 2000, Riloff
    et al. 1996, Roth et al 2005, Cardie et al 2006,
    Mooney et al. 2005,
  • Partially-supervised train system by
    bootstrapping from seed examples
  • Agichtein Gravano 2000, Etzioni et al., 2004,
    Yangarber Grishman 2001,
  • Open (no seeds) Sekine et al. 2006, Cafarella
    et al. 2007, Banko et al. 2007
  • Hybrid or interactive systems
  • Experts interact with machine learning algorithms
    (e.g., active learning family) to iteratively
    refine/extend rules and patterns

20
GATE Information Extraction Development
Environment
21
Comparison of Information Extraction Approaches
  • Knowledge Engineering
  • Significant effort required for each task and
    domain
  • developed by experienced language engineers
  • make use of human intuition
  • requires only small amount of training data
  • development could be very time consuming
  • some changes may be hard to accommodate
  • Machine Learning
  • Use statistics or other machine learning
  • developers do not need language engineering
    expertise
  • requires large amounts of annotated training data
  • some changes may require re-annotation of the
    entire training corpus
  • annotators are cheap (but you get what you pay
    for!)

22
Event Extraction FIX ME!
  • Specific sub-tasks
  • Coreference resolution
  • Deduplication
  • Disambiguation
  • Complete Event Extraction Systems
  • NYU Proteus
  • Disease Outbreaks, Terrorist Attacks, Corporate
    Succession, ... http//nlp.cs.nyu.edu/index.shtm
    l
  • ClearForest (commercial)
  • http//www.clearforest.com/
  • DBLife
  • Conferences, talks, publications, service
  • http//dblife.cs.wisc.edu/

23
Extracted Entities Resolving Duplicates
Document 1 The Justice Department has officially
ended its inquiry into the assassinations of John
F. Kennedy and Martin Luther King Jr., finding
no persuasive evidence'' to support conspiracy
theories, according to department documents. The
House Assassinations Committee concluded in 1978
that Kennedy was probably'' assassinated as the
result of a conspiracy involving a second gunman,
a finding that broke from the Warren Commission
's belief that Lee Harvey Oswald acted alone in
Dallas on Nov. 22, 1963. Document 2 In 1953,
Massachusetts Sen. John F. Kennedy married
Jacqueline Lee Bouvier in Newport, R.I. In 1960,
Democratic presidential candidate John F. Kennedy
confronted the issue of his Roman Catholic faith
by telling a Protestant group in Houston, I do
not speak for my church on public matters, and
the church does not speak for me.' Document 3
David Kennedy was born in Leicester, England in
1959.  Kennedy co-edited The New Poetry
(Bloodaxe Books 1993), and is the author of New
Relations The Refashioning Of British Poetry
1980-1994 (Seren 1996). 
From Li, Morie, Roth, AI Magazine, 2005
24
Event Extraction Integration Challenges
  • Information spans multiple documents
  • Missing or incorrect values
  • Combining relation tuples into complex events
  • No single key to order or cluster likely
    duplicates while separating them from similar but
    different entities.
  • Duplicate entities, relation tuples extracted
  • Large lists with multiple noisy mentions of the
    same entity/tuple
  • Need to depend on fuzzy and expensive string
    similarity functions
  • Cannot afford to compare with mention with every
    other.
  • See KDD 2006 Tutorial, Agichtein Sarawagi (Part
    II) for details on scaling up data integration
    http//www.scalability-tutorial.net/

25
Summary Accuracy of Extraction Tasks
Feldman, ICML 2006 tutorial
  • Errors cascade (error in entity tag ? error in
    relation extraction)
  • This estimate is optimistic
  • Holds for well-established tasks
  • Many specific/novel IE tasks exhibit lower
    accuracy

26
Outline
  • Overview of Information Extraction
  • Entity tagging
  • Relation extraction
  • Event Extraction
  • Scaling up Information Extraction
  • Focus on scaling up to large collections (where
    data mining and ML techniques shine)
  • Other dimensions of scalability

27
Scaling Information Extraction to the Web
  • Dimensions of Scalability
  • Corpus size
  • Applying rules/patterns is expensive
  • Need efficient ways to select/filter relevant
    documents
  • Document accessibility
  • Deep web documents only accessible via a search
    interface
  • Dynamic sources documents disappear from top
    page
  • Source heterogeneity
  • Coding/learning patterns for each source is
    expensive
  • Requires many rules (expensive to apply)
  • Domain diversity
  • Extracting information for any domain, entities,
    relationships
  • Some recent progress e.g. BCS07
  • Not the focus of this talk

28
Scaling Up Information Extraction
  • Scan-based extraction
  • Classification/filtering to avoid processing
    documents
  • Sharing common tags/annotations
  • General keyword index-based techniques
  • QXtract, KnowItAll
  • Specialized indexes
  • BE/KnowItNow, Linguists Search Engine
  • Parallelization/adaptive processing
  • IBM WebFountain, Googles Map/Reduce

29
Scan
Output Tuples

Extraction System
Text Database
  1. Extract output tuples
  1. Process documents
  1. Retrieve docs from database
  • Scan retrieves and processes documents
    sequentially (until reaching target recall)
  • Execution time Retrieved Docs (R P)

Time for processing a document
Time for retrieving a document
30
Efficient Scanning for Information Extraction
  • 80/20 rule use few simple rules to capture
    majority of the cases PRH2004
  • Train a classifier to discard irrelevant
    documents without processing GHY2002
  • Share base annotations (entity tags) across
    multiple tasks

31
Filtered Scan
Output Tuples

Extraction System
Text Database
filtered
  1. Extract output tuples
  1. Process documents
  1. Retrieve docs from database
  • Scan retrieves and processes all documents (until
    reaching target recall)
  • Filtered Scan uses a classifier to identify and
    process only promising documents(e.g., the
    Sports section of NYT is unlikely to describe
    disease outbreaks)
  • Execution time Retrieved Docs ( R F
    P)

s
Time for processing a document
Time for retrieving a document
Time for filteringa document
Classifier selectivity (s1)
32
Exploiting Keyword and Phrase Indexes
  • Generate queries to retrieve only relevant
    documents
  • Data mining problem!
  • Some methods in literature
  • Traversing Query Graphs AIG2003
  • Iteratively refine queries AG2003
  • Iteratively partition document space Etzioni et
    al., WWW 2004
  • Case studies QXtract, KnowItAll

33
Simple Strategy Iterative Set Expansion
Output Tuples

Text Database
Extraction System
Query Generation
  1. Extract tuplesfrom docs
  1. Process retrieved documents
  1. Augment seed tuples with new tuples
  1. Query database with seed tuples

(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)
  • Execution time Retrieved Docs (R P)
    Queries Q

Time for retrieving a document
Time for answering a query
Time for processing a document
34
Querying Graph
AIG2003
Tuples
Documents
t1
d1
  • The querying graph is a bipartite graph,
    containing tuples and documents
  • Each token (transformed to a keyword query)
    retrieves documents
  • Documents contain tuples

ltSARS, Chinagt
d2
t2
ltEbola, Zairegt
t3
d3
ltMalaria, Ethiopiagt
t4
d4
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
35
Recall Limit Reachability Graph
Reachability Graph
Tuples
Documents
t1
t1
d1
t2
t3
d2
t2
t3
d3
t4
t5
t4
d4
t1 retrieves document d1 that contains t2
t5
d5
Upper recall limit determined by the size of
the biggest connected component
36
Reachability Graph for DiseaseOutbreaks
DiseaseOutbreaks, New York Times 1995
37
Getting Around Reachability Limit
  • KnowItAll Etzioni et al., WWW 2004
  • Add keywords to partition documents into
    retrievable disjoint sets
  • Submit queries with parts of extracted instances
  • QXtract Agichtein Gravano, ICDE 2003
  • General queries with many matching documents
  • Assumes many documents retrievable per query

38
QXtract AG2003
User-Provided Seed Tuples
Seed Sampling
  1. Get document sample with likely negative and
    likely positive examples.
  2. Label sample documents usinginformation
    extraction systemas oracle.
  3. Train classifiers to recognizeuseful
    documents.
  4. Generate queries from classifiermodel/rules.

Information Extraction
Classifier Training
Query Generation
Queries
39
Using Generic Indexes Summary
  • Order of magnitude scale-up in corpus size
  • Indexes are approximate (queries not precise)
  • Require many documents to retrieve
  • Can we do better?

40
Index Structures for Information Extraction
  • Bindings Engine CE2005
  • Indexes of entities CGHX2006, IBM Avatar
  • Other systems (not covered)
  • Linguists search engine (P. Resnik) index
    syntactic structures
  • FREE Indexing regular expressions J. Cho et al.

41
Bindings Engine (BE) Cafarella and Etzioni 2005
  • Bindings Engine (BE) is search engine where
  • No downloads during query processing
  • Disk seeks constant in corpus size
  • queries phrases
  • BEs approach
  • Variabilized search query language
  • Pre-processes all documents before query-time
  • Integrates variable/type data with inverted
    index, minimizing query seeks

42
Bindings Engine (BE) Cafarella and Etzioni 2005
  • Variabilized search query language
  • Integrates variable/type data with inverted
    index, minimizing query seeks
  • Index ltNounPhrasegt, ltAdj-Termgt terms
  • Key idea neighbor index
  • At each position in the index, store neighbor
    text both lexemes and types
  • Query cities such as ltNounPhrasegt

docs

pos0
pos1
dociddocs-1
posdocs-1
docid0
docid1
as
billy
cities
friendly
give
mayors
nickels
philadelphia
such
words
19
posns
pos0
pos1
pospos-1
posns
pos0
neighbor0
pos1
neighbor1
pospos-1



12
neighbor1
str1
neighbors
blk_offset
neighbor0
str0
Result in document 19 I love cities such as
Philadelphia.
43
Related Approach CGHX2006
  • Support relationship keyword queries over
    indexed entities
  • Top-K support for early processing termination

44
Workload-Driven Indexing CPD 2006Indexing
Thousands of Entity Types
45
Selecting Types to Index CPD 2006
46
Open Information Extraction Banko et al., IJCAI
2007
  • Self-Supervised Learner
  • All triples in sample corpus (e1, r, e2) are
    potential tuples for r
  • Positive examples candidate triplets generated
    by dependency parser
  • Classifier trained on lexical features for
    positive and negative examples
  • Single-Pass Extractor
  • Classify all pairs of candidate entities as
    potential tuples
  • Heuristically generate a relation name from text
    between entities
  • Redundancy-Based Assessor
  • Estimate probability that entities are related
    from co-occurrence statistics
  • Scalability
  • Extraction/Indexing
  • No tuning or domain knowledge during extraction,
    loss of accuracy at query time
  • 0.04 CPU seconds pre sentence, 9M web page corpus
    in 68 CPU hours (I/O costs?)
  • Every document retrieved, parsed, indexed
  • Query-time
  • Distributed index for tuples by hashing relation
    name text

47
Parallelization/Adaptive Processing
  • Parallelize processing
  • IBM WebFountain GCG2004
  • Googles Map/Reduce
  • Select most efficient access strategy
  • Cost Estimation and Optimization IAJG2006

48
Map/Reduce Dean Ghemawat, OSDI 2004
  • General framework
  • Scales to 1000s of machines
  • Recently implemented in Nutch and other open
    source efforts
  • Maps nicely to information extraction
  • Map phase
  • Parse individual documents
  • Tag entities
  • Propose candidate relation tuples
  • Reduce phase
  • Merge multiple mentions of same relation tuple
  • Resolve co-references, duplicates

49
Summary
  • Brief overview of information extraction from
    text
  • Techniques to scale up information extraction
  • Scan-based techniques (limited impact)
  • Exploiting general indexes (limited accuracy)
  • Building specialized index structures (most
    promising)
  • Scalability is a data mining problem
  • Querying graphs ? link discovery
  • Workload mining for index optimization
  • Can (automatically) optimize for specific text
    mining application

50
Summary (continued)
Write a Comment
User Comments (0)
About PowerShow.com