Web scale Information Extraction - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Web scale Information Extraction

Description:

... news, scientific literature, online reviews, ... in a dictionary, left-to-right, ... novel IE tasks (e.g. bio- and medical- domains) exhibit lower accuracy ... – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 41
Provided by: Rad7
Category:

less

Transcript and Presenter's Notes

Title: Web scale Information Extraction


1
Web scale Information Extraction
2
The Value of Text Data
  • Unstructured text data is the primary form of
    human-generated information
  • Blogs, web pages, news, scientific literature,
    online reviews,
  • Semi-structured data (database generated) see
    Prof. Bing Lius KDD webinar http//www.cs.uic.ed
    u/liub/WCM-Refs.html
  • The techniques discussed here are complimentary
    to structured object extraction methods
  • Need to extract structured information to
    effectively manage, search, and mine the data
  • Information Extraction mature, but active
    research area
  • Intersection of Computational Linguistics,
    Machine Learning, Data mining, Databases, and
    Information Retrieval
  • Traditional focus on accuracy of extraction

3
Outline
  • Information Extraction Tasks
  • Entity tagging
  • Relation extraction
  • Event extraction
  • Scaling up Information Extraction
  • Focus on scaling up to large collections (where
    data mining can be most beneficial)
  • Other dimensions of scalability

4
Information Extraction Tasks
  • Extracting entities and relations this talk
  • Entities named (e.g., Person) and generic (e.g.,
    disease name)
  • Relations entities related in a predefined way
    (e.g., Location of a Disease outbreak, or a CEO
    of a Company)
  • Events can be composed from multiple relation
    tuples
  • Common extraction subtasks
  • Preprocess sentence chunking, syntactic parsing,
    morphological analysis
  • Create rules or extraction patterns hand-coded,
    machine learning, and hybrid
  • Apply extraction patterns or rules to extract new
    information
  • Postprocess and integrate information
  • Co-reference resolution, deduplication,
    disambiguation

5
Entity Tagging
  • Identifying mentions of entities (e.g., person
    names, locations, companies) in text
  • MUC (1997) Person, Location, Organization,
    Date/Time/Currency
  • ACE (2005) more than 100 more specific types
  • Hand-coded vs. Machine Learning approaches
  • Best approach depends on entity type and domain
  • Closed class (e.g., geographical locations,
    disease names, gene protein names) hand coded
    dictionaries
  • Syntactic (e.g., phone numbers, zip codes)
    regular expressions
  • Semantic (e.g., person and company names)
    mixture of context, syntactic features,
    dictionaries, heuristics, etc.
  • Almost solved for common/typical entity types

6
Example Extracting Entities from Text
  • Useful for data warehousing, data cleaning, web
    data integration

Address
4089 Whispering Pines Nobel Drive San Diego CA
92122
1
Ronald Fagin, Combining Fuzzy Information from
Multiple Systems, Proc. of ACM SIGMOD, 2002
Citation
7
Hand-Coded Methods
  • Easy to construct in some cases
  • e.g., to recognize prices, phone numbers, zip
    codes, conference names, etc.
  • Intuitive to debug and maintain
  • Especially if written in a high-level language
  • Can incorporate domain knowledge
  • Scalability issues
  • Labor-intensive to create
  • Highly domain-specific
  • Often corpus-specific
  • Rule-matches can be expensive

IBM Avatar
8
Machine Learning Methods
  • Can work well when lots of training data easy to
    construct
  • Can capture complex patterns that are hard to
    encode with hand-crafted rules
  • e.g., determine whether a review is positive or
    negative
  • extract long complex gene names
  • Non-local dependencies

9
Popular Machine Learning Methods
For details Feldman, 2006 and Cohen, 2004
  • Naive Bayes
  • SRV Freitag 1998, Inductive Logic Programming
  • Rapier Califf and Mooney 1997
  • Hidden Markov Models Leek 1997
  • Maximum Entropy Markov Models McCallum et al.
    2000
  • Conditional Random Fields Lafferty et al. 2001
  • Scalability
  • Can be labor intensive to construct training data
  • At run time, complex features can be expensive to
    construct or process (batch algorithms can help
    Chandel et al. 2006 )

10
Some Available Entity Taggers
  • ABNER
  • http//www.cs.wisc.edu/bsettles/abner/
  • Linear-chain conditional random fields (CRFs)
    with orthographic and contextual features.
  • Alias-I LingPipe
  • http//www.alias-i.com/lingpipe/
  • MALLET
  • http//mallet.cs.umass.edu/index.php/Main_Page
  • Collection of NLP and ML tools, can be trained
    for name entity tagging
  • MinorThird
  • http//minorthird.sourceforge.net/
  • Tools for learning to extract entities,
    categorization, and some visualization
  • Stanford Named Entity Recognizer
  • http//nlp.stanford.edu/software/CRF-NER.shtml
  • CRF-based entity tagger with non-local features

11
Alias-I LingPipe ( http//www.alias-i.com/lingpipe
/ )
  • Statistical named entity tagger
  • Generative statistical model
  • Find most likely tags given lexical and
    linguistic features
  • Accuracy at (or near) state of the art on
    benchmark tasks
  • Explicitly targets scalability
  • 100K tokens/second runtime on single PC
  • Pipelined extraction of entities
  • User-defined mentions, pronouns and stop list
  • Specified in a dictionary, left-to-right, longest
    match
  • Can be trained/bootstrapped on annotated corpora

12
Outline
  • Overview of Information Extraction
  • Entity tagging
  • Relation extraction
  • Event extraction
  • Scaling up Information Extraction
  • Focus on scaling up to large collections (where
    data mining and ML techniques shine)
  • Other dimensions of scalability

13
Relation Extraction Examples
  • Extract tuples of entities that are related in
    predefined way

Disease Outbreaks relation
Relation Extraction
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
From AliBaba
14
Relation Extraction Approaches
  • Knowledge engineering
  • Experts develop rules, patterns
  • Can be defined over lexical items ltcompanygt
    located in ltlocationgt
  • Over syntactic structures ((Obj ltcompanygt)
    (Verb located) () (Subj ltlocationgt))
  • Sophisticated development/debugging environments
  • Proteus, GATE
  • Machine learning
  • Supervised Train system over manually labeled
    data
  • Soderland et al. 1997, Muslea et al. 2000, Riloff
    et al. 1996, Roth et al 2005, Cardie et al 2006,
    Mooney et al. 2005,
  • Partially-supervised train system by
    bootstrapping from seed examples
  • Agichtein Gravano 2000, Etzioni et al., 2004,
    Yangarber Grishman 2001,
  • Open (no seeds) Sekine et al. 2006, Cafarella
    et al. 2007, Banko et al. 2007
  • Hybrid or interactive systems
  • Experts interact with machine learning algorithms
    (e.g., active learning family) to iteratively
    refine/extend rules and patterns
  • Interactions can involve annotating examples,
    modifying rules, or any combination

15
Open Information Extraction Banko et al., IJCAI
2007
  • Self-Supervised Learner
  • All triples in a sample corpus (e1, r, e2) are
    considered potential tuples for relation r
  • Positive examples candidate triplets generated
    by a dependency parser
  • Train classifier on lexical features for positive
    and negative examples
  • Single-Pass Extractor
  • Classify all pairs of candidate entities for some
    (undetermined) relation
  • Heuristically generate a relation name from the
    words between entities
  • Redundancy-Based Assessor
  • Estimate probability that entities are related
    from co-occurrence statistics
  • Scalability
  • Extraction/Indexing
  • No tuning or domain knowledge during extraction,
    relation inclusion determined at query time
  • 0.04 CPU seconds pre sentence, 9M web page corpus
    in 68 CPU hours
  • Every document retrieved, processed (parsed,
    indexed, classified) in a single pass
  • Query-time
  • Distributed index for tuples by hashing on the
    relation name text

16
Event Extraction
  • Similar to Relation Extraction, but
  • Events can be nested
  • Significantly more complex (e.g., more slots)
    than relations/template elements
  • Often requires coreference resolution,
    disambiguation, deduplication, and inference
  • Example an integrated disease outbreak event
    Hatunnen et al. 2002

17
Event Extraction Integration Challenges
  • Information spans multiple documents
  • Missing or incorrect values
  • Combining simple tuples into complex events
  • No single key to order or cluster likely
    duplicates while separating them from similar but
    different entities.
  • Ambiguity distinct physical entities with same
    name (e.g., Kennedy)
  • Duplicate entities, relation tuples extracted
  • Large lists with multiple noisy mentions of the
    same entity/tuple
  • Need to depend on fuzzy and expensive string
    similarity functions
  • Cannot afford to compare each mention with every
    other.
  • See Part II of KDD 2006 Tutorial Scalable
    Information Extraction and Integration --
    scaling up integration http//www.scalability-tut
    orial.net/

18
Summary Accuracy of Extraction Tasks
Feldman, ICML 2006 tutorial
  • Errors cascade (errors in entity tag cause errors
    in relation extraction)
  • This estimate is optimistic
  • Primarily for well-established (tuned) tasks
  • Many specialized or novel IE tasks (e.g. bio- and
    medical- domains) exhibit lower accuracy
  • Accuracy for all tasks is significantly lower for
    non-English

19
Multilingual Information Extraction
  • Closely tied to machine translation and
    cross-language information retrieval efforts.
  • Language-independent named entity tagging and
    related tasks at CoNLL
  • 2006 multi-lingual dependency parsing
    (http//nextens.uvt.nl/conll/)
  • 2002, 2003 shared tasks language independent
    Named Entity Tagging (http//www.cnts.ua.ac.be/con
    ll2003/ner/)
  • Global Autonomous Language Exploitation program
    (GALE)
  • http//www.darpa.mil/ipto/Programs/gale/concept.ht
    m
  • Interlingual Annotation of Multilingual Text
    Corpora (IAMTC)
  • Tools and data for building MT and IE systems for
    six languages
  • http//aitc.aitcnet.org/nsf/iamtc/index.html
  • REFLEX project NER for 50 languages
  • Exploit for training temporal correlations in
    weekly aligned corpora
  • http//l2r.cs.uiuc.edu/cogcomp/wpt.php?pr_keyREF
    LEX
  • Cross-Language Information Retrieval (CLEF)
  • http//www.clef-campaign.org/

20
Outline
  • Overview of Information Extraction
  • Entity tagging
  • Relation extraction
  • Event Extraction
  • Scaling up Information Extraction
  • Focus on scaling up to large collections (where
    data mining and ML techniques shine)
  • Other dimensions of scalability

21
Scaling Information Extraction to the Web
  • Dimensions of Scalability
  • Corpus size
  • Applying rules/patterns is expensive
  • Need efficient ways to select/filter relevant
    documents
  • Document accessibility
  • Deep web documents only accessible via a search
    interface
  • Dynamic sources documents disappear from top
    page
  • Source heterogeneity
  • Coding/learning patterns for each source is
    expensive
  • Requires many rules (expensive to apply)
  • Domain diversity
  • Extracting information for any domain, entities,
    relationships
  • Some recent progress (e.g., see slide 17)
  • Not the focus of this talk

22
Scaling Up Information Extraction
  • Scan-based extraction
  • Classification/filtering to avoid processing
    documents
  • Sharing common tags/annotations
  • General keyword index-based techniques
  • QXtract, KnowItAll
  • Specialized indexes
  • BE/KnowItNow, Linguists Search Engine
  • Parallelization/distributed processing
  • IBM WebFountain, UIMA, Googles Map/Reduce

23
Efficient Scanning for Information Extraction
Extraction System
Text Database
filtered
  • Extract output tuples
  • Process documents
  • Retrieve docs from database
  • 80/20 rule use few simple rules to capture
    majority of the instances Pantel et al. 2004
  • Train a classifier to discard irrelevant
    documents without processing Grishman et al.
    2002
  • (e.g., the Sports section of NYT is unlikely to
    describe disease outbreaks)
  • Share base annotations (entity tags) for multiple
    extraction tasks

24
Exploiting Keyword and Phrase Indexes
  • Generate queries to retrieve only relevant
    documents
  • Data mining problem!
  • Some methods in literature
  • Traversing Query Graphs Agichtein et al. 2003
  • Iteratively refine queries Agichtein and Gravano
    2003
  • Iteratively partition document space Etzioni et
    al., 2004
  • Case studies QXtract, KnowItAll

25
Simple Strategy Iterative Set Expansion
Text Database
Extraction System
Query Generation
  • Extract tuplesfrom docs
  • Process retrieved documents
  • Augment seed tuples with new tuples
  • Query database with seed tuples

(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)
  • Execution time Retrieved Docs (R P)
    Queries Q

Time for retrieving a document
Time for answering a query
Time for processing a document
26
Reachability via Querying
Agichtein et al. 2003b
Reachability Graph
Tuples
Documents
t1
t1
d1
ltSARS, Chinagt
t2
t3
d2
t2
ltEbola, Zairegt
t3
d3
t4
t5
ltMalaria, Ethiopiagt
t4
d4
t1 retrieves document d1 that contains t2
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
Upper recall limit determined by the size of
the biggest connected component
27
Some IE tools Available
  • MALLET (UMass)
  • statistical natural language processing,
  • document classification,
  • clustering,
  • information extraction
  • other machine learning applications to text.
  • Sample ApplicationGeneTaggerCRF a gene-entity
    tagger based on MALLET (MAchine Learning for
    LanguagE Toolkit). It uses conditional random
    fields to find genes in a text file.

28
MinorThird
  • http//minorthird.sourceforge.net/
  • a collection of Java classes for storing text,
    annotating text, and learning to extract entities
    and categorize text
  • Stored documents can be annotated in independent
    files using TextLabels (denoting, say,
    part-of-speech and semantic information)

29
GATE
  • http//gate.ac.uk/ie/annie.html
  • leading toolkit for Text Mining
  • distributed with an Information Extraction
    component set called ANNIE (demo)
  • Used in many research projects
  • Long list can be found on its website
  • Under integration of IBM UIMA

30
Sunita Sarawagi's CRF package
  • http//crf.sourceforge.net/
  • A Java implementation of conditional random
    fields for sequential labeling.

31
UIMA (IBM)
  • Unstructured Information Management Architecture.
  • A platform for unstructured information
    management solutions from combinations of
    semantic analysis (IE) and search components.

32
Some Interesting Website based on IE
  • ZoomInfo
  • CiteSeer.org (some of us using it everyday!)
  • Google Local, Google Scholar
  • and many more

33
UIMA (IBM Research)
  • Unstructured Information Management Architecture
    (UIMA)
  • http//www.research.ibm.com/UIMA/
  • Open component software architecture for
    development, composition, and deployment of text
    processing and analysis components.
  • Run-time framework allows to plug in components
    and applications and run them on different
    platforms. Supports distributed processing,
    failure recovery,
  • Scales to millions of documents incorporated
    into IBM OmniFind, grid computing-ready
  • The UIMA SDK (freely available) includes a
    run-time framework, APIs, and tools for composing
    and deploying UIMA components.
  • Framework source code also available on
    Sourceforge
  • http//uima-framework.sourceforge.net/

34
UIMA Quick Overview Architecture, Software
Framework and Tooling
35
Analytics Bridge the Unstructured Structured
Worlds
Text and Multi-Modal Analytics
Unstructured Information
Structured Information
Text, Chat, Email, Audio, Video
Indices
  • Discover Relevant Semantics ? Build into
    Structure
  • Docs, Emails, Phone Calls, Reports
  • Topics, Entities, Relationships
  • People, Places, Org, Times, Events
  • Customer Opinions, Products, Problems
  • Threats, Chemicals, Drugs, Drug Interactions....

DBs
KBs
  • High-Value
  • Most Current
  • Fastest Growing
  • ...BUT ...
  • Buried in Huge Volumes (Noise)
  • Implicit Semantics
  • Inefficient Search
  • Explicit Semantics
  • Efficient Search
  • Focused Content
  • ...BUT...
  • Slow Growing
  • Narrow Coverage
  • Less Current/Relevant

36
The right analysis for the job will likely be a
best-of-breed combination integrating
capabilities across many dimensions.
Analytics The kinds of things they do
  • Independently developed
  • From an increasing of sources
  • Different technologies interfaces
  • Highly specialized fine grained

Capability Specializations
Analysis Capabilities
  • Language, Speaker Identifiers
  • Tokenizers
  • Classifiers
  • Part of Speech Detectors
  • Document Structure Detectors
  • Parsers, Translators
  • Named-Entity Detectors
  • Face Recognizers
  • Relationship Detectors
  • Modality
  • Human Language
  • Domain of Interest
  • Source Style and Format
  • Input/Output Semantics
  • Privacy/Security
  • Precision/Recall Tradeoffs
  • Performance/Precision Tradeoffs...

37
UIMAs Basic Building Blocks are Annotators. They
iterate over an artifact to discover new types
based on existing ones and update the Common
Analysis Structure (CAS) for upstream processing.
38
  • Analyzed by a collection of text analytics
  • Detected Semantic Entities and Relations
    Highlighted
  • Represented in UIMA Common Analysis Structure
    (CAS)

39
UIMA Unstructured Information Management
Architecture
  • Open Software Architecture and Emerging Standard
  • Platform independent standard for interoperable
    text and multi-modal analytics
  • Under Development UIMA Standards Technical
    Committee Initiated under OASIS
  • http//www.oasis-open.org/committees/tc_home.php?w
    g_abbrevuima
  • Software Framework Implementation
  • SDK Available on IBM Alphaworks
  • http//www.alphaworks.ibm.com/tech/uima
  • Tools, Utilities, Runtime, Extensive
    Documentation
  • Creation, Integration, Discovery, Deployment of
    analytics
  • Java, C, Perl, Python (others possible)
  • Supports co-located and service-oriented
    deployments (eg., SOAP)
  • x-Language High-Performances APIs to common data
    structure (CAS)
  • Embeddable on Systems Middleware (e.g., ActiveMQ,
    WebSphere, DB2)
  • Apache UIMA open-source project
  • http//incubator.apache.org/uima/

40
Any UIMA-Compliant Readers, Segmenters
Any UIMA-Compliant CAS Consumer(s)
Any UIMA-Compliant Analysis Engine(s)
Transcription Engine
Video Object Detector
Web Crawler
Index Tokens Annotations in IR Engine
Entity Relation Detector(s)
Deep Parser
File System Reader
Index Entities Relations in RDB or OWL KB
Arabic-English Translator (Web Service)
Streaming Speech Segmenter
Relational Database
Analyze Content Assign Task-Relevant Semantics
Index or Process Results
Connect, Read Segment Sources
OWL Knowledge- Base
Text IR Engine Index
CAS
CAS
Video Search Index
UIMA Pluggable Framework, User-defined
Workflows CAS Common UIMA Data Representation
Interchange Aligned with OMG W3C standards
(i.e., XMI, SOAP, RDF)
Query Interface(s)
Query Services
End-User Application Interfaces
Relevant Knowledge
41
UIMA Component Architecture
Collection Processing Engine (CPE)
Aggregate Analysis Engine
CAS Consumer
Analysis Engine
CAS Consumer
Annotator
CAS Consumer
Analysis Engine
CAS
CAS
CAS
Annotator
Flow Controller
Flow Controller
Write a Comment
User Comments (0)
About PowerShow.com