Web scale Information Extraction

About This Presentation

Title:

Web scale Information Extraction

Description:

... news, scientific literature, online reviews, ... in a dictionary, left-to-right, ... novel IE tasks (e.g. bio- and medical- domains) exhibit lower accuracy ... – PowerPoint PPT presentation

Number of Views:175

Avg rating:3.0/5.0

Slides: 41

Provided by: Rad7

Category:

more less

Transcript and Presenter's Notes

Title: Web scale Information Extraction

1
Web scale Information Extraction
2
The Value of Text Data

Unstructured text data is the primary form of
human-generated information
Blogs, web pages, news, scientific literature,
online reviews,
Semi-structured data (database generated) see
Prof. Bing Lius KDD webinar http//www.cs.uic.ed
u/liub/WCM-Refs.html
The techniques discussed here are complimentary
to structured object extraction methods
Need to extract structured information to
effectively manage, search, and mine the data
Information Extraction mature, but active
research area
Intersection of Computational Linguistics,
Machine Learning, Data mining, Databases, and
Information Retrieval
Traditional focus on accuracy of extraction

3
Outline

Information Extraction Tasks
Entity tagging
Relation extraction
Event extraction
Scaling up Information Extraction
Focus on scaling up to large collections (where
data mining can be most beneficial)
Other dimensions of scalability

4
Information Extraction Tasks

Extracting entities and relations this talk
Entities named (e.g., Person) and generic (e.g.,
disease name)
Relations entities related in a predefined way
(e.g., Location of a Disease outbreak, or a CEO
of a Company)
Events can be composed from multiple relation
tuples
Common extraction subtasks
Preprocess sentence chunking, syntactic parsing,
morphological analysis
Create rules or extraction patterns hand-coded,
machine learning, and hybrid
Apply extraction patterns or rules to extract new
information
Postprocess and integrate information
Co-reference resolution, deduplication,
disambiguation

5
Entity Tagging

Identifying mentions of entities (e.g., person
names, locations, companies) in text
MUC (1997) Person, Location, Organization,
Date/Time/Currency
ACE (2005) more than 100 more specific types
Hand-coded vs. Machine Learning approaches
Best approach depends on entity type and domain
Closed class (e.g., geographical locations,
disease names, gene protein names) hand coded
dictionaries
Syntactic (e.g., phone numbers, zip codes)
regular expressions
Semantic (e.g., person and company names)
mixture of context, syntactic features,
dictionaries, heuristics, etc.
Almost solved for common/typical entity types

6
Example Extracting Entities from Text

Useful for data warehousing, data cleaning, web
data integration

Address
4089 Whispering Pines Nobel Drive San Diego CA
92122
1
Ronald Fagin, Combining Fuzzy Information from
Multiple Systems, Proc. of ACM SIGMOD, 2002
Citation
7
Hand-Coded Methods

Easy to construct in some cases
e.g., to recognize prices, phone numbers, zip
codes, conference names, etc.
Intuitive to debug and maintain
Especially if written in a high-level language
Can incorporate domain knowledge
Scalability issues
Labor-intensive to create
Highly domain-specific
Often corpus-specific
Rule-matches can be expensive

IBM Avatar
8
Machine Learning Methods

Can work well when lots of training data easy to
construct
Can capture complex patterns that are hard to
encode with hand-crafted rules
e.g., determine whether a review is positive or
negative
extract long complex gene names
Non-local dependencies

9
Popular Machine Learning Methods
For details Feldman, 2006 and Cohen, 2004

Naive Bayes
SRV Freitag 1998, Inductive Logic Programming
Rapier Califf and Mooney 1997
Hidden Markov Models Leek 1997
Maximum Entropy Markov Models McCallum et al.
2000
Conditional Random Fields Lafferty et al. 2001
Scalability
Can be labor intensive to construct training data
At run time, complex features can be expensive to
construct or process (batch algorithms can help
Chandel et al. 2006 )

10
Some Available Entity Taggers

ABNER
http//www.cs.wisc.edu/bsettles/abner/
Linear-chain conditional random fields (CRFs)
with orthographic and contextual features.
Alias-I LingPipe
http//www.alias-i.com/lingpipe/
MALLET
http//mallet.cs.umass.edu/index.php/Main_Page
Collection of NLP and ML tools, can be trained
for name entity tagging
MinorThird
http//minorthird.sourceforge.net/
Tools for learning to extract entities,
categorization, and some visualization
Stanford Named Entity Recognizer
http//nlp.stanford.edu/software/CRF-NER.shtml
CRF-based entity tagger with non-local features

11
Alias-I LingPipe ( http//www.alias-i.com/lingpipe
/ )

Statistical named entity tagger
Generative statistical model
Find most likely tags given lexical and
linguistic features
Accuracy at (or near) state of the art on
benchmark tasks
Explicitly targets scalability
100K tokens/second runtime on single PC
Pipelined extraction of entities
User-defined mentions, pronouns and stop list
Specified in a dictionary, left-to-right, longest
match
Can be trained/bootstrapped on annotated corpora

12
Outline

Overview of Information Extraction
Entity tagging
Relation extraction
Event extraction
Scaling up Information Extraction
Focus on scaling up to large collections (where
data mining and ML techniques shine)
Other dimensions of scalability

13
Relation Extraction Examples

Extract tuples of entities that are related in
predefined way

Disease Outbreaks relation
Relation Extraction
We show that CBF-A and CBF-C interact with each
other to form a CBF-A-CBF-C complex and that
CBF-B does not interact with CBF-A or CBF-C
individually but that it associates with the
CBF-A-CBF-C complex.
From AliBaba
14
Relation Extraction Approaches

Knowledge engineering
Experts develop rules, patterns
Can be defined over lexical items ltcompanygt
located in ltlocationgt
Over syntactic structures ((Obj ltcompanygt)
(Verb located) () (Subj ltlocationgt))
Sophisticated development/debugging environments
Proteus, GATE
Machine learning
Supervised Train system over manually labeled
data
Soderland et al. 1997, Muslea et al. 2000, Riloff
et al. 1996, Roth et al 2005, Cardie et al 2006,
Mooney et al. 2005,
Partially-supervised train system by
bootstrapping from seed examples
Agichtein Gravano 2000, Etzioni et al., 2004,
Yangarber Grishman 2001,
Open (no seeds) Sekine et al. 2006, Cafarella
et al. 2007, Banko et al. 2007
Hybrid or interactive systems
Experts interact with machine learning algorithms
(e.g., active learning family) to iteratively
refine/extend rules and patterns
Interactions can involve annotating examples,
modifying rules, or any combination

15
Open Information Extraction Banko et al., IJCAI
2007

Self-Supervised Learner
All triples in a sample corpus (e1, r, e2) are
considered potential tuples for relation r
Positive examples candidate triplets generated
by a dependency parser
Train classifier on lexical features for positive
and negative examples
Single-Pass Extractor
Classify all pairs of candidate entities for some
(undetermined) relation
Heuristically generate a relation name from the
words between entities
Redundancy-Based Assessor
Estimate probability that entities are related
from co-occurrence statistics
Scalability
Extraction/Indexing
No tuning or domain knowledge during extraction,
relation inclusion determined at query time
0.04 CPU seconds pre sentence, 9M web page corpus
in 68 CPU hours
Every document retrieved, processed (parsed,
indexed, classified) in a single pass
Query-time
Distributed index for tuples by hashing on the
relation name text

16
Event Extraction

Similar to Relation Extraction, but
Events can be nested
Significantly more complex (e.g., more slots)
than relations/template elements
Often requires coreference resolution,
disambiguation, deduplication, and inference
Example an integrated disease outbreak event
Hatunnen et al. 2002

17
Event Extraction Integration Challenges

Information spans multiple documents
Missing or incorrect values
Combining simple tuples into complex events
No single key to order or cluster likely
duplicates while separating them from similar but
different entities.
Ambiguity distinct physical entities with same
name (e.g., Kennedy)
Duplicate entities, relation tuples extracted
Large lists with multiple noisy mentions of the
same entity/tuple
Need to depend on fuzzy and expensive string
similarity functions
Cannot afford to compare each mention with every
other.
See Part II of KDD 2006 Tutorial Scalable
Information Extraction and Integration --
scaling up integration http//www.scalability-tut
orial.net/

18
Summary Accuracy of Extraction Tasks
Feldman, ICML 2006 tutorial

Errors cascade (errors in entity tag cause errors
in relation extraction)
This estimate is optimistic
Primarily for well-established (tuned) tasks
Many specialized or novel IE tasks (e.g. bio- and
medical- domains) exhibit lower accuracy
Accuracy for all tasks is significantly lower for
non-English

19
Multilingual Information Extraction

Closely tied to machine translation and
cross-language information retrieval efforts.
Language-independent named entity tagging and
related tasks at CoNLL
2006 multi-lingual dependency parsing
(http//nextens.uvt.nl/conll/)
2002, 2003 shared tasks language independent
Named Entity Tagging (http//www.cnts.ua.ac.be/con
ll2003/ner/)
Global Autonomous Language Exploitation program
(GALE)
http//www.darpa.mil/ipto/Programs/gale/concept.ht
m
Interlingual Annotation of Multilingual Text
Corpora (IAMTC)
Tools and data for building MT and IE systems for
six languages
http//aitc.aitcnet.org/nsf/iamtc/index.html
REFLEX project NER for 50 languages
Exploit for training temporal correlations in
weekly aligned corpora
http//l2r.cs.uiuc.edu/cogcomp/wpt.php?pr_keyREF
LEX
Cross-Language Information Retrieval (CLEF)
http//www.clef-campaign.org/

20
Outline

Overview of Information Extraction
Entity tagging
Relation extraction
Event Extraction
Scaling up Information Extraction
Focus on scaling up to large collections (where
data mining and ML techniques shine)
Other dimensions of scalability

21
Scaling Information Extraction to the Web

Dimensions of Scalability
Corpus size
Applying rules/patterns is expensive
Need efficient ways to select/filter relevant
documents
Document accessibility
Deep web documents only accessible via a search
interface
Dynamic sources documents disappear from top
page
Source heterogeneity
Coding/learning patterns for each source is
expensive
Requires many rules (expensive to apply)
Domain diversity
Extracting information for any domain, entities,
relationships
Some recent progress (e.g., see slide 17)
Not the focus of this talk

22
Scaling Up Information Extraction

Scan-based extraction
Classification/filtering to avoid processing
documents
Sharing common tags/annotations
General keyword index-based techniques
QXtract, KnowItAll
Specialized indexes
BE/KnowItNow, Linguists Search Engine
Parallelization/distributed processing
IBM WebFountain, UIMA, Googles Map/Reduce

23
Efficient Scanning for Information Extraction
Extraction System
Text Database
filtered

Extract output tuples

Process documents

Retrieve docs from database

80/20 rule use few simple rules to capture
majority of the instances Pantel et al. 2004
Train a classifier to discard irrelevant
documents without processing Grishman et al.
2002
(e.g., the Sports section of NYT is unlikely to
describe disease outbreaks)
Share base annotations (entity tags) for multiple
extraction tasks

24
Exploiting Keyword and Phrase Indexes

Generate queries to retrieve only relevant
documents
Data mining problem!
Some methods in literature
Traversing Query Graphs Agichtein et al. 2003
Iteratively refine queries Agichtein and Gravano
2003
Iteratively partition document space Etzioni et
al., 2004
Case studies QXtract, KnowItAll

25
Simple Strategy Iterative Set Expansion
Text Database
Extraction System
Query Generation

Extract tuplesfrom docs

Process retrieved documents

Augment seed tuples with new tuples

Query database with seed tuples

(e.g., ltMalaria, Ethiopiagt)
(e.g., Ebola AND Zaire)

Execution time Retrieved Docs (R P)
Queries Q

Time for retrieving a document
Time for answering a query
Time for processing a document
26
Reachability via Querying
Agichtein et al. 2003b
Reachability Graph
Tuples
Documents
t1
t1
d1
ltSARS, Chinagt
t2
t3
d2
t2
ltEbola, Zairegt
t3
d3
t4
t5
ltMalaria, Ethiopiagt
t4
d4
t1 retrieves document d1 that contains t2
ltCholera, Sudangt
t5
d5
ltH5N1, Vietnamgt
Upper recall limit determined by the size of
the biggest connected component
27
Some IE tools Available

MALLET (UMass)
statistical natural language processing,
document classification,
clustering,
information extraction
other machine learning applications to text.
Sample ApplicationGeneTaggerCRF a gene-entity
tagger based on MALLET (MAchine Learning for
LanguagE Toolkit). It uses conditional random
fields to find genes in a text file.

28
MinorThird

http//minorthird.sourceforge.net/
a collection of Java classes for storing text,
annotating text, and learning to extract entities
and categorize text
Stored documents can be annotated in independent
files using TextLabels (denoting, say,
part-of-speech and semantic information)

29
GATE

http//gate.ac.uk/ie/annie.html
leading toolkit for Text Mining
distributed with an Information Extraction
component set called ANNIE (demo)
Used in many research projects
Long list can be found on its website
Under integration of IBM UIMA

30
Sunita Sarawagi's CRF package

http//crf.sourceforge.net/
A Java implementation of conditional random
fields for sequential labeling.

31
UIMA (IBM)

Unstructured Information Management Architecture.
A platform for unstructured information
management solutions from combinations of
semantic analysis (IE) and search components.

32
Some Interesting Website based on IE

ZoomInfo
CiteSeer.org (some of us using it everyday!)
Google Local, Google Scholar
and many more

33
UIMA (IBM Research)

Unstructured Information Management Architecture
(UIMA)
http//www.research.ibm.com/UIMA/
Open component software architecture for
development, composition, and deployment of text
processing and analysis components.
Run-time framework allows to plug in components
and applications and run them on different
platforms. Supports distributed processing,
failure recovery,
Scales to millions of documents incorporated
into IBM OmniFind, grid computing-ready
The UIMA SDK (freely available) includes a
run-time framework, APIs, and tools for composing
and deploying UIMA components.
Framework source code also available on
Sourceforge
http//uima-framework.sourceforge.net/

34
UIMA Quick Overview Architecture, Software
Framework and Tooling
35
Analytics Bridge the Unstructured Structured
Worlds
Text and Multi-Modal Analytics
Unstructured Information
Structured Information
Text, Chat, Email, Audio, Video
Indices

Discover Relevant Semantics ? Build into
Structure
Docs, Emails, Phone Calls, Reports
Topics, Entities, Relationships
People, Places, Org, Times, Events
Customer Opinions, Products, Problems
Threats, Chemicals, Drugs, Drug Interactions....

DBs
KBs

High-Value
Most Current
Fastest Growing
...BUT ...
Buried in Huge Volumes (Noise)
Implicit Semantics
Inefficient Search

Explicit Semantics
Efficient Search
Focused Content
...BUT...
Slow Growing
Narrow Coverage
Less Current/Relevant

36
The right analysis for the job will likely be a
best-of-breed combination integrating
capabilities across many dimensions.
Analytics The kinds of things they do

Independently developed
From an increasing of sources

Different technologies interfaces
Highly specialized fine grained

Capability Specializations
Analysis Capabilities

Language, Speaker Identifiers
Tokenizers
Classifiers
Part of Speech Detectors
Document Structure Detectors
Parsers, Translators
Named-Entity Detectors
Face Recognizers
Relationship Detectors

Modality
Human Language
Domain of Interest
Source Style and Format
Input/Output Semantics
Privacy/Security
Precision/Recall Tradeoffs
Performance/Precision Tradeoffs...

37
UIMAs Basic Building Blocks are Annotators. They
iterate over an artifact to discover new types
based on existing ones and update the Common
Analysis Structure (CAS) for upstream processing.
38

Analyzed by a collection of text analytics
Detected Semantic Entities and Relations
Highlighted
Represented in UIMA Common Analysis Structure
(CAS)

39
UIMA Unstructured Information Management
Architecture

Open Software Architecture and Emerging Standard
Platform independent standard for interoperable
text and multi-modal analytics
Under Development UIMA Standards Technical
Committee Initiated under OASIS
http//www.oasis-open.org/committees/tc_home.php?w
g_abbrevuima
Software Framework Implementation
SDK Available on IBM Alphaworks
http//www.alphaworks.ibm.com/tech/uima
Tools, Utilities, Runtime, Extensive
Documentation
Creation, Integration, Discovery, Deployment of
analytics
Java, C, Perl, Python (others possible)
Supports co-located and service-oriented
deployments (eg., SOAP)
x-Language High-Performances APIs to common data
structure (CAS)
Embeddable on Systems Middleware (e.g., ActiveMQ,
WebSphere, DB2)
Apache UIMA open-source project
http//incubator.apache.org/uima/

40
Any UIMA-Compliant Readers, Segmenters
Any UIMA-Compliant CAS Consumer(s)
Any UIMA-Compliant Analysis Engine(s)
Transcription Engine
Video Object Detector
Web Crawler
Index Tokens Annotations in IR Engine
Entity Relation Detector(s)
Deep Parser
File System Reader
Index Entities Relations in RDB or OWL KB
Arabic-English Translator (Web Service)
Streaming Speech Segmenter
Relational Database
Analyze Content Assign Task-Relevant Semantics
Index or Process Results
Connect, Read Segment Sources
OWL Knowledge- Base
Text IR Engine Index
CAS
CAS
Video Search Index
UIMA Pluggable Framework, User-defined
Workflows CAS Common UIMA Data Representation
Interchange Aligned with OMG W3C standards
(i.e., XMI, SOAP, RDF)
Query Interface(s)
Query Services
End-User Application Interfaces
Relevant Knowledge
41
UIMA Component Architecture
Collection Processing Engine (CPE)
Aggregate Analysis Engine
CAS Consumer
Analysis Engine
CAS Consumer
Annotator
CAS Consumer
Analysis Engine
CAS
CAS
CAS
Annotator
Flow Controller
Flow Controller

Write a Comment

User Comments (0)

About PowerShow.com

Web scale Information Extraction - PowerPoint PPT Presentation

Web scale Information Extraction

... news, scientific literature, online reviews, ... in a dictionary, left-to-right, ... novel IE tasks (e.g. bio- and medical- domains) exhibit lower accuracy ... – PowerPoint PPT presentation