Title: Information Extraction Introduction and Tools
1Information Extraction -Introduction and Tools
- V.G.Vinod Vydiswaran
- Roll no. 02329011
- M.Tech (1st Year)
- KReSIT, IITBombay
- 29th October 2002
- Guided by Prof. S. Sarawagi
2Introduction
- What is Information Extraction (IE) ?
- To select desired fields from the given data, by
extracting common patterns that appear along with
the information. - To automate such a process.
- To make the process efficient by reducing the
training data required, so as to restrict the
cost.
3Motivation
- Abundant online data available.
- Most IE systems specific to single information
resource. - IE models usually hand-coded, and hence
error-prone. - Data available either in structured form or in
highly verbose content. Proper filters needed.
4Types of Data
- Based on text styles
- Structured data
- Semi-Structured text
- Plain text
-
- Based on information to the model
- Labeled
- Unlabeled
5Structured Data
- Relational Data
- Data in databases, in tables
- HTML Tags
- Query responses translated into Relational form
using Wrappers - Usually hand-coded and very specific to
information resource
6Wrapper Induction
- Wrapper
- Procedure extracting tuples from a particular
information source - A function from page to set of tuples
- Induction
- Task of generalizing from labeled examples to a
hypothesis function of labeling instances
7Wrapper Identification
- ExtractCCs (page P)
-
- skip past first occurrence of ltPgt in P
- while next ltBgt is before next ltHRgt in P
-
- for each (lk, rk) ?
- (ltBgt,lt/Bgt), (ltIgt, lt/Igt)
-
- skip past next occurrence of lk in P
- extract attribute from P to next
occurrence of rk -
- return extracted tuples
ExtractHLRT(page P,lth,t,l1,r1,,lk,rkgt) skip
past first occurrence of h in P while next l1
is before next t in P for each (lk, rk)
? (l1,r1), , (lk,rk) skip
past next occurrence of lk in P extract
attribute from P to next occurrence of rk
return extracted tuples HLRT Head Left
Right Tail
- ltHTMLgtltHEADgt
- ltTITLEgtCountry Codeslt/TITLEgt
- lt/HEADgt
- ltBODYgt
- ltBgtSome Country Codeslt/Bgt
- ltPgt
- ltBgtCongolt/Bgt ltIgt242lt/IgtltBRgt
- ltBgtEgyptlt/Bgt ltIgt20lt/IgtltBRgt
- ltBgtIndialt/Bgt ltIgt91lt/IgtltBRgt
- ltBgtSpainlt/Bgt ltIgt34lt/IgtltBRgt
- ltHRgt
- ltBgtEndlt/Bgt
- lt/BODYgt
- lt/HTMLgt
8Constructing Wrappers
- Problem
- Given a supply of example query responses, learn
a wrapper for the information resource that
generated them. - Instances
- Labels
- Hypotheses
- Oracles
- Page Oracles Label Oracles
- Probably Approximately Correct (PAC) Analysis
- Input accuracy (e) and confidence (d)
9Building HLRT Wrappers
- BuildHLRT (labeled pages ? , ltPn, Lngt,)
-
- for k 1 to K
- rk ? any common suffix of strings following
each (but not contained in any) attribute k - for k 2 to K
- lk ? any common suffix of strings preceding
each attribute k - for each common suffix l1 of pages heads
- for each common substring h of pages heads
- for each common substring t of pages
tails - if (a) h precedes l1 in each of
pages heads - (b) t precedes l1 in each of
pages tails - (c) t occurs between h and l1 in
none of pages heads - (d) l1 doesnt follow t in any
inter-tuple separator - then return (h, t, l1, r1, , lK, rK)
- T ?
- repeat
- Pn ? PageOracle()
- Ln ? LabelOracle(Pn)
- T ? T U (Pn , Ln )
- ? ? BuildHLRT()
- until Pr E(?) lt e gt 1 d
- return ?
- The wrapper induction algorithm
-
10Semi-Structured Text
- Telegraphic messages
- Advertisements
- Rule (WHISK Algorithm)
- Pattern ( Nghbr ) ( Digit ) Bdrm
( Number ) - Output Rental Neighbourhood 1 Bedrooms 2
Price 3 - Bdrm (brsbrbdrmbdbedroomsbedroombedBR)
Capitol Hill 1 br twnhme. fplc D/W W/D.
Undergrnd pkg incl 675. 3 BR, upper flr of turn
of ctry HOME. incl gar, grt N. Hill loc 995.
(206)999-9999
Capitol Hill 1 br twnhme. fplc D/W W/D.
Undergrnd pkg incl 675. 3 BR, upper flr of turn
of ctry HOME. incl gar, grt N. Hill loc 995.
(206)999-9999
11Free text
- Newspaper reports
- Example Management succession
- Input text
- C.Vincent Protho, chairman and chief executive
officer of this maker of semiconductors, was
named to the additional post of president,
succeeding John W. Smith, who resigned to pursue
other interests. - Succession Event
- Person In C.Vincent Protho
- Person Out John W. Smith
- Post president
- Much difficult needs syntax analysis
12Hidden Markov Models
- Probability of abb
- 0.3 1 0.7 0.5 0.8 1 ( 0.084 )
- 0.7 0.5 0.2 0.7 0.8 1 (
0.0392) - 0.1232
Maximum Probability Path
0.3 1 0.7 0.5 0.8 1 ( 0.084 )
13Advantages of HMM
- Strong statistical foundation
- Robust handling of new data
- Computationally efficient to develop and evaluate
- A priori notion of model topology
- Large amount of training data
Disadvantages of HMM
14HMM use in IE
- Enumeration of paths takes Exponential time
- There exists an algorithm based on Dynamic
programming called Viterbi algorithm that finds
the most likely path in quadratic time using
Trellis structure. - Each state associated with class to be extracted.
- Each state emits words from class-specific
unigram.
15Training the HMM
- Multiple states per class better to model hidden
sequence structure - Each state associated with one word and the
associated class - Simplification of model possible by Neighbour
merging and V-merging. - Model structure can be learned automatically from
data using Bayesian model merging
16Labeled, Unlabeled Distantly labeled data
- Asking Oracle to label
- is usually costly
- Use of counts for
- probability
- Unlabeled data Baum-Welsh Training algorithm
using Trellis structure in HMM - Distantly labeled data only relevant portion of
fields used. e.g. BibTeX entries.
17Experiment
- Problem definition
- To extract key fields such as title, author,
affiliation, email, abstract, keywords and
introduction, from the header of scientific
research papers - Types
- LD interpolated distantly labeled data with
labeled data - LD labeled and unlabeled
- ML maximum likelihood
18Results
- Observations
- ML and self better than full
- Distantly labeled data helps
- Smoothing gives best results
- Models with multiple states per class outperform
ML model with 92.9 accuracy
19Conclusion
- Wrapper Induction and WHISK are multi-slot
systems having high and near-perfect accuracy for
structured and semi-structured data. - For free text, WHISK algorithm falls considerably
short of human capability, but still useful. - HMM and other probabilistic modeling tools can
also be learned for IE using efficient algorithms
like Viterbi and Baum-Welsh.
20References
- Wrapper Induction for Information Extraction N.
Kushmerick, D. S. Weld, R. Doorenbos IJCAI-97 - Learning Information Extraction Rules for
Semi-structured and Free Text Stephan Soderland - Introduction to Discrete Hidden Morkov Models
Peter de Souza, 1997 - Learning HMM Structure for IE K. Seymore, A.
McCallum, R. Rosenfeld
21Thank you
- for your interest and
- patient hearing