Title: Annotation Free Information Extraction
1Annotation Free Information Extraction
- Chia-Hui Chang
- Department of Computer Science Information
Engineering - National Central University
- chia_at_csie.ncu.edu.tw
- 10/4/2002
2IEPAD Information Extraction based on Pattern
Discovery
- C.H. Chang.
- National Central University
- WWW10
3Semi-structured Information Extraction
- Information Extraction (IE)
- Input Html pages
- Output A set of records
4Pattern Discovery based IE
- Motivation
- Display of multiple records often forms a
repeated pattern - The occurrences of the pattern are spaced
regularly and adjacently - Now the problem becomes ...
- Find regular and adjacent repeats in a string
5IEPAD Architecture
6The Pattern Generator
- Translator
- PAT tree construction
- Pattern validator
- Rule Composer
71. Web Page Translation
- Encoding of HTML source
- Rule 1 Each tag is encoded as a token
- Rule 2 Any text between two tags are translated
to a special token called TEXT (denoted by a
underscore) - HTML Example
- ltBgtCongolt/BgtltIgt242lt/IgtltBRgt
- ltBgtEgyptlt/BgtltIgt20lt/IgtltBRgt
- Encoded token string
- T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
- T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
8Various Encoding Schemes
92. PAT Tree Construction
- PAT tree binary suffix tree
- A Patricia tree constructed over all possible
suffix strings of a text - Example
- T(ltBgt) 000
- T(lt/Bgt) 001
- T(ltIgt) 010
- T(lt/Igt) 011
- T(ltBRgt) 100
- T(_) 110
T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt) T(ltBgt)T(
_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
000110001010110011100 000110001010110011100
10The Constructed PAT Tree
11Definition of Maximal Repeats
- Let a occurs in S in position p1, p2, p3, , pk
- a is left maximal if there exists at least one
(i, j) pair such that Spi-1?Spj-1 - a is right maximal if there exists at least one
(i, j) pair such that Spia?Spja - a is a maximal repeat if it it both left maximal
and right maximal
12Finding Maximal Repeats
- Definition
- Lets call character Spi-1 the left character
of suffix pi - A node ? is left diverse if at least two leaves
in the ?s subtree have different left characters - Lemma
- The path labels of an internal node ? in a PAT
tree is a maximal repeat if and only if ? is left
diverse
133. Pattern Validator
- Suppose a maximal repeat ? are ordered by its
position such that suffix p1 lt p2 lt p3 lt pk,
where pi denotes the position of each suffix in
the encoded token sequence. - Characteristics of a Pattern
- Regularity Variance coefficient
-
- Adjacency Density
14Pattern Validator (Cont.)
- Basic Screening
- For each maximal repeat a, compute V(a) and D(a)
- a) check if the patterns variance V(a) lt 0.5
- b) check if the patterns density 0.25 lt D(a) lt
1.5
154. Rule Composer
- Occurrence partition
- Flexible variance threshold control
- Multiple string alignment
- Increase density of a pattern
16Occurrence Partition
- Problem
- Some patterns are divided into several blocks
- Ex Lycos, Excite with large regularity
- Solution
- Clustering of the occurrences of such a pattern
Clustering
V(P)lt0.1
No
P
Discard
Yes
Check density
17Multiple String Alignment
- Problem
- Patterns with density less than 1 can extract
only part of the information - Solution
- Align k-1 substrings among the k occurrences
- A natural generalization of alignment for two
strings which can be solved in O(nm) by dynamic
programming where n and m are string lengths.
18Multiple String Alignment (Cont.)
- Suppose adc is the discovered pattern for token
string adcwbdadcxbadcxbdadcb - If we have the following multiple alignment for
strings adcwbd'', adcxb'' and adcxbd'' - a d c w b d
- a d c x b -
- a d c x b d
- The extraction pattern can be generalized as
adcwxbd-
19Pattern Viewer
- Java-application based GUI
- Web based GUI
- http//www.csie.ncu.edu.tw/chia/WebIEPAD/
20The Extractor
- Matching the pattern against the encoding token
string - Knuth-Morris-Pratts algorithm
- Boyer-Moores algorithm
- Alternatives in a rule
- matching the longest pattern
- What are extracted?
- The whole record
21Experiment Setup
- Fourteen sources search engines
- Performance measures
- Number of patterns
- Retrieval rate and Accuracy rate
- Parameters
- Encoding scheme
- Thresholds control
22Translation
- Average page length is 22.7KB
23Accuracy and Retrieval Rate
24Problems
- Guarantee high retrieval rate instead of accuracy
rate - Generalized rule can extract more than the
desired data - Only applicable when there are several records in
a Web page, currently
25ROADRUNNER Towards Automatic Data Extraction
from Large Web Sites
- Valter Crescenzi , Giansalvatore , Paolo Merialdo
- VLDB2001
26Observations
- 1. Wrapper generator works by using additional
information. (labeled samples) - 2. Wrapper induction system has some a priori
knowledge about the page organization. - 3. Finally, systems generate wrapper by examining
one HTML page at a time.
27ROADRUNNER new perspective
- 1. Dont rely on any interaction with the user.
(Completely automatic) - 2. No a priori knowledge
- HTML schema will be inferred along with wrapper.
- Can handle any nested structures.
- 3. Works with two HTML pages at a time. (based on
the study of similarities and dissimilarities
between the pages)
28(No Transcript)
29Theoretical Background
- Site generation Encoding of database content
- Data extraction Decoding
- The problem is based on a close correspondence
between nested type and union-free regular
expressios.
30Delimiter
- PCDATA map to string
- map to lists (nested) , being iterator
- ? map to nullable fields, optional patterns.
- Find schema and data extraction Find minimal
UFRE.
31Matching Technique
- It is based on a matching technique called ACME.
(Align, Collapse under Mismatch, and Extract) - HTML ? XHTML ? tokens
- Matching algorithm works on two objects
- A list of tokens, call the sample
- A wrapper (one UFRE)
- This is done by solving mismatches between the
wrapper and the sample.
32(No Transcript)
33Mismatches
- 1. String mismatches
- May be due only to different values of a database
field. - These mismatches are use to discover fields.
(PCDATA) - Ex John Smith and Paul Jones at token 4
- 2. Tag mismatches
- Optional patterns
- Iterative patterns
34Discovering Optionals
- Strategy Looking for repeated patterns as a
first step, and then, if this attempt fails, in
trying to identify optional pattern. - Two steps
- 1. Optional Pattern Location by Cross-Search
- Mismatch at token 6 - ltULgt and ltIMG/gt
- Assume optional pattern is located on wrapper or
sample. - 2. Wrapper Generalization
- ( ltIMG src/gt ) ?
35Discovering Iterators
- 1. Square Location by Terminal Tag Search
- Both the wrapper and sample contain at least one
occurrence of the square. - Terminal Tag position before the mismatch
- In this example is lt/LIgt
- Test which is the square initial tag ?
- lt/UIgt lt/LIgt v.s. ltLIgt lt/LIgt
- Finally, we can infer that the sample contains
one candidate occurrence of the square at token
20-25.
36Discovering Iterators (cont)
- 2. Square Matching
- Try to match the candidate square occurrence
(tokens 20-25). - Backwards matching token 25 and 19, then moves
to 24 and 18 and so on. - 3. Wrapper Generalization
- If we denote the newly found square by s, we
replace the repeated pattern by (s)
37More Complex Example
- First mismatch at token 15 (external mismatch)
- Find iterators
- Terminal tag lt/LIgt
- Candidate square is found ltLIgt lt/LIgt at token
15-28 - Backward match second mismatch at token 23 and
9 (internal mismatch) ? solve the mismatch by
recursive
38Recursively solve mismatch
- Internal mismatch at token 23 and 9
- Solve it by the same way at external mismatch.
- But dont work by comparing one wrapper and one
sample, rather two different portions of the same
objects. - Terminal tag ltBgt
- Candidate square is lt/BgtltBgt token 23-18
- Backward match mismatch at token 20 and 26
- Find token 20-22 is optional pattern.
39(No Transcript)
40Matching as an AND-OR tree
- Finding one solution to match(w,s) corresponds to
finding one visit for the AND-OR tree. - (i) match(w,s) all external mismatches
encountered during the parsing (AND node) - (ii) solve mismatch by either introducing one
field, or one iterator, or one optional (OR) - (iii) The search may either on wrapper or sample
(OR) - (iv) iterators and optionals are various
candidates (OR) - (v) Discover iterators may be need to recursively
solve several internal mismatches. (AND)
41AND-OR tree
42Experimental Results
43Experimental Results (cont)
44Extracting Structured Data from Web Page
- Arvind Arasu, Hector Garcia-Molina
- ACM SIGMOD 2003
45Cue
- Keywords schema, template
- Web pages belonging to the same site are
generated by encoding data of the same schema
with a common template - gt a common template by plugging-in value
46Figuration
47Goal and Challenge
- Previous IE Techniques rely on heuristic by
human. ex. wrapper - Goal to deduce the template without human
- Time consuming and error-prone
- Optional attributes are ignored
- Challenge
- No obvious way of differentiating what text is
- template or data
- The schema of data in pages isnt flat but more
- complex and semi-structured of attributes
48Model, Problem Formulation
- Structured Data
- Model of Page Creation
- Optionals and Disjunctions
- Problem Statement
- Miscellaneous Terminology, Definition
49Structured Data
- Token A token is some basic unit of text
- Structured Data any set of data values
conforming to a common schema or type - Define Type
- 1. Basic Type (ß) string of tokens
- e.g. lthtmlgt, text
- 2. Ordered List Type tuple constructor order n
- e.g. ltT1, T2, , Tngt, T1, T2, , Tn type
- 3. Define Type set constructor
- e.g. T , T type
50Define term value and example
- Define instance
- 1. an instance of basic type, ß, token
- 2. an instance of type ltT1, T2, , Tngt is
- tuple of the form lti1, i2, , ingt, attributes
- i1, i2, , in are instances of typesT1,
T2, , Tn - 3. an instance of type T, is any set of
elements - e1, e2, , em, such ei is an instance of
type T - Instance ? Value String ? token
- Example
- Schema S1
- Value
51Model of Page Creation
- Definition A template T for a schema S (as shown
TS), is defined as a function that maps each type
constructor t of S into an ordered set of strings
T(t ), such that, - tis the tuple constructor of order n, T(t) is an
order set of n1 string - tis the set constructor of order n, T(t) is
string St
?(T, x) values x that are instances of
sub-schema of S
52Encoding of a value x? S
- 1. if x ?ß, then ? (T,x)?x
- 2. if x ? ltx1, x2, , xngttt
- ? (T,x) ? C1 ? (T, x1) C2 ? (T, xn) Cn1
- 3. if x ? e1, e2, , emts , ts ? S
- ? (T,x) ? ? (T, e1) S ? (T, e2) .S ? (T, em)
53Example of Schema S1
54Optionals and Disjunctions
- Optional
- If T is type, optional type (T)?Tt
- t 0 or 1
- Disjunction
- If T1 and T2 is type, disjunction type
- (T1 T2) ltT1t1, T2t2 gtt
- t1t2 1
55Problem Statement
- Extract Problem n pages, pi ?(T, xi)
- (1 i n), created from some unknown
deduction template T and values x1,. . .,x1
from the set of pages alone
56Example of correct solution of EXTRACT (cont.)
57Example of correct solution of EXTRACT (cont.)
58Miscellaneous Terminology, Definition
- An occurrence of a token in template is called a
template-token - An occurrence of a token in value is called a
value-token - An occurrence of a token in page is called a
page-token - 2 page-token in Pe have the same role iff they
have been generated by the same template-token
59Overview Approach - EXALG
(ECGM)
60EXALG - ECGM FINDEQ (step2)
- The module used to compute equivalence
classese, set of tokens having the same
frequency of occurrence in every pages Pe - Ex. ee1 lthtmlgt, ltbodygt, Book, Reviews, ltolgt,
- lt/olgt, lt/bodygt, lt/htmlgt
- Ex. ee3 ltligt, Reviewer, Rating, Text, lt/ligt
- EXALG retain only EQ Classes that are Large and
Frequently occurring EQ Classes (LFEQ)
61EXALG - ECGM HANDINV (step3)
- The module used to detect and remove invalid
LFEQs those that are not formed by tokens
associated with a type constructor
62DIFFFORM (step1) and DIFFEQ (step4)
- The module used to add more tokens to LFEQ by
differentiating roles - Ex. Name has multiple role, one occurs in Book
Name and the other occurs in Reviewer Name - Differentiate the multiple roles
- The multiple tokens occur in different path from
root in the HTML parse tree (DIFFFORM) - The multiple tokens occur in different Position
with respect to LFEQ ee1(DIFFEQ) - dtoken ex. Name5 and Name14
- regard NameA and NameB as different tokens
63Review ECGM
64Example After ECGM Process
- ee1 lthtmlgt, ltbodygt, ltbgt, Book, Name, lt/bgt,
ltbgt, Reviews, lt/bgt, ltolgt, lt/olgt, lt/bodygt, lt/htmlgt
- 8 ?13
- ee3 ltligt, ltbgt, Reviewer, Name, lt/bgt, ltbgt,
Rating, lt/bgt, ltbgt, Text, lt/bgt, lt/ligt - 5 ?12
- Position empty and non-empty
65Construct Schema from ECGM
- Construct Schema S fromee1
- The 1st of non-empty position is Basic Type ß
- The 2nd of non-empty position is ee3 , are
generated by set type constructorte3 - ? T(te1) ltC11, C12,C13gt, S ltß, S te2
gtte1 - ? T(te2) S lt C31, C32,C33,C34 gt
- ? T(te3) lt C31, C32,C33,C34 gt, ltß,ß,ß,gtte3
- ? S lt ß, ltß,ß,ß,gtte3 te2 gtte1
66Equivalence Classes (Cont.)
- Pages P p1, , pn , pi ?(TS, xi)
- TS t1, , tk type constructor
- Definition All tokens of equivalence class have
the same occurrence vector - ex. ee1 lt1,1,1,1gt ee3 lt1,2,1,0gt
- Observation1 Tokens associated with the same
type constructor tj in T that have unique-roles
occur in the same equivalence class. (used to
decide EQ valid or not) - Support of token (page contain)
- Size of EQ class (token of EQ)
67Equivalence Classes (Cont.)
- Observation2 for real pages, an equivalence
class of large size and support is usually valid - Properties of EQ class ltt1, , tmgt
- Ordered
- Nested the span of all occurrences of ei is
within for some fixed Position_p or doesnt
overlap - Observation3 A valid equivalence class is
ordered and a pair of two valid equivalence
classes is nested
68Handling Invalid Equivalence classes
- Detect the existence of invalid LFEQs using
violation of ordered and nesting - Yes, discard some of LFEQs and break other into
smaller LFEQs
Differentiating roles of tokens
- By Path different roles of tokens are in
different path of HTML parse tree - By Position different roles of tokens locates
at different Position (non-empty)
69Equivalence Class Generation Module
- OUTPUT set of LFEQs of dtokens and page
represented as string of dtokens - FINDEQ 2 parameters used to consider
- LFEQs (SIZETHRES, SUPTHRES)
- On running example
- SIZETHRES SUPTHRES 3
- the iteration 2, find out ee1 and ee3
70Building Template and Extracting Values
- Input to this module is e1 ,e2 , ,em
- The ANALYSIS consist of 2 modules CONSTTEMP and
EXVAL - CONSTTEMP ,ei d1, d2, , dl
- Start the basic e1 lthtmlgt, ltbodygt, ,lt/bodygt,
lt/htmlgt - recursively constructs a template Tei ,
corresponding toei , and template Tei, p,
corresponding to each non-empty position p ofei - Checks if the set of strings, PosString(ei ,p),
corresponding has some recognizable pattern
71Example
- In running example, PosString(ee1 ,6) is a
string dtokens for every occurrence of ee1,
which matches Pattern 5 of table PosString(ee1
,10) is always a string of 0 or more occurrences
of ee3, which matches Pattern 1 - ee1 lthtmlgt, ltbodygt, ltbgt, Book, Name, lt/bgt,
ltbgt, - Reviews, lt/bgt, ltolgt, lt/olgt,
lt/bodygt, lt/htmlgt
72Assumption
- The 4 assumptions
- (A1) A large number of tokens occurring in
- template have unique roles
- (A2) The EQ class derived from a type constructor
- is recognized as an LFEQ
- (A3) Irregularity in encoded data that leads to
- invalid EQ class
- (A4) The separators are around data values. In
- this model, strings associated with type
- construction are non-empty position
73Evaluation
- Leaf attribute Am in schema Sm
- Correct the set of Am in the page is equal to
the set of extracted value Ae in the page - Partially Correct the set of Am in the page is
not equal to the set of extracted value Ae in the
page, but as part of value of Ae - Incorrect not correct and Partially correct
74Result
- 18 or 40 of input collections our System
correctly extracted all the attribute - Around 80 of the attributes were extracted
correctly - Normalized average
- Input size lt10
- Parameter 3
75Conclusion
- EXALG use 2 novel concept equivalence classes
and differentiate roles, to discovery the
template - Impact of the failed assumption is limit to a few
attributes - Future work
- Develop techniques for crawling, indexing, and
providing querying support for the structured
pages in the web - Develop techniques for automatically annotating
the extracted data, possibly using the words that
appear in the template
76References
- C.H. Chang. and S.C. Lui. IEPAD Information
Extraction based on Pattern Discovery, WWW2001,
pp. 681-688. - Valter Crescenzi, Giansalvatore Mecca, Paolo
Merialdo. RoadRunner Towards Automatic Data
Extraction from Large Web Sites. VLDB2001,
109-118 - Arvind Arasu, Hector Garcia-Molina. Extracting
Structured Data from Web Pages. SIGMOD2003,
337-348.