Title: Schema-Guided Wrapper Maintenance for Web-Data Extraction
1Schema-Guided Wrapper Maintenance for Web-Data
Extraction
- Xiaofeng Meng, Dongdong Hu
- Renmin University of China, Beijing, China
- Chen Li
- University of California, Irvine, CA, USA
2Wrappers for Web Sources
- Extract information from Web pages
- Used in many Web-based applications
XML
Wrapper
HTML Documents
RDBMS
Wrapper
Application (e.g., data Integration)
Programs
Wrapper
3Problem
- The Web are very dynamic contents, page
structures - Original wrappers can stop working rely on Web
page structures - Re-generating wrappers is not easy heavy
workload to system developers
Original Wrapper
Extract nothing
Changed Documents
Original Wrapper
Incomplete results
Original Wrapper
Incorrect results
4Example
The original wrapper fails due to the structure
change.
5Problems
- Wrapper verification Is a wrapper is operating
correctly? - Several studies have been conducted on the
verification problem - E.g., computing the similarity between a
wrappers expected and observed output,
regression test - Wrapper maintenance how to automatically modify
a wrapper when the pages have changed? ? Focus of
this work
6Outline
- Motivation
- ? System overview
- Schema-Guided Wrapper Maintenance
- Experiments
- Related Work and Conclusion
7The SG-WRAM System
8User-Defined Schema
User provides schema for the target data
- lt!ELEMENT VideoList (Video)gt
- lt!ELEMENT Video (Name, Director, Actors, Price)gt
- lt!ELEMENT Name (PCDATA)gt
- lt!ELEMENT Director (PCDATA)gt
- lt!ELEMENT Actors (PCDATA)gt
- lt!ELEMENT Price (VHSPrice, DVDPrice)gt
- lt!ELEMENT VHSPrice (PCDATA)gt
- lt!ELEMENT DVDPrice (PCDATA)gt
9Schema-Guided Wrapper Generation
- Using a GUI toolkit, users can map data items in
HTML pages to elements in DTD
DTD tree
HTML page
10Schema-Guided Wrapper Generation
- Internally, the system computes the mappings from
the corresponding HTML tree to the DTD tree - Then generates the extraction rule
HTML tree
DTD tree
11Expressing Extraction Rule in XQuery
- Each rule is an FLWR XQuery expression
Example
FOR vedio IN vedioList/body/div0/table4/tr0
/td2/table/tr0 /td1
RETURN ltvediogt LET name
vedio/span0/b0/a0/text()0 RETURN
ltnamegt name lt/namegt lt/vediogt
12Annotations for data items
- Describe the semantic meaning of a data item
- Indicate the location of the data item
- Specified by the user using the GUI
- Recorded in the function of contains(pathToAnnota
tion, annotationValue) in XPath
/body/div0/table4/tr0/td2/table1/tr0/t
d1/text()0contains(null,"directed by")
Data values in HTML page Annotations
May Morning -
Ugo Liberatore directed by
Jane Birkin John Steiner Rosella Falk Featuring
15.38-23.26 DVD
14.98-18.99 VHS
13Outline
- Motivation
- System Overview
- ? Wrapper Maintenance (four steps)
- Data-Feature Discovery
- Item Recovery
- Block Configuration
- Rule Re-induction
- Experiments
- Related Work and Conclusion
14Intuition of the approach
- The page structure could change
- Observation many features of data items are
more static, e.g. - Hyperlink
- Annotation
- Pattern
- These features can help us find the new places of
the old data items
15Step 1 Data-feature discovery
- Compute features of the data items in the
original page
ID DTD Element L (hyperlink) A (annotation) P (data pattern)
1 Name True NULL A-Za-z0,
2 Director False Directed by A-Za-z0,
3 Actors False Featuring A-Za-z0,(.)
4 VHSPrice False VHS 0-90,0-9(.)0-92
5 DVDPrice False DVD 0-90,0-9(.)0-92
16Data-Pattern Feature
- A syntactic feature
- Represented as a regular expression
- E.g. 15.38 ? 0-90,0-9(.)0-92
- Can be extracted using existing technologies,
e.g., Brin98, GHQR98, LM00
17Annotations and Hyperlinks
- Get annotation and hyperlink information from the
original page - Checking the XQuery based extraction rule
- Hyperlink step of /a/ in the path
- Annotation function of contains()
LET name vedio/span0/b0/a0/text()0
RETURN ltnamegt name lt/namegt
LET actors vedio/text()contains( /preceding
-siblingb0 ,"Featuring") RETURN
ltactorsgt actors lt/actorsgt
18Step 2 Data-Item Recovery
- Traverse the new HTML tree following the
depth-first traversal order - Use the old features to identify potential data
items using 3 matching conditions - Hyperlink
- Annotation
- Data pattern
19Example
A-Za-z0,
ok
ok
Check hyperlink
Check data pattern
Recognize a data item
Find value starting from annotation
yes
Recognize a data item
Find annotation
Check data pattern
0-90,0-9(.)0-92
20Results of Data Item Recovery
- A mapping list including all the recognized data
items - Each mapping contains
- Value of the data item
- Path to it in the HTML tree
- Path of the corresponding DTD element
A sample mapping M1 (D May, HP
/table0/tr0/td1/span0/b0/a0/text()0
, SP VideoList/Video/Name )
21Step 3 Block Configuration
- Observation Data items are located in semantic
blocks - Conforms to the user-defined schema
- Data items are grouped in semantic blocks
Partial-Match
Full-Match
Over-Match
22Computing Full Match Blocks
Full match blocks
- Identify the level in a top-down manner
- Check the level by recursively considering the
matches between candidate blocks and the schema
23Results of Block Configuration
- A set of blocks that can fully match with the DTD
- Each of them is represented as a list of mappings
Examples
No. Element PATH
1 Title table1/tr0 /td1/span0/b0/a0/text()0
2 Director table1/tr0/ /td1/span1/textcontains( /preceding-siblingb0,"Directed by")
3 Actors table1/tr0/ /td1/span2/text()contains(/preceding-siblingb0,"Featuring")
4 Title table2/tr0 /td1/span0/b0/a0/text()0
5 Director table2/tr0/ /td1/span1/textcontains( /preceding-siblingb0,"Directed by")
6 Actors table2/tr0/ /td1/span2/text()contains(/preceding-siblingb0,"Featuring")
24Step 4 Rule Re-Induction
- Semantic blocks contain mappings from data items
in HTML to DTD elements - Induce new extraction rule by calling the
induction algorithm in wrapper generator - Refine the rule by trying to ensure the
extraction rule cover all other semantic blocks - Generalization is necessary
25Outline
- Motivation
- System Overview
- Wrapper Maintenance (four steps)
- Data-Feature Discovery
- Item Recovery
- Block Configuration
- Rule Re-induction
- ? Experiments
- Related Work and Conclusion
26Web Sources
1Bookstreet Book
Allbooks4less Book
Amazon Book (search)
Amazon Magazine
Barnesandnoble Book
CIA Factbook
CNN Currency
Excite Currency
Hotels Hotel
Yahoo Shopping Video
Yahoo Quotes
Yahoo People Email
- From October 2002 to May 2003
- Collected Web page changes
- From 16 data-intensive sites
- Using site search engine or from the same URL
- All the pages have complex table structures
- Observed changes
- Data items (add, delete, modify)
- Table structure ? non-table structure
- Complex table structure re-arrangement
27Experiment Procedures
New Web Docs
Original Web Docs
step1
Wrapper Repository
Wrapper Generator
Original Wrappers
step2
Repaired Wrappers
Check Extraction Results
Changed pages
step3
Wrapper Maintainer
28Experiment Metrics
- Recall (R)
- Proportion of the correctly extracted data items
of all the data items that should be extracted - Precision (P)
- Proportion of the correctly extracted data items
of all the data items that have been extracted
29Original wrappers after changes
Name of changed pages Item Number Avg Recall Avg Precision
1Bookstreet Book 12 6 82.54 100
Allbooks4less Book 15 4 0 -
Amazon Book (search) 15 6 40.49 100
Amazon Magazine 15 5 20.01 100
Barnesandnoble Book 15 5 0 100
CIA Factbook 5 10 0 100
CNN Currency 15 6 50.00 100
Excite Currency 18 11 42.86 100
Hotels Hotel 15 4 0 -
Yahoo Shopping Video 15 6 0 -
Yahoo Quotes 10 6 0 -
Yahoo People Email 10 3 0 -
30New wrappers (after item recovery)
Web site Avg Recall Avg Precision
1Bookstreet Book 98.67 71.26
Allbooks4less Book 75 32.69
Amazon Book (search) 83.05 36.3
Amazon Magazine 100 60.15
Barnesandnoble 78.72 43.13
CIA Factbook 100 100
CNN Currency 100 100
Excite Currency 100 100
Hotels Hotel 50 35.61
Yahoo Shopping 100 51.49
Yahoo Quotes 100 100
Yahoo People 100 53.54
31New Wrappers (final)
Web site Avg recall Avg precision
1Bookstreet Book 100 100
Allbooks4less Book 75 51.34
Amazon Book (search) 83.05 90.74
Amazon Magazine 100 100
Barnesandnoble 78.72 100
CIA Factbook 100 100
CNN Currency 100 100
Excite Currency 100 100
Hotels Hotel 50 41.87
Yahoo Shopping 100 92.86
Yahoo Quotes 100 100
Yahoo People 100 100
32Related Work on Wrapper Maintenance
- Kushmerick 99
- Using simple numeric features of the extracted
strings - Lerman K., Minton S. 00
- Using the starting and ending strings as the
description of the data fields - Chidlovskii B. 01
- Syntactic features of data items to be extracted,
and semantic features URL, time strings,
entities
33Comparions
- These approaches heavily rely on the syntactic
features of the data items, and may not precisely
recognize data items.
Title Our Price List Price
Data on Web 23.00 29.00
Java Programming 49.00 59.00
Title List Price Our Price
Data on Web 29.00 23.00
Java Programming 59.00 49.00
34Conclusion
- SG-WRAM a wrapper-maintenance system
- Intuition use features that are more stable
- Pattern
- Hyperlink
- Annotation
- Four steps of the approach
- Data-Feature Discovery
- Item Recovery
- Block Configuration
- Rule Re-induction
- Experiments showed that it is effective
35Thank you!
- Schema-Guided Wrapper Maintenance for Web-Data
Extraction - Xiaofeng Meng, Dongdong Hu
- Renmin University of China, Beijing, China
- Chen Li
- University of California, Irvine, CA, USA