Schema-Guided Wrapper Maintenance for Web-Data Extraction - PowerPoint PPT Presentation

About This Presentation
Title:

Schema-Guided Wrapper Maintenance for Web-Data Extraction

Description:

... Maintenance for Web-Data Extraction. Xiaofeng Meng, Dongdong Hu ... Extract information from Web pages. Used in many Web-based applications. HTML. Documents ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 36
Provided by: ics9
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Schema-Guided Wrapper Maintenance for Web-Data Extraction


1
Schema-Guided Wrapper Maintenance for Web-Data
Extraction
  • Xiaofeng Meng, Dongdong Hu
  • Renmin University of China, Beijing, China
  • Chen Li
  • University of California, Irvine, CA, USA

2
Wrappers for Web Sources
  • Extract information from Web pages
  • Used in many Web-based applications

XML
Wrapper
HTML Documents
RDBMS
Wrapper
Application (e.g., data Integration)


Programs
Wrapper
3
Problem
  • The Web are very dynamic contents, page
    structures
  • Original wrappers can stop working rely on Web
    page structures
  • Re-generating wrappers is not easy heavy
    workload to system developers

Original Wrapper
Extract nothing
Changed Documents
Original Wrapper
Incomplete results


Original Wrapper
Incorrect results
4
Example
The original wrapper fails due to the structure
change.
5
Problems
  • Wrapper verification Is a wrapper is operating
    correctly?
  • Several studies have been conducted on the
    verification problem
  • E.g., computing the similarity between a
    wrappers expected and observed output,
    regression test
  • Wrapper maintenance how to automatically modify
    a wrapper when the pages have changed? ? Focus of
    this work

6
Outline
  • Motivation
  • ? System overview
  • Schema-Guided Wrapper Maintenance
  • Experiments
  • Related Work and Conclusion

7
The SG-WRAM System
8
User-Defined Schema
User provides schema for the target data
  • lt!ELEMENT VideoList (Video)gt
  • lt!ELEMENT Video (Name, Director, Actors, Price)gt
  • lt!ELEMENT Name (PCDATA)gt
  • lt!ELEMENT Director (PCDATA)gt
  • lt!ELEMENT Actors (PCDATA)gt
  • lt!ELEMENT Price (VHSPrice, DVDPrice)gt
  • lt!ELEMENT VHSPrice (PCDATA)gt
  • lt!ELEMENT DVDPrice (PCDATA)gt

9
Schema-Guided Wrapper Generation
  • Using a GUI toolkit, users can map data items in
    HTML pages to elements in DTD

DTD tree
HTML page
10
Schema-Guided Wrapper Generation
  • Internally, the system computes the mappings from
    the corresponding HTML tree to the DTD tree
  • Then generates the extraction rule

HTML tree
DTD tree
11
Expressing Extraction Rule in XQuery
  • Each rule is an FLWR XQuery expression

Example
FOR vedio IN vedioList/body/div0/table4/tr0
/td2/table/tr0 /td1
RETURN ltvediogt LET name
vedio/span0/b0/a0/text()0 RETURN
ltnamegt name lt/namegt lt/vediogt
12
Annotations for data items
  • Describe the semantic meaning of a data item
  • Indicate the location of the data item
  • Specified by the user using the GUI
  • Recorded in the function of contains(pathToAnnota
    tion, annotationValue) in XPath

/body/div0/table4/tr0/td2/table1/tr0/t
d1/text()0contains(null,"directed by")
Data values in HTML page Annotations
May Morning -
Ugo Liberatore directed by
Jane Birkin John Steiner Rosella Falk Featuring
15.38-23.26 DVD
14.98-18.99 VHS
13
Outline
  • Motivation
  • System Overview
  • ? Wrapper Maintenance (four steps)
  • Data-Feature Discovery
  • Item Recovery
  • Block Configuration
  • Rule Re-induction
  • Experiments
  • Related Work and Conclusion

14
Intuition of the approach
  • The page structure could change
  • Observation many features of data items are
    more static, e.g.
  • Hyperlink
  • Annotation
  • Pattern
  • These features can help us find the new places of
    the old data items

15
Step 1 Data-feature discovery
  • Compute features of the data items in the
    original page

ID DTD Element L (hyperlink) A (annotation) P (data pattern)
1 Name True NULL A-Za-z0,
2 Director False Directed by A-Za-z0,
3 Actors False Featuring A-Za-z0,(.)
4 VHSPrice False VHS 0-90,0-9(.)0-92
5 DVDPrice False DVD 0-90,0-9(.)0-92
16
Data-Pattern Feature
  • A syntactic feature
  • Represented as a regular expression
  • E.g. 15.38 ? 0-90,0-9(.)0-92
  • Can be extracted using existing technologies,
    e.g., Brin98, GHQR98, LM00

17
Annotations and Hyperlinks
  • Get annotation and hyperlink information from the
    original page
  • Checking the XQuery based extraction rule
  • Hyperlink step of /a/ in the path
  • Annotation function of contains()

LET name vedio/span0/b0/a0/text()0
RETURN ltnamegt name lt/namegt
LET actors vedio/text()contains( /preceding
-siblingb0 ,"Featuring") RETURN
ltactorsgt actors lt/actorsgt
18
Step 2 Data-Item Recovery
  • Traverse the new HTML tree following the
    depth-first traversal order
  • Use the old features to identify potential data
    items using 3 matching conditions
  • Hyperlink
  • Annotation
  • Data pattern

19
Example
A-Za-z0,
ok
ok
Check hyperlink
Check data pattern
Recognize a data item
Find value starting from annotation
yes
Recognize a data item
Find annotation
Check data pattern
0-90,0-9(.)0-92
20
Results of Data Item Recovery
  • A mapping list including all the recognized data
    items
  • Each mapping contains
  • Value of the data item
  • Path to it in the HTML tree
  • Path of the corresponding DTD element

A sample mapping M1 (D May, HP
/table0/tr0/td1/span0/b0/a0/text()0
, SP VideoList/Video/Name )
21
Step 3 Block Configuration
  • Observation Data items are located in semantic
    blocks
  • Conforms to the user-defined schema
  • Data items are grouped in semantic blocks

Partial-Match
Full-Match
Over-Match
22
Computing Full Match Blocks
Full match blocks
  • Identify the level in a top-down manner
  • Check the level by recursively considering the
    matches between candidate blocks and the schema

23
Results of Block Configuration
  • A set of blocks that can fully match with the DTD
  • Each of them is represented as a list of mappings

Examples
No. Element PATH
1 Title table1/tr0 /td1/span0/b0/a0/text()0
2 Director table1/tr0/ /td1/span1/textcontains( /preceding-siblingb0,"Directed by")
3 Actors table1/tr0/ /td1/span2/text()contains(/preceding-siblingb0,"Featuring")
4 Title table2/tr0 /td1/span0/b0/a0/text()0
5 Director table2/tr0/ /td1/span1/textcontains( /preceding-siblingb0,"Directed by")
6 Actors table2/tr0/ /td1/span2/text()contains(/preceding-siblingb0,"Featuring")
24
Step 4 Rule Re-Induction
  • Semantic blocks contain mappings from data items
    in HTML to DTD elements
  • Induce new extraction rule by calling the
    induction algorithm in wrapper generator
  • Refine the rule by trying to ensure the
    extraction rule cover all other semantic blocks
  • Generalization is necessary

25
Outline
  • Motivation
  • System Overview
  • Wrapper Maintenance (four steps)
  • Data-Feature Discovery
  • Item Recovery
  • Block Configuration
  • Rule Re-induction
  • ? Experiments
  • Related Work and Conclusion

26
Web Sources
1Bookstreet Book
Allbooks4less Book
Amazon Book (search)
Amazon Magazine
Barnesandnoble Book
CIA Factbook
CNN Currency
Excite Currency
Hotels Hotel
Yahoo Shopping Video
Yahoo Quotes
Yahoo People Email
  • From October 2002 to May 2003
  • Collected Web page changes
  • From 16 data-intensive sites
  • Using site search engine or from the same URL
  • All the pages have complex table structures
  • Observed changes
  • Data items (add, delete, modify)
  • Table structure ? non-table structure
  • Complex table structure re-arrangement

27
Experiment Procedures
New Web Docs
Original Web Docs
step1
Wrapper Repository
Wrapper Generator
Original Wrappers
step2
Repaired Wrappers
Check Extraction Results

Changed pages

step3
Wrapper Maintainer
28
Experiment Metrics
  • Recall (R)
  • Proportion of the correctly extracted data items
    of all the data items that should be extracted
  • Precision (P)
  • Proportion of the correctly extracted data items
    of all the data items that have been extracted

29
Original wrappers after changes
Name of changed pages Item Number Avg Recall Avg Precision
1Bookstreet Book 12 6 82.54 100
Allbooks4less Book 15 4 0 -
Amazon Book (search) 15 6 40.49 100
Amazon Magazine 15 5 20.01 100
Barnesandnoble Book 15 5 0 100
CIA Factbook 5 10 0 100
CNN Currency 15 6 50.00 100
Excite Currency 18 11 42.86 100
Hotels Hotel 15 4 0 -
Yahoo Shopping Video 15 6 0 -
Yahoo Quotes 10 6 0 -
Yahoo People Email 10 3 0 -
30
New wrappers (after item recovery)
Web site Avg Recall Avg Precision
1Bookstreet Book 98.67 71.26
Allbooks4less Book 75 32.69
Amazon Book (search) 83.05 36.3
Amazon Magazine 100 60.15
Barnesandnoble 78.72 43.13
CIA Factbook 100 100
CNN Currency 100 100
Excite Currency 100 100
Hotels Hotel 50 35.61
Yahoo Shopping 100 51.49
Yahoo Quotes 100 100
Yahoo People 100 53.54
31
New Wrappers (final)
Web site Avg recall Avg precision
1Bookstreet Book 100 100
Allbooks4less Book 75 51.34
Amazon Book (search) 83.05 90.74
Amazon Magazine 100 100
Barnesandnoble 78.72 100
CIA Factbook 100 100
CNN Currency 100 100
Excite Currency 100 100
Hotels Hotel 50 41.87
Yahoo Shopping 100 92.86
Yahoo Quotes 100 100
Yahoo People 100 100
32
Related Work on Wrapper Maintenance
  • Kushmerick 99
  • Using simple numeric features of the extracted
    strings
  • Lerman K., Minton S. 00
  • Using the starting and ending strings as the
    description of the data fields
  • Chidlovskii B. 01
  • Syntactic features of data items to be extracted,
    and semantic features URL, time strings,
    entities

33
Comparions
  • These approaches heavily rely on the syntactic
    features of the data items, and may not precisely
    recognize data items.

Title Our Price List Price
Data on Web 23.00 29.00
Java Programming 49.00 59.00
Title List Price Our Price
Data on Web 29.00 23.00
Java Programming 59.00 49.00
34
Conclusion
  • SG-WRAM a wrapper-maintenance system
  • Intuition use features that are more stable
  • Pattern
  • Hyperlink
  • Annotation
  • Four steps of the approach
  • Data-Feature Discovery
  • Item Recovery
  • Block Configuration
  • Rule Re-induction
  • Experiments showed that it is effective

35
Thank you!
  • Schema-Guided Wrapper Maintenance for Web-Data
    Extraction
  • Xiaofeng Meng, Dongdong Hu
  • Renmin University of China, Beijing, China
  • Chen Li
  • University of California, Irvine, CA, USA
Write a Comment
User Comments (0)
About PowerShow.com