Machinelearning based Semistructured IE - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Machinelearning based Semistructured IE

Description:

Machine-learning based Semi-structured IE. Chia-Hui Chang ... Chun-Nan Hsu and Chian-Chi Chang. Institute of Information Science. Academia Sinica ... – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 55
Provided by: chiahu
Category:

less

Transcript and Presenter's Notes

Title: Machinelearning based Semistructured IE


1
Machine-learning based Semi-structured IE
  • Chia-Hui Chang
  • Department of Computer Science Information
    Engineering
  • National Central University
  • chia_at_csie.ncu.edu.tw

2
Wrapper Induction
  • Wrapper
  • An extracting program to extract desired
    information from Web pages.
  • Semi-Structure Doc. wrapper? Structure Info.
  • Web wrappers wrap...
  • Query-able or Search-able Web sites
  • Web pages with large itemized lists
  • The primary issues are
  • How to build the extractor quickly?

3
Semi-structured IE
  • Independently of the traditional IE
  • The necessity of extracting and integrating data
    from multiple Web-based sources

4
Machine-Learning Based Approach
  • A key component of IE systems is
  • a set of extraction patterns
  • that can be generated by machine learning
    algorithms.

5
Related Work
  • Shopbot
  • Doorenbos, Etzioni, Weld, AA-97
  • Ariadne
  • Ashish, Knoblock, Coopis-97
  • WIEN
  • Kushmerick, Weld, Doorenbos, IJCAI-97
  • SoftMealy wrapper representation
  • Hsu, IJCAI-99
  • STALKER
  • Muslea, Minton, Knoblock, AA-99
  • A hierarchical FST

6
WIEN
  • N. Kushmerick, D. S. Weld,
  • R. Doorenbos,
  • University of Washington, 1997
  • http//www.cs.ucd.ie/staff/nick/

7
Example 1
8
Extractor for Example 1
9
HLRT
10
Wrapper Induction
  • Induction
  • The task of generalizing from labeled examples to
    a hypothesis
  • Instances pages
  • Labels (Congo, 242), (Egypt, 20), (Belize,
    501), (Spain, 34)
  • Hypotheses
  • E.g. (ltpgt, ltHRgt, ltBgt, lt/Bgt, ltIgt, lt/Igt)

11
BuildHLRT
12
Other Family
  • OCLR (Open-Close-Left-Right)
  • Use Open and Close as delimiters for each tuple
  • HOCLRT
  • Combine OCLR with Head and Tail
  • N-LR and N-HLRT
  • Nested LR
  • Nested HLRT

13
Terminology
  • Oracles
  • Page Oracle
  • Label Oracle
  • PAC analysis
  • is to determine how many examples are necessary
    to build an wrapper with two parameters
    accuracy ? and confidence ?
  • PrE(w)lt?gt1-?, or PrE(w)gt?lt?

14
Probably Approximate Correct (PAC) Analysis
  • With ?0.1, ?0.1, K4, an average of 5
    tuples/page, Build HLRT must examine at least 72
    examples

15
Empirical Evaluation
  • Extract 48 web pages successfully.
  • Weakness
  • Missing attributes, attributes not in order,
    tabular data, etc.

16
Softmealy
  • Chun-Nan Hsu, Ming-Tzung Dung, 1998
  • Arizona State University
  • http//kaukoai.iis.sinica.edu.tw/chunnan/mypublic
    ations.html

17
Softmealy Architecture
  • Finite-State Transducers for Semi-Structured Text
    Mining
  • Labeling use a interface to label example by
    manually.
  • Learner FST (Finite-State Transducer)
  • Extractor
  • Demonstration
  • http//kaukoai.iis.sinica.edu.tw/video.html

18
Softmealy Wrapper
  • SoftMealy wrapper representation
  • Uses finite-state transducer where each distinct
    attribute permutations can be encoded as a
    successful path
  • Replaces delimiters with contextual rules that
    describes the context delimiting two adjacent
    attributes

19
Example
20
Label the Answer Key
4???
21
Finite State Transducer
????(N, M)?(N, A, M)2???
extract
extract
skip
skip
N
-U
U
skip
-N
extract
extract
skip
skip
M
-A
A
e
22
Find the starting position -- Single Pass
  • ?????

23
Contextual based Rule Learning
  • Tokens
  • Separators
  • SL Punc(,) Spc(1) Html(ltIgt)
  • SR C1Alph(Professor) Spc(1) OAlph(of)
  • Rule generalization
  • Taxonomy Tree

24
Tokens
  • All uppercase string CALph
  • An uppercase letter, followed by at least one
    lowercase letter, C1Alph
  • A lowercase letter, followed by zero or more
    characters OAlph
  • HTML tag HTML
  • Punctuation symbol Punc
  • Control characters NL(1), Tab(4), Spc(3)

25
Rule Generalization
26
Learning Algorithm
  • Generalize each column by replacing each token
    with their least common ancestor

27
Taxonomy Tree
28
Generating to Extract the Body
  • The contextual rules for the head and tail
    separators are
  • hLC1alpha(Staff) Html(lt/H2gt) NL(1)Html(ltHRgt)
    NL(1) Html(ltULgt)
  • tRHtml(lt/ULgt) NL(1) Html(ltHRgt) NL(1)
    Html(ltADDRESSgt) NL(1) Html(ltIgt) Clalpha(Please)

29
More Expressive Power
  • Softmealy allows
  • Disjunction
  • Multiple attribute orders within tuples
  • Missing attributes
  • Features of candidate strings

30
Stalker
  • I. Muslea, S. Minton, C. Knoblock,
  • University of Southern California
  • http//www.isi.edu/muslea/

31
STALKER
  • Embedded Catalog Tree
  • Leaves (primitive items) ????????
  • Internal nodes (items)
  • Homogeneous list, or
  • Heterogeneous tuple.

32
EC Tree of a page
33
Extracting Data from a Document
  • For each node in the EC Tree, the wrapper needs a
    rule that extracts that particular node from its
    parent
  • Additionally, for each list node, the wrapper
    requires a list iteration rule that decomposes
    the list into individual tuples.
  • Advantages
  • The hierarchical extraction based on the EC tree
    allows us to wrap information sources that have
    arbitrary many levels of embedded data.
  • Second, as each node is extracted independently
    of its siblings, our approach does not rely on
    there being a fixed ordering of the items, and we
    can easily handle extraction tasks from documents
    that may have missing items or items that appear
    in various orders.

34
Extraction Rules as Finite Automata
  • Landmarks
  • A sequence of tokens and wildcards
  • Landmark automata
  • A non-deterministic finite automata

35
Landmark Automata
  • A linear LA has one accepting state
  • from each non-accepting state, there are exactly
    two possible transitions a loop to itself, and a
    transition to the next state
  • each non-looping transition is labeled by a
    landmarks
  • all looping transitions have the meaning consume
    all tokens until you encounter the landmark that
    leads to the next state.

36
Rule Generating
Extract Credit info.
1st terminals reservation _Symbol_ _Word_
Candidate ltigt _Symbol_ _HtmlTag_
perfect Disjltigt _HtmlTag_ positive
example D3, D4 2nd uncoverD1, D2
Candicate _Symbol_
37
Possible Rules
38
(No Transcript)
39
(No Transcript)
40
The STALKER Algorithm
41
(No Transcript)
42
(No Transcript)
43
Features
  • Process is performed in a hierarchical manner.
  • ??Attributes not in order????
  • Use disjunctive rule ????Missing attributes????

44
Multi-pass Softmealy
  • Chun-Nan Hsu and Chian-Chi Chang
  • Institute of Information Science
  • Academia Sinica
  • Taipei, Taiwan

45
Multi-pass
46
Tabular style document
(Quote Server)
47
Tagged-list style document
(Internet Address Finder)
48
Layout styles and learnability
  • Tabular style
  • missing attributes, ordering as hints
  • Tagged-list style
  • variant ordering, tags as hints
  • Prediction
  • single-pass for tabular style
  • multi-pass for tagged-list style

49
Tabular result (Quote Server)
50
Tagged-list result (Internet Address Finder)
51
Comparison
  • Both
  • can handle irregular missing attributes.
  • ??????attribute,??training
  • Single-pass
  • ???attribute permutations ??
  • Single-pass is good for tabular pages
  • ???
  • Multi-pass
  • Attribute permutations????
  • Multi-pass is good for tagged-list pages
  • ???

52
Comparison
  • Quote Server
  • Stalker 10 example tuples, 79, 500 test
  • WIEN the collection beyond learns capablity
  • SoftMealy multi-pass 85, single-pass 97
  • Internet Address Finder
  • Stalker 80 100, 500 test
  • WIEN the collection beyond learns capablity
  • SoftMealy multi-pass 68, single-pass 41,

53
Comparison
  • Okra(tabular pages)
  • Stalker 97, 1 example tuple
  • WIEN 100 , 13 example tuples, 30 test
  • SoftMealy single-pass 100, 1 example tuple, 30
    test
  • Big-book(tagged-list pages)
  • Stalker 97, 8 example tuples
  • WIEN perfect, 18 example tuples, 30 test
  • SoftMealy single-pass 97, 4 examples, 30 test
  • multi-pass 100, 6 examples,
    30 test

54
References
  • Kushmerick, N. (2000) Wrapper induction
    Efficiency and expressiveness. Artificial
    Intelligence J. 118(1-2)15-68 (special issue on
    Intelligent Internet Systems).
  • Chun-Nan Hsu and Ming-Tzung Dung. Generating
    finite-state transducers for semistructured data
    extraction from the web. Information Systems,
    23(8)521-538, Special Issue on Semistructured
    Data, 1998.
  • Ion Muslea, Steve Minton, Craig
    Knoblock.Hierarchical Wrapper Induction for
    Semistructured Information Sources, Journal of
    Autonomous Agents and Multi-Agent Systems,
    493-114, 2001 .
Write a Comment
User Comments (0)
About PowerShow.com