Annotation Free Information Extraction - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Annotation Free Information Extraction

Description:

Ex: Lycos, Excite with large regularity. Solution ... Fourteen sources: search engines. Performance measures. Number of patterns ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 41
Provided by: chiahu
Category:

less

Transcript and Presenter's Notes

Title: Annotation Free Information Extraction


1
Annotation Free Information Extraction
  • Chia-Hui Chang
  • Department of Computer Science Information
    Engineering
  • National Central University
  • chia_at_csie.ncu.edu.tw
  • 10/4/2002

2
Introduction
  • TEXT IE
  • AutoSlog-TS
  • Semi IE
  • IEPAD

3
AutoSlog-TS Automatically Generating Extraction
Patterns from Untagged Text
  • Ellen Riloff
  • University of Utah
  • AAAI96

4
AutoSlog-TS
  • AutoSlog-TS is an extension of AutoSlog
  • It operates exhaustively by generating an
    extraction pattern for every noun phrase in the
    training corpus.
  • It then evaluates the extraction patterns by
    processing the corpus a second time and
    generating relevance statistics for each pattern.
  • A more significant difference is that AutoSlog-TS
    allows multiple rules to fire if more than one
    matches the context.

5
AutoSlog-TS Concept
6
Relevance Rate
  • Pr(relevant text text contains patterni)
  • rel-freqi / total-freqi
  • rel-freqi the number of instances of
    patterni that were activated in relevant texts.
  • total-freqi the total number of instances
    of patterni that were activated in the training
    corpus.
  • The motivation behind the conditional probability
    estimate is that domain-specific expressions will
    appear substantially more often in relevant texts
    than irrelevant texts.

7
Rank function
  • Next, we use a rank function to rank the patterns
    in order of importance to the domain
  • relevance rate
    log2(frequency)
  • So, a person only needs to review the most highly
    ranked patterns.

8
Experimental Results Setup
  • We evaluated AutoSlog and AutoSlog-TS by manually
    inspecting the performance of their dictionaries
    in the MUC-4 terrorism domain.
  • We used the MUC-4 texts as input and the MUC-4
    answer keys as the basis for judging correct
    output (MUC-4 Proceedings 1992).
  • Training

9
Testing
  • To evaluate the two dictionaries, we chose 100
    blind texts from the MUC-4 test set. (50 relevant
    texts and 50 irrelevant texts)
  • We scored the output by assigning each extracted
    item to one of five categories correct,
    mislabeled, duplicate, spurious, or missing.
  • Correct If an item matched against the answer
    keys.
  • Mislabeled If an item matched against the answer
    keys but was
  • extracted as the wrong
    type of object.
  • Duplicate If an item was referent to an item in
    the answer keys.
  • Spurious If an item did not refer to any object
    in the answer keys.
  • Missing Items in the answer keys that were not
    extracted

10
Experimental Results
  • We scored three items perpetrators, victims, and
    targets.

11
Experimental Results
  • We calculated recall as correct / (correct
    missing)
  • Compute precision as
  • (correct duplicate) / (correct duplicate
    mislabeled spurious)

12
Behind the scenes
  • In fact, we have reason to believe that
    AutoSlog-TS is ultimately capable of producing
    better recall than AutoSlog because it generates
    many good patterns that AutoSlog did not.
  • AutoSlog-TS produced 158 patterns with a
    relevance rate ? 90 and frequency ? 5. Only 45
    of these patterns were in the original AutoSlog
    dictionary.
  • The higher precision demonstrated by AutoSlog-TS
    is probably a result of the relevance statistics.

13
Future Directions
  • A potential problem with AutoSlog-TS is that
    there are undoubtedly many useful patterns buried
    deep in the ranked list, which cumulatively could
    have a substantial impact on performance.
  • The precision of the extraction patterns could
    also be improved by adding semantic constraints
    and, in the long run, creating more complex
    extraction patterns.

14
IEPAD Information Extraction based on Pattern
Discovery
  • C.H. Chang.
  • National Central University
  • WWW10

15
Semi-structured Information Extraction
  • Information Extraction (IE)
  • Input Html pages
  • Output A set of records

16
Pattern Discovery based IE
  • Motivation
  • Display of multiple records often forms a
    repeated pattern
  • The occurrences of the pattern are spaced
    regularly and adjacently
  • Now the problem becomes ...
  • Find regular and adjacent repeats in a string

17
IEPAD Architecture
18
The Pattern Generator
  • Translator
  • PAT tree construction
  • Pattern validator
  • Rule Composer

19
1. Web Page Translation
  • Encoding of HTML source
  • Rule 1 Each tag is encoded as a token
  • Rule 2 Any text between two tags are translated
    to a special token called TEXT (denoted by a
    underscore)
  • HTML Example
  • ltBgtCongolt/BgtltIgt242lt/IgtltBRgt
  • ltBgtEgyptlt/BgtltIgt20lt/IgtltBRgt
  • Encoded token string
  • T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
  • T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)

20
Various Encoding Schemes
21
2. PAT Tree Construction
  • PAT tree binary suffix tree
  • A Patricia tree constructed over all possible
    suffix strings of a text
  • Example
  • T(ltBgt) 000
  • T(lt/Bgt) 001
  • T(ltIgt) 010
  • T(lt/Igt) 011
  • T(ltBRgt) 100
  • T(_) 110

T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt) T(ltBgt)T(
_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
000110001010110011100 000110001010110011100
22
The Constructed PAT Tree
23
Definition of Maximal Repeats
  • Let a occurs in S in position p1, p2, p3, , pk
  • a is left maximal if there exists at least one
    (i, j) pair such that Spi-1?Spj-1
  • a is right maximal if there exists at least one
    (i, j) pair such that Spia?Spja
  • a is a maximal repeat if it it both left maximal
    and right maximal

24
Finding Maximal Repeats
  • Definition
  • Lets call character Spi-1 the left character
    of suffix pi
  • A node ? is left diverse if at least two leaves
    in the ?s subtree have different left characters
  • Lemma
  • The path labels of an internal node ? in a PAT
    tree is a maximal repeat if and only if ? is left
    diverse

25
3. Pattern Validator
  • Suppose a maximal repeat ? are ordered by its
    position such that suffix p1 lt p2 lt p3 lt pk,
    where pi denotes the position of each suffix in
    the encoded token sequence.
  • Characteristics of a Pattern
  • Regularity Variance coefficient
  • Adjacency Density

26
Pattern Validator (Cont.)
  • Basic Screening
  • For each maximal repeat a, compute V(a) and D(a)
  • a) check if the patterns variance V(a) lt 0.5
  • b) check if the patterns density 0.25 lt D(a) lt
    1.5

27
4. Rule Composer
  • Occurrence partition
  • Flexible variance threshold control
  • Multiple string alignment
  • Increase density of a pattern

28
Occurrence Partition
  • Problem
  • Some patterns are divided into several blocks
  • Ex Lycos, Excite with large regularity
  • Solution
  • Clustering of the occurrences of such a pattern

Clustering
V(P)lt0.1
No
P
Discard
Yes
Check density
29
Multiple String Alignment
  • Problem
  • Patterns with density less than 1 can extract
    only part of the information
  • Solution
  • Align k-1 substrings among the k occurrences
  • A natural generalization of alignment for two
    strings which can be solved in O(nm) by dynamic
    programming where n and m are string lengths.

30
Multiple String Alignment (Cont.)
  • Suppose adc is the discovered pattern for token
    string adcwbdadcxbadcxbdadcb
  • If we have the following multiple alignment for
    strings adcwbd'', adcxb'' and adcxbd''
  • a d c w b d
  • a d c x b -
  • a d c x b d
  • The extraction pattern can be generalized as
    adcwxbd-

31
Pattern Viewer
  • Java-application based GUI
  • Web based GUI
  • http//www.csie.ncu.edu.tw/chia/WebIEPAD/

32
The Extractor
  • Matching the pattern against the encoding token
    string
  • Knuth-Morris-Pratts algorithm
  • Boyer-Moores algorithm
  • Alternatives in a rule
  • matching the longest pattern
  • What are extracted?
  • The whole record

33
Experiment Setup
  • Fourteen sources search engines
  • Performance measures
  • Number of patterns
  • Retrieval rate and Accuracy rate
  • Parameters
  • Encoding scheme
  • Thresholds control

34
of Patterns Discovered Using BlockLevel Encoding
  • Average 117 maximal repeats in our test Web pages

35
Translation
  • Average page length is 22.7KB

36
Accuracy and Retrieval Rate
37
Summary
  • IEPAD Information Extraction based on Pattern
    Discovery
  • Rule generator
  • The extractor
  • Pattern viewer
  • Performance
  • 97 retrieval rate and 94 accuracy rate

38
Problems
  • Guarantee high retrieval rate instead of accuracy
    rate
  • Generalized rule can extract more than the
    desired data
  • Only applicable when there are several records in
    a Web page, currently

39
References
  • TEXT IE
  • Riloff, E. (1996) Automatically Generating
    Extraction Patterns from Untagged Text, (AAAI-96)
    , 1996, pp. 1044-1049.
  • Riloff, E. (1999) Information Extraction as a
    Stepping Stone toward Story Understanding, In
    Computational Models of Reading and
    Understanding, A. Ram and K. Moorman, (Ed.). The
    MIT Press.

40
References
  • Semi-structured IE
  • D.W. Embley, Y.S. Jiang, and W.-K. Ng,
    Record-Boundary Discovery in Web Documents,
    SIGMOD'99 Proceedings
  • C.H. Chang. and S.C. Lui. IEPAD Information
    Extraction based on Pattern Discovery, WWW10, pp.
    681-688, May 2-6, 2001, Hong Kong.
  • B. Chidlovskii, J. Ragetli, and M. de Rijke,
    Automatic Wrapper Generation for Web Search
    Engines, The 1st Intern. Conf. on Web-Age
    Information Management (WAIM'2000), Shanghai,
    China, June 2000
Write a Comment
User Comments (0)
About PowerShow.com