Annotation Free Information Extraction - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Annotation Free Information Extraction

Description:

Ex: Lycos, Excite with large regularity. Solution ... Fourteen sources: search engines. Performance measures. Number of patterns ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 41

Provided by: chiahu

Category:

more less

Transcript and Presenter's Notes

Title: Annotation Free Information Extraction

1
Annotation Free Information Extraction

Chia-Hui Chang
Department of Computer Science Information
Engineering
National Central University
chia_at_csie.ncu.edu.tw
10/4/2002

2
Introduction

TEXT IE
AutoSlog-TS
Semi IE
IEPAD

3
AutoSlog-TS Automatically Generating Extraction
Patterns from Untagged Text

Ellen Riloff
University of Utah
AAAI96

4
AutoSlog-TS

AutoSlog-TS is an extension of AutoSlog
It operates exhaustively by generating an
extraction pattern for every noun phrase in the
training corpus.
It then evaluates the extraction patterns by
processing the corpus a second time and
generating relevance statistics for each pattern.
A more significant difference is that AutoSlog-TS
allows multiple rules to fire if more than one
matches the context.

5
AutoSlog-TS Concept
6
Relevance Rate

Pr(relevant text text contains patterni)
rel-freqi / total-freqi
rel-freqi the number of instances of
patterni that were activated in relevant texts.
total-freqi the total number of instances
of patterni that were activated in the training
corpus.
The motivation behind the conditional probability
estimate is that domain-specific expressions will
appear substantially more often in relevant texts
than irrelevant texts.

7
Rank function

Next, we use a rank function to rank the patterns
in order of importance to the domain
relevance rate
log2(frequency)
So, a person only needs to review the most highly
ranked patterns.

8
Experimental Results Setup

We evaluated AutoSlog and AutoSlog-TS by manually
inspecting the performance of their dictionaries
in the MUC-4 terrorism domain.
We used the MUC-4 texts as input and the MUC-4
answer keys as the basis for judging correct
output (MUC-4 Proceedings 1992).
Training

9
Testing

To evaluate the two dictionaries, we chose 100
blind texts from the MUC-4 test set. (50 relevant
texts and 50 irrelevant texts)
We scored the output by assigning each extracted
item to one of five categories correct,
mislabeled, duplicate, spurious, or missing.
Correct If an item matched against the answer
keys.
Mislabeled If an item matched against the answer
keys but was
extracted as the wrong
type of object.
Duplicate If an item was referent to an item in
the answer keys.
Spurious If an item did not refer to any object
in the answer keys.
Missing Items in the answer keys that were not
extracted

10
Experimental Results

We scored three items perpetrators, victims, and
targets.

11
Experimental Results

We calculated recall as correct / (correct
missing)
Compute precision as
(correct duplicate) / (correct duplicate
mislabeled spurious)

12
Behind the scenes

In fact, we have reason to believe that
AutoSlog-TS is ultimately capable of producing
better recall than AutoSlog because it generates
many good patterns that AutoSlog did not.
AutoSlog-TS produced 158 patterns with a
relevance rate ? 90 and frequency ? 5. Only 45
of these patterns were in the original AutoSlog
dictionary.
The higher precision demonstrated by AutoSlog-TS
is probably a result of the relevance statistics.

13
Future Directions

A potential problem with AutoSlog-TS is that
there are undoubtedly many useful patterns buried
deep in the ranked list, which cumulatively could
have a substantial impact on performance.
The precision of the extraction patterns could
also be improved by adding semantic constraints
and, in the long run, creating more complex
extraction patterns.

14
IEPAD Information Extraction based on Pattern
Discovery

C.H. Chang.
National Central University
WWW10

15
Semi-structured Information Extraction

Information Extraction (IE)
Input Html pages
Output A set of records

16
Pattern Discovery based IE

Motivation
Display of multiple records often forms a
repeated pattern
The occurrences of the pattern are spaced
regularly and adjacently
Now the problem becomes ...
Find regular and adjacent repeats in a string

17
IEPAD Architecture
18
The Pattern Generator

Translator
PAT tree construction
Pattern validator
Rule Composer

19
1. Web Page Translation

Encoding of HTML source
Rule 1 Each tag is encoded as a token
Rule 2 Any text between two tags are translated
to a special token called TEXT (denoted by a
underscore)
HTML Example
ltBgtCongolt/BgtltIgt242lt/IgtltBRgt
ltBgtEgyptlt/BgtltIgt20lt/IgtltBRgt
Encoded token string
T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)

20
Various Encoding Schemes
21
2. PAT Tree Construction

PAT tree binary suffix tree
A Patricia tree constructed over all possible
suffix strings of a text
Example
T(ltBgt) 000
T(lt/Bgt) 001
T(ltIgt) 010
T(lt/Igt) 011
T(ltBRgt) 100
T(_) 110

T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt) T(ltBgt)T(
_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
000110001010110011100 000110001010110011100
22
The Constructed PAT Tree
23
Definition of Maximal Repeats

Let a occurs in S in position p1, p2, p3, , pk
a is left maximal if there exists at least one
(i, j) pair such that Spi-1?Spj-1
a is right maximal if there exists at least one
(i, j) pair such that Spia?Spja
a is a maximal repeat if it it both left maximal
and right maximal

24
Finding Maximal Repeats

Definition
Lets call character Spi-1 the left character
of suffix pi
A node ? is left diverse if at least two leaves
in the ?s subtree have different left characters
Lemma
The path labels of an internal node ? in a PAT
tree is a maximal repeat if and only if ? is left
diverse

25
3. Pattern Validator

Suppose a maximal repeat ? are ordered by its
position such that suffix p1 lt p2 lt p3 lt pk,
where pi denotes the position of each suffix in
the encoded token sequence.
Characteristics of a Pattern
Regularity Variance coefficient
Adjacency Density

26
Pattern Validator (Cont.)

Basic Screening
For each maximal repeat a, compute V(a) and D(a)
a) check if the patterns variance V(a) lt 0.5
b) check if the patterns density 0.25 lt D(a) lt
1.5

27
4. Rule Composer

Occurrence partition
Flexible variance threshold control
Multiple string alignment
Increase density of a pattern

28
Occurrence Partition

Problem
Some patterns are divided into several blocks
Ex Lycos, Excite with large regularity
Solution
Clustering of the occurrences of such a pattern

Clustering
V(P)lt0.1
No
P
Discard
Yes
Check density
29
Multiple String Alignment

Problem
Patterns with density less than 1 can extract
only part of the information
Solution
Align k-1 substrings among the k occurrences
A natural generalization of alignment for two
strings which can be solved in O(nm) by dynamic
programming where n and m are string lengths.

30
Multiple String Alignment (Cont.)

Suppose adc is the discovered pattern for token
string adcwbdadcxbadcxbdadcb
If we have the following multiple alignment for
strings adcwbd'', adcxb'' and adcxbd''
a d c w b d
a d c x b -
a d c x b d
The extraction pattern can be generalized as
adcwxbd-

31
Pattern Viewer

Java-application based GUI
Web based GUI
http//www.csie.ncu.edu.tw/chia/WebIEPAD/

32
The Extractor

Matching the pattern against the encoding token
string
Knuth-Morris-Pratts algorithm
Boyer-Moores algorithm
Alternatives in a rule
matching the longest pattern
What are extracted?
The whole record

33
Experiment Setup

Fourteen sources search engines
Performance measures
Number of patterns
Retrieval rate and Accuracy rate
Parameters
Encoding scheme
Thresholds control

34
of Patterns Discovered Using BlockLevel Encoding

Average 117 maximal repeats in our test Web pages

35
Translation

Average page length is 22.7KB

36
Accuracy and Retrieval Rate
37
Summary

IEPAD Information Extraction based on Pattern
Discovery
Rule generator
The extractor
Pattern viewer
Performance
97 retrieval rate and 94 accuracy rate

38
Problems

Guarantee high retrieval rate instead of accuracy
rate
Generalized rule can extract more than the
desired data
Only applicable when there are several records in
a Web page, currently

39
References

TEXT IE
Riloff, E. (1996) Automatically Generating
Extraction Patterns from Untagged Text, (AAAI-96)
, 1996, pp. 1044-1049.
Riloff, E. (1999) Information Extraction as a
Stepping Stone toward Story Understanding, In
Computational Models of Reading and
Understanding, A. Ram and K. Moorman, (Ed.). The
MIT Press.

40
References

Semi-structured IE
D.W. Embley, Y.S. Jiang, and W.-K. Ng,
Record-Boundary Discovery in Web Documents,
SIGMOD'99 Proceedings
C.H. Chang. and S.C. Lui. IEPAD Information
Extraction based on Pattern Discovery, WWW10, pp.
681-688, May 2-6, 2001, Hong Kong.
B. Chidlovskii, J. Ragetli, and M. de Rijke,
Automatic Wrapper Generation for Web Search
Engines, The 1st Intern. Conf. on Web-Age
Information Management (WAIM'2000), Shanghai,
China, June 2000