Title: Annotation Free Information Extraction
1Annotation Free Information Extraction
- Chia-Hui Chang
- Department of Computer Science Information
Engineering - National Central University
- chia_at_csie.ncu.edu.tw
- 10/4/2002
2Introduction
- TEXT IE
- AutoSlog-TS
- Semi IE
- IEPAD
3AutoSlog-TS Automatically Generating Extraction
Patterns from Untagged Text
- Ellen Riloff
- University of Utah
- AAAI96
4AutoSlog-TS
- AutoSlog-TS is an extension of AutoSlog
- It operates exhaustively by generating an
extraction pattern for every noun phrase in the
training corpus. - It then evaluates the extraction patterns by
processing the corpus a second time and
generating relevance statistics for each pattern. - A more significant difference is that AutoSlog-TS
allows multiple rules to fire if more than one
matches the context.
5AutoSlog-TS Concept
6Relevance Rate
- Pr(relevant text text contains patterni)
- rel-freqi / total-freqi
-
- rel-freqi the number of instances of
patterni that were activated in relevant texts. - total-freqi the total number of instances
of patterni that were activated in the training
corpus. - The motivation behind the conditional probability
estimate is that domain-specific expressions will
appear substantially more often in relevant texts
than irrelevant texts.
7Rank function
- Next, we use a rank function to rank the patterns
in order of importance to the domain - relevance rate
log2(frequency) - So, a person only needs to review the most highly
ranked patterns. -
8Experimental Results Setup
- We evaluated AutoSlog and AutoSlog-TS by manually
inspecting the performance of their dictionaries
in the MUC-4 terrorism domain. - We used the MUC-4 texts as input and the MUC-4
answer keys as the basis for judging correct
output (MUC-4 Proceedings 1992). - Training
9Testing
- To evaluate the two dictionaries, we chose 100
blind texts from the MUC-4 test set. (50 relevant
texts and 50 irrelevant texts) - We scored the output by assigning each extracted
item to one of five categories correct,
mislabeled, duplicate, spurious, or missing. - Correct If an item matched against the answer
keys. - Mislabeled If an item matched against the answer
keys but was - extracted as the wrong
type of object. - Duplicate If an item was referent to an item in
the answer keys. - Spurious If an item did not refer to any object
in the answer keys. - Missing Items in the answer keys that were not
extracted
10Experimental Results
- We scored three items perpetrators, victims, and
targets.
11Experimental Results
- We calculated recall as correct / (correct
missing) - Compute precision as
- (correct duplicate) / (correct duplicate
mislabeled spurious)
12Behind the scenes
- In fact, we have reason to believe that
AutoSlog-TS is ultimately capable of producing
better recall than AutoSlog because it generates
many good patterns that AutoSlog did not. - AutoSlog-TS produced 158 patterns with a
relevance rate ? 90 and frequency ? 5. Only 45
of these patterns were in the original AutoSlog
dictionary. - The higher precision demonstrated by AutoSlog-TS
is probably a result of the relevance statistics.
13Future Directions
- A potential problem with AutoSlog-TS is that
there are undoubtedly many useful patterns buried
deep in the ranked list, which cumulatively could
have a substantial impact on performance. - The precision of the extraction patterns could
also be improved by adding semantic constraints
and, in the long run, creating more complex
extraction patterns.
14IEPAD Information Extraction based on Pattern
Discovery
- C.H. Chang.
- National Central University
- WWW10
15Semi-structured Information Extraction
- Information Extraction (IE)
- Input Html pages
- Output A set of records
16Pattern Discovery based IE
- Motivation
- Display of multiple records often forms a
repeated pattern - The occurrences of the pattern are spaced
regularly and adjacently - Now the problem becomes ...
- Find regular and adjacent repeats in a string
17IEPAD Architecture
18The Pattern Generator
- Translator
- PAT tree construction
- Pattern validator
- Rule Composer
191. Web Page Translation
- Encoding of HTML source
- Rule 1 Each tag is encoded as a token
- Rule 2 Any text between two tags are translated
to a special token called TEXT (denoted by a
underscore) - HTML Example
- ltBgtCongolt/BgtltIgt242lt/IgtltBRgt
- ltBgtEgyptlt/BgtltIgt20lt/IgtltBRgt
- Encoded token string
- T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
- T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
20Various Encoding Schemes
212. PAT Tree Construction
- PAT tree binary suffix tree
- A Patricia tree constructed over all possible
suffix strings of a text - Example
- T(ltBgt) 000
- T(lt/Bgt) 001
- T(ltIgt) 010
- T(lt/Igt) 011
- T(ltBRgt) 100
- T(_) 110
T(ltBgt)T(_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt) T(ltBgt)T(
_)T(lt/Bgt)T(ltIgt)T(_)T(lt/Igt)T(ltBRgt)
000110001010110011100 000110001010110011100
22The Constructed PAT Tree
23Definition of Maximal Repeats
- Let a occurs in S in position p1, p2, p3, , pk
- a is left maximal if there exists at least one
(i, j) pair such that Spi-1?Spj-1 - a is right maximal if there exists at least one
(i, j) pair such that Spia?Spja - a is a maximal repeat if it it both left maximal
and right maximal
24Finding Maximal Repeats
- Definition
- Lets call character Spi-1 the left character
of suffix pi - A node ? is left diverse if at least two leaves
in the ?s subtree have different left characters - Lemma
- The path labels of an internal node ? in a PAT
tree is a maximal repeat if and only if ? is left
diverse
253. Pattern Validator
- Suppose a maximal repeat ? are ordered by its
position such that suffix p1 lt p2 lt p3 lt pk,
where pi denotes the position of each suffix in
the encoded token sequence. - Characteristics of a Pattern
- Regularity Variance coefficient
-
- Adjacency Density
26Pattern Validator (Cont.)
- Basic Screening
- For each maximal repeat a, compute V(a) and D(a)
- a) check if the patterns variance V(a) lt 0.5
- b) check if the patterns density 0.25 lt D(a) lt
1.5
274. Rule Composer
- Occurrence partition
- Flexible variance threshold control
- Multiple string alignment
- Increase density of a pattern
28Occurrence Partition
- Problem
- Some patterns are divided into several blocks
- Ex Lycos, Excite with large regularity
- Solution
- Clustering of the occurrences of such a pattern
Clustering
V(P)lt0.1
No
P
Discard
Yes
Check density
29Multiple String Alignment
- Problem
- Patterns with density less than 1 can extract
only part of the information - Solution
- Align k-1 substrings among the k occurrences
- A natural generalization of alignment for two
strings which can be solved in O(nm) by dynamic
programming where n and m are string lengths.
30Multiple String Alignment (Cont.)
- Suppose adc is the discovered pattern for token
string adcwbdadcxbadcxbdadcb - If we have the following multiple alignment for
strings adcwbd'', adcxb'' and adcxbd'' - a d c w b d
- a d c x b -
- a d c x b d
- The extraction pattern can be generalized as
adcwxbd-
31Pattern Viewer
- Java-application based GUI
- Web based GUI
- http//www.csie.ncu.edu.tw/chia/WebIEPAD/
32The Extractor
- Matching the pattern against the encoding token
string - Knuth-Morris-Pratts algorithm
- Boyer-Moores algorithm
- Alternatives in a rule
- matching the longest pattern
- What are extracted?
- The whole record
33Experiment Setup
- Fourteen sources search engines
- Performance measures
- Number of patterns
- Retrieval rate and Accuracy rate
- Parameters
- Encoding scheme
- Thresholds control
34 of Patterns Discovered Using BlockLevel Encoding
- Average 117 maximal repeats in our test Web pages
35Translation
- Average page length is 22.7KB
36Accuracy and Retrieval Rate
37Summary
- IEPAD Information Extraction based on Pattern
Discovery - Rule generator
- The extractor
- Pattern viewer
- Performance
- 97 retrieval rate and 94 accuracy rate
38Problems
- Guarantee high retrieval rate instead of accuracy
rate - Generalized rule can extract more than the
desired data - Only applicable when there are several records in
a Web page, currently
39References
- TEXT IE
- Riloff, E. (1996) Automatically Generating
Extraction Patterns from Untagged Text, (AAAI-96)
, 1996, pp. 1044-1049. - Riloff, E. (1999) Information Extraction as a
Stepping Stone toward Story Understanding, In
Computational Models of Reading and
Understanding, A. Ram and K. Moorman, (Ed.). The
MIT Press.
40References
- Semi-structured IE
- D.W. Embley, Y.S. Jiang, and W.-K. Ng,
Record-Boundary Discovery in Web Documents,
SIGMOD'99 Proceedings - C.H. Chang. and S.C. Lui. IEPAD Information
Extraction based on Pattern Discovery, WWW10, pp.
681-688, May 2-6, 2001, Hong Kong. - B. Chidlovskii, J. Ragetli, and M. de Rijke,
Automatic Wrapper Generation for Web Search
Engines, The 1st Intern. Conf. on Web-Age
Information Management (WAIM'2000), Shanghai,
China, June 2000