Title: Ad Hoc Data and the Token Ambiguity Problem
1Ad Hoc Data and the Token Ambiguity Problem
- Qian Xi, Kathleen Fisher, David Walker, Kenny
Zhu - 2009/1/19
Princeton University, ATT Labs Research
2 Ad Hoc Data
- Standardized data formats HTML, XML
- Data processing tools Visualizers (HTML
browsers), XQuery
- Non-standard, semi-structured
- Not many data processing tools
- Examples web server log (CLF), phone call
provisioning data
207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 244.133.108.200
- - 16/Oct/1997143222 -0700 "POST
/scpt/ddorg/confirm HTTP/1.0" 200 941
91522729152272128136400922813640092281364009
22813640092no_ii152272EDTF_60MARVINS1UNO10
1000295291 915227291522721281364009228136400
9228136400922813640092no_ii15222EDTF_60MARV
INS1UNO101000295291201000295291171001649600
191001
1/19
3 learnPADS Goal
- Automatically generates a description of the
format - Automatically generates a suite of data
processing tools
Punion payload Pint32 i PstringFW(3)
s2 Pstruct source \ payload
p1 , payload p2 \
0,24 bar,end foo,16
Declarative Description
XML converter, Grapher, etc.
2/19
4 learnPADS Architecture
XML converter
Raw Data
Profiler
Chunking Tokenization
Format Inference Engine
Structure Discovery
Format Refinement
PADS Compiler
Data Description
3/19
5 learnPADS framework
Chunking Tokenization
0,24 bar,end foo,bag 0,56 cat,name
int, int str, str str, str int, int str,
str
Structure Discovery
struct
Format Refinement
union
struct
struct
struct
union
,
union
0
INT
STR
STR
,
,
INT
STR
INT
STR
4/19
6 Token Ambiguity Problem (TAP)
Given a string, therere multiple ways to
tokenize it.
- Message
- Word White Word White Word White... White URL
- Word White Quote Filepath Quote White Word
White...
- old learnPADS
- user defines a set of base tokens with fixed
order - take the first, longest match
- new solution probabilistic tokenization
- use probabilistic models to find most likely
token sequences
5/19
7 Probabilistic Graphical Models
earthquake
burglar
alarm
parent comes home
node random variable edge probabilistic
relationship
6/19
8 Hidden Markov Model (HMM)
- Observation/Character Ci
- Character Features upper/lower case, digit,
punctuation... - Hidden state/Pseudo-token Ti
- maximize probability P(token sequencecharacter
sequence)
Quote
Word
Comma
Int
Quote
tokens
Quote
Word
Word
Word
Comma
Int
Int
Quote
pseudo-tokens
,
input characters
f
o
o
1
6
transition probability P(TiTi-1)
emission probability P(CiTi)
7/19
9 Hidden Markov Model Formula
the probability of token sequence given
character sequence the probability that
token T1 comes first the probability that
token Ti follows Ti-1 for all i the
probability that we see character Ci given token
Ti for all i
transition probability
emission probability
8/19
10 Hidden Markov Model Parameters
transition probability
emission probability
9/19
11 Hierarchical Models
Quote
Comma
Word
Quote
Int
,
foo
16
Maximum Entropy Support Vector Machines
10/19
12 Three Probabilistic Tokenizers
- Character-by-character Hidden Markov Model (HMM)
- One pseudo-token only depends on the previous
one. - Hierarchical Maximum Entropy Model (HMEM)
- The upper level models the transition
probabilities. - The lower level constructs Maximum Entropy
models for individual tokens. - Hierarchical Support Vector Machines (HSVM)
- Same as HMEM, except that the lower level
constructs Support Vector Machine models for
individual tokens.
11/19
13 Tokenization By the old learnPADS, HMM and HMEM
Sat Jun 24 063846 crashreporterd120
mach_msg() reply failed (ipc/send) invalid
destination port
dateSat Jun 24 white time063846
white int2006 white
stringcrashreporterd char int120
char char white stringmach_msg
char( char) white stringreply
white stringfailed char white
char( stringipc char/ stringsend
char) white stringinvalid white
stringdestination white stringport
wordSat white wordJun white
int24 white time063846 white
int2006 white wordcrashreporterd
punctuation int120 punctuation
punctuation messagemach_msg() reply
failed punctuation message(ipc/send)
invalid destination port
dateSat Jun 24 white time063846
white int2006 white
wordcrashreporterd punctuation
int120 punctuation punctuation
messagemach_msg() reply failed
punctuation message(ipc/send) invalid
destination port
12/19
14 Test Data Sources
13/19
15 Evaluation 1 Tokenization Accuracy
Token error rate misidentified tokens Token
boundary error rate misidentified token
boundaries
input string qian
Jan/19/09 ideal token sequence id
white date inferred token sequence id
white filepath token error rate
1/3 token boundary error rate 0/3
14/19
16 Evaluation 1 Tokenization Accuracy
PT probabilistic tokenization testing
data sources 20
15/19
17 Evaluation 2 Type and Data Costs
PT probabilistic tokenization testing data
sources 20
type cost cost in bits of transmitting the
description data cost cost in bits of
transmitting the data given the description
16/19
18 Evaluation 3 Execution Time
- The old learnPADS system takes 10 secs to 25
mins. -
- The new system using probabilistic tokenization
approaches takes a few seconds to several hours. - requires extra time to find all possible token
sequences - requires extra time to find the most likely
token sequences - fastest Hidden Markov Model
- most time-consuming Hierarchical Support Vector
Machines
17/19
19Related Work
- Grammar induction structure discovery without
token ambiguity problem - Arasu Garcia-Molina 03 extracting structure
from web pages - Garofalakis et al. 00 XTRACT for infering
DTDs - Kushmerick et al. 97 wrapper induction
- Detect row table components by Hidden Markov
Model Conditional Random Fields - Pinto et al. 03
- Extract certain fields in records from text
- Borkar et al. 01
- Predict exons and introns in DNA sequences
using generalized HMM - Kulp 96
- Part-of-speech tagging in natural language
processing - Heeman 99 (Decision Tree)
- Speech Recognition
- Rabiner 89
18/19
20Contributions
- Identify the Token Ambiguity Problem and take
initial steps towards solving it by statistical
models - Use all possible token sequences.
-
- Integrate 3 statistical approaches into the
learnPADS framework. - Hidden Markov Model
- Hierarchical Maximum Entropy Model
- Hierarchical Support Vector Machines Model
-
- Evaluate correctness and performance by a number
of measures - Results have shown that multiple token sequences
and statistical methods achieve partial success.
19/19
21End
22Future Work
- How to make use of vertical information
- one record is not independent of others
- key alignment
- Conditional Random Fields
- Online learning
- old description new data new
description
23 Evaluation 3 Qualitative Comparison
0
The description is too general and it loses much
useful information.
The description is too verbose and the structure
is unclear.
-2
-1
1
2
optimal