Ad Hoc Data and the Token Ambiguity Problem - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Ad Hoc Data and the Token Ambiguity Problem

Description:

Grammar induction & structure discovery without token ambiguity problem ... Identify the Token Ambiguity Problem and take initial steps towards solving it ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 24

Provided by: Qian54

Category:

more less

Transcript and Presenter's Notes

Title: Ad Hoc Data and the Token Ambiguity Problem

1
Ad Hoc Data and the Token Ambiguity Problem

Qian Xi, Kathleen Fisher, David Walker, Kenny
Zhu
2009/1/19

Princeton University, ATT Labs Research
2
Ad Hoc Data

Standardized data formats HTML, XML
Data processing tools Visualizers (HTML
browsers), XQuery

Non-standard, semi-structured
Not many data processing tools
Examples web server log (CLF), phone call
provisioning data

207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 244.133.108.200
- - 16/Oct/1997143222 -0700 "POST
/scpt/ddorg/confirm HTTP/1.0" 200 941
91522729152272128136400922813640092281364009
22813640092no_ii152272EDTF_60MARVINS1UNO10
1000295291 915227291522721281364009228136400
9228136400922813640092no_ii15222EDTF_60MARV
INS1UNO101000295291201000295291171001649600
191001
1/19
3
learnPADS Goal

Automatically generates a description of the
format
Automatically generates a suite of data
processing tools

Punion payload Pint32 i PstringFW(3)
s2 Pstruct source \ payload
p1 , payload p2 \
0,24 bar,end foo,16
Declarative Description
XML converter, Grapher, etc.
2/19
4
learnPADS Architecture
XML converter
Raw Data
Profiler
Chunking Tokenization
Format Inference Engine
Structure Discovery
Format Refinement
PADS Compiler
Data Description
3/19
5
learnPADS framework
Chunking Tokenization
0,24 bar,end foo,bag 0,56 cat,name
int, int str, str str, str int, int str,
str
Structure Discovery
struct
Format Refinement

union

struct
struct
struct

union
,
union

0
INT
STR
STR
,
,
INT
STR
INT
STR
4/19
6
Token Ambiguity Problem (TAP)
Given a string, therere multiple ways to
tokenize it.

Message
Word White Word White Word White... White URL
Word White Quote Filepath Quote White Word
White...

old learnPADS
user defines a set of base tokens with fixed
order
take the first, longest match
new solution probabilistic tokenization
use probabilistic models to find most likely
token sequences

5/19
7
Probabilistic Graphical Models
earthquake
burglar
alarm
parent comes home
node random variable edge probabilistic
relationship
6/19
8
Hidden Markov Model (HMM)

Observation/Character Ci
Character Features upper/lower case, digit,
punctuation...
Hidden state/Pseudo-token Ti
maximize probability P(token sequencecharacter
sequence)

Quote
Word
Comma
Int
Quote
tokens
Quote
Word
Word
Word
Comma
Int
Int
Quote
pseudo-tokens
,
input characters

f
o
o
1
6

transition probability P(TiTi-1)
emission probability P(CiTi)
7/19
9
Hidden Markov Model Formula
the probability of token sequence given
character sequence the probability that
token T1 comes first the probability that
token Ti follows Ti-1 for all i the
probability that we see character Ci given token
Ti for all i
transition probability
emission probability
8/19
10
Hidden Markov Model Parameters
transition probability
emission probability
9/19
11
Hierarchical Models
Quote
Comma
Word
Quote
Int
,

foo

16
Maximum Entropy Support Vector Machines
10/19
12
Three Probabilistic Tokenizers

Character-by-character Hidden Markov Model (HMM)
One pseudo-token only depends on the previous
one.
Hierarchical Maximum Entropy Model (HMEM)
The upper level models the transition
probabilities.
The lower level constructs Maximum Entropy
models for individual tokens.
Hierarchical Support Vector Machines (HSVM)
Same as HMEM, except that the lower level
constructs Support Vector Machine models for
individual tokens.

11/19
13
Tokenization By the old learnPADS, HMM and HMEM
Sat Jun 24 063846 crashreporterd120
mach_msg() reply failed (ipc/send) invalid
destination port
dateSat Jun 24 white time063846
white int2006 white
stringcrashreporterd char int120
char char white stringmach_msg
char( char) white stringreply
white stringfailed char white
char( stringipc char/ stringsend
char) white stringinvalid white
stringdestination white stringport
wordSat white wordJun white
int24 white time063846 white
int2006 white wordcrashreporterd
punctuation int120 punctuation
punctuation messagemach_msg() reply
failed punctuation message(ipc/send)
invalid destination port
dateSat Jun 24 white time063846
white int2006 white
wordcrashreporterd punctuation
int120 punctuation punctuation
messagemach_msg() reply failed
punctuation message(ipc/send) invalid
destination port
12/19
14
Test Data Sources
13/19
15
Evaluation 1 Tokenization Accuracy
Token error rate misidentified tokens Token
boundary error rate misidentified token
boundaries
input string qian
Jan/19/09 ideal token sequence id
white date inferred token sequence id
white filepath token error rate
1/3 token boundary error rate 0/3
14/19
16
Evaluation 1 Tokenization Accuracy
PT probabilistic tokenization testing
data sources 20
15/19
17
Evaluation 2 Type and Data Costs
PT probabilistic tokenization testing data
sources 20
type cost cost in bits of transmitting the
description data cost cost in bits of
transmitting the data given the description
16/19
18
Evaluation 3 Execution Time

The old learnPADS system takes 10 secs to 25
mins.
The new system using probabilistic tokenization
approaches takes a few seconds to several hours.
requires extra time to find all possible token
sequences
requires extra time to find the most likely
token sequences
fastest Hidden Markov Model
most time-consuming Hierarchical Support Vector
Machines

17/19
19
Related Work

Grammar induction structure discovery without
token ambiguity problem
Arasu Garcia-Molina 03 extracting structure
from web pages
Garofalakis et al. 00 XTRACT for infering
DTDs
Kushmerick et al. 97 wrapper induction
Detect row table components by Hidden Markov
Model Conditional Random Fields
Pinto et al. 03
Extract certain fields in records from text
Borkar et al. 01
Predict exons and introns in DNA sequences
using generalized HMM
Kulp 96
Part-of-speech tagging in natural language
processing
Heeman 99 (Decision Tree)
Speech Recognition
Rabiner 89

18/19
20
Contributions

Identify the Token Ambiguity Problem and take
initial steps towards solving it by statistical
models
Use all possible token sequences.
Integrate 3 statistical approaches into the
learnPADS framework.
Hidden Markov Model
Hierarchical Maximum Entropy Model
Hierarchical Support Vector Machines Model
Evaluate correctness and performance by a number
of measures
Results have shown that multiple token sequences
and statistical methods achieve partial success.

19/19
21
End
22
Future Work

How to make use of vertical information
one record is not independent of others
key alignment
Conditional Random Fields
Online learning
old description new data new
description

23
Evaluation 3 Qualitative Comparison
0
The description is too general and it loses much
useful information.
The description is too verbose and the structure
is unclear.
-2
-1
1
2
optimal

Write a Comment

User Comments (0)