Title: Mining Reference Tables for Automatic Text Segmentation
1Mining Reference Tables for Automatic Text
Segmentation
- Eugene Agichtein
- Columbia University
- Venkatesh Ganti
- Microsoft Research
2Scenarios
- Importing unformatted strings into a target
structured database - Data warehousing
- Data integration
- Requires each string to be segmented into the
target relation schema - Input strings are prone to errors (e.g., data
warehousing, data exchange)
3Current Approaches
- Rule-based
- Hard to develop, maintain, and deploy
comprehensive sets of rules for every domain - Supervised
- E.g., BSD01
- Hard to obtain comprehensive datasets needed to
train robust models
4Our Approach
- Exploit large reference tables
- Learn domain-specific dictionaries
- Learn structure within attribute values
- Challenges
- Order of attribute concatenation in future test
input is unknown - Robustness to errors in test input after training
on clean and standardized reference tables
5Problem Statement
- Target schema RA1,,An
- For a given string s (a sequence of tokens)
- segment s into s1,,sn substrings at token
boundaries - map s1,,sn to Ai1,,Ain
- maximize P(Ai1s1)P(Ainsn) among all possible
segmentations of s - Product combination function handles arbitrary
concatenation order of attribute values - P(Aix) that a string x belongs to Ai estimated
by an Attribute Recognition Model ARMi - ARMs are learned from a reference relation
rA1,,An
6Segmentation Architecture
7ARMs
- Design goals
- Accurately distinguish an attribute value from
other attributes - Generalize to unobserved/new attribute values
- Robust to input errors
- Able to learn over large reference tables
8ARM Instantiation of HMMs
- Purpose Estimate probabilities of token
sequences belonging to attributes - ARM instantiation of HMMs (sequential models)
- Acceptance probability product of emission and
transition probabilities
9Instantiating HMMs
- Instantiation has to define
- Topology states transitions
- Emission transition probabilities
- Current automatic approaches for topology search
from among a pre-defined class of topologies are
based on cross validation FC00, BSD01 - Expensive
- Number of states in the ARM is small to keep the
search space tractable
10Intuition behind ARM Design
- Street address examples
- nw 57th St, Redmond Woodinville Rd
- Album names
- The best of eagles, The fury of aquabats,
Colors Soundtrack - Large dictionaries (e.g., aquabats, soundtrack,
st) to exploit - Begin and end tokens are very important to
distinguish values of an attribute (nw, st,
the,) - Can learn patterns on tokens (e.g., 57th
generalizes to th) - Need robustness to input errors
- Best of eagles for The best of eagles, nw
57th for nw 57th st
11Large Number of States
- Associate a state per token Each state only
emits a single base token - More accurate transition probabilities
- Model sizes for many large reference tables are
still within a few megabytes - Not a problem with current main memory sizes!
- Prune the number of states (say, remove low
frequency tokens) to limit the ARM size
12BMT Topology Relax Positional Specificity
A single state per distinct symbol within a
category -- emission probability of a symbol
within a category is same
13Feature Hierarchy Relax Token Specificity BSD01
14Example ARM for Address
15Robustness Operations Relax Sequential
Specificity
- Make ARMs robust to common errors in the input,
i.e., maintain high probability of acceptance
despite these errors - Common types of errors HS98
- Token deletions
- Token insertions
- Missing values
- Intuition Simulate the effects of such erroneous
values over each ARM
16Robustness Operations
Simulating the effect of token insertions token
and corresponding transition probabilities are
copiedfrom BEGIN to MIDDLE state
17Transition Probabilities
- Transitions from B?M and B?T and M?M and M?T
allowed - Learned from examples in reference table
- Transition probabilities are also weighted by
their ability to distinguish an attribute - A transition ? which is common across
many attributes gets low weight
18Summary of ARM Instantiation
- BMT topology
- Token hierarchy to generalize observed patterns
- Robustness operations on HMMs to address input
errors - One state per token in reference table to exploit
large dictionaries
19Attribute Order Determination
- If attribute order is known
- Can use dynamic programming algorithm to segment
Rabiner89 - If attribute order is unknown
- Can ask the user to provide attribute order
- Can discover attribute order
- Naïve expensive strategy evaluate all
concatenation orders and segmentations for each
input string - Consistent Attribute Order Assumption the
attribute order is the same across a batch of
input tuples - Several datasets on the web satisfy this
assumption - Allows us to efficiently
- Determine the attribute order over a batch of
tuples - Segment input strings (using dynamic programming)
20Segmentation Algorithm (runtime)
21Experimental Evaluation
- Reference relations from several domains
- Addresses 1,000,000 tuples
- Name, 1, 2, Street Address, City, State, Zip
- Media 280,000 tuples
- ArtistName, AlbumName, TrackName
- Bibliography 100,000 tuples
- Title, Author, Journal, Volume, Month, Year
- Compare CRAM (our system) with DataMold BSD01
22Test Datasets
- Naturally erroneous datasets unformatted input
strings seen in operational databases - Media
- Customer addresses
- Controlled error injection
- Clean reference table tuples ? Inject errors ?
Concatenate to generate input strings - Evaluate whether a segmentation algorithm
recovered the original tuple - Accuracy Measure of attribute values correctly
recognized
23Overall Accuracy
DBLP
Addresses
24Topology Robustness Operations
Addresses
25Training on Hypothetical Error Models
26Exploiting Dictionaries
Accuracy vs Reference Table size
27Conclusions
- Reference tables leveraged for segmentation
- Combining ARMs based on independence allows
segmenting input strings with unknown attribute
order - ARM models learned over clean reference relations
can accurately segment erroneous input strings - BMT topology
- Robustness operations
- Exploiting large dictionaries
28Model Sizes Pruning
Accuracy States Transitions Model
Size in MB
29Order Determination Accuracy
30Topology
Media
31Specificities of HMM Models
- Model specificity restricts accepted token
sequences - Positional specificity
- Number ending in thst can only be the 2nd
token in an address value - Token specificity
- Last state only accepts st, rd, wy, blvd
- Sequential specificity
- st, rd, wy, blvd have to follow a number in
stth
32Robustness Operations
Token insertion
Token deletion
Missing values