Mining Reference Tables for Automatic Text Segmentation - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Mining Reference Tables for Automatic Text Segmentation

Description:

Requires each string to be segmented into the target relation schema ... Compare CRAM (our system) with DataMold [BSD01] Test Datasets ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 33

Provided by: ven45

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mining Reference Tables for Automatic Text Segmentation

1
Mining Reference Tables for Automatic Text
Segmentation

Eugene Agichtein
Columbia University
Venkatesh Ganti
Microsoft Research

2
Scenarios

Importing unformatted strings into a target
structured database
Data warehousing
Data integration
Requires each string to be segmented into the
target relation schema
Input strings are prone to errors (e.g., data
warehousing, data exchange)

3
Current Approaches

Rule-based
Hard to develop, maintain, and deploy
comprehensive sets of rules for every domain
Supervised
E.g., BSD01
Hard to obtain comprehensive datasets needed to
train robust models

4
Our Approach

Exploit large reference tables
Learn domain-specific dictionaries
Learn structure within attribute values
Challenges
Order of attribute concatenation in future test
input is unknown
Robustness to errors in test input after training
on clean and standardized reference tables

5
Problem Statement

Target schema RA1,,An
For a given string s (a sequence of tokens)
segment s into s1,,sn substrings at token
boundaries
map s1,,sn to Ai1,,Ain
maximize P(Ai1s1)P(Ainsn) among all possible
segmentations of s
Product combination function handles arbitrary
concatenation order of attribute values
P(Aix) that a string x belongs to Ai estimated
by an Attribute Recognition Model ARMi
ARMs are learned from a reference relation
rA1,,An

6
Segmentation Architecture
7
ARMs

Design goals
Accurately distinguish an attribute value from
other attributes
Generalize to unobserved/new attribute values
Robust to input errors
Able to learn over large reference tables

8
ARM Instantiation of HMMs

Purpose Estimate probabilities of token
sequences belonging to attributes
ARM instantiation of HMMs (sequential models)
Acceptance probability product of emission and
transition probabilities

9
Instantiating HMMs

Instantiation has to define
Topology states transitions
Emission transition probabilities
Current automatic approaches for topology search
from among a pre-defined class of topologies are
based on cross validation FC00, BSD01
Expensive
Number of states in the ARM is small to keep the
search space tractable

10
Intuition behind ARM Design

Street address examples
nw 57th St, Redmond Woodinville Rd
Album names
The best of eagles, The fury of aquabats,
Colors Soundtrack
Large dictionaries (e.g., aquabats, soundtrack,
st) to exploit
Begin and end tokens are very important to
distinguish values of an attribute (nw, st,
the,)
Can learn patterns on tokens (e.g., 57th
generalizes to th)
Need robustness to input errors
Best of eagles for The best of eagles, nw
57th for nw 57th st

11
Large Number of States

Associate a state per token Each state only
emits a single base token
More accurate transition probabilities
Model sizes for many large reference tables are
still within a few megabytes
Not a problem with current main memory sizes!
Prune the number of states (say, remove low
frequency tokens) to limit the ARM size

12
BMT Topology Relax Positional Specificity
A single state per distinct symbol within a
category -- emission probability of a symbol
within a category is same
13
Feature Hierarchy Relax Token Specificity BSD01
14
Example ARM for Address
15
Robustness Operations Relax Sequential
Specificity

Make ARMs robust to common errors in the input,
i.e., maintain high probability of acceptance
despite these errors
Common types of errors HS98
Token deletions
Token insertions
Missing values
Intuition Simulate the effects of such erroneous
values over each ARM

16
Robustness Operations
Simulating the effect of token insertions token
and corresponding transition probabilities are
copiedfrom BEGIN to MIDDLE state
17
Transition Probabilities

Transitions from B?M and B?T and M?M and M?T
allowed
Learned from examples in reference table
Transition probabilities are also weighted by
their ability to distinguish an attribute
A transition ? which is common across
many attributes gets low weight

18
Summary of ARM Instantiation

BMT topology
Token hierarchy to generalize observed patterns
Robustness operations on HMMs to address input
errors
One state per token in reference table to exploit
large dictionaries

19
Attribute Order Determination

If attribute order is known
Can use dynamic programming algorithm to segment
Rabiner89
If attribute order is unknown
Can ask the user to provide attribute order
Can discover attribute order
Naïve expensive strategy evaluate all
concatenation orders and segmentations for each
input string
Consistent Attribute Order Assumption the
attribute order is the same across a batch of
input tuples
Several datasets on the web satisfy this
assumption
Allows us to efficiently
Determine the attribute order over a batch of
tuples
Segment input strings (using dynamic programming)

20
Segmentation Algorithm (runtime)
21
Experimental Evaluation

Reference relations from several domains
Addresses 1,000,000 tuples
Name, 1, 2, Street Address, City, State, Zip
Media 280,000 tuples
ArtistName, AlbumName, TrackName
Bibliography 100,000 tuples
Title, Author, Journal, Volume, Month, Year
Compare CRAM (our system) with DataMold BSD01

22
Test Datasets

Naturally erroneous datasets unformatted input
strings seen in operational databases
Media
Customer addresses
Controlled error injection
Clean reference table tuples ? Inject errors ?
Concatenate to generate input strings
Evaluate whether a segmentation algorithm
recovered the original tuple
Accuracy Measure of attribute values correctly
recognized

23
Overall Accuracy
DBLP
Addresses
24
Topology Robustness Operations
Addresses
25
Training on Hypothetical Error Models
26
Exploiting Dictionaries
Accuracy vs Reference Table size
27
Conclusions

Reference tables leveraged for segmentation
Combining ARMs based on independence allows
segmenting input strings with unknown attribute
order
ARM models learned over clean reference relations
can accurately segment erroneous input strings
BMT topology
Robustness operations
Exploiting large dictionaries