Mining Reference Tables for Automatic Text Segmentation - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Mining Reference Tables for Automatic Text Segmentation

Description:

Requires each string to be segmented into the target relation schema ... Compare CRAM (our system) with DataMold [BSD01] Test Datasets ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 33
Provided by: ven45
Category:

less

Transcript and Presenter's Notes

Title: Mining Reference Tables for Automatic Text Segmentation


1
Mining Reference Tables for Automatic Text
Segmentation
  • Eugene Agichtein
  • Columbia University
  • Venkatesh Ganti
  • Microsoft Research

2
Scenarios
  • Importing unformatted strings into a target
    structured database
  • Data warehousing
  • Data integration
  • Requires each string to be segmented into the
    target relation schema
  • Input strings are prone to errors (e.g., data
    warehousing, data exchange)

3
Current Approaches
  • Rule-based
  • Hard to develop, maintain, and deploy
    comprehensive sets of rules for every domain
  • Supervised
  • E.g., BSD01
  • Hard to obtain comprehensive datasets needed to
    train robust models

4
Our Approach
  • Exploit large reference tables
  • Learn domain-specific dictionaries
  • Learn structure within attribute values
  • Challenges
  • Order of attribute concatenation in future test
    input is unknown
  • Robustness to errors in test input after training
    on clean and standardized reference tables

5
Problem Statement
  • Target schema RA1,,An
  • For a given string s (a sequence of tokens)
  • segment s into s1,,sn substrings at token
    boundaries
  • map s1,,sn to Ai1,,Ain
  • maximize P(Ai1s1)P(Ainsn) among all possible
    segmentations of s
  • Product combination function handles arbitrary
    concatenation order of attribute values
  • P(Aix) that a string x belongs to Ai estimated
    by an Attribute Recognition Model ARMi
  • ARMs are learned from a reference relation
    rA1,,An

6
Segmentation Architecture
7
ARMs
  • Design goals
  • Accurately distinguish an attribute value from
    other attributes
  • Generalize to unobserved/new attribute values
  • Robust to input errors
  • Able to learn over large reference tables

8
ARM Instantiation of HMMs
  • Purpose Estimate probabilities of token
    sequences belonging to attributes
  • ARM instantiation of HMMs (sequential models)
  • Acceptance probability product of emission and
    transition probabilities

9
Instantiating HMMs
  • Instantiation has to define
  • Topology states transitions
  • Emission transition probabilities
  • Current automatic approaches for topology search
    from among a pre-defined class of topologies are
    based on cross validation FC00, BSD01
  • Expensive
  • Number of states in the ARM is small to keep the
    search space tractable

10
Intuition behind ARM Design
  • Street address examples
  • nw 57th St, Redmond Woodinville Rd
  • Album names
  • The best of eagles, The fury of aquabats,
    Colors Soundtrack
  • Large dictionaries (e.g., aquabats, soundtrack,
    st) to exploit
  • Begin and end tokens are very important to
    distinguish values of an attribute (nw, st,
    the,)
  • Can learn patterns on tokens (e.g., 57th
    generalizes to th)
  • Need robustness to input errors
  • Best of eagles for The best of eagles, nw
    57th for nw 57th st

11
Large Number of States
  • Associate a state per token Each state only
    emits a single base token
  • More accurate transition probabilities
  • Model sizes for many large reference tables are
    still within a few megabytes
  • Not a problem with current main memory sizes!
  • Prune the number of states (say, remove low
    frequency tokens) to limit the ARM size

12
BMT Topology Relax Positional Specificity
A single state per distinct symbol within a
category -- emission probability of a symbol
within a category is same
13
Feature Hierarchy Relax Token Specificity BSD01
14
Example ARM for Address
15
Robustness Operations Relax Sequential
Specificity
  • Make ARMs robust to common errors in the input,
    i.e., maintain high probability of acceptance
    despite these errors
  • Common types of errors HS98
  • Token deletions
  • Token insertions
  • Missing values
  • Intuition Simulate the effects of such erroneous
    values over each ARM

16
Robustness Operations
Simulating the effect of token insertions token
and corresponding transition probabilities are
copiedfrom BEGIN to MIDDLE state
17
Transition Probabilities
  • Transitions from B?M and B?T and M?M and M?T
    allowed
  • Learned from examples in reference table
  • Transition probabilities are also weighted by
    their ability to distinguish an attribute
  • A transition ? which is common across
    many attributes gets low weight

18
Summary of ARM Instantiation
  • BMT topology
  • Token hierarchy to generalize observed patterns
  • Robustness operations on HMMs to address input
    errors
  • One state per token in reference table to exploit
    large dictionaries

19
Attribute Order Determination
  • If attribute order is known
  • Can use dynamic programming algorithm to segment
    Rabiner89
  • If attribute order is unknown
  • Can ask the user to provide attribute order
  • Can discover attribute order
  • Naïve expensive strategy evaluate all
    concatenation orders and segmentations for each
    input string
  • Consistent Attribute Order Assumption the
    attribute order is the same across a batch of
    input tuples
  • Several datasets on the web satisfy this
    assumption
  • Allows us to efficiently
  • Determine the attribute order over a batch of
    tuples
  • Segment input strings (using dynamic programming)

20
Segmentation Algorithm (runtime)
21
Experimental Evaluation
  • Reference relations from several domains
  • Addresses 1,000,000 tuples
  • Name, 1, 2, Street Address, City, State, Zip
  • Media 280,000 tuples
  • ArtistName, AlbumName, TrackName
  • Bibliography 100,000 tuples
  • Title, Author, Journal, Volume, Month, Year
  • Compare CRAM (our system) with DataMold BSD01

22
Test Datasets
  • Naturally erroneous datasets unformatted input
    strings seen in operational databases
  • Media
  • Customer addresses
  • Controlled error injection
  • Clean reference table tuples ? Inject errors ?
    Concatenate to generate input strings
  • Evaluate whether a segmentation algorithm
    recovered the original tuple
  • Accuracy Measure of attribute values correctly
    recognized

23
Overall Accuracy
DBLP
Addresses
24
Topology Robustness Operations
Addresses
25
Training on Hypothetical Error Models
26
Exploiting Dictionaries
Accuracy vs Reference Table size
27
Conclusions
  • Reference tables leveraged for segmentation
  • Combining ARMs based on independence allows
    segmenting input strings with unknown attribute
    order
  • ARM models learned over clean reference relations
    can accurately segment erroneous input strings
  • BMT topology
  • Robustness operations
  • Exploiting large dictionaries

28
Model Sizes Pruning
Accuracy States Transitions Model
Size in MB
29
Order Determination Accuracy
30
Topology
Media
31
Specificities of HMM Models
  • Model specificity restricts accepted token
    sequences
  • Positional specificity
  • Number ending in thst can only be the 2nd
    token in an address value
  • Token specificity
  • Last state only accepts st, rd, wy, blvd
  • Sequential specificity
  • st, rd, wy, blvd have to follow a number in
    stth

32
Robustness Operations
Token insertion
Token deletion
Missing values
Write a Comment
User Comments (0)
About PowerShow.com