Title: Text Categorization with All Substring Features
1Text Categorization with All Substring Features
We compute k argmaxk vk for maximal
Daisuke Okanohara, Junichi Tsujii Dept. of
Computer Science, University of Tokyo, hillbig,
L1 regularization
of substrings in training data. O(1) time by
of substrings weighted byprobs. O(1) time by
table lookup.
We present a novel document classification method
using all substrings as features. Although of
substrings in a document T is O(T2), we can
train this in O(T) time without approximation.
We estimate the weight vector w by maximum
likelihood estimation with L1 regularization
Text categorization
Test Data Statistics
Log-likelihood of training data
Given a text d, estimate its category y (e.g.
sports). A text d is usually represented by BOW,
a feature vector f(d)?Rm where each dimension
corresponds to a tokenized word.
L1 regularization
Learning with L1 reg. has an effect of feature
selection. i.e., of non-zero weight is very
few. Efficient inference, interpretable
result features
Problem 1. Difficulty in detecting of word.
(e.g. Japanese, Chinese, DNA seq. ) Problem2.
Words units are often NOT appropriate for text
categorization. (e.g. Movie title/templates)
Grafting S. Perkins 03
Let v be the gradient of L(w), and H be the
set of current active features. Then, we continue
the following until vkltC. kargmaxkvk, H
H?k, Optimize (1) with H This achieves global
optimum. If we can compute argmaxk vk
efficiently, the training time is proportional to
the active features.
max. substring is about T/3
Result of text categorization acc. ()
All string Bag-of-Words
A text d is represented as a feature vector
f(d)?Rm where each dimension corresponds to a
substring. Problem of substrings are much
larger, O(d2) We solve this by L1
regularization Grafting Maximal substring
Maximal substring
Lemma Let P(s) be the position list of a
substring s in a text d, and s and s are in the
same class if P(s) P(s). Then the of
different classes in d is at most 2d-1. Many
statistics information (e.g. tf, df, idf) are
same if they are in the same class. We call a
substring s maximal substring if s is the longest
substring in the same class. These maximal
substrings can be enumerated in O(d) time by
enhanced suffix arrays Abouelhoda 04
1 Search substring features greedly Ifrim 08
Classification Model
Time for extractingmaximal substrings. Linear
We use a multi-class logistic regression model.
Conditional probability of y given d is defined
F(d,y)?RmY i-th block F(d)I(yyi)
Feature work Unsupervised learning, Sequential
labeling task