Title: Text Categorization with All Substring Features
1Text Categorization with All Substring Features
We compute k argmaxk vk for maximal
substrings
Daisuke Okanohara, Junichi Tsujii Dept. of
Computer Science, University of Tokyo, hillbig,
tsujii_at_is.s.u-tokyo.ac.jp
Abstract
L1 regularization
of substrings in training data. O(1) time by
SA
of substrings weighted byprobs. O(1) time by
table lookup.
We present a novel document classification method
using all substrings as features. Although of
substrings in a document T is O(T2), we can
train this in O(T) time without approximation.
We estimate the weight vector w by maximum
likelihood estimation with L1 regularization
Experiments
Text categorization
Test Data Statistics
Log-likelihood of training data
Given a text d, estimate its category y (e.g.
sports). A text d is usually represented by BOW,
a feature vector f(d)?Rm where each dimension
corresponds to a tokenized word.
L1 regularization
Learning with L1 reg. has an effect of feature
selection. i.e., of non-zero weight is very
few. Efficient inference, interpretable
result features
Problem 1. Difficulty in detecting of word.
(e.g. Japanese, Chinese, DNA seq. ) Problem2.
Words units are often NOT appropriate for text
categorization. (e.g. Movie title/templates)
Grafting S. Perkins 03
Let v be the gradient of L(w), and H be the
set of current active features. Then, we continue
the following until vkltC. kargmaxkvk, H
H?k, Optimize (1) with H This achieves global
optimum. If we can compute argmaxk vk
efficiently, the training time is proportional to
the active features.
max. substring is about T/3
Result of text categorization acc. ()
All string Bag-of-Words
A text d is represented as a feature vector
f(d)?Rm where each dimension corresponds to a
substring. Problem of substrings are much
larger, O(d2) We solve this by L1
regularization Grafting Maximal substring
Maximal substring
Lemma Let P(s) be the position list of a
substring s in a text d, and s and s are in the
same class if P(s) P(s). Then the of
different classes in d is at most 2d-1. Many
statistics information (e.g. tf, df, idf) are
same if they are in the same class. We call a
substring s maximal substring if s is the longest
substring in the same class. These maximal
substrings can be enumerated in O(d) time by
enhanced suffix arrays Abouelhoda 04
1 Search substring features greedly Ifrim 08
Classification Model
Time for extractingmaximal substrings. Linear
scalability.
We use a multi-class logistic regression model.
Conditional probability of y given d is defined
as
F(d,y)?RmY i-th block F(d)I(yyi)
Feature work Unsupervised learning, Sequential
labeling task