Title: Fast Methods for Kernel-based Text Analysis
1Fast Methods for Kernel-based Text Analysis
- Taku Kudo ?? ?
- Yuji Matsumoto ?? ??
- NAIST (Nara Institute of Science and Technology)
41st Annual Meeting of the Association for
Computational Linguistics , Sapporo JAPAN
2Background
- Kernel methods (e.g., SVM) become popular
- Can incorporate prior knowledge independently
from the machine learning algorithms by giving
task dependent kernel (generalized dot-product) - High accuracy
3Problem
- Too slow to use kernel-based text analyzers to
the real NL applications (e.g., QA or text
mining) because of their inefficiency in testing - Some kernel-based parsers run only at 2 - 3
seconds/sentence
4Goals
-
- Build fast but still accurate kernel- based
text analyzers - Make it possible to use them to wider range of NL
applications
5Outline
- Polynomial Kernel of degree d
- Fast Methods for Polynomial kernel
- PKI
- PKE
- Experiments
- Conclusions and Future Work
6Outline
- Polynomial Kernel of degree d
- Fast Methods for Polynomial kernels
- PKI
- PKE
- Experiments
- Conclusions and Future Work
7Kernel Methods
Training data
- No need to represent example in an explicit
feature vector - Complexity of testing is O(L X)
8Kernels for Sets (1/3)
- Focus on the special case where examples
are represented as sets - The instances in NLP are usually
represented as sets (e.g., bag-of-words)
Feature set
Training data
9Kernels for Sets (2/3)
- Combinations (subsets) of features
2nd order
3rd order
10Kernels for Sets (3/3)
Dependent (1) or independent (-1) ?
I ate a cake PRP VBD
DT NN
head
modifier
11Polynomial Kernel of degree d
Implicit form
12Example (Cubic Kernel d3 )
Implicit form
Up to 3 subsets are used as new features
13Outline
- Polynomial Kernel of degree d
- Fast Methods for Polynomial kernel
- PKI
- PKE
- Experiments
- Conclusions and Future Work
14Toy Example
Feature Set Fa,b,c,d,e
Examples
X
a
j
j
1 0.5 -2
1 2 3
a, b, c a, b, d b, c, d
SVs L 3
Kernel
Test Example
Xa,c,e
15PKB (Baseline)
3
K(X,X) (XnX1)
X
a
j
a, b, c a, b, d b, c, d
K(Xj,X)
1 0.5 -2
Test Example Xa,c,e
1 2 3
3
3
3
f(X) 1(21) 0.5(11) - 2 (11)
15 Complexity is always O(LX)
16PKI (Inverted Representation)
3
K(X,X) (XnX1)
Inverted Index
Xj
a
B Avg. size
a b c d
1,2 1,2,3 1,3 2,3
Test Example X a, c, e
a, b, c a, b, d b, c, d
1 0.5 -2
1 2 3
3
3
3
f(X)1(21) 0.5(11) - 2 (11) 15
- Average complexity is O(BXL)
- Efficient if feature space is sparse
- Suitable for many NL tasks
17PKE (Expanded Representation)
- Convert into linear form by calculating vector w
- projects X into its subsets space
18PKE (Expanded Representation)
3
K(X,X) (XnX1)
19PKE in Practice
- Hard to calculate Expansion Table exactly
- Use Approximated Expansion Table
- Subsets with smaller w can be removed, since
w represents a contribution to the final
classification - Use subset mining (a.k.a. basket mining)
algorithm for efficient calculation
20Subset Mining Problem
set
id
a3 b3 c3 d2 a b2 b
c 2 a c2 a d 2
1
a c d
2
a b c
3
a b d
4
b c e
Results
Transaction Database
- Extract all subsets that occur in no less than
sets of the transaction database - and no size constraints ? NP-hard
- Efficient algorithms have been proposed
(e.g., Apriori, PrefixSpan)
21Feature Selection as Mining
Xi
ai
a, b, c a, b, d b, c, d
1 2 3
1 0.5 -2
-
- Can efficiently build the approximated table
- s controls the rate of approximation
22Outline
- Polynomial Kernel of degree d
- Fast Methods for Polynomial kernel
- PKI
- PKE
- Experiments
- Conclusions and Future Work
23Experimental Settings
- Three NL tasks
- English Base-NP Chunking (EBC)
- Japanese Word Segmentation (JWS)
- Japanese Dependency Parsing (JDP)
- Kernel Settings
- Quadratic kernel is applied to EBC
- Cubic kernel is applied to JWS and JDP
24Results (English Base-NP Chunking)
Time (Sec./Sent.) Speedup Ratio F-score
PKB .164 1.0 93.84
PKI .020 8.3 93.84
PKE (s.01) .0016 105.2 93.79
PKE (s.005) .0016 101.3 93.85
PKE (s.001) .0017 97.7 93.84
PKE (s.0005) .0017 96.8 93.84
25Results (Japanese Word Segmentation)
Time (Sec./Sent.) Speedup Ratio Accuracy ()
PKB .85 1.0 97.94
PKI .49 1.7 97.94
PKE (s.01) .0024 358.2 97.93
PKE (s.005) .0028 300.1 97.95
PKE (s.001) .0034 242.6 97.94
PKE (s.0005) .0035 238.8 97.94
26Results (Japanese Dependency Parsing)
Time (Sec./Sent.) Speedup Ratio Accuracy ()
PKB .285 1.0 89.29
PKI .0226 12.6 89.29
PKE (s.01) .0042 66.8 88.91
PKE (s.005) .0060 47.8 89.05
PKE (s.001) .0086 33.3 89.26
PKE (s.0005) .0090 31.8 89.29
27Results
- 2 - 12 fold speed up in PKI
- 30 - 300 fold speed up in PKE
- Preserve the accuracy when we set an appropriate
s
28Comparison with related work
- XQK Isozaki et al. 02
- Same concept as PKE
- Designed only for the Quadratic Kernel
- Exhaustively creates the expansion table
- PKE
- Designed for general Polynomial Kernels
- Uses subset mining algorithms to create the
expansion table
29Conclusions
- Propose two fast methods for the polynomial
kernel of degree d - PKI (Inverted)
- PKE (Expanded)
- 2-12 fold speed up in PKI, 30-300 fold speed up
in PKE - Preserve the accuracy
30Future Work
- Examine the effectiveness in a general machine
learning dataset - Apply PKE to other convolution kernels
- Tree Kernel Collins 00
- Dot-product between trees
- Feature space is all sub-tree
- Apply sub-tree mining algorithm Zaki 02
31English Base-NP Chunking
Extract Non-overlapping Noun Phrase from text
NP He reckons NP the current account deficit
will narrow to NP only 1.8 billion in NP
September .
- BIO representation (seeing as a tagging task)
- B beginning of chunk
- I non-initial chunk
- O outside
- Pair-wise method to 3-class problem
- training wsj15-18, test wsj20 (standard set)
32Japanese Word Segmentation
Taro made Hanako read a book
? ? ? ? ? ? ? ? ? ? ? ?
Sentence
? ? ? ? ? ? ? ?
Boundaries
If there is a boundary between and
, otherwise
- Distinguish the relative position
- Use also the character types of Japanese
- Training KUC 01-08, Test KUC 09
33Japanese Dependency Parsing
?? ???? ??? I-top cake-acc. eat
I eat a cake
- Identify the correct dependency relations
between two bunsetsu (base phrase in English) - Linguistic features related to the modifier
and head (word, POS, POS-subcat,
inflections, punctuations, etc) - Binary classification (1 dependent, -1
independent) - Cascaded Chunking Model kudo, et al. 02
- Training KUC 01-08, Test KUC 09
-
34Kernel Methods (1/2)
Suppose a learning task
training examples
- X example to be classified
- Xi training examples
- weight for examples
- a function to map examples to another
vectorial space
35PKE (Expanded Representation)
If we calculate in advance ( is the indicator
function)
for all subsets
36TRIE representation
root
w
a d a,b a,c b,c b,d c,d b,c,d
10.5 -10.5 12 12 -12 -18 -24 -12
a
d
b
c
10.5
-10.5
c
c
d
d
b
-24
12
12
-18
-12
d
-12
- Compress redundant structures
- Classification can be done by simply
traversing the TRIE
37Kernel Methods
Training data
- No need to represent example in an explicit
feature vector - Complexity of testing is O(L X)