Title: Text Categorization: Support Vector Machines
1Text CategorizationSupport Vector Machines
2(No Transcript)
3Modulation ofText Categorization (1)
- Each text is converted into a vector xi. A
component of xi describes the frequency of a
certain word in this text. - We take d words in our dictionary. This are all
words, which we want to consider in our problem.
We call them features. They build together the
feature space c. Thus xi Î feature space c Î Âd - We have a predefined set of categories
Category1,...,Categoryk, - Label yi is the category of xi. yi ÃŽ
Category1,,Categoryk
4Modulation ofText Categorization (2)
- Training data are (x1,y1),(x2,y2),,(xn,yn).
- A classifier is a function, which maps a text to
a category y c(x) c -gt Category1,,Categoryk
- The text, which we want to classifier is xn1.
- The Categorization Problem is the followingWhat
is yn1 ÃŽ Category1,,Categoryk for xn1 ÃŽ c?
5Example
Sport Politics Music
x20 (2,0,0)T
y20 c(x20) ?
Âd
c
Bush
Beatles
x
Euro08
y20 ?
6Text Categorization
- High-Dimensional Feature Space c
- Sparse Text Vector xi
- Few irrelevant Words
- Stopwords
7Support Vector Machine (SVM)
Sport Politics
Support Vectors
m1
m2
Feature 2
Feature 2
x
Separating Hyperplane
Feature 1
Feature 1
8Nonlinear Dividing Line
Kernel function
?
F Â2 ? Â3 (F1,F2) ? (Z1,Z2,Z3)
(F12,v2F1F2,F22)
9Soft Margin
Outlier
10More than two Categories
yn1 ?
x
11Mathematical Formulation
- Training data(x1,y1),(x2,y2),,(xn,yn)
- Separating hyperplane is described by a normal
vector w and a translation parameter b. So it
holds wTx b 0 for each point on the plane - For Support Vectors (on dashed lines) holds wTxi
b m - Label yiyi Category1 ( ) if wTxi b
myi Category2 ( ) if wTxi b m - Classifier cyi1 c(xi1) sgn(wTxi1 b)
m
Feature 2
w
Feature 1
12Learning Problem
m
- Find w (w normalized w 1),
- such that the margin m is maximized
- Maximize m
- (m geometric Margin, see Figure)
- Subject to "xi ÃŽc yi(wTxi b) m
Feature 2
w
Feature 1
13Alternative Formulationwithout m!
- Rescaling w w/m, b b/m
- Þ m2 1/w2 1/(2½wTw)
- (without derivation!)
- Minimize w for a given margin m 1
- (m functional margin)
- Þ Minimize ½wTw
-
- Subject to "xi ÃŽc yi(wTxi b) 1
- Þ Generalized Lagrange Function
- L(w,b,a) ½wTw aiyi(wTxi b) 1
-
- Þ Find saddle point
- Minimize w and b
- Maximize the ai
14Solution
- Solving this optimization problem analytically
leads us to the decision function ( classifier
c) of our text classification Problem - yi1 c(xi1) sgn(wTxi1 b) sgn(
aiyixiTxi1 b) - (difficult derivation, see handout)
15Soft Margin
- Introduce a cost function!
- Minimize ½wTw C ?i
- Subject to
- "xi ÃŽc yi(wTxi b) 1 ?i
- "xi ÃŽc ?i gt 0
- Cost parameter C
Feature 2
?2
?1
Feature 1
16Quality Measure
- How good is the classifier, which we trained
previously? - Find a lower boundfor the margin m!
17Text Categorization Example
- We have 3 predefined categories Music ,
Politics , Sport - Training data 100 Documents per category. Each
document consists of exactly 150 words. - Feature space We choose 20,000 words into
dictionary, so the feature space c has a
dimension of 20,000. We assume that each word in
training documents is in dictionary. - We use one against many
- Sport against Ø Sport
-
- (Ø Sport Music È Politics )
18Odds Ratio
Examples
Þ Odds Ratio of Iraq is
19Sorting Features
- An Odds Ratio of
- 1 means, that the feature fit for Sport as well
as for Ø Sport. Such a feature do not carry
information. E.g. stopwords - gt 1 means, that the feature helps to identify the
category Sport. - lt 1 means, that the feature doesnt belong to
Sport possibly.
20Example
TCat(p1n1 f1,,psnsfs)-concept
- TCatSport( 5842105, stopwords
- 26896,1127158, high freq.
- 143864, 6271602, medium freq.
- 412108,2106231 low freq.
- 29328836 irrelevant
- )
Þ Subsets instead of words! Easier to find a
lower bound!
21Find a Lower Bound (1)
- Define p (p1,,ps)T, n (n1,,ns)T, F
diag(f1,,fs) - For SVMs with a hyperplane passing trough the
origin and without soft margin it holds the
following optimization problem (see section 3.2) - W(w) min(½wTw), s.t. "xi Îc yi(wTxi) 1
- It holds
- m2 1/w2 1/(2½wTw) 1/(2W(w)) for
the solution vector w - Simplification of optimization problem
- Let us add the constraint that within each group
of fi features the weights are required
identical. Then wTw vTFv, v Î Âs. - By definition, each example contains a certain
number of features from each group. This means
that all constraints for positive examples are
equivalent to pTv 1 and nTv 1. - Þ V(v) min(½vTFv), s.t. pTv 1, nTv 1
- v is the solution vector. So we get a lower
bound V(v) W(w) Þ m2 1/(2V(v))
22Find a Lower Bound (2)
- Introducing and solving Lagrange multiplayers
- L(v, a, a) ½vTFv a(vTp 1) a(vTn
1), a 0, a 0 - Û v F1(ap an)
- For ease of notation we write
- v F1XYa, with X (p, n), Y diag(1, 1), aT
(a, a) - Þ L(a) 1Ta ½aTYXTF1XYa
- Maximize L(a), s.t. a 0, a 0
- Since only a lower bound on the margin is
needed, it is possible to drop the constraints a
0 and a 0, because removing this constraints
can only increase the objective function at the
solution. So the unconstrainted maximum L(a) is
greater or equal to L(a). - Û a (YXTF1XY)11
- Þ L(a) ½1T(YXTF1XY)11 ?
- The special form of (YXTF1XY) makes it possible
to compute its inverse in closed form - Substituting it into ?.
23Lower Bound for the Margin m
- For TCat(p1n1 f1,,psnsfs)-concepts ,
there is always a hyperplane passing trough the
origin that has a margin m bounded by
m
m
24Our Example
TCatSport( 5842105, stopwords 26896,
1127158, high freq. 143864,
6271602, medium freq. 412108,210623
1 low freq. 29328836 irrelevant
)
m
m
- a 582/105 262/96 112/158 142/864
62/1602 42/2108 22/6231 292/8836 40.20 - b 5842/105 268/96 1127/158 143/864
627/1602 41/2108 210/6231 2932/8836
27.51 - c 422/105 82/96 272/158 32/864 272/1602
12/2108 102/6231 322/8836 22.68 - m2 (40.222.7 27.52) / (40.2 227.5 22.7)
1.32 - Þ The lower bound is m 1.15!
25Questions Remarks
! ! !