Title: David Newman, UC Irvine Lecture 7: SVMs 1
1ICS 278 Data MiningLecture 7 Support Vector
Machines
- David Newman
- Department of Computer Science
- University of California, Irvine
2Notices
- Project proposals
- Homework 2 questions?
3What gets published?
4Writing tips
- Use active voice
- A. predictions of cancer rates were made by our
classifier - B. our classifier predicted cancer rates
- Dont use quotes unless you are quoting
- We will compare to ground truth
- We observed a bias in the data
- Use short sentences
- 2-3 line max. Check 3-line sentences
- Use formal correct language
- The classifier guessed the class labels for test
data - We will use a bunch of features to predict Y
5Writing tips (cont.)
- Use inclusive language
- First our crawler will find researchers. For
researcher X, our crawler will also find where he
got his Ph.D. - Replace ambiguous pronouns (it, that), or check
that meaning is clear - Dont use both e.g. and etc.
- Some classifiers, e.g. Naïve Bayes, SVM, Decision
Tree, etc. are ideal for - Look for words to delete that dont add meaning
- A. This effort may prove to be applicable to
other domains - B. This effort may be applicable to other domains
6Writing tips (cont.)
- My pet peeve ?
- We will employ this algorithm to
- We will use this algorithm to
only if you are going to pay the algorithm
7Sentence re-writes
- However, most recognizers are prone to making
errors. - We are interested in discovering these
constraints or patterns in an automatic way. - A list of keywords that are frequent in pages
which are announcing events are fed into the
search API. - For each of the models, the parameters that do
best on the evaluation set are used for testing. - The goal of modeling network growth by evolution
presented in this proposal is to study the
process of genome evolution. - Old links are deleted when new connections
satisfy generated or predicted rules better.
8Homework 2
- Classifiers
- K-Nearest Neighbors
- Naïve Bayes
- Bernoulli
- Multinomial
- Support Vector Machine
- 1 vs. rest
- 1 vs. 1
- Weka
- e.g. Decision Tree
9Feature Selection
- Performance of text classification algorithms can
be improved by selecting a subset of the
discriminative terms - See classification results later in these slides
- Greedy search (Chakrabati 5.5)
- Start from empty set or full set and add/delete
one at a time - Heuristics for adding/deleting
- Information gain (mutual information of term with
class, e.g. McCallum and Nigam 1998 paper) - Chi-square
- Other ideas
- Methods tend not to be particularly sensitive to
the specific heuristic used for feature
selection, but some form of feature selection
often improves performance
10Effect of Feature Selection
(from Chakrabarti, Fig 5.5)
9600 documents from US Patent database 20,000 raw
features (terms)
11Comparing Naïve Bayes models
- McCallum and Nigam (1998)
- Found that multinomial outperformed Bernoulli in
text classification experiments
12WebKB Data Set
- Train on 5,000 hand-labeled web pages
- Cornell, Washington, U.Texas, Wisconsin
- Crawl and classify a new site (CMU)
- Results
13Comparing Bernoulli and Multinomial on Web KB Data
14Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
15Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
16 Comparing Bernoulli and Multinomial
Results from classifying 13,589 Yahoo! Web pages
in Science subtree of hierarchy into 95
different topics
17Note
- For Homework 2, we will NOT do feature selection
18Beyond independence
- Naïve Bayes assumes conditional independence of
words given class - Alternative approaches try to account for
higher-order dependencies - Bayesian networks
- p(x c) Px p(x parents(x), c)
- Equivalent to directed graph where edges
represent direct dependencies - Various algorithms that search for a good network
structure - Useful for improving quality of distribution
model - .however, this does not always translate into
better classification - Maximum entropy models
- p(x c) 1/Z Psubsets f( subsets(x) c)
- Equivalent to undirected graph model
- Estimation is equivalent to maximum entropy
assumption - Feature selection is crucial (which f terms to
include) - can provide high accuracy classification
- . however, tends to be computationally complex
to fit (estimating Z is difficult)
19Linear classifiers and SVM basics
wT x b 0
Direction of w vector
Distance of x from the boundary is (wT x b )
/ w
20Optimal Hyperplane, Support Vectors, and Margin
Circles support vectors points on
convex hull that are closest to
hyperplane M margin distance of
support vectors from hyperplane Goal is to find
weight vector that maximizes M Theory tells us
that max-margin hyperplane leads to good
generalization (see work by Vapnik in 1990s)
21SVM setup 1
- Data xi, target ti in -1,1, i1N
- Assume linearly separable. Can find w, b, s.t.
- Distance of point x to decision surface y(x)0
22SVM setup 2
- Maximum margin solution w, b, s.t.
- Let w ? k w and b ? k b, so that for the closest
point to the surface j - Then all data points satisfy constraints
- Optimize problem maximize
i
w,b
i1N
23SVM setup 3
- Minimize subject to constraints
- Lagrangian
- Solve
? Get quadratic programming problem
24SVM setup 4
- Not separable, use slack variables
- Minimize
- C is regularization parameter
25Support Vector Machine
- Unique solution for a linearly separable data set
-
- Margin M of the classifier
- the distance between the separating hyperplane
and the closest training samples - optimal separating hyperplane ? maximum margin
- This results in a quadratic programming
optimization problem - Good news
- convex function of unknowns, unique optimum
- Variety of well-known algorithms for finding this
optimum - Bad news
- Quadratic programming in general scales as O(n3),
- In practice takes O(na), where a 1.6 to 2
(see Chakrabarti, Chapter 5, p166)
26From Chakrabarti, Chapter 5, 2002 Timing results
on text classification
27Multi-class classification
- SVM does binary classification y -1, 1
- Build K-class classifier by combining binary
classifiers. Have classes c c1, c2, c3, ,
cK - 1 vs. rest
- build classifier for y c1, not c1
-
- build classifier for y ci, not ci
-
- build classifier for y cK, not cK
- 1 vs. 1
- build classifier for y c1, c2
-
- build classifier for y ci, cj
- ...
- build classifier for y cK-1, cK
281 vs. rest
R2
R3
R1
not C3
R5
C3
R6
R4
C1
not C2
R7
C2
not C1
291 vs. rest
C1
?
?
not C3
?
C3
C3
?
C1
not C2
?
C2
not C1
301 vs. rest
- Learn K 1-vs-rest classifiers
- for k1K
- yk(x) svm_predict(x)
- Predict class for test document x
- class x arg max yk(x)
- ? Issue no guarantee that yk(x) for different
classifiers will have appropriate scales
k
311 vs. 1
C3
R2
C1
C3
R4
C2
R1
R3
C1
C2
321 vs. 1
C3
C3
C1
C3
?
C2
C1
C2
C1
C2
331 vs. 1
?
?
?
C2
?
C3
?
?
C1
C3
?
C1
C2
341 vs. 1
- Train ½ K(K-1) binary classifiers
- Classify test docs using all ½ K(K-1) classifiers
- Predicted class is class that gets highest
votes
35Classic Reuters Data Set
- 21578 documents, labeled manually
- 9603 training, 3299 test articles
- 118 categories
- An article can be in more than one category
- Learn 118 binary category distinctions
- Example interest rate article
- 2-APR-1987 063519.50
- west-germany
- b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
- FRANKFURT, March 2
- The Bundesbank left credit policies unchanged
after today's regular meeting of its council, a
spokesman said in answer to enquiries. The West
German discount rate remains at 3.0 pct, and the
Lombard emergency financing rate at 5.0 pct.
- Earn (2877, 1087)
- Acquisitions (1650, 179)
- Money-fx (538, 179)
- Grain (433, 149)
- Crude (389, 189)
- Trade (369,119)
- Interest (347, 131)
- Ship (197, 89)
- Wheat (212, 71)
- Corn (182, 56)
36Dumais et al. 1998 Reuters - Accuracy
37Precision-Recall for SVM (linear), Naïve Bayes,
and NN (from Dumais 1998) using the Reuters data
set
38Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.