David Newman, UC Irvine Lecture 7: SVMs 1

1 / 38

About This Presentation

Title:

David Newman, UC Irvine Lecture 7: SVMs 1

Description:

Comparing Bernoulli and Multinomial on Web KB Data. Comparing Multinomial and Bernoulli on Reuter's Data (from McCallum and Nigam, 1998) ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 39

Provided by: Informatio367

more less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 7: SVMs 1

1
ICS 278 Data MiningLecture 7 Support Vector
Machines

David Newman
Department of Computer Science
University of California, Irvine

2
Notices

Project proposals
Homework 2 questions?

3
What gets published?
4
Writing tips

Use active voice
A. predictions of cancer rates were made by our
classifier
B. our classifier predicted cancer rates
Dont use quotes unless you are quoting
We will compare to ground truth
We observed a bias in the data
Use short sentences
2-3 line max. Check 3-line sentences
Use formal correct language
The classifier guessed the class labels for test
data
We will use a bunch of features to predict Y

5
Writing tips (cont.)

Use inclusive language
First our crawler will find researchers. For
researcher X, our crawler will also find where he
got his Ph.D.
Replace ambiguous pronouns (it, that), or check
that meaning is clear
Dont use both e.g. and etc.
Some classifiers, e.g. Naïve Bayes, SVM, Decision
Tree, etc. are ideal for
Look for words to delete that dont add meaning
A. This effort may prove to be applicable to
other domains
B. This effort may be applicable to other domains

6
Writing tips (cont.)

My pet peeve ?
We will employ this algorithm to
We will use this algorithm to

only if you are going to pay the algorithm
7
Sentence re-writes

However, most recognizers are prone to making
errors.
We are interested in discovering these
constraints or patterns in an automatic way.
A list of keywords that are frequent in pages
which are announcing events are fed into the
search API.
For each of the models, the parameters that do
best on the evaluation set are used for testing.
The goal of modeling network growth by evolution
presented in this proposal is to study the
process of genome evolution.
Old links are deleted when new connections
satisfy generated or predicted rules better.

8
Homework 2

Classifiers
K-Nearest Neighbors
Naïve Bayes
Bernoulli
Multinomial
Support Vector Machine
1 vs. rest
1 vs. 1
Weka
e.g. Decision Tree

Data

9
Feature Selection

Performance of text classification algorithms can
be improved by selecting a subset of the
discriminative terms
See classification results later in these slides
Greedy search (Chakrabati 5.5)
Start from empty set or full set and add/delete
one at a time
Heuristics for adding/deleting
Information gain (mutual information of term with
class, e.g. McCallum and Nigam 1998 paper)
Chi-square
Other ideas
Methods tend not to be particularly sensitive to
the specific heuristic used for feature
selection, but some form of feature selection
often improves performance

10
Effect of Feature Selection
(from Chakrabarti, Fig 5.5)
9600 documents from US Patent database 20,000 raw
features (terms)
11
Comparing Naïve Bayes models

McCallum and Nigam (1998)
Found that multinomial outperformed Bernoulli in
text classification experiments

12
WebKB Data Set

Train on 5,000 hand-labeled web pages
Cornell, Washington, U.Texas, Wisconsin
Crawl and classify a new site (CMU)
Results

13
Comparing Bernoulli and Multinomial on Web KB Data
14
Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
15
Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
16
Comparing Bernoulli and Multinomial
Results from classifying 13,589 Yahoo! Web pages
in Science subtree of hierarchy into 95
different topics
17
Note

For Homework 2, we will NOT do feature selection

18
Beyond independence

Naïve Bayes assumes conditional independence of
words given class
Alternative approaches try to account for
higher-order dependencies
Bayesian networks
p(x c) Px p(x parents(x), c)
Equivalent to directed graph where edges
represent direct dependencies
Various algorithms that search for a good network
structure
Useful for improving quality of distribution
model
.however, this does not always translate into
better classification
Maximum entropy models
p(x c) 1/Z Psubsets f( subsets(x) c)
Equivalent to undirected graph model
Estimation is equivalent to maximum entropy
assumption
Feature selection is crucial (which f terms to
include)
can provide high accuracy classification
. however, tends to be computationally complex
to fit (estimating Z is difficult)

19
Linear classifiers and SVM basics
wT x b 0
Direction of w vector
Distance of x from the boundary is (wT x b )
/ w
20
Optimal Hyperplane, Support Vectors, and Margin
Circles support vectors points on
convex hull that are closest to
hyperplane M margin distance of
support vectors from hyperplane Goal is to find
weight vector that maximizes M Theory tells us
that max-margin hyperplane leads to good
generalization (see work by Vapnik in 1990s)
21
SVM setup 1

Data xi, target ti in -1,1, i1N
Assume linearly separable. Can find w, b, s.t.
Distance of point x to decision surface y(x)0

22
SVM setup 2

Maximum margin solution w, b, s.t.
Let w ? k w and b ? k b, so that for the closest
point to the surface j
Then all data points satisfy constraints
Optimize problem maximize

i
w,b
i1N
23
SVM setup 3

Minimize subject to constraints
Lagrangian
Solve

? Get quadratic programming problem
24
SVM setup 4

Not separable, use slack variables
Minimize
C is regularization parameter

25
Support Vector Machine

Unique solution for a linearly separable data set
Margin M of the classifier
the distance between the separating hyperplane
and the closest training samples
optimal separating hyperplane ? maximum margin
This results in a quadratic programming
optimization problem
Good news
convex function of unknowns, unique optimum
Variety of well-known algorithms for finding this
optimum
Bad news
Quadratic programming in general scales as O(n3),
In practice takes O(na), where a 1.6 to 2
(see Chakrabarti, Chapter 5, p166)

26
From Chakrabarti, Chapter 5, 2002 Timing results
on text classification
27
Multi-class classification

SVM does binary classification y -1, 1
Build K-class classifier by combining binary
classifiers. Have classes c c1, c2, c3, ,
cK
1 vs. rest
build classifier for y c1, not c1
build classifier for y ci, not ci
build classifier for y cK, not cK
1 vs. 1
build classifier for y c1, c2
build classifier for y ci, cj
...
build classifier for y cK-1, cK

28
1 vs. rest
R2
R3
R1
not C3
R5
C3
R6
R4
C1
not C2
R7
C2
not C1
29
1 vs. rest
C1
?
?
not C3
?
C3
C3
?
C1
not C2
?
C2
not C1
30
1 vs. rest

Learn K 1-vs-rest classifiers
for k1K
yk(x) svm_predict(x)
Predict class for test document x
class x arg max yk(x)
? Issue no guarantee that yk(x) for different
classifiers will have appropriate scales

k
31
1 vs. 1
C3
R2
C1
C3
R4
C2
R1
R3
C1
C2
32
1 vs. 1
C3
C3
C1
C3
?
C2
C1
C2
C1
C2
33
1 vs. 1
?
?
?
C2
?
C3
?
?
C1
C3
?
C1
C2
34
1 vs. 1

Train ½ K(K-1) binary classifiers
Classify test docs using all ½ K(K-1) classifiers
Predicted class is class that gets highest
votes

35
Classic Reuters Data Set

21578 documents, labeled manually
9603 training, 3299 test articles
118 categories
An article can be in more than one category
Learn 118 binary category distinctions
Example interest rate article
2-APR-1987 063519.50
west-germany
b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
FRANKFURT, March 2
The Bundesbank left credit policies unchanged
after today's regular meeting of its council, a
spokesman said in answer to enquiries. The West
German discount rate remains at 3.0 pct, and the
Lombard emergency financing rate at 5.0 pct.

Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)

Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)

36
Dumais et al. 1998 Reuters - Accuracy
37
Precision-Recall for SVM (linear), Naïve Bayes,
and NN (from Dumais 1998) using the Reuters data
set
38
Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.

Write a Comment

User Comments (0)