David Newman, UC Irvine Lecture 7: SVMs 1

1 / 38
About This Presentation
Title:

David Newman, UC Irvine Lecture 7: SVMs 1

Description:

Comparing Bernoulli and Multinomial on Web KB Data. Comparing Multinomial and Bernoulli on Reuter's Data (from McCallum and Nigam, 1998) ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: David Newman, UC Irvine Lecture 7: SVMs 1


1
ICS 278 Data MiningLecture 7 Support Vector
Machines
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Notices
  • Project proposals
  • Homework 2 questions?

3
What gets published?
4
Writing tips
  • Use active voice
  • A. predictions of cancer rates were made by our
    classifier
  • B. our classifier predicted cancer rates
  • Dont use quotes unless you are quoting
  • We will compare to ground truth
  • We observed a bias in the data
  • Use short sentences
  • 2-3 line max. Check 3-line sentences
  • Use formal correct language
  • The classifier guessed the class labels for test
    data
  • We will use a bunch of features to predict Y

5
Writing tips (cont.)
  • Use inclusive language
  • First our crawler will find researchers. For
    researcher X, our crawler will also find where he
    got his Ph.D.
  • Replace ambiguous pronouns (it, that), or check
    that meaning is clear
  • Dont use both e.g. and etc.
  • Some classifiers, e.g. Naïve Bayes, SVM, Decision
    Tree, etc. are ideal for
  • Look for words to delete that dont add meaning
  • A. This effort may prove to be applicable to
    other domains
  • B. This effort may be applicable to other domains

6
Writing tips (cont.)
  • My pet peeve ?
  • We will employ this algorithm to
  • We will use this algorithm to

only if you are going to pay the algorithm
7
Sentence re-writes
  • However, most recognizers are prone to making
    errors.
  • We are interested in discovering these
    constraints or patterns in an automatic way.
  • A list of keywords that are frequent in pages
    which are announcing events are fed into the
    search API.
  • For each of the models, the parameters that do
    best on the evaluation set are used for testing.
  • The goal of modeling network growth by evolution
    presented in this proposal is to study the
    process of genome evolution.
  • Old links are deleted when new connections
    satisfy generated or predicted rules better.

8
Homework 2
  • Classifiers
  • K-Nearest Neighbors
  • Naïve Bayes
  • Bernoulli
  • Multinomial
  • Support Vector Machine
  • 1 vs. rest
  • 1 vs. 1
  • Weka
  • e.g. Decision Tree
  • Data

9
Feature Selection
  • Performance of text classification algorithms can
    be improved by selecting a subset of the
    discriminative terms
  • See classification results later in these slides
  • Greedy search (Chakrabati 5.5)
  • Start from empty set or full set and add/delete
    one at a time
  • Heuristics for adding/deleting
  • Information gain (mutual information of term with
    class, e.g. McCallum and Nigam 1998 paper)
  • Chi-square
  • Other ideas
  • Methods tend not to be particularly sensitive to
    the specific heuristic used for feature
    selection, but some form of feature selection
    often improves performance

10
Effect of Feature Selection
(from Chakrabarti, Fig 5.5)
9600 documents from US Patent database 20,000 raw
features (terms)
11
Comparing Naïve Bayes models
  • McCallum and Nigam (1998)
  • Found that multinomial outperformed Bernoulli in
    text classification experiments

12
WebKB Data Set
  • Train on 5,000 hand-labeled web pages
  • Cornell, Washington, U.Texas, Wisconsin
  • Crawl and classify a new site (CMU)
  • Results

13
Comparing Bernoulli and Multinomial on Web KB Data
14
Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
15
Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
16
Comparing Bernoulli and Multinomial
Results from classifying 13,589 Yahoo! Web pages
in Science subtree of hierarchy into 95
different topics
17
Note
  • For Homework 2, we will NOT do feature selection

18
Beyond independence
  • Naïve Bayes assumes conditional independence of
    words given class
  • Alternative approaches try to account for
    higher-order dependencies
  • Bayesian networks
  • p(x c) Px p(x parents(x), c)
  • Equivalent to directed graph where edges
    represent direct dependencies
  • Various algorithms that search for a good network
    structure
  • Useful for improving quality of distribution
    model
  • .however, this does not always translate into
    better classification
  • Maximum entropy models
  • p(x c) 1/Z Psubsets f( subsets(x) c)
  • Equivalent to undirected graph model
  • Estimation is equivalent to maximum entropy
    assumption
  • Feature selection is crucial (which f terms to
    include)
  • can provide high accuracy classification
  • . however, tends to be computationally complex
    to fit (estimating Z is difficult)

19
Linear classifiers and SVM basics
wT x b 0
Direction of w vector
Distance of x from the boundary is (wT x b )
/ w
20
Optimal Hyperplane, Support Vectors, and Margin
Circles support vectors points on
convex hull that are closest to
hyperplane M margin distance of
support vectors from hyperplane Goal is to find
weight vector that maximizes M Theory tells us
that max-margin hyperplane leads to good
generalization (see work by Vapnik in 1990s)
21
SVM setup 1
  • Data xi, target ti in -1,1, i1N
  • Assume linearly separable. Can find w, b, s.t.
  • Distance of point x to decision surface y(x)0

22
SVM setup 2
  • Maximum margin solution w, b, s.t.
  • Let w ? k w and b ? k b, so that for the closest
    point to the surface j
  • Then all data points satisfy constraints
  • Optimize problem maximize

i
w,b
i1N
23
SVM setup 3
  • Minimize subject to constraints
  • Lagrangian
  • Solve

? Get quadratic programming problem
24
SVM setup 4
  • Not separable, use slack variables
  • Minimize
  • C is regularization parameter

25
Support Vector Machine
  • Unique solution for a linearly separable data set
  • Margin M of the classifier
  • the distance between the separating hyperplane
    and the closest training samples
  • optimal separating hyperplane ? maximum margin
  • This results in a quadratic programming
    optimization problem
  • Good news
  • convex function of unknowns, unique optimum
  • Variety of well-known algorithms for finding this
    optimum
  • Bad news
  • Quadratic programming in general scales as O(n3),
  • In practice takes O(na), where a 1.6 to 2
    (see Chakrabarti, Chapter 5, p166)

26
From Chakrabarti, Chapter 5, 2002 Timing results
on text classification
27
Multi-class classification
  • SVM does binary classification y -1, 1
  • Build K-class classifier by combining binary
    classifiers. Have classes c c1, c2, c3, ,
    cK
  • 1 vs. rest
  • build classifier for y c1, not c1
  • build classifier for y ci, not ci
  • build classifier for y cK, not cK
  • 1 vs. 1
  • build classifier for y c1, c2
  • build classifier for y ci, cj
  • ...
  • build classifier for y cK-1, cK

28
1 vs. rest
R2
R3
R1
not C3
R5
C3
R6
R4
C1
not C2
R7
C2
not C1
29
1 vs. rest
C1
?
?
not C3
?
C3
C3
?
C1
not C2
?
C2
not C1
30
1 vs. rest
  • Learn K 1-vs-rest classifiers
  • for k1K
  • yk(x) svm_predict(x)
  • Predict class for test document x
  • class x arg max yk(x)
  • ? Issue no guarantee that yk(x) for different
    classifiers will have appropriate scales

k
31
1 vs. 1
C3
R2
C1
C3
R4
C2
R1
R3
C1
C2
32
1 vs. 1
C3
C3
C1
C3
?
C2
C1
C2
C1
C2
33
1 vs. 1
?
?
?
C2
?
C3
?
?
C1
C3
?
C1
C2
34
1 vs. 1
  • Train ½ K(K-1) binary classifiers
  • Classify test docs using all ½ K(K-1) classifiers
  • Predicted class is class that gets highest
    votes

35
Classic Reuters Data Set
  • 21578 documents, labeled manually
  • 9603 training, 3299 test articles
  • 118 categories
  • An article can be in more than one category
  • Learn 118 binary category distinctions
  • Example interest rate article
  • 2-APR-1987 063519.50
  • west-germany
  • b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
  • FRANKFURT, March 2
  • The Bundesbank left credit policies unchanged
    after today's regular meeting of its council, a
    spokesman said in answer to enquiries. The West
    German discount rate remains at 3.0 pct, and the
    Lombard emergency financing rate at 5.0 pct.
  • Earn (2877, 1087)
  • Acquisitions (1650, 179)
  • Money-fx (538, 179)
  • Grain (433, 149)
  • Crude (389, 189)
  • Trade (369,119)
  • Interest (347, 131)
  • Ship (197, 89)
  • Wheat (212, 71)
  • Corn (182, 56)

36
Dumais et al. 1998 Reuters - Accuracy
37
Precision-Recall for SVM (linear), Naïve Bayes,
and NN (from Dumais 1998) using the Reuters data
set
38
Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.
Write a Comment
User Comments (0)