ICS 278: Data Mining Lecture 15: Text Classification

1 / 41

About This Presentation

Title:

ICS 278: Data Mining Lecture 15: Text Classification

Description:

Results from classifying 13,589 Yahoo! Web pages in Science subtree ... The West German discount rate remains at 3.0 pct, and the Lombard emergency ... – PowerPoint PPT presentation

Number of Views:232

Avg rating:3.0/5.0

Slides: 42

Provided by: Informatio367

Learn more at: http://www.ics.uci.edu

more less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 15: Text Classification

1
ICS 278 Data MiningLecture 15 Text
Classification

Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine

2
RoadMap for Lectures

Lecture 15 (today) text classification
Lectures 16, 17, 18, 19
Unsupervised learning from text clustering and
topic modeling
Recommender systems
Credit scoring applications
Pattern-finding algorithms
Lecture 20
Thursday June 8th (2 weeks from Thursday)
5-minute project summary from each student
More details on format to come later..

3
Text Classification

Text classification has many applications
Spam email detection
Automated tagging of streams of news articles,
e.g., Google News
Automated creation of Web-page taxonomies
Data Representation
Bag of words most commonly used either counts
or binary
Can also use phrases for commonly occuring
combinations of words
Classification Methods
Naïve Bayes widely used (e.g., for spam email)
Fast and reasonably accurate
Support vector machines (SVMs)
Typically the most accurate method in research
studies
But more complex computationally
Logistic Regression (regularized)
Not as widely used, but can be competitive with
SVMs (e.g., Zhang and Oles, 2002)

4
Further Reading on Text Classification

Web-related text mining in general
S. Chakrabarti, Mining the Web Discovering
Knowledge from Hypertext Data, Morgan Kaufmann,
2003.
See chapter 5 for discussion of text
classification
General references on text and language modeling
Foundations of Statistical Language Processing,
C. Manning and H. Schutze, MIT Press, 1999.
Speech and Language Processing An Introduction
to Natural Language Processing, Dan Jurafsky and
James Martin, Prentice Hall, 2000.
SVMs for text classification
T. Joachims, Learning to Classify Text using
Support Vector Machines Methods, Theory and
Algorithms, Kluwer, 2002

5
Common Data Sets used for Evaluation

Reuters
10700 labeled documents
10 documents with multiple class labels
Yahoo! Science Hierarchy
95 disjoint classes with 13,598 pages
20 Newsgroups data
18800 labeled USENET postings
20 leaf classes, 5 root level classes
WebKB
8300 documents in 7 categories such as faculty,
course, student.
Industry
6449 home pages of companies partitioned into 71
classes

6
Trimming the Vocabulary

Stopword removal
remove non-content words
very frequent stop words such as the, and.
remove very rare words, e.g., that only occur a
few times in 100k documents
Can remove 30 or more of the original unique
words
Stemming
Reduce all variants of a word to a single term
E.g., draw, drawing, drawings - draw
Porter stemming algorithm (1980)
relies on a preconstructed suffix list with
associated rules
e.g. if suffixIZATION and prefix contains at
least one vowel followed by a consonant, replace
with suffixIZE
BINARIZATION BINARIZE
This still often leaves p O(104) terms
a very high-dimensional classification
problem!

7
Feature Selection

Performance of text classification algorithms can
be optimized by selecting only a subset of the
discriminative terms
See classification results later in these slides
Greedy search
Start from empty set or full set and add/delete
one at a time
Heuristics for adding/deleting
Information gain (mutual information of term with
class)
Chi-square
Other ideas
Methods tend not to be particularly sensitive to
the specific heuristic used for feature
selection, but some form of feature selection
often improves performance

8
Example of Role of Feature Selection
(from Chakrabarti, Chapter 5)
9600 documents from US Patent database 20,000 raw
features (terms)
9
Classifying Term Vectors

Typically multiple different words may be helpful
in classifying a particular class, e.g.,
Class finance
Words stocks, return, interest, rate,
etc.
Thus, classifiers that combine multiple features
often do well, e.g,
Naïve Bayes, Logistic regression, SVMs, etc
Linear classifiers often perform well in
high-dimensions
In many cases fewer documents in training data
than dimensions,
i.e., n training data are linearly
separable
So again, naïve Bayes, logistic regression,
linear SVMS, are all useful
Question becomes which linear discriminant to
select?

10
Classification Issues

Typically many features, p O(104) terms
Consider n sample points in p dimensions
Binary labels 2n possible labelings (or
dichotomies)
A labeling is linearly separable if we can
separate the labels with a hyperplane
Let f(n,p) fraction of the 2n possible
labelings that are linearly separable
f(n, p) 1
n
2/ 2n S (n-1 choose
i) n p1

11
If n linearly separable (for large p)
12
Types of Classifiers

Generative/Probabilistic
Model p(x c) for each class, then estimate p(c
x)
e.g., naïve Bayes model
Conditional Probability/Regression
Model p(c x) directly, e.g.,
e.g., logistic regression
Discriminative
Look for decision boundaries in input space x
directly
No probabilities
e.g., perceptron, linear discriminants, SVMs, etc

13
Probabilistic Generative Classifiers

Model p( x ck ) for each class and perform
classification via Bayes rule, c arg
max p( ck x ) arg max p( x ck )
p(ck)
How to model p( x ck )?
p( x ck ) probability of a bag of words x
given a class ck
Two commonly used approaches (for text)
Naïve Bayes treat each term xj as being
conditionally independent, given ck
Multinomial model a document with N words as N
tosses of a p-sided die
Other models possible but less common,
E.g., model word order by using a Markov chain
for p( x ck )

14
Naïve Bayes Classifier for Text

Naïve Bayes classifier conditional
independence model
Assumes conditional independence assumption given
the class p( x ck )
P p( xj ck )
Note that we model each term xj as a discrete
random variable
Binary terms (Bernoulli)
p( x ck ) P p( xj 1 ck ) P p( xj 0
ck )
Non-binary terms (counts)
p( x ck ) P p( xj
k ck )
can use a parametric model (e.g.,
Poisson) or non-parametric model
(e.g., histogram) for p(xj k ck )
distributions.

15
Multinomial Classifier for Text

Multinomial Classification model
Assume that the data are generated by a p-sided
die (multinomial model)
where Nx number of terms (total count) in
document x nj number of times term j
occurs in the document
p(Nx ck) probability a document has length Nx,
e.g., Poisson model
Can be dropped if thought not to be class
dependent
Here we have a single random variable for each
class, and the p( xj i ck ) probabilities sum
to 1 over i (i.e., a multinomial model)
Probabilities typically only defined and
evaluated for i1, 2, 3
But zero counts could also be modeled if
desired
This would be equivalent to a Naïve Bayes model
with a geometric distribution on counts

16
Highest Probability Terms in Multinomial
Distributions
17
Comparing Naïve Bayes and Multinomial models

McCallum and Nigam (1998) Found that multinomial
outperformed naïve Bayes (with binary features)
in text classification experiments
(however, may be more a result
of using counts vs. binary)
Note on names used in the literature
- Bernoulli (or multivariate Bernoulli)
sometimes used for binary version of Naïve Bayes
model
- multinomial model is also referred to as
unigram model
- multinomial model is also sometimes
(confusingly) referred to as naïve Bayes

18
WebKB Data Set

Train on 5,000 hand-labeled web pages
Cornell, Washington, U.Texas, Wisconsin
Crawl and classify a new site (CMU)
Results

19
Comparing Bernoulli and Multinomial on Web KB Data
20
Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
21
Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
22
Comparing Bernoulli and Multinomial
(slide from Chris Manning, Stanford)
Results from classifying 13,589 Yahoo! Web pages
in Science subtree of hierarchy into 95
different topics
23
Comments on Generative Models for Text

(Comments applicable to both Naïve Bayes and
Multinomial classifiers)
Simple and fast popular in practice
e.g., linear in p, n, M for both training and
prediction
Training smoothed frequency counts, e.g.,
e.g., easy to use in situations where classifier
needs to be updated regularly (e.g., for spam
email)
Numerical issues
Typically work with log p( ck x ), etc., to
avoid numerical underflow
Useful trick
when computing S log p( xj ck ) , for sparse
data, it may be much faster to
precompute S log p( xj 0 ck )
and then subtract off the log p( xj 1 ck )
terms
Note both models are wrong but for
classification are often sufficient

24
Beyond independence

Naïve Bayes and multinomial both assume
conditional independence of words given class
Alternative approaches try to account for
higher-order dependencies
Bayesian networks
p(x c) Px p(x parents(x), c)
Equivalent to directed graph where edges
represent direct dependencies
Various algorithms that search for a good network
structure
Useful for improving quality of distribution
model
.however, this does not always translate into
better classification
Maximum entropy models
p(x c) 1/Z Psubsets f( subsets(x) c)
Equivalent to undirected graph model
Estimation is equivalent to maximum entropy
assumption
Feature selection is crucial (which f terms to
include)
can provide high accuracy classification
. however, tends to be computationally complex
to fit (estimating Z is difficult)

25
(No Transcript)
26
Linear Classifiers

Linear classifier (two-class case)
wT x w0
0
w is a p-dimensional vector of weights (learned
from the data)
w0 is a threshold (also learned from the data)
Equation of linear hyperplane (decision boundary)
wT x w0
0
- Distance of a point x from hyperplane

27
Geometry of Linear Classifiers
wT x w0 0
Direction of w vector
Distance of x from the boundary is 1/w (wT x
w0 )
28
Optimal Hyperplane, Support Vectors, and Margin
Circles support vectors points on
convex hull that are closest to
hyperplane M margin distance of
support vectors from hyperplane Goal is to find
weight vector that maximizes M Theory tells us
that max-margin hyperplane leads to good
generalization (see work by Vapnik in 1990s)
29
Optimal Separating Hyperplane

Solution to constrained optimization problem
(Here yi e -1, 1 is
the binary class label for example i)
wlog, let w 1/M
Unique solution for a linearly separable data set
Margin M of the classifier
the distance between the separating hyperplane
and the closest training samples
optimal separating hyperplane ? maximum margin

30
Sketch of Optimization Problem

Define Lagrangian as a function of w vector, and
as
The solution must satisfy
Points with ai 0 are called support vectors and
distance from hyperplane M
This results in a quadratic programming
optimization problem
Good news
convex function of unknowns, unique optimum
Variety of well-known algorithms for finding this
optimum
Bad news
Quadratic programming in general scales as O(n3),
In practice takes O(na), where a 1.6 to 2
(see Chakrabarti, Chapter 5, p166)

31
From Chakrabarti, Chapter 5, 2002 Timing results
on text classification
32
Support Vector Machines

If ?i 0 then the distance of xi from the
separating hyperplane is M
Support vectors - points with associated ?I 0
The decision function f(x) is computed from
support vectors as
prediction can be fast
Non-linearly-separable case can generalize to
allow slack constraints
Non-linear SVMs replace original x vector with
non-linear functions of x
kernel trick can solve high-d problem without
working directly in high d
Computational speedups can reduce training time
to near- linear
e.g Platts SMO algorithm, Joachims SVMLight

33
Classic Reuters Data Set

21578 documents, labeled manually
9603 training, 3299 test articles (ModApte
split)
118 categories
An article can be in more than one category
Learn 118 binary category distinctions
Example interest rate article
2-APR-1987 063519.50
west-germany
b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
FRANKFURT, March 2
The Bundesbank left credit policies unchanged
after today's regular meeting of its council, a
spokesman said in answer to enquiries. The West
German discount rate remains at 3.0 pct, and the
Lombard emergency financing rate at 5.0 pct.

Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)

Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)

Common categories (train, test)
34
Dumais et al. 1998 Reuters - Accuracy
35
Precision-Recall for SVM (linear), Naïve Bayes,
and NN (from Dumais 1998) using the Reuters data
set
36
Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.
37
Comparing Text Classifiers

Naïve Bayes models (Bernoulli or Multinomial)
Low time complexity (single linear pass through
the data)
Generally good, but not always best
Widely used for spam email filtering
Linear SVMs
Often produce best results in research studies
But computationally complex to train
not so widely used in practice as naïve Bayes
Others
Logistic regression, decision trees less widely
used, but can be useful

38
Learning with Labeled and Unlabeled documents

In practice, obtaining labels for documents is
time-consuming, expensive, and error prone
Typical application small number of labeled docs
and a very large number of unlabeled docs
Idea
Build a probabilistic model on labeled docs
Classify the unlabeled docs, get p(class doc)
for each class and doc
This is equivalent to the E-step in the EM
algorithm
Now relearn the probabilistic model using the new
soft labels
This is equivalent to the M-step in the EM
algorithm
Continue to iterate until convergence (e.g.,
class probabilities do not change significantly)
This EM approach to classification shows that
unlabeled data can help in classification
performance, compared to labeled data alone

39
Learning with Labeled and Unlabeled Data
Graph from Semi-supervised text classification
using EM, Nigam, McCallum, and Mitchell, 2006
40
Other issues in text classification

Real-time constraints
Being able to update classifiers as new data
arrives
Being able to make predictions very quickly in
real-time
Document length
Varying document length can be a problem for some
classifiers
Multinomial tends to be better than Bernoulli for
example
Multi-labels and multiple classes
Text documents can have more than one label
SVMs for example can only handle binary data
Feature selection
Experiments have shown that feature selection
(e.g., by greedy algorithms using information
gain) can often improve results
Linked documents
Can view Web documents as nodes in a directed
graph
Classification can now be performed that
leverages the link structure,
Heuristic class labels of linked pages are more
likely to be the same

41
Further Reading on Text Classification

Web-related text mining in general
S. Chakrabarti, Mining the Web Discovering
Knowledge from Hypertext Data, Morgan Kaufmann,
2003.
See chapter 5 for discussion of text
classification
General references on text and language modeling
Foundations of Statistical Language Processing,
C. Manning and H. Schutze, MIT Press, 1999.
Speech and Language Processing An Introduction
to Natural Language Processing, Dan Jurafsky and
James Martin, Prentice Hall, 2000.
SVMs for text classification
T. Joachims, Learning to Classify Text using
Support Vector Machines Methods, Theory and
Algorithms, Kluwer, 2002