Title: ICS 278: Data Mining Lecture 15: Text Classification
1ICS 278 Data MiningLecture 15 Text
Classification
- Padhraic Smyth
- Department of Information and Computer Science
- University of California, Irvine
2RoadMap for Lectures
- Lecture 15 (today) text classification
- Lectures 16, 17, 18, 19
- Unsupervised learning from text clustering and
topic modeling - Recommender systems
- Credit scoring applications
- Pattern-finding algorithms
- Lecture 20
- Thursday June 8th (2 weeks from Thursday)
- 5-minute project summary from each student
- More details on format to come later..
3Text Classification
- Text classification has many applications
- Spam email detection
- Automated tagging of streams of news articles,
e.g., Google News - Automated creation of Web-page taxonomies
- Data Representation
- Bag of words most commonly used either counts
or binary - Can also use phrases for commonly occuring
combinations of words - Classification Methods
- Naïve Bayes widely used (e.g., for spam email)
- Fast and reasonably accurate
- Support vector machines (SVMs)
- Typically the most accurate method in research
studies - But more complex computationally
- Logistic Regression (regularized)
- Not as widely used, but can be competitive with
SVMs (e.g., Zhang and Oles, 2002)
4Further Reading on Text Classification
- Web-related text mining in general
- S. Chakrabarti, Mining the Web Discovering
Knowledge from Hypertext Data, Morgan Kaufmann,
2003. - See chapter 5 for discussion of text
classification - General references on text and language modeling
- Foundations of Statistical Language Processing,
C. Manning and H. Schutze, MIT Press, 1999. - Speech and Language Processing An Introduction
to Natural Language Processing, Dan Jurafsky and
James Martin, Prentice Hall, 2000. - SVMs for text classification
- T. Joachims, Learning to Classify Text using
Support Vector Machines Methods, Theory and
Algorithms, Kluwer, 2002
5Common Data Sets used for Evaluation
- Reuters
- 10700 labeled documents
- 10 documents with multiple class labels
- Yahoo! Science Hierarchy
- 95 disjoint classes with 13,598 pages
- 20 Newsgroups data
- 18800 labeled USENET postings
- 20 leaf classes, 5 root level classes
- WebKB
- 8300 documents in 7 categories such as faculty,
course, student. - Industry
- 6449 home pages of companies partitioned into 71
classes
6Trimming the Vocabulary
- Stopword removal
- remove non-content words
- very frequent stop words such as the, and.
- remove very rare words, e.g., that only occur a
few times in 100k documents - Can remove 30 or more of the original unique
words - Stemming
- Reduce all variants of a word to a single term
- E.g., draw, drawing, drawings - draw
- Porter stemming algorithm (1980)
- relies on a preconstructed suffix list with
associated rules - e.g. if suffixIZATION and prefix contains at
least one vowel followed by a consonant, replace
with suffixIZE - BINARIZATION BINARIZE
- This still often leaves p O(104) terms
- a very high-dimensional classification
problem!
7Feature Selection
- Performance of text classification algorithms can
be optimized by selecting only a subset of the
discriminative terms - See classification results later in these slides
- Greedy search
- Start from empty set or full set and add/delete
one at a time - Heuristics for adding/deleting
- Information gain (mutual information of term with
class) - Chi-square
- Other ideas
- Methods tend not to be particularly sensitive to
the specific heuristic used for feature
selection, but some form of feature selection
often improves performance
8Example of Role of Feature Selection
(from Chakrabarti, Chapter 5)
9600 documents from US Patent database 20,000 raw
features (terms)
9Classifying Term Vectors
- Typically multiple different words may be helpful
in classifying a particular class, e.g., - Class finance
- Words stocks, return, interest, rate,
etc. - Thus, classifiers that combine multiple features
often do well, e.g, - Naïve Bayes, Logistic regression, SVMs, etc
- Linear classifiers often perform well in
high-dimensions - In many cases fewer documents in training data
than dimensions, - i.e., n training data are linearly
separable - So again, naïve Bayes, logistic regression,
linear SVMS, are all useful - Question becomes which linear discriminant to
select?
10Classification Issues
- Typically many features, p O(104) terms
- Consider n sample points in p dimensions
- Binary labels 2n possible labelings (or
dichotomies) - A labeling is linearly separable if we can
separate the labels with a hyperplane - Let f(n,p) fraction of the 2n possible
labelings that are linearly separable - f(n, p) 1
n - 2/ 2n S (n-1 choose
i) n p1 -
11If n linearly separable (for large p)
12Types of Classifiers
- Generative/Probabilistic
- Model p(x c) for each class, then estimate p(c
x) - e.g., naïve Bayes model
- Conditional Probability/Regression
- Model p(c x) directly, e.g.,
- e.g., logistic regression
- Discriminative
- Look for decision boundaries in input space x
directly - No probabilities
- e.g., perceptron, linear discriminants, SVMs, etc
13Probabilistic Generative Classifiers
- Model p( x ck ) for each class and perform
classification via Bayes rule, c arg
max p( ck x ) arg max p( x ck )
p(ck) - How to model p( x ck )?
- p( x ck ) probability of a bag of words x
given a class ck - Two commonly used approaches (for text)
- Naïve Bayes treat each term xj as being
conditionally independent, given ck - Multinomial model a document with N words as N
tosses of a p-sided die - Other models possible but less common,
- E.g., model word order by using a Markov chain
for p( x ck )
14Naïve Bayes Classifier for Text
- Naïve Bayes classifier conditional
independence model - Assumes conditional independence assumption given
the class p( x ck )
P p( xj ck ) - Note that we model each term xj as a discrete
random variable - Binary terms (Bernoulli)
p( x ck ) P p( xj 1 ck ) P p( xj 0
ck ) - Non-binary terms (counts)
- p( x ck ) P p( xj
k ck ) -
- can use a parametric model (e.g.,
Poisson) or non-parametric model
(e.g., histogram) for p(xj k ck )
distributions. -
15Multinomial Classifier for Text
- Multinomial Classification model
- Assume that the data are generated by a p-sided
die (multinomial model) - where Nx number of terms (total count) in
document x nj number of times term j
occurs in the document - p(Nx ck) probability a document has length Nx,
e.g., Poisson model - Can be dropped if thought not to be class
dependent - Here we have a single random variable for each
class, and the p( xj i ck ) probabilities sum
to 1 over i (i.e., a multinomial model) - Probabilities typically only defined and
evaluated for i1, 2, 3 - But zero counts could also be modeled if
desired - This would be equivalent to a Naïve Bayes model
with a geometric distribution on counts
16Highest Probability Terms in Multinomial
Distributions
17Comparing Naïve Bayes and Multinomial models
- McCallum and Nigam (1998) Found that multinomial
outperformed naïve Bayes (with binary features)
in text classification experiments - (however, may be more a result
- of using counts vs. binary)
- Note on names used in the literature
- - Bernoulli (or multivariate Bernoulli)
sometimes used for binary version of Naïve Bayes
model - - multinomial model is also referred to as
unigram model - - multinomial model is also sometimes
(confusingly) referred to as naïve Bayes
18WebKB Data Set
- Train on 5,000 hand-labeled web pages
- Cornell, Washington, U.Texas, Wisconsin
- Crawl and classify a new site (CMU)
- Results
19Comparing Bernoulli and Multinomial on Web KB Data
20Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
21Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
22Comparing Bernoulli and Multinomial
(slide from Chris Manning, Stanford)
Results from classifying 13,589 Yahoo! Web pages
in Science subtree of hierarchy into 95
different topics
23Comments on Generative Models for Text
- (Comments applicable to both Naïve Bayes and
Multinomial classifiers) - Simple and fast popular in practice
- e.g., linear in p, n, M for both training and
prediction - Training smoothed frequency counts, e.g.,
-
- e.g., easy to use in situations where classifier
needs to be updated regularly (e.g., for spam
email) - Numerical issues
- Typically work with log p( ck x ), etc., to
avoid numerical underflow - Useful trick
- when computing S log p( xj ck ) , for sparse
data, it may be much faster to - precompute S log p( xj 0 ck )
- and then subtract off the log p( xj 1 ck )
terms - Note both models are wrong but for
classification are often sufficient
24Beyond independence
- Naïve Bayes and multinomial both assume
conditional independence of words given class - Alternative approaches try to account for
higher-order dependencies - Bayesian networks
- p(x c) Px p(x parents(x), c)
- Equivalent to directed graph where edges
represent direct dependencies - Various algorithms that search for a good network
structure - Useful for improving quality of distribution
model - .however, this does not always translate into
better classification - Maximum entropy models
- p(x c) 1/Z Psubsets f( subsets(x) c)
- Equivalent to undirected graph model
- Estimation is equivalent to maximum entropy
assumption - Feature selection is crucial (which f terms to
include) - can provide high accuracy classification
- . however, tends to be computationally complex
to fit (estimating Z is difficult)
25(No Transcript)
26Linear Classifiers
- Linear classifier (two-class case)
- wT x w0
0 - w is a p-dimensional vector of weights (learned
from the data) - w0 is a threshold (also learned from the data)
- Equation of linear hyperplane (decision boundary)
- wT x w0
0 - - Distance of a point x from hyperplane
27Geometry of Linear Classifiers
wT x w0 0
Direction of w vector
Distance of x from the boundary is 1/w (wT x
w0 )
28Optimal Hyperplane, Support Vectors, and Margin
Circles support vectors points on
convex hull that are closest to
hyperplane M margin distance of
support vectors from hyperplane Goal is to find
weight vector that maximizes M Theory tells us
that max-margin hyperplane leads to good
generalization (see work by Vapnik in 1990s)
29Optimal Separating Hyperplane
- Solution to constrained optimization problem
- (Here yi e -1, 1 is
the binary class label for example i) - wlog, let w 1/M
-
- Unique solution for a linearly separable data set
-
- Margin M of the classifier
- the distance between the separating hyperplane
and the closest training samples - optimal separating hyperplane ? maximum margin
30Sketch of Optimization Problem
- Define Lagrangian as a function of w vector, and
as - The solution must satisfy
- Points with ai 0 are called support vectors and
distance from hyperplane M - This results in a quadratic programming
optimization problem - Good news
- convex function of unknowns, unique optimum
- Variety of well-known algorithms for finding this
optimum - Bad news
- Quadratic programming in general scales as O(n3),
- In practice takes O(na), where a 1.6 to 2
(see Chakrabarti, Chapter 5, p166)
31From Chakrabarti, Chapter 5, 2002 Timing results
on text classification
32Support Vector Machines
- If ?i 0 then the distance of xi from the
separating hyperplane is M - Support vectors - points with associated ?I 0
- The decision function f(x) is computed from
support vectors as - prediction can be fast
- Non-linearly-separable case can generalize to
allow slack constraints - Non-linear SVMs replace original x vector with
non-linear functions of x - kernel trick can solve high-d problem without
working directly in high d - Computational speedups can reduce training time
to near- linear - e.g Platts SMO algorithm, Joachims SVMLight
33Classic Reuters Data Set
- 21578 documents, labeled manually
- 9603 training, 3299 test articles (ModApte
split) - 118 categories
- An article can be in more than one category
- Learn 118 binary category distinctions
- Example interest rate article
- 2-APR-1987 063519.50
- west-germany
- b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
- FRANKFURT, March 2
- The Bundesbank left credit policies unchanged
after today's regular meeting of its council, a
spokesman said in answer to enquiries. The West
German discount rate remains at 3.0 pct, and the
Lombard emergency financing rate at 5.0 pct.
- Earn (2877, 1087)
- Acquisitions (1650, 179)
- Money-fx (538, 179)
- Grain (433, 149)
- Crude (389, 189)
- Trade (369,119)
- Interest (347, 131)
- Ship (197, 89)
- Wheat (212, 71)
- Corn (182, 56)
Common categories (train, test)
34Dumais et al. 1998 Reuters - Accuracy
35Precision-Recall for SVM (linear), Naïve Bayes,
and NN (from Dumais 1998) using the Reuters data
set
36Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.
37Comparing Text Classifiers
- Naïve Bayes models (Bernoulli or Multinomial)
- Low time complexity (single linear pass through
the data) - Generally good, but not always best
- Widely used for spam email filtering
- Linear SVMs
- Often produce best results in research studies
- But computationally complex to train
- not so widely used in practice as naïve Bayes
- Others
- Logistic regression, decision trees less widely
used, but can be useful
38Learning with Labeled and Unlabeled documents
- In practice, obtaining labels for documents is
time-consuming, expensive, and error prone - Typical application small number of labeled docs
and a very large number of unlabeled docs - Idea
- Build a probabilistic model on labeled docs
- Classify the unlabeled docs, get p(class doc)
for each class and doc - This is equivalent to the E-step in the EM
algorithm - Now relearn the probabilistic model using the new
soft labels - This is equivalent to the M-step in the EM
algorithm - Continue to iterate until convergence (e.g.,
class probabilities do not change significantly) - This EM approach to classification shows that
unlabeled data can help in classification
performance, compared to labeled data alone
39Learning with Labeled and Unlabeled Data
Graph from Semi-supervised text classification
using EM, Nigam, McCallum, and Mitchell, 2006
40Other issues in text classification
- Real-time constraints
- Being able to update classifiers as new data
arrives - Being able to make predictions very quickly in
real-time - Document length
- Varying document length can be a problem for some
classifiers - Multinomial tends to be better than Bernoulli for
example - Multi-labels and multiple classes
- Text documents can have more than one label
- SVMs for example can only handle binary data
- Feature selection
- Experiments have shown that feature selection
(e.g., by greedy algorithms using information
gain) can often improve results - Linked documents
- Can view Web documents as nodes in a directed
graph - Classification can now be performed that
leverages the link structure, - Heuristic class labels of linked pages are more
likely to be the same
41Further Reading on Text Classification
- Web-related text mining in general
- S. Chakrabarti, Mining the Web Discovering
Knowledge from Hypertext Data, Morgan Kaufmann,
2003. - See chapter 5 for discussion of text
classification - General references on text and language modeling
- Foundations of Statistical Language Processing,
C. Manning and H. Schutze, MIT Press, 1999. - Speech and Language Processing An Introduction
to Natural Language Processing, Dan Jurafsky and
James Martin, Prentice Hall, 2000. - SVMs for text classification
- T. Joachims, Learning to Classify Text using
Support Vector Machines Methods, Theory and
Algorithms, Kluwer, 2002