Title: ICS 278: Data Mining Lecture 12: Text Mining
1ICS 278 Data MiningLecture 12 Text Mining
- Padhraic Smyth
- Department of Information and Computer Science
- University of California, Irvine
2Text Mining
- Information Retrieval
- Text Classification
- Text Clustering
- Information Extraction
3Text Classification
- Text classification has many applications
- Spam email detection
- Automated tagging of streams of news articles,
e.g., Google News - Automated creation of Web-page taxonomies
- Data Representation
- Bag of words most commonly used either counts
or binary - Can also use phrases for commonly occuring
combinations of words - Classification Methods
- NaĂŻve Bayes widely used (e.g., for spam email)
- Fast and reasonably accurate
- Support vector machines (SVMs)
- Typically the most accurate method in research
studies - But more complex computationally
- Logistic Regression (regularized)
- Not as widely used, but can be competitive with
SVMs (e.g., Zhang and Oles, 2002)
4Trimming the Vocabulary
- Stopword removal
- remove non-content words
- very frequent stop words such as the, and.
- remove very rare words, e.g., that only occur a
few times in 100k documents - Can remove 30 or more of the original unique
words - Stemming
- Reduce all variants of a word to a single term
- E.g., draw, drawing, drawings -gt draw
- Porter stemming algorithm (1980)
- relies on a preconstructed suffix list with
associated rules - e.g. if suffixIZATION and prefix contains at
least one vowel followed by a consonant, replace
with suffixIZE - BINARIZATION gt BINARIZE
- This still often leaves p O(104) terms
- gt a very high-dimensional classification
problem!
5Classification Issues
- Typically many features, p O(104) terms
- Consider n sample points in p dimensions
- Binary labels gt 2n possible labelings (or
dichotomies) - A labeling is linearly separable if we can
separate the labels with a hyperplane - Let f(n,p) fraction of the 2n possible
labelings that are linear - f(n, p) 1
n lt p 1 - 2/ 2n S (n-1 choose
i) n gt p1 -
6(No Transcript)
7Classifying Term Vectors
- Typically multiple different words may be helpful
in classifying a particular class, e.g., - Class finance
- Words stocks, return, interest, rate,
etc. - Thus, classifiers that combine multiple features
often do well, e.g, - NaĂŻve Bayes, Logistic regression, SVMs,
- Classifiers based on single features (e.g.,
trees) do less well - Linear classifiers often perform well in
high-dimensions - In many cases fewer documents in training data
than dimensions, - i.e., n lt p gt training data are linearly
separable - So again, naĂŻve Bayes, logistic regression,
linear SVMS, are all useful - Question becomes which linear discriminant to
select?
8Probabilistic Generative Classifiers
- Model p( x ck ) for each class and perform
classification via Bayes rule, c arg
max p( ck x ) arg max p( x ck )
p(ck) - How to model p( x ck )?
- p( x ck ) probability of a bag of words x
given a class ck - Two commonly used approaches (for text)
- NaĂŻve Bayes treat each term xj as being
conditionally independent, given ck - Multinomial model a document with N words as N
tosses of a p-sided die - Other models possible but less common,
- E.g., model word order by using a Markov chain
for p( x ck )
9NaĂŻve Bayes Classifier for Text
- NaĂŻve Bayes classifier conditional
independence model - Assumes conditional independence assumption given
the class p( x ck )
P p( xj ck ) - Note that we model each term xj as a discrete
random variable - Binary terms (Bernoulli)
p( x ck ) P p( xj 1 ck ) P p( xj 0
ck ) - Non-binary terms (counts)
- p( x ck ) P p( xj
k ck ) -
- can use a parametric model (e.g.,
Poisson) or non-parametric model
(e.g., histogram) for p(xj k ck )
distributions. -
10Multinomial Classifier for Text
- Multinomial Classification model
- Assume that the data are generated by a p-sided
die (multinomial model) - where Nx number of terms (total count) in
document x nj number of times term j
occurs in the document - p(Nx ck) probability a document has length Nx,
e.g., Poisson model - Can be dropped if thought not to be class
dependent - Here we have a single random variable for each
class, and the p( xj i ck ) probabilities sum
to 1 over i (i.e., a multinomial model) - Probabilities typically only defined and
evaluated for i1, 2, 3 - But zero counts could also be modeled if
desired - This would be equivalent to a NaĂŻve Bayes model
with a geometric distribution on counts
11Comparing NaĂŻve Bayes and Multinomial models
- McCallum and Nigam (1998) Found that multinomial
outperformed naĂŻve Bayes (with binary features)
in text classification experiments - (however, may be more a result
- of using counts vs. binary)
- Note on names used in the literature
- - Bernoulli (or multivariate Bernoulli)
sometimes used for binary version of NaĂŻve Bayes
model - - multinomial model is also referred to as
unigram model - - multinomial model is also sometimes
(confusingly) referred to as naĂŻve Bayes
12WebKB Data Set
- Train on 5,000 hand-labeled web pages
- Cornell, Washington, U.Texas, Wisconsin
- Crawl and classify a new site (CMU)
- Results
13Probabilistic Model Comparison
14Highest Probability Terms in Multinomial
Distributions
15Sample Learning Curve(Yahoo Science Data)
16Comments on Generative Models for Text
- (Comments applicable to both NaĂŻve Bayes and
Multinomial classifiers) - Simple and fast gt popular in practice
- e.g., linear in p, n, M for both training and
prediction - Training smoothed frequency counts, e.g.,
-
- e.g., easy to use in situations where classifier
needs to be updated regularly (e.g., for spam
email) - Numerical issues
- Typically work with log p( ck x ), etc., to
avoid numerical underflow - Useful trick
- when computing S log p( xj ck ) , for sparse
data, it may be much faster to - precompute S log p( xj 0 ck )
- and then subtract off the log p( xj 1 ck )
terms - Note both models are wrong but for
classification are often sufficient
17(No Transcript)
18Linear Classifiers
- Linear classifier (two-class case)
- wT x w0 gt
0 - w is a p-dimensional vector of weights (learned
from the data) - w0 is a threshold (also learned from the data)
- Equation of linear hyperplane (decision boundary)
- wT x w0
0
19Geometry of Linear Classifiers
wT x w0 0
Direction of w vector
Distance of x from the boundary is 1/w (wT x
w0 )
20Optimal Hyperplane and Margin
M margin Circles support vectors Goal is to
find weight vector that maximizes M Theory tells
us that max-margin hyperplane leads to good
generalization
21Optimal Separating Hyperplane
- Solution to constrained optimization problem
- (Here yi e -1, 1 is the
binary class label for example i) - Unique for each linearly separable data set
-
- Margin M of the classifier
- the distance between the separating hyperplane
and the closest training samples - optimal separating hyperplane ? maximum margin
22Sketch of Optimization Problem
- Define Langrangian as a function of w vector, and
as - Form of solution dictates that optimal w can be
expressed as - This results in a quadratic programming
optimization problem - Good news
- convex function of unknowns, unique optimum
- Variety of well-known algorithms for finding this
optimum - Bad news
- Quadratic programming in general scales as O(n3)
23Support Vector Machines
- If ?i gt 0 then the distance of xi from the
separating hyperplane is M - Support vectors - points with associated ?I gt 0
- The decision function f(x) is computed from
support vectors as - gt prediction can be fast
- Non-linearly-separable case can generalize to
allow slack constraints - Non-linear SVMs replace original x vector with
non-linear functions of x - kernel trick can solve high-d problem without
working directly in high d - Computational speedups can reduce training time
to O(n2) or even near- linear - e.g Platts SMO algorithm, Joachims SVMLight
24From Chakrabarti, Chapter 5, 2002 Timing results
on text classification
25Classic Reuters Data Set
- 21578 documents, labeled manually
- 9603 training, 3299 test articles (ModApte
split) - 118 categories
- An article can be in more than one category
- Learn 118 binary category distinctions
- Example interest rate article
- 2-APR-1987 063519.50
- west-germany
- b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
- FRANKFURT, March 2
- The Bundesbank left credit policies unchanged
after today's regular meeting of its council, a
spokesman said in answer to enquiries. The West
German discount rate remains at 3.0 pct, and the
Lombard emergency financing rate at 5.0 pct.
- Earn (2877, 1087)
- Acquisitions (1650, 179)
- Money-fx (538, 179)
- Grain (433, 149)
- Crude (389, 189)
- Trade (369,119)
- Interest (347, 131)
- Ship (197, 89)
- Wheat (212, 71)
- Corn (182, 56)
Common categories (train, test)
26Dumais et al. 1998 Reuters - Accuracy
27Precision-Recall for SVM (linear), NaĂŻve Bayes,
and NN (from Dumais 1998) using the Reuters data
set
28Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.
29Other issues in text classification
- Real-time constraints
- Being able to update classifiers as new data
arrives - Being able to make predictions very quickly in
real-time - Multi-labels and multiple classes
- Text documents can have more than one label
- SVMs for example can only handle binary data
- Feature selection
- Experiments have shown that feature selection
(e.g., by greedy algorithms using information
gain) can improve results
30Further Reading on Text Classification
- General references on text and language modeling
- Foundations of Statistical Language Processing,
C. Manning and H. Schutze, MIT Press, 1999. - Speech and Language Processing An Introduction
to Natural Language Processing, Dan Jurafsky and
James Martin, Prentice Hall, 2000. - SVMs for text classification
- T. Joachims, Learning to Classify Text using
Support Vector Machines Methods, Theory and
Algorithms, Kluwer, 2002 - Web-related text mining
- S. Chakrabarti, Mining the Web Discovering
Knowledge from Hypertext Data, Morgan Kaufmann,
2003.