ICS 278: Data Mining Lecture 12: Text Mining

1 / 30

About This Presentation

Title:

ICS 278: Data Mining Lecture 12: Text Mining

Description:

Na ve Bayes, Logistic ... So again, na ve Bayes, logistic regression, linear SVMS, ... would be equivalent to a Na ve Bayes model with a geometric ... – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 31

Provided by: Informatio367

Learn more at: https://www.ics.uci.edu

more less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 12: Text Mining

1
ICS 278 Data MiningLecture 12 Text Mining

Padhraic Smyth
Department of Information and Computer Science
University of California, Irvine

2
Text Mining

Information Retrieval
Text Classification
Text Clustering
Information Extraction

3
Text Classification

Text classification has many applications
Spam email detection
Automated tagging of streams of news articles,
e.g., Google News
Automated creation of Web-page taxonomies
Data Representation
Bag of words most commonly used either counts
or binary
Can also use phrases for commonly occuring
combinations of words
Classification Methods
Naïve Bayes widely used (e.g., for spam email)
Fast and reasonably accurate
Support vector machines (SVMs)
Typically the most accurate method in research
studies
But more complex computationally
Logistic Regression (regularized)
Not as widely used, but can be competitive with
SVMs (e.g., Zhang and Oles, 2002)

4
Trimming the Vocabulary

Stopword removal
remove non-content words
very frequent stop words such as the, and.
remove very rare words, e.g., that only occur a
few times in 100k documents
Can remove 30 or more of the original unique
words
Stemming
Reduce all variants of a word to a single term
E.g., draw, drawing, drawings -gt draw
Porter stemming algorithm (1980)
relies on a preconstructed suffix list with
associated rules
e.g. if suffixIZATION and prefix contains at
least one vowel followed by a consonant, replace
with suffixIZE
BINARIZATION gt BINARIZE
This still often leaves p O(104) terms
gt a very high-dimensional classification
problem!

5
Classification Issues

Typically many features, p O(104) terms
Consider n sample points in p dimensions
Binary labels gt 2n possible labelings (or
dichotomies)
A labeling is linearly separable if we can
separate the labels with a hyperplane
Let f(n,p) fraction of the 2n possible
labelings that are linear
f(n, p) 1
n lt p 1
2/ 2n S (n-1 choose
i) n gt p1

6
(No Transcript)
7
Classifying Term Vectors

Typically multiple different words may be helpful
in classifying a particular class, e.g.,
Class finance
Words stocks, return, interest, rate,
etc.
Thus, classifiers that combine multiple features
often do well, e.g,
Naïve Bayes, Logistic regression, SVMs,
Classifiers based on single features (e.g.,
trees) do less well
Linear classifiers often perform well in
high-dimensions
In many cases fewer documents in training data
than dimensions,
i.e., n lt p gt training data are linearly
separable
So again, naïve Bayes, logistic regression,
linear SVMS, are all useful
Question becomes which linear discriminant to
select?

8
Probabilistic Generative Classifiers

Model p( x ck ) for each class and perform
classification via Bayes rule, c arg
max p( ck x ) arg max p( x ck )
p(ck)
How to model p( x ck )?
p( x ck ) probability of a bag of words x
given a class ck
Two commonly used approaches (for text)
Naïve Bayes treat each term xj as being
conditionally independent, given ck
Multinomial model a document with N words as N
tosses of a p-sided die
Other models possible but less common,
E.g., model word order by using a Markov chain
for p( x ck )

9
Naïve Bayes Classifier for Text

Naïve Bayes classifier conditional
independence model
Assumes conditional independence assumption given
the class p( x ck )
P p( xj ck )
Note that we model each term xj as a discrete
random variable
Binary terms (Bernoulli)
p( x ck ) P p( xj 1 ck ) P p( xj 0
ck )
Non-binary terms (counts)
p( x ck ) P p( xj
k ck )
can use a parametric model (e.g.,
Poisson) or non-parametric model
(e.g., histogram) for p(xj k ck )
distributions.

10
Multinomial Classifier for Text

Multinomial Classification model
Assume that the data are generated by a p-sided
die (multinomial model)
where Nx number of terms (total count) in
document x nj number of times term j
occurs in the document
p(Nx ck) probability a document has length Nx,
e.g., Poisson model
Can be dropped if thought not to be class
dependent
Here we have a single random variable for each
class, and the p( xj i ck ) probabilities sum
to 1 over i (i.e., a multinomial model)
Probabilities typically only defined and
evaluated for i1, 2, 3
But zero counts could also be modeled if
desired
This would be equivalent to a Naïve Bayes model
with a geometric distribution on counts

11
Comparing Naïve Bayes and Multinomial models

McCallum and Nigam (1998) Found that multinomial
outperformed naïve Bayes (with binary features)
in text classification experiments
(however, may be more a result
of using counts vs. binary)
Note on names used in the literature
- Bernoulli (or multivariate Bernoulli)
sometimes used for binary version of Naïve Bayes
model
- multinomial model is also referred to as
unigram model
- multinomial model is also sometimes
(confusingly) referred to as naïve Bayes

12
WebKB Data Set

Train on 5,000 hand-labeled web pages
Cornell, Washington, U.Texas, Wisconsin
Crawl and classify a new site (CMU)
Results

13
Probabilistic Model Comparison
14
Highest Probability Terms in Multinomial
Distributions
15
Sample Learning Curve(Yahoo Science Data)
16
Comments on Generative Models for Text

(Comments applicable to both Naïve Bayes and
Multinomial classifiers)
Simple and fast gt popular in practice
e.g., linear in p, n, M for both training and
prediction
Training smoothed frequency counts, e.g.,
e.g., easy to use in situations where classifier
needs to be updated regularly (e.g., for spam
email)
Numerical issues
Typically work with log p( ck x ), etc., to
avoid numerical underflow
Useful trick
when computing S log p( xj ck ) , for sparse
data, it may be much faster to
precompute S log p( xj 0 ck )
and then subtract off the log p( xj 1 ck )
terms
Note both models are wrong but for
classification are often sufficient

17
(No Transcript)
18
Linear Classifiers

Linear classifier (two-class case)
wT x w0 gt
0
w is a p-dimensional vector of weights (learned
from the data)
w0 is a threshold (also learned from the data)
Equation of linear hyperplane (decision boundary)
wT x w0
0

19
Geometry of Linear Classifiers
wT x w0 0
Direction of w vector
Distance of x from the boundary is 1/w (wT x
w0 )
20
Optimal Hyperplane and Margin
M margin Circles support vectors Goal is to
find weight vector that maximizes M Theory tells
us that max-margin hyperplane leads to good
generalization
21
Optimal Separating Hyperplane

Solution to constrained optimization problem
(Here yi e -1, 1 is the
binary class label for example i)
Unique for each linearly separable data set
Margin M of the classifier
the distance between the separating hyperplane
and the closest training samples
optimal separating hyperplane ? maximum margin

22
Sketch of Optimization Problem

Define Langrangian as a function of w vector, and
as
Form of solution dictates that optimal w can be
expressed as
This results in a quadratic programming
optimization problem
Good news
convex function of unknowns, unique optimum
Variety of well-known algorithms for finding this
optimum
Bad news
Quadratic programming in general scales as O(n3)

23
Support Vector Machines

If ?i gt 0 then the distance of xi from the
separating hyperplane is M
Support vectors - points with associated ?I gt 0
The decision function f(x) is computed from
support vectors as
gt prediction can be fast
Non-linearly-separable case can generalize to
allow slack constraints
Non-linear SVMs replace original x vector with
non-linear functions of x
kernel trick can solve high-d problem without
working directly in high d
Computational speedups can reduce training time
to O(n2) or even near- linear
e.g Platts SMO algorithm, Joachims SVMLight

24
From Chakrabarti, Chapter 5, 2002 Timing results
on text classification
25
Classic Reuters Data Set

21578 documents, labeled manually
9603 training, 3299 test articles (ModApte
split)
118 categories
An article can be in more than one category
Learn 118 binary category distinctions
Example interest rate article
2-APR-1987 063519.50
west-germany
b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
FRANKFURT, March 2
The Bundesbank left credit policies unchanged
after today's regular meeting of its council, a
spokesman said in answer to enquiries. The West
German discount rate remains at 3.0 pct, and the
Lombard emergency financing rate at 5.0 pct.

Earn (2877, 1087)
Acquisitions (1650, 179)
Money-fx (538, 179)
Grain (433, 149)
Crude (389, 189)

Trade (369,119)
Interest (347, 131)
Ship (197, 89)
Wheat (212, 71)
Corn (182, 56)

Common categories (train, test)
26
Dumais et al. 1998 Reuters - Accuracy
27
Precision-Recall for SVM (linear), Naïve Bayes,
and NN (from Dumais 1998) using the Reuters data
set
28
Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.
29
Other issues in text classification

Real-time constraints
Being able to update classifiers as new data
arrives
Being able to make predictions very quickly in
real-time
Multi-labels and multiple classes
Text documents can have more than one label
SVMs for example can only handle binary data
Feature selection
Experiments have shown that feature selection
(e.g., by greedy algorithms using information
gain) can improve results

30
Further Reading on Text Classification

General references on text and language modeling
Foundations of Statistical Language Processing,
C. Manning and H. Schutze, MIT Press, 1999.
Speech and Language Processing An Introduction
to Natural Language Processing, Dan Jurafsky and
James Martin, Prentice Hall, 2000.
SVMs for text classification
T. Joachims, Learning to Classify Text using
Support Vector Machines Methods, Theory and
Algorithms, Kluwer, 2002
Web-related text mining
S. Chakrabarti, Mining the Web Discovering
Knowledge from Hypertext Data, Morgan Kaufmann,
2003.