ICS 278: Data Mining Lecture 15: Text Classification

1 / 41
About This Presentation
Title:

ICS 278: Data Mining Lecture 15: Text Classification

Description:

Results from classifying 13,589 Yahoo! Web pages in Science subtree ... The West German discount rate remains at 3.0 pct, and the Lombard emergency ... – PowerPoint PPT presentation

Number of Views:232
Avg rating:3.0/5.0
Slides: 42
Provided by: Informatio367
Learn more at: http://www.ics.uci.edu

less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 15: Text Classification


1
ICS 278 Data MiningLecture 15 Text
Classification
  • Padhraic Smyth
  • Department of Information and Computer Science
  • University of California, Irvine

2
RoadMap for Lectures
  • Lecture 15 (today) text classification
  • Lectures 16, 17, 18, 19
  • Unsupervised learning from text clustering and
    topic modeling
  • Recommender systems
  • Credit scoring applications
  • Pattern-finding algorithms
  • Lecture 20
  • Thursday June 8th (2 weeks from Thursday)
  • 5-minute project summary from each student
  • More details on format to come later..

3
Text Classification
  • Text classification has many applications
  • Spam email detection
  • Automated tagging of streams of news articles,
    e.g., Google News
  • Automated creation of Web-page taxonomies
  • Data Representation
  • Bag of words most commonly used either counts
    or binary
  • Can also use phrases for commonly occuring
    combinations of words
  • Classification Methods
  • Naïve Bayes widely used (e.g., for spam email)
  • Fast and reasonably accurate
  • Support vector machines (SVMs)
  • Typically the most accurate method in research
    studies
  • But more complex computationally
  • Logistic Regression (regularized)
  • Not as widely used, but can be competitive with
    SVMs (e.g., Zhang and Oles, 2002)

4
Further Reading on Text Classification
  • Web-related text mining in general
  • S. Chakrabarti, Mining the Web Discovering
    Knowledge from Hypertext Data, Morgan Kaufmann,
    2003.
  • See chapter 5 for discussion of text
    classification
  • General references on text and language modeling
  • Foundations of Statistical Language Processing,
    C. Manning and H. Schutze, MIT Press, 1999.
  • Speech and Language Processing An Introduction
    to Natural Language Processing, Dan Jurafsky and
    James Martin, Prentice Hall, 2000.
  • SVMs for text classification
  • T. Joachims, Learning to Classify Text using
    Support Vector Machines Methods, Theory and
    Algorithms, Kluwer, 2002

5
Common Data Sets used for Evaluation
  • Reuters
  • 10700 labeled documents
  • 10 documents with multiple class labels
  • Yahoo! Science Hierarchy
  • 95 disjoint classes with 13,598 pages
  • 20 Newsgroups data
  • 18800 labeled USENET postings
  • 20 leaf classes, 5 root level classes
  • WebKB
  • 8300 documents in 7 categories such as faculty,
    course, student.
  • Industry
  • 6449 home pages of companies partitioned into 71
    classes

6
Trimming the Vocabulary
  • Stopword removal
  • remove non-content words
  • very frequent stop words such as the, and.
  • remove very rare words, e.g., that only occur a
    few times in 100k documents
  • Can remove 30 or more of the original unique
    words
  • Stemming
  • Reduce all variants of a word to a single term
  • E.g., draw, drawing, drawings - draw
  • Porter stemming algorithm (1980)
  • relies on a preconstructed suffix list with
    associated rules
  • e.g. if suffixIZATION and prefix contains at
    least one vowel followed by a consonant, replace
    with suffixIZE
  • BINARIZATION BINARIZE
  • This still often leaves p O(104) terms
  • a very high-dimensional classification
    problem!

7
Feature Selection
  • Performance of text classification algorithms can
    be optimized by selecting only a subset of the
    discriminative terms
  • See classification results later in these slides
  • Greedy search
  • Start from empty set or full set and add/delete
    one at a time
  • Heuristics for adding/deleting
  • Information gain (mutual information of term with
    class)
  • Chi-square
  • Other ideas
  • Methods tend not to be particularly sensitive to
    the specific heuristic used for feature
    selection, but some form of feature selection
    often improves performance

8
Example of Role of Feature Selection
(from Chakrabarti, Chapter 5)
9600 documents from US Patent database 20,000 raw
features (terms)
9
Classifying Term Vectors
  • Typically multiple different words may be helpful
    in classifying a particular class, e.g.,
  • Class finance
  • Words stocks, return, interest, rate,
    etc.
  • Thus, classifiers that combine multiple features
    often do well, e.g,
  • Naïve Bayes, Logistic regression, SVMs, etc
  • Linear classifiers often perform well in
    high-dimensions
  • In many cases fewer documents in training data
    than dimensions,
  • i.e., n training data are linearly
    separable
  • So again, naïve Bayes, logistic regression,
    linear SVMS, are all useful
  • Question becomes which linear discriminant to
    select?

10
Classification Issues
  • Typically many features, p O(104) terms
  • Consider n sample points in p dimensions
  • Binary labels 2n possible labelings (or
    dichotomies)
  • A labeling is linearly separable if we can
    separate the labels with a hyperplane
  • Let f(n,p) fraction of the 2n possible
    labelings that are linearly separable
  • f(n, p) 1
    n
  • 2/ 2n S (n-1 choose
    i) n p1

11
If n linearly separable (for large p)
12
Types of Classifiers
  • Generative/Probabilistic
  • Model p(x c) for each class, then estimate p(c
    x)
  • e.g., naïve Bayes model
  • Conditional Probability/Regression
  • Model p(c x) directly, e.g.,
  • e.g., logistic regression
  • Discriminative
  • Look for decision boundaries in input space x
    directly
  • No probabilities
  • e.g., perceptron, linear discriminants, SVMs, etc

13
Probabilistic Generative Classifiers
  • Model p( x ck ) for each class and perform
    classification via Bayes rule, c arg
    max p( ck x ) arg max p( x ck )
    p(ck)
  • How to model p( x ck )?
  • p( x ck ) probability of a bag of words x
    given a class ck
  • Two commonly used approaches (for text)
  • Naïve Bayes treat each term xj as being
    conditionally independent, given ck
  • Multinomial model a document with N words as N
    tosses of a p-sided die
  • Other models possible but less common,
  • E.g., model word order by using a Markov chain
    for p( x ck )

14
Naïve Bayes Classifier for Text
  • Naïve Bayes classifier conditional
    independence model
  • Assumes conditional independence assumption given
    the class p( x ck )
    P p( xj ck )
  • Note that we model each term xj as a discrete
    random variable
  • Binary terms (Bernoulli)
    p( x ck ) P p( xj 1 ck ) P p( xj 0
    ck )
  • Non-binary terms (counts)
  • p( x ck ) P p( xj
    k ck )
  • can use a parametric model (e.g.,
    Poisson) or non-parametric model
    (e.g., histogram) for p(xj k ck )
    distributions.

15
Multinomial Classifier for Text
  • Multinomial Classification model
  • Assume that the data are generated by a p-sided
    die (multinomial model)
  • where Nx number of terms (total count) in
    document x nj number of times term j
    occurs in the document
  • p(Nx ck) probability a document has length Nx,
    e.g., Poisson model
  • Can be dropped if thought not to be class
    dependent
  • Here we have a single random variable for each
    class, and the p( xj i ck ) probabilities sum
    to 1 over i (i.e., a multinomial model)
  • Probabilities typically only defined and
    evaluated for i1, 2, 3
  • But zero counts could also be modeled if
    desired
  • This would be equivalent to a Naïve Bayes model
    with a geometric distribution on counts

16
Highest Probability Terms in Multinomial
Distributions
17
Comparing Naïve Bayes and Multinomial models
  • McCallum and Nigam (1998) Found that multinomial
    outperformed naïve Bayes (with binary features)
    in text classification experiments
  • (however, may be more a result
  • of using counts vs. binary)
  • Note on names used in the literature
  • - Bernoulli (or multivariate Bernoulli)
    sometimes used for binary version of Naïve Bayes
    model
  • - multinomial model is also referred to as
    unigram model
  • - multinomial model is also sometimes
    (confusingly) referred to as naïve Bayes

18
WebKB Data Set
  • Train on 5,000 hand-labeled web pages
  • Cornell, Washington, U.Texas, Wisconsin
  • Crawl and classify a new site (CMU)
  • Results

19
Comparing Bernoulli and Multinomial on Web KB Data
20
Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
21
Comparing Multinomial and Bernoulli on Reuters
Data (from McCallum and
Nigam, 1998)
22
Comparing Bernoulli and Multinomial
(slide from Chris Manning, Stanford)
Results from classifying 13,589 Yahoo! Web pages
in Science subtree of hierarchy into 95
different topics
23
Comments on Generative Models for Text
  • (Comments applicable to both Naïve Bayes and
    Multinomial classifiers)
  • Simple and fast popular in practice
  • e.g., linear in p, n, M for both training and
    prediction
  • Training smoothed frequency counts, e.g.,
  • e.g., easy to use in situations where classifier
    needs to be updated regularly (e.g., for spam
    email)
  • Numerical issues
  • Typically work with log p( ck x ), etc., to
    avoid numerical underflow
  • Useful trick
  • when computing S log p( xj ck ) , for sparse
    data, it may be much faster to
  • precompute S log p( xj 0 ck )
  • and then subtract off the log p( xj 1 ck )
    terms
  • Note both models are wrong but for
    classification are often sufficient

24
Beyond independence
  • Naïve Bayes and multinomial both assume
    conditional independence of words given class
  • Alternative approaches try to account for
    higher-order dependencies
  • Bayesian networks
  • p(x c) Px p(x parents(x), c)
  • Equivalent to directed graph where edges
    represent direct dependencies
  • Various algorithms that search for a good network
    structure
  • Useful for improving quality of distribution
    model
  • .however, this does not always translate into
    better classification
  • Maximum entropy models
  • p(x c) 1/Z Psubsets f( subsets(x) c)
  • Equivalent to undirected graph model
  • Estimation is equivalent to maximum entropy
    assumption
  • Feature selection is crucial (which f terms to
    include)
  • can provide high accuracy classification
  • . however, tends to be computationally complex
    to fit (estimating Z is difficult)

25
(No Transcript)
26
Linear Classifiers
  • Linear classifier (two-class case)
  • wT x w0
    0
  • w is a p-dimensional vector of weights (learned
    from the data)
  • w0 is a threshold (also learned from the data)
  • Equation of linear hyperplane (decision boundary)
  • wT x w0
    0
  • - Distance of a point x from hyperplane

27
Geometry of Linear Classifiers
wT x w0 0
Direction of w vector
Distance of x from the boundary is 1/w (wT x
w0 )
28
Optimal Hyperplane, Support Vectors, and Margin
Circles support vectors points on
convex hull that are closest to
hyperplane M margin distance of
support vectors from hyperplane Goal is to find
weight vector that maximizes M Theory tells us
that max-margin hyperplane leads to good
generalization (see work by Vapnik in 1990s)
29
Optimal Separating Hyperplane
  • Solution to constrained optimization problem
  • (Here yi e -1, 1 is
    the binary class label for example i)
  • wlog, let w 1/M
  • Unique solution for a linearly separable data set
  • Margin M of the classifier
  • the distance between the separating hyperplane
    and the closest training samples
  • optimal separating hyperplane ? maximum margin

30
Sketch of Optimization Problem
  • Define Lagrangian as a function of w vector, and
    as
  • The solution must satisfy
  • Points with ai 0 are called support vectors and
    distance from hyperplane M
  • This results in a quadratic programming
    optimization problem
  • Good news
  • convex function of unknowns, unique optimum
  • Variety of well-known algorithms for finding this
    optimum
  • Bad news
  • Quadratic programming in general scales as O(n3),
  • In practice takes O(na), where a 1.6 to 2
    (see Chakrabarti, Chapter 5, p166)

31
From Chakrabarti, Chapter 5, 2002 Timing results
on text classification
32
Support Vector Machines
  • If ?i 0 then the distance of xi from the
    separating hyperplane is M
  • Support vectors - points with associated ?I 0
  • The decision function f(x) is computed from
    support vectors as
  • prediction can be fast
  • Non-linearly-separable case can generalize to
    allow slack constraints
  • Non-linear SVMs replace original x vector with
    non-linear functions of x
  • kernel trick can solve high-d problem without
    working directly in high d
  • Computational speedups can reduce training time
    to near- linear
  • e.g Platts SMO algorithm, Joachims SVMLight

33
Classic Reuters Data Set
  • 21578 documents, labeled manually
  • 9603 training, 3299 test articles (ModApte
    split)
  • 118 categories
  • An article can be in more than one category
  • Learn 118 binary category distinctions
  • Example interest rate article
  • 2-APR-1987 063519.50
  • west-germany
  • b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
  • FRANKFURT, March 2
  • The Bundesbank left credit policies unchanged
    after today's regular meeting of its council, a
    spokesman said in answer to enquiries. The West
    German discount rate remains at 3.0 pct, and the
    Lombard emergency financing rate at 5.0 pct.
  • Earn (2877, 1087)
  • Acquisitions (1650, 179)
  • Money-fx (538, 179)
  • Grain (433, 149)
  • Crude (389, 189)
  • Trade (369,119)
  • Interest (347, 131)
  • Ship (197, 89)
  • Wheat (212, 71)
  • Corn (182, 56)

Common categories (train, test)
34
Dumais et al. 1998 Reuters - Accuracy
35
Precision-Recall for SVM (linear), Naïve Bayes,
and NN (from Dumais 1998) using the Reuters data
set
36
Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.
37
Comparing Text Classifiers
  • Naïve Bayes models (Bernoulli or Multinomial)
  • Low time complexity (single linear pass through
    the data)
  • Generally good, but not always best
  • Widely used for spam email filtering
  • Linear SVMs
  • Often produce best results in research studies
  • But computationally complex to train
  • not so widely used in practice as naïve Bayes
  • Others
  • Logistic regression, decision trees less widely
    used, but can be useful

38
Learning with Labeled and Unlabeled documents
  • In practice, obtaining labels for documents is
    time-consuming, expensive, and error prone
  • Typical application small number of labeled docs
    and a very large number of unlabeled docs
  • Idea
  • Build a probabilistic model on labeled docs
  • Classify the unlabeled docs, get p(class doc)
    for each class and doc
  • This is equivalent to the E-step in the EM
    algorithm
  • Now relearn the probabilistic model using the new
    soft labels
  • This is equivalent to the M-step in the EM
    algorithm
  • Continue to iterate until convergence (e.g.,
    class probabilities do not change significantly)
  • This EM approach to classification shows that
    unlabeled data can help in classification
    performance, compared to labeled data alone

39
Learning with Labeled and Unlabeled Data
Graph from Semi-supervised text classification
using EM, Nigam, McCallum, and Mitchell, 2006
40
Other issues in text classification
  • Real-time constraints
  • Being able to update classifiers as new data
    arrives
  • Being able to make predictions very quickly in
    real-time
  • Document length
  • Varying document length can be a problem for some
    classifiers
  • Multinomial tends to be better than Bernoulli for
    example
  • Multi-labels and multiple classes
  • Text documents can have more than one label
  • SVMs for example can only handle binary data
  • Feature selection
  • Experiments have shown that feature selection
    (e.g., by greedy algorithms using information
    gain) can often improve results
  • Linked documents
  • Can view Web documents as nodes in a directed
    graph
  • Classification can now be performed that
    leverages the link structure,
  • Heuristic class labels of linked pages are more
    likely to be the same

41
Further Reading on Text Classification
  • Web-related text mining in general
  • S. Chakrabarti, Mining the Web Discovering
    Knowledge from Hypertext Data, Morgan Kaufmann,
    2003.
  • See chapter 5 for discussion of text
    classification
  • General references on text and language modeling
  • Foundations of Statistical Language Processing,
    C. Manning and H. Schutze, MIT Press, 1999.
  • Speech and Language Processing An Introduction
    to Natural Language Processing, Dan Jurafsky and
    James Martin, Prentice Hall, 2000.
  • SVMs for text classification
  • T. Joachims, Learning to Classify Text using
    Support Vector Machines Methods, Theory and
    Algorithms, Kluwer, 2002
Write a Comment
User Comments (0)