ICS 278: Data Mining Lecture 12: Text Mining

1 / 30
About This Presentation
Title:

ICS 278: Data Mining Lecture 12: Text Mining

Description:

Na ve Bayes, Logistic ... So again, na ve Bayes, logistic regression, linear SVMS, ... would be equivalent to a Na ve Bayes model with a geometric ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: ICS 278: Data Mining Lecture 12: Text Mining


1
ICS 278 Data MiningLecture 12 Text Mining
  • Padhraic Smyth
  • Department of Information and Computer Science
  • University of California, Irvine

2
Text Mining
  • Information Retrieval
  • Text Classification
  • Text Clustering
  • Information Extraction

3
Text Classification
  • Text classification has many applications
  • Spam email detection
  • Automated tagging of streams of news articles,
    e.g., Google News
  • Automated creation of Web-page taxonomies
  • Data Representation
  • Bag of words most commonly used either counts
    or binary
  • Can also use phrases for commonly occuring
    combinations of words
  • Classification Methods
  • NaĂŻve Bayes widely used (e.g., for spam email)
  • Fast and reasonably accurate
  • Support vector machines (SVMs)
  • Typically the most accurate method in research
    studies
  • But more complex computationally
  • Logistic Regression (regularized)
  • Not as widely used, but can be competitive with
    SVMs (e.g., Zhang and Oles, 2002)

4
Trimming the Vocabulary
  • Stopword removal
  • remove non-content words
  • very frequent stop words such as the, and.
  • remove very rare words, e.g., that only occur a
    few times in 100k documents
  • Can remove 30 or more of the original unique
    words
  • Stemming
  • Reduce all variants of a word to a single term
  • E.g., draw, drawing, drawings -gt draw
  • Porter stemming algorithm (1980)
  • relies on a preconstructed suffix list with
    associated rules
  • e.g. if suffixIZATION and prefix contains at
    least one vowel followed by a consonant, replace
    with suffixIZE
  • BINARIZATION gt BINARIZE
  • This still often leaves p O(104) terms
  • gt a very high-dimensional classification
    problem!

5
Classification Issues
  • Typically many features, p O(104) terms
  • Consider n sample points in p dimensions
  • Binary labels gt 2n possible labelings (or
    dichotomies)
  • A labeling is linearly separable if we can
    separate the labels with a hyperplane
  • Let f(n,p) fraction of the 2n possible
    labelings that are linear
  • f(n, p) 1
    n lt p 1
  • 2/ 2n S (n-1 choose
    i) n gt p1

6
(No Transcript)
7
Classifying Term Vectors
  • Typically multiple different words may be helpful
    in classifying a particular class, e.g.,
  • Class finance
  • Words stocks, return, interest, rate,
    etc.
  • Thus, classifiers that combine multiple features
    often do well, e.g,
  • NaĂŻve Bayes, Logistic regression, SVMs,
  • Classifiers based on single features (e.g.,
    trees) do less well
  • Linear classifiers often perform well in
    high-dimensions
  • In many cases fewer documents in training data
    than dimensions,
  • i.e., n lt p gt training data are linearly
    separable
  • So again, naĂŻve Bayes, logistic regression,
    linear SVMS, are all useful
  • Question becomes which linear discriminant to
    select?

8
Probabilistic Generative Classifiers
  • Model p( x ck ) for each class and perform
    classification via Bayes rule, c arg
    max p( ck x ) arg max p( x ck )
    p(ck)
  • How to model p( x ck )?
  • p( x ck ) probability of a bag of words x
    given a class ck
  • Two commonly used approaches (for text)
  • NaĂŻve Bayes treat each term xj as being
    conditionally independent, given ck
  • Multinomial model a document with N words as N
    tosses of a p-sided die
  • Other models possible but less common,
  • E.g., model word order by using a Markov chain
    for p( x ck )

9
NaĂŻve Bayes Classifier for Text
  • NaĂŻve Bayes classifier conditional
    independence model
  • Assumes conditional independence assumption given
    the class p( x ck )
    P p( xj ck )
  • Note that we model each term xj as a discrete
    random variable
  • Binary terms (Bernoulli)
    p( x ck ) P p( xj 1 ck ) P p( xj 0
    ck )
  • Non-binary terms (counts)
  • p( x ck ) P p( xj
    k ck )
  • can use a parametric model (e.g.,
    Poisson) or non-parametric model
    (e.g., histogram) for p(xj k ck )
    distributions.

10
Multinomial Classifier for Text
  • Multinomial Classification model
  • Assume that the data are generated by a p-sided
    die (multinomial model)
  • where Nx number of terms (total count) in
    document x nj number of times term j
    occurs in the document
  • p(Nx ck) probability a document has length Nx,
    e.g., Poisson model
  • Can be dropped if thought not to be class
    dependent
  • Here we have a single random variable for each
    class, and the p( xj i ck ) probabilities sum
    to 1 over i (i.e., a multinomial model)
  • Probabilities typically only defined and
    evaluated for i1, 2, 3
  • But zero counts could also be modeled if
    desired
  • This would be equivalent to a NaĂŻve Bayes model
    with a geometric distribution on counts

11
Comparing NaĂŻve Bayes and Multinomial models
  • McCallum and Nigam (1998) Found that multinomial
    outperformed naĂŻve Bayes (with binary features)
    in text classification experiments
  • (however, may be more a result
  • of using counts vs. binary)
  • Note on names used in the literature
  • - Bernoulli (or multivariate Bernoulli)
    sometimes used for binary version of NaĂŻve Bayes
    model
  • - multinomial model is also referred to as
    unigram model
  • - multinomial model is also sometimes
    (confusingly) referred to as naĂŻve Bayes

12
WebKB Data Set
  • Train on 5,000 hand-labeled web pages
  • Cornell, Washington, U.Texas, Wisconsin
  • Crawl and classify a new site (CMU)
  • Results

13
Probabilistic Model Comparison
14
Highest Probability Terms in Multinomial
Distributions
15
Sample Learning Curve(Yahoo Science Data)
16
Comments on Generative Models for Text
  • (Comments applicable to both NaĂŻve Bayes and
    Multinomial classifiers)
  • Simple and fast gt popular in practice
  • e.g., linear in p, n, M for both training and
    prediction
  • Training smoothed frequency counts, e.g.,
  • e.g., easy to use in situations where classifier
    needs to be updated regularly (e.g., for spam
    email)
  • Numerical issues
  • Typically work with log p( ck x ), etc., to
    avoid numerical underflow
  • Useful trick
  • when computing S log p( xj ck ) , for sparse
    data, it may be much faster to
  • precompute S log p( xj 0 ck )
  • and then subtract off the log p( xj 1 ck )
    terms
  • Note both models are wrong but for
    classification are often sufficient

17
(No Transcript)
18
Linear Classifiers
  • Linear classifier (two-class case)
  • wT x w0 gt
    0
  • w is a p-dimensional vector of weights (learned
    from the data)
  • w0 is a threshold (also learned from the data)
  • Equation of linear hyperplane (decision boundary)
  • wT x w0
    0

19
Geometry of Linear Classifiers
wT x w0 0
Direction of w vector
Distance of x from the boundary is 1/w (wT x
w0 )
20
Optimal Hyperplane and Margin
M margin Circles support vectors Goal is to
find weight vector that maximizes M Theory tells
us that max-margin hyperplane leads to good
generalization
21
Optimal Separating Hyperplane
  • Solution to constrained optimization problem
  • (Here yi e -1, 1 is the
    binary class label for example i)
  • Unique for each linearly separable data set
  • Margin M of the classifier
  • the distance between the separating hyperplane
    and the closest training samples
  • optimal separating hyperplane ? maximum margin

22
Sketch of Optimization Problem
  • Define Langrangian as a function of w vector, and
    as
  • Form of solution dictates that optimal w can be
    expressed as
  • This results in a quadratic programming
    optimization problem
  • Good news
  • convex function of unknowns, unique optimum
  • Variety of well-known algorithms for finding this
    optimum
  • Bad news
  • Quadratic programming in general scales as O(n3)

23
Support Vector Machines
  • If ?i gt 0 then the distance of xi from the
    separating hyperplane is M
  • Support vectors - points with associated ?I gt 0
  • The decision function f(x) is computed from
    support vectors as
  • gt prediction can be fast
  • Non-linearly-separable case can generalize to
    allow slack constraints
  • Non-linear SVMs replace original x vector with
    non-linear functions of x
  • kernel trick can solve high-d problem without
    working directly in high d
  • Computational speedups can reduce training time
    to O(n2) or even near- linear
  • e.g Platts SMO algorithm, Joachims SVMLight

24
From Chakrabarti, Chapter 5, 2002 Timing results
on text classification
25
Classic Reuters Data Set
  • 21578 documents, labeled manually
  • 9603 training, 3299 test articles (ModApte
    split)
  • 118 categories
  • An article can be in more than one category
  • Learn 118 binary category distinctions
  • Example interest rate article
  • 2-APR-1987 063519.50
  • west-germany
  • b f BC-BUNDESBANK-LEAVES-CRE 04-02 0052
  • FRANKFURT, March 2
  • The Bundesbank left credit policies unchanged
    after today's regular meeting of its council, a
    spokesman said in answer to enquiries. The West
    German discount rate remains at 3.0 pct, and the
    Lombard emergency financing rate at 5.0 pct.
  • Earn (2877, 1087)
  • Acquisitions (1650, 179)
  • Money-fx (538, 179)
  • Grain (433, 149)
  • Crude (389, 189)
  • Trade (369,119)
  • Interest (347, 131)
  • Ship (197, 89)
  • Wheat (212, 71)
  • Corn (182, 56)

Common categories (train, test)
26
Dumais et al. 1998 Reuters - Accuracy
27
Precision-Recall for SVM (linear), NaĂŻve Bayes,
and NN (from Dumais 1998) using the Reuters data
set
28
Comparison of accuracy across three classifiers
Naive Bayes, Maximum Entropy and Linear SVM,
using three data sets 20 newsgroups, the
Recreation sub-tree of the Open Directory, and
University Web pages from WebKB. From
Chakrabarti, 2003, Chapter 5.
29
Other issues in text classification
  • Real-time constraints
  • Being able to update classifiers as new data
    arrives
  • Being able to make predictions very quickly in
    real-time
  • Multi-labels and multiple classes
  • Text documents can have more than one label
  • SVMs for example can only handle binary data
  • Feature selection
  • Experiments have shown that feature selection
    (e.g., by greedy algorithms using information
    gain) can improve results

30
Further Reading on Text Classification
  • General references on text and language modeling
  • Foundations of Statistical Language Processing,
    C. Manning and H. Schutze, MIT Press, 1999.
  • Speech and Language Processing An Introduction
    to Natural Language Processing, Dan Jurafsky and
    James Martin, Prentice Hall, 2000.
  • SVMs for text classification
  • T. Joachims, Learning to Classify Text using
    Support Vector Machines Methods, Theory and
    Algorithms, Kluwer, 2002
  • Web-related text mining
  • S. Chakrabarti, Mining the Web Discovering
    Knowledge from Hypertext Data, Morgan Kaufmann,
    2003.
Write a Comment
User Comments (0)