Title: Multimedia%20search:%20From%20Lab%20to%20Web
1KI2 - 3
Bayesian Learning Application to Text
Classification Example spam filtering
Marius Bulacu prof. dr. Lambert Schomaker
Kunstmatige Intelligentie / RuG
2Founders of Probability Theory
Pierre Fermat (1601-1665, France)
Blaise Pascal (1623-1662, France)
They laid the foundations of the probability
theory in a correspondence on a dice game.
3Prior, Joint and Conditional Probabilities
P(A) prior probability of A P(B) prior
probability of B
P(A, B) joint probability of A and B P(A B)
conditional (posterior) probability of A
given B P(B A) conditional (posterior)
probability of B given A
4Probability Rules
5Statistical Independence
- Two random variables A and B are independent iff
- P(A, B) P(A) P(B)
- P(A B) P(A)
- P(B A) P(B)
knowing the value of one variable does not yield
any information about the value of the other
6Statistical Dependence - Bayes
Thomas Bayes (1702-1761, England)
Essay towards solving a problem in the doctrine
of chances published in the Philosophical
Transactions of the Royal Society of London in
1764.
7Bayes Theorem
gt P(A ? B) P(AB) P(B) P(BA) P(A)
P(BA) P(A)
gt P(AB)
P(B)
8Bayes Theorem - Causality
Diagnostic P(CauseEffect) P(EffectCause)
P(Cause) / P(Effect) Pattern Recognition P(Class
Feature) P(FeatureClass) P(Class) / P(Feature)
9Bayes Formula and Classification
Prior probability of the class before seeing
anything
Conditional Likelihood of the data given the class
Posterior probability of the class after seeing
the data
Unconditional probability of the data
10Medical example
p(disease) 0.002 p(test disease)
0.97 p(test -disease) 0.04
p(test) p(test disease) p(disease)
p(test -disease) p(-disease) 0.97 0.002
0.04 0.97 0.00194 0.03992 0.04186
p(disease test) p(test disease)
p(disease) / p(test) 0.97 0.002 / 0.04186
0.00194 / 0.04186 0.046
p(-disease test) p(test -disease)
p(-disease) / p(test) 0.04 0.998 / 0.04186
0.03992 / 0.04186 0.953
11MAP Classification
x
- To minimize probability of misclassification,
assign new input x to the class with the Maximum
A posteriori Probability, e.g. assign to x to
class C1 if - p(C1x) gt p(C2x) ltgt p(xC1)p(C1) gt
p(xC2)p(C2) - Therefore we must impose a decision boundary
where the two posterior probability distributions
cross each other.
12Maximum Likelihood Classification
- When the prior class distributions are not known
or for equal (non-informative) priors - p(xC1)p(C1) gt p(xC2)p(C2)
- becomes
- p(xC1) gt p(xC2)
- Therefore assign the input x to the class with
the Maximum Likelihood to have generated it.
13Continuous Features
- Two methods for dealing with continuous-valued
features - Binning divide the range of continuous values
into a discrete number of bins, then apply the
discrete methodology. - Mixture of Gaussians make an assumption
regarding the functional form of the PDF (liniar
combination of Gaussians) and derive the
corresponding parameters (means and standard
deviations).
14Accumulation of Evidence
p(CX,Y) ? p(X,Y,C) ? p(C) p(X,YC) ? p(C)
p(XC) p(YC,X) ... ? p(C) p(XC) p(YC,X)
p(ZC,X,Y)
prior
new prior
new prior
- Bayesian inference allows for integrating prior
knowledge about the world (beliefs being
expressed in terms of probabilities) with new
incoming data. - Different forms of data (possibly
incommensurable) can be fused towards the final
decision using the common currency of
probability. - As the new data arrives, the latest posterior
becomes the new prior for interpreting the new
input.
15Example temperature classification
Classes C Cold P(xC) Normal P(xN) Warm
P(xW) Hot P(xH)
P(xC)
P(xN)
P(xW)
P(xH)
P(x)
P(x) likelihood of x values
16Bayes probability blow up
P(Cx)
P(Nx)
P(Wx)
P(Hx)
Classes C Cold P(xC) Normal P(xN) Warm
P(xW) Hot P(xH)
17in
P(xC)
even with an irregular PDF shape
P(Cx)
P(Cx) P(xC) P(C) / P(x) Bayesian output has
a nice plateau
out
18Puzzle
- So if Bayes is optimal and can be used for
continuous data too, why has it become popular so
late, i.e., much later than neural networks?
19Why Bayes has become popular so late
- Note the example was 1-dimensional
- A PDF (histogram) with 100 bins for one
dimension will cost 10000 bins for two dimensions
etc. - ? Ncells Nbinsndims
20Why Bayes has become popular so late
- ? Ncells Nbinsndims
- Yes but you could use n-dimensional theoretical
distributions (Gauss, Weibull etc.) instead of
empirically measured PDFs
21Why Bayes has become popular so late
- use theoretical distributions instead of
empirically measured PDFs - still the dimensionality is a problem
- 20 samples needed to estimate 1-dim. Gaussian
PDF - 400 samples needed to estimate 2-dim. Gaussian!,
etc. - massive amounts of labeled data
- are needed to estimate probabilities reliably!
22Labeled (ground truthed) data
0.1 0.54 0.53 0.874 8.455 0.001 0.111 risk 0.2
0.59 0.01 0.974 8.40 0.002 0.315 risk 0.11 0.4
0.3 0.432 7.455 0.013 0.222 safe 0.2 0.64 0.13
0.774 8.123 0.001 0.415 risk 0.1 0.17 0.59
0.813 9.451 0.021 0.319 risk 0.8 0.43 0.55
0.874 8.852 0.011 0.227 safe 0.1 0.78 0.63
0.870 8.115 0.002 0.254 risk . . .
. . . .
.
Example client evaluation in insurances
23Success of speech recognition
- massive amounts of data
- increased computing power
- cheap computer memory
- allowed for the use of Bayes in
- hidden Markov Models for speech recognition
- similarly (but slower) application of Bayes
- in script recognition
24- Global Structure
- year
- title
- date
- date and number of entry (Rappt)
- redundant lines between paragraphs
- jargon-words
- Notificatie
- Besluit fiat
- imprint with page number
- ? XML model
25Local probabilistic structure P(Novb 16 is a
date sticks out to the left is left of
Rappt ) ?
26Naive BayesConditional Independence
- Naive Bayes assumes the attributes (features)
are independent - p(X,YC) p(XC) P(YC)
- or
- p(x1, ... xnC) ?i p(xiC)
- Often works surprisingly well in practice
despite its manifest simplicity.
27Accumulation of Evidence Independence
28The Naive Bayes Classifier
29Learning to Classify Text
- Representation each electronic document is
represented by the set of words that it contains
under the independence assumptions - - order of words does not matter
- - co-occurrences of words do not matter
- i.e. each document is represented as a bag of
words - Learning estimate from the training dataset of
documents - - the prior class probability P(ci)
- - the conditional likelihood of a word wj given
the document class ci P(wjci) - Classification maximum a posteriori (MAP)
30Learning to Classify e-mail
- Is this e-mail a spam? e-mail ? spam, ham
- Each word represents an attribute characterizing
the e-mail. - Estimate the class priors p(spam) and p(ham) from
the training data as well as the class
conditional likelihoods for all the encountered
words. - For a new e-mail, assuming naive Bayes
conditional independence, compute the MAP
hypothesis.
31Spam filtering
Example of regular mail
From acd_at_essex.ac.uk Mon Nov 10 192344
2003Return-Path ltalan_at_essex.ac.ukgtReceived
from serlinux15.essex.ac.uk (serlinux15.essex.ac.u
k 155.245.48.17) by tcw2.ppsw.rug.nl
(8.12.8/8.12.8) with ESMTP id hAAIecHC008727
Mon, 10 Nov 2003 194038 0100 Apologies for
multiple postings.gt 2nd C a l l f
o r P a p e r sgtgt DAS
2004gtgt Sixth IAPR International
Workshop ongt Document Analysis
Systemsgtgt September 8-10,
2004gtgt Florence, Italygtgt
http//www.dsi.unifi.it/DAS04gtgt
Notegt There are two main additions with respect
to the previous CFPgt 1) DASDL data are now
available on the workshop web sitegt 2)
Proceedings will be published by Springer Verlag
in LNCS series
32Spam filtering
Example of spam
From Easy Qualify" ltmbulacu_at_netaccessproviders.n
etgt To bulacu_at_hotmail.com Subject Claim
your Unsecured Platinum Card - 75OO dollar limit
Date Tue, 28 Oct 2003 171207
-0400
mbulacu - Tuesday, Oct 28,
2003
Congratulations, you have been selected
for an Unsecured Platinum Credit Card / 7500
starting credit limit.This offer is valid even
if you've had past credit problems or evenno
credit history. Now you can receive a 7,500
unsecured Platinum Credit Card that can help
build your credit. And to help get your card to
you sooner, we have been authorized to waive any
employment or credit verification.
33Conclusions
- Effective about 90 correct classification
- Could be applied to any text classification
problem - Needs to be polished
34Summary
- Bayesian inference allows for integrating prior
knowledge about the world (beliefs being
expressed in terms of probabilities) with new
incoming data. - Inductive bias of Naive Bayes attributes are
independent. - Although this assumption is often violated, it
provides a very efficient tool often used (e.g.
text classification spam filtering). - Applicable to discrete or continuous data.