Title: Parametric Methods
1Parametric Methods
2Learning Objectives
- Understand what are parametric methods
- Understand how to learn probabilities from a
training set - Understand regression as a method for prediction
3Acknowledgements
- Some of these slides have been adapted from Ethem
Alpaydin.
4Terminology
- A statistic is a value calculated from a given
sample. - Parametric methods assume that the training set
obeys a known model (Gaussian or normal
distribution, )
5Parametric Estimation
- X xt t where xt p (x)
- Parametric estimation
- Assume a form for p (x ?) and estimate ?, its
sufficient statistics, using X - e.g., N ( µ, s2) where ? µ, s2
6Maximum Likelihood Estimation
- Likelihood of ? given the sample X
- l (?X) p (X ?) ?t p (xt?)
- Log likelihood
- L(?X) log l (?X) ?t log p (xt?)
- Maximum likelihood estimator (MLE)
- ? argmax? L(?X)
7Examples Bernoulli/Multinomial
- Bernoulli Two states, failure/success, x in
0,1 - P (x) pox (1 po ) (1 x)
- L (poX) log ?t poxt (1 po ) (1 xt)
- MLE po ?t xt / N
- Multinomial Kgt2 states, xi in 0,1
- P (x1,x2,...,xK) ?i pixi
- L(p1,p2,...,pKX) log ?t ?i pixit
- MLE pi ?t xit / N
8Gaussian (Normal) Distribution
- p(x) N ( µ, s2)
-
- MLE for µ and s2
µ
s
9Bias and Variance
Unknown parameter ? Estimator di d (Xi) on
sample Xi Bias b?(d) E d ? Variance E
(dE d)2 Mean square error r (d,?) E
(d?)2 (E d ?)2 E (dE d)2
Bias2 Variance
10Bayes Estimator
- Treat ? as a random var with prior p (?)
- Bayes rule p (?X) p(X?) p(?) / p(X)
- Full p(xX) ? p(x?) p(?X) d?
- Maximum a Posteriori (MAP) ?MAP argmax? p(?X)
- Maximum Likelihood (ML) ?ML argmax? p(X?)
- Bayes ?Bayes E?X ? ? p(?X) d?
11Bayes Estimator Example
- xt N (?, so2) and ? N ( µ, s2)
- ?ML m
- ?MAP ?Bayes
12Bayesian Classification Why?
- Probabilistic learning Calculate explicit
probabilities for hypothesis, among the most
practical approaches to certain types of learning
problems - Incremental Each training example can
incrementally increase/decrease the probability
that a hypothesis is correct. Prior knowledge
can be combined with observed data. - Probabilistic prediction Predict multiple
hypotheses, weighted by their probabilities - Standard Even when Bayesian methods are
computationally intractable, they can provide a
standard of optimal decision making against which
other methods can be measured
13Bayesian Theorem
- Given training data D, posteriori probability of
a hypothesis h, P(hD) follows the Bayes theorem - MAP (maximum posteriori) hypothesis
- Practical difficulty require initial knowledge
of many probabilities, significant computational
cost
14Naïve Bayes Classifier (I)
- A simplified assumption attributes are
conditionally independent - Greatly reduces the computation cost, only count
the class distribution.
15Naive Bayesian Classifier (II)
- Given a training set, we can compute the
probabilities
16Bayesian classification
- The classification problem may be formalized
using a-posteriori probabilities - P(CX) prob. that the sample tuple
Xltx1,,xkgt is of class C. - E.g. P(classN outlooksunny,windytrue,)
- Idea assign to sample X the class label C such
that P(CX) is maximal
17Estimating a-posteriori probabilities
- Bayes theorem
- P(CX) P(XC)P(C) / P(X)
- P(X) is constant for all classes
- P(C) relative freq of class C samples
- C such that P(CX) is maximum C such that
P(XC)P(C) is maximum - Problem computing P(XC) is unfeasible!
18Naïve Bayesian Classification
- Naïve assumption attribute independence
- P(x1,,xkC) P(x1C)P(xkC)
- If i-th attribute is categoricalP(xiC) is
estimated as the relative freq of samples having
value xi as i-th attribute in class C - If i-th attribute is continuousP(xiC) is
estimated thru a Gaussian density function - Computationally easy in both cases
19Play-tennis example estimating P(xiC)
20Play-tennis example classifying X
- An unseen sample X ltrain, hot, high, falsegt
- P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
ep)P(p) 3/92/93/96/99/14 0.010582 - P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
en)P(n) 2/52/54/52/55/14 0.018286 - Sample X is classified in class n (dont play)
21The independence hypothesis
- makes computation possible
- yields optimal classifiers when satisfied
- but is seldom satisfied in practice, as
attributes (variables) are often correlated. - Attempts to overcome this limitation
- Bayesian networks, that combine Bayesian
reasoning with causal relationships between
attributes - Decision trees, that reason on one attribute at
the time, considering most important attributes
first
22Bayesian Belief Networks (I)
Family History
Smoker
(FH, S)
(FH, S)
(FH, S)
(FH, S)
LC
0.7
0.8
0.5
0.1
LungCancer
Emphysema
LC
0.3
0.2
0.5
0.9
The conditional probability table for the
variable LungCancer
PositiveXRay
Dyspnea
Bayesian Belief Networks
23Bayesian Belief Networks (II)
- Bayesian belief network allows a subset of the
variables conditionally independent - A graphical model of causal relationships
- Several cases of learning Bayesian belief
networks - Given both network structure and all the
variables easy - Given network structure but only some variables
- When the network structure is not known in advance
24Regression
25Regression From LogL to Error
26Linear Regression
- Linear regression Y ? ? X
- Two parameters , ? and ? specify the line and
are to be estimated by using the data at hand. - using the least squares criterion to the known
values of Y1, Y2, , X1, X2, . - ß S (xi avg(x)) ( yi avg(y)) / (xi
avg(x))2 - a avg(y) ß avg(x)
27Other Error Measures
- Square Error
- Relative Square Error
- Absolute Error E (?X) ?t rt g(xt?)
- e-sensitive Error E (?X) ? t 1(rt
g(xt?)gte) (rt g(xt?) e)
28Bias and Variance
29Estimating Bias and Variance
- M samples Xixti , rti, i1,...,M
- are used to fit gi (x), i 1,...,M
30Bias/Variance Dilemma
- Example gi(x)2 has no variance and high bias
- gi(x) ?t rti/N has lower bias with variance
- As we increase complexity,
- bias decreases (a better fit to data) and
- variance increases (fit varies more with data)
- Bias/Variance dilemma (Geman et al., 1992)
31f
f
bias
gi
g
variance
32Polynomial Regression
Best fit min error
33Best fit, elbow
34Model Selection
- Cross-validation Measure generalization accuracy
by testing on data unused during training - Regularization Penalize complex models
- Eerror on data ? model complexity
- Akaikes information criterion (AIC), Bayesian
information criterion (BIC) - Minimum description length (MDL) Kolmogorov
complexity, shortest description of data - Structural risk minimization (SRM)
35Bayesian Model Selection
- Prior on models, p(model)
- Regularization, when prior favors simpler models
- Bayes, MAP of the posterior, p(modeldata)
- Average over a number of models with high
posterior (voting, ensembles Chapter 15)