Title: Statistical Learning Methods
1Statistical Learning Methods
2Introduction
- Agents can handle uncertainty by using the
methods of probability and decision theory
utility theory probability theory - But first they must learn their probabilistic
theories of the world from experience...
3Key Concepts
- Data evidence, i.e., instantiation of one or
more random variables describing the domain - Hypotheses probabilistic theories of how the
domain works
4Outline
- Bayesian learning
- Maximum a posteriori and maximum likelihood
learning - Instance-based learning
- Intro neural networks
5Bayesian Learning
- Let D be all data, with observed value d, then
probability of a hypothesis hi, using Bayes rule
P(hid) aP(dhi)P(hi) - For prediction about quantity X P(Xd) ?
P(Xd,hi)P(hid) ? P(Xhi)P(hid)
6Bayesian Learning
- For prediction about quantity X P(Xd) ?
P(Xd,hi)P(hid) ? P(Xhi)P(hid) - No single best-guess hypothesis, all hypothesis
are involved
7Bayesian Learning
- Simply calculates probability of each hypothesis,
given data, and makes predictions based on this - I.e., predictions based on all hypothesis,
weighted by their probabilities, rather than
using only single best hypothesis
8Candy
- Suppose five kinds of bags of candies
- 10 are h1 100 cherry candies
- 20 are h2 75 cherry candies 25 lime
candies - 40 are h3 50 cherry candies 50 lime
candies - 20 are h4 25 cherry candies 75 lime
candies - 10 are h5 100 lime candies
- We observe candies drawn from some bag
9Mo Candy
- We observe candies drawn from some bag
- Assume observations are i.i.d.,e.g. because many
candies in the bag - Assume we dont like the green lime candy
- Important questions
- What kind of bag is it? h1, h2,...,h5?
- What flavor will the next candy be?
10Posterior Probability of Hypotheses
11Posterior Probability of Hypotheses
- True hypothesis will eventually dominate the
Bayesian prediction prior is of no influence in
the long run - More importantly maybe not for us? Bayesian
prediction is optimal
12The Price for Being Optimal
- For real learning problems the hypothesis space
is large, possibly infinite - Summation / integration over hypothesis cannot be
carried out - Resort to approximate or simplified methods
13Maximum A Posteriori
- Common approximation method make predictions on
the single most probable hypothesis - I.e. take the hi that maximizes P(hid)
- Such a MAP hypothesis is approximately Bayesian,
i.e., P(Xd) P(Xhi) the more evidence the
better the approximation
14Hypothesis Prior
- Both in Bayesian learning and in MAP learning,
hypothesis prior plays an important role - If hypothesis space is too expressive overfitting
can occur cf. Chapter 18 - Prior is used to penalize complexity instead of
explicitly limiting the space the more complex
the hypothesis the lower the prior probability - If enough evidence available, eventually complex
hypothesis chosen if necessary
15Maximum Likelihood Approximation
- For enough data, prior becomes irrelevant
- Maximum likelihood ML learning choose h that
maximizes P(dhi) - I.e., simply get the best fit to the data
- Identical to MAP for uniform prior P(hi)
- Also reasonable if all hypotheses are of the same
complexity - ML is the standard non-Bayesian / classical
statistical learning method
16E.g.
- Bag from new manufacturer fraction ? of red
cherry candies any ? is possible - Suppose unwrap N candies, c cherries and l N -
c limes - Likelihood
- Maximize for ? using log likelihood
17E.g. 2
- Gaussian model often denoted by N(µ,?)
- Log likelihood is given by
- If ? is known, find maximum likelihood for µ
- If µ is known, find maximum likelihood for ?
18Halfway Summary and Additional Remarks
- Full Bayesian learning gives best possible
predictions but is intractable - MAP selects single best hypothesis prior is
still used - Maximum likelihood assumes uniform prior, OK for
large data sets - Choose parameterized family of models to describe
the data - Write down likelihood of data as function of
parameters - Write down derivative of log likelihood w.r.t.
each parameter - Find parameter values such that the derivatives
are zero - ML estimation may be hard / impossible modern
optimization techniques help - In games, data often becomes available
sequentially not necessary to train in one go
19Outline
- Bayesian learning v
- Maximum a posteriori and maximum likelihood
learning v - Instance-based learning
- Intro neural networks
20Instance-Based Learning
- We saw statistical learning as parameter
learning, i.e., given a specific
parameter-dependent family of probability models
fit it to the data by tweaking parameters - Often simple and effective
- Fixed complexity
- Maybe good for very little data
21Instance-Based Learning
- We saw statistical learning as parameter learning
- Nonparametric learning methods allow hypothesis
complexity to grow with the data - The more data we have, the more wigglier the
hypothesis can be
22Nearest-Neighbor Method
- Key idea properties of an input point x are
likely to be similar to points in the
neighborhood of x - E.g. classification estimate unknown class of x
using classes of neighboring points - Simple, but how does one define what a
neighborhood is? - One solution find the k nearest neighbors
- But now the problem is how to decide what nearest
is...
23k Nearest-Neighbor Classification
- Check the class / output label of your k
neighbors and simply take for example of
neighbors having class label x
kas the posterior probability of
having class label x - When assigning a single label take MAP!
24kNN Probability Density Estimation
25Kernel Models
- Idea Put little density function a kernel in
every data point and take the normalized sum of
these - Somehow similar to kNN
- Often providing comparable performance
26Probability Density Estimation
27Outline
- Bayesian learning v
- Maximum a posteriori and maximum likelihood
learning v - Instance-based learning v
- Intro neural networks
28Neural Networks and Games
29Neural Networks and Games
30Neural Networks and Games
31Neural Networks and Games
32Neural Networks and Games
33Neural Networks and Games
34Neural Networks and Games
35So First... Neural Networks
- According to Robert Hecht-Nielsen, a neural
network is simply a computing system made up of
a number of simple, highly interconnected
processing elements, which process information by
their dynamic state response to external inputs
Simply... - We skip the biology for now
- And provide the bare basics