Title: Artificial Intelligence
1Artificial Intelligence
- Statistical learning methods
- Chapter 20, AIMA
- (only ANNs SVMs)
2Artificial neural networks
- The brain is a pretty intelligent system.
- Can we copy it?
- There are approx. 1011 neurons in the brain.
- There are approx. 23?109 neurons in the male
cortex (females have about 15 less).
3The simple model
- The McCulloch-Pitts model (1943)
w2
w1
w3
y g(w0w1x1w2x2w3x3)
Image from Neuroscience Exploring the brain by
Bear, Connors, and Paradiso
4Neuron firing rate
5Transfer functions g(z)
The logistic function
The Heaviside function
6The simple perceptron
With -1,1 representation Traditionally
(early 60s) trained with Perceptron learning.
7Perceptron learning
Desired output
- Repeat until no errors are made anymore
- Pick a random example x(n),f(n)
- If the classification is correct, i.e. if
y(x(n)) f(n) , then do nothing - If the classification is wrong, then do the
following update to the parameters (h, the
learning rate, is a small positive number)
8Example Perceptron learning
x2
x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 1
x1
The AND function
Initial values h 0.3
9Example Perceptron learning
x2
x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 1
This one is correctlyclassified, no action.
x1
The AND function
10Example Perceptron learning
x2
x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 1
This one is incorrectlyclassified, learning
action.
x1
The AND function
11Example Perceptron learning
x2
x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 1
This one is incorrectlyclassified, learning
action.
x1
The AND function
12Example Perceptron learning
x2
x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 1
This one is correctlyclassified, no action.
x1
The AND function
13Example Perceptron learning
x2
x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 1
This one is incorrectlyclassified, learning
action.
x1
The AND function
14Example Perceptron learning
x2
x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 1
This one is incorrectlyclassified, learning
action.
x1
The AND function
15Example Perceptron learning
x2
x1 x2 f
0 0 -1
0 1 -1
1 0 -1
1 1 1
x1
The AND function
Final solution
16Perceptron learning
- Perceptron learning is guaranteed to find a
solution in finite time, if a solution exists. - Perceptron learning cannot be generalized to more
complex networks. - Better to use gradient descent based on
formulating an error and differentiable functions
17Gradient search
The learning rate (h) is set heuristically
E(W)
Go downhill
W
W(k)
W(k1) W(k) DW(k)
18The Multilayer Perceptron (MLP)
- Combine several single layer perceptrons.
- Each single layer perceptron uses a sigmoid
function (C?)E.g.
input
output
Can be trained using gradient descent
19Example One hidden layer
- Can approximate any continuous function
- q(z) sigmoid or linear,
- f(z) sigmoid.
20Training Backpropagation(Gradient descent)
21Support vector machines
22Linear classifier on a linearly separable problem
There are infinitely manylines that have zero
trainingerror. Which line should we choose?
23Linear classifier on a linearly separable problem
There are infinitely manylines that have zero
trainingerror. Which line should we choose? ?
Choose the line with thelargest margin. The
large margin classifier
margin
24Linear classifier on a linearly separable problem
There are infinitely manylines that have zero
trainingerror. Which line should we choose? ?
Choose the line with thelargest margin. The
large margin classifier
margin
Support vectors
25Computing the margin
The plane separating and is defined
by The dashed planes are given by
w
margin
26Computing the margin
Divide by b Define new w w/b and a a/b
w
margin
We have defined a scalefor w and b
27Computing the margin
We have which gives
x lw
lw
x
margin
28Linear classifier on a linearly separable problem
Maximizing the margin isequal to minimizing
w subject to the constraints wTx(n) a
? 1 for all wTx(n) a ? -1 for all
w
Quadratic programming problem, constraints can
be included with Lagrange multipliers.
29Quadratic programming problem
Minimize cost (Lagrangian)
Minimum of Lp occurs at the maximum of (the Wolfe
dual)
Only scalar productin cost. IMPORTANT!
30Linear Support Vector Machine
Test phase, the predicted output
Where a is determined e.g. by looking at one of
the support vectors. Still only scalar products
in the expression.
31How deal with nonlinear case?
- Project data into high-dimensional space Z. There
we know that it will be linearly separable (due
to VC dimension of linear classifier). - We dont even have to know the projection...!
32Scalar product kernel trick
If we can find kernel such that
Then we dont even have to know the mapping to
solve the problem...
33Valid kernels (Mercers theorem)
Define the matrix
If K is symmetric, K KT, and positive
semi-definite, thenKx(i),x(j) is a valid
kernel.
34Examples of kernels
First, Gaussian kernel. Second, polynomial
kernel. With d1 we have linear SVM. Linear SVM
often used with good success on highdimensional
data (e.g. text classification).
35Example Robot color vision(Competition 1999)
Classify the Lego pieces into red, blue, and
yellow. Classify white balls, black sideboard,
and green carpet.
36What the camera sees (RGB space)
Yellow
Red
Green
37Mapping RGB (3D) to rgb (2D)
38Lego in normalized rgb space
x2
Output is 6D red, blue, yellow, green, black,
white
x1
Input is 2D
39MLP classifier
E_train 0.21 E_test 0.24
2-3-1 MLP Levenberg- Marquardt
Training time (150 epochs) 51 seconds
40SVM classifier
E_train 0.19 E_test 0.20
SVM with g 1000
Training time 22 seconds