Lecture 1: Introduction to Machine Learning - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Lecture 1: Introduction to Machine Learning

Description:

... P. Hart, and D. Stork. Standard pattern recognition textbook. ... Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. ... – PowerPoint PPT presentation

Number of Views:176

Avg rating:3.0/5.0

Slides: 36

Provided by: Isabell47

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 1: Introduction to Machine Learning

1
Lecture 1 Introduction to Machine Learning

Isabelle Guyon
isabelle_at_clopinet.com

2
What is Machine Learning?
Trained machine

Learning
algorithm

TRAINING DATA
Answer
?
Query
3
What for?

Classification
Time series prediction
Regression
Clustering

4
Applications
5
Banking / Telecom / Retail

Identify
Prospective customers
Dissatisfied customers
Good customers
Bad payers
Obtain
More effective advertising
Less credit risk
Fewer fraud
Decreased churn rate

6
Biomedical / Biometrics

Medicine
Screening
Diagnosis and prognosis
Drug discovery
Security
Face recognition
Signature / fingerprint / iris verification
DNA fingerprinting

6
7
Computer / Internet

Computer interfaces
Troubleshooting wizards
Handwriting and speech
Brain waves
Internet
Hit ranking
Spam filtering
Text categorization
Text translation
Recommendation

7
8
Conventions
n
Xxij
y yj
m
xi
a
w
9
Learning problem
Data matrix X m lines patterns (data points,
examples) samples, patients, documents, images,
n columns features (attributes, input
variables) genes, proteins, words, pixels,
Unsupervised learning Is there structure in
data? Supervised learning Predict an outcome y.
Colon cancer, Alon et al 1999
10
Some Learning Machines

Linear models
Kernel methods
Neural networks
Decision trees

11
Linear Models

f(x) w ? x b Sj1n wj xj b
Linearity in the parameters, NOT in the input
components.
f(x) w ? F(x) b Sj wj fj(x) b
(Perceptron)
f(x) Si1m ai k(xi,x) b (Kernel method)

12
Artificial Neurons
Cell potential
Axon
Activation of other neurons
Activation function
Dendrites
Synapses
f(x) w ? x b
McCulloch and Pitts, 1943
13
Linear Decision Boundary
14
Perceptron
Rosenblatt, 1957
15
NL Decision Boundary
16
Kernel Method
Potential functions, Aizerman et al 1964
17
What is a Kernel?

A kernel is
a similarity measure
a dot product in some feature space k(s, t)
F(s) ? F(t)
But we do not need to know the F representation.
Examples
k(s, t) exp(-s-t2/s2) Gaussian kernel
k(s, t) (s ? t)q Polynomial kernel

18
Hebbs Rule

wj ? wj yi xij

Axon
Link to Naïve Bayes
19
Kernel Trick (for Hebbs rule)

Hebbs rule for the Perceptron
w Si yi F(xi)
f(x) w ? F(x) Si yi F(xi) ? F(x)
Define a dot product
k(xi,x) F(xi) ? F(x)
f(x) Si yi k(xi,x)

20
Kernel Trick (general)

f(x) Si ai k(xi, x)
k(xi, x) F(xi) ? F(x)
f(x) w ? F(x)
w Si ai F(xi)

Dual forms
21
Simple Kernel Methods
f(x) S ai k(xi, x) k(xi, x)
F(xi).F(x) Potential Function algorithm ai ? ai
yi if yif(xi)lt0 (Aizerman et al 1964) Dual
minover ai ? ai yi for min yif(xi) Dual
LMS ai ? ai ? (yi - f(xi))
f(x) w F(x) Perceptron algorithm w ? w
yi F(xi) if yif(xi)lt0 (Rosenblatt
1958) Minover (optimum margin) w ? w yi F(xi)
for min yif(xi) (Krauth-Mézard 1987) LMS
regression w ? w ? (yi- f(xi)) F(xi)
w S ai F(xi)
i
i
(ancestor of SVM 1992, similar to kernel
Adatron, 1998, and SMO, 1999)
22
Multi-Layer Perceptron
Back-propagation, Rumelhart et al, 1986
23
Chessboard Problem
24
Tree Classifiers

CART (Breiman, 1984) or C4.5 (Quinlan, 1993)

25
Iris Data (Fisher, 1936)
Figure from Norbert Jankowski and Krzysztof
Grabczewski
Linear discriminant
Tree classifier
versicolor
setosa
virginica
Gaussian mixture
Kernel method (SVM)
26
Fit / Robustness Tradeoff
x2
x1
15
27
Performance evaluation
f(x) lt 0
f(x) lt 0
x2
f(x) 0
f(x) 0
f(x) gt 0
f(x) gt 0
x1
28
Performance evaluation
f(x) lt -1
f(x) lt -1
x2
f(x) -1
f(x) -1
f(x) gt -1
f(x) gt -1
x1
29
Performance evaluation
f(x) lt 1
f(x) lt 1
x2
f(x) 1
f(x) 1
f(x) gt 1
f(x) gt 1
x1
30
ROC Curve
For a given threshold on f(x), you get a point
on the ROC curve.
Ideal ROC curve
100
Actual ROC
Positive class success rate (hit rate,
sensitivity)
Random ROC
0
100
1 - negative class success rate (false alarm
rate, 1-specificity)
31
ROC Curve
For a given threshold on f(x), you get a point
on the ROC curve.
Ideal ROC curve (AUC1)
100
Actual ROC
Positive class success rate (hit rate,
sensitivity)
Random ROC (AUC0.5)
0 ? AUC ? 1
0
100
1 - negative class success rate (false alarm
rate, 1-specificity)
32
What is a Risk Functional?

A function of the parameters of the learning
machine, assessing how much it is expected to
fail on a given task.
Examples
Classification
Error rate (1/m) Si1m 1(F(xi)?yi)
1- AUC
Regression
Mean square error (1/m) Si1m(f(xi)-yi)2

33
How to train?

Define a risk functional Rf(x,w)
Optimize it w.r.t. w (gradient descent,
mathematical programming, simulated annealing,
genetic algorithms, etc.)

( to be continued in the next lecture)
34
Summary

With linear threshold units (neurons) we can
build
Linear discriminant (including Naïve Bayes)
Kernel methods
Neural networks
Decision trees
The architectural hyper-parameters may include
The choice of basis functions f (features)
The kernel
The number of units
Learning means fitting
Parameters (weights)
Hyper-parameters
Be aware of the fit vs. robustness tradeoff

35
Want to Learn More?

Pattern Classification, R. Duda, P. Hart, and D.
Stork. Standard pattern recognition textbook.
Limited to classification problems. Matlab code.
http//rii.ricoh.com/stork/DHS.html
The Elements of statistical Learning Data
Mining, Inference, and Prediction. T. Hastie, R.
Tibshirani, J. Friedman, Standard statistics
textbook. Includes all the standard machine
learning methods for classification, regression,
clustering. R code. http//www-stat-class.stanford
.edu/tibs/ElemStatLearn/
Linear Discriminants and Support Vector Machines,
I. Guyon and D. Stork, In Smola et al Eds.
Advances in Large Margin Classiers. Pages
147--169, MIT Press, 2000. http//clopinet.com/isa
belle/Papers/guyon_stork_nips98.ps.gz
Feature Extraction Foundations and Applications.
I. Guyon et al, Eds. Book for practitioners with
datasets of NIPS 2003 challenge, tutorials, best
performing methods, Matlab code, teaching
material. http//clopinet.com/fextract-book