Machine Learning - PowerPoint PPT Presentation

1 / 101

About This Presentation

Title:

Machine Learning

Description:

Groups of 2 or 3 students ... smoker. age. gender. no. no. no. yes. no. no. yes. no. no. yes. yes. no. lung cancer. gray. no. 44. female ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 102

Provided by: ifsTuw

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning

1
Machine Learning

Georg Pölzlbauer
December 11, 2006

2
Outline

Exercises
Data Preparation
Decision Trees
Model Selection
Random Forests
Support Vector Machines

3
Exercises

Groups of 2 or 3 students
UCI ML Repository pick 3 data sets (different
characteristics, i.e. number of samples, number
of dimensions, number of classes)
Estimate classification error with 3 classifiers
of choice compare results
Estimate appropriate parameters for these
classifiers
Implement in Matlab, R, WEKA, YALE, KNIME

4
Exercises Software

Matlab
YALEhttp//rapid-i.com/
WEKAhttp//www.cs.waikato.ac.nz/ml/weka/
KNIMEhttp//www.knime.org/
Rhttp//www.r-project.org/

5
Exercises Software

WEKA recommended easy to use, easy to learn, no
programming
KNIME, YALE also easy to use
R most advanced and powerful software do not
use if you do not know R really well!
Matlab not recommended requires installation of
packages from internet etc.

6
Exercises Written Report

Report should be 5-10 pages
Discuss characteristics of data sets (i.e.
handling of missing values, scaling etc.)
Summarize classifiers used (one paragraph each)
Discuss experimental results (tables, figures)
Do not include code in report

7
Exercises How to proceed

It is not necessary to implement anything rely
on libraries, modules etc
UCI ML Repositoryhttp//www.ics.uci.edu/mlearn/
MLSummary.html
Import data file, scale data, apply model
selection, write down any problems/findings

8
Grading

No written/oral exam
End of January submission of report
Ca. 15 minutes discussion of results and code
(individually for each group)
Grading bonus Use of sophisticated models,
detailed comparision of classifiers, thorough
discussion of experiments, justification of
choices

9
Questions?

Questions regarding theory
poelzlbauer_at_ifs.tuwien.ac.at
musliu_at_dbai.tuwien.ac.at
Questions regarding R, Weka,
Forum

10
Machine Learning Setting
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
11
Machine Learning Setting
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
Train ML Model
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
12
Machine Learning Setting
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
Train ML Model
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
13
Data Preparation

-gt Example adult census data
Table format data (data matrix)
Missing values
Categorical data
Quantitative (continuous) data with different
scales

14
Categorical variables

Non-numeric variables with a finite number of
levels
E.g. "red", "blue", "green"
Some ML algorithms can only handle numeric
variables
Solution 1-to-N coding

15
1-to-N Coding
feature
red
blue
green
red
red
green
blue
red blue green
1 0 0
0 1 0
0 0 1
1 0 0
1 0 0
0 0 1
0 1 0
16
Scaling of continuous variables

Many ML algorithms rely on measuring the distance
between 2 samples
There should be no difference if a length
variable is measured in cm, inch, or km
To remove the unit of measure (e.g. kg, mph, )
each variable dimension is normalized
subtract mean
divide by standard deviation

17
Scaling of continuous variables

Data set now has mean 0, variance 1
Chebyshev's inequality
75 of data between -2 and 2
89 of data between -3 and 3
94 of data between -4 and 4

18
Output variables

ML requires categorical output (continuous output
regression)
ML methods can be applied by binning continuous
output (loss of prediction accuracy)

19
Binary Decision Trees

Rely on Information Theory (Shannon)
Recursive algorithm that splits feature space
into 2 areas at each recursion step
Classification works by going through the tree
from the root node until arriving at a leaf node

20
Decision Trees Example
21
Information Theory, Entropy

Introduced by Claude Shannon
Applications in data compression
Concerned with measuring actual information vs.
redundancy
Measures information in bits

22
What is Entropy?

In Machine Learning, Entropy is a measure for the
impurity of a set
High Entropy gt bad for prediction
High Entropy gt needs to be reduced (Information
Gain)

23
Calculating H(X)
24
H(X) Case studies
H(X)
p(xblue)
p(xred)
1
0.5
0.5
I
0.88
0.7
0.3
II
0.88
0.3
0.7
III
0
1
0
IV
25
H(X) Relative vs. absolute frequencies
red blue
I 8 4
II 18 9
vs.
gt H(XI) H(XII)
Only relative frequencies matter!
26
Information Gain

Given a set and a choice between possible
subsets, which one is preferable?

Information Gain Sets that minimize Entropy by
largest amount
H(X) 1
A (green) B (yellow)
Points 6 4
p(X.) 0.6 0.4
p(xred) 0.33 0.75
p(xblue) 0.67 0.25
H(X.) 0.92 0.81
IG 0.12 0.12
A (green) B (yellow)
Points 9 1
p(X.) 0.9 0.1
p(xred) 0.44 1
p(xblue) 0.56 0
H(X.) 0.99 0
IG 0.11 0.11
A (green) B (yellow)
Points 5 5
p(X.) 0.5 0.5
p(xred) 0.2 0.8
p(xblue) 0.8 0.2
H(X.) 0.72 0.72
IG 0.28 0.28
27
Informatin Gain (Properties)

IG is at most as large as the Entropy of the
original set
IG is the amount by which the original Entropy
can be reduced by splitting into subsets
IG is at least zero (if Entropy is not reduced)
0 lt IG lt H(X)

28
Building (binary) Decision Trees

Data set categorical or quantitative variables
Iterate variables, calculate IG for every
possible split
categorical variables one variable vs. the rest
quantitative variables sort values, split
between each pair of values
recurse until prediction is good enough

29
Decision Trees Quantitative variables
0.07
0.00
0.01
0.03
0.08
0.03
0.00
0.00
0.01
0.13
0.06
original H 0.99
0.17
0.01
0.11
0.43
0.26
0.06
0.13
0.05
0.29
0.28
0.09
0.16
30
Decision Trees Quantitative variables
31
Decision Trees Classification
32
Decision Trees Classification
33
Decision Trees Classification
34
Decision Trees More than 2 classes
35
Decision Trees Non-binary trees
36
Decision Trees Overfitting

Fully grown trees are usually too complicated

37
Decision Trees Stopping Criteria

Stop when absolute number of samples is low
(below a threshold)
Stop when Entropy is already relatively low
(below a threshold)
Stop if IG is low
Stop if decision could be random (Chi-Square
test)
Threshold values are hyperparameters

38
Decision Trees Pruning

"Pruning" means removing nodes from a tree after
training has finished
Stopping criteria are sometimes referred to as
"pre-pruning"
Redundant nodes are removed, sometimes tree is
remodeled

39
Example Pruning
40
Decision Trees Stability
41
Decision Trees Stability
42
Decision Trees Stability
43
Decision Trees Stability
44
Decision Trees Stability
45
Decision Trees Stability
46
Decision Trees Stability
47
Decision Trees Stability
48
Decision Trees Stability
49
Decision Trees Stability
50
Decision Trees Stability
51
Decision Trees Stability
52
Decision Trees Stability
53
Model Selection

General ML Framework
Takes care of estimating hyperparameters
Takes care of selecting good model (avoid local
minima)

54
Why is Generalization an Issue?
55
Why is Generalization an Issue?
56
Why is Generalization an Issue?
57
Why is Generalization an Issue?
f
m
f
m
58
Why is Generalization an Issue?
f
m
f
m
59
Bayes Optimal Classifier
60
Training Set, Test Set

Solution
Split data into training and test sets
80 training, 20 test
Train different models
Classify test set
Pick the one model that has the least test set
error

61
Trade-off complexity vs. generalization
Minimum of Test Set Error
Test Set
Error
Training Set
Model Complexity
62
Estimation of Generalization Error

Test set is used in model selection and tuning of
parameters
thus, test set error is not an accurate estimate
of generalization error
Generalization error is the expected error that
the classifier will make on a given data set

63
Training-Test-Validation

Save part of the data set for validation
Split E.g.
60 training set
20 test set
20 validation set
Train classifiers on training set
Select classifier based on test set performance
Estimate generalization error on validation set

64
Crossvalidation

Split data into 10 parts of equal sizes
This is called 10-fold crossvalidation
repeat 10 times
use 9 parts for training/tuning
calculate performance on remaining part
(validation set)
Estimate of generalization error is average of
the validation set errors

65
Bootstrapping

A bootstrap sample is a random subset of the data
sample
Validation set is also random sample
In the sampling process, data points may be
selected repeatedly (with replacement)
An arbitrary number of bootstrap samples may be
used
Bootstrapping is more reliable than
crossvalidation, training-test-validation

66
Example Bootstrapping
67
Example Bootstrapping
68
Random Forests

Combination of decision trees and bootstrapping
concepts
A large number of decision trees is trained, each
on a different bootstrap sample
At each split, only a random number of the
original variables is available (i.e. small
selection of columns)
Data points are classified by majority voting of
the individual trees

69
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
blue
504
1.1
B
9
II
green
480
1.8
A
15
I
red
511
1.0
C
2
II
green
512
0.7
C
-2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
I
cyan
500
1.5
A
10
II
blue
505
0.3
C
0
II
blue
502
1.9
B
9
70
Example Random Forests
Bootstrap sample
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
71
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
72
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
73
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
74
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
75
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
76
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
77
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
78
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
79
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
80
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
81
Classification with Random Forests
Green Class I Blue Class II 3 votes for I 2
votes for II gt classify as I
82
Properties of Random Forests

Easy to use ("off-the-shelve"), only 2 parameters
(no. of trees, variables for split)
Very high accuracy
No overfitting if selecting large number of trees
(choose high)
Insensitive to choice of split (20)
Returns an estimate of variable importance

83
Support Vector Machines

Introduced by Vapnik
Rather sophisticated mathematical model
Based on 2 concepts
Optimization (maximization of margin)
Kernel (non-linear separation)

84
Linear separation
85
Linear separation
86
Linear separation
87
Linear separation
88
Largest Margin
89
Largest Margin

Finding optimal hyperplane can be expressed as an
optimization problem
Solved by quadratic programming
Soft margin Not necessarily 100 classification
accuracy on test set

90
Non-linearly separable data
91
Non-linearly separable data
92
Non-linearly separable data
93
Additional coordinate zx2
y
x
94
Additional coordinate zx2
zx2
y
x
95
Additional coordinate zx2
y
zx2
96
Kernels

Projection of data space into higher dimensional
space
Data may be separable in this high dimensional
space
Projection multiplication of vectors with
kernel matrix
Kernel matrix determines shape of possible
separators

97
Common Kernels

Quadratic Kernel
Radial Basis Kernel
General Polynomial Kernel (arbitrary degree)
Linear Kernel (no kernel)

98
Kernel Trick

Other ML algorithms could work with projected
(high dimensional) data, so why bother with SVM?
Working with high dimensional data is problematic
(complexity)
Kernel Trick The optimization problem can be
restated such that it uses only distances in high
dimensional data
This is computationally very inexpensive

99
Properties of SVM