Title: Machine Learning
1Machine Learning
- Georg Pölzlbauer
- December 11, 2006
2Outline
- Exercises
- Data Preparation
- Decision Trees
- Model Selection
- Random Forests
- Support Vector Machines
3Exercises
- Groups of 2 or 3 students
- UCI ML Repository pick 3 data sets (different
characteristics, i.e. number of samples, number
of dimensions, number of classes) - Estimate classification error with 3 classifiers
of choice compare results - Estimate appropriate parameters for these
classifiers - Implement in Matlab, R, WEKA, YALE, KNIME
4Exercises Software
- Matlab
- YALEhttp//rapid-i.com/
- WEKAhttp//www.cs.waikato.ac.nz/ml/weka/
- KNIMEhttp//www.knime.org/
- Rhttp//www.r-project.org/
5Exercises Software
- WEKA recommended easy to use, easy to learn, no
programming - KNIME, YALE also easy to use
- R most advanced and powerful software do not
use if you do not know R really well! - Matlab not recommended requires installation of
packages from internet etc.
6Exercises Written Report
- Report should be 5-10 pages
- Discuss characteristics of data sets (i.e.
handling of missing values, scaling etc.) - Summarize classifiers used (one paragraph each)
- Discuss experimental results (tables, figures)
- Do not include code in report
7Exercises How to proceed
- It is not necessary to implement anything rely
on libraries, modules etc - UCI ML Repositoryhttp//www.ics.uci.edu/mlearn/
MLSummary.html - Import data file, scale data, apply model
selection, write down any problems/findings
8Grading
- No written/oral exam
- End of January submission of report
- Ca. 15 minutes discussion of results and code
(individually for each group) - Grading bonus Use of sophisticated models,
detailed comparision of classifiers, thorough
discussion of experiments, justification of
choices
9Questions?
- Questions regarding theory
- poelzlbauer_at_ifs.tuwien.ac.at
- musliu_at_dbai.tuwien.ac.at
- Questions regarding R, Weka,
- Forum
10Machine Learning Setting
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
11Machine Learning Setting
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
Train ML Model
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
12Machine Learning Setting
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
Train ML Model
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
13Data Preparation
- -gt Example adult census data
- Table format data (data matrix)
- Missing values
- Categorical data
- Quantitative (continuous) data with different
scales
14Categorical variables
- Non-numeric variables with a finite number of
levels - E.g. "red", "blue", "green"
- Some ML algorithms can only handle numeric
variables - Solution 1-to-N coding
151-to-N Coding
feature
red
blue
green
red
red
green
blue
red blue green
1 0 0
0 1 0
0 0 1
1 0 0
1 0 0
0 0 1
0 1 0
16Scaling of continuous variables
- Many ML algorithms rely on measuring the distance
between 2 samples - There should be no difference if a length
variable is measured in cm, inch, or km - To remove the unit of measure (e.g. kg, mph, )
each variable dimension is normalized - subtract mean
- divide by standard deviation
17Scaling of continuous variables
- Data set now has mean 0, variance 1
- Chebyshev's inequality
- 75 of data between -2 and 2
- 89 of data between -3 and 3
- 94 of data between -4 and 4
18Output variables
- ML requires categorical output (continuous output
regression) - ML methods can be applied by binning continuous
output (loss of prediction accuracy)
19Binary Decision Trees
- Rely on Information Theory (Shannon)
- Recursive algorithm that splits feature space
into 2 areas at each recursion step - Classification works by going through the tree
from the root node until arriving at a leaf node
20Decision Trees Example
21Information Theory, Entropy
- Introduced by Claude Shannon
- Applications in data compression
- Concerned with measuring actual information vs.
redundancy - Measures information in bits
22What is Entropy?
- In Machine Learning, Entropy is a measure for the
impurity of a set - High Entropy gt bad for prediction
- High Entropy gt needs to be reduced (Information
Gain)
23Calculating H(X)
24H(X) Case studies
H(X)
p(xblue)
p(xred)
1
0.5
0.5
I
0.88
0.7
0.3
II
0.88
0.3
0.7
III
0
1
0
IV
25H(X) Relative vs. absolute frequencies
red blue
I 8 4
II 18 9
vs.
gt H(XI) H(XII)
Only relative frequencies matter!
26Information Gain
- Given a set and a choice between possible
subsets, which one is preferable?
Information Gain Sets that minimize Entropy by
largest amount
H(X) 1
A (green) B (yellow)
Points 6 4
p(X.) 0.6 0.4
p(xred) 0.33 0.75
p(xblue) 0.67 0.25
H(X.) 0.92 0.81
IG 0.12 0.12
A (green) B (yellow)
Points 9 1
p(X.) 0.9 0.1
p(xred) 0.44 1
p(xblue) 0.56 0
H(X.) 0.99 0
IG 0.11 0.11
A (green) B (yellow)
Points 5 5
p(X.) 0.5 0.5
p(xred) 0.2 0.8
p(xblue) 0.8 0.2
H(X.) 0.72 0.72
IG 0.28 0.28
27Informatin Gain (Properties)
- IG is at most as large as the Entropy of the
original set - IG is the amount by which the original Entropy
can be reduced by splitting into subsets - IG is at least zero (if Entropy is not reduced)
- 0 lt IG lt H(X)
28Building (binary) Decision Trees
- Data set categorical or quantitative variables
- Iterate variables, calculate IG for every
possible split - categorical variables one variable vs. the rest
- quantitative variables sort values, split
between each pair of values - recurse until prediction is good enough
29Decision Trees Quantitative variables
0.07
0.00
0.01
0.03
0.08
0.03
0.00
0.00
0.01
0.13
0.06
original H 0.99
0.17
0.01
0.11
0.43
0.26
0.06
0.13
0.05
0.29
0.28
0.09
0.16
30Decision Trees Quantitative variables
31Decision Trees Classification
32Decision Trees Classification
33Decision Trees Classification
34Decision Trees More than 2 classes
35Decision Trees Non-binary trees
36Decision Trees Overfitting
- Fully grown trees are usually too complicated
37Decision Trees Stopping Criteria
- Stop when absolute number of samples is low
(below a threshold) - Stop when Entropy is already relatively low
(below a threshold) - Stop if IG is low
- Stop if decision could be random (Chi-Square
test) - Threshold values are hyperparameters
38Decision Trees Pruning
- "Pruning" means removing nodes from a tree after
training has finished - Stopping criteria are sometimes referred to as
"pre-pruning" - Redundant nodes are removed, sometimes tree is
remodeled
39Example Pruning
40Decision Trees Stability
41Decision Trees Stability
42Decision Trees Stability
43Decision Trees Stability
44Decision Trees Stability
45Decision Trees Stability
46Decision Trees Stability
47Decision Trees Stability
48Decision Trees Stability
49Decision Trees Stability
50Decision Trees Stability
51Decision Trees Stability
52Decision Trees Stability
53Model Selection
- General ML Framework
- Takes care of estimating hyperparameters
- Takes care of selecting good model (avoid local
minima)
54Why is Generalization an Issue?
55Why is Generalization an Issue?
56Why is Generalization an Issue?
57Why is Generalization an Issue?
f
m
f
m
58Why is Generalization an Issue?
f
m
f
m
59Bayes Optimal Classifier
60Training Set, Test Set
- Solution
- Split data into training and test sets
- 80 training, 20 test
- Train different models
- Classify test set
- Pick the one model that has the least test set
error
61Trade-off complexity vs. generalization
Minimum of Test Set Error
Test Set
Error
Training Set
Model Complexity
62Estimation of Generalization Error
- Test set is used in model selection and tuning of
parameters - thus, test set error is not an accurate estimate
of generalization error - Generalization error is the expected error that
the classifier will make on a given data set
63Training-Test-Validation
- Save part of the data set for validation
- Split E.g.
- 60 training set
- 20 test set
- 20 validation set
- Train classifiers on training set
- Select classifier based on test set performance
- Estimate generalization error on validation set
64Crossvalidation
- Split data into 10 parts of equal sizes
- This is called 10-fold crossvalidation
- repeat 10 times
- use 9 parts for training/tuning
- calculate performance on remaining part
(validation set) - Estimate of generalization error is average of
the validation set errors
65Bootstrapping
- A bootstrap sample is a random subset of the data
sample - Validation set is also random sample
- In the sampling process, data points may be
selected repeatedly (with replacement) - An arbitrary number of bootstrap samples may be
used - Bootstrapping is more reliable than
crossvalidation, training-test-validation
66Example Bootstrapping
67Example Bootstrapping
68Random Forests
- Combination of decision trees and bootstrapping
concepts - A large number of decision trees is trained, each
on a different bootstrap sample - At each split, only a random number of the
original variables is available (i.e. small
selection of columns) - Data points are classified by majority voting of
the individual trees
69Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
blue
504
1.1
B
9
II
green
480
1.8
A
15
I
red
511
1.0
C
2
II
green
512
0.7
C
-2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
I
cyan
500
1.5
A
10
II
blue
505
0.3
C
0
II
blue
502
1.9
B
9
70Example Random Forests
Bootstrap sample
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
71Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
72Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
73Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
74Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
75Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
76Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
77Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
78Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
79Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
80Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
81Classification with Random Forests
Green Class I Blue Class II 3 votes for I 2
votes for II gt classify as I
82Properties of Random Forests
- Easy to use ("off-the-shelve"), only 2 parameters
(no. of trees, variables for split) - Very high accuracy
- No overfitting if selecting large number of trees
(choose high) - Insensitive to choice of split (20)
- Returns an estimate of variable importance
83Support Vector Machines
- Introduced by Vapnik
- Rather sophisticated mathematical model
- Based on 2 concepts
- Optimization (maximization of margin)
- Kernel (non-linear separation)
84Linear separation
85Linear separation
86Linear separation
87Linear separation
88Largest Margin
89Largest Margin
- Finding optimal hyperplane can be expressed as an
optimization problem - Solved by quadratic programming
- Soft margin Not necessarily 100 classification
accuracy on test set
90Non-linearly separable data
91Non-linearly separable data
92Non-linearly separable data
93Additional coordinate zx2
y
x
94Additional coordinate zx2
zx2
y
x
95Additional coordinate zx2
y
zx2
96Kernels
- Projection of data space into higher dimensional
space - Data may be separable in this high dimensional
space - Projection multiplication of vectors with
kernel matrix - Kernel matrix determines shape of possible
separators
97Common Kernels
- Quadratic Kernel
- Radial Basis Kernel
- General Polynomial Kernel (arbitrary degree)
- Linear Kernel (no kernel)
98Kernel Trick
- Other ML algorithms could work with projected
(high dimensional) data, so why bother with SVM? - Working with high dimensional data is problematic
(complexity) - Kernel Trick The optimization problem can be
restated such that it uses only distances in high
dimensional data - This is computationally very inexpensive
99Properties of SVM
- High classification accuracy
- Linear kernels Good for sparse, high dimensional
data - Much research has been directed at SVM,
VC-dimension etc. gt solid background
100The End
101Additional topics
- Confusion Matrix (weights)
- Prototype based methods (LVQ,)
- k-NN