Machine Learning - PowerPoint PPT Presentation

1 / 101
About This Presentation
Title:

Machine Learning

Description:

Groups of 2 or 3 students ... smoker. age. gender. no. no. no. yes. no. no. yes. no. no. yes. yes. no. lung cancer. gray. no. 44. female ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 102
Provided by: ifsTuw
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning


1
Machine Learning
  • Georg Pölzlbauer
  • December 11, 2006

2
Outline
  • Exercises
  • Data Preparation
  • Decision Trees
  • Model Selection
  • Random Forests
  • Support Vector Machines

3
Exercises
  • Groups of 2 or 3 students
  • UCI ML Repository pick 3 data sets (different
    characteristics, i.e. number of samples, number
    of dimensions, number of classes)
  • Estimate classification error with 3 classifiers
    of choice compare results
  • Estimate appropriate parameters for these
    classifiers
  • Implement in Matlab, R, WEKA, YALE, KNIME

4
Exercises Software
  • Matlab
  • YALEhttp//rapid-i.com/
  • WEKAhttp//www.cs.waikato.ac.nz/ml/weka/
  • KNIMEhttp//www.knime.org/
  • Rhttp//www.r-project.org/

5
Exercises Software
  • WEKA recommended easy to use, easy to learn, no
    programming
  • KNIME, YALE also easy to use
  • R most advanced and powerful software do not
    use if you do not know R really well!
  • Matlab not recommended requires installation of
    packages from internet etc.

6
Exercises Written Report
  • Report should be 5-10 pages
  • Discuss characteristics of data sets (i.e.
    handling of missing values, scaling etc.)
  • Summarize classifiers used (one paragraph each)
  • Discuss experimental results (tables, figures)
  • Do not include code in report

7
Exercises How to proceed
  • It is not necessary to implement anything rely
    on libraries, modules etc
  • UCI ML Repositoryhttp//www.ics.uci.edu/mlearn/
    MLSummary.html
  • Import data file, scale data, apply model
    selection, write down any problems/findings

8
Grading
  • No written/oral exam
  • End of January submission of report
  • Ca. 15 minutes discussion of results and code
    (individually for each group)
  • Grading bonus Use of sophisticated models,
    detailed comparision of classifiers, thorough
    discussion of experiments, justification of
    choices

9
Questions?
  • Questions regarding theory
  • poelzlbauer_at_ifs.tuwien.ac.at
  • musliu_at_dbai.tuwien.ac.at
  • Questions regarding R, Weka,
  • Forum

10
Machine Learning Setting
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
11
Machine Learning Setting
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
Train ML Model
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
12
Machine Learning Setting
gender age smoker eye color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
Train ML Model
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
13
Data Preparation
  • -gt Example adult census data
  • Table format data (data matrix)
  • Missing values
  • Categorical data
  • Quantitative (continuous) data with different
    scales

14
Categorical variables
  • Non-numeric variables with a finite number of
    levels
  • E.g. "red", "blue", "green"
  • Some ML algorithms can only handle numeric
    variables
  • Solution 1-to-N coding

15
1-to-N Coding
feature
red
blue
green
red
red
green
blue
red blue green
1 0 0
0 1 0
0 0 1
1 0 0
1 0 0
0 0 1
0 1 0
16
Scaling of continuous variables
  • Many ML algorithms rely on measuring the distance
    between 2 samples
  • There should be no difference if a length
    variable is measured in cm, inch, or km
  • To remove the unit of measure (e.g. kg, mph, )
    each variable dimension is normalized
  • subtract mean
  • divide by standard deviation

17
Scaling of continuous variables
  • Data set now has mean 0, variance 1
  • Chebyshev's inequality
  • 75 of data between -2 and 2
  • 89 of data between -3 and 3
  • 94 of data between -4 and 4

18
Output variables
  • ML requires categorical output (continuous output
    regression)
  • ML methods can be applied by binning continuous
    output (loss of prediction accuracy)

19
Binary Decision Trees
  • Rely on Information Theory (Shannon)
  • Recursive algorithm that splits feature space
    into 2 areas at each recursion step
  • Classification works by going through the tree
    from the root node until arriving at a leaf node

20
Decision Trees Example
21
Information Theory, Entropy
  • Introduced by Claude Shannon
  • Applications in data compression
  • Concerned with measuring actual information vs.
    redundancy
  • Measures information in bits

22
What is Entropy?
  • In Machine Learning, Entropy is a measure for the
    impurity of a set
  • High Entropy gt bad for prediction
  • High Entropy gt needs to be reduced (Information
    Gain)

23
Calculating H(X)
24
H(X) Case studies
H(X)
p(xblue)
p(xred)
1
0.5
0.5
I
0.88
0.7
0.3
II
0.88
0.3
0.7
III
0
1
0
IV
25
H(X) Relative vs. absolute frequencies
red blue
I 8 4
II 18 9
vs.
gt H(XI) H(XII)
Only relative frequencies matter!
26
Information Gain
  • Given a set and a choice between possible
    subsets, which one is preferable?

Information Gain Sets that minimize Entropy by
largest amount
H(X) 1
A (green) B (yellow)
Points 6 4
p(X.) 0.6 0.4
p(xred) 0.33 0.75
p(xblue) 0.67 0.25
H(X.) 0.92 0.81
IG 0.12 0.12
A (green) B (yellow)
Points 9 1
p(X.) 0.9 0.1
p(xred) 0.44 1
p(xblue) 0.56 0
H(X.) 0.99 0
IG 0.11 0.11
A (green) B (yellow)
Points 5 5
p(X.) 0.5 0.5
p(xred) 0.2 0.8
p(xblue) 0.8 0.2
H(X.) 0.72 0.72
IG 0.28 0.28
27
Informatin Gain (Properties)
  • IG is at most as large as the Entropy of the
    original set
  • IG is the amount by which the original Entropy
    can be reduced by splitting into subsets
  • IG is at least zero (if Entropy is not reduced)
  • 0 lt IG lt H(X)

28
Building (binary) Decision Trees
  • Data set categorical or quantitative variables
  • Iterate variables, calculate IG for every
    possible split
  • categorical variables one variable vs. the rest
  • quantitative variables sort values, split
    between each pair of values
  • recurse until prediction is good enough

29
Decision Trees Quantitative variables
0.07
0.00
0.01
0.03
0.08
0.03
0.00
0.00
0.01
0.13
0.06
original H 0.99
0.17
0.01
0.11
0.43
0.26
0.06
0.13
0.05
0.29
0.28
0.09
0.16
30
Decision Trees Quantitative variables
31
Decision Trees Classification
32
Decision Trees Classification
33
Decision Trees Classification
34
Decision Trees More than 2 classes
35
Decision Trees Non-binary trees
36
Decision Trees Overfitting
  • Fully grown trees are usually too complicated

37
Decision Trees Stopping Criteria
  • Stop when absolute number of samples is low
    (below a threshold)
  • Stop when Entropy is already relatively low
    (below a threshold)
  • Stop if IG is low
  • Stop if decision could be random (Chi-Square
    test)
  • Threshold values are hyperparameters

38
Decision Trees Pruning
  • "Pruning" means removing nodes from a tree after
    training has finished
  • Stopping criteria are sometimes referred to as
    "pre-pruning"
  • Redundant nodes are removed, sometimes tree is
    remodeled

39
Example Pruning
40
Decision Trees Stability
41
Decision Trees Stability
42
Decision Trees Stability
43
Decision Trees Stability
44
Decision Trees Stability
45
Decision Trees Stability
46
Decision Trees Stability
47
Decision Trees Stability
48
Decision Trees Stability
49
Decision Trees Stability
50
Decision Trees Stability
51
Decision Trees Stability
52
Decision Trees Stability
53
Model Selection
  • General ML Framework
  • Takes care of estimating hyperparameters
  • Takes care of selecting good model (avoid local
    minima)

54
Why is Generalization an Issue?
55
Why is Generalization an Issue?
56
Why is Generalization an Issue?
57
Why is Generalization an Issue?
f
m
f
m
58
Why is Generalization an Issue?
f
m
f
m
59
Bayes Optimal Classifier
60
Training Set, Test Set
  • Solution
  • Split data into training and test sets
  • 80 training, 20 test
  • Train different models
  • Classify test set
  • Pick the one model that has the least test set
    error

61
Trade-off complexity vs. generalization
Minimum of Test Set Error
Test Set
Error
Training Set
Model Complexity
62
Estimation of Generalization Error
  • Test set is used in model selection and tuning of
    parameters
  • thus, test set error is not an accurate estimate
    of generalization error
  • Generalization error is the expected error that
    the classifier will make on a given data set

63
Training-Test-Validation
  • Save part of the data set for validation
  • Split E.g.
  • 60 training set
  • 20 test set
  • 20 validation set
  • Train classifiers on training set
  • Select classifier based on test set performance
  • Estimate generalization error on validation set

64
Crossvalidation
  • Split data into 10 parts of equal sizes
  • This is called 10-fold crossvalidation
  • repeat 10 times
  • use 9 parts for training/tuning
  • calculate performance on remaining part
    (validation set)
  • Estimate of generalization error is average of
    the validation set errors

65
Bootstrapping
  • A bootstrap sample is a random subset of the data
    sample
  • Validation set is also random sample
  • In the sampling process, data points may be
    selected repeatedly (with replacement)
  • An arbitrary number of bootstrap samples may be
    used
  • Bootstrapping is more reliable than
    crossvalidation, training-test-validation

66
Example Bootstrapping
67
Example Bootstrapping
68
Random Forests
  • Combination of decision trees and bootstrapping
    concepts
  • A large number of decision trees is trained, each
    on a different bootstrap sample
  • At each split, only a random number of the
    original variables is available (i.e. small
    selection of columns)
  • Data points are classified by majority voting of
    the individual trees

69
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
blue
504
1.1
B
9
II
green
480
1.8
A
15
I
red
511
1.0
C
2
II
green
512
0.7
C
-2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
I
cyan
500
1.5
A
10
II
blue
505
0.3
C
0
II
blue
502
1.9
B
9
70
Example Random Forests
Bootstrap sample
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
71
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
72
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
73
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
74
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
75
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
76
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
77
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
78
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
79
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
80
Example Random Forests
I
red
501
0.1
A
12
II
red
499
1.2
B
8
II
green
480
1.8
A
15
I
red
511
1.0
C
2
I
cyan
488
0.4
C
7
I
cyan
491
0.6
A
7
II
blue
505
0.3
C
0
81
Classification with Random Forests
Green Class I Blue Class II 3 votes for I 2
votes for II gt classify as I
82
Properties of Random Forests
  • Easy to use ("off-the-shelve"), only 2 parameters
    (no. of trees, variables for split)
  • Very high accuracy
  • No overfitting if selecting large number of trees
    (choose high)
  • Insensitive to choice of split (20)
  • Returns an estimate of variable importance

83
Support Vector Machines
  • Introduced by Vapnik
  • Rather sophisticated mathematical model
  • Based on 2 concepts
  • Optimization (maximization of margin)
  • Kernel (non-linear separation)

84
Linear separation
85
Linear separation
86
Linear separation
87
Linear separation
88
Largest Margin
89
Largest Margin
  • Finding optimal hyperplane can be expressed as an
    optimization problem
  • Solved by quadratic programming
  • Soft margin Not necessarily 100 classification
    accuracy on test set

90
Non-linearly separable data
91
Non-linearly separable data
92
Non-linearly separable data
93
Additional coordinate zx2
y
x
94
Additional coordinate zx2
zx2
y
x
95
Additional coordinate zx2
y
zx2
96
Kernels
  • Projection of data space into higher dimensional
    space
  • Data may be separable in this high dimensional
    space
  • Projection multiplication of vectors with
    kernel matrix
  • Kernel matrix determines shape of possible
    separators

97
Common Kernels
  • Quadratic Kernel
  • Radial Basis Kernel
  • General Polynomial Kernel (arbitrary degree)
  • Linear Kernel (no kernel)

98
Kernel Trick
  • Other ML algorithms could work with projected
    (high dimensional) data, so why bother with SVM?
  • Working with high dimensional data is problematic
    (complexity)
  • Kernel Trick The optimization problem can be
    restated such that it uses only distances in high
    dimensional data
  • This is computationally very inexpensive

99
Properties of SVM
  • High classification accuracy
  • Linear kernels Good for sparse, high dimensional
    data
  • Much research has been directed at SVM,
    VC-dimension etc. gt solid background

100
The End
101
Additional topics
  • Confusion Matrix (weights)
  • Prototype based methods (LVQ,)
  • k-NN
Write a Comment
User Comments (0)
About PowerShow.com