Title: preprocessing: an example
1pre-processing an example
- the iris data set.
- a commonly used data base in machine learning and
statistics. - classifying an iris flower into one of three
species based on four attributes. - attributes
- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)
- all attributes are continuous
- species (or target class)
- setosa
- versicolour
- virginica
- note
- the sepal is one of the small, green, leaf like
outer parts of a flower - the petal is one of the brightly coloured outer
parts of the flower - 150 examples (50 in each class)
2pre-processing an example
- 150 examples (50 in each class)
- what does the raw data look like ?
- example sepal sepal petal
petal class - length width length width (species)
- __________________________________________________
________ - 1 6.3 3.4
5.6 2.4 virginica - 2 5.7 2.6
3.5 1.0 versicolour - ... ... ...
... ... - ... ... ...
... ... - 149 5 3.4
1.5 0.2 setosa - 150 5.7 2.8
4.1 1.3 versicolour
3pre-processing an example
- looking at the distribution of the data
- For example, sepal length
4pre-processing an example
- linear Scaling of input data
- eg
- for sepal length
- raw data range
- 4.3 cm (min) to 7.9 cm (max)
- scale to 0 (min) and 1(max)
- i.e 4.3 becomes 0, 7.9
becomes 1 - scaled value (raw_value - minimum_raw_value)/(ma
ximum_raw_value - minimum_raw_value) - or
- scaled value raw_value - minimum_raw_value/range
of raw values - for example 1 where sepal length 6.3
- scaled value (6.3 - 4.3)/(7.9 - 4.3) 2/3.6
0.56
5pre-processing an example
- Pre-processing the target
- three target classes (setosa, versicolour and
virginica) - the numeric target values will depend on the
activation function used by the hidden layer
neurons (assume a logistic function) - using one output unit
- eg
- 0.1 represents setosa
- 0.5 represents
versicolour - 0.9 represents virginica
- or using three output units
- 0.1 0.1 0.9 for setosa
- 0.1 0.9 0.1 for versicolour
- 0.9 0.1 0.1 for virginica
6pre-processing an example
- so the pre-processed examples might look like
(using linear scaling and three output units for
the target
)
ltetcgt
4 input units
3 output units
7Data pre-processing (cont'd)
- circular/periodic data
- values are repeated periodically
- examples
- days of the week
- months of the year
- seasons
- consider the representation of the seasons of the
year - we could use a single input
- e.g. 0 for summer
- 0.3 for autumn
- 0.6 for winter
- 1 for spring
- however the cyclic nature (eg that spring and
summer are close together) is not preserved in
this representation. -
- or ...
- we could use two inputs to represent the season
- e.g. 1 1 for summer
- 1 0 for autumn
- 0 1 for winter
8Data pre-processing (cont'd)
- circular/periodic data
- or ...
- we could use four inputs to represent the season
- e.g. 1 0 0 0 for summer
- 0 1 0 0 for autumn
- 0 0 1 0 for winter
- and 0 0 0 1 for spring
- You could fuzzify' the inputs i.e. encode the
degree to which the time of the year is seen to
be in each of the seasons. - the time of the year could be represented using
four inputs i.e. each representing the seasonal
degree of that time. -
- e.g. March could represented as 0.36 0.64 0 0
- 0.36 in summer, 0.64 in autumn, 0 in winter and
0 in spring - April as 0 1 0 0 0 in summer, 1 in autumn, 0 in
winter and 0 in spring and - May as 0 0.36 0.64 0 0 in summer, 0.36 in
autumn, 0.64 in winter and 0 in spring
9Data pre-processing (cont'd)
- Missing Data
- i.e. some examples in the data set have
attributes for which the value is missing - what can you do?
- options
- 1. Don't use these examples. This is not
recommended unless, - there are only a few examples with missing data
with respect to the total number of available
examples ( 10). - 2. Substitute a value for the missing value,
some possibilities are - use the maximum
- use the minimum
- use median (most common across the data set
for that attribute) - use a typical value for that output class
- determine the closest' example use its
value for the attribute - 3. Accept that the value is missing and use an
extra input to indicate that it is missing
(applicable for discrete inputs) - e.g. gender
- where
- 1 0 0 represents male ( or 1
0) - 0 1 0 represents female ( or 0
1)
10Preparing the training and testing sets
an overview
training data
training a multi-layer perceptron
pre-processing decisions
training set performance
testing data
data set
testing performance (generalisation)
11Preparing the training and testing sets
- Training and Testing Sets
- training set used to train the network (i.e. to
adjust the weights) - testing set used to test generalisation
capabilities of the network during training, it
is not used to adjust weights but simply gives an
indication of performance - Allocation of examples to each of these sets?
- data is randomly allocated to the training set
and the testing set. - however the distribution of the data should be
preserved in both. - the examples for each of the output (target
classes) should allocated proportionally in each
set. - e.g. if 70 of the total data set are examples
from class A, then the 70 of the training set
should be from class A and 70 of the testing set
should be class A. - There are no strict rules for the number of
examples allocated to each - possibilities
- training testing
- 60 40 or
- 70 30
- remember the number of training examples should
exceed the number of testing examples
12Preparing the training and testing sets
- Cross-validation
- This is a more reliable measure of
generalisation than creating single training and
testing sets . - For example 10-fold cross validation
data set
Set 1
Set 2
Set 3
Set 4
Set 5
Set 7
Set 6
Set 8
Set 9
Set 10
testing examples
training examples
Set 1
Set 3
Set 4
Set 5
Set 7
Set 6
Set 8
Set 9
Set 10
Set 2
fold 1
Set 1
Set 2
Set 3
Set 4
Set 5
Set 7
Set 6
Set 8
Set 9
Set 10
fold 2
etc
13Preparing the training and testing sets
- Cross-validation
- This is a more reliable measure of
generalisation than creating single training and
testing sets . - For example 10-fold cross validation
- comments
- ensures that all examples are at some time used
for training and testing. - each example is used only once for testing
- minimizes bias in the data sets
- results in many more experiments i.e.
time-consuming - cross-validation can give a performance estimate
on unseen examples for a network trained on the
entire data set.