Stream Data Classification Lecture - 1 - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Stream Data Classification Lecture - 1

Description:

As a pilot project, it is decided to try to separate sea bass from salmon using optical sensing ... longer than a salmon. Related feature: (or attribute) ... – PowerPoint PPT presentation

Number of Views:141

Avg rating:3.0/5.0

Slides: 50

Provided by: mehedy

Category:

more less

Transcript and Presenter's Notes

Title: Stream Data Classification Lecture - 1

1
Stream Data ClassificationLecture - 1

M. Mehedy Masud

2
Presentation Overview

Classification
Stream Data
Data Selection
Ensemble
Our Approach
Results

3
Classification
4
An Example
Classification

(from Pattern Classification by Duda Hart
Stork Second Edition, 2001)
A fish-packing plant wants to automate the
process of sorting incoming fish according to
species
As a pilot project, it is decided to try to
separate sea bass from salmon using optical
sensing

5
An Example (continued)
Classification

Features (to distinguish)
Length
Lightness
Width
Position of mouth

6
An Example (continued)
Classification

Preprocessing Images of different fishes are
isolated from one another and from background
Feature extraction The information of a single
fish is then sent to a feature extractor, that
measure certain features or properties
Classification The values of these features are
passed to a classifier that evaluates the
evidence presented, and build a model to
discriminate between the two species

7
An Example (continued)
Classification

Domain knowledge
A sea bass is generally longer than a salmon
Related feature (or attribute)
Length
Training the classifier
Some examples are provided to the classifier in
this form ltfish_length, fish_namegt
These examples are called training examples
The classifier learns itself from the training
examples, how to distinguish Salmon from Bass
based on the fish_length

8
An Example (continued)
Classification

Classification model (hypothesis)
The classifier generates a model from the
training data to classify future examples (test
examples)
An example of the model is a rule like this
If Length gt l then sea bass otherwise salmon
Here the value of l determined by the classifier
Testing the model
Once we get a model out of the classifier, we may
use the classifier to test future examples
The test data is provided in the form
ltfish_lengthgt
The classifier outputs ltfish_typegt by checking
fish_length against the model

9
An Example (continued)
Classification
Training Data
Test/Unlabeled Data

So the overall classification process goes like
this ?

Preprocessing, and feature extraction
Preprocessing, and feature extraction
Feature vector
Feature vector
Training
Testing against model/ Classification
Prediction/Evaluation
Model
10
An Example (continued)
Classification
If len gt 12, then sea bass else salmon
Pre-processing, Feature extraction
12, salmon 15, sea bass 8, salmon 5, sea bass
Training
Training data
Model
Feature vector
Labeled data
sea bass (error!) salmon (correct) sea bass salmon
Pre-processing, Feature extraction
15, salmon 10, salmon 18, ? 8, ?
Test/ Classify
Evaluation/Prediction
Test data
Feature vector
Unlabeled data
11
An Example (continued)
Classification

Why error?
Insufficient training data
Too few features
Too many/irrelevant features
Overfitting / specialization

12
An Example (continued)
Classification
13
An Example (continued)
Classification

New Feature
Average lightness of the fish scales

14
An Example (continued)
Classification
15
An Example (continued)
Classification
If ltns gt 6 or len5ltns2gt100 then sea bass
else salmon
Pre-processing, Feature extraction
12, 4, salmon 15, 8, sea bass 8, 2, salmon 5, 10,
sea bass
Training
Training data
Model
Feature vector
salmon (correct) salmon (correct) sea bass salmon
Pre-processing, Feature extraction
15, 2, salmon 10, 7, salmon 18, 7, ? 8, 5, ?
Test/ Classify
Evaluation/Prediction
Test data
Feature vector
16
Terms
Classification

Accuracy
of test data correctly classified
In our first example, accuracy was 3 out 4 75
In our second example, accuracy was 4 out 4
100
False positive
Negative class incorrectly classified as positive
Usually, the larger class is the negative class
Suppose
salmon is negative class
sea bass is positive class

17
Terms
Classification
false positive
false negative
18
Terms
Classification

Cross validation (3 fold)

Testing
Training
Training
Training
Training
Testing
Training
Testing
Training
Fold 2
Fold 3
Fold 1
19
Stream data
20
Problem Description
Stream Data

Suppose we have a continuous flow data
For example, a network server always receiving
some data
We would like to detect intrusions / attacks in
the data
Classification problem
Is the incoming data to the server is attack or
normal ?
How do we solve this classification problem?

21
Problem Formulation
Stream Data

Distinguish normal traffic from attack traffic
Identify important features from domain knowledge
Extract features from the data
Prepare training data
Train a classifier
Classify future data

22
Problem Formulation (cont)
Stream Data

Problem 1
How much data should be used for training?
Train with first t hour data only?
What if no attack appears during first t hours?
What if the first t hour data were only attack

Training
Testing
Data ?
Time ? 0
t
now
23
An example
Stream Data

Trojan.Peacomm attack

24
Problem Formulation (cont)
Stream Data

Possible solution
Use all data upto now for training
Problem II
Cant store unlimited data
Cant train a classifier with large volume of
data
Possible solution
choose only a subset of data for training

25
Problem Formulation (cont)
Stream Data

Problem II
Cant store unlimited data
Cant train a classifier with large volume of
data
Possible solution
Divide data stream into chunks (e.g. 1 hour
data)
Selectively add new data chunks to the training
set (how?)

chunk1
chunk2
chunk3

26
Problem Formulation (cont)
Stream Data

Problem III concept drift
The concept (i.e., characteristic of classes) may
change over time
For example, characteristics (length, lightness)
of salmon and sea bass may change after
thousand/million years
Thus, old training data would be outdated and
discarded
Solution selectively discard old training data
(how?)

27
Systematic data selection

Source Fan, W. Systematic data selection to mine
concept-drifting data streams. In Proc. KDD 04.

28
Data Selection Problem
Systematic Data Selection

In the presence of concept drift, which data
should be used to train the classifier?
Use all data? discard oldest? random selection?

29
Data Selection Problem
Systematic Data Selection

Concept drift
Si is the data received at time stamp i
FOi (x) is its optimum model
Let FOi-1(x) be the optimal hypothesis at time
stamp i-1
We say that there is concept drift from time
stamp i-1 to time stamp i if there exists some x
such that
FOi (x) ? FOi -1(x)
Data sufficiency
Training data is sufficient if adding more data
to the training set does not improve
classification accuracy

30
Will Old Data Help?
Systematic Data Selection

Underlying model does not change (no concept
drift)
Old data will help if the recent data is
insufficient
Overfitting does not occur
Underlying model does change
Let SP S1 U U Si-1
The data in SP can be any of the three categories
1. FOi (x) ? FOi -1(x) (disagree)
2. FOi (x) FOi -1(x) y (agree and correct)
3. FOi (x) FOi -1(x) ? y (agree but wrong)

31
Will Old Data Help? (cont)
Systematic Data Selection

1. FOi (x) ? FOi -1(x) (disagree)
2. FOi (x) FOi -1(x) y (agree and correct)
3. FOi (x) FOi -1(x) ? y (agree but wrong)

3
2
1
32
Scenario-I
Systematic Data Selection

New data is sufficient by itself and there is no
concept drift
Optimal model the one trained with new data
only
Optimal model may also be the old model if that
data was sufficient
Problem - we may never know whether the data is
sufficient, or there is no concept drift
What if we
Train a new model from the new data
A new model from the combined new and old data
Compare with the original old model

33
Scenario-II
Systematic Data Selection

New data is sufficient by itself and there is
concept drift
Optimal model the one trained with new data
only
Problem - we may never know whether the data is
sufficient, or there is no concept drift

34
Scenario-III
Systematic Data Selection

New data is insufficient by itself and there is
no concept drift
Optimal model If the previous data is
sufficient, then the existing model
Optimal model If previous data is not
sufficient, then
Train a new model from new data plus existing
data
Choose the one with high accuracy

35
Scenario-IV
Systematic Data Selection

New data is insufficient by itself and there is
concept drift
Optimal model not obtainable from new data only
Choose only those examples from previous data
chunks that
Have consistent concept with the new data chunk
And combine those examples with the new data

36
Computing Optimal Model
Systematic Data Selection

Optimal model is different under different
situations
Choice depends on whether the data is sufficient
and there is concept drift
Solution
Compare a few plausible models statistically
Chose the one with the highest accuracy
Notation
FN(x) a new model trained from recent data
FO(x) optimal model finally chosen

37
Computing Optimal Model (cont)
Systematic Data Selection

1. Train a model FNi(x) from the new data chunk.
2. Let Di-1be dataset that trained the most
recent optimal model FOi-1(x)
Di-1 may not be the most recent data chunk Si-1
How Di-1 is obtained will be discussed shortly
Select the examples from Di-1 that both
The model FNi(x) and
The model FOi-1(x) make correct prediction
Say, these examples are si-1
That is, si-1 for all (x,y) ?, Di-1 such that
FNi(x)FOi-1(x)y

38
Computing Optimal Model (cont)
Systematic Data Selection

3. Train a model FNi(x) from the new data chunk
plus the selected data in the last step, i.e.,
from Si U si-1
4. Update the most recent model FOi-1(x) with Si
and call this model FOi-1(x). i.e., FOi-1(x)
is trained from Di U Si
5. Compare the accuracies of all four models
FOi-1(x), FOi-1(x), FNi(x), FNi(x)
Using cross-validation and
Choose the one that is the most accurate
Call it FOi(x)

39
Computing Optimal Model (cont)
Systematic Data Selection

6. Di is the training set that computes FOi(x).
It is either of the followings
Si
Di-1
Si U si
Si U Di-1

40
Scenarios, Revisited
Systematic Data Selection

1. New data is sufficient by itself and there is
no concept change.
Conceptually FNi(x) should be the optimal model.
However, FNi (x), FOi-1(x) and FOi-1(x) could
be its close match since there is no concept
change.
2. New data is sufficient by itself and there is
concept change.
Obviously, FNi(x) should be the optimal model.
However, FNi (x) could be very similar in
performance to FNi (x)

41
Scenarios, Revisited (continued)
Systematic Data Selection

3. New data is insufficient by itself and there
is no concept change
The optimal model should be either FOi-1(x) or
FOi-1(x).
4. New data is insufficient by itself and there
is concept change.
The optimal model should be either FNi(x) or
FNi(x).

42
Data Set
Systematic Data Selection

Synthetic data
Each data point is a d-dimensional vector
x1,,xd where x ? 0,1
Concept drift is achieved by a moving hyperplane
Equation of the hyperplane
Weights are changed at a certain rate

43
Data Set (continued)
Systematic Data Selection

Synthetic data (continued)
Parameters
d dimension 10
t rate of change of weight
Weight is changed with the formula aiaisit/N
N 1000
k how many dimensions to change (varied from
20 to 50)
s direction of change (randomly changed)
p noise set to 5

44
Data Set (continued)
Systematic Data Selection

Credit card fraud data (real)
Sampled from credit card transaction records
within a one year period
Contains total 5 million transactions
Features
Time
Merchant type, location
Past payments
Summary of transaction history etc.

45
Experiments
Systematic Data Selection

Comparison with other methods
G1 decision tree trained from new data chunk
only
GA decision tree trained from all data
Gi single decision tree trained from most recent
i data chunks
Ei decision tree ensemble trained from most
recent i data chunks, each tree from one chunk

46
Results
Systematic Data Selection
47
Criticism
Systematic Data Selection

Quote
will the training data Di become unnecessarily
large? The answer is no. Di only grows in size
(or includes older data) if and only if the
additional data helps improve accuracy.
Although it is claimed
that training data will not grow large,
there is no guarantee that it will not exceed
memory/system limitations
Can we do better?
Store models rather than data

48
Conclusion
Systematic Data Selection