Title: Stream Data Classification Lecture - 1
1Stream Data ClassificationLecture - 1
2Presentation Overview
- Classification
- Stream Data
- Data Selection
- Ensemble
- Our Approach
- Results
3Classification
4An Example
Classification
- (from Pattern Classification by Duda Hart
Stork Second Edition, 2001) - A fish-packing plant wants to automate the
process of sorting incoming fish according to
species - As a pilot project, it is decided to try to
separate sea bass from salmon using optical
sensing
5An Example (continued)
Classification
- Features (to distinguish)
- Length
- Lightness
- Width
- Position of mouth
-
6An Example (continued)
Classification
- Preprocessing Images of different fishes are
isolated from one another and from background - Feature extraction The information of a single
fish is then sent to a feature extractor, that
measure certain features or properties - Classification The values of these features are
passed to a classifier that evaluates the
evidence presented, and build a model to
discriminate between the two species
7An Example (continued)
Classification
- Domain knowledge
- A sea bass is generally longer than a salmon
- Related feature (or attribute)
- Length
- Training the classifier
- Some examples are provided to the classifier in
this form ltfish_length, fish_namegt - These examples are called training examples
- The classifier learns itself from the training
examples, how to distinguish Salmon from Bass
based on the fish_length
8An Example (continued)
Classification
- Classification model (hypothesis)
- The classifier generates a model from the
training data to classify future examples (test
examples) - An example of the model is a rule like this
- If Length gt l then sea bass otherwise salmon
- Here the value of l determined by the classifier
- Testing the model
- Once we get a model out of the classifier, we may
use the classifier to test future examples - The test data is provided in the form
ltfish_lengthgt - The classifier outputs ltfish_typegt by checking
fish_length against the model
9An Example (continued)
Classification
Training Data
Test/Unlabeled Data
- So the overall classification process goes like
this ?
Preprocessing, and feature extraction
Preprocessing, and feature extraction
Feature vector
Feature vector
Training
Testing against model/ Classification
Prediction/Evaluation
Model
10An Example (continued)
Classification
If len gt 12, then sea bass else salmon
Pre-processing, Feature extraction
12, salmon 15, sea bass 8, salmon 5, sea bass
Training
Training data
Model
Feature vector
Labeled data
sea bass (error!) salmon (correct) sea bass salmon
Pre-processing, Feature extraction
15, salmon 10, salmon 18, ? 8, ?
Test/ Classify
Evaluation/Prediction
Test data
Feature vector
Unlabeled data
11An Example (continued)
Classification
- Why error?
- Insufficient training data
- Too few features
- Too many/irrelevant features
- Overfitting / specialization
12An Example (continued)
Classification
13An Example (continued)
Classification
- New Feature
- Average lightness of the fish scales
14An Example (continued)
Classification
15An Example (continued)
Classification
If ltns gt 6 or len5ltns2gt100 then sea bass
else salmon
Pre-processing, Feature extraction
12, 4, salmon 15, 8, sea bass 8, 2, salmon 5, 10,
sea bass
Training
Training data
Model
Feature vector
salmon (correct) salmon (correct) sea bass salmon
Pre-processing, Feature extraction
15, 2, salmon 10, 7, salmon 18, 7, ? 8, 5, ?
Test/ Classify
Evaluation/Prediction
Test data
Feature vector
16Terms
Classification
- Accuracy
- of test data correctly classified
- In our first example, accuracy was 3 out 4 75
- In our second example, accuracy was 4 out 4
100 - False positive
- Negative class incorrectly classified as positive
- Usually, the larger class is the negative class
- Suppose
- salmon is negative class
- sea bass is positive class
17Terms
Classification
false positive
false negative
18Terms
Classification
- Cross validation (3 fold)
Testing
Training
Training
Training
Training
Testing
Training
Testing
Training
Fold 2
Fold 3
Fold 1
19Stream data
20Problem Description
Stream Data
- Suppose we have a continuous flow data
- For example, a network server always receiving
some data - We would like to detect intrusions / attacks in
the data - Classification problem
- Is the incoming data to the server is attack or
normal ? - How do we solve this classification problem?
21Problem Formulation
Stream Data
- Distinguish normal traffic from attack traffic
- Identify important features from domain knowledge
- Extract features from the data
- Prepare training data
- Train a classifier
- Classify future data
22Problem Formulation (cont)
Stream Data
- Problem 1
- How much data should be used for training?
- Train with first t hour data only?
- What if no attack appears during first t hours?
- What if the first t hour data were only attack
Training
Testing
Data ?
Time ? 0
t
now
23An example
Stream Data
24Problem Formulation (cont)
Stream Data
- Possible solution
- Use all data upto now for training
- Problem II
- Cant store unlimited data
- Cant train a classifier with large volume of
data - Possible solution
- choose only a subset of data for training
25Problem Formulation (cont)
Stream Data
- Problem II
- Cant store unlimited data
- Cant train a classifier with large volume of
data - Possible solution
- Divide data stream into chunks (e.g. 1 hour
data) - Selectively add new data chunks to the training
set (how?)
chunk1
chunk2
chunk3
26Problem Formulation (cont)
Stream Data
- Problem III concept drift
- The concept (i.e., characteristic of classes) may
change over time - For example, characteristics (length, lightness)
of salmon and sea bass may change after
thousand/million years - Thus, old training data would be outdated and
discarded - Solution selectively discard old training data
(how?)
27Systematic data selection
- Source Fan, W. Systematic data selection to mine
concept-drifting data streams. In Proc. KDD 04.
28Data Selection Problem
Systematic Data Selection
- In the presence of concept drift, which data
should be used to train the classifier? - Use all data? discard oldest? random selection?
29Data Selection Problem
Systematic Data Selection
- Concept drift
- Si is the data received at time stamp i
- FOi (x) is its optimum model
- Let FOi-1(x) be the optimal hypothesis at time
stamp i-1 - We say that there is concept drift from time
stamp i-1 to time stamp i if there exists some x
such that - FOi (x) ? FOi -1(x)
- Data sufficiency
- Training data is sufficient if adding more data
to the training set does not improve
classification accuracy
30Will Old Data Help?
Systematic Data Selection
- Underlying model does not change (no concept
drift) - Old data will help if the recent data is
insufficient - Overfitting does not occur
- Underlying model does change
- Let SP S1 U U Si-1
- The data in SP can be any of the three categories
- 1. FOi (x) ? FOi -1(x) (disagree)
- 2. FOi (x) FOi -1(x) y (agree and correct)
- 3. FOi (x) FOi -1(x) ? y (agree but wrong)
31Will Old Data Help? (cont)
Systematic Data Selection
- 1. FOi (x) ? FOi -1(x) (disagree)
- 2. FOi (x) FOi -1(x) y (agree and correct)
- 3. FOi (x) FOi -1(x) ? y (agree but wrong)
3
2
1
32Scenario-I
Systematic Data Selection
- New data is sufficient by itself and there is no
concept drift - Optimal model the one trained with new data
only - Optimal model may also be the old model if that
data was sufficient - Problem - we may never know whether the data is
sufficient, or there is no concept drift - What if we
- Train a new model from the new data
- A new model from the combined new and old data
- Compare with the original old model
33Scenario-II
Systematic Data Selection
- New data is sufficient by itself and there is
concept drift - Optimal model the one trained with new data
only - Problem - we may never know whether the data is
sufficient, or there is no concept drift
34Scenario-III
Systematic Data Selection
- New data is insufficient by itself and there is
no concept drift - Optimal model If the previous data is
sufficient, then the existing model - Optimal model If previous data is not
sufficient, then - Train a new model from new data plus existing
data - Choose the one with high accuracy
35Scenario-IV
Systematic Data Selection
- New data is insufficient by itself and there is
concept drift - Optimal model not obtainable from new data only
- Choose only those examples from previous data
chunks that - Have consistent concept with the new data chunk
- And combine those examples with the new data
36Computing Optimal Model
Systematic Data Selection
- Optimal model is different under different
situations - Choice depends on whether the data is sufficient
and there is concept drift - Solution
- Compare a few plausible models statistically
- Chose the one with the highest accuracy
- Notation
- FN(x) a new model trained from recent data
- FO(x) optimal model finally chosen
37Computing Optimal Model (cont)
Systematic Data Selection
- 1. Train a model FNi(x) from the new data chunk.
- 2. Let Di-1be dataset that trained the most
recent optimal model FOi-1(x) - Di-1 may not be the most recent data chunk Si-1
- How Di-1 is obtained will be discussed shortly
- Select the examples from Di-1 that both
- The model FNi(x) and
- The model FOi-1(x) make correct prediction
- Say, these examples are si-1
- That is, si-1 for all (x,y) ?, Di-1 such that
FNi(x)FOi-1(x)y
38Computing Optimal Model (cont)
Systematic Data Selection
- 3. Train a model FNi(x) from the new data chunk
plus the selected data in the last step, i.e.,
from Si U si-1 - 4. Update the most recent model FOi-1(x) with Si
and call this model FOi-1(x). i.e., FOi-1(x)
is trained from Di U Si - 5. Compare the accuracies of all four models
FOi-1(x), FOi-1(x), FNi(x), FNi(x) - Using cross-validation and
- Choose the one that is the most accurate
- Call it FOi(x)
39Computing Optimal Model (cont)
Systematic Data Selection
- 6. Di is the training set that computes FOi(x).
It is either of the followings - Si
- Di-1
- Si U si
- Si U Di-1
40Scenarios, Revisited
Systematic Data Selection
- 1. New data is sufficient by itself and there is
no concept change. - Conceptually FNi(x) should be the optimal model.
- However, FNi (x), FOi-1(x) and FOi-1(x) could
be its close match since there is no concept
change. - 2. New data is sufficient by itself and there is
concept change. - Obviously, FNi(x) should be the optimal model.
- However, FNi (x) could be very similar in
performance to FNi (x)
41Scenarios, Revisited (continued)
Systematic Data Selection
- 3. New data is insufficient by itself and there
is no concept change - The optimal model should be either FOi-1(x) or
FOi-1(x). - 4. New data is insufficient by itself and there
is concept change. - The optimal model should be either FNi(x) or
FNi(x).
42Data Set
Systematic Data Selection
- Synthetic data
- Each data point is a d-dimensional vector
x1,,xd where x ? 0,1 - Concept drift is achieved by a moving hyperplane
- Equation of the hyperplane
- Weights are changed at a certain rate
43Data Set (continued)
Systematic Data Selection
- Synthetic data (continued)
- Parameters
- d dimension 10
- t rate of change of weight
- Weight is changed with the formula aiaisit/N
- N 1000
- k how many dimensions to change (varied from
20 to 50) - s direction of change (randomly changed)
- p noise set to 5
44Data Set (continued)
Systematic Data Selection
- Credit card fraud data (real)
- Sampled from credit card transaction records
within a one year period - Contains total 5 million transactions
- Features
- Time
- Merchant type, location
- Past payments
- Summary of transaction history etc.
45Experiments
Systematic Data Selection
- Comparison with other methods
- G1 decision tree trained from new data chunk
only - GA decision tree trained from all data
- Gi single decision tree trained from most recent
i data chunks - Ei decision tree ensemble trained from most
recent i data chunks, each tree from one chunk
46Results
Systematic Data Selection
47Criticism
Systematic Data Selection
- Quote
- will the training data Di become unnecessarily
large? The answer is no. Di only grows in size
(or includes older data) if and only if the
additional data helps improve accuracy. - Although it is claimed
- that training data will not grow large,
- there is no guarantee that it will not exceed
memory/system limitations - Can we do better?
- Store models rather than data
48Conclusion
Systematic Data Selection
- Concept drift is a major problem with stream data
mining - Systematic selection of data works better than
random data selection - However, there is no guarantee that data will not
grow beyond acceptable limit
49Ensemble methods for stream data classification