Systematic Data Selection to Mine Concept Drifting Data Streams - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Systematic Data Selection to Mine Concept Drifting Data Streams

Description:

make sure that the constructed model is the most accurate and up-to-date. Data Sufficiency ... and probability are used to make the best prediction. Training ... – PowerPoint PPT presentation

Number of Views:208

Avg rating:3.0/5.0

Slides: 43

Provided by: yia7

Category:

more less

Transcript and Presenter's Notes

Title: Systematic Data Selection to Mine Concept Drifting Data Streams

1
Systematic Data Selection to Mine Concept
Drifting Data Streams

Wei Fan
IBM T.J.Watson

2
About

Data streams continuous stream of new data,
generated either in real time or periodically.
Credit card transactions
Stock trades.
Insurance claim data.
Phone call records
Our notations.

3
Data Streams
New data
4
Data Stream Mining

Characteristics may change over time.
Main goal of stream mining
make sure that the constructed model is the most
accurate and up-to-date.

5
Data Sufficiency

Definition
A dataset is considered sufficient if adding
more data items will not increase the final
accuracy of a trained model significantly.
We normally do not know if a dataset is
sufficient or not.
Sufficiency detection
Expensive progressive sampling experiment.
Keep on adding data and stop when accuracy
doesnt increase significantly.
Dependent on both dataset and algorithm
Difficult to make a general claim

6
Possible changes of data streams

Possible concept drift.
For the same feature vector, different class
labels are generated at some later time
Or stochastically, with different probabilities.
Possible data sufficiency.
Other possible changes not addressed in our
paper.
Most important of all
These are possibilities.
No Oracle out there to tell us the truth!
Dangerous to make assumptions.

7
How many combinations?

Four combinations
Sufficient and no drift.
Insufficient and no drift.
Sufficient and drift.
Insufficient and drift
Question Does the most accurate model remain
the same under all four situations?

8
Case 1 Sufficient and no drift

Solution one
Throw away old models and data.
Re-train a new model from new data.
By definitions of data sufficiency.
Solution two
If old model is trained from sufficient data,
just use the old model

9
Case 2 Sufficient and drift

Solution one
Train a new model from new data
Same sufficiency definition.

10
Case 3 Insufficient and no drift

Possibility I if old model is trained from
sufficient data, keep the old model.
Possibility II otherwise, combine new data and
old data, and train a new model.

11
Case 4 Insufficient and drift

Obviously, new data is not enough by definition.
What are our options.
Use old data?
But how?

12
A moving hyper plane
13
A moving hyper plane
14
See any problems?

Which old data items can we use?

15
We need to be picky
16
Inconsistent Examples
17
Consistent examples
18
See more problems?

We normally never know which of the four cases a
real data stream belongs to.
It may change over time from case to case.
Normally, no truth is known apriori or even later.

19
Solution

Requirements
The right solution should not be one size fits
all.
Should not make any assumptions. Any assumptions
can be wrong.
It should be adaptive.
Let the data speak for itself.
We prefer model A over model B if the accuracy of
A on the evolving data stream is likely to be
more accurate than B.
No assumptions!

20
An Un-biased Selection framework

Train FN from new data.
Train FN from new data and selected consistent
old data.
Assume FO is the previous most accurate model.
Update FO using the new data. Call it FO.
Use cross-validation to choose among the four
candidate models FN, FN, FO, and FO.

21
Consistent old data

Theoretically, if we know the true models, we can
use the true models to choose consistent data.
But we dont
Practically, we have to rely on optimal models.
Go back to the hyper plane example

22
A moving hyper plane
23
Their optimal models
24
True model and optimal models

True model.
Perfect model never makes mistakes.
Not always possible due to
Stochastic nature of the problem
Noise in training data
Data is insufficient
Optimal model defined over a given loss function.

25
Optimal Model

Loss function L(t,y) to evaluate performance.
t is true label and y is prediction
Optimal decision decision y is the label that
minimizes the expected loss when x is sampled
many times
0-1 loss y is the label that appears the most
often, i.e., if P(fraudx) gt 0.5, predict fraud
cost-sensitive loss the label that minimizes the
empirical risk.
If P(fraudx) 1000 gt 90 or p(fraudx) gt 0.09,
predict fraud

26
Random decision trees

Train multiple trees. Details to follow.
Each tree outputs posterior probability when
classifying an example x.
The probability outputs of many trees are
averaged as the final probability estimation.
Loss function and probability are used to make
the best prediction.

27
Training

At each node, an un-used feature is chosen
randomly
A discrete feature is un-used if it has never
been chosen previously on a given decision path
starting from the root to the current node.
A continuous feature can be chosen multiple times
on the same decision path, but each time a
different threshold value is chosen

28
Example
Gender?
M
F
Agegt30
P 1 N 9
n
y

P 100 N 150
Agegt 25
29
Training Continued

We stop when one of the following happens
A node becomes empty.
Or the total height of the tree exceeds a
threshold, currently set as the total number of
features.
Each node of the tree keeps the number of
examples belonging to each class.

30
Classification

Each tree outputs membership probability
p(fraudx) n_fraud/(n_fraud n_normal)
If a leaf node is empty (very likely for when
discrete feature is tested at the end)
Use the parent nodes probability estimate but do
not output 0 or NaN
The membership probability from multiple random
trees are averaged to approximate as the final
output
Loss function is required to make a decision
0-1 loss p(fraudx) gt 0.5, predict fraud
cost-sensitive loss p(fraudx) 1000 gt 90

31
N-fold Cross-validation with Random Decision Trees

Tree structure is independent from the data.
Compensation when computing probability

32
Key advantage

n-fold cross validation comes easy.
Same cost as testing the model once on the
training data.
Training is efficient since we do not compute
information gain.
It is actually also very accurate.

33
Experiments

I have a demo available to show. Please contact
me.
In the paper. I have the following experiments.
Synthetic datasets.
Credit card fraud datasets.
Donation datasets.

34
Compare

This new selective framework proposed in this
paper.
Our last years hard coded ensemble framework.
Use k number of weighted ensembles.
K1. Only train on new data.
K8.
Use new data and previous 7 periods of model.
Classifier is weighted against new data.
Sufficient and insufficient. Always drift.

35
Data insufficient new method
36
Last years method
37
Avg Result
38
Data sufficient new method
39
Data sufficient last years method
40
Avg Result
41
Independent study and implementation of random
decision tree