Mining%20Decision%20Trees%20from%20Data%20Streams - PowerPoint PPT Presentation

About This Presentation

Title:

Mining%20Decision%20Trees%20from%20Data%20Streams

Description:

Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU – PowerPoint PPT presentation

Number of Views:324

Avg rating:3.0/5.0

Slides: 57

Provided by: Unkno200

Category:

more less

Transcript and Presenter's Notes

Title: Mining%20Decision%20Trees%20from%20Data%20Streams

1
Mining Decision Trees fromData Streams

Thanks Tong Suk Man Ivy
HKU

2
Contents

Introduction problems in mining data streams
Classification of stream data
VFDT algorithm
Window approach
CVFDT algorithm
Experimental results
Conclusions
Future work

3
Data Streams

Characteristics
Large volume of ordered data points, possibly
infinite
Arrive continuously
Fast changing
Appropriate model for many applications
Phone call records
Network and security monitoring
Financial applications (stock exchange)
Sensor networks

4
Problems in Mining Data Streams

Traditional data mining techniques usually
require
Entire data set to be present
Random access (or multiple passes) to the data
Much time per data item
Challenges of stream mining
Impractical to store the whole data
Random access is expensive
Simple calculation per data due to time and space
constraints

5
Processing Data Streams Motivation

A growing number of applications generate streams
of data
Performance measurements in network monitoring
and traffic management
Call detail records in telecommunications
Transactions in retail chains, ATM operations in
banks
Log records generated by Web Servers
Sensor network data
Application characteristics
Massive volumes of data (several terabytes)
Records arrive at a rapid rate
Goal Mine patterns, process queries and compute
statistics on data streams in real-time

(from VLDB02 Tutorial)
6
Data Streams Computation Model
Synopsis in Memory
Data Streams
Stream Processing Engine
(Approximate) Answer

A data stream is a (massive) sequence of
elements Stream processing requirements
Single pass Each record is examined at most once
Bounded storage Limited Memory (M) for storing
synopsis
Real-time Per record processing time (to
maintain synopsis) must be low

7
Network Management Application

Network Management involves monitoring and
configuring network hardware and software to
ensure smooth operation
Monitor link bandwidth usage, estimate traffic
demands
Quickly detect faults, congestion and isolate
root cause
Load balancing, improve utilization of network
resources

Network Operations Center
Measurements Alarms
Network
(from VLDB02 Tutorial)
8
IP Network Measurement Data

IP session data (collected using Cisco
NetFlow)
ATT collects 100 GBs of NetFlow data each
day!
ATT collects 100 GB of NetFlow data per day!

Source Destination Duration
Bytes Protocol 10.1.0.2
16.2.3.7 12 20K
http 18.6.7.1 12.4.0.3
16 24K http
13.9.4.3 11.6.8.2 15
20K http 15.2.2.9
17.1.2.1 19 40K
http 12.4.3.8 14.8.7.4
26 58K http
10.5.1.3 13.0.0.1 27
100K ftp 11.1.0.6
10.3.4.5 32 300K
ftp 19.7.1.2 16.5.5.8
18 80K ftp
(from VLDB02 Tutorial)
9
Network Data Processing
(from VLDB02 Tutorial)

Traffic estimation
How many bytes were sent between a pair of IP
addresses?
What fraction network IP addresses are active?
List the top 100 IP addresses in terms of traffic
Traffic analysis
What is the average duration of an IP session?
What is the median of the number of bytes in each
IP session?
Fraud detection
List all sessions that transmitted more than 1000
bytes
Identify all sessions whose duration was more
than twice the normal
Security/Denial of Service
List all IP addresses that have witnessed a
sudden spike in traffic
Identify IP addresses involved in more than 1000
sessions

10
Data Stream Processing Algorithms

Generally, algorithms compute approximate answers
Difficult to compute answers accurately with
limited memory
Approximate answers - Deterministic bounds
Algorithms only compute an approximate answer,
but bounds on error
Approximate answers - Probabilistic bounds
Algorithms compute an approximate answer with
high probability
With probability at least , the computed
answer is within a factor of the actual
answer
Single-pass algorithms for processing streams
also applicable to (massive) terabyte databases!

(from VLDB02 Tutorial)
11
Classification of Stream Data

VFDT algorithm
Mining High-Speed Data Streams, KDD 2000.
Pedro Domingos, Geoff Hulten
CVFDT algorithm (window approach)
Mining Time-Changing Data Streams, KDD 2001.
Geoff Hulten, Laurie Spencer, Pedro Domingos

12
Hoeffding Trees
13
Definitions

A classification problem is defined as
N is a set of training examples of the form (x,
y)
x is a vector of d attributes
y is a discrete class label
Goal To produce from the examples a model yf(x)
that predict the classes y for future examples x
with high accuracy

14
Decision Tree Learning

One of the most effective and widely-used
classification methods
Induce models in the form of decision trees
Each node contains a test on the attribute
Each branch from a node corresponds to a possible
outcome of the test
Each leaf contains a class prediction
A decision tree is learned by recursively
replacing leaves by test nodes, starting at the
root

15
Challenges

Classic decision tree learners assume all
training data can be simultaneously stored in
main memory
Disk-based decision tree learners repeatedly read
training data from disk sequentially
Prohibitively expensive when learning complex
trees
Goal design decision tree learners that read
each example at most once, and use a small
constant time to process it

16
Key Observation

In order to find the best attribute at a node, it
may be sufficient to consider only a small subset
of the training examples that pass through that
node.
Given a stream of examples, use the first ones to
choose the root attribute.
Once the root attribute is chosen, the successive
examples are passed down to the corresponding
leaves, and used to choose the attribute there,
and so on recursively.
Use Hoeffding bound to decide how many examples
are enough at each node

17
Hoeffding Bound

Consider a random variable a whose range is R
Suppose we have n observations of a
Mean
Hoeffding bound states
With probability 1- ?, the true mean of a is at
least
, where

18
How many examples are enough?

Let G(Xi) be the heuristic measure used to choose
test attributes (e.g. Information Gain, Gini
Index)
Xa the attribute with the highest attribute
evaluation value after seeing n examples.
Xb the attribute with the second highest split
evaluation function value after seeing n
examples.
Given a desired ?, if
after seeing n examples at a node,
Hoeffding bound guarantees the true
, with probability 1-?.
This node can be split using Xa, the succeeding
examples will be passed to the new leaves.

19
Algorithm

Calculate the information gain for the attributes
and determines the best two attributes
Pre-pruning consider a null attribute that
consists of not splitting the node
At each node, check for the condition

If condition satisfied, create child nodes based
on the test at the node
If not, stream in more examples and perform
calculations till condition satisfied

20
(No Transcript)
21
Performance Analysis

p probability that an example passed through DT
to level i will fall into a leaf at that point
The expected disagreement between the tree
produced by Hoeffding tree algorithm and that
produced using infinite examples at each node is
no greater than ? /p.
Required memory O(leaves attributes values
classes)

22
VFDT
23
VFDT (Very Fast Decision Tree)

A decision-tree learning system based on the
Hoeffding tree algorithm
Split on the current best attribute, if the
difference is less than a user-specified
threshold
Wasteful to decide between identical attributes
Compute G and check for split periodically
Memory management
Memory dominated by sufficient statistics
Deactivate or drop less promising leaves when
needed
Bootstrap with traditional learner
Rescan old data when time available

24
VFDT(2)

Scales better than pure memory-based or pure
disk-based learners
Access data sequentially
Use subsampling to potentially require much less
than one scan
VFDT is incremental and anytime
New examples can be quickly incorporated as they
arrive
A usable model is available after the first few
examples and then progressively defined

25
Experiment Results (VFDT vs. C4.5)

Compared VFDT and C4.5 (Quinlan, 1993)
Same memory limit for both (40 MB)
100k examples for C4.5
VFDT settings d 10-7, t 5, nmin200
Domains 2 classes, 100 binary attributes
Fifteen synthetic trees 2.2k 500k leaves
Noise from 0 to 30

26
Experiment Results
Accuracy as a function of the number of training
examples
27
Experiment Results
Tree size as a function of number of training
examples
28
Mining Time-Changing Data Stream

Most KDD systems, include VFDT, assume training
data is a sample drawn from stationary
distribution
Most large databases or data streams violate this
assumption
Concept Drift data is generated by a
time-changing concept function, e.g.
Seasonal effects
Economic cycles
Goal
Mining continuously changing data streams
Scale well

29
Window Approach

Common Approach when a new example arrives,
reapply a traditional learner to a sliding window
of w most recent examples
Sensitive to window size
If w is small relative to the concept shift rate,
assure the availability of a model reflecting the
current concept
Too small w may lead to insufficient examples to
learn the concept
If examples arrive at a rapid rate or the concept
changes quickly, the computational cost of
reapplying a learner may be prohibitively high.

30
CVFDT
31
CVFDT

CVFDT (Concept-adapting Very Fast Decision Tree
learner)
Extend VFDT
Maintain VFDTs speed and accuracy
Detect and respond to changes in the
example-generating process

32
Observations

With a time-changing concept, the current
splitting attribute of some nodes may not be the
best any more.
An outdated subtree may still be better than the
best single leaf, particularly if it is near the
root.
Grow an alternative subtree with the new best
attribute at its root, when the old attribute
seems out-of-date.
Periodically use a bunch of samples to evaluate
qualities of trees.
Replace the old subtree when the alternate one
becomes more accurate.

33
CVFDT algorithm

Alternate trees for each node in HT start as
empty.
Process examples from the stream indefinitely.
For each example (x, y),
Pass (x, y) down to a set of leaves using HT and
all alternate trees of the nodes (x, y) passes
through.
Add (x, y) to the sliding window of examples.
Remove and forget the effect of the oldest
examples, if the sliding window overflows.
CVFDTGrow
CheckSplitValidity if f examples seen since last
checking of alternate trees.
Return HT.

34
CVFDT algorithm process each example
Yes
No
Read new example
35
CVFDT algorithm process each example
36
CVFDTGrow

For each node reached by the example in HT,
Increment the corresponding statistics at the
node.
For each alternate tree Talt of the node,
CVFDTGrow
If enough examples seen at the leaf in HT which
the example reaches,
Choose the attribute that has the highest average
value of the attribute evaluation measure
(information gain or gini index).
If the best attribute is not the null
attribute, create a node for each possible value
of this attribute

37
CVFDT algorithm process each example
38
Forget old example

Maintain the sufficient statistics at every node
in HT to monitor the validity of its previous
decisions.
VFDT only maintain such statistics at leaves.
HT might have grown or changed since the example
was initially incorporated.
Assigned each node a unique, monotonically
increasing ID as they are created.
forgetExample (HT, example, maxID)
For each node reached by the old example with
node ID no larger than the max leave ID the
example reaches,
Decrement the corresponding statistics at the
node.
For each alternate tree Talt of the node,
forget(Talt, example, maxID).

39
CVFDT algorithm process each example
Read new example
40
CheckSplitValidtiy

Periodically scans the internal nodes of HT.
Start a new alternate tree when a new winning
attribute is found.
Tighter criteria to avoid excessive alternate
tree creation.
Limit the total number of alternate trees.

41
Smoothly adjust to concept drift

Alternate trees are grown the same way HT is.
Periodically each node with non-empty alternate
trees enter a testing mode.
M training examples to compare accuracy.
Prune alternate trees with non-increasing
accuracy over time.
Replace if an alternate tree is more accurate.

42
Adjust to concept drift(2)

Dynamically change the window size
Shrink the window when many nodes gets
questionable or data rate changes rapidly.
Increase the window size when few nodes are
questionable.

43
Performance

Require memory O(nodes attributes attribute
values classes).
Independent of the total number of examples.
Running time O(Lc attributes attribute values
number of classes).
Lc the longest length an example passes through
number of alternate trees.
Model learned by CVFDT vs. the one learned by
VFDT-Window
Similar in accuracy
O(1) vs. O(window size) per new example.

44
Experiment Results

Compare CVFDT, VFDT, VFDT-Window
5 million training examples
Concept changed at every 50k examples
Drift Level average percentage of the test
points that changes label at each concept change.
About 8 of test points change label each drift
100,000 examples in window
5 noise
Test the model every 10k examples throughout the
run, averaged these results.

45
Experiment Results (CVFDT vs. VFDT)
Error rate as a function of number of attributes
46
Experiment Results (CVFDT vs. VFDT)
Tree size as a function of number of attributes
47
Experiment Results (CVFDT vs. VFDT)
Error rates of learners as a function of the
number of examples seen
48
Experiment Results (CVFDT vs. VFDT)
Error rates as a function of the amount of
concept drift
49
Experiment Results
CVFDTs drift characteristics
50
Experiment Results (CVFDT vs. VFDT vs.
VFDT-window)

Error Rate
VFDT 19.4
CVFDT 16.3
VFDT-Window 15.3
Running Time
VFDT 10 minutes
CVFDT 46 minutes
VFDT-Window expect 548 days

Error rates over time of CVFDT, VFDT, and
VFDT-window
51
Experiment Results

CVFDT not use too much RAM
D50, CVFDT never uses more than 70MB
Use as little as half the RAM of VFDT
VFDT often had twice as many leaves as the number
of nodes in CVFDTs HT and alternate subtrees
combined
Reason VFDT considers many more outdated
examples and is forced to grow larger trees to
make up for its earlier wrong decisions due to
concept drift

52
Conclusions

CVFDT a decision-tree induction system capable
of learning accurate models from high speed,
concept-drifting data streams
Grow an alternative subtree whenever an old one
becomes questionable
Replace the old subtree when the new more
accurate
Similar in accuracy to applying VFDT to a moving
window of examples

53
Future Work

Concepts changed periodically and removed
subtrees may become useful again
Comparisons with related systems
Continuous attributes
Weighting examples

54
Reference List

P. Domingos and G. Hulten. Mining high-speed data
streams. In Proceedings of the Sixth ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, 2000.
G. Hulten, L. Spencer, and P. Domingos. Mining
time-changing data streams. In Proceedings of the
Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2001
V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON
Mining and monitoring evolving data. In
Proceedings of the Sixteenth International
Conference on Data Engineering, 2000.
J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L.
Loh. BOAT optimistic decision tree construction.
In Proceedings of the 1999 ACM SIGMOD
International Conference on Management of Data,
1999.