Title: Mining Decision Trees from Data Streams
1Mining Decision Trees fromData Streams
- Tong Suk Man Ivy
- CSIS DB Seminar
- February 12, 2003
2Contents
- Introduction problems in mining data streams
- Classification of stream data
- VFDT algorithm
- Window approach
- CVFDT algorithm
- Experimental results
- Conclusions
- Future work
3Data Streams
- Characteristics
- Large volume of ordered data points, possibly
infinite - Arrive continuously
- Fast changing
- Appropriate model for many applications
- Phone call records
- Network and security monitoring
- Financial applications (stock exchange)
- Sensor networks
4Problems in Mining Data Streams
- Traditional data mining techniques usually
require - Entire data set to be present
- Random access (or multiple passes) to the data
- Much time per data item
- Challenges of stream mining
- Impractical to store the whole data
- Random access is expensive
- Simple calculation per data due to time and space
constraints
5Classification of Stream Data
- VFDT algorithm
- Mining High-Speed Data Streams, KDD 2000.
- Pedro Domingos, Geoff Hulten
- CVFDT algorithm (window approach)
- Mining Time-Changing Data Streams, KDD 2001.
- Geoff Hulten, Laurie Spencer, Pedro Domingos
6Hoeffding Trees
7Definitions
- A classification problem is defined as
- N is a set of training examples of the form (x,
y) - x is a vector of d attributes
- y is a discrete class label
- Goal To produce from the examples a model yf(x)
that predict the classes y for future examples x
with high accuracy
8Decision Tree Learning
- One of the most effective and widely-used
classification methods - Induce models in the form of decision trees
- Each node contains a test on the attribute
- Each branch from a node corresponds to a possible
outcome of the test - Each leaf contains a class prediction
- A decision tree is learned by recursively
replacing leaves by test nodes, starting at the
root
9Challenges
- Classic decision tree learners assume all
training data can be simultaneously stored in
main memory - Disk-based decision tree learners repeatedly read
training data from disk sequentially - Prohibitively expensive when learning complex
trees - Goal design decision tree learners that read
each example at most once, and use a small
constant time to process it
10Key Observation
- In order to find the best attribute at a node, it
may be sufficient to consider only a small subset
of the training examples that pass through that
node. - Given a stream of examples, use the first ones to
choose the root attribute. - Once the root attribute is chosen, the successive
examples are passed down to the corresponding
leaves, and used to choose the attribute there,
and so on recursively. - Use Hoeffding bound to decide how many examples
are enough at each node
11Hoeffding Bound
- Consider a random variable a whose range is R
- Suppose we have n observations of a
- Mean
- Hoeffding bound states
- With probability 1- ?, the true mean of a is at
least - , where
12How many examples are enough?
- Let G(Xi) be the heuristic measure used to choose
test attributes (e.g. Information Gain, Gini
Index) - Xa the attribute with the highest attribute
evaluation value after seeing n examples. - Xb the attribute with the second highest split
evaluation function value after seeing n
examples. - Given a desired ?, if
after seeing n examples at a node, - Hoeffding bound guarantees the true
, with probability 1-?. - This node can be split using Xa, the succeeding
examples will be passed to the new leaves.
13Algorithm
- Calculate the information gain for the attributes
and determines the best two attributes - Pre-pruning consider a null attribute that
consists of not splitting the node - At each node, check for the condition
- If condition satisfied, create child nodes based
on the test at the node - If not, stream in more examples and perform
calculations till condition satisfied
14(No Transcript)
15Performance Analysis
- p probability that an example passed through DT
to level i will fall into a leaf at that point - The expected disagreement between the tree
produced by Hoeffding tree algorithm and that
produced using infinite examples at each node is
no greater than ? /p. - Required memory O(leaves attributes values
classes)
16VFDT
17VFDT (Very Fast Decision Tree)
- A decision-tree learning system based on the
Hoeffding tree algorithm - Split on the current best attribute, if the
difference is less than a user-specified
threshold - Wasteful to decide between identical attributes
- Compute G and check for split periodically
- Memory management
- Memory dominated by sufficient statistics
- Deactivate or drop less promising leaves when
needed - Bootstrap with traditional learner
- Rescan old data when time available
18VFDT(2)
- Scales better than pure memory-based or pure
disk-based learners - Access data sequentially
- Use subsampling to potentially require much less
than one scan - VFDT is incremental and anytime
- New examples can be quickly incorporated as they
arrive - A usable model is available after the first few
examples and then progressively defined
19Experiment Results (VFDT vs. C4.5)
- Compared VFDT and C4.5 (Quinlan, 1993)
- Same memory limit for both (40 MB)
- 100k examples for C4.5
- VFDT settings d 10-7, t 5, nmin200
- Domains 2 classes, 100 binary attributes
- Fifteen synthetic trees 2.2k 500k leaves
- Noise from 0 to 30
20Experiment Results
Accuracy as a function of the number of training
examples
21Experiment Results
Tree size as a function of number of training
examples
22Mining Time-Changing Data Stream
- Most KDD systems, include VFDT, assume training
data is a sample drawn from stationary
distribution - Most large databases or data streams violate this
assumption - Concept Drift data is generated by a
time-changing concept function, e.g. - Seasonal effects
- Economic cycles
- Goal
- Mining continuously changing data streams
- Scale well
23Window Approach
- Common Approach when a new example arrives,
reapply a traditional learner to a sliding window
of w most recent examples - Sensitive to window size
- If w is small relative to the concept shift rate,
assure the availability of a model reflecting the
current concept - Too small w may lead to insufficient examples to
learn the concept - If examples arrive at a rapid rate or the concept
changes quickly, the computational cost of
reapplying a learner may be prohibitively high.
24CVFDT
25CVFDT
- CVFDT (Concept-adapting Very Fast Decision Tree
learner) - Extend VFDT
- Maintain VFDTs speed and accuracy
- Detect and respond to changes in the
example-generating process
26Observations
- With a time-changing concept, the current
splitting attribute of some nodes may not be the
best any more. - An outdated subtree may still be better than the
best single leaf, particularly if it is near the
root. - Grow an alternative subtree with the new best
attribute at its root, when the old attribute
seems out-of-date. - Periodically use a bunch of samples to evaluate
qualities of trees. - Replace the old subtree when the alternate one
becomes more accurate.
27CVFDT algorithm
- Alternate trees for each node in HT start as
empty. - Process examples from the stream indefinitely.
For each example (x, y), - Pass (x, y) down to a set of leaves using HT and
all alternate trees of the nodes (x, y) passes
through. - Add (x, y) to the sliding window of examples.
- Remove and forget the effect of the oldest
examples, if the sliding window overflows. - CVFDTGrow
- CheckSplitValidity if f examples seen since last
checking of alternate trees. - Return HT.
28CVFDT algorithm process each example
Yes
No
Read new example
29CVFDT algorithm process each example
30CVFDTGrow
- For each node reached by the example in HT,
- Increment the corresponding statistics at the
node. - For each alternate tree Talt of the node,
- CVFDTGrow
- If enough examples seen at the leaf in HT which
the example reaches, - Choose the attribute that has the highest average
value of the attribute evaluation measure
(information gain or gini index). - If the best attribute is not the null
attribute, create a node for each possible value
of this attribute
31CVFDT algorithm process each example
32Forget old example
- Maintain the sufficient statistics at every node
in HT to monitor the validity of its previous
decisions. - VFDT only maintain such statistics at leaves.
- HT might have grown or changed since the example
was initially incorporated. - Assigned each node a unique, monotonically
increasing ID as they are created. - forgetExample (HT, example, maxID)
- For each node reached by the old example with
node ID no larger than the max leave ID the
example reaches, - Decrement the corresponding statistics at the
node. - For each alternate tree Talt of the node,
forget(Talt, example, maxID).
33CVFDT algorithm process each example
Read new example
34CheckSplitValidtiy
- Periodically scans the internal nodes of HT.
- Start a new alternate tree when a new winning
attribute is found. - Tighter criteria to avoid excessive alternate
tree creation. - Limit the total number of alternate trees.
35Smoothly adjust to concept drift
- Alternate trees are grown the same way HT is.
- Periodically each node with non-empty alternate
trees enter a testing mode. - M training examples to compare accuracy.
- Prune alternate trees with non-increasing
accuracy over time. - Replace if an alternate tree is more accurate.
36Adjust to concept drift(2)
- Dynamically change the window size
- Shrink the window when many nodes gets
questionable or data rate changes rapidly. - Increase the window size when few nodes are
questionable.
37Performance
- Require memory O(nodes attributes attribute
values classes). - Independent of the total number of examples.
- Running time O(Lc attributes attribute values
number of classes). - Lc the longest length an example passes through
number of alternate trees. - Model learned by CVFDT vs. the one learned by
VFDT-Window - Similar in accuracy
- O(1) vs. O(window size) per new example.
38Experiment Results
- Compare CVFDT, VFDT, VFDT-Window
- 5 million training examples
- Concept changed at every 50k examples
- Drift Level average percentage of the test
points that changes label at each concept change. - About 8 of test points change label each drift
- 100,000 examples in window
- 5 noise
- Test the model every 10k examples throughout the
run, averaged these results.
39Experiment Results (CVFDT vs. VFDT)
Error rate as a function of number of attributes
40Experiment Results (CVFDT vs. VFDT)
Tree size as a function of number of attributes
41Experiment Results (CVFDT vs. VFDT)
Error rates of learners as a function of the
number of examples seen
42Experiment Results (CVFDT vs. VFDT)
Error rates as a function of the amount of
concept drift
43Experiment Results
CVFDTs drift characteristics
44Experiment Results (CVFDT vs. VFDT vs.
VFDT-window)
- Error Rate
- VFDT 19.4
- CVFDT 16.3
- VFDT-Window 15.3
- Running Time
- VFDT 10 minutes
- CVFDT 46 minutes
- VFDT-Window expect 548 days
Error rates over time of CVFDT, VFDT, and
VFDT-window
45Experiment Results
- CVFDT not use too much RAM
- D50, CVFDT never uses more than 70MB
- Use as little as half the RAM of VFDT
- VFDT often had twice as many leaves as the number
of nodes in CVFDTs HT and alternate subtrees
combined - Reason VFDT considers many more outdated
examples and is forced to grow larger trees to
make up for its earlier wrong decisions due to
concept drift
46Conclusions
- CVFDT a decision-tree induction system capable
of learning accurate models from high speed,
concept-drifting data streams - Grow an alternative subtree whenever an old one
becomes questionable - Replace the old subtree when the new more
accurate - Similar in accuracy to applying VFDT to a moving
window of examples
47Future Work
- Concepts changed periodically and removed
subtrees may become useful again - Comparisons with related systems
- Continuous attributes
- Weighting examples
48Reference List
- P. Domingos and G. Hulten. Mining high-speed data
streams. In Proceedings of the Sixth ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, 2000. - G. Hulten, L. Spencer, and P. Domingos. Mining
time-changing data streams. In Proceedings of the
Seventh ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2001 - V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON
Mining and monitoring evolving data. In
Proceedings of the Sixteenth International
Conference on Data Engineering, 2000. - J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L.
Loh. BOAT optimistic decision tree construction.
In Proceedings of the 1999 ACM SIGMOD
International Conference on Management of Data,
1999.
49The end
50Thank You!