Mining Decision Trees from Data Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Decision Trees from Data Streams

Description:

Car Type= Sports Car? No. Yes. Yes. Yes. No. No. 9. Challenges ... Window Approach ... Add (x, y) to the sliding window of examples. ... – PowerPoint PPT presentation

Number of Views:397
Avg rating:3.0/5.0
Slides: 51
Provided by: unkn638
Category:

less

Transcript and Presenter's Notes

Title: Mining Decision Trees from Data Streams


1
Mining Decision Trees fromData Streams
  • Tong Suk Man Ivy
  • CSIS DB Seminar
  • February 12, 2003

2
Contents
  • Introduction problems in mining data streams
  • Classification of stream data
  • VFDT algorithm
  • Window approach
  • CVFDT algorithm
  • Experimental results
  • Conclusions
  • Future work

3
Data Streams
  • Characteristics
  • Large volume of ordered data points, possibly
    infinite
  • Arrive continuously
  • Fast changing
  • Appropriate model for many applications
  • Phone call records
  • Network and security monitoring
  • Financial applications (stock exchange)
  • Sensor networks

4
Problems in Mining Data Streams
  • Traditional data mining techniques usually
    require
  • Entire data set to be present
  • Random access (or multiple passes) to the data
  • Much time per data item
  • Challenges of stream mining
  • Impractical to store the whole data
  • Random access is expensive
  • Simple calculation per data due to time and space
    constraints

5
Classification of Stream Data
  • VFDT algorithm
  • Mining High-Speed Data Streams, KDD 2000.
  • Pedro Domingos, Geoff Hulten
  • CVFDT algorithm (window approach)
  • Mining Time-Changing Data Streams, KDD 2001.
  • Geoff Hulten, Laurie Spencer, Pedro Domingos

6
Hoeffding Trees
7
Definitions
  • A classification problem is defined as
  • N is a set of training examples of the form (x,
    y)
  • x is a vector of d attributes
  • y is a discrete class label
  • Goal To produce from the examples a model yf(x)
    that predict the classes y for future examples x
    with high accuracy

8
Decision Tree Learning
  • One of the most effective and widely-used
    classification methods
  • Induce models in the form of decision trees
  • Each node contains a test on the attribute
  • Each branch from a node corresponds to a possible
    outcome of the test
  • Each leaf contains a class prediction
  • A decision tree is learned by recursively
    replacing leaves by test nodes, starting at the
    root

9
Challenges
  • Classic decision tree learners assume all
    training data can be simultaneously stored in
    main memory
  • Disk-based decision tree learners repeatedly read
    training data from disk sequentially
  • Prohibitively expensive when learning complex
    trees
  • Goal design decision tree learners that read
    each example at most once, and use a small
    constant time to process it

10
Key Observation
  • In order to find the best attribute at a node, it
    may be sufficient to consider only a small subset
    of the training examples that pass through that
    node.
  • Given a stream of examples, use the first ones to
    choose the root attribute.
  • Once the root attribute is chosen, the successive
    examples are passed down to the corresponding
    leaves, and used to choose the attribute there,
    and so on recursively.
  • Use Hoeffding bound to decide how many examples
    are enough at each node

11
Hoeffding Bound
  • Consider a random variable a whose range is R
  • Suppose we have n observations of a
  • Mean
  • Hoeffding bound states
  • With probability 1- ?, the true mean of a is at
    least
  • , where

12
How many examples are enough?
  • Let G(Xi) be the heuristic measure used to choose
    test attributes (e.g. Information Gain, Gini
    Index)
  • Xa the attribute with the highest attribute
    evaluation value after seeing n examples.
  • Xb the attribute with the second highest split
    evaluation function value after seeing n
    examples.
  • Given a desired ?, if
    after seeing n examples at a node,
  • Hoeffding bound guarantees the true
    , with probability 1-?.
  • This node can be split using Xa, the succeeding
    examples will be passed to the new leaves.

13
Algorithm
  • Calculate the information gain for the attributes
    and determines the best two attributes
  • Pre-pruning consider a null attribute that
    consists of not splitting the node
  • At each node, check for the condition
  • If condition satisfied, create child nodes based
    on the test at the node
  • If not, stream in more examples and perform
    calculations till condition satisfied

14
(No Transcript)
15
Performance Analysis
  • p probability that an example passed through DT
    to level i will fall into a leaf at that point
  • The expected disagreement between the tree
    produced by Hoeffding tree algorithm and that
    produced using infinite examples at each node is
    no greater than ? /p.
  • Required memory O(leaves attributes values
    classes)

16
VFDT
17
VFDT (Very Fast Decision Tree)
  • A decision-tree learning system based on the
    Hoeffding tree algorithm
  • Split on the current best attribute, if the
    difference is less than a user-specified
    threshold
  • Wasteful to decide between identical attributes
  • Compute G and check for split periodically
  • Memory management
  • Memory dominated by sufficient statistics
  • Deactivate or drop less promising leaves when
    needed
  • Bootstrap with traditional learner
  • Rescan old data when time available

18
VFDT(2)
  • Scales better than pure memory-based or pure
    disk-based learners
  • Access data sequentially
  • Use subsampling to potentially require much less
    than one scan
  • VFDT is incremental and anytime
  • New examples can be quickly incorporated as they
    arrive
  • A usable model is available after the first few
    examples and then progressively defined

19
Experiment Results (VFDT vs. C4.5)
  • Compared VFDT and C4.5 (Quinlan, 1993)
  • Same memory limit for both (40 MB)
  • 100k examples for C4.5
  • VFDT settings d 10-7, t 5, nmin200
  • Domains 2 classes, 100 binary attributes
  • Fifteen synthetic trees 2.2k 500k leaves
  • Noise from 0 to 30

20
Experiment Results
Accuracy as a function of the number of training
examples
21
Experiment Results
Tree size as a function of number of training
examples
22
Mining Time-Changing Data Stream
  • Most KDD systems, include VFDT, assume training
    data is a sample drawn from stationary
    distribution
  • Most large databases or data streams violate this
    assumption
  • Concept Drift data is generated by a
    time-changing concept function, e.g.
  • Seasonal effects
  • Economic cycles
  • Goal
  • Mining continuously changing data streams
  • Scale well

23
Window Approach
  • Common Approach when a new example arrives,
    reapply a traditional learner to a sliding window
    of w most recent examples
  • Sensitive to window size
  • If w is small relative to the concept shift rate,
    assure the availability of a model reflecting the
    current concept
  • Too small w may lead to insufficient examples to
    learn the concept
  • If examples arrive at a rapid rate or the concept
    changes quickly, the computational cost of
    reapplying a learner may be prohibitively high.

24
CVFDT
25
CVFDT
  • CVFDT (Concept-adapting Very Fast Decision Tree
    learner)
  • Extend VFDT
  • Maintain VFDTs speed and accuracy
  • Detect and respond to changes in the
    example-generating process

26
Observations
  • With a time-changing concept, the current
    splitting attribute of some nodes may not be the
    best any more.
  • An outdated subtree may still be better than the
    best single leaf, particularly if it is near the
    root.
  • Grow an alternative subtree with the new best
    attribute at its root, when the old attribute
    seems out-of-date.
  • Periodically use a bunch of samples to evaluate
    qualities of trees.
  • Replace the old subtree when the alternate one
    becomes more accurate.

27
CVFDT algorithm
  • Alternate trees for each node in HT start as
    empty.
  • Process examples from the stream indefinitely.
    For each example (x, y),
  • Pass (x, y) down to a set of leaves using HT and
    all alternate trees of the nodes (x, y) passes
    through.
  • Add (x, y) to the sliding window of examples.
  • Remove and forget the effect of the oldest
    examples, if the sliding window overflows.
  • CVFDTGrow
  • CheckSplitValidity if f examples seen since last
    checking of alternate trees.
  • Return HT.

28
CVFDT algorithm process each example
Yes
No
Read new example
29
CVFDT algorithm process each example
30
CVFDTGrow
  • For each node reached by the example in HT,
  • Increment the corresponding statistics at the
    node.
  • For each alternate tree Talt of the node,
  • CVFDTGrow
  • If enough examples seen at the leaf in HT which
    the example reaches,
  • Choose the attribute that has the highest average
    value of the attribute evaluation measure
    (information gain or gini index).
  • If the best attribute is not the null
    attribute, create a node for each possible value
    of this attribute

31
CVFDT algorithm process each example
32
Forget old example
  • Maintain the sufficient statistics at every node
    in HT to monitor the validity of its previous
    decisions.
  • VFDT only maintain such statistics at leaves.
  • HT might have grown or changed since the example
    was initially incorporated.
  • Assigned each node a unique, monotonically
    increasing ID as they are created.
  • forgetExample (HT, example, maxID)
  • For each node reached by the old example with
    node ID no larger than the max leave ID the
    example reaches,
  • Decrement the corresponding statistics at the
    node.
  • For each alternate tree Talt of the node,
    forget(Talt, example, maxID).

33
CVFDT algorithm process each example
Read new example
34
CheckSplitValidtiy
  • Periodically scans the internal nodes of HT.
  • Start a new alternate tree when a new winning
    attribute is found.
  • Tighter criteria to avoid excessive alternate
    tree creation.
  • Limit the total number of alternate trees.

35
Smoothly adjust to concept drift
  • Alternate trees are grown the same way HT is.
  • Periodically each node with non-empty alternate
    trees enter a testing mode.
  • M training examples to compare accuracy.
  • Prune alternate trees with non-increasing
    accuracy over time.
  • Replace if an alternate tree is more accurate.

36
Adjust to concept drift(2)
  • Dynamically change the window size
  • Shrink the window when many nodes gets
    questionable or data rate changes rapidly.
  • Increase the window size when few nodes are
    questionable.

37
Performance
  • Require memory O(nodes attributes attribute
    values classes).
  • Independent of the total number of examples.
  • Running time O(Lc attributes attribute values
    number of classes).
  • Lc the longest length an example passes through
    number of alternate trees.
  • Model learned by CVFDT vs. the one learned by
    VFDT-Window
  • Similar in accuracy
  • O(1) vs. O(window size) per new example.

38
Experiment Results
  • Compare CVFDT, VFDT, VFDT-Window
  • 5 million training examples
  • Concept changed at every 50k examples
  • Drift Level average percentage of the test
    points that changes label at each concept change.
  • About 8 of test points change label each drift
  • 100,000 examples in window
  • 5 noise
  • Test the model every 10k examples throughout the
    run, averaged these results.

39
Experiment Results (CVFDT vs. VFDT)
Error rate as a function of number of attributes
40
Experiment Results (CVFDT vs. VFDT)
Tree size as a function of number of attributes
41
Experiment Results (CVFDT vs. VFDT)
Error rates of learners as a function of the
number of examples seen
42
Experiment Results (CVFDT vs. VFDT)
Error rates as a function of the amount of
concept drift
43
Experiment Results
CVFDTs drift characteristics
44
Experiment Results (CVFDT vs. VFDT vs.
VFDT-window)
  • Error Rate
  • VFDT 19.4
  • CVFDT 16.3
  • VFDT-Window 15.3
  • Running Time
  • VFDT 10 minutes
  • CVFDT 46 minutes
  • VFDT-Window expect 548 days

Error rates over time of CVFDT, VFDT, and
VFDT-window
45
Experiment Results
  • CVFDT not use too much RAM
  • D50, CVFDT never uses more than 70MB
  • Use as little as half the RAM of VFDT
  • VFDT often had twice as many leaves as the number
    of nodes in CVFDTs HT and alternate subtrees
    combined
  • Reason VFDT considers many more outdated
    examples and is forced to grow larger trees to
    make up for its earlier wrong decisions due to
    concept drift

46
Conclusions
  • CVFDT a decision-tree induction system capable
    of learning accurate models from high speed,
    concept-drifting data streams
  • Grow an alternative subtree whenever an old one
    becomes questionable
  • Replace the old subtree when the new more
    accurate
  • Similar in accuracy to applying VFDT to a moving
    window of examples

47
Future Work
  • Concepts changed periodically and removed
    subtrees may become useful again
  • Comparisons with related systems
  • Continuous attributes
  • Weighting examples

48
Reference List
  • P. Domingos and G. Hulten. Mining high-speed data
    streams. In Proceedings of the Sixth ACM SIGKDD
    International Conference on Knowledge Discovery
    and Data Mining, 2000.
  • G. Hulten, L. Spencer, and P. Domingos. Mining
    time-changing data streams. In Proceedings of the
    Seventh ACM SIGKDD International Conference on
    Knowledge Discovery and Data Mining, 2001
  • V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON
    Mining and monitoring evolving data. In
    Proceedings of the Sixteenth International
    Conference on Data Engineering, 2000.
  • J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L.
    Loh. BOAT optimistic decision tree construction.
    In Proceedings of the 1999 ACM SIGMOD
    International Conference on Management of Data,
    1999.

49
The end
  • Q A

50
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com