Title: Data Mining Session 4 Scaling up decision tree learners
1Data MiningSession 4Scaling up decision tree
learners
- Luc Dehaspe
- K.U.L. Computer Science Department
2Data Mining process model CRISP-DM
3Course overview
Session 2-3 Data preparation
Data Mining
4Scaling up Decision trees
- Decision trees
- Scaling up
- Scaling up Decision trees
Ian H. Witten and Eibe Frank. Data Mining.
Practical Machine Learning Tools and Techniques
with Java implementations, Morgan Kaufmann,
2000 Chapter 4.3 Divide-And-Conquer
Constructing Decision Trees
F. Provost and V. Kolluri. A Survey of Methods
for Scaling Up Inductive Algorithms. Data Mining
and Knowledge Discovery, 2 131-169, 1999.
A. Srivastava, E. Han, V. Kumar, V. Singh.
Parallel Formulations of Decision-Tree
Classification Algorithms. Data Mining and
Knowledge Discovery, 3 237-261, 2000.
5Classification
- process of assigning new objects to predefined
categories or classes - given a set of labeled records
- build a model (decision tree)
- predict labels for future unlabeled records
6Decision trees
7Decision treesThe Weka tool
_at_relation weather.symbolic _at_attribute outlook
sunny, overcast, rainy _at_attribute temperature
hot, mild, cool _at_attribute humidity high,
normal _at_attribute windy TRUE, FALSE _at_attribute
play yes, no _at_data sunny,hot,high,FALSE,no sunn
y,hot,high,TRUE,no overcast,hot,high,FALSE,yes rai
ny,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no overcast,cool,normal,TR
UE,yes sunny,mild,high,FALSE,no sunny,cool,normal,
FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,n
ormal,TRUE,yes overcast,mild,high,TRUE,yes overcas
t,hot,normal,FALSE,yes rainy,mild,high,TRUE,no
http//www.cs.waikato.ac.nz/ml/weka/
8Decision treesAttribute selection
http//www-lmmb.ncifcrf.gov/toms/paper/primer/lat
ex/index.html http//directory.google.com/Top/Scie
nce/Math/Applications/Information_Theory/Papers/
9Decision treesAttribute selection
0.94 bits
maximal information gain
amount of information required to specify class
of an example given that it reaches node
0.97 bits 5/14
gain 0.25 bits
10Decision treesBuilding
outlook
sunny
overcast
rainy
0.97 bits
maximal information gain
11Decision treesBuilding
outlook
sunny
overcast
rainy
0.97 bits
humidity
high
normal
12Decision treesFinal tree
outlook
sunny
overcast
rainy
windy
humidity
false
true
high
normal
13Decision treeBasic Algorithm
- Initialize top node to all examples
- While impure leaves available
- select next impure leave L
- find splitting attribute A with maximal
information gain - for each value of A add child to L
14Decision treeFind good split
- Sufficient statistics to compute info gain count
matrix
15Decision treesID3 - C4.5 - C5 (Quinlan)
- Simple depth-first construction
- Needs entire data to fit in memory
- Unsuitable for large data sets
- Need to scale up
16Scaling up Decision trees
- Decision trees
- Scaling up
- Scaling up Decision trees
Ian H. Witten and Eibe Frank. Data Mining.
Practical Machine Learning Tools and Techniques
with Java implementations, Morgan Kaufmann,
2000 Chapter 4.3 Divide-And-Conquer
Constructing Decision Trees
F. Provost and V. Kolluri. A Survey of Methods
for Scaling Up Inductive Algorithms. Data Mining
and Knowledge Discovery, 2 131-169, 1999.
A. Srivastava, E. Han, V. Kumar, V. Singh.
Parallel Formulations of Decision-Tree
Classification Algorithms. Data Mining and
Knowledge Discovery, 3 237-261, 2000.
17Scaling upWhy?
- Terrabyte databases exist (e.g., 100 M records
2000 fields 5 bytes) - increasing size of training set often increases
the accuracy of learned classification models - Overfitting with small data sets due to
- small disjuncts
- large numbers of features (sparsely populated
model spaces) - In discovery setting special cases should occur
frequently enough - Other motivations for fast algorithm
interaction, crossvalidation, multiple models
18Scaling upvery large
- database practitioner 100 GB
- data mining 100 MB - 1 GB
- Somewhere around data sizes of 100 MB or so,
qualitatively new, very serious scaling problems
begin to arise, both on the human and on the
algorithmic side - (Huber 1997)
19Scaling upWhat?
- pragmatic turning impractical algorithm into
practical one - how large a problem can you deal
with? - time complexity what is the growth rate of the
algorithms run time with increasing - examples
- attributes per example
- values per attribute
- space complexity
- main memory limitation
- no substantial loss of accuracy
20Scaling upMethods
- Fast algorithm
- algorithm/programming optimizations
- parallelization
- Data partitioning (horizontal, vertical)
- Relational representations
- e.g., 3 table database (5 bytes/value)
- customer table 1 million customers, 20
attributes - state table 50 states, 80 attributes
- product table 10 products, 400 attributes
- 100 MB database ? 2.5 GB single table
21Scaling up Decision trees
- Decision trees
- Scaling up
- Scaling up Decision trees
Ian H. Witten and Eibe Frank. Data Mining.
Practical Machine Learning Tools and Techniques
with Java implementations, Morgan Kaufmann,
2000 Chapter 4.3 Divide-And-Conquer
Constructing Decision Trees
F. Provost and V. Kolluri. A Survey of Methods
for Scaling Up Inductive Algorithms. Data Mining
and Knowledge Discovery, 2 131-169, 1999.
A. Srivastava, E. Han, V. Kumar, V. Singh.
Parallel Formulations of Decision-Tree
Classification Algorithms. Data Mining and
Knowledge Discovery, 3 237-261, 2000.
22Decision trees algorithm optimizationsSPRINT
(Shafer, Agrawal, Mehta, 1996)
attribute lists
horizontal data format
23Decision trees SPRINT
attribute list L (sunny)
24Decision trees SPRINT
- Advantages
- attribute lists might fit in memory when total
data set doesnt - Disadvantages
- Size of hash table is O(N) for top levels of tree
- if hash table does not fit in memory (mostly true
for large data sets), then build in parts so that
each part fits
25ParallelismTask vs. Data Parallelism
- Data Parallelism
- data partitioned among P processors
- each processor performs the same work on local
partition - Task Parallelism
- each processor performs different computation
- data may be (selectively) replicated or
partitioned
26ParallelismStatic vs. Dynamic Load Balancing
- Static Load Balancing
- Work is initially divided
- no subsequent computation or data movement
- Dynamic Load Balancing
- steal work from heavily loaded processors
- reassign to lightly loaded processors
27Decision trees Data Parallelism
28Decision treesData Parallelism
- Synchronous Tree Construction
- initially partition data across processors
- Pro
- no data movement is required
- Con
- load imbalance
- High communication cost too high in lower parts
of tree
proc 0
29Decision treesTask parallelism
proc 3
proc 2
30Decision treesTask parallelism
- Partitioned Tree Construction
- Pro
- Highly concurrent
- no data movement after 1 proc responsible for
entire subtree - Con
- excessive data movements until 1 proc responsible
- load imbalance
- portions of tree die out (assignment of nodes to
procs based on nr cases, not work) - solution dynamic load balancing
- exchange data between processor partitions
- load balance within processor partitions
31Decision treeHybrid parallel formulation
- Start with synchronous tree construction
- Detect when cost of communication becomes too
high and redistribution of data is cheaper - ? (Communication cost) ? Moving Cost Load
Balancing - Proceed with partitioned tree construction
approach
32Decision treeHybrid parallel formulation (detail)
- Database split among P processors, all processors
assigned to 1 partition, tree root node allocated
to partition - Nodes at frontier of tree within 1 partition
processed with synchronous tree construction - Communication cost prohibitive ? processors in
partition divided into 2 partitions, current set
of frontier nodes split and allocated
(partitioned tree construction) - Above steps repeated in each one of the
partitions for subtrees - Idle processors recycled
33Speedup comparisons
- Data set 1.6 million examples, 11 attributes
34Scaling up Decision treesSummary
- Decision trees
- Scaling up
- why? what?
- methods
- fast algorithm
- data partitioning
- relational representation
- Scaling up Decision trees
- Sprint attribute lists
- Parallelism (data, task, hybrid)