Data Mining Session 4 Scaling up decision tree learners - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Data Mining Session 4 Scaling up decision tree learners

Description:

sunny,cool,normal,FALSE,yes. rainy,mild,normal,FALSE,yes. sunny,mild,normal,TRUE,yes ... cool. false. true. maximal. information. gain. DATA MINING - 2 MARCH ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 35

Provided by: lucde8

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining Session 4 Scaling up decision tree learners

1
Data MiningSession 4Scaling up decision tree
learners

Luc Dehaspe
K.U.L. Computer Science Department

2
Data Mining process model CRISP-DM
3
Course overview
Session 2-3 Data preparation
Data Mining
4
Scaling up Decision trees

Decision trees
Scaling up
Scaling up Decision trees

Ian H. Witten and Eibe Frank. Data Mining.
Practical Machine Learning Tools and Techniques
with Java implementations, Morgan Kaufmann,
2000 Chapter 4.3 Divide-And-Conquer
Constructing Decision Trees
F. Provost and V. Kolluri. A Survey of Methods
for Scaling Up Inductive Algorithms. Data Mining
and Knowledge Discovery, 2 131-169, 1999.
A. Srivastava, E. Han, V. Kumar, V. Singh.
Parallel Formulations of Decision-Tree
Classification Algorithms. Data Mining and
Knowledge Discovery, 3 237-261, 2000.
5
Classification

process of assigning new objects to predefined
categories or classes
given a set of labeled records
build a model (decision tree)
predict labels for future unlabeled records

6
Decision trees
7
Decision treesThe Weka tool
_at_relation weather.symbolic _at_attribute outlook
sunny, overcast, rainy _at_attribute temperature
hot, mild, cool _at_attribute humidity high,
normal _at_attribute windy TRUE, FALSE _at_attribute
play yes, no _at_data sunny,hot,high,FALSE,no sunn
y,hot,high,TRUE,no overcast,hot,high,FALSE,yes rai
ny,mild,high,FALSE,yes rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no overcast,cool,normal,TR
UE,yes sunny,mild,high,FALSE,no sunny,cool,normal,
FALSE,yes rainy,mild,normal,FALSE,yes sunny,mild,n
ormal,TRUE,yes overcast,mild,high,TRUE,yes overcas
t,hot,normal,FALSE,yes rainy,mild,high,TRUE,no
http//www.cs.waikato.ac.nz/ml/weka/
8
Decision treesAttribute selection
http//www-lmmb.ncifcrf.gov/toms/paper/primer/lat
ex/index.html http//directory.google.com/Top/Scie
nce/Math/Applications/Information_Theory/Papers/
9
Decision treesAttribute selection
0.94 bits
maximal information gain
amount of information required to specify class
of an example given that it reaches node
0.97 bits 5/14
gain 0.25 bits
10
Decision treesBuilding
outlook
sunny
overcast
rainy
0.97 bits
maximal information gain
11
Decision treesBuilding
outlook
sunny
overcast
rainy
0.97 bits
humidity

high
normal
12
Decision treesFinal tree
outlook
sunny
overcast
rainy
windy
humidity
false
true
high
normal
13
Decision treeBasic Algorithm

Initialize top node to all examples
While impure leaves available
select next impure leave L
find splitting attribute A with maximal
information gain
for each value of A add child to L

14
Decision treeFind good split

Sufficient statistics to compute info gain count
matrix

15
Decision treesID3 - C4.5 - C5 (Quinlan)

Simple depth-first construction
Needs entire data to fit in memory
Unsuitable for large data sets
Need to scale up

16
Scaling up Decision trees

Decision trees
Scaling up
Scaling up Decision trees

Terrabyte databases exist (e.g., 100 M records
2000 fields 5 bytes)
increasing size of training set often increases
the accuracy of learned classification models
Overfitting with small data sets due to
small disjuncts
large numbers of features (sparsely populated
model spaces)
In discovery setting special cases should occur
frequently enough
Other motivations for fast algorithm
interaction, crossvalidation, multiple models

18
Scaling upvery large

database practitioner 100 GB
data mining 100 MB - 1 GB
Somewhere around data sizes of 100 MB or so,
qualitatively new, very serious scaling problems
begin to arise, both on the human and on the
algorithmic side
(Huber 1997)

19
Scaling upWhat?

pragmatic turning impractical algorithm into
practical one - how large a problem can you deal
with?
time complexity what is the growth rate of the
algorithms run time with increasing
examples
attributes per example
values per attribute
space complexity
main memory limitation
no substantial loss of accuracy

20
Scaling upMethods

Fast algorithm
algorithm/programming optimizations
parallelization
Data partitioning (horizontal, vertical)
Relational representations
e.g., 3 table database (5 bytes/value)
customer table 1 million customers, 20
attributes
state table 50 states, 80 attributes
product table 10 products, 400 attributes
100 MB database ? 2.5 GB single table

21
Scaling up Decision trees

Decision trees
Scaling up
Scaling up Decision trees

Advantages
attribute lists might fit in memory when total
data set doesnt
Disadvantages
Size of hash table is O(N) for top levels of tree
if hash table does not fit in memory (mostly true
for large data sets), then build in parts so that
each part fits

25
ParallelismTask vs. Data Parallelism

Data Parallelism
data partitioned among P processors
each processor performs the same work on local
partition
Task Parallelism
each processor performs different computation
data may be (selectively) replicated or
partitioned

26
ParallelismStatic vs. Dynamic Load Balancing

Static Load Balancing
Work is initially divided
no subsequent computation or data movement
Dynamic Load Balancing
steal work from heavily loaded processors
reassign to lightly loaded processors

27
Decision trees Data Parallelism
28
Decision treesData Parallelism

Synchronous Tree Construction
initially partition data across processors
Pro
no data movement is required
Con
load imbalance
High communication cost too high in lower parts
of tree

proc 0
29
Decision treesTask parallelism
proc 3
proc 2
30
Decision treesTask parallelism

Partitioned Tree Construction
Pro
Highly concurrent
no data movement after 1 proc responsible for
entire subtree
Con
excessive data movements until 1 proc responsible
load imbalance
portions of tree die out (assignment of nodes to
procs based on nr cases, not work)
solution dynamic load balancing
exchange data between processor partitions
load balance within processor partitions

31
Decision treeHybrid parallel formulation

Start with synchronous tree construction
Detect when cost of communication becomes too
high and redistribution of data is cheaper
? (Communication cost) ? Moving Cost Load
Balancing
Proceed with partitioned tree construction
approach

32
Decision treeHybrid parallel formulation (detail)

Database split among P processors, all processors
assigned to 1 partition, tree root node allocated
to partition
Nodes at frontier of tree within 1 partition
processed with synchronous tree construction
Communication cost prohibitive ? processors in
partition divided into 2 partitions, current set
of frontier nodes split and allocated
(partitioned tree construction)
Above steps repeated in each one of the
partitions for subtrees
Idle processors recycled

33
Speedup comparisons