CHAN Siu Lung, Daniel

About This Presentation

Title:

CHAN Siu Lung, Daniel

Description:

Title: PowerPoint Presentation Author: SLCHAN Last modified by: Africa Pig Created Date: 10/11/2002 4:18:07 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:227

Avg rating:3.0/5.0

Slides: 36

Provided by: SLC2

Category:

more less

Transcript and Presenter's Notes

Title: CHAN Siu Lung, Daniel

1
SPRINT A Scalable Parallel Classifier for Data
Mining
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin
Hung, Victor KOON Ping Yin, Bob
2
Overview

Classification
Decision Tree
SPRINT
Serial Algorithm
Parallelizing Classification
Performance Evaluation
Conclusion
Reference

3
Classification

Objective
To build a model of the classifying attribute
based upon the other attributes
Classification given
A set of example record, called a training set
that is a set of records consists of several
attributes
Attributes are either
Continuous (i.e. Age 23,24,25)
Categorical (i.e. Gender Female, Male)
One of the attributes called the classifying
attribute

4
Decision Trees

Advantages
Good for data mining
Can be constructed fast
Models are simple
Easy to understand
Can be easily converted into SQL statements
Classifiers obtain better accuracy

5
Example Car Insurance
Categorical attribute
Continuous attribute
6
Classification algorithm

Two classification algorithms are presented in
the paper
SLIQ
SPRINT
SLIQ
Handles large disk-resident data
Requires some data per record that stay in
memory-resident all the time
SPRINT
Removes all of the memory restrictions
Fast and scalable and it can be easily
parallelized
How good it is?
Excellent scaleup, speedup and sizeup properties

7
SPRINT
Serial Algorithm

A decision tree classifier is built in two phases
Growth phase
Prune phase
Growth phase is computationally much more
expensive than pruning

8
Serial Algorithm
Growth Phase

Key Issue
To find split points that define node tests.
Having chosen a split point, how to partition the
data
Data Structures
Attribute lists
Histograms

9
Serial Algorithm
Data Structure (Attribute lists)

SPRINT creates an attribute list for each
attribute
Entries are called attribute records which
contains
Attribute value
Class label
Index of the record

10
Serial Algorithm
Data Structure (Attribute lists)

The initial lists are associated with the root
As the tree is grown, the attribute lists
belonging to each node are partitioned and
associated with the children

11
Serial Algorithm
Data Structure (Histograms)

For continuous attributes, two histograms are
associated with each decision-tree node. These
histograms, denoted as Cabove and Cbelow
Cbelow maintains this distribution for
attribute records that already been processed
Cabove maintains this distribution for attribute
records that have not been processed

12
Serial Algorithm
Data Structure (Histograms)

For categorical attributes, one histogram
associated with a node. However, only one
histogram is needed and called count matrix

13
Serial Algorithm
Finding split points

While growing the tree, the goal at each node is
to determine the split point that best divides
the training records belonging to the leaf
In the paper, they use gini index
Gini(S)1-?pj2
Where pj is the relative of class j in S
Ginisplit(S) n1/n(S1)n2/n(S2)

14
Serial Algorithm
Example Split point for the continuous
attributes
Age Class Tid
17 High 1
20 High 5
23 High 0
32 Low 4
43 High 2
68 Low 3
H L
Cabove 3 0
Cbelow 1 2
Cursor Position 3
15
Serial Algorithm
Example Split point for the continuous
attributes

After finding all the gini indexes we choose the
lowest as the split point
Therefore, we split at position 3 where the
candidate split point is the mid-point between
age 23 and 32 (i.e. Age lt 27.5)

16
Serial Algorithm
Example Split point for the categorical
attributes
H L
Family 2 1
Sports 2 0
Truck 0 1
17
Serial Algorithm
Performing the split

Once the best split point has been found for a
node, we then execute the split by creating child
nodes and dividing the attribute records between
them
For the rest of the attribute lists (i.e.
CarType) we need to retrieve the information by
using rids

18
Serial Algorithm
Comparison with SLIQ

SLIQ does not have separate sets of attribute
lists for each node
Advantage
Do not have to rewrite these lists during a
split
Reassignment of records is easy
Disadvantage
It must stay in memory all the timewhich limits
the amount of data thatcan be classified by SLIQ

SPRINT was not to outperform SLIQ. Rather, the
purpose of the algorithm is to develop an
accurate classifier for datasets, and to develop
a classifier efficiently. Furthermore, SPRINT is
designed to be easily parallelizable, thus the
workload can be shared among N processor
19
SPRINT
Parallelizing Classification

SPRINT was specifically designed to remove any
dependence on data structures that are either
centralized or memory-resident
These algorithms all based on a shared-nothing
parallel environment where each of N processor
has private memory and disks. The processor are
connected by a communication network and can
communicate only by passing message

20
Parallelizing Classification
Data placement and Workload Balancing

The main data structures are
attribute lists
class histograms
SPRINT achieves uniform data placement and
workload balancing by distributing the attribute
lists evenly over N processor
Each processor to work on only 1/N of the total
data

21
Parallelizing Classification
Finding split points (Continuous attributes)

In a parallel environment, each processor has a
separate contiguous section of a global
attribute list
Cbelow must initially reflect the class
distribution of all sections of an attribute-list
assigned to processors of lower rank
Cabove must initially reflect the class
distribution of the local section as well as all
sections assigned to processor of higher rank

22
Parallelizing Classification
Example Split point for the continuous attributes
Processor 0
H L
Cbelow 0 0
Cabove 4 2
Age Class rid
17 High 1
20 High 5
23 High 0
Processor 1
H L
Cbelow 3 0
Cabove 1 2
Age Class rid
32 Low 4
43 High 2
68 Low 3
23
Parallelizing Classification
Finding split points (Categorical attributes)

Each processor built a counter matrix for a leaf
Exchange these matrices to get the global
counts
Sums the local matrices to get the global count
matrices

24
Parallelizing Classification
Performing the Splits

Splitting the attribute lists for each leaf is
identical to the serial algorithm
The only difference is that before building the
probe structure, we will need to collect rids
from all the processors.
Exchange the rids
Each processor constructs a probe-structure with
all the rids and using it to split the leafs
remaining attribute lists

25
Parallelizing Classification
Parallelizing SLIQ

Two approaches for parallelizing SLIQ
The class list is replicated in memory of every
processor (SLIQ/R)
The class list is distributed such that each
processors memory holds only a portion of the
entire list (SLIQ/D)
Disadvantage
SLIQ/R the size of the training set is limited
by the memory size of a single processor
SLIQ/D High communication cost

26
Performance Evaluation

The primary metric for evaluating classifier
performance
Classification accuracy
Classification time
Size of the decision tree
The accuracy and tree size characteristics of
SPRINT are identical to SLIQ
Focus only on the classification time

27
Performance Evaluation
Datasets

Each record in synthetic database consists of
nine attributes .
In the paper, they present results for two of
these function
Function 2 results in fairly small decision tree
Function 7 produces very large tree
Both these functions divide the databaseinto
two classes

28
Performance Evaluation
Serial Performance

We used training sets ranging in size from 10,000
records to 2.5 million records
To examine how well SPRINT performs in operating
regions where SLIQ can and cannot run

29
Performance Evaluation
Comparison of Parallel Algorithms

In this experiments, each processor contained
50,000 training examples and the number of
processors varied from 2 to 16

30
Performance Evaluation
Scaleup
31
Performance Evaluation
Speedup
32
Performance Evaluation
Sizeup
33
Performance Evaluation
Conclusion

Classification can handle very large database by
building classifiers
SPRINT efficiently allows classification of
virtually any sized dataset and parallelizable.
SPRINT scales nicely with the size of the dataset
SPRINT parallelize efficiently in a
shared-nothing environment

34
References

John Shafer,Rakeeh Agrawal and Manish Mehta.
SPRINT A Scalable Parallel Classifier for Data
Mining

35
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

CHAN Siu Lung, Daniel - PowerPoint PPT Presentation

CHAN Siu Lung, Daniel

Title: PowerPoint Presentation Author: SLCHAN Last modified by: Africa Pig Created Date: 10/11/2002 4:18:07 AM Document presentation format: On-screen Show – PowerPoint PPT presentation