CHAN Siu Lung, Daniel - PowerPoint PPT Presentation

About This Presentation
Title:

CHAN Siu Lung, Daniel

Description:

Title: PowerPoint Presentation Author: SLCHAN Last modified by: Africa Pig Created Date: 10/11/2002 4:18:07 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:223
Avg rating:3.0/5.0
Slides: 36
Provided by: SLC2
Category:
Tags: chan | daniel | lung | mehta | siu

less

Transcript and Presenter's Notes

Title: CHAN Siu Lung, Daniel


1
SPRINT A Scalable Parallel Classifier for Data
Mining
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin
Hung, Victor KOON Ping Yin, Bob
2
Overview
  • Classification
  • Decision Tree
  • SPRINT
  • Serial Algorithm
  • Parallelizing Classification
  • Performance Evaluation
  • Conclusion
  • Reference

3
Classification
  • Objective
  • To build a model of the classifying attribute
    based upon the other attributes
  • Classification given
  • A set of example record, called a training set
    that is a set of records consists of several
    attributes
  • Attributes are either
  • Continuous (i.e. Age 23,24,25)
  • Categorical (i.e. Gender Female, Male)
  • One of the attributes called the classifying
    attribute

4
Decision Trees
  • Advantages
  • Good for data mining
  • Can be constructed fast
  • Models are simple
  • Easy to understand
  • Can be easily converted into SQL statements
  • Classifiers obtain better accuracy

5
Example Car Insurance
Categorical attribute
Continuous attribute
6
Classification algorithm
  • Two classification algorithms are presented in
    the paper
  • SLIQ
  • SPRINT
  • SLIQ
  • Handles large disk-resident data
  • Requires some data per record that stay in
    memory-resident all the time
  • SPRINT
  • Removes all of the memory restrictions
  • Fast and scalable and it can be easily
    parallelized
  • How good it is?
  • Excellent scaleup, speedup and sizeup properties

7
SPRINT
Serial Algorithm
  • A decision tree classifier is built in two phases
  • Growth phase
  • Prune phase
  • Growth phase is computationally much more
    expensive than pruning

8
Serial Algorithm
Growth Phase
  • Key Issue
  • To find split points that define node tests.
  • Having chosen a split point, how to partition the
    data
  • Data Structures
  • Attribute lists
  • Histograms

9
Serial Algorithm
Data Structure (Attribute lists)
  • SPRINT creates an attribute list for each
    attribute
  • Entries are called attribute records which
    contains
  • Attribute value
  • Class label
  • Index of the record

10
Serial Algorithm
Data Structure (Attribute lists)
  • The initial lists are associated with the root
  • As the tree is grown, the attribute lists
    belonging to each node are partitioned and
    associated with the children

11
Serial Algorithm
Data Structure (Histograms)
  • For continuous attributes, two histograms are
    associated with each decision-tree node. These
    histograms, denoted as Cabove and Cbelow
  • Cbelow maintains this distribution for
    attribute records that already been processed
  • Cabove maintains this distribution for attribute
    records that have not been processed

12
Serial Algorithm
Data Structure (Histograms)
  • For categorical attributes, one histogram
    associated with a node. However, only one
    histogram is needed and called count matrix

13
Serial Algorithm
Finding split points
  • While growing the tree, the goal at each node is
    to determine the split point that best divides
    the training records belonging to the leaf
  • In the paper, they use gini index
  • Gini(S)1-?pj2
  • Where pj is the relative of class j in S
  • Ginisplit(S) n1/n(S1)n2/n(S2)

14
Serial Algorithm
Example Split point for the continuous
attributes
Age Class Tid
17 High 1
20 High 5
23 High 0
32 Low 4
43 High 2
68 Low 3
H L
Cabove 3 0
Cbelow 1 2
Cursor Position 3
15
Serial Algorithm
Example Split point for the continuous
attributes
  • After finding all the gini indexes we choose the
    lowest as the split point
  • Therefore, we split at position 3 where the
    candidate split point is the mid-point between
    age 23 and 32 (i.e. Age lt 27.5)

16
Serial Algorithm
Example Split point for the categorical
attributes
H L
Family 2 1
Sports 2 0
Truck 0 1
17
Serial Algorithm
Performing the split
  • Once the best split point has been found for a
    node, we then execute the split by creating child
    nodes and dividing the attribute records between
    them
  • For the rest of the attribute lists (i.e.
    CarType) we need to retrieve the information by
    using rids

18
Serial Algorithm
Comparison with SLIQ
  • SLIQ does not have separate sets of attribute
    lists for each node
  • Advantage
  • Do not have to rewrite these lists during a
    split
  • Reassignment of records is easy
  • Disadvantage
  • It must stay in memory all the timewhich limits
    the amount of data thatcan be classified by SLIQ

SPRINT was not to outperform SLIQ. Rather, the
purpose of the algorithm is to develop an
accurate classifier for datasets, and to develop
a classifier efficiently. Furthermore, SPRINT is
designed to be easily parallelizable, thus the
workload can be shared among N processor
19
SPRINT
Parallelizing Classification
  • SPRINT was specifically designed to remove any
    dependence on data structures that are either
    centralized or memory-resident
  • These algorithms all based on a shared-nothing
    parallel environment where each of N processor
    has private memory and disks. The processor are
    connected by a communication network and can
    communicate only by passing message

20
Parallelizing Classification
Data placement and Workload Balancing
  • The main data structures are
  • attribute lists
  • class histograms
  • SPRINT achieves uniform data placement and
    workload balancing by distributing the attribute
    lists evenly over N processor
  • Each processor to work on only 1/N of the total
    data

21
Parallelizing Classification
Finding split points (Continuous attributes)
  • In a parallel environment, each processor has a
    separate contiguous section of a global
    attribute list
  • Cbelow must initially reflect the class
    distribution of all sections of an attribute-list
    assigned to processors of lower rank
  • Cabove must initially reflect the class
    distribution of the local section as well as all
    sections assigned to processor of higher rank

22
Parallelizing Classification
Example Split point for the continuous attributes
Processor 0
H L
Cbelow 0 0
Cabove 4 2
Age Class rid
17 High 1
20 High 5
23 High 0
Processor 1
H L
Cbelow 3 0
Cabove 1 2
Age Class rid
32 Low 4
43 High 2
68 Low 3
23
Parallelizing Classification
Finding split points (Categorical attributes)
  • Each processor built a counter matrix for a leaf
  • Exchange these matrices to get the global
    counts
  • Sums the local matrices to get the global count
    matrices

24
Parallelizing Classification
Performing the Splits
  • Splitting the attribute lists for each leaf is
    identical to the serial algorithm
  • The only difference is that before building the
    probe structure, we will need to collect rids
    from all the processors.
  • Exchange the rids
  • Each processor constructs a probe-structure with
    all the rids and using it to split the leafs
    remaining attribute lists

25
Parallelizing Classification
Parallelizing SLIQ
  • Two approaches for parallelizing SLIQ
  • The class list is replicated in memory of every
    processor (SLIQ/R)
  • The class list is distributed such that each
    processors memory holds only a portion of the
    entire list (SLIQ/D)
  • Disadvantage
  • SLIQ/R the size of the training set is limited
    by the memory size of a single processor
  • SLIQ/D High communication cost

26
Performance Evaluation
  • The primary metric for evaluating classifier
    performance
  • Classification accuracy
  • Classification time
  • Size of the decision tree
  • The accuracy and tree size characteristics of
    SPRINT are identical to SLIQ
  • Focus only on the classification time

27
Performance Evaluation
Datasets
  • Each record in synthetic database consists of
    nine attributes .
  • In the paper, they present results for two of
    these function
  • Function 2 results in fairly small decision tree
  • Function 7 produces very large tree
  • Both these functions divide the databaseinto
    two classes

28
Performance Evaluation
Serial Performance
  • We used training sets ranging in size from 10,000
    records to 2.5 million records
  • To examine how well SPRINT performs in operating
    regions where SLIQ can and cannot run

29
Performance Evaluation
Comparison of Parallel Algorithms
  • In this experiments, each processor contained
    50,000 training examples and the number of
    processors varied from 2 to 16

30
Performance Evaluation
Scaleup
31
Performance Evaluation
Speedup
32
Performance Evaluation
Sizeup
33
Performance Evaluation
Conclusion
  • Classification can handle very large database by
    building classifiers
  • SPRINT efficiently allows classification of
    virtually any sized dataset and parallelizable.
  • SPRINT scales nicely with the size of the dataset
  • SPRINT parallelize efficiently in a
    shared-nothing environment

34
References
  • John Shafer,Rakeeh Agrawal and Manish Mehta.
    SPRINT A Scalable Parallel Classifier for Data
    Mining

35
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com