Title: SLIQ and SPRINT for disk resident data
1SLIQ and SPRINTfor disk resident data
2SLIQ
- SLIQ is a decision tree classifier that can
handle both numerical and categorical attributes - Builds compact and accurate trees
- Uses a pre-sorting technique in the tree growing
phase - Suitable for classification of large
disk-resident datasets.
3Issues
- There are two major, critical performance, issues
in the tree-growth phase - How to find split points
- How to partition the data
- The well-known decision tree classifiers
- Grow trees depth-first
- Repeatedly sort the data at every node
- SLIQ
- Replace this repeated sorting with one-time sort
- Use new a data structure call class-list
- Class-list must remain memory resident at all
time
4Some Data
5SLIQ - Attribute Lists
These are projections on (rid, attribute).
6SLIQ - Sort Numeric, Group Categorical
7SLIQ - Class List
N1
8SLIQ - Histograms
N1
age?25 ?
age?30 ?
Evaluate each split, using GINI or Entropy.
...
9SLIQ - Histograms
N1
age?25
age?30
Evaluate each split, using GINI or Entropy.
...
10SLIQ - Histograms
N1
salary?20
salary?30
Evaluate each split, using GINI or Entropy.
...
11SLIQ - Histograms
N1
Married
Single
Evaluate each split, using GINI or Entropy.
12SLIQ - Perform best split and Update Class List
N1
salary?60
N2
N3
13SLIQ - Perform best split and Update Class List
14SLIQ - Histograms
N1
N2
N1
age?25 ?
N2
Evaluate each split, using GINI or Entropy.
...
15SLIQ - Histograms
N1
N2
N1
age?25
N2
Evaluate each split, using GINI or Entropy.
...
16SLIQ - Pseudocode
- Split evaluation
- EvaluateSplits()
- for each numeric attribute A do
- for each value v in the attribute list do
- find the corresponding entry in the class list,
and hence the corresponding class and the
leaf node Ni - update the class histogram in leaf Ni
- compute splitting score for test (A v) for Ni
- for each categorical attribute A do
- for each leaf of the tree do
- find subset of A with best split
17SLIQ - Pseudocode
- Updating the class list
- UpdateLabels()
- for each split leaf Ni do
- Let A be the split attribute for Ni.
- for each (rid,v) in the attribute list for A do
- find the corresponding entry in the class list
e (using the rid) - if the leaf referenced by e is Ni then
- find the new leaf Nj to which (rid,v)
belongs - (by applying the splitting test)
- update the leaf pointer for e to Nj
18SLIQ - bottleneck
- Class-list must remain memory resident at all
time! - Although not a big problem with today's memories,
still there might be cases where this is a
bottleneck. - So, what can we do when the class-list doesn't
fit in main memory? - SPRINT is a solution...
19SPRINT
The main data structures used in SPRINT
are Attribute lists and Class histograms
20SPRINT - Histograms
age?25
age?30
Evaluate each split, using GINI or Entropy.
...
21SPRINT - Histograms
salary?20
salary?30
Evaluate each split, using GINI or Entropy.
...
22SPRINT - Histograms
Married
Single
Evaluate each split, using GINI or Entropy.
23SPRINT - Performing Best Split
- Once the best split point has been found for a
node, we execute the split by creating child
nodes. - Requires splitting the nodes lists for every
attribute into two. - Partitioning the attribute list of the winning
attribute (salary) is easy. - We scan the list, apply the split test, and move
the records to two new attribute lists - one for
each new child.
24SPRINT - Performing Best Split
- Unfortunately, for the remaining attribute lists
of the node (age and marital), we have no test
that we can apply to the attribute values to
decide how to divide the records. - Solution use the rids.
- As we partition the list of the splitting
attribute (i.e. salary), we insert the rids of
each record into a probe structure (hash table),
noting to which child the record was moved. - Once we have collected all the rids, we scan the
lists of the remaining attributes and probe the
hash table with the rid of each record. - The retrieved information tells us with which
child to place the record.
25SPRINT - Performing Best Split
- If the hash-table is too large for the memory,
splitting is done in more than one step. - The attribute list for the splitting attribute is
partitioned up to the attribute record for which
the hash table will fit in memory - Portions of attribute lists of non-splitting
attributes are partitioned and the process is
repeated for the remainder of the attribute list
of the splitting attribute.