Title: Scalable Mining For Classification Rules in Relational Databases
1Scalable Mining For Classification Rules in
Relational Databases
Min Wang Bala Iyer Jeffrey Scott Vitter
???? ?? ??? ???????
2Abstract
- Problem Increase in Size of Training Set
- MIND (MINing in Database) Classifier
- Can be Implemented easily over SQL
- Other Classifiers Need O(N) space In Memory.
- MIND Scales Well Over
- I/O
- of Processors
3Over View
- Introduction
- Algorithm
- Database Implementation
- Performance
- Experimental Results
- Conclusions
4Introduction - Classification Problem
DETAIL TABLE
CLASSIFYER
Age lt 30
yes
no
salary lt 62K
safe
yes
no
risky
safe
5Introduction - Scalability In Classification
- Importance Of Scalability
- Use a Very Large Training Set Data is Not
Memory Resident. - Number Of CPUs better usage of resources.
6Introduction - Scalability In Classification
- Properties of MIND
- Scalable in memory
- Scalable In CPU
- Uses SQL
- Easy to implement
- Assumptions
- Attribute Values Are Discrete
- We focus on the growth stage(no pruning)
7The Algorithm - DataStracture
DATA in DETAIL TABLE DETAIL(attr1,attr2,.,cla
ss,leaf_num) attri i attribute class
Class type leaf_num the number of leaf the
example belongs to(this data can be calculated
by the known tree)
8The Algorithm - gini index
S - data Set C - number of Classes Pi - relative
frequency of class i in S gini index
9The Algorithm
- GrowTree(DETAIL TABLE)
- Initialize tree T and put all records of DETAIL
in root - while (some leaf in T is not a STOP node)
- for each attribute i do
- evaluate gini index for each non-STOP leaf at
each split value with respect to attribute i - for each non-STOP leaf do
- get the overall best split for it
- partition the records and grow the tree for one
more level according to best splits - mark all small or pure leaves as STOP nodes
- return T
10Database Implementation - Dimension table
- For Each Attribute and each level of the tree
- INSERT INTO DIMi
- SELECT leaf_num,class,attri,count()
- FROM DETAIL
- WHERE leaf_num,ltgt STOP
- GROUP BY leaf_num,class,attri
- Size of Dimi leaves distinct values of
attri classes
11Database Implementation - Dimension table SQL
- SELECT FROM DETAIL
- INSERT INTO DIM1 leaf_num,class,attr1,count()
- WHERE leaf_num,ltgt STOP
- GROUP BY leaf_num,class,attr1
- INSERT INTO DIM2 leaf_num,class,attr2,count()
- WHERE leaf_num,ltgt STOP
- GROUP BY leaf_num,class,attr2
-
-
-
12Database Implementation - UP/DOWN - split
- for each attribute we find all possible split
places - INSERT INTO UP
- SELECT d1.leaf_num, d1.attri,
- d1.class,SUM(d2.count)
- FROM(FULL OUTER JOIN DIMi d1, DIMi d2 ON
d1.leaf_num d2.leaf_num AND - d2. attri lt d1. attri AND
- d1.class d2.class
- GROUP BY d1.leaf_num, d1. attri, d1.class
13Database Implementation - Class View
- create view for each class k and attribute i
-
- CREATE VIEW Ck_UP(leaf_num,attri,count)
- SELECT leaf_num,attri,count
- FROM UP
- WHERE class k
-
14Database Implementation - GINI VALUE
- create view for all gini values
- CREATE VIEW GINI_VALUE(leaf_num,
- attri,gini)AS
- SELECT u1.leaf_num, u1.attri,ƒgini
- FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1...
- ,Cc_DOWN dc
- WHERE u1.attri .. uc. attri .. dc.
attri - AND u1.leaf_num .. uc.leaf_num ..
dc.leaf_num
15Database Implementation - MIN GINI VALUE
- create table for minimum gini values
- for attribute i
- INSERT INTO MIN_GINI
- SELECT leaf_num,i,attri,gini
- FROM GINI_VALUE a
- WHERE a.gini
- (SELECT MIN(gini)
- FROM GINI_VALUE b
- WHERE a.leaf_num b.leaf_num
16Database Implementation - BEST SPLIT
- create view over MIN_GINI for best split
- CREATE VIEW BEST_SPLIT
- (leaf_num,attr_name,attr_value)
- SELECT leaf_num, attr_name,attr_value
- FROM MIN_GINI a
- WHERE a.gini
- (SELECT MIN(gini)
- FROM MIN_GINI b
- WHERE a.leaf_num b.leaf_num
17Database Implementation - Partitioning
- Build new nodes by spliting old nodes according
to BEST_SPLIT values - Set correct node to recoreds
- Update leaf_node - is done by a function
- No need to UPDATE data or DB
18Performance
I/O cost of MIND
I/O cost of SPRINT
19Experimental Results
Normalized time to finish building the tree
Normalized time to build the tree per example
20Experimental Results
Normalized time to build the tree per of
processor
Time to build tree By Training Set Size
21Conclusions
- MIND works over DB
- MIND works well because
- MIND rephrases the classification to a DB problem
- MIND avoid UPDATES the DETAIL table
- Parallelism and Scaling Are achived by the use of
RDBMS - MIND uses a user function to get the performance
gain in the DIMi creation.