Title: CrossMine: Efficient Classification Across Multiple Database Relations
1CrossMine Efficient Classification Across
Multiple Database Relations
- Xiaoxin Yin, Jiawei Han, Jiong Yang
- University of Illinois at Urbana-Champaign
- Philip S. Yu
- IBM T. J. Watson Research Center
2Roadmap
- Introduction, definitions
- Problem definition - preliminaries
- Tuple ID Propagation
- Rule Generation
- Negative Tuple Sampling
- Performance Study
3Introduction, definitions
- Most real-world data are stored in relational
databases - Multirelational classification procedure of
building a classifier based on information stored
in multiple relational databases - ILP most widely used, but are not scalable
- Multi-relational classification Automatically
classifying objects using multiple relations
4An Example Loan Applications
Ask the backend database
Approve or not?
Apply for loan
5The Backend Database
Account
District
account-id
Loan
district-id
district-id
dist-name
Target relation Each tuple has a class label,
indicating whether a loan is paid on time.
loan-id
frequency
Card
region
account-id
date
card-id
people
date
disp-id
lt-500
amount
type
lt-2000
duration
Transaction
issue-date
lt-10000
payment
trans-id
gt-10000
account-id
city
Disposition
date
ratio-urban
Order
disp-id
type
avg-salary
order-id
account-id
operation
unemploy95
account-id
client-id
amount
unemploy96
bank-to
balance
den-enter
account-to
Client
symbol
crime95
amount
client-id
crime96
type
birth-date
gender
district-id
How to make decisions to loan applications?
6Roadmap
- Motivation
- Problem definition - preliminaries
- Tuple ID Propagation
- Rule Generation
- Negative Tuple Sampling
- Performance Study
7Preliminaries
- Target relation
- Class labels
- Predicates
- Rules
- Decision Trees
- Searching for Predicates by Joins
8The problem
The joined realation Loan, Account, Order,
Transaction (x-y represents attribute y in
relation x)
9Rule Generation
- Search for good predicates across multiple
relations
Applicant 1
Loan Applications
Applicant 2
Applicant 3
Orders
Accounts
Applicant 4
Other relations
Districts
10Previous Approaches
- Inductive Logic Programming (ILP)
- To build a rule
- Repeatedly find the best predicate
- To evaluate a predicate on relation R, first join
target relation with R - Not scalable because
- Huge search space (numerous candidate predicates)
- Not efficient to evaluate each predicate
- To evaluate a predicate
- Loan(L, ) - Loan (L, A,?,?,?,?), Account(A,?,
monthly,?) - first join loan relation with account relation
- CrossMine is more scalable and more than one
hundred times faster on datasets with reasonable
sizes
11CrossMine An Efficient and Accurate
Multi-relational Classifier
- Tuple-ID propagation an efficient and flexible
method for virtually joining relations - Confine the rule search process in promising
directions - Look-one-ahead a more powerful search strategy
- Negative tuple sampling improve efficiency while
maintaining accuracy
12Roadmap
- Motivation
- Problem definition - preliminaries
- Tuple ID Propagation
- Rule Generation
- Negative Tuple Sampling
- Performance Study
13Tuple ID Propagation
Instead of performing physical join, the IDs and
class labels of target tuples can be propagated
to Account relation
14Tuple ID Propagation
Applicant 1
Labels
Propagated ID
Open date
Frequency
Account ID
Applicant 2
2, 0
1, 2
02/27/93
monthly
124
0, 1
3
09/23/97
weekly
108
0, 1
4
12/09/96
monthly
45
0, 0
Null
01/01/97
weekly
67
Applicant 3
- Possible predicates
- Frequencymonthly 2 , 1
- Open date lt 01/01/95 2 , 0
Applicant 4
- Propagate tuple IDs of target relation to
non-target relations - Virtually join relations to avoid the high cost
of physical joins
15Tuple ID Propagation (cont.)
- Efficient
- Only propagate the tuple IDs
- Time and space usage is low
- Flexible
- Can propagate IDs among non-target relations
- Many sets of IDs can be kept on one relation,
which are propagated from different join paths
16Roadmap
- Motivation
- Problem definition - preliminaries
- Tuple ID Propagation
- Rule Generation
- Negative Tuple Sampling
- Performance Study
17Overall Procedure
- Sequential covering algorithm
- while(enough target tuples left)
- generate a rule
- remove positive target tuples satisfying this
rule
Examples covered by Rule 2
Examples covered by Rule 1
Examples covered by Rule 3
Positive examples
18Rule Generation
- To generate a rule
- while(true)
- find the best predicate p
- if foil-gain(p)gtthreshold then add p to current
rule - else break
A31
A31A12
A31A12 A85
Positive examples
Negative examples
19Evaluating Predicates
- All predicates in a relation can be evaluated
based on propagated IDs - Use foil-gain to evaluate predicates
- Suppose current rule is r. For a predicate p,
- foil-gain(p)
- Categorical Attributes
- Compute foil-gain directly
- Numerical Attributes
- Discretize with every possible value
20Rule Generation
- Start from the target relation
- Only the target relation is active
- Repeat
- Search in all active relations
- Search in all relations joinable to active
relations - Add the best predicate to the current rule
- Set the involved relation to active
- Until
- The best predicate does not have enough gain
- Current rule is too long
21Rule Generation Example
Target relation
First predicate
Second predicate
Range of Search
Add best predicate to rule
22Look-one-ahead in Rule Generation
- Two types of relations Entity and Relationship
- Often cannot find useful predicates on relations
of relationship
No good predicate
Target Relation
- Solution of CrossMine
- When propagating IDs to a relation of
relationship, propagate one more step to next
relation of entity.
23Roadmap
- Motivation
- Problem definition - preliminaries
- Tuple ID Propagation
- Rule Generation
- Negative Tuple Sampling
- Performance Study
24Negative Tuple Sampling
- A rule covers some positive examples
- Positive examples are removed after covered
- After generating many rules, there are much less
positive examples than negative ones
25Negative Tuple Sampling (cont.)
- When there are much more negative examples than
positive ones - Cannot build good rules (low support)
- Still time consuming (large number of negative
examples) - Make sampling on negative examples
- Improve efficiency without affecting rule quality
- T(-) lt Neg_Pos_Ratio x T() and T(-) lt
Max_Num_Negtive
26Roadmap
- Motivation
- Problem definition - preliminaries
- Tuple ID Propagation
- Rule Generation
- Negative Tuple Sampling
- Performance Study
27Performance study
- 1.7GHz P4 PC Windows2000
- For CrossMine-Rule parameters
- Min_Foil_Gain 2.5
- Max_Rule_Length 6
- Neg_Pos_Ratio 1
- Max_Num_Negative 600
28Performance study
- Synthetic relational databases are generated
- Use different
- Number of relations
- Number of tuples in each relation
- Number of foreign keys
- The running time and accuracy are compared
- CrossMine can be performed efficiently on data
stored on disks (real applications) too.
29Synthetic datasets
Scalability w.r.t. number of relations
Scalability w.r.t. number of tuples
30Real Dataset
- PKDD Cup 99 dataset Loan Application
- Mutagenesis dataset (4 relations)
31References
- H. Blockeel, L. De Raedt and J. Ramon. Top-down
induction of logical decision trees. In Proc. of
the Fifteenth Int. Conf. of Machine Learning,
Madison, WI, 1998. - C. J. C. Burges. A tutorial on support vector
machines for pattern recognition. Data Mining and
Knowledge Discovery, 1998. - L. Dehaspe and H. Toivonen. Discovery of
Relational Association Rules. In Relational Data
Mining, Springer-Verlag, 2000. - L. Getoor, N. Friedman, D. Koller, and B. Taskar.
Learning probabilistic models of relational
structure. In Proc. 18th International Conf. on
Machine Learning, Williamtown, MA, 2001. - H. A. Leiva. MRDTL a multi-relational decision
tree learning algorithm. M.S. thesis, Iowa State
Univ., 2002. - T. Mitchell. Machine Learning. McGraw Hill, 1996.
- S. Muggleton. Inverse Entailment and Progol. New
Generation Computing, Special issue on Inductive
Logic Programming, 1995. - S. Muggleton and C. Feng. Efficient induction of
logic programs. In Proc. of First Conf. on
Algorithmic Learning Theory, Tokyo, Japan, 1990. - A. Popescul, L. Ungar, S. Lawrence, and M.
Pennock. Towards Structural Logistic Regression
Combining Relational and Statistical Learning. In
Proc. of Multi-Relational Data Mining Workshop,
Alberta, Canada, 2002. - J. R. Quinlan. FOIL A midterm report. In Proc.
of the sixth European Conf. on Machine Learning,
Springer-Verlag, 1993. - J. R. Quilan. C4.5 Programs for Machine
Learning. In Morgan Kaufmann series in machine
learning, Morgan Kaufmann, 1993. - B. Taskar, E. Segal, and D. Koller. Probabilistic
Classification and Clustering in Relational Data.
in Proc. of 17th Int. Joint Conf. on Artificial
Intelligence, Seattle, WA, 2001.