CrossMine: Efficient Classification Across Multiple Database Relations - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

CrossMine: Efficient Classification Across Multiple Database Relations

Description:

Xiaoxin Yin, Jiawei Han, Jiong Yang. University of Illinois at ... Disposition. disp-id. birth-date. gender. district-id. Client. client-id. dist-name. region ... – PowerPoint PPT presentation

Number of Views:343

Avg rating:3.0/5.0

Slides: 32

Provided by: xyin

Category:

more less

Transcript and Presenter's Notes

Title: CrossMine: Efficient Classification Across Multiple Database Relations

1
CrossMine Efficient Classification Across
Multiple Database Relations

Xiaoxin Yin, Jiawei Han, Jiong Yang
University of Illinois at Urbana-Champaign
Philip S. Yu
IBM T. J. Watson Research Center

2
Roadmap

Introduction, definitions
Problem definition - preliminaries
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study

3
Introduction, definitions

Most real-world data are stored in relational
databases
Multirelational classification procedure of
building a classifier based on information stored
in multiple relational databases
ILP most widely used, but are not scalable
Multi-relational classification Automatically
classifying objects using multiple relations

4
An Example Loan Applications
Ask the backend database
Approve or not?
Apply for loan
5
The Backend Database
Account
District
account-id
Loan
district-id
district-id
dist-name
Target relation Each tuple has a class label,
indicating whether a loan is paid on time.
loan-id
frequency
Card
region
account-id
date
card-id
people
date
disp-id
lt-500
amount
type
lt-2000
duration
Transaction
issue-date
lt-10000
payment
trans-id
gt-10000
account-id
city
Disposition
date
ratio-urban
Order
disp-id
type
avg-salary
order-id
account-id
operation
unemploy95
account-id
client-id
amount
unemploy96
bank-to
balance
den-enter
account-to
Client
symbol
crime95
amount
client-id
crime96
type
birth-date
gender
district-id
How to make decisions to loan applications?
6
Roadmap

Motivation
Problem definition - preliminaries
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study

7
Preliminaries

Target relation
Class labels
Predicates
Rules
Decision Trees
Searching for Predicates by Joins

8
The problem
The joined realation Loan, Account, Order,
Transaction (x-y represents attribute y in
relation x)
9
Rule Generation

Search for good predicates across multiple
relations

Applicant 1
Loan Applications
Applicant 2
Applicant 3
Orders
Accounts
Applicant 4
Other relations
Districts
10
Previous Approaches

Inductive Logic Programming (ILP)
To build a rule
Repeatedly find the best predicate
To evaluate a predicate on relation R, first join
target relation with R
Not scalable because
Huge search space (numerous candidate predicates)
Not efficient to evaluate each predicate
To evaluate a predicate
Loan(L, ) - Loan (L, A,?,?,?,?), Account(A,?,
monthly,?)
first join loan relation with account relation
CrossMine is more scalable and more than one
hundred times faster on datasets with reasonable
sizes

11
CrossMine An Efficient and Accurate
Multi-relational Classifier

Tuple-ID propagation an efficient and flexible
method for virtually joining relations
Confine the rule search process in promising
directions
Look-one-ahead a more powerful search strategy
Negative tuple sampling improve efficiency while
maintaining accuracy

12
Roadmap

Motivation
Problem definition - preliminaries
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study

13
Tuple ID Propagation
Instead of performing physical join, the IDs and
class labels of target tuples can be propagated
to Account relation
14
Tuple ID Propagation
Applicant 1
Labels
Propagated ID
Open date
Frequency
Account ID
Applicant 2
2, 0
1, 2
02/27/93
monthly
124
0, 1
3
09/23/97
weekly
108
0, 1
4
12/09/96
monthly
45
0, 0
Null
01/01/97
weekly
67
Applicant 3

Possible predicates
Frequencymonthly 2 , 1
Open date lt 01/01/95 2 , 0

Applicant 4

Propagate tuple IDs of target relation to
non-target relations
Virtually join relations to avoid the high cost
of physical joins

15
Tuple ID Propagation (cont.)

Efficient
Only propagate the tuple IDs
Time and space usage is low
Flexible
Can propagate IDs among non-target relations
Many sets of IDs can be kept on one relation,
which are propagated from different join paths

16
Roadmap

Motivation
Problem definition - preliminaries
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study

17
Overall Procedure

Sequential covering algorithm
while(enough target tuples left)
generate a rule
remove positive target tuples satisfying this
rule

Examples covered by Rule 2
Examples covered by Rule 1
Examples covered by Rule 3
Positive examples
18
Rule Generation

To generate a rule
while(true)
find the best predicate p
if foil-gain(p)gtthreshold then add p to current
rule
else break

A31
A31A12
A31A12 A85
Positive examples
Negative examples
19
Evaluating Predicates

All predicates in a relation can be evaluated
based on propagated IDs
Use foil-gain to evaluate predicates
Suppose current rule is r. For a predicate p,
foil-gain(p)
Categorical Attributes
Compute foil-gain directly
Numerical Attributes
Discretize with every possible value

20
Rule Generation

Start from the target relation
Only the target relation is active
Repeat
Search in all active relations
Search in all relations joinable to active
relations
Add the best predicate to the current rule
Set the involved relation to active
Until
The best predicate does not have enough gain
Current rule is too long

21
Rule Generation Example
Target relation
First predicate
Second predicate
Range of Search
Add best predicate to rule
22
Look-one-ahead in Rule Generation

Two types of relations Entity and Relationship
Often cannot find useful predicates on relations
of relationship

No good predicate
Target Relation

Solution of CrossMine
When propagating IDs to a relation of
relationship, propagate one more step to next
relation of entity.

23
Roadmap

Motivation
Problem definition - preliminaries
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study

24
Negative Tuple Sampling

A rule covers some positive examples
Positive examples are removed after covered
After generating many rules, there are much less
positive examples than negative ones

25
Negative Tuple Sampling (cont.)

When there are much more negative examples than
positive ones
Cannot build good rules (low support)
Still time consuming (large number of negative
examples)
Make sampling on negative examples
Improve efficiency without affecting rule quality
T(-) lt Neg_Pos_Ratio x T() and T(-) lt
Max_Num_Negtive

26
Roadmap

Motivation
Problem definition - preliminaries
Tuple ID Propagation
Rule Generation
Negative Tuple Sampling
Performance Study

27
Performance study

1.7GHz P4 PC Windows2000
For CrossMine-Rule parameters
Min_Foil_Gain 2.5
Max_Rule_Length 6
Neg_Pos_Ratio 1
Max_Num_Negative 600

28
Performance study

Synthetic relational databases are generated
Use different
Number of relations
Number of tuples in each relation
Number of foreign keys
The running time and accuracy are compared
CrossMine can be performed efficiently on data
stored on disks (real applications) too.

29
Synthetic datasets
Scalability w.r.t. number of relations
Scalability w.r.t. number of tuples

30
Real Dataset

PKDD Cup 99 dataset Loan Application

Mutagenesis dataset (4 relations)

31
References

H. Blockeel, L. De Raedt and J. Ramon. Top-down
induction of logical decision trees. In Proc. of
the Fifteenth Int. Conf. of Machine Learning,
Madison, WI, 1998.
C. J. C. Burges. A tutorial on support vector
machines for pattern recognition. Data Mining and
Knowledge Discovery, 1998.
L. Dehaspe and H. Toivonen. Discovery of
Relational Association Rules. In Relational Data
Mining, Springer-Verlag, 2000.
L. Getoor, N. Friedman, D. Koller, and B. Taskar.
Learning probabilistic models of relational
structure. In Proc. 18th International Conf. on
Machine Learning, Williamtown, MA, 2001.
H. A. Leiva. MRDTL a multi-relational decision
tree learning algorithm. M.S. thesis, Iowa State
Univ., 2002.
T. Mitchell. Machine Learning. McGraw Hill, 1996.
S. Muggleton. Inverse Entailment and Progol. New
Generation Computing, Special issue on Inductive
Logic Programming, 1995.
S. Muggleton and C. Feng. Efficient induction of
logic programs. In Proc. of First Conf. on
Algorithmic Learning Theory, Tokyo, Japan, 1990.
A. Popescul, L. Ungar, S. Lawrence, and M.
Pennock. Towards Structural Logistic Regression
Combining Relational and Statistical Learning. In
Proc. of Multi-Relational Data Mining Workshop,
Alberta, Canada, 2002.
J. R. Quinlan. FOIL A midterm report. In Proc.
of the sixth European Conf. on Machine Learning,
Springer-Verlag, 1993.
J. R. Quilan. C4.5 Programs for Machine
Learning. In Morgan Kaufmann series in machine
learning, Morgan Kaufmann, 1993.
B. Taskar, E. Segal, and D. Koller. Probabilistic
Classification and Clustering in Relational Data.
in Proc. of 17th Int. Joint Conf. on Artificial
Intelligence, Seattle, WA, 2001.