CrossMine: Efficient Classification Across Multiple Database Relations - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

CrossMine: Efficient Classification Across Multiple Database Relations

Description:

Xiaoxin Yin, Jiawei Han, Jiong Yang. University of Illinois at ... Disposition. disp-id. birth-date. gender. district-id. Client. client-id. dist-name. region ... – PowerPoint PPT presentation

Number of Views:343
Avg rating:3.0/5.0
Slides: 32
Provided by: xyin
Category:

less

Transcript and Presenter's Notes

Title: CrossMine: Efficient Classification Across Multiple Database Relations


1
CrossMine Efficient Classification Across
Multiple Database Relations
  • Xiaoxin Yin, Jiawei Han, Jiong Yang
  • University of Illinois at Urbana-Champaign
  • Philip S. Yu
  • IBM T. J. Watson Research Center

2
Roadmap
  • Introduction, definitions
  • Problem definition - preliminaries
  • Tuple ID Propagation
  • Rule Generation
  • Negative Tuple Sampling
  • Performance Study

3
Introduction, definitions
  • Most real-world data are stored in relational
    databases
  • Multirelational classification procedure of
    building a classifier based on information stored
    in multiple relational databases
  • ILP most widely used, but are not scalable
  • Multi-relational classification Automatically
    classifying objects using multiple relations

4
An Example Loan Applications
Ask the backend database
Approve or not?
Apply for loan
5
The Backend Database
Account
District
account-id
Loan
district-id
district-id
dist-name
Target relation Each tuple has a class label,
indicating whether a loan is paid on time.
loan-id
frequency
Card
region
account-id
date
card-id
people
date
disp-id
lt-500
amount
type
lt-2000
duration
Transaction
issue-date
lt-10000
payment
trans-id
gt-10000
account-id
city
Disposition
date
ratio-urban
Order
disp-id
type
avg-salary
order-id
account-id
operation
unemploy95
account-id
client-id
amount
unemploy96
bank-to
balance
den-enter
account-to
Client
symbol
crime95
amount
client-id
crime96
type
birth-date
gender
district-id
How to make decisions to loan applications?
6
Roadmap
  • Motivation
  • Problem definition - preliminaries
  • Tuple ID Propagation
  • Rule Generation
  • Negative Tuple Sampling
  • Performance Study

7
Preliminaries
  • Target relation
  • Class labels
  • Predicates
  • Rules
  • Decision Trees
  • Searching for Predicates by Joins

8
The problem
The joined realation Loan, Account, Order,
Transaction (x-y represents attribute y in
relation x)
9
Rule Generation
  • Search for good predicates across multiple
    relations

Applicant 1
Loan Applications
Applicant 2
Applicant 3
Orders
Accounts
Applicant 4
Other relations
Districts
10
Previous Approaches
  • Inductive Logic Programming (ILP)
  • To build a rule
  • Repeatedly find the best predicate
  • To evaluate a predicate on relation R, first join
    target relation with R
  • Not scalable because
  • Huge search space (numerous candidate predicates)
  • Not efficient to evaluate each predicate
  • To evaluate a predicate
  • Loan(L, ) - Loan (L, A,?,?,?,?), Account(A,?,
    monthly,?)
  • first join loan relation with account relation
  • CrossMine is more scalable and more than one
    hundred times faster on datasets with reasonable
    sizes

11
CrossMine An Efficient and Accurate
Multi-relational Classifier
  • Tuple-ID propagation an efficient and flexible
    method for virtually joining relations
  • Confine the rule search process in promising
    directions
  • Look-one-ahead a more powerful search strategy
  • Negative tuple sampling improve efficiency while
    maintaining accuracy

12
Roadmap
  • Motivation
  • Problem definition - preliminaries
  • Tuple ID Propagation
  • Rule Generation
  • Negative Tuple Sampling
  • Performance Study

13
Tuple ID Propagation
Instead of performing physical join, the IDs and
class labels of target tuples can be propagated
to Account relation
14
Tuple ID Propagation
Applicant 1
Labels
Propagated ID
Open date
Frequency
Account ID
Applicant 2
2, 0
1, 2
02/27/93
monthly
124
0, 1
3
09/23/97
weekly
108
0, 1
4
12/09/96
monthly
45
0, 0
Null
01/01/97
weekly
67
Applicant 3
  • Possible predicates
  • Frequencymonthly 2 , 1
  • Open date lt 01/01/95 2 , 0

Applicant 4
  • Propagate tuple IDs of target relation to
    non-target relations
  • Virtually join relations to avoid the high cost
    of physical joins

15
Tuple ID Propagation (cont.)
  • Efficient
  • Only propagate the tuple IDs
  • Time and space usage is low
  • Flexible
  • Can propagate IDs among non-target relations
  • Many sets of IDs can be kept on one relation,
    which are propagated from different join paths

16
Roadmap
  • Motivation
  • Problem definition - preliminaries
  • Tuple ID Propagation
  • Rule Generation
  • Negative Tuple Sampling
  • Performance Study

17
Overall Procedure
  • Sequential covering algorithm
  • while(enough target tuples left)
  • generate a rule
  • remove positive target tuples satisfying this
    rule

Examples covered by Rule 2
Examples covered by Rule 1
Examples covered by Rule 3
Positive examples
18
Rule Generation
  • To generate a rule
  • while(true)
  • find the best predicate p
  • if foil-gain(p)gtthreshold then add p to current
    rule
  • else break

A31
A31A12
A31A12 A85
Positive examples
Negative examples
19
Evaluating Predicates
  • All predicates in a relation can be evaluated
    based on propagated IDs
  • Use foil-gain to evaluate predicates
  • Suppose current rule is r. For a predicate p,
  • foil-gain(p)
  • Categorical Attributes
  • Compute foil-gain directly
  • Numerical Attributes
  • Discretize with every possible value

20
Rule Generation
  • Start from the target relation
  • Only the target relation is active
  • Repeat
  • Search in all active relations
  • Search in all relations joinable to active
    relations
  • Add the best predicate to the current rule
  • Set the involved relation to active
  • Until
  • The best predicate does not have enough gain
  • Current rule is too long

21
Rule Generation Example
Target relation
First predicate
Second predicate
Range of Search
Add best predicate to rule
22
Look-one-ahead in Rule Generation
  • Two types of relations Entity and Relationship
  • Often cannot find useful predicates on relations
    of relationship

No good predicate
Target Relation
  • Solution of CrossMine
  • When propagating IDs to a relation of
    relationship, propagate one more step to next
    relation of entity.

23
Roadmap
  • Motivation
  • Problem definition - preliminaries
  • Tuple ID Propagation
  • Rule Generation
  • Negative Tuple Sampling
  • Performance Study

24
Negative Tuple Sampling
  • A rule covers some positive examples
  • Positive examples are removed after covered
  • After generating many rules, there are much less
    positive examples than negative ones


































25
Negative Tuple Sampling (cont.)
  • When there are much more negative examples than
    positive ones
  • Cannot build good rules (low support)
  • Still time consuming (large number of negative
    examples)
  • Make sampling on negative examples
  • Improve efficiency without affecting rule quality
  • T(-) lt Neg_Pos_Ratio x T() and T(-) lt
    Max_Num_Negtive























26
Roadmap
  • Motivation
  • Problem definition - preliminaries
  • Tuple ID Propagation
  • Rule Generation
  • Negative Tuple Sampling
  • Performance Study

27
Performance study
  • 1.7GHz P4 PC Windows2000
  • For CrossMine-Rule parameters
  • Min_Foil_Gain 2.5
  • Max_Rule_Length 6
  • Neg_Pos_Ratio 1
  • Max_Num_Negative 600

28
Performance study
  • Synthetic relational databases are generated
  • Use different
  • Number of relations
  • Number of tuples in each relation
  • Number of foreign keys
  • The running time and accuracy are compared
  • CrossMine can be performed efficiently on data
    stored on disks (real applications) too.

29
Synthetic datasets
Scalability w.r.t. number of relations
Scalability w.r.t. number of tuples



30
Real Dataset
  • PKDD Cup 99 dataset Loan Application
  • Mutagenesis dataset (4 relations)

31
References
  • H. Blockeel, L. De Raedt and J. Ramon. Top-down
    induction of logical decision trees. In Proc. of
    the Fifteenth Int. Conf. of Machine Learning,
    Madison, WI, 1998.
  • C. J. C. Burges. A tutorial on support vector
    machines for pattern recognition. Data Mining and
    Knowledge Discovery, 1998.
  • L. Dehaspe and H. Toivonen. Discovery of
    Relational Association Rules. In Relational Data
    Mining, Springer-Verlag, 2000.
  • L. Getoor, N. Friedman, D. Koller, and B. Taskar.
    Learning probabilistic models of relational
    structure. In Proc. 18th International Conf. on
    Machine Learning, Williamtown, MA, 2001.
  • H. A. Leiva. MRDTL a multi-relational decision
    tree learning algorithm. M.S. thesis, Iowa State
    Univ., 2002.
  • T. Mitchell. Machine Learning. McGraw Hill, 1996.
  • S. Muggleton. Inverse Entailment and Progol. New
    Generation Computing, Special issue on Inductive
    Logic Programming, 1995.
  • S. Muggleton and C. Feng. Efficient induction of
    logic programs. In Proc. of First Conf. on
    Algorithmic Learning Theory, Tokyo, Japan, 1990.
  • A. Popescul, L. Ungar, S. Lawrence, and M.
    Pennock. Towards Structural Logistic Regression
    Combining Relational and Statistical Learning. In
    Proc. of Multi-Relational Data Mining Workshop,
    Alberta, Canada, 2002.
  • J. R. Quinlan. FOIL A midterm report. In Proc.
    of the sixth European Conf. on Machine Learning,
    Springer-Verlag, 1993.
  • J. R. Quilan. C4.5 Programs for Machine
    Learning. In Morgan Kaufmann series in machine
    learning, Morgan Kaufmann, 1993.
  • B. Taskar, E. Segal, and D. Koller. Probabilistic
    Classification and Clustering in Relational Data.
    in Proc. of 17th Int. Joint Conf. on Artificial
    Intelligence, Seattle, WA, 2001.
Write a Comment
User Comments (0)
About PowerShow.com