Experiments with MRDTL - PowerPoint PPT Presentation

About This Presentation
Title:

Experiments with MRDTL

Description:

Experiments with MRDTL A Multi-relational Decision Tree Learning Algorithm Experiments with MRDTL A Multi-relational Decision Tree Learning Algorithm – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 31
Provided by: AnnaA156
Learn more at: http://msl.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Experiments with MRDTL


1
Experiments with MRDTL A Multi-relational
Decision Tree Learning Algorithm
Experiments with MRDTL A Multi-relational
Decision Tree Learning Algorithm
Hector Leiva, Anna Atramentov and Vasant
Honavar Artificial Intelligence
Laboratory Department of Computer Science
and Graduate Program in Bioinformatics and
Computational Biology Iowa State University Ames,
IA 50011, USA www.cs.iastate.edu/honavar/aigroup/
html
Support provided in part by National Science
Foundation, Carver Foundation, and Pioneer
Hi-Bred, Inc.
2
Motivation
  • Importance of multi-relational learning
  • Growth of data stored in MRDB
  • Techniques for learning unstructured data often
    extract the data into MRDB
  • Expanding of the techniques for multi-relational
    learning
  • Blockeels framework (ILP)(1998)
  • Getoors framework (first order extensions of
    PM)(2001)
  • Knobbes framework (MRDM)(1999)Problem no
    experimental results available

Goals
  • Perform experiments and evaluate performance of
    the Knobbes framework
  • Understand strengths and limits of the approach

3
Multi-Relational Learning Literature
  • Inductive Logic Programming
  • First order extensions of probabilistic models
  • Multi-Relational Data Mining
  • Propositionalization methods
  • PRMs extension for cumulative learning for
    learning and reasoning as agents interact with
    the world
  • Approaches for mining data in form of graph
  • Blockeel, 1998 De Raedt, 1998 Knobbe et
    al., 1999 Friedman et al., 1999 Koller, 1999
    Krogel and Wrobel, 2001 Getoor, 2001 Kersting
    et al., 2000 Pfeffer, 2000 Dzeroski and Lavrac,
    2001 Dehaspe and De Raedt, 1997 Dzeroski et
    al., 2001 Jaeger, 1997 Karalic and Bratko,
    1997 Holder and Cook, 2000 Gonzalez et al.,
    2000

4
Problem Formulation
  • Given Data stored in relational data base
  • Goal Build decision tree for predicting target
    attribute in the target table

Example of multi-relational database
Department Department Department
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
schema
instances
Department
ID
Specialization
Students
Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student Graduate Student
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
Grad.Student
ID
Name
GPA
Publications
Advisor
Department
Staff
ID
Name
Department
Position
Salary
Staff Staff Staff Staff Staff
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
5
Propositional decision tree algorithm.
Construction phase
Day Outlook Temp Hum-ty Wind PlayT
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
Day Outlook Temp-re Humidity Wind PlayTennis
d1 Sunny Hot High Weak No
d2 Sunny Hot High Strong No
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
Day Outlook Temp Hum-ty Wind PlayT
d3 Overcast Hot High Weak Yes
d4 Overcast Cold Normal Weak No
d1, d2, d3, d4
Tree_induction(D data) A
optimal_attribute(D) if stopping_criterion
(D) return leaf(D) else Dleft
split(D, A) Dright
splitcomplement(D, A) childleft
Tree_induction(Dleft) childright
Tree_induction(Dright) return node(A,
childleft, childright)
Outlook
6
MR setting. Splitting data with Selection Graphs
Department
Graduate Student
ID Name GPA Public. Advisor Department
s1 John 2.0 4 p1 d3
s2 Lisa 3.5 10 p4 d3
s3 Michel 3.9 3 p4 d4
ID Specialization Students
d1 Math 1000
d2 Physics 300
d3 Computer Science 400
Staff
Department
ID Name Department Position Salary
p4 David d3 Professor 80-100k
ID Name Department Position Salary
p1 Dale d1 Professor 70 - 80k
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
p4 David d3 Professor 80-100k
Grad.Student
ID Name Department Position Salary
p1 Dale d1 Professor 70-80k
Staff
ID Name Department Position Salary
p2 Martin d3 Postdoc 30-40k
p3 Victor d2 VisitorScientist 40-50k
complement selection graphs
7
What is selection graph?
Department
  • It corresponds to the subset of the instances
    from target table
  • Nodes correspond to the tables from the database
  • Edges correspond to the associations between
    tables
  • Open edge have at least one
  • Closed edge have non of

Grad.Student
Staff
Grad.Student
Department
Staff
Specializationmath
8
Automatic transforming selection graphs into SQL
query
Staff
Select distinct T0.id From Staff Where
T0.positionProfessor
Position Professor
Select distinct T0.id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor
Staff
Grad. Student
Generic query select distinct
T0.primary_key from table_list where
join_list and condition_list
Staff
Grad. Student
Select distinct T0.id From Staff T0 Where T0.id
not in ( Select T1. id
From Graduate_Student T1)
Grad. Student
Select distinct T0. id From Staff T0,
Graduate_Student T1 Where T0.idT1.Advisor T0.
id not in ( Select T1. id From
Graduate_Student T1 Where T1.GPA gt 3.9)
Staff
Grad. Student
GPA gt3.9
9
MR decision tree
  • Each node contains selection graph
  • Each children selection graph is a supergraphof
    the parent selection graph

10
How to choose selection graphs in nodes?
  • Problem There are too many supergraph selection
    graphs to choose from in each node
  • Solution
  • start with initial selection graph
  • find greedy heuristic to choose
    supergraphselection graphs refinements
  • use binary splits for simplicity
  • for each refinementget complement refinement
  • choose the best refinement basedon information
    gain criterion
  • Problem Somepotentiallygood refinementsmay
    give noimmediate benefit
  • Solution
  • look ahead capability

11
Refinements of selection graph
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
  • add condition to the node - explore attribute
    information in the tables
  • add present edge and open node explore
    relational properties between the tables

12
Refinements of selection graph
refinement
Specializationmath
Position Professor
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

Position ! Professor
13
Refinements of selection graph
refinement
GPA gt2.0
Specializationmath
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

14
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Students gt200
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

15
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Note information gain 0
Specializationmath
  • add condition to the node
  • add present edge and open node

16
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

17
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

18
Refinements of selection graph
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Specializationmath
  • add condition to the node
  • add present edge and open node

19
Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Specializationmath
complement refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
20
Look ahead capability
refinement
Grad.Student
Department
Staff
Specializationmath
Grad.Student
GPA gt3.9
Students gt 200
Specializationmath
Specializationmath
21
MR decision tree algorithm. Construction phase
Staff
  • for each non-leaf node
  • consider all possible refinements and their
    complements of the nodes selection graph
  • choose the best ones based on information gain
    criterion
  • create children nodes

Staff
Staff
Grad.Student
Grad.Student
22
MR decision tree algorithm. Classification phase
Staff
  • for each leaf
  • apply selection graph of the leaf to the test
    data
  • classify resulting instances with classification
    of the leaf

Staff
Staff
Grad.Student
Grad.Student
Grad.Student


Staff
Staff
Grad. Student
Grad.Student
GPA gt3.9
GPA gt3.9




Staff
Grad. Student
Staff
Grad. Student
GPA gt3.9
Position Professor
..
GPA gt3.9
Department
Department
70-80k
80-100k
Specmath
Specphysics
23
Experimental results. Mutagenesis
  • Most widely DB used in ILP.
  • Describes molecules of certain nitro aromatic
    compounds.
  • Goal predict their mutagenic activity (label
    attribute) ability to cause DNA to mutate.
    High mutagenic activity can cause cancer.
  • Class distribution.

Compounds Active Inactive Total
Regression friendly 125 63 188
Regression unfriendly 13 29 42
Total 138 92 230
  • 5 levels of background knowledge B0, B1, B2, B3,
    B4. They provide richer descriptions of the
    examples. The first three levels (B0, B1, B2) are
    used only.

24
Experimental results. Mutagenesis
  • Results of 10-fold cross-validation for
    regression friendly set.

Systems Accuracy () Accuracy () Accuracy () Time (secs.) Time (secs.) Time (secs.)
B0 B1 B2 B0 B1 B2
Progol 79 86 86 8595 4627 6530
Progol 76 81 83 117k 64k 42k
FOIL 61 61 83 4950 9138 0.5
TILDE 75 79 85 41 170 142
MRDTL 67 87 88 0.85 332 221
  • Size of decision trees.

Systems Number of nodes Number of nodes Number of nodes
B0 B1 B2
MRDTL 1 53 51
25
Experimental results. Mutagenesis
  • Results of leave-one-out cross-validation for
    regression unfriendly set.

Background Accuracy Time Nodes
B0 70 0.6 secs. 1
B1 81 86 secs. 24
B2 81 60 secs. 22
  • Two recent approaches (Sebag and Rauveirol, 1997)
    and (Kramer and De Raedt, 2001) using B3 have
    achieved 93.6 and 94.7, respectively for
    mutagenesis database.

26
Experimental results. KDD Cup 2001
  • Consists of a variety of details about the
    various genes of one particular type of organism.
  • Genes code for proteins, and these proteins tend
    to localize in various parts of cells and
    interact with one another in order to perform
    crucial functions.
  • Task Prediction of gene/protein localization (15
    possible values)
  • Target table Gene
  • Target attribute Localization
  • 862 training genes, 381 test genes.
  • Challenge many attribute values are missing.
  • Approach using a special value to encode a
    missing value.Result accuracy of 50
  • Have to find good techniques for filling in
    missing values.

27
Experimental results. KDD Cup 2001
  • Approach Replacing missing values by the most
    common value of the attribute for the
    class.Results- accuracy of around 85 with a
    decision tree of 367 nodes, with no limit in the
    number of times an association can be
    instantiated.- accuracy of 80, when limiting
    the number of times an association can be
    instantiated.- accuracy of around 75 is
    obtained when following associations only in the
    forward direction.
  • This shows that providing reasonable guesses for
    missing values can significantly enhance the
    performance of MRDTL on real world data sets.
  • In practice, since the class labels for test data
    are unknown, it is not possible to apply this
    method.
  • Approach Extension of the Naïve Bayes algorithm
    for relational dataResult-no improvement
    comparing to the first approach
  • Have to incorporate handling missing values into
    decision tree algorithm

28
Experimental results. Adult database
  • Suitable for propositional learning. One table, 6
    numerical attributes, 8 nominal attributes.
  • Information from 1994 census.
  • Task determine whether a person makes over 50k a
    year.
  • Class distribution for adult database

Training Training Test Test Total
gt50k lt50k gt50k lt50k
With missing values 7841 24720 3846 12435 48842
W/o missing values 7508 22654 3700 11360 45222
  • Result after removal of missing values and using
    original train/test split 82.2.
  • Filling missing values with Naïve Bayes approach
    yields 83
  • C4.5 result 84.46

29
Summary
  • the algorithm is a promising alternative to
    existing algorithms, such as Progol, Foil, and
    Tilde
  • the running time is comparable with the best
    existing approaches
  • if equipped with principled approaches to handle
    missing values it is an effective algorithm for
    learning real-world relational data
  • the approach is an extension of propositional
    learning, and can be successfully applied for
    propositional learning
  • Questions
  • - why cant we split the data based on the value
    of the attribute in arbitrary table right away?
  • - is there less restrictive and more simple way
    of representing the splits of data than selection
    graphs?
  • - the running time for computing the first nodes
    in decision tree is much less then for the rest
    of the nodes. Is it unavoidable? Can we implement
    the same idea more efficiently?

30
Future work
  • Incorporation of the more sophisticated
    techniques for handling missing values
  • Incorporating of more sophisticated pruning
    techniques or complexity regularizations
  • More extensive evaluation of MRDTL on real-world
    data sets
  • Development of ontology-guided multi-relational
    decision tree learning algotihms to generate
    classifiers at multiple levels of abstraction
    Zhang et al., 2002
  • Development of variants of MRDTL for
    classification tasks where the classes are not
    disjoint, based on the recently developed
    propositional decision tree counterparts of such
    algorithms Caragea et al., 2002
  • Development of variants of MRDTL that can learn
    from heterogeneous, distributed, autonomous data
    sources, based on recently developed techniques
    for distributed learning and ontology based data
    integration
Write a Comment
User Comments (0)
About PowerShow.com