Title: Mining%20Tree-Query%20Associations%20in%20a%20Graph
1Mining Tree-Query Associations in a Graph
- Bart Goethals
- University of Antwerp, Belgium
- Eveline Hoekx
- Jan Van den Bussche
- Hasselt University, Belgium
2Graph Data
- A (directed) graph over a set of nodes N is a set
G of edges ordered pairs ?i?j? with i?j ? N.
Snapshot of a graph representing the complete
metabolic pathway of a human.
3Graph Mining
- Transactional category
- dataset set of many small graphs (transactions)
- frequency ?transactions in which the pattern
occurs (at least once) - ILP Warmr
- AGM, FSG, TreeMiner, gSpan, FFSM
- Single graph category
- dataset single large graph
- frequency ?copies of the pattern in the large
graph - Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM,
Focus on pattern mining, few work on association
rule mining!
4Our work
- Single graph category
- Pattern association rule mining
- Patterns with
- Existential nodes
- Parameters
- Occurrence of the pattern in G is any
homomorphism from the pattern in G. - So far only considered in the ILP (transactional)
setting -
5Example of a pattern
frequency? ? ??x???? z? ?5?z? ? G ? ?z?8??? G ?
?z?x? ? G?
6Patterns are conjunctive queries.
- select distinct G3.to as x
- from G G1, G G2, G G3
- where G1.from5 and G1.toG2.from
- and G1.toG3.from and G2.to8
frequency? ? ??x???? z? ?5?z? ? G ? ?z?8??? G ?
?z?x? ? G?
7Example of an Association Rule
8Features of the presented algorithms
- Pattern mining phase association mining phase
- Restriction to trees gt efficient algorithms
- Equivalence checking
- Apply theory of conjunctive database queries
- Database oriented implementation
9Outline rest of talk
- Formal problem definition
- Algorithms
- Pattern Mining
- Overall approach
- Outer loop incremental
- Inner loop levelwise
- Equivalence checking
- Association Rule Mining
- Result management
- Experimental results
- Future work
10Formal definition of a tree pattern.
- A tree pattern is a tree P whose nodes are called
variables, and - some variables marked as existential ???
- some variables are parameters (labeled with a
constant) - remaining variables are called distinguished
11Formal definition of a tree query.
- A tree query Q is a pair (H,P) where
- P is a tree pattern, the body of Q
- H is a tuple of distinguished variables and
parameters of P. All distinguished variables of P
must appear at least once in H, the head of Q
12Formal definition of a matching
- A matching of a pattern P in a graph G is a
homomorphism - h P ? G, with h?z????a, for parameters labeled
13Example Matching
z? y z? x
14Example Matching
z? y z? x
15Example Matching
z? y z? x
h? 0 1 8 4
16Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
17Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
h? 0 2 8 4
18Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
h? 0 2 8 4
h? 0 2 8 5
19Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
h? 0 2 8 4
h? 0 2 8 5
h? 0 2 8 8
20Formal definition of frequency
We define the answer set of Q in G as follows
Q(G)??f(H)f is a matching of P in G?
- The frequency of Q in G is answers in the answer
21Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
h? 0 2 8 4
h? 0 2 8 5
h? 0 2 8 8
frequency ???
22Problem statement 1 Tree query mining
- Given a graph G and a threshold k, find all tree
queries that - have frequency at least k in G, those queries are
called - frequent.
23Formal definition of an association rule
An association rule (AR) is of the form Q1 ? Q2
with Q1 and Q2 tree queries. The AR is legal if
Q2 ? Q1. The confidence of the AR in a graph G
is defined as the frequency of Q2 divided by the
frequency of Q1.
24Problem statement 2 Association rule mining
- Input a graph G, minsup, a tree query Qleft
frequent in G, minconf - Output all tree queries Q such that Qleft ? Q is
a legal and confident association rule in G.
25Outline rest of talk
- Formal problem definition
- Algorithms
- Pattern Mining
- Overall approach
- Outer loop incremental
- Inner loop levelwise
- Equivalence checking
- Association Rule Mining
- Result management
- Experimental results
- Future work
26Pattern Mining Algorithm
- Outer loop
- Generate, incrementally, all possible trees of
increasing sizes. Avoid generation of isomorphic
Inner loop For each newly generated tree,
generate all queries based on that tree, and test
their frequency.
27Outer loop
- It is well known how to efficiently generate all
trees uniquely up to isomorphism - Based on canonical form of trees.
- Scions, Li-Ruskey, Zaki, Chi-Young-Muntz
28Inner loop Levelwise approach
- A query Q is characterized by?
- ?Q? set of existential nodes
- ?Q? set of parameters
- Labeling ?Q?of the parameters by constants.
- Q?????? ??? ??? specializes Q?????? ??? ??? if
???? ??, ?? ? ?? and ?? agrees with ?? on ??. - If Q? specializes Q? then freq?Q?? ? freq?Q???
- Most general query T (?, ?, ?)
29Inner loop Candidate generation
- CanTab?????????????????????? is a candidate
query? - FreqTab???????????????????????is a frequent
query? - Q?????? is a parent of Q?????? if either
- ??? and ? has precisely one more node than ?,
or - ??? and ? has precisely one more node than ?
- Join Lemma
- Each candidacy table can be computed by taking
the - natural join of its parent frequency tables.
30Inner loop Frequency counting
- Each candidacy table can be computed by a single
SQL query. (ref. Join lemma). - Suppose G?from??to? table in the database, then
each frequency table can be computed with a
single SQL query. - ???????
- formulate in SQL and count
- ??? ???
- formulate ?????? ?? in SQL?? E
- natural join of E with CanTab???
- group by ?
- count each group
31Inner loop Example
?????x?? ?????x?? x?? ?????x????? x????
32Inner loop Example
?????x?? ?????x?? x?? ?????x????? x????
- Join expression
- CanTabx?x?,x? FreqTab?x???x????
FreqTab?x????x?? ? FreqTab????x???x??
33Inner loop Example
?????x?? ?????x?? x?? ?????x????? x????
- SQL expression E for ??x??? ?? ???
select distinct G1.from as x1, G2.to as x3,
G3.to as x4 from G G1, G G2, G G3 where G1.to
G2.from and G3.from G2.from
34Inner loop Example
?????x?? ?????x?? x?? ?????x????? x????
- SQL expression for filling the frequency table
select distinct E.x1, E.x3, count(E.x4) from E,
CanTabx2x1,x3 as CT where E.x1 CT.x1 and
E.x3 CT.x3 group by E.x1, E.x3 having
count(E.x4) gt k
35Equivalent queries
- Queries Q? and Q? are equivalent if same answer
sets on all - graphs G (up to renaming of the distinguished
- 2 cases of equivalent queries
- Q1 has fewer nodes than Q2
- Q1 and Q2 have the same number of nodes
36Equivalence theorem
Two queries are equivalent if and only if there
are containment mappings between them in both
- A containment mapping from Q? to Q? is a h
Q???Q? that - maps distinguished variables of Q? one-to-one to
distinguished - variables of Q?, and maps parameters of Q? to
parameters of Q?, - preserving labels
37Case ? Q? fewer nodes than Q2
- Redundancy lemma
- Let Q be a tree query without selected nodes.
Then Q has a - redundancy if and only if it contains a subtree C
in the form of a - linear chain of ? nodes (possibly just a single
node), such that the - parent of C has another subtree that is at least
as deep as C.
Redundant subtree
38Case ? Q? and Q? same number of nodes
- Q? and Q? must be isomorphic.
- Canonical form of queries refine the canonical
ordering of the underlying unlabeled tree, taking
into account node labels.
39Association Mining Algorithm
- Input a graph G, minsup, a tree query Qleft
frequent in G, minconf - Output all tree queries Q such that Qleft ? Q is
a legal and confident association rule in G.
40Containment mappings
- For each tree query, generate all containment
mappings from Qleft to Q, ignoring parameter
- For each containment mapping, generate all
parameter assignments such that Qleft ? Q is
frequent and confident.
42Equivalent Association rules
- Equivalence checking of association rules is as
hard as general graph isomorphism testing.
43Outline rest of talk
- Result management
- Experimental results
- Future work
44Result management
- Output frequency tables stored in a relational
database. - Browser
45(No Transcript)
46Experimental results Real-life datasets
- Food web ??nodes????? ?edges?????
frequency 176
47Experimental results Real-life datasets
- Food web ??nodes????? ?edges?????
confidence 11
48Experimental results Performance
- Fully implemented on top of IBM DB2
- Preliminary performance results
- pattern mining algorithm
- adequate performance
- huge number of patterns
- constant overhead per discovered pattern
- association mining algorithm
- very fast
- constant overhead per discovered rule
49Future work
- Applications scientific data mining
- Loosen restriction to trees
- Bart Goethals, Eveline Hoekx and Jan Van den
Bussche, Mining Tree Queries in a Graph, in
Proceedings of the eleventh ACM SIGKDD
International conference on Knowledge Discovery
and Data Mining, p 61-69, ACM Press 2005 - Eveline Hoekx and Jan Van den Bussche, Mining for
Tree-Query Associations in a Graph, to appear in
Proceedings of the 2006 IEEE International
Conference on Data Mining (ICDM 2006)