Mining Complex Patterns in Massive Relational Data

About This Presentation

Title:

Mining Complex Patterns in Massive Relational Data

Description:

Computer Science. Mining Complex Patterns in Massive Relational Data ... Generic intersections operation for any pattern type (e.g. Eclat, Spade, TreeMiner) ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 67

Provided by: informati3

Category:

more less

Transcript and Presenter's Notes

Title: Mining Complex Patterns in Massive Relational Data

1
Mining Complex Patterns in Massive Relational Data

Mohammed J. Zaki
Computer Science Department
Rensselaer Polytechnic Institute
http//www.cs.rpi.edu/zaki

2
Team Members

Data Mining Template Library
Nilanjana De PhD Student
Benjarath Phoophakdee PhD Student
Nagender Parimi MS Student
Extensible Data Mining Server
Feng Gao PhD Candidate
Joe Urban MS Student
Jeevan Pathuri MS Student
Past Members
Paolo Palmerini (Visiting Scholar)

3
Frequent Structure Mining(FSM) Toolkit

Systematic solution to a whole class of common
pattern mining tasks, rather than a specific
problem, in MASSIVE, complex, relational datasets
Develop large-scale FSM toolkit
Extensible modular for ease of use
Scalable high-performance for interactive
response

4
Mining Complex Patterns

Common Pattern Mining Tasks
Itemsets (transactional, unordered data)
Sequences (temporal/positional text, bioseqs)
Tree patterns (semi-structured/XML data, web
mining, bioinformatics)
Graph patterns (bioinformatics, web data)
Frequent spans
Long patterns in dense data
Subspace patterns
Correlations, other statistical metrics
Interesting, non-redundant patterns

5
Example Pattern Types
Itemset
Sequence

Can add attributes
To nodes
To edges
Attributes
Labels
Type (directed or undirected )
Set-valued

6
FSM Motivation

Exploratory analysis of complex datasets
High-dimensional
Massive
Relational (graph-based)
Detect subspace patterns
Detect abnormal/rare high-value patterns embedded
in mass of normal data
Improve/build global models
Classification
Clustering, etc.

7
FSMSpecific Applications

Link-detection
XML, semi-structured data
Mine structure content
Web usage mining
LOGML Log Markup Language
Bioinformatics
Detect patterns (motifs) in bio-molecules
(RNA/Proteins)
Security-related
Intrusion, Fraud, Failure detection

8
Induced vs EmbeddedSub-patterns

Induced Sub-patterns S (Vs, Es) is a
sub-pattern of T (V,E) if and only if
Vs ? V
e (nx, ny) ? Es iff (nx, ny) ? E (nx directly
connected to ny)
Embedded Sub-patterns S (Vs, Es) is a
sub-pattern of T (V,E) if and only if
Vs ? V
e (nx, ny) ? Es iff nx l ny in T (nx connected
to ny)
We say S occurs in T if S is a sub-pattern of T
If S has k nodes, we call it a k-pattern

9
FSM Current Practice

A new algorithm for a new type of pattern
No unifying theme or framework
Little DBMS support
Scalability issues (typically in-memory!)
Not interactive, online
Little support for KDD process

10
FSM Toolkit A Generic Pattern Mining Engine

Data Mining Template Library (DMTL)
Generic algorithms (work for ANY pattern type!)
Ability to define custom pattern types
Persistent, generic data structures (containers)
Iterators for traversal over persistent structs
Extensible Data Mining Server (EDMS)
Persistency (patterns and data)
Indexing (patterns and data)
Data layout and I/O issues

11
FSM Toolkit

Scalable, high-performance data mining
Tight DBMS integration
FSM functionality
Mining (search over complex high-dimensional
spaces)
Pre-processing (discretization, feature data
selection)
Post-processing (interesting, rare, visualization)

12
Data Mining Template Library (DMTL)

A generic pattern mining engine
Facilitates frequent structure mining tasks
Generic programming goes beyond object-oriented
Compile-time resolution vs. run-time
Generic algorithms (work for ANY pattern type!)
Ability to define custom pattern types
Persistent, generic data structures (containers)
Iterators for traversal

13
Generic vs. Object-Oriented

Object Oriented (Run-time polymorphism)
Class Pattern
virtual bool sub-pattern(Pattern P2)
Class set public Pattern ...sub-pattern()
Class seq public Pattern sub-pattern()
Pattern P1 new set Pattern P2 new set
sub-pattern(Pattern P1, Pattern P2)
return P1.sub-pattern(P2)
Which sub-pattern() will be called?
Determined at run-time (inefficient)

14
Generic vs. Object-Oriented

Generic Programming (Compile-time)
Template ltclass Pgt Class Pattern
sub-pattern(PatternltPgt P2)
return Pat.sub-pattern(P2.Pat)
P Pat
Class set ...sub-pattern() , Class seq
sub-pattern()
Patltsetgt P1 new Patltsetgt Patltsetgt P2 new
Patltsetgt
Which sub-pattern() will be called?
Determined at compile-time (efficient)
Works for ANY pattern type, provided sub-pattern
is defined (also checked at compile time)

15
DMTL Generic classes

Data structure to represent a pattern
Operations necessary for a pattern
Data structure to represent a group of patterns
Operations necessary for a group of patterns
Algorithms that operate on any pattern or a group
of patterns
Interface with Extensible Data Mining Server for
scalable statistics computations for a pattern or
group of patterns

16
Generic Pattern Class

Types of patterns
Itemset
Sequence
Tree Graphs
Define your own!
Pattern Class
Members for support counting other statistics
Sub-pattern checking (may require isomorphism
testing!)

17
Pattern Class
18
Pattern

Template ltclass Pgt
Class Pattern
P pat
int support
bool sub-pattern(PatternltPgt P2)
return pat.sub-pattern(P2.pat)
Every new pattern-type must define its own
sub-pattern function

19
Pattern-TypesItemset Sequences

Class Itemset
typedef typename vectorltintgt PT
bool sub-pattern (PT p2)
PT p
Class Sequence
typedef typename list ltvectorltintgt gt PT
bool sub-pattern (PT p2)
PT p
Class Sequence2
typedef typename listltpairltvectorltint, timegt gt gt
PT

20
String Representation of Trees
0
0
1
3
1
-1
2
-1
-1
2
-1
-1
2
-1
1
2
With N nodes, M branches, F max
fanout Adjacency Matrix requires N(F1)
space Adjacency List requires 4N-2 space Tree
requires (node, child, sibling) 3N space String
representation requires 2N-1 space
3
2
1
2
21
Tree String Representation

Like an itemset
-1 as the backtrack item
Class Tree
typedef typename vectorltintgt pattern-type
Assuming only labels on nodes
For trees labels on edges can be treated as
labels on nodes
edge-labelnode-label new label!

22
Graphs DFS Tree
Graph
0
DFS Tree Remaining Edges (0,1,A,B) (0,3,A,C) (1
,2,B,D) (3,2,C,D)
A
1
C
B
3
D
2
23
Graphs

Many Possible Representations
Canonical DFS-tree (gSpan)
CAMs Canonical Adjacency Matrix (FSSM)
Adjacency Matrix (FSG, FGM, etc.)
DMTL (uses Canonical DFS-tree)
Graph is a vector of edges
Each edges is a 5-tuple (v1, v2, vl1, el, vl2)
Class Graph
typedef typename vectorltedgesgt PT
Class Edges
int v1 int v2 //v1,v2 are node ids
int vl1 int vl2 //vl1, vl2 are node labels
int el //el is edge label

24
Pattern Family

We call a group of patterns, a family
A generic class that works for any pattern type
(a collection of patterns)
Support operations like
Get frequency
Compute other statistics (e.g., count-by-class)
Compute maximal closed, etc.
Persistency (via persistency manager from EDMS)
Pattern Indexing (out-of-core) prefix trees for
sets and sequences

25
Pattern Family pvector

Template ltclass PFTgt
Class PatternFamily
typedef typename
PFT pat_fam_t
typedef typename
PFTpattern P
PFT pat_fam
Template ltclass P, class PMgt
Class pvector
typedef typename
P pattern_type
typedef typename
PM persist_mgr

26
DMTL Hierarchy
27
Generic Mining Algorithms

A collection of common, generic frequent pattern
mining algorithms
Horizontal pattern matching based
Vertical intersection based
BFS or DFS
Work for any pattern/family type
Future Work
Projection based algorithms
Maximal (long) pattern mining
Closed pattern mining
Add constraints

28
Generic DFS-Mine

Template ltclass PFTgt
void DFS-Mine (PatternFamilyltPFTgt PF, Dbase
DB)
typedef typename PFTpattern_type P
typedef typename PFTpersist_mgr PM
Works for any PFT (pattern family type)!

29
Candidate Generation Support Counting

Candidate Generation
Extend by a node or an edge
Avoid duplicates as far as possible
May involve isomorphism testing for graphs
Not required for sets, sequences or trees
DMTL provides a generic includes operation
Support Counting
EDMS for data access
Generic intersections operation for any pattern
type (e.g. Eclat, Spade, TreeMiner)
For horizontal data use generic sub-pattern
operation

30
Candidate Generation

Sets add the next item in lex (or other) order
added at end of last element
E.g. ABC to ABCD
Sequences add any item at end of last element
(set or sequence extension)
E.g. ABC to ABCA (only sequence extension
allowed same as A?B?C?A)
E.g. ABC to ABC?A to ABC?AB (if set extensions
allowed)

31
Trees Systematic Candidate Generation
Two subtrees are in the same class iff they share
a common prefix string P up to the (k-1)th node
A valid element x attached to only the nodes
lying on the path from root to rightmost leaf in
prefix P
Not valid position Prefix 3 4 2 x
32
Candidate Generation (Join operator ?)
Self Join
New Candidates
Equivalence Class Prefix 1 2,
Elements (3,1) (4,0)
1
1
1
1
1
1
1
2
2
?
2
2
2
4
3
3
2
4
2
3
3
3
3
Join
1
1
3
3
?
2
2
4
New Equivalence Class Prefix 1 2 3 Elements
(3,1) (3,2) (4,0)
3
33
Graphs

Define an ordering on edges (5-tuples)
Use a canonical DFS tree to represent the
candidates (collection of edges)
Add one new edge to an existing graph
Can prove that every new candidate can be
obtained by rightmost path extension (like
trees), plus back-edges
First add back-edges, then forward edges in DFS
order
Test for canonical DFS tree (involves isomorphism
testing to eliminate duplicates)

34
Candidate Generation

Generate new candidates (k1)-patterns from
equivalence classes of k-patterns
Consider each pair of elements in a class,
including self-extensions
Consider all new candidates from each pair of
joined elements
All possible candidates patterns are enumerated
Each patterns is generated only once if possible
(sets,seqs,trees, but not graphs)

35
Extensible Data Mining Server (EDMS)

Provide scalable I/O
Data and pattern Indexing
Persistency Manager
Provide native support for several data models
Horizontal (tabular, row-based)
Vertical (full vertical fragmentation)
Provide generic storage models
Flat-files
Databases (OODBMS, Embedded, etc)
Custom Libraries

36
EDMS Data Model VATs (Vertical Attribute Tables)
37
VATs

VATs are composed of a header and a body.
Different types of patterns require different
VATs

Header
Body (List of Object IDs)
38
EDMS Class Hierarchy DB, Metatable, VAT, Storage
VAT Header
VAT Records
Storage
DB
39
VAT Class
For Itemsets VATltintgt
40
VATs for Sequences
For Sequences VATlt pair ltint, timegt gt
41
VATs for TreesMatch labels
Subtree
Tree
0
0,6
0
0
n0
1
2
1,5
2
2
2
1
6,6
n6
n1
6
3
2
2,4
5
5,5
n5
n2
1
2
3,3
4
4,4
n4
n3
Match Label 03456 Support 1
3
VAT for Trees vector lt id, match label, scope gt
42
Frequency Computation Scope List Joins In
Scope
T1
T2
T0
0,5
0,3
1
1
2
0,7
Minsup 3 (100)
2
3
3
5
1
2
3
2,3
3,7
1,3
1,1
1,2
4,4
5,5
4
1
2
2
4
4,7
3,3
2,2
2,2
3,3
1
1
6,7
2
3
5,5
7,7
4
2
4
Equivalence Class Prefix Ø Elements (1,-1)
(2,-1) (3,-1) (4,-1)
0, 0, 1,1
0, 0, 3,3
3
4
1
2
1, 1, 2,2
1, 1, 3,3
2, 0, 7,7 2, 4, 7,7
0, 1,1 1, 0,5 1, 2,2 1, 4,4 2, 2,2 2,
5,5
0, 2,3 1, 5,5 2, 1,2 2, 6,7
0,3,3 1,3,3 2,7,7
2, 0, 2,2 2, 0, 5,5 2, 4, 5,5
0, 0,3
1, 1,3
2, 0,7 2, 4,7
Count 3
Tree Id, Prefix Match Label, Last Node Scope
1,1
0
0
43
Scope List Joins Out Scope
1
1
1
2
4
2
4
0, 0, 1,1
0, 0, 3,3
0, 01, 3,3
1, 1, 2,2
1, 1, 3,3
1, 12, 3,3
2, 0, 7,7 2, 4, 7,7
2, 0, 2,2 2, 0, 5,5 2, 4, 5,5
2, 02, 7,7 2, 05, 7,7 2, 45, 7,7
44
VATs for Graphs

One VAT per unique edge
VAT body consists of a vector of
Graph id
Vertex id 1 and vertex id 2
Vector ltint, vectorltpair ltint, intgt gt
Possible to get (k1)-pattern VAT by intersecting
k-pattern VAT with 1-pattern VAT!
Currently works for induced sub-graphs (same as
gSpan, FSG, etc.)

45
EDMS Classes
46
DB

DB is the main user interface
Provides methods like read/write for a DMTL
database
Accesses the mapper (for pre-processing)
Indexing finds correct MetaTable VAT index,
which allows retrieval of the desired VAT

47
MetaTable
MetaTables provide grouping of VATS. Attribute
stores encoded value for predefined grouping
strategy. Effects upload of data.

VAT index for fast and efficient search

Persistence is applied to VATs through this
object.

48
Persistency

VATs persistency status can be one out of the
following three
Volatile. Only in main memory, not registered in
any MetaTable (small VATs).
Buffered. Accessed as if it was in main memory,
but actually stored on disk (big VATs).
Persistent. Stored persistently on disk, also
after the computation has finished.

49
Storageltclass Tgt

Abstracts the details of physical storage
Provides persistency to MetaTables
Methods to
Read/write a VAT from/to disk
Different implementations of this class
Metakit (embedded db)
Gigabase (object-relational)
Flat-file
Work in progess
Persistency manager for pattern/families
Performance optimizations

50
Buffer Replacement(LRU) for Flatfile/Metakit
51
What about other pattern types?

Define pattern type
Define VAT body type
Define the isomorphism or sub-pattern, and
candidate extension functions (or use default
graph functions)
Define VAT intersection operation
Select persistency manager if desired (default
instantiation, e.g., gigabase)
All containers and algorithms work!
Instead of pattern specific sub-pattern/extension/
VAT joins, implement generic functions using
pattern properties

52
Pattern Properties

Define a hierarchy or partial order of pattern
properties
Also modeled as classes!
Write generic sub-pattern/extension/VATjoin
functions
A given pattern-property satisfies all properties
above it in the partial order
E.g. Given pattern P (as collection of edges),
extend last edge (x,u) with an edge (u,v) with v
gt u for sets, with any v for sequences. For trees
add new edge to any right most vertex, and for
graphs also add new back-edges.

Itemset Propvectorltintgt
Seq Prop vectorltintgt
Seq2 Prop vectorlttime, intgt
Tree Prop vectorltintgt
Seq3 Prop listlt pair lttime, vectorltintgtgtgt
53
Itemset Mining(2.8Ghz P4, 6GB ram)

As minsup decreases, gap increases, but DMTL is
within a factor of 10 of optimized ECLAT
As database size increases, the gap decresases,
and becomes equal ECLAT breaks for 5000K
(gt2hrs), while DMTL works (23.5s), since it does
transparent memory management

54
Itemset Mining(less memory)