Title: Mining Complex Patterns in Massive Relational Data
1Mining Complex Patterns in Massive Relational Data
- Mohammed J. Zaki
- Computer Science Department
- Rensselaer Polytechnic Institute
- http//www.cs.rpi.edu/zaki
2Team Members
- Data Mining Template Library
- Nilanjana De PhD Student
- Benjarath Phoophakdee PhD Student
- Nagender Parimi MS Student
- Extensible Data Mining Server
- Feng Gao PhD Candidate
- Joe Urban MS Student
- Jeevan Pathuri MS Student
- Past Members
- Paolo Palmerini (Visiting Scholar)
3Frequent Structure Mining(FSM) Toolkit
- Systematic solution to a whole class of common
pattern mining tasks, rather than a specific
problem, in MASSIVE, complex, relational datasets - Develop large-scale FSM toolkit
- Extensible modular for ease of use
- Scalable high-performance for interactive
response
4Mining Complex Patterns
- Common Pattern Mining Tasks
- Itemsets (transactional, unordered data)
- Sequences (temporal/positional text, bioseqs)
- Tree patterns (semi-structured/XML data, web
mining, bioinformatics) - Graph patterns (bioinformatics, web data)
- Frequent spans
- Long patterns in dense data
- Subspace patterns
- Correlations, other statistical metrics
- Interesting, non-redundant patterns
5Example Pattern Types
Itemset
Sequence
- Can add attributes
- To nodes
- To edges
- Attributes
- Labels
- Type (directed or undirected )
- Set-valued
6FSM Motivation
- Exploratory analysis of complex datasets
- High-dimensional
- Massive
- Relational (graph-based)
- Detect subspace patterns
- Detect abnormal/rare high-value patterns embedded
in mass of normal data - Improve/build global models
- Classification
- Clustering, etc.
7FSMSpecific Applications
- Link-detection
- XML, semi-structured data
- Mine structure content
- Web usage mining
- LOGML Log Markup Language
- Bioinformatics
- Detect patterns (motifs) in bio-molecules
(RNA/Proteins) - Security-related
- Intrusion, Fraud, Failure detection
8Induced vs EmbeddedSub-patterns
- Induced Sub-patterns S (Vs, Es) is a
sub-pattern of T (V,E) if and only if - Vs ? V
- e (nx, ny) ? Es iff (nx, ny) ? E (nx directly
connected to ny) - Embedded Sub-patterns S (Vs, Es) is a
sub-pattern of T (V,E) if and only if - Vs ? V
- e (nx, ny) ? Es iff nx l ny in T (nx connected
to ny) - We say S occurs in T if S is a sub-pattern of T
- If S has k nodes, we call it a k-pattern
9FSM Current Practice
- A new algorithm for a new type of pattern
- No unifying theme or framework
- Little DBMS support
- Scalability issues (typically in-memory!)
- Not interactive, online
- Little support for KDD process
10FSM Toolkit A Generic Pattern Mining Engine
- Data Mining Template Library (DMTL)
- Generic algorithms (work for ANY pattern type!)
- Ability to define custom pattern types
- Persistent, generic data structures (containers)
- Iterators for traversal over persistent structs
- Extensible Data Mining Server (EDMS)
- Persistency (patterns and data)
- Indexing (patterns and data)
- Data layout and I/O issues
11FSM Toolkit
- Scalable, high-performance data mining
- Tight DBMS integration
- FSM functionality
- Mining (search over complex high-dimensional
spaces) - Pre-processing (discretization, feature data
selection) - Post-processing (interesting, rare, visualization)
12Data Mining Template Library (DMTL)
- A generic pattern mining engine
- Facilitates frequent structure mining tasks
- Generic programming goes beyond object-oriented
- Compile-time resolution vs. run-time
- Generic algorithms (work for ANY pattern type!)
- Ability to define custom pattern types
- Persistent, generic data structures (containers)
- Iterators for traversal
13Generic vs. Object-Oriented
- Object Oriented (Run-time polymorphism)
- Class Pattern
- virtual bool sub-pattern(Pattern P2)
-
- Class set public Pattern ...sub-pattern()
- Class seq public Pattern sub-pattern()
- Pattern P1 new set Pattern P2 new set
- sub-pattern(Pattern P1, Pattern P2)
- return P1.sub-pattern(P2)
-
- Which sub-pattern() will be called?
- Determined at run-time (inefficient)
14Generic vs. Object-Oriented
- Generic Programming (Compile-time)
- Template ltclass Pgt Class Pattern
- sub-pattern(PatternltPgt P2)
- return Pat.sub-pattern(P2.Pat)
-
- P Pat
-
- Class set ...sub-pattern() , Class seq
sub-pattern() - Patltsetgt P1 new Patltsetgt Patltsetgt P2 new
Patltsetgt - Which sub-pattern() will be called?
- Determined at compile-time (efficient)
- Works for ANY pattern type, provided sub-pattern
is defined (also checked at compile time)
15DMTL Generic classes
- Data structure to represent a pattern
- Operations necessary for a pattern
- Data structure to represent a group of patterns
- Operations necessary for a group of patterns
- Algorithms that operate on any pattern or a group
of patterns - Interface with Extensible Data Mining Server for
scalable statistics computations for a pattern or
group of patterns
16Generic Pattern Class
- Types of patterns
- Itemset
- Sequence
- Tree Graphs
- Define your own!
- Pattern Class
- Members for support counting other statistics
- Sub-pattern checking (may require isomorphism
testing!)
17Pattern Class
18Pattern
- Template ltclass Pgt
- Class Pattern
- P pat
- int support
- bool sub-pattern(PatternltPgt P2)
- return pat.sub-pattern(P2.pat)
-
-
- Every new pattern-type must define its own
- sub-pattern function
19Pattern-TypesItemset Sequences
- Class Itemset
- typedef typename vectorltintgt PT
- bool sub-pattern (PT p2)
- PT p
-
- Class Sequence
- typedef typename list ltvectorltintgt gt PT
- bool sub-pattern (PT p2)
- PT p
-
- Class Sequence2
- typedef typename listltpairltvectorltint, timegt gt gt
PT -
20String Representation of Trees
0
0
1
3
1
-1
2
-1
-1
2
-1
-1
2
-1
1
2
With N nodes, M branches, F max
fanout Adjacency Matrix requires N(F1)
space Adjacency List requires 4N-2 space Tree
requires (node, child, sibling) 3N space String
representation requires 2N-1 space
3
2
1
2
21Tree String Representation
- Like an itemset
- -1 as the backtrack item
- Class Tree
- typedef typename vectorltintgt pattern-type
-
- Assuming only labels on nodes
- For trees labels on edges can be treated as
labels on nodes - edge-labelnode-label new label!
22Graphs DFS Tree
Graph
0
DFS Tree Remaining Edges (0,1,A,B) (0,3,A,C) (1
,2,B,D) (3,2,C,D)
A
1
C
B
3
D
2
23Graphs
- Many Possible Representations
- Canonical DFS-tree (gSpan)
- CAMs Canonical Adjacency Matrix (FSSM)
- Adjacency Matrix (FSG, FGM, etc.)
- DMTL (uses Canonical DFS-tree)
- Graph is a vector of edges
- Each edges is a 5-tuple (v1, v2, vl1, el, vl2)
- Class Graph
- typedef typename vectorltedgesgt PT
-
- Class Edges
- int v1 int v2 //v1,v2 are node ids
- int vl1 int vl2 //vl1, vl2 are node labels
- int el //el is edge label
24Pattern Family
- We call a group of patterns, a family
- A generic class that works for any pattern type
(a collection of patterns) - Support operations like
- Get frequency
- Compute other statistics (e.g., count-by-class)
- Compute maximal closed, etc.
- Persistency (via persistency manager from EDMS)
- Pattern Indexing (out-of-core) prefix trees for
sets and sequences
25Pattern Family pvector
- Template ltclass PFTgt
- Class PatternFamily
- typedef typename
- PFT pat_fam_t
- typedef typename
- PFTpattern P
- PFT pat_fam
-
- Template ltclass P, class PMgt
- Class pvector
- typedef typename
- P pattern_type
- typedef typename
- PM persist_mgr
-
26DMTL Hierarchy
27Generic Mining Algorithms
- A collection of common, generic frequent pattern
mining algorithms - Horizontal pattern matching based
- Vertical intersection based
- BFS or DFS
- Work for any pattern/family type
- Future Work
- Projection based algorithms
- Maximal (long) pattern mining
- Closed pattern mining
- Add constraints
28Generic DFS-Mine
- Template ltclass PFTgt
- void DFS-Mine (PatternFamilyltPFTgt PF, Dbase
DB) - typedef typename PFTpattern_type P
- typedef typename PFTpersist_mgr PM
-
-
- Works for any PFT (pattern family type)!
29Candidate Generation Support Counting
- Candidate Generation
- Extend by a node or an edge
- Avoid duplicates as far as possible
- May involve isomorphism testing for graphs
- Not required for sets, sequences or trees
- DMTL provides a generic includes operation
- Support Counting
- EDMS for data access
- Generic intersections operation for any pattern
type (e.g. Eclat, Spade, TreeMiner) - For horizontal data use generic sub-pattern
operation
30Candidate Generation
- Sets add the next item in lex (or other) order
added at end of last element - E.g. ABC to ABCD
- Sequences add any item at end of last element
(set or sequence extension) - E.g. ABC to ABCA (only sequence extension
allowed same as A?B?C?A) - E.g. ABC to ABC?A to ABC?AB (if set extensions
allowed)
31Trees Systematic Candidate Generation
Two subtrees are in the same class iff they share
a common prefix string P up to the (k-1)th node
A valid element x attached to only the nodes
lying on the path from root to rightmost leaf in
prefix P
Not valid position Prefix 3 4 2 x
32Candidate Generation (Join operator ?)
Self Join
New Candidates
Equivalence Class Prefix 1 2,
Elements (3,1) (4,0)
1
1
1
1
1
1
1
2
2
?
2
2
2
4
3
3
2
4
2
3
3
3
3
Join
1
1
3
3
?
2
2
4
New Equivalence Class Prefix 1 2 3 Elements
(3,1) (3,2) (4,0)
3
33Graphs
- Define an ordering on edges (5-tuples)
- Use a canonical DFS tree to represent the
candidates (collection of edges) - Add one new edge to an existing graph
- Can prove that every new candidate can be
obtained by rightmost path extension (like
trees), plus back-edges - First add back-edges, then forward edges in DFS
order - Test for canonical DFS tree (involves isomorphism
testing to eliminate duplicates)
34Candidate Generation
- Generate new candidates (k1)-patterns from
equivalence classes of k-patterns - Consider each pair of elements in a class,
including self-extensions - Consider all new candidates from each pair of
joined elements - All possible candidates patterns are enumerated
- Each patterns is generated only once if possible
(sets,seqs,trees, but not graphs)
35Extensible Data Mining Server (EDMS)
- Provide scalable I/O
- Data and pattern Indexing
- Persistency Manager
- Provide native support for several data models
- Horizontal (tabular, row-based)
- Vertical (full vertical fragmentation)
- Provide generic storage models
- Flat-files
- Databases (OODBMS, Embedded, etc)
- Custom Libraries
36EDMS Data Model VATs (Vertical Attribute Tables)
37VATs
- VATs are composed of a header and a body.
- Different types of patterns require different
VATs
Header
Body (List of Object IDs)
38EDMS Class Hierarchy DB, Metatable, VAT, Storage
VAT Header
VAT Records
Storage
DB
39VAT Class
For Itemsets VATltintgt
40VATs for Sequences
For Sequences VATlt pair ltint, timegt gt
41VATs for TreesMatch labels
Subtree
Tree
0
0,6
0
0
n0
1
2
1,5
2
2
2
1
6,6
n6
n1
6
3
2
2,4
5
5,5
n5
n2
1
2
3,3
4
4,4
n4
n3
Match Label 03456 Support 1
3
VAT for Trees vector lt id, match label, scope gt
42Frequency Computation Scope List Joins In
Scope
T1
T2
T0
0,5
0,3
1
1
2
0,7
Minsup 3 (100)
2
3
3
5
1
2
3
2,3
3,7
1,3
1,1
1,2
4,4
5,5
4
1
2
2
4
4,7
3,3
2,2
2,2
3,3
1
1
6,7
2
3
5,5
7,7
4
2
4
Equivalence Class Prefix Ø Elements (1,-1)
(2,-1) (3,-1) (4,-1)
0, 0, 1,1
0, 0, 3,3
3
4
1
2
1, 1, 2,2
1, 1, 3,3
2, 0, 7,7 2, 4, 7,7
0, 1,1 1, 0,5 1, 2,2 1, 4,4 2, 2,2 2,
5,5
0, 2,3 1, 5,5 2, 1,2 2, 6,7
0,3,3 1,3,3 2,7,7
2, 0, 2,2 2, 0, 5,5 2, 4, 5,5
0, 0,3
1, 1,3
2, 0,7 2, 4,7
Count 3
Tree Id, Prefix Match Label, Last Node Scope
1,1
0
0
43Scope List Joins Out Scope
1
1
1
2
4
2
4
0, 0, 1,1
0, 0, 3,3
0, 01, 3,3
1, 1, 2,2
1, 1, 3,3
1, 12, 3,3
2, 0, 7,7 2, 4, 7,7
2, 0, 2,2 2, 0, 5,5 2, 4, 5,5
2, 02, 7,7 2, 05, 7,7 2, 45, 7,7
44VATs for Graphs
- One VAT per unique edge
- VAT body consists of a vector of
- Graph id
- Vertex id 1 and vertex id 2
- Vector ltint, vectorltpair ltint, intgt gt
- Possible to get (k1)-pattern VAT by intersecting
k-pattern VAT with 1-pattern VAT! - Currently works for induced sub-graphs (same as
gSpan, FSG, etc.)
45EDMS Classes
46DB
- DB is the main user interface
- Provides methods like read/write for a DMTL
database - Accesses the mapper (for pre-processing)
- Indexing finds correct MetaTable VAT index,
which allows retrieval of the desired VAT
47MetaTable
MetaTables provide grouping of VATS. Attribute
stores encoded value for predefined grouping
strategy. Effects upload of data.
- VAT index for fast and efficient search
- Persistence is applied to VATs through this
object.
48Persistency
- VATs persistency status can be one out of the
following three - Volatile. Only in main memory, not registered in
any MetaTable (small VATs). - Buffered. Accessed as if it was in main memory,
but actually stored on disk (big VATs). - Persistent. Stored persistently on disk, also
after the computation has finished.
49Storageltclass Tgt
- Abstracts the details of physical storage
- Provides persistency to MetaTables
- Methods to
- Read/write a VAT from/to disk
- Different implementations of this class
- Metakit (embedded db)
- Gigabase (object-relational)
- Flat-file
- Work in progess
- Persistency manager for pattern/families
- Performance optimizations
50Buffer Replacement(LRU) for Flatfile/Metakit
51What about other pattern types?
- Define pattern type
- Define VAT body type
- Define the isomorphism or sub-pattern, and
candidate extension functions (or use default
graph functions) - Define VAT intersection operation
- Select persistency manager if desired (default
instantiation, e.g., gigabase) - All containers and algorithms work!
- Instead of pattern specific sub-pattern/extension/
VAT joins, implement generic functions using
pattern properties
52Pattern Properties
- Define a hierarchy or partial order of pattern
properties - Also modeled as classes!
- Write generic sub-pattern/extension/VATjoin
functions - A given pattern-property satisfies all properties
above it in the partial order - E.g. Given pattern P (as collection of edges),
extend last edge (x,u) with an edge (u,v) with v
gt u for sets, with any v for sequences. For trees
add new edge to any right most vertex, and for
graphs also add new back-edges.
Itemset Propvectorltintgt
Seq Prop vectorltintgt
Seq2 Prop vectorlttime, intgt
Tree Prop vectorltintgt
Seq3 Prop listlt pair lttime, vectorltintgtgtgt
53Itemset Mining(2.8Ghz P4, 6GB ram)
- As minsup decreases, gap increases, but DMTL is
within a factor of 10 of optimized ECLAT - As database size increases, the gap decresases,
and becomes equal ECLAT breaks for 5000K
(gt2hrs), while DMTL works (23.5s), since it does
transparent memory management
54Itemset Mining(less memory)
- Run on 256MB machine, Pentium III, 450Mhz
- DMTL (metakit) able to scale to 10 million
transactions! - Within factor of 2 for ECLAT (optimized algo)
- ECLAT didnt finish
- (gt 1 hr) for 5M, 10M
- Embedded DB 2 times faster than Flat-file
55Sequence Mining
- Same trends as itemset mining SPADE faster
(factor of 10), but gap decreases with large data
56Sequences (less mem)
- DMTL within a factor of 2 w.r.t. Optimized SPADE
- Gap closing for larger number of records
- SPADE will break at some point (no memory mgmt)
57Tree Mining
58Graph Mining
59Pre-processingConfiguration and Mapping
- XML-based configuration specification
- Dynamically specify different mapping strategies
- Discretization (possibly non-uniform) of
numerical attributes - Taxonomy and grouping of categorical attributes
60Mapping Continuous Categorical Attributes
61Post-processingVisualization of Patterns
- MIRAGE Minimal association rules
- VTK (Vizualization Toolkit)
- Interactive Exploration (Java-based)
- Works only for Itemsets
- Generalize to other patterns (future work)
- Preliminary prototype
62Lattice Visualization
63Interactive Browsing
64Other Visualization
Cylinder (confidence color coded)
Cone Trees
65Summary
- First-of-its-kind generic pattern miner!
- DMTL as a first-step exploratory tool
- Handle massive, high-dimensional data
- Pattern property hierarchy
- Generic algorithms data structures
- Pattern and data indexing
- Persistency support
- Databases, flat-files, etc.
- Will make publicly available in the near future
66Future Work
- Completely generic approach
- To isomorphism, extension, and counting
- Graphs only work for induced sub-patterns
- Extend to embedded graphs
- Consider how to add constraints
- Closed and maximal patterns
- Optimization of DMTL to be competitive with
specific algorithms - Extend DMTL to other mining tasks (clustering,
classification, etc.)