Mining Complex Patterns in Massive Relational Data - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Mining Complex Patterns in Massive Relational Data

Description:

Computer Science. Mining Complex Patterns in Massive Relational Data ... Generic intersections operation for any pattern type (e.g. Eclat, Spade, TreeMiner) ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 67
Provided by: informati3
Category:

less

Transcript and Presenter's Notes

Title: Mining Complex Patterns in Massive Relational Data


1
Mining Complex Patterns in Massive Relational Data
  • Mohammed J. Zaki
  • Computer Science Department
  • Rensselaer Polytechnic Institute
  • http//www.cs.rpi.edu/zaki

2
Team Members
  • Data Mining Template Library
  • Nilanjana De PhD Student
  • Benjarath Phoophakdee PhD Student
  • Nagender Parimi MS Student
  • Extensible Data Mining Server
  • Feng Gao PhD Candidate
  • Joe Urban MS Student
  • Jeevan Pathuri MS Student
  • Past Members
  • Paolo Palmerini (Visiting Scholar)

3
Frequent Structure Mining(FSM) Toolkit
  • Systematic solution to a whole class of common
    pattern mining tasks, rather than a specific
    problem, in MASSIVE, complex, relational datasets
  • Develop large-scale FSM toolkit
  • Extensible modular for ease of use
  • Scalable high-performance for interactive
    response

4
Mining Complex Patterns
  • Common Pattern Mining Tasks
  • Itemsets (transactional, unordered data)
  • Sequences (temporal/positional text, bioseqs)
  • Tree patterns (semi-structured/XML data, web
    mining, bioinformatics)
  • Graph patterns (bioinformatics, web data)
  • Frequent spans
  • Long patterns in dense data
  • Subspace patterns
  • Correlations, other statistical metrics
  • Interesting, non-redundant patterns

5
Example Pattern Types
Itemset
Sequence
  • Can add attributes
  • To nodes
  • To edges
  • Attributes
  • Labels
  • Type (directed or undirected )
  • Set-valued

6
FSM Motivation
  • Exploratory analysis of complex datasets
  • High-dimensional
  • Massive
  • Relational (graph-based)
  • Detect subspace patterns
  • Detect abnormal/rare high-value patterns embedded
    in mass of normal data
  • Improve/build global models
  • Classification
  • Clustering, etc.

7
FSMSpecific Applications
  • Link-detection
  • XML, semi-structured data
  • Mine structure content
  • Web usage mining
  • LOGML Log Markup Language
  • Bioinformatics
  • Detect patterns (motifs) in bio-molecules
    (RNA/Proteins)
  • Security-related
  • Intrusion, Fraud, Failure detection

8
Induced vs EmbeddedSub-patterns
  • Induced Sub-patterns S (Vs, Es) is a
    sub-pattern of T (V,E) if and only if
  • Vs ? V
  • e (nx, ny) ? Es iff (nx, ny) ? E (nx directly
    connected to ny)
  • Embedded Sub-patterns S (Vs, Es) is a
    sub-pattern of T (V,E) if and only if
  • Vs ? V
  • e (nx, ny) ? Es iff nx l ny in T (nx connected
    to ny)
  • We say S occurs in T if S is a sub-pattern of T
  • If S has k nodes, we call it a k-pattern

9
FSM Current Practice
  • A new algorithm for a new type of pattern
  • No unifying theme or framework
  • Little DBMS support
  • Scalability issues (typically in-memory!)
  • Not interactive, online
  • Little support for KDD process

10
FSM Toolkit A Generic Pattern Mining Engine
  • Data Mining Template Library (DMTL)
  • Generic algorithms (work for ANY pattern type!)
  • Ability to define custom pattern types
  • Persistent, generic data structures (containers)
  • Iterators for traversal over persistent structs
  • Extensible Data Mining Server (EDMS)
  • Persistency (patterns and data)
  • Indexing (patterns and data)
  • Data layout and I/O issues

11
FSM Toolkit
  • Scalable, high-performance data mining
  • Tight DBMS integration
  • FSM functionality
  • Mining (search over complex high-dimensional
    spaces)
  • Pre-processing (discretization, feature data
    selection)
  • Post-processing (interesting, rare, visualization)

12
Data Mining Template Library (DMTL)
  • A generic pattern mining engine
  • Facilitates frequent structure mining tasks
  • Generic programming goes beyond object-oriented
  • Compile-time resolution vs. run-time
  • Generic algorithms (work for ANY pattern type!)
  • Ability to define custom pattern types
  • Persistent, generic data structures (containers)
  • Iterators for traversal

13
Generic vs. Object-Oriented
  • Object Oriented (Run-time polymorphism)
  • Class Pattern
  • virtual bool sub-pattern(Pattern P2)
  • Class set public Pattern ...sub-pattern()
  • Class seq public Pattern sub-pattern()
  • Pattern P1 new set Pattern P2 new set
  • sub-pattern(Pattern P1, Pattern P2)
  • return P1.sub-pattern(P2)
  • Which sub-pattern() will be called?
  • Determined at run-time (inefficient)

14
Generic vs. Object-Oriented
  • Generic Programming (Compile-time)
  • Template ltclass Pgt Class Pattern
  • sub-pattern(PatternltPgt P2)
  • return Pat.sub-pattern(P2.Pat)
  • P Pat
  • Class set ...sub-pattern() , Class seq
    sub-pattern()
  • Patltsetgt P1 new Patltsetgt Patltsetgt P2 new
    Patltsetgt
  • Which sub-pattern() will be called?
  • Determined at compile-time (efficient)
  • Works for ANY pattern type, provided sub-pattern
    is defined (also checked at compile time)

15
DMTL Generic classes
  • Data structure to represent a pattern
  • Operations necessary for a pattern
  • Data structure to represent a group of patterns
  • Operations necessary for a group of patterns
  • Algorithms that operate on any pattern or a group
    of patterns
  • Interface with Extensible Data Mining Server for
    scalable statistics computations for a pattern or
    group of patterns

16
Generic Pattern Class
  • Types of patterns
  • Itemset
  • Sequence
  • Tree Graphs
  • Define your own!
  • Pattern Class
  • Members for support counting other statistics
  • Sub-pattern checking (may require isomorphism
    testing!)

17
Pattern Class
18
Pattern
  • Template ltclass Pgt
  • Class Pattern
  • P pat
  • int support
  • bool sub-pattern(PatternltPgt P2)
  • return pat.sub-pattern(P2.pat)
  • Every new pattern-type must define its own
  • sub-pattern function

19
Pattern-TypesItemset Sequences
  • Class Itemset
  • typedef typename vectorltintgt PT
  • bool sub-pattern (PT p2)
  • PT p
  • Class Sequence
  • typedef typename list ltvectorltintgt gt PT
  • bool sub-pattern (PT p2)
  • PT p
  • Class Sequence2
  • typedef typename listltpairltvectorltint, timegt gt gt
    PT

20
String Representation of Trees
0
0
1
3
1
-1
2
-1
-1
2
-1
-1
2
-1
1
2
With N nodes, M branches, F max
fanout Adjacency Matrix requires N(F1)
space Adjacency List requires 4N-2 space Tree
requires (node, child, sibling) 3N space String
representation requires 2N-1 space
3
2
1
2
21
Tree String Representation
  • Like an itemset
  • -1 as the backtrack item
  • Class Tree
  • typedef typename vectorltintgt pattern-type
  • Assuming only labels on nodes
  • For trees labels on edges can be treated as
    labels on nodes
  • edge-labelnode-label new label!

22
Graphs DFS Tree
Graph
0
DFS Tree Remaining Edges (0,1,A,B) (0,3,A,C) (1
,2,B,D) (3,2,C,D)
A
1
C
B
3
D
2
23
Graphs
  • Many Possible Representations
  • Canonical DFS-tree (gSpan)
  • CAMs Canonical Adjacency Matrix (FSSM)
  • Adjacency Matrix (FSG, FGM, etc.)
  • DMTL (uses Canonical DFS-tree)
  • Graph is a vector of edges
  • Each edges is a 5-tuple (v1, v2, vl1, el, vl2)
  • Class Graph
  • typedef typename vectorltedgesgt PT
  • Class Edges
  • int v1 int v2 //v1,v2 are node ids
  • int vl1 int vl2 //vl1, vl2 are node labels
  • int el //el is edge label

24
Pattern Family
  • We call a group of patterns, a family
  • A generic class that works for any pattern type
    (a collection of patterns)
  • Support operations like
  • Get frequency
  • Compute other statistics (e.g., count-by-class)
  • Compute maximal closed, etc.
  • Persistency (via persistency manager from EDMS)
  • Pattern Indexing (out-of-core) prefix trees for
    sets and sequences

25
Pattern Family pvector
  • Template ltclass PFTgt
  • Class PatternFamily
  • typedef typename
  • PFT pat_fam_t
  • typedef typename
  • PFTpattern P
  • PFT pat_fam
  • Template ltclass P, class PMgt
  • Class pvector
  • typedef typename
  • P pattern_type
  • typedef typename
  • PM persist_mgr

26
DMTL Hierarchy
27
Generic Mining Algorithms
  • A collection of common, generic frequent pattern
    mining algorithms
  • Horizontal pattern matching based
  • Vertical intersection based
  • BFS or DFS
  • Work for any pattern/family type
  • Future Work
  • Projection based algorithms
  • Maximal (long) pattern mining
  • Closed pattern mining
  • Add constraints

28
Generic DFS-Mine
  • Template ltclass PFTgt
  • void DFS-Mine (PatternFamilyltPFTgt PF, Dbase
    DB)
  • typedef typename PFTpattern_type P
  • typedef typename PFTpersist_mgr PM
  • Works for any PFT (pattern family type)!

29
Candidate Generation Support Counting
  • Candidate Generation
  • Extend by a node or an edge
  • Avoid duplicates as far as possible
  • May involve isomorphism testing for graphs
  • Not required for sets, sequences or trees
  • DMTL provides a generic includes operation
  • Support Counting
  • EDMS for data access
  • Generic intersections operation for any pattern
    type (e.g. Eclat, Spade, TreeMiner)
  • For horizontal data use generic sub-pattern
    operation

30
Candidate Generation
  • Sets add the next item in lex (or other) order
    added at end of last element
  • E.g. ABC to ABCD
  • Sequences add any item at end of last element
    (set or sequence extension)
  • E.g. ABC to ABCA (only sequence extension
    allowed same as A?B?C?A)
  • E.g. ABC to ABC?A to ABC?AB (if set extensions
    allowed)

31
Trees Systematic Candidate Generation
Two subtrees are in the same class iff they share
a common prefix string P up to the (k-1)th node
A valid element x attached to only the nodes
lying on the path from root to rightmost leaf in
prefix P
Not valid position Prefix 3 4 2 x
32
Candidate Generation (Join operator ?)
Self Join
New Candidates
Equivalence Class Prefix 1 2,
Elements (3,1) (4,0)
1
1
1
1
1
1
1
2
2
?
2
2
2
4
3
3
2
4
2
3
3
3
3
Join
1
1
3
3
?
2
2
4
New Equivalence Class Prefix 1 2 3 Elements
(3,1) (3,2) (4,0)
3
33
Graphs
  • Define an ordering on edges (5-tuples)
  • Use a canonical DFS tree to represent the
    candidates (collection of edges)
  • Add one new edge to an existing graph
  • Can prove that every new candidate can be
    obtained by rightmost path extension (like
    trees), plus back-edges
  • First add back-edges, then forward edges in DFS
    order
  • Test for canonical DFS tree (involves isomorphism
    testing to eliminate duplicates)

34
Candidate Generation
  • Generate new candidates (k1)-patterns from
    equivalence classes of k-patterns
  • Consider each pair of elements in a class,
    including self-extensions
  • Consider all new candidates from each pair of
    joined elements
  • All possible candidates patterns are enumerated
  • Each patterns is generated only once if possible
    (sets,seqs,trees, but not graphs)

35
Extensible Data Mining Server (EDMS)
  • Provide scalable I/O
  • Data and pattern Indexing
  • Persistency Manager
  • Provide native support for several data models
  • Horizontal (tabular, row-based)
  • Vertical (full vertical fragmentation)
  • Provide generic storage models
  • Flat-files
  • Databases (OODBMS, Embedded, etc)
  • Custom Libraries

36
EDMS Data Model VATs (Vertical Attribute Tables)
37
VATs
  • VATs are composed of a header and a body.
  • Different types of patterns require different
    VATs

Header
Body (List of Object IDs)
38
EDMS Class Hierarchy DB, Metatable, VAT, Storage
VAT Header
VAT Records
Storage
DB
39
VAT Class
For Itemsets VATltintgt
40
VATs for Sequences
For Sequences VATlt pair ltint, timegt gt
41
VATs for TreesMatch labels
Subtree
Tree
0
0,6
0
0
n0
1
2
1,5
2
2
2
1
6,6
n6
n1
6
3
2
2,4
5
5,5
n5
n2
1
2
3,3
4
4,4
n4
n3
Match Label 03456 Support 1
3
VAT for Trees vector lt id, match label, scope gt
42
Frequency Computation Scope List Joins In
Scope
T1
T2
T0
0,5
0,3
1
1
2
0,7
Minsup 3 (100)
2
3
3
5
1
2
3
2,3
3,7
1,3
1,1
1,2
4,4
5,5
4
1
2
2
4
4,7
3,3
2,2
2,2
3,3
1
1
6,7
2
3
5,5
7,7
4
2
4
Equivalence Class Prefix Ø Elements (1,-1)
(2,-1) (3,-1) (4,-1)
0, 0, 1,1
0, 0, 3,3
3
4
1
2
1, 1, 2,2
1, 1, 3,3
2, 0, 7,7 2, 4, 7,7
0, 1,1 1, 0,5 1, 2,2 1, 4,4 2, 2,2 2,
5,5
0, 2,3 1, 5,5 2, 1,2 2, 6,7
0,3,3 1,3,3 2,7,7
2, 0, 2,2 2, 0, 5,5 2, 4, 5,5
0, 0,3
1, 1,3
2, 0,7 2, 4,7
Count 3
Tree Id, Prefix Match Label, Last Node Scope
1,1
0
0
43
Scope List Joins Out Scope
1
1
1
2
4
2
4
0, 0, 1,1
0, 0, 3,3
0, 01, 3,3
1, 1, 2,2
1, 1, 3,3
1, 12, 3,3
2, 0, 7,7 2, 4, 7,7
2, 0, 2,2 2, 0, 5,5 2, 4, 5,5
2, 02, 7,7 2, 05, 7,7 2, 45, 7,7
44
VATs for Graphs
  • One VAT per unique edge
  • VAT body consists of a vector of
  • Graph id
  • Vertex id 1 and vertex id 2
  • Vector ltint, vectorltpair ltint, intgt gt
  • Possible to get (k1)-pattern VAT by intersecting
    k-pattern VAT with 1-pattern VAT!
  • Currently works for induced sub-graphs (same as
    gSpan, FSG, etc.)

45
EDMS Classes
46
DB
  • DB is the main user interface
  • Provides methods like read/write for a DMTL
    database
  • Accesses the mapper (for pre-processing)
  • Indexing finds correct MetaTable VAT index,
    which allows retrieval of the desired VAT

47
MetaTable
MetaTables provide grouping of VATS. Attribute
stores encoded value for predefined grouping
strategy. Effects upload of data.
  • VAT index for fast and efficient search
  • Persistence is applied to VATs through this
    object.

48
Persistency
  • VATs persistency status can be one out of the
    following three
  • Volatile. Only in main memory, not registered in
    any MetaTable (small VATs).
  • Buffered. Accessed as if it was in main memory,
    but actually stored on disk (big VATs).
  • Persistent. Stored persistently on disk, also
    after the computation has finished.

49
Storageltclass Tgt
  • Abstracts the details of physical storage
  • Provides persistency to MetaTables
  • Methods to
  • Read/write a VAT from/to disk
  • Different implementations of this class
  • Metakit (embedded db)
  • Gigabase (object-relational)
  • Flat-file
  • Work in progess
  • Persistency manager for pattern/families
  • Performance optimizations

50
Buffer Replacement(LRU) for Flatfile/Metakit
51
What about other pattern types?
  • Define pattern type
  • Define VAT body type
  • Define the isomorphism or sub-pattern, and
    candidate extension functions (or use default
    graph functions)
  • Define VAT intersection operation
  • Select persistency manager if desired (default
    instantiation, e.g., gigabase)
  • All containers and algorithms work!
  • Instead of pattern specific sub-pattern/extension/
    VAT joins, implement generic functions using
    pattern properties

52
Pattern Properties
  • Define a hierarchy or partial order of pattern
    properties
  • Also modeled as classes!
  • Write generic sub-pattern/extension/VATjoin
    functions
  • A given pattern-property satisfies all properties
    above it in the partial order
  • E.g. Given pattern P (as collection of edges),
    extend last edge (x,u) with an edge (u,v) with v
    gt u for sets, with any v for sequences. For trees
    add new edge to any right most vertex, and for
    graphs also add new back-edges.

Itemset Propvectorltintgt
Seq Prop vectorltintgt
Seq2 Prop vectorlttime, intgt
Tree Prop vectorltintgt
Seq3 Prop listlt pair lttime, vectorltintgtgtgt
53
Itemset Mining(2.8Ghz P4, 6GB ram)
  • As minsup decreases, gap increases, but DMTL is
    within a factor of 10 of optimized ECLAT
  • As database size increases, the gap decresases,
    and becomes equal ECLAT breaks for 5000K
    (gt2hrs), while DMTL works (23.5s), since it does
    transparent memory management

54
Itemset Mining(less memory)
  • Run on 256MB machine, Pentium III, 450Mhz
  • DMTL (metakit) able to scale to 10 million
    transactions!
  • Within factor of 2 for ECLAT (optimized algo)
  • ECLAT didnt finish
  • (gt 1 hr) for 5M, 10M
  • Embedded DB 2 times faster than Flat-file

55
Sequence Mining
  • Same trends as itemset mining SPADE faster
    (factor of 10), but gap decreases with large data

56
Sequences (less mem)
  • DMTL within a factor of 2 w.r.t. Optimized SPADE
  • Gap closing for larger number of records
  • SPADE will break at some point (no memory mgmt)

57
Tree Mining
58
Graph Mining
59
Pre-processingConfiguration and Mapping
  • XML-based configuration specification
  • Dynamically specify different mapping strategies
  • Discretization (possibly non-uniform) of
    numerical attributes
  • Taxonomy and grouping of categorical attributes

60
Mapping Continuous Categorical Attributes
61
Post-processingVisualization of Patterns
  • MIRAGE Minimal association rules
  • VTK (Vizualization Toolkit)
  • Interactive Exploration (Java-based)
  • Works only for Itemsets
  • Generalize to other patterns (future work)
  • Preliminary prototype

62
Lattice Visualization
63
Interactive Browsing
64
Other Visualization
Cylinder (confidence color coded)
Cone Trees
65
Summary
  • First-of-its-kind generic pattern miner!
  • DMTL as a first-step exploratory tool
  • Handle massive, high-dimensional data
  • Pattern property hierarchy
  • Generic algorithms data structures
  • Pattern and data indexing
  • Persistency support
  • Databases, flat-files, etc.
  • Will make publicly available in the near future

66
Future Work
  • Completely generic approach
  • To isomorphism, extension, and counting
  • Graphs only work for induced sub-patterns
  • Extend to embedded graphs
  • Consider how to add constraints
  • Closed and maximal patterns
  • Optimization of DMTL to be competitive with
    specific algorithms
  • Extend DMTL to other mining tasks (clustering,
    classification, etc.)
Write a Comment
User Comments (0)
About PowerShow.com