Title: Sequential Patterns
1Sequential PatternsProcess Mining
- Current State of Research
- Edgar de Graaf
- LIACS
2Mining Sequential Patterns
- Sequential Patterns
- Sequence Databases
- AprioriAll
- PrefixSpan
- Gap Constraints
3Sequential Patterns
- lt(a,b)(c)(a,b,d)gt
- lt a1, a2, a3 gt
- lt(3)(4,5)(8)gt contained in lt(7)(3,8)(9)(4,5,6)(8)gt
- lt(3)(4,5)(8)gt not contained in lt(7)(3,8)(9)(4)(5,6
)(8)gt
4Sequential databases
The Database with sequences
5Sequential databases
lt(3)(4,5)(8)gt
Support count 0
A Generated Candidate Pattern
6Sequential databases
lt(3)(4,5)(8)gt
Support count 0
1
7Sequential databases
Support count 1
lt(3)(4,5)(8)gt
Not Contained ? Not Counted
8Sequential databases
Contained
Support count 1
2
3
4
5
Contained
Contained
IF Minimal Support 50 THEN lt(3)(4,5)(8)gt
frequent
Contained
Contained
9Lifting order (1)
- Notation by examples
- ltA,B,Cgt, a ordered list of sets sequence
- Every set A,B and C is unordered. E.g. A
(x,y,z) (y,z,x) (z,y,x) - x,y,z is an extension we ignore the order when
counting frequency
10Lifting order (2)
- lt(t1)(t2)(t3)(t4)gt and
- lt(t1)(t3)(t2)(t4)gt frequent
- ?
- lt(t1)(t3,t2)(t4)gt is frequent
- Says t3 and t2 occurs frequent in-between t1 and
t4 in either order
11Lifting Order (3)
- lt(t1)(t2)(t3)(t4)gt and
- lt(t1)(t3)(t2)(t4)gt infrequent
- suppose (t1)t3,t2(t4) frequent
- Says often t3 and t2 occur in-between t1 and t4
12Existing Algorithms
- AprioriAll the first algorithm based on the
anti-monotone principles - PrefixSpan currently the fastest algorithm
around, it uses projected databases
13AprioriAll (1)
- AprioriAll(DB, min_sup)
- L1 frequent sequences size 1
- k 2
- while(Lk-1 is not empty)
- Ck candidateGeneration(Lk-1,k)
- Ck candidatePruning(Ck, k)
- Lk supportBasedPruning(Ck)
- k
-
14AprioriAll (2)
- candidateGeneration(Lk-1, k)
- Ck ø
- for each a in Lk-1
- for each b in Lk-1
- if(all n, 1 n k-2 an bn)
- toevoegen aan Ck de sequences
- a1ak-2, ak-1, bk-1 en
- a1ak-2, bk-1, ak-1
-
-
15PrefixSpan (1)
- Assume that the prefix lt(a,b)(c)gt
- Scan de projected database to find every frequent
item x such that - lt(a,b)(c,x)gt is frequent or
- lt(a,b)(c)(x)gt is frequent
- Append the x to the prefix and output the pattern
- Now call recursively e.g. PrefixSpan(lt(a,b)(c,x)gt
, newProjDB)
16PrefixSpan (2)
- A projected DB only stores the postfix
- E.g. if prefix lt(a,b)gt then we store lt(a,b,x)gt
as lt( _, x)gt - New projected DB Old projected DB sequences
without prefix
17PrefixSpan (3)
- Faster than AprioriAll
- No non-existing candidates
- Testing on a shrinking projected DB
18Gap Constraint
- Simple idea between sequence-item-sets a maximal
distance - lt(a)(c)(d)(e)gt, e.g. pattern lt(a)(e)gt and gap
1 then this sequence is not counted
19Process Mining
- What is process mining?
- Using D/F tables and graphs
- Genetic Algorithms
- Problem areas
- Using sequential patterns
20What is process mining? (1)
- The ordering of events is known e.g. lt(task
A)(task B)(task C)gt - Process mining constructs a petri net
pay
ready
claim
register
to_be_evaluated
send_letter
Source Workflow Management by W. van der Aalst
and K. van Hee. (1997)
21What is process mining? (2)
- Usability of process mining
- Given the audit trails, what is the workflow
network? - Mined workflow network original design? (Delta
Analysis) - Mined workflow network better than the original
design? (Performance Analysis)
22Using D/F tables and graphs (1)
- For every task a D/F table
- Intuition if A is often followed by B then the
probability of A causing B increases
23Using D/F tables and graphs (2)
- A D/F graph is constructed
- IF((A?B N) AND (A gt B s) AND
- (B lt A s) THEN connection A to B
- More complicated rules deal with recursion and
short loops
24Using D/F tables and graphs (3)
25Using D/F tables and graphs (4)
- AND/OR-Splits
- OR if neither C gt B or B gt C is higher
- than the threshold
- AND if both are higher than threshold
B
A
C
26Genetic Algorithms (1)
- Create a initial population of workflows
- Calculate their fitness using audit trails
- Create a child
- Mutate the child
- Repeat 3 to 4 to create the new population
- Go to 2
27Genetic Algorithms (2)
- Advantages
- Can deal with duplicate tasks and non-free
choice. - Disadvantages
- The structure of the chromosome
- How do we measure fitness?
- How do we do cross-over and mutation?
28Problem Areas (1)
- Hidden tasks
- Duplicate tasks when tasks have the same name
B
C
29Problem Areas (2)
A
D
C
B
E
30Problem Areas (3)
D
A
B
C
31Problem Areas (4)
- Delta analysis how do we compare two models?
- Other problems time, dealing with noise and
incompleteness.
32Using sequential patterns
- Mining loops?
- Fitness measure in a GA?
- Use in delta analysis?
- Generate the important frequent subsequences to
help the designer
33Further research in sequences
- How about gaps between items in different item
sets? - What type of frequent subsequences to use in
fitness? - Lifting order, is it useful in workflow
generation? - Further research of lifting order
34The End
- Thank you for your attention
- Edgar de Graaf
- edegraaf_at_liacs.nl