Title: Data Mining Engineering
1 Datamining Methods
Mining Association
Rules and Sequential Patterns
2KDD (Knowledge Discovery in Databases) Process
Data Mining
Clean, Collect, Summarize
Data Preparation
Training Data
Data Warehouse
Model, Patterns
Verification Evaluation
Operational Databases
3Mining Association Rules
- Association rule mining finds interesting
association or correlation relationships among a
large set of data items. - This can help in many business decision making
processes store layout, catalog design, and
customer segmentation based on buying paterns.
Another important field medical applications. - Market basket analysis - a typical example of
association rule mining. - How can we find association rules from large
amounts of data? Which association rules are the
most interesting. How can we help or guide the
mining procedures?
4Informal Introduction
- Given a set of database transactions, where each
transaction is a set of items, an association
rule is an expression
X ? Ywhere X and Y are sets of items
(literals). The intuitive meaningof the rule
transactions in the database which contain the
items in X tend to also contain the items in Y. - Example 98 of customers who purchase tires and
auto accessories also buy some automotive
services here 98 is called the confidence of
the rule. The support of the ruleis the
percentage of transactions that contain both X
and Y. - The problem of mining association rules is to
find all rules that satisfy a user-specified
minimum support and minimum confidence.
5Basic Concepts
Let J (i1, i2, ..., im) be a set of
items.Typically, the items are identifiers of
individuals articles (pro- ducts (e.g., bar
codes). Let D, the task relevant data, be a set
of database transactions where each transaction T
is a set of items such that T ? J. Let A be a
set of items a transaction T is said to
contain A if and only if A ? T, An
association rule is an implication of the form A
? B, where A ? J, B ? J, and A ? B ?. The
rule A ? B holds in the transaction set D with
support s, where s is the percentage of
transactions in D that contain A ? B (i.e. both
A and B). This is the probability, P(A ?B).
6Basic Concepts (Cont.)
The rule A ? B has confidence c in the
transaction set D if c is the percentage of
transactions in D containing A that also contain
B - the conditional probability P(BA). Rules
that satisfy both a minimum support
threshold (min_sup) and a minimum confidence
threshold (min_conf) are called strong. A set of
items is referred to as an itemset. An itemset
that contains k items is a k-itemset. The
occurence frequency of an itemset is the number
of transactions that contain the itemset.
7Basic Concepts - Example
transaction purchased items 1
bread, coffee, milk, cake 2 coffee,
milk, cake 3 bread, butter, coffee,
milk 4 milk, cake 5 bread,
cake 6 bread
X coffee, milk R coffee, cake,
milk support of X 3 from 6
50 support of R 2 from 6
33
Support of milk, coffee ? cake equals to
support of R 33 Confidence of milk,
coffee ? cake 2 from 3 67
support(R)/support(X)
8Basic Concepts (Cont.)
An itemset satisfies minimum support if the
occurrence fre- quency of the itemset is greater
than or equal to the product of min_sup and the
total number of transactions in D. The number
of transactions required for the itemset to
satis- fy minimum support is therefore referred
to as the minimum support count. If an itemset
satisfy minimum support, then it is a frequent
itemset. The set of frequent k-itemsets is
commonly denoted by Lk. Association rule mining
is a two-step process 1. Find all frequent
itemsets. 2. Generate strong association rules
from the frequent itemsets.
9Association Rule Classification
- Based on the types of values handled in the
ruleIf a rule concerns associations between the
presence or absence of items, it is a Boolean
association rule. For example computer ?
financial_management_software support 2,
confidence 60If a rule describes
associations between quantitative items or
attributes, then it is a quantitative
associa-tion rule. For example age(X,
30..39) and income(X,42K..48K) ? buys(X,
high resolution TV)Note that the quantitative
attributes, age and income,have been discretized.
10Association Rule Classification (Cont.)
- Based on the dimensions of data involved in the
ruleIf the items or attributes in an
association rule refe-rence only one dimension,
then it is a single dimensional association rule.
For example buys(X,computer) ? buys (X,
financial manage- ment software)The
above rule refers to only one dimension,
buys.If a rule references two or more
dimensions, such as buys, time_of_transaction,
and customer_category, then it is a
multidimensional association rule.The second
rule on the previous slide is a 3-dimensional
ass. rule since it involves 3 dimensions age,
income, and buys.
11Association Rule Classification (Cont.)
- Based on the levels of abstractions involved in
the rule setSuppose that a set of association
rules minded includes age(X,30..39) ?
buys(X, laptop computer) age(X,30..39)
? buys(X, computer) In the above rules, the
items bought are referenced at different levels
of abstraction. (E.g., computer is a
higher-level abstraction of laptop computer .)
Such ru-les are called multilevel association
rules.Single-level association rules refer one
abstraction level only.
12Mining Single-Dimensional Boolean Association
Rules from Transactional Databases
This is the simplest form of association rules
(used in market basket analysis. We present
Apriori, a basic algorithm for finding
frequent itemsets. Its name it uses prior
knowledge of frequent itemset properties
(explained later). Apriori employs a iterative
approach known as a level-wise search, where
k-itemsets are used to explore (k
1)-itemsets. First, the set of frequent 1-items,
L1, is found. L1 is used to find L2, the set of
frequent 2-itemsets, which is used to find L3,
and so on, until no more frequent k-itemsets can
be found. The finding of each Lk requires one
full scan of the database. The Apriori property
is used to reduce the search space.
13The Apriori Property
All nonempty subsets of a frequent itemset must
also be frequent. If an itemset I does not
satisfy the minimum support threshold, min_sup,
then I is not frequent, that is, P(I) lt min_sup.
If an item A is added to the itemset I, then
the resulting itemset (i.e., I ? A ) cannot
occur more frequently than I. Therefore, I ? A
is not frequent either, that is, P (I ? A ) lt
min_sup. How is the Apriori property used in the
algorithm? To understand this, let us look at
how Lk-1 is used to find Lk. A two-step process
is followed, consisting of join and prune
actions. These steps are explained on the next
slides,
14The Apriori Algorithm the Join Step
To find Lk, a set of candidate k-itemsets is
generated by joining Lk-1 with itself. This set
of candidates is denoted by Ck. Let l1 and l2 be
itemsets in Lk-1. The notation lij refers to
the jth item in li (e.g., lik-2 refers to the
second to the last item in l1). Apriori assumes
that items within a transaction or itemset
are sorted in lexicographic order. The join
Lk-1 join Lk-1, is performed, where members of
Lk-1 are joinable if their first (k-2) items are
in common. That is, members l1 and l2 of Lk-1 are
joined if (l11 l21 ) ? (l12 l22 )
? ... (l1k-2 l2k-2 ) ? (l1k-1 lt l2k-1
) . The condition (l1k-1 lt l2k-1 ) simply
ensures that no duplicates are generated. The
resulting itemset l11 l12 ) ... l1k-1
l2k-1 .
15The Apriori Algorithm the Join Step (2)
Illustration by an example
p ? Lk-1 ( 1 2 3)
Join Result ? Ck ( 1 2 3 4)
q ? Lk-1 ( 1 2 4)
Each frequent k-itemset p is always extended by
the last item of all frequent itemsets q which
have the same first k-1 items as p .
16The Apriori Algorithm the Prune Step
Ck is a superset of Lk, that is, its members may
or may not be frequent, but all of the frequent
k-items are included in Ck. A scan of the
database to determine the count of each
candidate in Ck would result in the
determination of Lk. Ck can be huge, and so this
could involve heavy computation. To reduce the
size of Ck, the Apriori property is used as
follows. Any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset.
Hence, if any (k-1)-subset of a candidate
k-itemset is not in Lk-1, then the candidate
cannot be frequent either and so can be removed
from Ck. The above subset testing can be done
quickly by maintaining a hash tree of all
frequent itemsets.
17The Apriori Algorithm - Example
Lets look at a concrete example of Apriori,
based on the AllElectronics transaction
database D, shown below. There are nine
transactions in this database, e.i., D 9. We
use the next figure to illus- trate the fin- ding
of fre- quent itemsets in D.
TID List of item_Ids T100 I1, I2, I5 T200 I2,
I4 T300 I2, I3 T400 I1, I2, I4 T500 I1,
I3 T600 I2, I3 T700 I1, I3 T800 I1, I2, I3,
I5 T900 I1, I2, I3
18Generation of CK and LK (min.supp. count2)
Scan D for count of each candidate- scan
Itemset Sup. count I1 6
I2 7 I3 6
I4 2 I5 2
Itemset Sup. count I1 6
I2 7 I3 6
I4 2 I5 2
Compare candidate support count with minimum
support count - compare
C1
L1
Itemset Sup. count I1,I2 4
I1,I3 4 I1,I4 1
I1,I5 2 I2,I3 4
I2,I4 2 I2,I5 2 I3,I4
0 I3,I5 1 I4,I5
0
Itemset I1,I2 I1,I3 I1,I4 I1,I5
I2,I3 I2,I4 I2,I5 I3,I4 I3,I5
I4,I5
Itemset Sup. count I1,I2 4
I1,I3 4 I1,I5 2 I2,I3
4 I2,I4 2 I2, I5 2
Generate C2 candidates from L1
Scan
Compare
L2
C2
C2
19Generation of CK and LK (min.supp. count2)
Generate C3 candidates from L2
Itemset I1,I2,I3 I1,I2,I5
Itemset Sup. Count I1,I2,I3
2 I1,I2,I5 2
Itemset Sup. Count I1,I2,I3
2 I1,I2,I5 2
Scan
Compare
C3
C3
L3
20Algorithm Application Description
- In the 1st iteration, each item is a member of
C1. The algorithm simply scan all the
transactions in order to count the number of
occurrences of each item. - Suppose that the minimum transaction support
count (min_sup 2/9 22). L1 can then be
determined. - C2 L1 join L1.
- The transactions in D are scanned and the
support count of each candidate itemset in C2 ,
as shown in the middle table of the second row in
the last figure. - The set of frequent 2-itemsets, L2 , is then
determined, consisting of those candidate
2-itemsets in C2 having minimum support.
21Algorithm Application Description (2)
- The generation of C3 L2 join L2 is detailed
in the next figure. Based on the Apriori property
that all subsets of a frequent itemset must also
be frequent, we can determine that the four
latter candidates cannot possibly be frequent. We
therefore remove them from C3. - The transactions in D are scanned in order to
determine L3 , consisting of those candidate
3-itemsets in C3 having minimum support. - C4 L3 join L3 , after the pruning C4 Ø.
22Example Generation C3 from L2
1. Join C3 L2 ?? L2 I1,I2,I1,I3,I1,I5
, I2,I3,I2,I4, I2,I5 ?? I1,I2,
I1,I3, I1,I5, I2,I3,I2,I4, I2,I5
I1,I2,I3, I1,I2,I5, I1,I3,I5,
I2,I3,I4, I2,I3,I5, I2,I4,I5. 2.
Prune using the Apriori property All nonempty
subsets of a frequent itemset must also be
frequent. ? The 2-item subsets of I1,I2,I3
are I1,I2, I1,I3, I2,I3, and they
all are members of L2. Therefore, keep
I1,I2,I3 in C3. ? The 2-item subsets of
I1,I2,I5 are I1,I2, I1,I5,
I2,I5, and they all are members of L2.
Therefore, keep I1,I2,I5 in C3.
? Using the same analysis remove other 3-items
from C3. 3. Therefore, C3 I1,I2,I3,
I1,I2,I5 after pruning.
23Generating Association Rules from Frequent Items
We generate strong association rules - they
satisfy both minimum support and minimum
confidence.
support_count(A ? B) confidence
( A ? B ) P(BA) -------------------------
support_count(A) where support_count(A ?
B) is the number of transactions containing the
itemsets A ? B, and support_count(A) is the
number of transactions containing the itemset A.
24Generating Association Rules from Frequent Items
(Cont.)
Based on the equations on the previous slide,
association rules can be generated as follows
- For each frequent itemset l , generate all
nonempty subsets of l. - For every
nonempty subset s of l, output the rule s ?
(l - s) support_count(l) if
----------------- ? min_conf, where min_conf is
minimum support_count(s)
confidence threshold.
25Generating Association Rules - Example
Suppose that the transactional data for
AllElectronics contain the frequent itemset l
I1,I2,I5. The resulting rules are I1 ? I2
? I5, confidence 2/4 50 I1 ? I5 ?
I2, confidence 2/2 100 I2 ? I5 ?
I1, confidence 2/2 100 I1 ? I2 ?
I5, confidence 2/6 33 I2 ? I1 ?
I5, confidence 2/7 29 I5 ? I1 ?
I2, confidence 2/2 100 If the minimum
confidence threshold is, say, 70, then only the
second, third, and the last rules above are
output, since these are the only ones generated
that are strong.
26Multilevel (Generalized) Association Rules
For many applications, it is difficult to find
strong associations among data items at low or
primitive levels of abstraction due to sparsity
of data in multidimensional space. Strong
associations discovered at high concept levels
may represent common sense knowledge. However,
what may represent common sense to one user may
seem novel to another. Therefore, data mining
systems should provide capabilities to mine
association rules at multiple levels of
abstraction and traverse easily among different
abstraction spaces.
27Multilevel (Generalized) Association Rules -
Example
Suppose we are given the following task-relevant
set of transactional data for sales at the
computer department of an AllElectronics branch,
showing the items purchased for each transaction
TID. TID Items purchased T1 IBM desktop
computer, Sony b/w printer T2 Microsoft
educational software, Microsoft financial
software T3 Logitech mouse computer accessory,
Ergoway wrist pad accessory T4 IBM desktop
computer, Microsoft financial software T5 IBM
desktop computer . . . . . .
Table Transactions
28A Concept Hierarchy for our Example
Level 0
all
Computer accessory
computer
software
printer
wrist pad
mouse
desktop
laptop
educational
financial
color
b/w
...
...
...
...
... ... ...
HP
Sony
Ergoway
Logitech
IBM
Microsoft
... ... ...
Level 3
29Example (Cont.)
The items in Table Transactions are at the
lowest level of the concept hierarchy. It is
difficult to find interesting purchase patterns
at such raw or primitive level data. If, e.g.,
IBM desktop computer or Sony b/w printer
each occurs in a very small fraction of the
transactions, then it may be difficult to find
strong associations involving such items. In
other words, it is unlikely that the itemset
IBM desktop computer, Sony b/w printer will
satisfy minimum support. Itemsets containing
generalized items, such as IBM
desktop computer, b/w printer and computer,
printer are more likely to have minimum
support. Rules generated from association rule
mining with concept hie- rarchies are called
multiple-level or multilevel or
generalized association rules.
30Parallel Formulation of Association Rules
- Need
- Huge Transaction Datasets (10s of TB)
- Large Number of Candidates.
- Data Distribution
- Partition the Transaction Database, or
- Partition the Candidates, or
- Both
31Parallel Association Rules Count Distribution
(CD)
- Each Processor has complete candidate hash tree.
- Each Processor updates its hash tree with local
data. - Each Processor participates in global reduction
to get global counts of candidates in the hash
tree. - Multiple database scans per iteration are
required if hash tree too big for memory.
32CD Illustration
P0
P1
P2
N/p
N/p
N/p
Global Reduction of Counts
33Parallel Association Rules Data Distribution (DD)
- Candidate set is partitioned among the
processors. - Once local data has been partitioned, it is
broadcast to all other processors. - High Communication Cost due to data movement.
- Redundant work due to multiple traversals of the
hash trees.
34DD Illustration
Data Broadcast
P0
P1
P2
Remote Data
Remote Data
Remote Data
9
1,2
2,3
12
1,3
10
3,4
10
All-to-All Broadcast of Candidates
35Predictive Model Markup Language PMML and
Visualization
36Predictive Model Markup Language - PMML
- Markup language (XML) to describe data mining
models - PMML describes
- the inputs to data mining models
- the transformations used prior to prepare data
for data mining - The parameters which define the models themselves
37PMML 2.1 Association Rules (1)
- Model attributes (1)
- ltxselement name"AssociationModel"gt
- ltxscomplexTypegt
- ltxssequencegt
- ltxselement minOccurs"0"
maxOccurs"unbounded" ref"Extension" /gt - ltxselement ref"MiningSchema" /gt
- ltxselement minOccurs"0"
maxOccurs"unbounded" ref"Item" /gt - ltxselement minOccurs"0"
maxOccurs"unbounded" ref"Itemset" /gt - ltxselement minOccurs"0"
maxOccurs"unbounded" ref"AssociationRule" /gt
- ltxselement minOccurs"0"
maxOccurs"unbounded" ref"Extension" /gt - lt/xssequencegt
-
38PMML 2.1 Association Rules (2)
-
- 1. Model attributes (2)
- ltxsattribute name"modelName" type"xsstring"
/gt - ltxsattribute name"functionName"
type"MINING-FUNCTION use"required"/gt - ltxsattribute name"algorithmName"
type"xsstring" /gt - ltxsattribute name"numberOfTransactions"
type"INT-NUMBER" use"required" /gt - ltxsattribute name"maxNumberOfItemsPerTA"
type"INT-NUMBER" /gt - ltxsattribute name"avgNumberOfItemsPerTA"
type"REAL-NUMBER" /gt - ltxsattribute name"minimumSupport"
type"PROB-NUMBER" use"required" /gt - ltxsattribute name"minimumConfidence"
type"PROB-NUMBER" use"required" /gt - ltxsattribute name"lengthLimit"
type"INT-NUMBER" /gt - ltxsattribute name"numberOfItems"
type"INT-NUMBER" use"required" /gt - ltxsattribute name"numberOfItemsets"
type"INT-NUMBER" use"required" /gt - ltxsattribute name"numberOfRules"
type"INT-NUMBER" use"required" /gt - lt/xscomplexTypegt
- lt/xselementgt
39PMML 2.1 Association Rules (3)
- 2. Items
- ltxselement name"Item"gt
- ltxscomplexTypegt
- ltxsattribute name"id" type"xsstring"
use"required" /gt - ltxsattribute name"value" type"xsstring"
use"required" /gt - ltxsattribute name"mappedValue"
type"xsstring" /gt - ltxsattribute name"weight" type"REAL-NUMBER"
/gt - lt/xscomplexTypegt
- lt/xselementgt
40PMML 2.1 Association Rules (4)
- 3. ItemSets
- ltxselement name"Itemset"gt
- ltxscomplexTypegt
- ltxssequencegt
- ltxselement minOccurs"0"
maxOccurs"unbounded" ref"ItemRef /gt - ltxselement minOccurs"0"
maxOccurs"unbounded" ref"Extension /gt - lt/xssequencegt
- ltxsattribute name"id" type"xsstring"
use"required" /gt - ltxsattribute name"support" type"PROB-NUMBER"
/gt - ltxsattribute name"numberOfItems"
type"INT-NUMBER" /gt - lt/xscomplexTypegt
- lt/xselementgt
41PMML 2.1 Association Rules (5)
- 4. AssociationRules
- ltxselement name"AssociationRule"gt
- ltxscomplexTypegt
- ltxssequencegt
- ltxselement minOccurs"0"
maxOccurs"unbounded" ref"Extension" /gt - lt/xssequencegt
- ltxsattribute name"support"
type"PROB-NUMBER" use"required" /gt - ltxsattribute name"confidence"
type"PROB-NUMBER" use"required" /gt - ltxsattribute name"antecedent"
type"xsstring" use"required" /gt - ltxsattribute name"consequent"
type"xsstring" use"required" /gt - lt/xscomplexTypegt
- lt/xselementgt
42PMML example model for AssociationRules (1)
- lt?xml version"1.0" ?gt
- ltPMML version"2.1" gt
- ltDataDictionary numberOfFields"2" gt
- ltDataField name"transaction"
optype"categorical" /gt - ltDataField name"item" optype"categorical" /gt
- lt/DataDictionarygt
- ltAssociationModel functionName"associationRules"
numberOfTransactions"4" numberOfItems4"
minimumSupport"0.6" minimumConfidence"0.3"
numberOfItemsets7" numberOfRules3"gt - ltMiningSchemagt
- ltMiningField name"transaction"/gt
- ltMiningField name"item"/gt
- lt/MiningSchemagt
43PMML example model for AssociationRules (2)
-
- lt!-- four items - input data --gt
- ltItem id"1" valuePC" /gt
- ltItem id"2" valueMonitor" /gt
- ltItem id"3" valuePrinter" /gt
- ltItem id4" valueNotebook" /gt
- lt!-- three frequent 1-itemsets --gt
- ltItemset id"1" support"1.0" numberOfItems"1"gt
- ltItemRef itemRef"1" /gt
- lt/Itemsetgt
- ltItemset id"2" support"1.0"
numberOfItems"1"gt - ltItemRef itemRef2" /gt
- lt/Itemsetgt
- ltItemset id3" support"1.0"
numberOfItems"1"gt - ltItemRef itemRef"3" /gt
- lt/Itemsetgt
-
44PMML example model for AssociationRules (3)
- lt!-- three frequent 2-itemset --gt
- ltItemset id4" support"1.0"
numberOfItems"2"gt - ltItemRef itemRef"1" /gt
- ltItemRef itemRef2" /gt
- lt/Itemsetgt
- ltItemset id5" support"1.0"
numberOfItems"2"gt - ltItemRef itemRef"1" /gt
- ltItemRef itemRef3" /gt
- lt/Itemsetgt
- ltItemset id6" support"1.0"
numberOfItems"2"gt - ltItemRef itemRef2" /gt
- ltItemRef itemRef"3" /gt
- lt/Itemsetgt
45PMML example model for AssociationRules (4)
- lt!-- one frequent 3-itemset --gt
- ltItemset id7" support"0.9" numberOfItems3"gt
- ltItemRef itemRef"1" /gt
- ltItemRef itemRef2" /gt
- ltItemRef itemRef"3" /gt
- lt/Itemsetgt
- lt!-- three rules satisfy the requirements the
output --gt - ltAssociationRule support"0.9
confidence"0.85 - antecedent4" consequent3" /gt
- ltAssociationRule support"0.9"
confidence"0.75" - antecedent1" consequent6" /gt
- ltAssociationRule support"0.9"
confidence"0.70" - antecedent6" consequent"1" /gt
- lt/AssociationModelgt
- lt/PMMLgt
46Visualization of Association Rules (1)
Antecedent Consequent Support Confidence
PC, Monitor Printer 90 85
PC Printer, Monitor 90 75
Printer, Monitor PC 80 70
47Visualization of Association Rules (2)
PC
Printer
Monitor
Monitor
PC
Printer
Printer
PC
Monitor
48Visualization of Association Rules (3)
3. 3-D Visualisation
49Mining Sequential Patterns (Mining Sequential
Associations)
50Mining Sequential Patterns
- Discovering sequential patterns is a relatively
new data mining problem. - The input data is a set of sequences, called
data-sequences. - Each data-sequence is a list of transactions
where each transaction is a set of items.
Typically, there is a transaction time associated
with each transaction. - A sequential pattern also consists of a list of
sets of items. - The problem is to find all sequential patterns
with a user-specified minimum support , where the
support of a sequential pattern is a percentage
of data-sequences that contain the pattern.
51Application Examples
- Book club Each data sequence may correspond to
all book selections of a customer, and each
transaction corresponds to the books selected by
the customer in one order. A sequential pattern
may be 5 of customers bough Foundation,
then Foundation and Empire and then Second
Foundation. The data sequences corresponding
to a customer who bought some other books in
between these books still contains this
sequential pattern. - Medical domainA data sequence may correspond to
the symptoms or diseases of a patient, with a
transaction corresponding tothe symptoms
exhibited or diseases diagnosed during a visit to
the doctor. The patterns discovered could be used
in disease research to help identify symptoms
diseases that precede certain diseases.
52Discovering Sequential Associations
Given A set of objects with associated event
occurrences.
events
Object 1
timeline
10
20
30
40
50
Object 2
53Problem Statement
We are given a database D of customer
transactions. Each transaction consists of the
following fields customer-id, transaction-time,
and the items purchased in the transaction. No
customer has more than one transaction with the
same transaction time. We do not consider
quantities of items bought in a transaction each
item is a binary variable representing whether
an item was bought or not. A sequence is an
ordered list of itemsets. We denote an itemset i
by (i1 i2 ... im ), where ij is an item. We
denote a sequence s by lts1 s2 ... sngt, where sj
is an itemset. A sequence lta1 a2 ... angt is
contained in another sequence ltb1 b2 ... bmgt if
there exist integers i1 lt i2 lt in such that a1
? bi1, a2 ? bi2, ..., an ? bin.
54Problem Statement (2)
For example, lt(3) (4 5) (8)gt is contained in lt(7)
(3 8) (9) (4 5 6) (8)gt, since (3) ? (3 8), (4 5)
? (4 5 6) and (8) ? (8). However, the sequence
lt(3) (5)gt is not contained in lt(3 5)gt (an vice
versa). The former represents items 3 and 5 being
bought one after the other, while the latter
represents items 3 and 5 being bought together. I
n a set of sequences, a sequence s is maximal
if s is not contained in any other
sequence. Customer sequence - an itemset list of
customer transactions ordered by increasing
transaction time ltitemset(T1)
itemset(T2) ... itemset(Tn)gt
55Problem Statement (3)
A customer supports a sequence s if s is
contained in the customer sequence for this
customer. The support for a sequence is defined
as the fraction of total customers who support
this sequence. Given a database D customer
transactions, the problem of mining sequential
patterns is to find the maximal sequences among
all sequences that have a certain user-specified
minimum support. Each such sequence represents a
sequential pattern. We call a sequence
satisfying the minimum support constraint a large
sequence. See the next example.
56Example
Customer Id Transaction Time Items
Bought 1 June 25 00 30 1 June 30
00 90 2 June 10 00 10, 20 2 June
15 00 30 2 June 20 00 40, 60,
70 3 June 25 00 30, 50, 70 4 June 25
00 30 4 June 30 00 40, 70 4 July
25 00 90 5 June 12 00 90
Database sorted by customer Id and transaction
time
Customer Id Custom Sequence 1 lt(30)
(90)gt 2 lt(10 20) (30) (40 60 70)gt 3 lt(30 50
70)gt 4 lt(30) (40 70) (90)gt 5 lt(90)gt
Customer-sequence version of the database
57Example (2)
With minimum support set to 25, i.e., a minimum
support of 2 customers, two sequences lt(30)
(90)gt and lt(30) (40 70)gt are maximal among those
satisfying the support constraint, and are the
desired sequential patterns. lt(30) (90)gt is
supported by customers 1 and 4. Customer 4
buys items (40 70) in between items 30 and 90,
but supports the patterns lt(30) (90)gt since we
are looking for patterns that are not necessarily
contiguous. lt(30 (40 70)gt is supported by
customers 2 and 4. Customer 2 buys 60 along with
40 and 70, but suports this pattern since (40 70)
is a subset of (40 60 70). E.g. the sequence
lt(10 20) (30)gt does not have minimal support it
is only supported by customer 2. The sequences
lt(30)gt, lt(40)gt lt(70)gt, lt(90)gt, lt(30) (40)gt,
lt(30) (70)gt and lt(40 70)gt have minimum support,
they are not maximal - therefore, they are not in
the answer.