Title: Parallel Mining of Association Rules
1Parallel Mining of Association Rules
Rakesh Agrawal John C.Shafer
Presented by Ting Hian Ong Xu XingJian http//www
2- Introduction
- Overview of Serial AlgorithmParallel
Algorithms - Count Distribution (CD)
- Data Distribution (DD)
- Candidate Distribution (CaD)
- Parallel Rule Generation
- Performance Sensitivity Analysis
- Conclusions
- Q A
3Ultra-large databases
Possibility of faster access and manipulation
- The efficient discovery of previously unknown
patterns in large databases - ? The need of fast algorithms for discovering
association rules
4- Why Parallel Algorithms?
- Databases (raw transaction data instead of
samples) to be mined are often very large - in GB
and TB - The need of fast algorithm for discovering
association rules - Transaction databases has to be scanned
repeatedly in discovering the frequent itemsets - Requires a lot of computation power, memory and
I/O, which can only provided by parallel computer
using parallel algorithms
5- Three parallel algorithms introduced
- Count Distribution (CD)
- Data Distribution (DD)
- Candidate Distribution (CaD)
- Based on the serial algorithm Apriori
6- Association Rules
- The problem of mining association rules is to
generate all association rules that have certain
user-specified minimum support and confidence. - Problem Decomposition
- Find all sets of items whose support is greater
than the user-specified minimum support (frequent
itemsets) - Use frequent itemsets to generate the desired
7Apriori Algorithm L1 frequent 1-itemsets k
2 while (Lk-1 ¹ 0) do Cknew candidates of
size k generated from Lk-1 forall transactions
t Î D do Increment the count of all candidates
in Ck that are contained in t LkAll
candidates in Ck with minimum support kk1 en
d Answer Èk Lk
8Apriori Algorithm Candidate Generation Join
step insert into Ck select p.item1, p.item2,
, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where
p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 Prune step delete all
itemsets c Î Ck such that some (k-1)-subset of c
is not in Lk-1
9- Three parallel algorithms CD, DD, CaD based on
Apriori - Discovering frequent itemsets (1) is much more
expensive than generating rules (2) - Phase 1
- Each node generates candidate k-itemsets locally
from the frequent (k-1)-itemsets ? how to
partition? -
- Phase 2
- The match candidates itemsets and transactions
collect the local counts ? how to distribute? - Phase 3
- - determine the global counts for itemsets ? how
to find? - find frequent k-itemsets and replicate in all
10- Implemented on an IBM POWERparallel System SP2, a
shared-nothing machine, where each of N
processors has a private memory and a private
disk. - Data is evenly distributed among the nodes
11(No Transcript)
12- Objective minimizing communication
- Techniques
- - Straight-forward parallelization of Apriori
- Carry out redundant duplicate computations in
parallel to avoid communication - Only requires communicating count values (no data
tuples are exchanged) - Processors can scan the local data asynchronously
in parallel
13- Algorithm
- Pass 1
- Each processor Pi generates its local candidate
itemset Ci1 depending on the items present in its
local data partition Di - Develop and Exchange local counts Ci1
- Develop global support counts C1
14- Algorithm
- Pass kgt1
- Pi generates the complete Ck using the complete
Lk-1 created at the end of pass (k-1). Each
processor has the identical Lk-1 thus generates
identical Ck and puts its count values in a
common order into a count array - Pi makes a pass over data partition Di and
develop local support counts for candidates in Ck
- Pi exchanges local Ck counts with all other
processors to develop global Ck counts. All
processors must synchronize. - Pi computes Lk from Ck
- Pi independently decide to terminate or continue
to the next pass
15(No Transcript)
16- Disadvantages
- CD does not exploit the aggregate memory of the
system - Must synchronize and develop global count at the
end of each pass
17- Objective utilize aggregate main memory of the
system effectively - Technique
- Partitions the candidates into disjoint sets,
which are assigned to different processors. Each
processor works with the entire dataset but only
portion of the candidate set. - Each processor counts mutually exclusive
candidates. On a N-processor configuration, DD
can count in a single pass candidate set that
would require N pass in CD
18Basic Idea
- Example 2 processors
- Data Distribution only processes a subset of Ck
to utilize the aggregate memory - Exchange data to develop global counts for Cki
19- Algorithm
- Pass 1 Same as the CD algorithm
- Pass kgt1
- Pi generates Ck from Lk-1. It retains only 1/N of
the itemsets forming Cik - Pi develops support counts for itemsets in Cik
for ALL transactions (using local data pages and
data pages received from other processors) - At the end of the data pass, Pi calculates Lik
using local Cik - Processors exchange Lik so that every processor
has the complete Lk for generating Ck1 for the
next pass (requires processors to synchronize) - Pi can independently decide whether to terminate
or continue on to the next pass
21Disadvantages heavy communication Each
processor must broadcast their local data and
frequent itemsets to all other processors and
synchronize in every pass.
22- Problem
- CD and DD require processors to synchronize at
the end of each pass - Basic Idea Remove dependence among processors
- Data dependence
Complete transactions are required to compute
support count (in CD)
- Frequent itemsets dependency
A global itemset Lk is needed during the pruning
step of Apiori candidate generation algorithm(in
23- Remove Data Dependency
- Each processor Pi works on Cki, a disjoint subset
of Ck - Pi derives global support counts for Cki from
local data. - Replicate data amongst processors in order to
achieve the above - Reduce Frequent itemset dependency
- Does not wait for the complete pruning
information to arrive from other processors. - Prune the candidate set as much as possible
- Late arriving pruning information is used in
subsequent passes.
24- Algorithm
- Pass kltl Use either the CD or DD algorithm
- Pass kl
- Partition Lk-1 among N processors
- Pi generates Cik logically using only the Lik-1
partition (use standard pruning) - Pi develops global counts for candidates in Cik
and the database is repartitioned into D Ri at
the same time (requires communicating local data) - Pi receive Ljk from all other processors needed
for pruning Cik1 - Pi computes Lik from Cik and asynchronously send
it to the other N-1 processors - Pass kgtl
- Pi collects all frequent itemsets sent by other
processors - Pi generates Cik using local Lik-,, take care of
pruning(Ljl-1) - Pi passes over D Ri and counts Cik
- Pi computes Lik from Cik and asynchronously send
it to the other N-1 processors
25- How to partition Lk ?
- Partition the itemsets in Lk based on common k-1
long prefixes - Assume items in the itemsets are
lexicographically ordered - Example (in the paper) an error ADE
- L6 Æ
- ABC, ABD, ABE ? all have common prefix AB
- The apriori candidate generation procedure
generate ABCD, ABCE, ABDE, and ABCDE by joining
only the items in e - Repartition the database according to Lk Partition
26- In candidate distribution, each processor works
independently by counting only its portion of
global candidate set using only local data - CaD must communicate the entire dataset during
the redistribution pass (kl step 3), but only
once. Unlike DD, processors may selectively
filter out transactions it sends to other
processors depending upon how the dependency
graph is partitioned.
27Given a frequent itemset l examine a subset a and
generate rule a gt (l-a) with support
support(l) and confidence support(l) /
support(a) Example Frequent itemsets ABCD,
AB Confidence support(ABCD) / support
(AB) Only proceed to smaller subsets if rules
have the required minconf. Example Frequent
itemset ABCD, If ABC Þ D doesnt satisfy
minconf, AB Þ CD will not have minconf
28- Examination of dataset is not required.-gt Cheap
- Generating rules in parallel need partitioning
the set of all frequent itemsets. Each processor
generates rule for its partition only using the
algorithm. - Sensitive to itemsets length, balancing by
partitioning the itemsets of each length
equally. - Each processor must have access to all frequent
itemsets before rule generation begins for
calculating the confidence. - ?In CaD occurs waiting time for slower processors
to discover and transmit all frequent itemsets. - Due to load imbalance, this can be performed
off-line, possibly on a serial processor.
29- Hardware specifications
- a 32-node IBM SP2 Model 302
- Each node is a Thin Node 2 consisting of a
POWER2processor running at 66.7 MHz with 256MB
memory - Each node has 2GB disk of which less than 500MB
available for tests - Combined communication hardware has a rated peak
bandwitdh of 80 MBps and latency lt 40 ms. Actual
point-to-point bandwidth reached 20 MBps - Message Passing Interface (MPI) was used to
facilitate communication among processors
30- Six synthetic datasets used of varying complexity
- All datasets size were about 100 MB per
processor - Data Parameters
T Average transaction length I Average size
of frequent itemsets D Average number of
- Response Time
- The time elapsed from the initiation of the
execution to the end time of the last processor
finishing the computation - Note
- - Run on the 6 datasets on 16-node configuration
- - Since limited disk space available, the
response time for the serial version are run on 1
nodes worth of data or 1/16th of the database - - Repartitioning for CaD was done in the 4th
pass (best performance)
Response times for CD and CaD are much lower than
DD and close to the serial version run with 1/N
33- DD was able to exploit aggregate memory of the
multiprocessor and make fewer passes in the case
of datasets with large average transaction and
frequent itemset lengths. - CaD makes just as many data passes as CD, because
the large candidate sets that force CD into
multiple subpasses all occur before CaD takes
over with its redistribution pass.
No Communication
- Normal DD the same 100 MB data replicated on
each of the 16 nodes - No-communication DD a node is not receiving
data from other nodes, simply processed its local
data 15 more times. - Half of the time taken by DD was for
communication. - I/O savings due to DD making fewer passes become
35- DD performs quite low for 2 reasons
- Extra communication
- Every node in the system must process every
single transaction - CaD must communicate the entire dataset during
the redistribution pass ONCE, also suffers the
same problems as DD. - Unfortunately a single pass of redistribution is
costly. The savings from each processor that can
run completely independently with smaller
candidate sets can not compensate the cost. - Although CDs overhead is small (less than 7.5
to serial version), synchronization cost can be
large if the data distributions are skewed or the
nodes are not equally capable (different memory,
processor speed, I/O bandwidth, and capacities) - Suggestion CD Load Balance
36- TEST PARAMETERS (only on CD algorithm)
- Scaleup
- Increased the size of the database in direct
proportion to the number of nodes in the system - Sizeup
- Fixed the size of the multiprocessor at 16
nodes, while increasing the database from 25MB
per node to 400MB per node - Speedup
- Fixed the size of each database at 400 MB and
varied the number of processors
37SCALE UP CD scales linearly able to keep the
response time almost constant as the database
and multiprocessor size increase. Reasons The
itemsets found by CD doesnt change as the
database size increased, the number of candidates
whose support must be summed by the communication
phase remains constant
38SIZE UP CD shows sublinear performance, the
program is actually more efficient as the
database size increase. More efficientincreasing
size of database ? more I/O and transaction
processing ? less portion of time spent in
39SPEED UP CD has a very good speedup performance,
up to 8 processors Larger datasets shows better
speedup characteristics.The more data processed
per node, the less significant becomes the
communication time
40- Count Distribution attempts to minimize
communication by replicating the candidate sets
in each processors memory - Data Distribution maximizes the use of aggregate
memory by allowing each processor works with the
entire dataset but only portion of the candidate
set - Candidate Distribution eliminates the
synchronization costs at the end of every pass,
maximizes the use of aggregate memory while
limiting heavy communication to a single
redistribution pass
41(No Transcript)
42(No Transcript)
43- Count Distribution exhibited linear scale-up and
excellent speed-up and size-up behaviour - Data Distribution lost out because of the cost of
broadcasting local data from each processor to
every other processor and Candidate Distribution
lost because the cost of data redistribution. - Not all problems require an intricate