Distributed Data Mining

About This Presentation

Title:

Distributed Data Mining

Description:

1 Mflop/s 1 Megaflop/s 106 Flop/sec. 1 Gflop/s 1 Gigaflop/s 109 Flop/sec ... E.g., HPF, TreadMarks, sw for NoW (JavaParty, Manta, Jackal) 35 ... – PowerPoint PPT presentation

Number of Views:838

Avg rating:3.0/5.0

Slides: 175

Provided by: Opte

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Data Mining

1
Distributed Data Mining
ACAI05/SEKT05 ADVANCED COURSE ON KNOWLEDGE
DISCOVERY

Dr. Giuseppe Di Fatta
University of Konstanz (Germany)
and ICAR-CNR, Palermo (Italy)
5 July, 2005
Email fatta_at_inf.uni-konstanz.de,
difatta_at_pa.icar.cnr.it

2
Tutorial Outline

Part 1 Overview of High-Performance Computing
Technology trends
Parallel and Distributed Computing architectures
Programming paradigms
Part 2 Distributed Data Mining
Classification
Clustering
Association Rules
Graph Mining
Conclusions

3
Tutorial Outline

Part 1 Overview of High-Performance Computing
Technology trends
Moores law
Processing
Memory
Communication
Supercomputers

4
Units of HPC

Processing
1 Mflop/s 1 Megaflop/s 106 Flop/sec
1 Gflop/s 1 Gigaflop/s 109 Flop/sec
1 Tflop/s 1 Teraflop/s 1012 Flop/sec
1 Pflop/s 1 Petaflop/s 1015 Flop/sec
Memory
1 MB 1 Megabyte 106 Bytes
1 GB 1 Gigabyte 109 Bytes
1 TB 1 Terabyte 1012 Bytes
1 PB 1 Petabyte 1015 Bytes

5
How far did we go?
6
Technology Limits
r
1 Tflop - 1 TB sequential machine
r 0.3 mm
1 TB

Consider the 1 Tflop sequential machine
data must travel some distance, r, to get from
memory to CPU
to get 1 data element per cycle, this means 1012
times per second at the speed of light, c 3x108
m/s
so r lt c/1012 0.3 mm
Now put 1 TB of storage in a 0.3 mm2 area
each word occupies about 3 Angstroms2, the size
of a small atom

7
Moores Law (1965)

Gordon Moore
(co-founder of Intel)

The complexity for minimum component costs has
increased at a rate of roughly a factor of two
per year. Certainly over the short term this rate
can be expected to continue, if not to increase.
Over the longer term, the rate of increase is a
bit more uncertain, although there is no reason
to believe it will not remain nearly constant for
at least 10 years. That means by 1975, the number
of components per integrated circuit for minimum
cost will be 65,000.
8
Moores Law (1975)

In 1975, Moore refined his law
circuit complexity doubles every 18 months.
So far it holds for CPUs and DRAMs!
Extrapolation for computing power at a given
cost and semiconductor revenues.

9
Technology Trend
10
Technology Trend
11
Technology Trend

Processors issue instructions roughly every
nanosecond.
DRAM can be accessed roughly every 100
nanoseconds.
DRAM cannot keep processors busy! And the gap is
growing
processors getting faster by 60 per year.
DRAM getting faster by 7 per year.

12
Memory Hierarchy

Most programs have a high degree of locality in
their accesses
spatial locality accessing things nearby
previous accesses
temporal locality reusing an item that was
previously accessed
Memory hierarchy tries to exploit locality.

13
Memory Latency

Hiding memory latency
temporal and spatial locality (caching)
multithreading
prefetching

14
Communication

Topology
The manner in which the nodes are connected.
Best choice would be a fully connected network
(every processor to every other).
Unfeasible for cost and scaling reasons. Instead,
processors are arranged in some variation of a
bus, grid, torus, or hypercube.
Latency
How long does it take to start sending a
"message"? Measured in microseconds.
(Also in processors how long does it take to
output results of some operations, such as
floating point add, divide etc., which are
pipelined?)
Bandwidth
What data rate can be sustained once the message
is started? Measured in Mbytes/sec.

15
Networking Trend

System interconnection network
bus, crossbar, array, mesh, tree
static, dynamic
LAN/WAN

16
LAN/WAN

1st network connection in 1969 50 Kpbs
at about 1030 PM on October 29'th, 1969, the
first ARPANET connection was established between
UCLA and SRI over a 50 kbps line provided by the
ATT telephone company.
At the UCLA end, they typed in the 'l' and asked
SRI if they received it 'got the l' came the
voice reply. UCLA typed in the 'o', asked if they
got it, and received 'got the o'. UCLA then typed
in the 'g' and the darned system CRASHED! Quite a
beginning. On the second attempt, it worked
fine! (Leonard Kleinrock)
10Base5 Ethernet in 1976 by Bob Metcalfe and
David Boggs
end of 90s 100 Mbps (fast Ethernet) and 1 Gbps
Bandwidth is not all the story!
Do not forget to consider delay and latency.

17
Delay in packet-switched networks

(3) Transmission delay
Rlink bandwidth (bps)
Lpacket length (bits)
time to send bits into link L/R
(4) Propagation delay
d length of physical link
s propagation speed in medium (2x108 m/sec)
propagation delay d/s

(1) nodal processing
check bit errors
determine output link
(2) queuing
time waiting at output link for transmission
depends on congestion level of router

Note s and R are very different quantities!
18
Latency

How long does it take to start sending a
"message"?
Latency may be critical for parallel computing.
Some LAN technologies provide high BW and low
latency for .

19
HPC Trend
20 years ago Mflop/s 1x106 Floating Point
Ops/sec - Scalar based 10 years ago
Gflop/s 1x109 Floating Point Ops/sec) Vector
Shared memory computing, bandwidth aware block
partitioned, latency tolerant Today
Tflop/s 1x1012 Floating Point Ops/sec Highly
parallel, distributed processing, message
passing, network based data decomposition,
communication/computation 5 years away
Pflop/s 1x1015 Floating Point Ops/sec Many more
levels MH, combination/gridsHPC More adaptive,
LT and BW aware, fault tolerant, extended
precision, attention to SMP nodes
20
TOP500 SuperComputers
21
TOP500 SuperComputers
22
IBM BlueGene/L
23
Tutorial Outline

Part 1 Overview of High-Performance Computing
Technology trends
Parallel and Distributed Computing architectures
Programming paradigms

24
Parallel and Distributed Systems
25
Different Architectures

Parallel computing
single systems with many processors working on
same problem
Distributed computing
many systems loosely coupled by a scheduler to
work on related problems
Grid Computing (MetaComputing)
many systems tightly coupled by software, perhaps
geographically distributed, to work together on
single problems or on related problems
Massively Parallel Processors (MPPs) continue to
account of more than half of all installed
high-performance computers worldwide (Top500
list).
Microprocessor based supercomputers have brought
a major change in accessibility and
affordability.
Nowadays, cluster systems are the most growing
part.

26
Classification Control Model

Flynns Classical Taxonomy (1966)
Flynn's taxonomy distinguishes multi-processor
computer architectures according to how they can
be classified along the two independent
dimensions of Instruction and Data. Each of these
dimensions can have only one of two possible
states Single or Multiple.

27
SISD
Von Neumann Machine

Single Instruction, Single Data
A serial (non-parallel) computer
Single instruction only one instruction stream
is being acted on by the CPU during any one clock
cycle
Single data only one data stream is being used
as input during any one clock cycle
Deterministic execution
This is the oldest and until recently, the most
prevalent form of computer
Examples most PCs, single CPU workstations and
mainframes

28
SIMD

Single Instruction, Multiple Data
Single instruction all processing units execute
the same instruction at any given clock cycle.
Multiple data each processing unit can operate
on a different data element.
This type of machine typically has an instruction
dispatcher, a very high-bandwidth internal
network, and a very large array of very
small-capacity instruction units.
Best suited for specialized problems
characterized by a high degree of regularity,
such as image processing.
Synchronous (lockstep) and deterministic
execution
Two varieties Processor Arrays and Vector
Pipelines
Examples
Processor Arrays Connection Machine CM-2, Maspar
MP-1, MP-2
Vector Pipelines IBM 9000, Cray C90, Fujitsu VP,
NEC SX-2, Hitachi S820

29
MISD

Multiple Instruction, Single Data
Few actual examples of this class of parallel
computer have ever existed.
Some conceivable examples might be
multiple frequency filters operating on a single
signal stream
multiple cryptography algorithms attempting to
crack a single coded message.

30
MIMD

Multiple Instruction, Multiple Data
Currently, the most common type of parallel
computer
Multiple Instruction every processor may be
executing a different instruction stream.
Multiple Data every processor may be working
with a different data stream.
Execution can be synchronous or asynchronous,
deterministic or non- deterministic.
Examples most current supercomputers, networked
parallel computer "grids" and multi-processor SMP
computers - including some types of PCs.

31
Classification Communication Model

Shared vs. Distributed Memory systems

32
Shared Memory UMA vs. NUMA
33
Distributed Memory MPPs vs. Clusters

Processors-memory nodes are connected by some
type of interconnect network
Massively Parallel Processor (MPP) tightly
integrated, single system image.
Cluster individual computers connected by SW

Interconnect Network
P
P
P
P
P
P
M
M
M
M
M
M
34
Distributed Shared-Memory

Virtual shared memory (shared address space)
on hardware level
on software level
Global address space spanning all of the memory
in the system.
E.g., HPF, TreadMarks, sw for NoW (JavaParty,
Manta, Jackal)

35
Parallel vs. Distributed Computing

Parallel computing usually considers dedicated
homogeneous HPC systems to solve parallel
problems.
Distributed computing extends the parallel
approach to heterogeneous general-purpose
systems.
Both look at the parallel formulation of a
problem.
But usually reliability, security, heterogeneity
are not considered in parallel computing. But
they are considered in Grid computing.
A distributed system is one in which the failure
of a computer you didn't even know existed can
render your own computer unusable. (Leslie
Lamport)

36
Parallel and Distributed Computing

Parallel computing
Shared-Memory SIMD
Distributed-Memory SIMD
Shared-Memory MIMD
Distributed-Memory MIMD
Behind DM-MIMD
Distributed computing and Clusters
Behind parallel and distributed computing
Metacomputing

SCALABILITY
37
Tutorial Outline

Part 1 Overview of High-Performance Computing
Technology trends
Parallel and Distributed Computing architectures
Programming paradigms
Programming models
Problem decomposition
Parallel programming issues

38
Programming Paradigms

Parallel Programming Models
Control
how is parallelism created
what orderings exist between operations
how do different threads of control synchronize
Naming
what data is private vs. shared
how logically shared data is accessed or
communicated
Set of operations
what are the basic operations
what operations are considered to be atomic
Cost
how do we account for the cost of each of the
above

39
Model 1 Shared Address Space

Program consists of a collection of threads of
control,
Each with a set of private variables
e.g., local variables on the stack
Collectively with a set of shared variables
e.g., static variables, shared common blocks,
global heap
Threads communicate implicitly by writing and
reading shared variables
Threads coordinate explicitly by synchronization
operations on shared variables
writing and reading flags
locks, semaphores
Like concurrent programming in uniprocessor

40
Model 2 Message Passing

Program consists of a collection of named
processes
thread of control plus local address space
local variables, static variables, common blocks,
heap
Processes communicate by explicit data transfers
matching pair of send receive by source and
dest. proc.
Coordination is implicit in every communication
event
Logically shared data is partitioned over local
processes
Like distributed programming
Program with standard libraries MPI, PVM
aka shared nothing architecture, or a
multicomputer

41
Model 3 Data Parallel

Single sequential thread of control consisting of
parallel operations
Parallel operations applied to all (or defined
subset) of a data structure
Communication is implicit in parallel operators
and shifted data structures
Elegant and easy to understand
Not all problems fit this model
Vector computing

42
SIMD Machine

An SIMD (Single Instruction Multiple Data)
machine
A large number of small processors
A single control processor issues each
instruction
each processor executes the same instruction
some processors may be turned off on any
instruction interconnect
Machines not popular (CM2), but programming model
is
implemented by mapping n-fold parallelism to p
processors
mostly done in the compilers (HPF High
Performance Fortran) control processor

43
Model 4 Hybrid

Shared memory machines (SMPs) are the fastest
commodity machine. Why not build a larger machine
by connecting many of them with a network?
CLUMP Cluster of SMPs
Shared memory within one SMP, message passing
outside
Clusters, ASCI Red (Intel), ...
Programming model?
Treat machine as flat, always use message
passing, even within SMP (simple, but ignore
important part of memory hierarchy)
Expose two layers shared memory (OpenMP) and
message passing (MPI) higher performance, but
ugly to program.

44
Hybrid Systems
45
Model 5 BSP

Bulk Synchronous Processing (BSP) (L. Valiant,
1990)
Used within the message passing or shared memory
models as a programming convention
Phases separated by global barriers
Compute phases all operate on local data (in
distributed memory)
or read access to global data (in shared memory)
Communication phases all participate in
rearrangement or reduction of global data
Generally all doing the same thing in a phase
all do f, but may all do different things within
f
Simplicity of data parallelism without
restrictions

BSP superstep
46
Problem Decomposition

Domain decomposition ? data parallel
Functional decomposition ? task parallel

47
Parallel Programming

directives-based data-parallel language
Such as High Performance Fortran (HPF) or OpenMP
Serial code is made parallel by adding directives
(which appear as comments in the serial code)
that tell the compiler how to distribute data and
work across the processors.
The details of how data distribution,
computation, and communications are to be done
are left to the compiler.
Usually implemented on shared-memory
architectures.
Message Passing (e.g. MPI, PVM)
very flexible approach based on explicit message
passing via library calls from standard
programming languages
It is left up to the programmer to explicitly
divide data and work across the processors as
well as manage the communications among them.
Multi-threading in distributed environments
Parallelism is transparent to the programmer
Shared-memory or distributed shared-memory systems

48
Parallel Programming Issues

The main goal of a parallel program is to get
better performance over the serial version.
Performance evaluation
Important issues to take into account
Load balancing
Minimizing communication
Overlapping communication and computation

49
Speedup

Serial fraction
Parallel fraction
Speedup

Superlinear speedup is, in general, impossible
but it may arise in two cases
memory hierarchy phenomena
search algorithms

50
Maximum Speedup

Amdahls Law states that potential program
speedup is defined by the fraction of code (fp)
which can be parallelized.

51
Maximum Speedup

There are limits to the scalability of
parallelism.
For example, at fp 0.50, 0.90 and 0.99,
50, 90 and 99 of the code is parallelizable.
However, certain problems demonstrate increased
performance by increasing the problem size.
Problems which increase the percentage of
parallel time with their size are more "scalable"
than problems with a fixed percentage of parallel
time.
fs and fp may not be static

52
Efficiency

Given the parallel cost
Efficiency E
In general, the total overhead To is an
increasing function of p, at least linearly when
fs gt 0
communication,
extra computation,
idle periods due to sequential components,
idle periods due to load imbalance.

53
Cost-optimality of Parallel Systems

A parallel system is composed by a parallel
algorithm and a parallel computational platform.
A parallel system is cost-optimal if the cost of
solving a problem has the same asymptotic growth
(in ? terms, as a function of the input size W)
as the fastest known sequential algorithm.
As a consequence,

54
Isoefficiency

For a given problem size, when we increase the
number of PEs, the speedup and the efficiency
decrease.
How much do we need to increase the problem size
to keep the efficiency constant?
Isoefficiency is a metric for scalability
And, in general, as the problem size increases
the efficiency increases, while keeping the
number of processors constant
Isoefficiency
In scalable parallel system, when increasing the
number of PEs, the efficiency can be kept
constant by increasing the problem size.
Of course, for different problems, the rate at
which W must be increased may vary. This rate
determines the degree of scalability of the
system.

55
Sources of Parallel Overhead

Total parallel overhead
INTERPROCESSOR COMMUNICATION
If each PE spends Tcomm time for communications,
then the overhead will increase by pTcomm.
LOAD IMBALANCE
if exists, some PE will be idle while others are
busy. Idle time of any PE contributes to the
Overhead Time.
Load imbalance always occurs if there is a
strictly sequential component of the algorithm.
Load imbalance often occurs at the end of the
process run for asynchronous termination (e.g. in
coarse-grain parallelism).
EXTRA COMPUTATION
Parallel version of the fastest sequential
algorithm may not be straightforward. Additional
computation may be needed in the parallel
algorithm. This contributes to the Overhead Time.

To pTp Ts
56
Load Balancing

Load balancing is the task of equally dividing
the work among the available processes.
A range of load balancing problems is determined
by
Task costs
Task dependencies
Locality needs
Spectrum of solutions from static to dynamic

A closely related problem is scheduling, which
determines the order in which tasks run.
57
Different Load Balancing Problems

Load balancing problems differ in
Tasks costs
Do all tasks have equal costs?
If not, when are the costs known?
Before starting, when task created, or only when
task ends
Task dependencies
Can all tasks be run in any order (including
parallel)?
If not, when are the dependencies known?
Before starting, when task created, or only when
task ends
Locality
Is it important for some tasks to be scheduled on
the same processor (or nearby) to reduce
communication cost?
When is the information about communication
between tasks known?

58
Task cost
59
Task Dependency
(e.g. data/control dependencies at end/beginning
of task executions)
60
Task Locality
(e.g. data/control dependencies during task
executions)
61
Spectrum of Solutions

Static scheduling. All information is available
to scheduling algorithm, which runs before any
real computation starts. (offline algorithms)
Semi-static scheduling. Information may be known
at program startup, or the beginning of each
timestep, or at other well-defined points.
Offline algorithms may be used even though the
problem is dynamic.
Dynamic scheduling. Information is not known
until mid-execution.
(online algorithms)

62
LB Approaches

Static load balancing
Semi-static load balancing
Self-scheduling (manager-workers)
Distributed task queues
Diffusion-based load balancing
DAG scheduling (graph partitioning is
NP-complete)
Mixed Parallelism

63
Distributed and Dynamic LB

Dynamic load balancing algorithms, aka work
stealing/donating
Basic idea, when applied to search trees
Each processor performs search on disjoint part
of tree
When finished, get work from a processor that is
still busy
Requires asynchronous communication

busy
Finished available work
idle
Select a processor and request work
Do fixed amount of work
No work found
Service pending messages
Service pending messages
Got work
64
Selecting a Donor

Basic distributed algorithms
Asynchronous Round Robin (ARR)
Each processor k, keeps a variable target_k
When a processor runs out of work, request from
target_k
Set target_k (target_k 1) procs
Nearest Neighbor (NN)
Round robin over neighbors
Takes topology into account (as diffusive
techniques)
Load balancing somewhat slower than randomized
Global Round Robin (GRR)
Processor 0 keeps a single variable target
When a processor needs work, get target, a
request from target
P0 increments (mod procs) with each access to
target
Random polling/stealing
When a processor needs work, select a random
processor and request work from it

65
Tutorial Outline

Part 2 Distributed Data Mining
Classification
Clustering
Association Rules
Graph Mining

66
Knowledge Discovery in Databases

Knowledge Discovery in Databases (KDD) is a
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data.

Data Mining
Clean, Collect, Summarize
Data Preparation
Training Data
Data Warehouse
Model, Patterns
Verification Evaluation
Operational Databases
67
Origins of Data Mining

KDD draws ideas from machine learning/AI, pattern
recognition, statistics, database systems, and
data visualization.
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Description Methods
Find human-interpretable patterns that describe
the data.
Traditional techniques may be unsuitable
enormity of data
high dimensionality of data
heterogeneous, distributed nature of data

68
Speeding up Data Mining

Data oriented approach
Discretization
Feature selection
Feature construction (PCA)
Sampling
Methods oriented approach
Efficient and scalable algorithms

69
Speeding up Data Mining

Methods oriented approach (contd.)
Distributed and parallel data-mining
Task or control parallelism
Data parallelism
Hybrid parallelism
Distributed-data mining
Voting
Meta-learning, etc.

70
Tutorial Outline

Part 2 Distributed Data Mining
Classification
Clustering
Association Rules
Graph Mining

71
What is Classification?

Classification is the process of assigning new
objects to predefined categories or classes
Given a set of labeled records
Build a model (e.g. a decision tree)
Predict labels for future unlabeled records

72
Classification learning

Supervised learning (labels are known)
Example described in terms of attributes
Categorical (unordered symbolic values)
Numeric (integers, reals)
Class (output/predicted attribute)
categorical for classification
numeric for regression

73
Classification learning

Training set
set of examples, where each example is a feature
vector (i.e., a set of ltattribute,valuegt pairs)
with its associated class. The model is built on
this set.
Test set
a set of examples disjoint from the training set,
used for testing the accuracy of a model.

74
Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
75
Classification Models

Some models are better than others
Accuracy
Understandability
Models range from easy to understand to
incomprehensible
Decision trees
Rule induction
Regression models
Genetic Algorithms
Bayesian Networks
Neural networks

Easier
Harder
76
Decision Trees

Decision tree models are better suited for data
mining
Inexpensive to construct
Easy to Interpret
Easy to integrate with database systems
Comparable or better accuracy in many
applications

77
Decision Trees Example
categorical
categorical
continuous
class
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
The splitting attribute at a node is determined
based on the Gini index.
78
From Tree to Rules
1) Refund Yes ? NO 2) Refund No and MarSt
in Single, Divorced and TaxInc lt 80K ? NO 3)
Refund No and MarSt in Single, Divorced
and TaxInc gt 80K ? YES 4) Refund No and
MarSt in Married ? NO
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
79
Decision Trees Sequential Algorithms

Many algorithms
Hunts algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
General structure
Tree induction
Tree pruning

80
Classification algorithm

Build tree
Start with data at root node
Select an attribute and formulate a logical test
on attribute
Branch on each outcome of the test, and move
subset of examples satisfying that outcome to
corresponding child node
Recurse on each child node
Repeat until leaves are pure, i.e., have
example from a single class, or nearly pure,
i.e., majority of examples are from the same
class
Prune tree
Remove subtrees that do not improve
classification accuracy
Avoid over-fitting, i.e., training set specific
artifacts

81
Build tree

Evaluate split-points for all attributes
Select the best point and the winning
attribute
Split the data into two
Breadth/depth-first construction
CRITICAL STEPS
Formulation of good split tests
Selection measure for attributes

82
How to capture good splits?

Occams razor Prefer the simplest hypothesis
that fits the data
Minimum message/description length
dataset D
hypotheses H1, H2, , Hx describing D
MML(Hi) Mlength(Hi)Mlength(DHi)
pick Hk with minimum MML
Mlength given by Gini index, Gain, etc.

83
Tree pruning using MDL

Data encoding sum classification errors
Model encoding
Encode the tree structure
Encode the split points
Pruning choose smallest length option
Convert to leaf
Prune left or right child
Do nothing

84
Hunts Method

Attributes Refund (Yes, No), Marital Status
(Single, Married, Divorced), Taxable Income
Class Cheat, Dont Cheat

Dont Cheat
85
Whats really happening?
Marital Status
Cheat
Dont Cheat
Married
Income
lt 80K
86
Finding good split points

Use Gini index for partition purity
where p(i) frequency of class i in the node.
If S is pure, Gini(S) 1-1 0
Find split-point with minimum Gini
Only need class distributions

87
Finding good splits points
Marital Status
Marital Status
Cheat
Dont Cheat
Income
Income
Gini(split) 0.34
Gini(split) 0.31
88
Categorical Attributes Computing Gini Index

For each distinct value, gather counts for each
class in the dataset
Use the count matrix to make decisions

Two-way split (find best partition of values)
Multi-way split
89
Decision Trees Parallel Algorithms

Approaches for Categorical Attributes
Synchronous Tree Construction (data parallel)
no data movement required
high communication cost as tree becomes bushy
Partitioned Tree Construction (task parallel)
processors work independently once partitioned
completely
load imbalance and high cost of data movement
Hybrid Algorithm
combines good features of two approaches
adapts dynamically according to the size and
shape of trees

90
Synchronous Tree Construction

Partitioning of data only
global reduction per node is required
large number of classification tree nodes gives
high communication cost

m categorical attributes
n records
91
Partitioned Tree Construction

Partitioning of classification tree nodes
natural concurrency
load imbalance as the amount of work associated
with each node varies
child nodes use the same data as used by parent
node
loss of locality
high data movement cost

92
Synchronous Tree Construction
Partition Data Across Processors

No data movement is required
Load imbalance
can be eliminated by breadth-first expansion
High communication cost
becomes too high in lower parts of the tree

93
Partitioned Tree Construction
Partition Data and Nodes

Highly concurrent
High communication cost due to excessive data
movements
Load imbalance

94
Hybrid Parallel Formulation
switch
95
Load Balancing
96
Switch Criterion

Switch to Partitioned Tree Construction when
Splitting criterion ensures
Parallel Formulations of Decision-Tree
Classification Algorithms. A. Srivastava, E.
Han, V. Kumar, and V. Singh, Data Mining and
Knowledge Discovery An International Journal,
vol. 3, no. 3, pp 237-261, September 1999.

97
Speedup Comparison
linear
linear
hybrid
hybrid
partitioned
partitioned
synchronous
synchronous
0.8 million examples
1.6 million examples
98
Speedup of the Hybrid Algorithm with Different
Size Data Sets
99
Scaleup of the Hybrid Algorithm
100
Summary of Algorithms for Categorical Attributes

Synchronous Tree Construction Approach
no data movement required
high communication cost as tree becomes bushy
Partitioned Tree Construction Approach
processors work independently once partitioned
completely
load imbalance and high cost of data movement
Hybrid Algorithm
combines good features of two approaches
adapts dynamically according to the size and
shape of trees

101
Tutorial Outline

Part 2 Distributed Data Mining
Classification
Clustering
Association Rules
Graph Mining

102
Clustering Definition

Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
Data points in one cluster are more similar to
one another
Data points in separate clusters are less similar
to one another

103
Clustering

Given N k-dimensional feature vectors, find a
meaningful partition of the N examples into c
subsets or groups.
Discover the labels automatically
c may be given, or discovered
Much more difficult than classification, since
in the latter the groups are given, and we seek a
compact description.

104
Clustering Illustration
k3 ? Euclidean Distance Based Clustering in 3-D
space
Intracluster distances are minimized
Intercluster distances are maximized
105
Clustering

Have to define some notion of similarity
between examples
Similarity Measures
Euclidean Distance if attributes are continuous.
Other Problem-specific Measures.
Goal maximize intra-cluster similarity and
minimize inter-cluster similarity
Feature vector be
All numeric (well defined distances)
All categorical or mixed (harder to define
similarity geometric notions dont work)

106
Clustering schemes

Distance-based
Numeric
Euclidean distance (root of sum of squared
differences along each dimension)
Angle between two vectors
Categorical
Number of common features (categorical)
Partition-based
Enumerate partitions and score each

107
Clustering schemes

Model-based
Estimate a density (e.g., a mixture of gaussians)
Go bump-hunting
Compute P(Feature Vector i Cluster j)
Finds overlapping clusters too
Example bayesian clustering

108
Before clustering

Normalization
Given three attributes
A in micro-seconds
B in milli-seconds
C in seconds
Cant treat differences as the same in all
dimensions or attributes
Need to scale or normalize for comparison
Can assign weight for more importance

109
The k-means algorithm

Specify k, the number of clusters
Guess k seed cluster centers
1) Look at each example and assign it to the
center that is closest
2) Recalculate the center
Iterate on steps 1 and 2 till centers converge or
for a fixed number of times

110
K-means algorithm
Initial seeds
111
K-means algorithm
New centers
112
K-means algorithm
Final centers
113
Operations in k-means

Main Operation Calculate distance to all k means
or centroids
Other operations
Find the closest centroid for each point
Calculate mean squared error (MSE) for all points
Recalculate centroids

114
Parallel k-means

Divide N points among P processors
Replicate the k centroids
Each processor computes distance of each local
point to the centroids
Assign points to closest centroid and compute
local MSE
Perform reduction for global centroids and global
MSE value

115
Serial and Parallel k-means
Group communication
116
Serial k-means Complexity
117
Parallel k-means Complexity
Where depends on the physical
communication topology, e.g. in a hypercube
118
Speedup and Scaleup
Condition for linear speedup
Condition for linear scaleup (w.r.t. n)
119
Tutorial Outline

Part 2 Distributed Data Mining
Classification
Clustering
Association Rules
Frequent Itemset Mining
Graph Mining

120
ARM Definition

Given a set of records each of which contain some
number of items from a given collection
Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.

121
ARM Definition

Given a set of items/attributes, and a set of
objects containing a subset of the items
Find rules if I1 then I2 (sup, conf)
I1, I2 are sets of items
I1, I2 have sufficient support P(I1I2)
Rule has sufficient confidence P(I2I1)

122
Association Mining

User specifies interestingness
Minimum support (minsup)
Minimum confidence (minconf)
Find all frequent itemsets (gt minsup)
Exponential Search Space
Computation and I/O Intensive
Generate strong rules (gt minconf)
Relatively cheap

123
Association Rule Discovery Support and
Confidence
Example
Association Rule
Support
Confidence
124
Handling Exponential Complexity

Given n transactions and m different items
number of possible association rules
computation complexity
Systematic search for all patterns, based on
support constraint
If A,B has support at least a, then both A and
B have support at least a.
If either A or B has support less than a, then
A,B has support less than a.
Use patterns of k-1 items to find patterns of k
items.

125
Apriori Principle

Collect single item counts. Find large items.
Find candidate pairs, count them gt large pairs
of items.
Find candidate triplets, count them gt large
triplets of items, and so on...
Guiding Principle every subset of a frequent
itemset has to be frequent.
Used for pruning many candidates.

126
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support 3
Triplets (3-itemsets)
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 2 14
127
Counting Candidates

Frequent Itemsets are found by counting
candidates.
Simple way
Search for each candidate in each
transaction. Expensive!!!

Transactions
Candidates
M
N
128
Association Rule Discovery Hash tree for fast
access.
Candidate Hash Tree
Hash Function
1,4,7
3,6,9
2,5,8
129
Association Rule Discovery Subset Operation
transaction
130
Association Rule Discovery Subset Operation
(contd.)
transaction
1 3 6
3 4 5
1 5 9
131
Parallel Formulation of Association Rules

Large-scale problems have
Huge Transaction Datasets (10s of TB)
Large Number of Candidates.
Parallel Approaches
Partition the Transaction Database, or
Partition the Candidates, or
Both

132
Parallel Association Rules Count Distribution
(CD)

Each Processor has complete candidate hash tree.
Each Processor updates its hash tree with local
data.
Each Processor participates in global reduction
to get global counts of candidates in the hash
tree.
Multiple database scans per iteration are
required if hash tree too big for memory.

133
CD Illustration
P0
P1
P2
N/p
N/p
N/p
Global Reduction of Counts
134
Parallel Association Rules Data Distribution
(DD)

Candidate set is partitioned among the
processors.
Once local data has been partitioned, it is
broadcast to all other processors.
High Communication Cost due to data movement.
Redundant work due to multiple traversals of the
hash trees.

135
DD Illustration
P0
P1
P2
Data Broadcast
Remote Data
Remote Data
Remote Data
9
1,2
2,3
12
1,3
10
3,4
10
All-to-All Broadcast of Candidates
136
Parallel Association Rules Intelligent Data
Distribution (IDD)

Data Distribution using point-to-point
communication.
Intelligent partitioning of candidate sets.
Partitioning based on the first item of
candidates.
Bitmap to keep track of local candidate items.
Pruning at the root of candidate hash tree using
the bitmap.
Suitable for single data source such as database
server.
With smaller candidate set, load balancing is
difficult.

137
IDD Illustration
P0
P1
P2
Data Shift
Remote Data
Remote Data
Remote Data
1
2,3
5
bitmask
Count
Count
Count
9
1,2
2,3
12
1,3
10
3,4
10
All-to-All Broadcast of Candidates
138
Filtering Transactions in IDD
bitmask
139
Parallel Association Rules Hybrid Distribution
(HD)

Candidate set is partitioned into G groups to
just fit in main memory
Ensures Good load balance with smaller candidate
set.
Logical processor mesh G x P/G is formed.
Perform IDD along the column processors
Data movement among processors is minimized.
Perform CD along the row processors
Smaller number of processors in global reduction
operation.

140
HD Illustration
P/G Processors per Group
N/P
N/P
N/P
G Groups of Processors
N/P
N/P
N/P
N/P
N/P
N/P
141
Parallel Association Rules Comments

HD has shown the same linear speedup and sizeup
behavior as that of CD.
HD exploits total aggregate main memory, while CD
does not.
IDD has much better scaleup behavior than DD.

142
Tutorial Outline

Part 2 Distributed Data Mining
Classification
Clustering
Association Rules
Graph Mining
Frequent Subgraph Mining

143
Graph Mining

? Market Basket Analysis
Association Rule Mining (ARM) ? find frequent
itemset
Search space
Unstructured data only item type is important
item set I, In ? power set P(I), P(I) 2n
pruning technique to make the search feasible
Subset test
for each user transaction t and each candidate
frequent-itemset s we need a subset test in order
to compute the support (frequency).
? Molecular Compound Analysis
Frequent Subgraph Mining (FSM) ? find frequent
subgraph
Bigger search space
Structured data atom types are not sufficient
atoms have bonds with other atoms.
Subgraph isomorphism test
for each graph and each candidate
frequent-subgraph we need a subgraph isomorphism
test.
N.B. for general graphs the subgraph isomorphism
test is NP-complete

144
Molecular Fragment Lattice

C
O
S
N
C-C
S-O
C-S
S-N
SN
C-S-N
C-SN
C-C-S
C-S-O
N-S-O
C-S-O N
C-C-S-N
(minSupp50)
145
Mining Molecular Fragments

Frequent Molecular Fragments
Frequent Subgraph Mining (FSM)

Discriminative Molecular Fragments
Molecular compounds are classified into
active compounds ? focus subset
inactive compounds ? complement subset
Problem definition
find all discriminative molecular fragments,
which are frequent in the set of the active
compounds and not frequent among the inactive
compounds, i.e. contrast substructures.
User parameters
minSupp (minimum support in the focus dataset)
maxSupp (maximum support in the complement
dataset)

C
F
146
Molecular Fragment Search Tree
A search tree node represents a molecular
fragment.
Successor nodes are generated by extending the
fragment of one bond and, eventually, one atom.
147
Large-Scale Issue

Need for scalability in terms of
the number of molecules
larger main and secondary memory to store
molecules
fragments with longer list of instances in
molecules (embeddings)
the size of molecules
larger memory to store larger molecules
fragments with longer list of longer embeddings
more fragments (bigger search space)
the minimum support
with lower support threshold the mining algorithm
produces
more embeddings for each fragment
more fragments and longer search tree branches

148
High-Performance Distributed Approach

Sequential algorithms cannot handle large-scale
problems and small values of the user parameters
for better quality results.

Search Space Partitioning
Distributed implementation of backtracking
external representation (enhanced SMILES)
DB selection and projection
Tree-based reduction
Dynamic load balancing for irregular problems
donor selection and work splitting mechanism
Peer-to-Peer computing

149
Search Space Partitioning
150
Search Space Partitioning

4th kind of search-tree pruning Distributed
Computing pruning
prune a search node, generate a new job and
assign it to an idle processor
asynchronous communication and low overhead
backtracking is particularly suitable to parallel
processing because a subtree rooted at any node
can be searched independently

151
Tree-based Reduction
3D hypercube (p8)

Star reduction (master-slave) O(p)
Tree reduction O(log(p))
152
Job Assignment and Parallel Overheads
overheads
external representation (enhanced SMILES)
split
embed
stack
latency
donor
delay
parallel computing overhead communication
excess computation idling periods
receiver
idle

1st job assignment
job assignment
termination detection

Overlapped with useful computation
DB selection projection
DLB for irregular problems
153
Parallel Execution Analysis
worker execution
setup
idle1
jobs
idle3
idle2
setup configuration message, DB loading idle1
wait first job assignment, due to initial
sequential part idle2 processor
starvation idle3 idle period due to load
imbalance jobs job processing time (including
computational overhead)
single job execution
mining
data
prep
data data preprocessing prep prepare root
search node (embed the core fragment) mining data
mining processing (useful work)
154
Load Balancing

The search space is not know a priori and is very
irregular
Dynamic load balancing
receiver-initiated approach
donor selection
work splitting mechanism
The DLB determines the overall performance and
efficiency.

155
Highly Irregular Problem
Search tree node visit-time (subtree visit)
Search tree node expand-time (node extension)
Power-law distribution
156
Work Splitting
A search tree node n can be donated only if
1) stackSize() gt minStackSize, 2) support(n) gt
(1 a) minSupp 3) lxa(n) lt ß atomCount(n)
157
Dynamic Load Balancing

Receiver-initiated approaches
Random Polling (RP)
Scheduler-based (MS)

excellent scalability
Quasi-Random Polling (QRP)
optimal solution

Quasi-Random Polling (QRP) policy
Global list of potential donors (sorted w. r. t.
the running time)
server collects job statistics
receiver periodically gets updated donor-list
P2P Computing framework
receiver selects a random donor according to a
probability distribution decreasing with the
donor rank in the list
high probability to choose long running jobs

158
Issues

Penalty for global synchronization
Adaptive application
Asynchronous communication
Highly irregular problem
Difficulty in predicting work loads
Heavy work loads may delay message processing
Large-scale multi-domain heterogeneous computing
environment
Network latency and delay tolerant

159
Tutorial Outline

Part 1 Overview of High-Performance Computing
Technology trends
Parallel and Distributed Computing architectures
Programming paradigms
Part 2 Distributed Data Mining
Classification
Clustering
Association Rules
Graph Mining
Conclusions

160
Large-scale Parallel KDD Systems

Data
Terabyte-sized datasets
Centralized or distributed datasets
Incremental changes (refine knowledge as data
changes)
Heterogeneous data sources

161
Large-scale Parallel KDD Systems

Software
Pre-processing, mining, post-processing
Interactive (anytime mining)
Modular (rapid development)
Web services
Workflow management tool integration
Fault and latency tolerant
Highly scalable

162
Large-scale Parallel KDD Systems

Computing Infrastructure
Clusters already widespread
Multi-domain heterogeneous
Data and computational Grids
Dynamic resource aggregation (P2P)
Self-managing

163
Research Directions

Fast algorithms different mining tasks
Classification, clustering, associations, etc.
Incorporating concept hierarchies
Parallelism and scalability
Millions of records
Thousands of attributes/dimensions
Single pass algorithms
Sampling
Parallel I/O and file systems

164
Research Directions (contd.)

Parallel Ensemble Learning
parallel execution of different data mining
algorithms and techniques that can be integrated
to obtain a better model.
Not just high performance but also high accuracy

165
Research Directions (contd.)

Tight database integration
Push common primitives inside DBMS
Use multiple tables
Use efficient indexing techniques
Caching strategies for sequence of data mining
operations
Data mining query language and parallel query
optimization

166
Research Directions (contd.)

Understandability too many patterns
Incorporate background knowledge
Integrate constraints
Meta-level mining
Visualization, exploration
Usability build a complete system
Pre-processing, mining, post-processing,
persistent management of mined results

167
Conclusions

Data mining is a rapidly growing field
Fueled by enormous data collection rates, and
need for intelligent analysis for business and
scientific gains.
Large and high-dimensional data requires new
analysis techniques and algorithms.
High Performance Distributed Computing is
becoming an essential component in data mining
and data exploration.
Many research and commercial opportunities.

168
Resources

Workshops
IEEE IPDPS Workshop on Parallel and Distributed
Data Mining
HiPC Special Session on Large-Scale Data Mining
ACM SIGKDD Workshop on Distributed Data Mining
IEEE IPDPS Workshop on High Performance Data
Mining
ACM SIGKDD Workshop on Large-Scale Parallel KDD
Systems
ACM SIGKDD Workshop on Distributed Data Mining,
IEEE IPPS Workshop on High Performance Data
Mining
LifeDDM, Distributed Data Mining in Life Science
Books
A. Freitas and S. Lavington. Mining very large
databases with parallel processing. Kluwer
Academic Pub., Boston, MA, 1998.
M. J. Zaki and C.-T. Ho (eds). Large-Scale
Parallel Data Mining. LNAI State-of-the-Art
Survey, Volume 1759, Springer-Verlag, 2000.
H. Kargupta and P. Chan (eds). Advance in
Distributed and Parallel Knowledge Discovery,
AAAI Press, Summer 2000.

169
References

Journal Special Issues
P. Stolorz and R. Musick (eds.). Scalable
High-Performance Computing for KDD, Data Mining
and Knowledge Discovery An International
Journal, Vol. 1, No. 4, December 1997.
Y. Guo and R. Grossman (eds.). Scalable Parallel
and Distributed Data Mining, Data Mining and
Kno

Write a Comment

User Comments (0)