Title: Distributed Data Mining
1Distributed Data Mining
ACAI05/SEKT05 ADVANCED COURSE ON KNOWLEDGE
DISCOVERY
- Dr. Giuseppe Di Fatta
- University of Konstanz (Germany)
- and ICAR-CNR, Palermo (Italy)
- 5 July, 2005
- Email fatta_at_inf.uni-konstanz.de,
difatta_at_pa.icar.cnr.it
2Tutorial Outline
- Part 1 Overview of High-Performance Computing
- Technology trends
- Parallel and Distributed Computing architectures
- Programming paradigms
- Part 2 Distributed Data Mining
- Classification
- Clustering
- Association Rules
- Graph Mining
- Conclusions
3Tutorial Outline
- Part 1 Overview of High-Performance Computing
- Technology trends
- Moores law
- Processing
- Memory
- Communication
- Supercomputers
4Units of HPC
- Processing
- 1 Mflop/s 1 Megaflop/s 106 Flop/sec
- 1 Gflop/s 1 Gigaflop/s 109 Flop/sec
- 1 Tflop/s 1 Teraflop/s 1012 Flop/sec
- 1 Pflop/s 1 Petaflop/s 1015 Flop/sec
- Memory
- 1 MB 1 Megabyte 106 Bytes
- 1 GB 1 Gigabyte 109 Bytes
- 1 TB 1 Terabyte 1012 Bytes
- 1 PB 1 Petabyte 1015 Bytes
5How far did we go?
6Technology Limits
r
1 Tflop - 1 TB sequential machine
r 0.3 mm
1 TB
- Consider the 1 Tflop sequential machine
- data must travel some distance, r, to get from
memory to CPU - to get 1 data element per cycle, this means 1012
times per second at the speed of light, c 3x108
m/s - so r lt c/1012 0.3 mm
- Now put 1 TB of storage in a 0.3 mm2 area
- each word occupies about 3 Angstroms2, the size
of a small atom
7Moores Law (1965)
- Gordon Moore
- (co-founder of Intel)
The complexity for minimum component costs has
increased at a rate of roughly a factor of two
per year. Certainly over the short term this rate
can be expected to continue, if not to increase.
Over the longer term, the rate of increase is a
bit more uncertain, although there is no reason
to believe it will not remain nearly constant for
at least 10 years. That means by 1975, the number
of components per integrated circuit for minimum
cost will be 65,000.
8Moores Law (1975)
- In 1975, Moore refined his law
- circuit complexity doubles every 18 months.
- So far it holds for CPUs and DRAMs!
- Extrapolation for computing power at a given
cost and semiconductor revenues.
9Technology Trend
10Technology Trend
11Technology Trend
- Processors issue instructions roughly every
nanosecond. - DRAM can be accessed roughly every 100
nanoseconds. - DRAM cannot keep processors busy! And the gap is
growing - processors getting faster by 60 per year.
- DRAM getting faster by 7 per year.
12Memory Hierarchy
- Most programs have a high degree of locality in
their accesses - spatial locality accessing things nearby
previous accesses - temporal locality reusing an item that was
previously accessed - Memory hierarchy tries to exploit locality.
13Memory Latency
- Hiding memory latency
- temporal and spatial locality (caching)
- multithreading
- prefetching
14Communication
- Topology
- The manner in which the nodes are connected.
- Best choice would be a fully connected network
(every processor to every other). - Unfeasible for cost and scaling reasons. Instead,
processors are arranged in some variation of a
bus, grid, torus, or hypercube. - Latency
- How long does it take to start sending a
"message"? Measured in microseconds. - (Also in processors how long does it take to
output results of some operations, such as
floating point add, divide etc., which are
pipelined?) - Bandwidth
- What data rate can be sustained once the message
is started? Measured in Mbytes/sec.
15Networking Trend
- System interconnection network
- bus, crossbar, array, mesh, tree
- static, dynamic
- LAN/WAN
16LAN/WAN
- 1st network connection in 1969 50 Kpbs
- at about 1030 PM on October 29'th, 1969, the
first ARPANET connection was established between
UCLA and SRI over a 50 kbps line provided by the
ATT telephone company. - At the UCLA end, they typed in the 'l' and asked
SRI if they received it 'got the l' came the
voice reply. UCLA typed in the 'o', asked if they
got it, and received 'got the o'. UCLA then typed
in the 'g' and the darned system CRASHED! Quite a
beginning. On the second attempt, it worked
fine! (Leonard Kleinrock) - 10Base5 Ethernet in 1976 by Bob Metcalfe and
David Boggs - end of 90s 100 Mbps (fast Ethernet) and 1 Gbps
- Bandwidth is not all the story!
- Do not forget to consider delay and latency.
17Delay in packet-switched networks
- (3) Transmission delay
- Rlink bandwidth (bps)
- Lpacket length (bits)
- time to send bits into link L/R
- (4) Propagation delay
- d length of physical link
- s propagation speed in medium (2x108 m/sec)
- propagation delay d/s
- (1) nodal processing
- check bit errors
- determine output link
- (2) queuing
- time waiting at output link for transmission
- depends on congestion level of router
Note s and R are very different quantities!
18Latency
- How long does it take to start sending a
"message"? - Latency may be critical for parallel computing.
Some LAN technologies provide high BW and low
latency for .
19HPC Trend
20 years ago Mflop/s 1x106 Floating Point
Ops/sec - Scalar based 10 years ago
Gflop/s 1x109 Floating Point Ops/sec) Vector
Shared memory computing, bandwidth aware block
partitioned, latency tolerant Today
Tflop/s 1x1012 Floating Point Ops/sec Highly
parallel, distributed processing, message
passing, network based data decomposition,
communication/computation 5 years away
Pflop/s 1x1015 Floating Point Ops/sec Many more
levels MH, combination/gridsHPC More adaptive,
LT and BW aware, fault tolerant, extended
precision, attention to SMP nodes
20TOP500 SuperComputers
21TOP500 SuperComputers
22IBM BlueGene/L
23Tutorial Outline
- Part 1 Overview of High-Performance Computing
- Technology trends
- Parallel and Distributed Computing architectures
- Programming paradigms
24Parallel and Distributed Systems
25Different Architectures
- Parallel computing
- single systems with many processors working on
same problem - Distributed computing
- many systems loosely coupled by a scheduler to
work on related problems - Grid Computing (MetaComputing)
- many systems tightly coupled by software, perhaps
geographically distributed, to work together on
single problems or on related problems - Massively Parallel Processors (MPPs) continue to
account of more than half of all installed
high-performance computers worldwide (Top500
list). - Microprocessor based supercomputers have brought
a major change in accessibility and
affordability. - Nowadays, cluster systems are the most growing
part.
26Classification Control Model
- Flynns Classical Taxonomy (1966)
- Flynn's taxonomy distinguishes multi-processor
computer architectures according to how they can
be classified along the two independent
dimensions of Instruction and Data. Each of these
dimensions can have only one of two possible
states Single or Multiple.
27SISD
Von Neumann Machine
- Single Instruction, Single Data
- A serial (non-parallel) computer
- Single instruction only one instruction stream
is being acted on by the CPU during any one clock
cycle - Single data only one data stream is being used
as input during any one clock cycle - Deterministic execution
- This is the oldest and until recently, the most
prevalent form of computer - Examples most PCs, single CPU workstations and
mainframes
28SIMD
- Single Instruction, Multiple Data
- Single instruction all processing units execute
the same instruction at any given clock cycle. - Multiple data each processing unit can operate
on a different data element. - This type of machine typically has an instruction
dispatcher, a very high-bandwidth internal
network, and a very large array of very
small-capacity instruction units. - Best suited for specialized problems
characterized by a high degree of regularity,
such as image processing. - Synchronous (lockstep) and deterministic
execution - Two varieties Processor Arrays and Vector
Pipelines - Examples
- Processor Arrays Connection Machine CM-2, Maspar
MP-1, MP-2 - Vector Pipelines IBM 9000, Cray C90, Fujitsu VP,
NEC SX-2, Hitachi S820
29MISD
- Multiple Instruction, Single Data
- Few actual examples of this class of parallel
computer have ever existed. - Some conceivable examples might be
- multiple frequency filters operating on a single
signal stream - multiple cryptography algorithms attempting to
crack a single coded message.
30MIMD
- Multiple Instruction, Multiple Data
- Currently, the most common type of parallel
computer - Multiple Instruction every processor may be
executing a different instruction stream. - Multiple Data every processor may be working
with a different data stream. - Execution can be synchronous or asynchronous,
deterministic or non- deterministic. - Examples most current supercomputers, networked
parallel computer "grids" and multi-processor SMP
computers - including some types of PCs.
31Classification Communication Model
- Shared vs. Distributed Memory systems
32Shared Memory UMA vs. NUMA
33Distributed Memory MPPs vs. Clusters
- Processors-memory nodes are connected by some
type of interconnect network - Massively Parallel Processor (MPP) tightly
integrated, single system image. - Cluster individual computers connected by SW
Interconnect Network
P
P
P
P
P
P
M
M
M
M
M
M
34Distributed Shared-Memory
- Virtual shared memory (shared address space)
- on hardware level
- on software level
- Global address space spanning all of the memory
in the system. - E.g., HPF, TreadMarks, sw for NoW (JavaParty,
Manta, Jackal)
35Parallel vs. Distributed Computing
- Parallel computing usually considers dedicated
homogeneous HPC systems to solve parallel
problems. - Distributed computing extends the parallel
approach to heterogeneous general-purpose
systems. - Both look at the parallel formulation of a
problem. - But usually reliability, security, heterogeneity
are not considered in parallel computing. But
they are considered in Grid computing. - A distributed system is one in which the failure
of a computer you didn't even know existed can
render your own computer unusable. (Leslie
Lamport)
36Parallel and Distributed Computing
- Parallel computing
- Shared-Memory SIMD
- Distributed-Memory SIMD
- Shared-Memory MIMD
- Distributed-Memory MIMD
- Behind DM-MIMD
- Distributed computing and Clusters
- Behind parallel and distributed computing
- Metacomputing
SCALABILITY
37Tutorial Outline
- Part 1 Overview of High-Performance Computing
- Technology trends
- Parallel and Distributed Computing architectures
- Programming paradigms
- Programming models
- Problem decomposition
- Parallel programming issues
38Programming Paradigms
- Parallel Programming Models
- Control
- how is parallelism created
- what orderings exist between operations
- how do different threads of control synchronize
- Naming
- what data is private vs. shared
- how logically shared data is accessed or
communicated - Set of operations
- what are the basic operations
- what operations are considered to be atomic
- Cost
- how do we account for the cost of each of the
above
39Model 1 Shared Address Space
- Program consists of a collection of threads of
control, - Each with a set of private variables
- e.g., local variables on the stack
- Collectively with a set of shared variables
- e.g., static variables, shared common blocks,
global heap - Threads communicate implicitly by writing and
reading shared variables - Threads coordinate explicitly by synchronization
operations on shared variables - writing and reading flags
- locks, semaphores
- Like concurrent programming in uniprocessor
40Model 2 Message Passing
- Program consists of a collection of named
processes - thread of control plus local address space
- local variables, static variables, common blocks,
heap - Processes communicate by explicit data transfers
- matching pair of send receive by source and
dest. proc. - Coordination is implicit in every communication
event - Logically shared data is partitioned over local
processes - Like distributed programming
- Program with standard libraries MPI, PVM
- aka shared nothing architecture, or a
multicomputer
41Model 3 Data Parallel
- Single sequential thread of control consisting of
parallel operations - Parallel operations applied to all (or defined
subset) of a data structure - Communication is implicit in parallel operators
and shifted data structures - Elegant and easy to understand
- Not all problems fit this model
- Vector computing
42SIMD Machine
- An SIMD (Single Instruction Multiple Data)
machine - A large number of small processors
- A single control processor issues each
instruction - each processor executes the same instruction
- some processors may be turned off on any
instruction interconnect - Machines not popular (CM2), but programming model
is - implemented by mapping n-fold parallelism to p
processors - mostly done in the compilers (HPF High
Performance Fortran) control processor
43Model 4 Hybrid
- Shared memory machines (SMPs) are the fastest
commodity machine. Why not build a larger machine
by connecting many of them with a network? - CLUMP Cluster of SMPs
- Shared memory within one SMP, message passing
outside - Clusters, ASCI Red (Intel), ...
- Programming model?
- Treat machine as flat, always use message
passing, even within SMP (simple, but ignore
important part of memory hierarchy) - Expose two layers shared memory (OpenMP) and
message passing (MPI) higher performance, but
ugly to program.
44Hybrid Systems
45Model 5 BSP
- Bulk Synchronous Processing (BSP) (L. Valiant,
1990) - Used within the message passing or shared memory
models as a programming convention - Phases separated by global barriers
- Compute phases all operate on local data (in
distributed memory) - or read access to global data (in shared memory)
- Communication phases all participate in
rearrangement or reduction of global data - Generally all doing the same thing in a phase
- all do f, but may all do different things within
f - Simplicity of data parallelism without
restrictions
BSP superstep
46Problem Decomposition
- Domain decomposition ? data parallel
- Functional decomposition ? task parallel
47Parallel Programming
- directives-based data-parallel language
- Such as High Performance Fortran (HPF) or OpenMP
- Serial code is made parallel by adding directives
(which appear as comments in the serial code)
that tell the compiler how to distribute data and
work across the processors. - The details of how data distribution,
computation, and communications are to be done
are left to the compiler. - Usually implemented on shared-memory
architectures. - Message Passing (e.g. MPI, PVM)
- very flexible approach based on explicit message
passing via library calls from standard
programming languages - It is left up to the programmer to explicitly
divide data and work across the processors as
well as manage the communications among them. - Multi-threading in distributed environments
- Parallelism is transparent to the programmer
- Shared-memory or distributed shared-memory systems
48Parallel Programming Issues
- The main goal of a parallel program is to get
better performance over the serial version. - Performance evaluation
- Important issues to take into account
- Load balancing
- Minimizing communication
- Overlapping communication and computation
49Speedup
- Serial fraction
- Parallel fraction
- Speedup
- Superlinear speedup is, in general, impossible
but it may arise in two cases - memory hierarchy phenomena
- search algorithms
50Maximum Speedup
- Amdahls Law states that potential program
speedup is defined by the fraction of code (fp)
which can be parallelized.
51Maximum Speedup
- There are limits to the scalability of
parallelism. - For example, at fp 0.50, 0.90 and 0.99,
- 50, 90 and 99 of the code is parallelizable.
- However, certain problems demonstrate increased
performance by increasing the problem size.
Problems which increase the percentage of
parallel time with their size are more "scalable"
than problems with a fixed percentage of parallel
time. - fs and fp may not be static
52Efficiency
- Given the parallel cost
- Efficiency E
- In general, the total overhead To is an
increasing function of p, at least linearly when
fs gt 0 - communication,
- extra computation,
- idle periods due to sequential components,
- idle periods due to load imbalance.
53Cost-optimality of Parallel Systems
- A parallel system is composed by a parallel
algorithm and a parallel computational platform. - A parallel system is cost-optimal if the cost of
solving a problem has the same asymptotic growth
(in ? terms, as a function of the input size W)
as the fastest known sequential algorithm. - As a consequence,
54Isoefficiency
- For a given problem size, when we increase the
number of PEs, the speedup and the efficiency
decrease. - How much do we need to increase the problem size
to keep the efficiency constant? - Isoefficiency is a metric for scalability
- And, in general, as the problem size increases
the efficiency increases, while keeping the
number of processors constant - Isoefficiency
- In scalable parallel system, when increasing the
number of PEs, the efficiency can be kept
constant by increasing the problem size. - Of course, for different problems, the rate at
which W must be increased may vary. This rate
determines the degree of scalability of the
system.
55Sources of Parallel Overhead
- Total parallel overhead
- INTERPROCESSOR COMMUNICATION
- If each PE spends Tcomm time for communications,
then the overhead will increase by pTcomm. - LOAD IMBALANCE
- if exists, some PE will be idle while others are
busy. Idle time of any PE contributes to the
Overhead Time. - Load imbalance always occurs if there is a
strictly sequential component of the algorithm. - Load imbalance often occurs at the end of the
process run for asynchronous termination (e.g. in
coarse-grain parallelism). - EXTRA COMPUTATION
- Parallel version of the fastest sequential
algorithm may not be straightforward. Additional
computation may be needed in the parallel
algorithm. This contributes to the Overhead Time.
To pTp Ts
56Load Balancing
- Load balancing is the task of equally dividing
the work among the available processes. - A range of load balancing problems is determined
by - Task costs
- Task dependencies
- Locality needs
- Spectrum of solutions from static to dynamic
A closely related problem is scheduling, which
determines the order in which tasks run.
57Different Load Balancing Problems
- Load balancing problems differ in
- Tasks costs
- Do all tasks have equal costs?
- If not, when are the costs known?
- Before starting, when task created, or only when
task ends - Task dependencies
- Can all tasks be run in any order (including
parallel)? - If not, when are the dependencies known?
- Before starting, when task created, or only when
task ends - Locality
- Is it important for some tasks to be scheduled on
the same processor (or nearby) to reduce
communication cost? - When is the information about communication
between tasks known?
58Task cost
59Task Dependency
(e.g. data/control dependencies at end/beginning
of task executions)
60Task Locality
(e.g. data/control dependencies during task
executions)
61Spectrum of Solutions
- Static scheduling. All information is available
to scheduling algorithm, which runs before any
real computation starts. (offline algorithms) - Semi-static scheduling. Information may be known
at program startup, or the beginning of each
timestep, or at other well-defined points.
Offline algorithms may be used even though the
problem is dynamic. - Dynamic scheduling. Information is not known
until mid-execution. - (online algorithms)
62LB Approaches
- Static load balancing
- Semi-static load balancing
- Self-scheduling (manager-workers)
- Distributed task queues
- Diffusion-based load balancing
- DAG scheduling (graph partitioning is
NP-complete) - Mixed Parallelism
63Distributed and Dynamic LB
- Dynamic load balancing algorithms, aka work
stealing/donating - Basic idea, when applied to search trees
- Each processor performs search on disjoint part
of tree - When finished, get work from a processor that is
still busy - Requires asynchronous communication
busy
Finished available work
idle
Select a processor and request work
Do fixed amount of work
No work found
Service pending messages
Service pending messages
Got work
64Selecting a Donor
- Basic distributed algorithms
- Asynchronous Round Robin (ARR)
- Each processor k, keeps a variable target_k
- When a processor runs out of work, request from
target_k - Set target_k (target_k 1) procs
- Nearest Neighbor (NN)
- Round robin over neighbors
- Takes topology into account (as diffusive
techniques) - Load balancing somewhat slower than randomized
- Global Round Robin (GRR)
- Processor 0 keeps a single variable target
- When a processor needs work, get target, a
request from target - P0 increments (mod procs) with each access to
target - Random polling/stealing
- When a processor needs work, select a random
processor and request work from it
65Tutorial Outline
- Part 2 Distributed Data Mining
- Classification
- Clustering
- Association Rules
- Graph Mining
66Knowledge Discovery in Databases
- Knowledge Discovery in Databases (KDD) is a
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data.
Data Mining
Clean, Collect, Summarize
Data Preparation
Training Data
Data Warehouse
Model, Patterns
Verification Evaluation
Operational Databases
67Origins of Data Mining
- KDD draws ideas from machine learning/AI, pattern
recognition, statistics, database systems, and
data visualization. - Prediction Methods
- Use some variables to predict unknown or future
values of other variables. - Description Methods
- Find human-interpretable patterns that describe
the data. - Traditional techniques may be unsuitable
- enormity of data
- high dimensionality of data
- heterogeneous, distributed nature of data
68Speeding up Data Mining
- Data oriented approach
- Discretization
- Feature selection
- Feature construction (PCA)
- Sampling
- Methods oriented approach
- Efficient and scalable algorithms
69Speeding up Data Mining
- Methods oriented approach (contd.)
- Distributed and parallel data-mining
- Task or control parallelism
- Data parallelism
- Hybrid parallelism
- Distributed-data mining
- Voting
- Meta-learning, etc.
70Tutorial Outline
- Part 2 Distributed Data Mining
- Classification
- Clustering
- Association Rules
- Graph Mining
71What is Classification?
- Classification is the process of assigning new
objects to predefined categories or classes - Given a set of labeled records
- Build a model (e.g. a decision tree)
- Predict labels for future unlabeled records
72Classification learning
- Supervised learning (labels are known)
- Example described in terms of attributes
- Categorical (unordered symbolic values)
- Numeric (integers, reals)
- Class (output/predicted attribute)
- categorical for classification
- numeric for regression
73Classification learning
- Training set
- set of examples, where each example is a feature
vector (i.e., a set of ltattribute,valuegt pairs)
with its associated class. The model is built on
this set. - Test set
- a set of examples disjoint from the training set,
used for testing the accuracy of a model.
74Classification Example
categorical
categorical
continuous
class
Learn Classifier
Training Set
75Classification Models
- Some models are better than others
- Accuracy
- Understandability
- Models range from easy to understand to
incomprehensible - Decision trees
- Rule induction
- Regression models
- Genetic Algorithms
- Bayesian Networks
- Neural networks
Easier
Harder
76Decision Trees
- Decision tree models are better suited for data
mining - Inexpensive to construct
- Easy to Interpret
- Easy to integrate with database systems
- Comparable or better accuracy in many
applications
77Decision Trees Example
categorical
categorical
continuous
class
Splitting Attributes
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
The splitting attribute at a node is determined
based on the Gini index.
78From Tree to Rules
1) Refund Yes ? NO 2) Refund No and MarSt
in Single, Divorced and TaxInc lt 80K ? NO 3)
Refund No and MarSt in Single, Divorced
and TaxInc gt 80K ? YES 4) Refund No and
MarSt in Married ? NO
Refund
Yes
No
MarSt
NO
Married
Single, Divorced
TaxInc
NO
lt 80K
gt 80K
YES
NO
79Decision Trees Sequential Algorithms
- Many algorithms
- Hunts algorithm (one of the earliest)
- CART
- ID3, C4.5
- SLIQ, SPRINT
- General structure
- Tree induction
- Tree pruning
80Classification algorithm
- Build tree
- Start with data at root node
- Select an attribute and formulate a logical test
on attribute - Branch on each outcome of the test, and move
subset of examples satisfying that outcome to
corresponding child node - Recurse on each child node
- Repeat until leaves are pure, i.e., have
example from a single class, or nearly pure,
i.e., majority of examples are from the same
class - Prune tree
- Remove subtrees that do not improve
classification accuracy - Avoid over-fitting, i.e., training set specific
artifacts
81Build tree
- Evaluate split-points for all attributes
- Select the best point and the winning
attribute - Split the data into two
- Breadth/depth-first construction
- CRITICAL STEPS
- Formulation of good split tests
- Selection measure for attributes
82How to capture good splits?
- Occams razor Prefer the simplest hypothesis
that fits the data - Minimum message/description length
- dataset D
- hypotheses H1, H2, , Hx describing D
- MML(Hi) Mlength(Hi)Mlength(DHi)
- pick Hk with minimum MML
- Mlength given by Gini index, Gain, etc.
83Tree pruning using MDL
- Data encoding sum classification errors
- Model encoding
- Encode the tree structure
- Encode the split points
- Pruning choose smallest length option
- Convert to leaf
- Prune left or right child
- Do nothing
84Hunts Method
- Attributes Refund (Yes, No), Marital Status
(Single, Married, Divorced), Taxable Income - Class Cheat, Dont Cheat
Dont Cheat
85Whats really happening?
Marital Status
Cheat
Dont Cheat
Married
Income
lt 80K
86Finding good split points
- Use Gini index for partition purity
- where p(i) frequency of class i in the node.
- If S is pure, Gini(S) 1-1 0
- Find split-point with minimum Gini
- Only need class distributions
87Finding good splits points
Marital Status
Marital Status
Cheat
Dont Cheat
Income
Income
Gini(split) 0.34
Gini(split) 0.31
88Categorical Attributes Computing Gini Index
- For each distinct value, gather counts for each
class in the dataset - Use the count matrix to make decisions
Two-way split (find best partition of values)
Multi-way split
89Decision Trees Parallel Algorithms
- Approaches for Categorical Attributes
- Synchronous Tree Construction (data parallel)
- no data movement required
- high communication cost as tree becomes bushy
- Partitioned Tree Construction (task parallel)
- processors work independently once partitioned
completely - load imbalance and high cost of data movement
- Hybrid Algorithm
- combines good features of two approaches
- adapts dynamically according to the size and
shape of trees
90Synchronous Tree Construction
- Partitioning of data only
- global reduction per node is required
- large number of classification tree nodes gives
high communication cost
m categorical attributes
n records
91Partitioned Tree Construction
- Partitioning of classification tree nodes
- natural concurrency
- load imbalance as the amount of work associated
with each node varies - child nodes use the same data as used by parent
node - loss of locality
- high data movement cost
92Synchronous Tree Construction
Partition Data Across Processors
- No data movement is required
- Load imbalance
- can be eliminated by breadth-first expansion
- High communication cost
- becomes too high in lower parts of the tree
93Partitioned Tree Construction
Partition Data and Nodes
- Highly concurrent
- High communication cost due to excessive data
movements - Load imbalance
94Hybrid Parallel Formulation
switch
95Load Balancing
96Switch Criterion
- Switch to Partitioned Tree Construction when
- Splitting criterion ensures
- Parallel Formulations of Decision-Tree
Classification Algorithms. A. Srivastava, E.
Han, V. Kumar, and V. Singh, Data Mining and
Knowledge Discovery An International Journal,
vol. 3, no. 3, pp 237-261, September 1999.
97Speedup Comparison
linear
linear
hybrid
hybrid
partitioned
partitioned
synchronous
synchronous
0.8 million examples
1.6 million examples
98Speedup of the Hybrid Algorithm with Different
Size Data Sets
99Scaleup of the Hybrid Algorithm
100Summary of Algorithms for Categorical Attributes
- Synchronous Tree Construction Approach
- no data movement required
- high communication cost as tree becomes bushy
- Partitioned Tree Construction Approach
- processors work independently once partitioned
completely - load imbalance and high cost of data movement
- Hybrid Algorithm
- combines good features of two approaches
- adapts dynamically according to the size and
shape of trees
101Tutorial Outline
- Part 2 Distributed Data Mining
- Classification
- Clustering
- Association Rules
- Graph Mining
102Clustering Definition
- Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that - Data points in one cluster are more similar to
one another - Data points in separate clusters are less similar
to one another
103Clustering
- Given N k-dimensional feature vectors, find a
meaningful partition of the N examples into c
subsets or groups. - Discover the labels automatically
- c may be given, or discovered
- Much more difficult than classification, since
in the latter the groups are given, and we seek a
compact description.
104Clustering Illustration
k3 ? Euclidean Distance Based Clustering in 3-D
space
Intracluster distances are minimized
Intercluster distances are maximized
105Clustering
- Have to define some notion of similarity
between examples - Similarity Measures
- Euclidean Distance if attributes are continuous.
- Other Problem-specific Measures.
- Goal maximize intra-cluster similarity and
minimize inter-cluster similarity - Feature vector be
- All numeric (well defined distances)
- All categorical or mixed (harder to define
similarity geometric notions dont work)
106Clustering schemes
- Distance-based
- Numeric
- Euclidean distance (root of sum of squared
differences along each dimension) - Angle between two vectors
- Categorical
- Number of common features (categorical)
- Partition-based
- Enumerate partitions and score each
107Clustering schemes
- Model-based
- Estimate a density (e.g., a mixture of gaussians)
- Go bump-hunting
- Compute P(Feature Vector i Cluster j)
- Finds overlapping clusters too
- Example bayesian clustering
108Before clustering
- Normalization
- Given three attributes
- A in micro-seconds
- B in milli-seconds
- C in seconds
- Cant treat differences as the same in all
dimensions or attributes - Need to scale or normalize for comparison
- Can assign weight for more importance
109The k-means algorithm
- Specify k, the number of clusters
- Guess k seed cluster centers
- 1) Look at each example and assign it to the
center that is closest - 2) Recalculate the center
- Iterate on steps 1 and 2 till centers converge or
for a fixed number of times
110K-means algorithm
Initial seeds
111K-means algorithm
New centers
112K-means algorithm
Final centers
113Operations in k-means
- Main Operation Calculate distance to all k means
or centroids - Other operations
- Find the closest centroid for each point
- Calculate mean squared error (MSE) for all points
- Recalculate centroids
114Parallel k-means
- Divide N points among P processors
- Replicate the k centroids
- Each processor computes distance of each local
point to the centroids - Assign points to closest centroid and compute
local MSE - Perform reduction for global centroids and global
MSE value
115Serial and Parallel k-means
Group communication
116Serial k-means Complexity
117Parallel k-means Complexity
Where depends on the physical
communication topology, e.g. in a hypercube
118Speedup and Scaleup
Condition for linear speedup
Condition for linear scaleup (w.r.t. n)
119Tutorial Outline
- Part 2 Distributed Data Mining
- Classification
- Clustering
- Association Rules
- Frequent Itemset Mining
- Graph Mining
120ARM Definition
- Given a set of records each of which contain some
number of items from a given collection - Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.
121ARM Definition
- Given a set of items/attributes, and a set of
objects containing a subset of the items - Find rules if I1 then I2 (sup, conf)
- I1, I2 are sets of items
- I1, I2 have sufficient support P(I1I2)
- Rule has sufficient confidence P(I2I1)
122Association Mining
- User specifies interestingness
- Minimum support (minsup)
- Minimum confidence (minconf)
- Find all frequent itemsets (gt minsup)
- Exponential Search Space
- Computation and I/O Intensive
- Generate strong rules (gt minconf)
- Relatively cheap
123Association Rule Discovery Support and
Confidence
Example
Association Rule
Support
Confidence
124Handling Exponential Complexity
- Given n transactions and m different items
- number of possible association rules
- computation complexity
- Systematic search for all patterns, based on
support constraint - If A,B has support at least a, then both A and
B have support at least a. - If either A or B has support less than a, then
A,B has support less than a. - Use patterns of k-1 items to find patterns of k
items.
125Apriori Principle
- Collect single item counts. Find large items.
- Find candidate pairs, count them gt large pairs
of items. - Find candidate triplets, count them gt large
triplets of items, and so on... - Guiding Principle every subset of a frequent
itemset has to be frequent. - Used for pruning many candidates.
126Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
Minimum Support 3
Triplets (3-itemsets)
If every subset is considered, 6C1 6C2 6C3
41 With support-based pruning, 6 6 2 14
127Counting Candidates
- Frequent Itemsets are found by counting
candidates. - Simple way
- Search for each candidate in each
transaction. Expensive!!!
Transactions
Candidates
M
N
128Association Rule Discovery Hash tree for fast
access.
Candidate Hash Tree
Hash Function
1,4,7
3,6,9
2,5,8
129Association Rule Discovery Subset Operation
transaction
130Association Rule Discovery Subset Operation
(contd.)
transaction
1 3 6
3 4 5
1 5 9
131Parallel Formulation of Association Rules
- Large-scale problems have
- Huge Transaction Datasets (10s of TB)
- Large Number of Candidates.
- Parallel Approaches
- Partition the Transaction Database, or
- Partition the Candidates, or
- Both
132Parallel Association Rules Count Distribution
(CD)
- Each Processor has complete candidate hash tree.
- Each Processor updates its hash tree with local
data. - Each Processor participates in global reduction
to get global counts of candidates in the hash
tree. - Multiple database scans per iteration are
required if hash tree too big for memory.
133CD Illustration
P0
P1
P2
N/p
N/p
N/p
Global Reduction of Counts
134Parallel Association Rules Data Distribution
(DD)
- Candidate set is partitioned among the
processors. - Once local data has been partitioned, it is
broadcast to all other processors. - High Communication Cost due to data movement.
- Redundant work due to multiple traversals of the
hash trees.
135DD Illustration
P0
P1
P2
Data Broadcast
Remote Data
Remote Data
Remote Data
9
1,2
2,3
12
1,3
10
3,4
10
All-to-All Broadcast of Candidates
136Parallel Association Rules Intelligent Data
Distribution (IDD)
- Data Distribution using point-to-point
communication. - Intelligent partitioning of candidate sets.
- Partitioning based on the first item of
candidates. - Bitmap to keep track of local candidate items.
- Pruning at the root of candidate hash tree using
the bitmap. - Suitable for single data source such as database
server. - With smaller candidate set, load balancing is
difficult.
137IDD Illustration
P0
P1
P2
Data Shift
Remote Data
Remote Data
Remote Data
1
2,3
5
bitmask
Count
Count
Count
9
1,2
2,3
12
1,3
10
3,4
10
All-to-All Broadcast of Candidates
138Filtering Transactions in IDD
bitmask
139Parallel Association Rules Hybrid Distribution
(HD)
- Candidate set is partitioned into G groups to
just fit in main memory - Ensures Good load balance with smaller candidate
set. - Logical processor mesh G x P/G is formed.
- Perform IDD along the column processors
- Data movement among processors is minimized.
- Perform CD along the row processors
- Smaller number of processors in global reduction
operation.
140HD Illustration
P/G Processors per Group
N/P
N/P
N/P
G Groups of Processors
N/P
N/P
N/P
N/P
N/P
N/P
141Parallel Association Rules Comments
- HD has shown the same linear speedup and sizeup
behavior as that of CD. - HD exploits total aggregate main memory, while CD
does not. - IDD has much better scaleup behavior than DD.
142Tutorial Outline
- Part 2 Distributed Data Mining
- Classification
- Clustering
- Association Rules
- Graph Mining
- Frequent Subgraph Mining
143Graph Mining
- ? Market Basket Analysis
- Association Rule Mining (ARM) ? find frequent
itemset - Search space
- Unstructured data only item type is important
- item set I, In ? power set P(I), P(I) 2n
- pruning technique to make the search feasible
- Subset test
- for each user transaction t and each candidate
frequent-itemset s we need a subset test in order
to compute the support (frequency). - ? Molecular Compound Analysis
- Frequent Subgraph Mining (FSM) ? find frequent
subgraph - Bigger search space
- Structured data atom types are not sufficient
atoms have bonds with other atoms. - Subgraph isomorphism test
- for each graph and each candidate
frequent-subgraph we need a subgraph isomorphism
test. - N.B. for general graphs the subgraph isomorphism
test is NP-complete
144Molecular Fragment Lattice
C
O
S
N
C-C
S-O
C-S
S-N
SN
C-S-N
C-SN
C-C-S
C-S-O
N-S-O
C-S-O N
C-C-S-N
(minSupp50)
145Mining Molecular Fragments
- Frequent Molecular Fragments
- Frequent Subgraph Mining (FSM)
- Discriminative Molecular Fragments
- Molecular compounds are classified into
- active compounds ? focus subset
- inactive compounds ? complement subset
- Problem definition
- find all discriminative molecular fragments,
which are frequent in the set of the active
compounds and not frequent among the inactive
compounds, i.e. contrast substructures. - User parameters
- minSupp (minimum support in the focus dataset)
- maxSupp (maximum support in the complement
dataset)
C
F
146Molecular Fragment Search Tree
A search tree node represents a molecular
fragment.
Successor nodes are generated by extending the
fragment of one bond and, eventually, one atom.
147Large-Scale Issue
- Need for scalability in terms of
- the number of molecules
- larger main and secondary memory to store
molecules - fragments with longer list of instances in
molecules (embeddings) - the size of molecules
- larger memory to store larger molecules
- fragments with longer list of longer embeddings
- more fragments (bigger search space)
- the minimum support
- with lower support threshold the mining algorithm
produces - more embeddings for each fragment
- more fragments and longer search tree branches
148High-Performance Distributed Approach
- Sequential algorithms cannot handle large-scale
problems and small values of the user parameters
for better quality results.
- Search Space Partitioning
- Distributed implementation of backtracking
- external representation (enhanced SMILES)
- DB selection and projection
- Tree-based reduction
- Dynamic load balancing for irregular problems
- donor selection and work splitting mechanism
- Peer-to-Peer computing
149Search Space Partitioning
150Search Space Partitioning
- 4th kind of search-tree pruning Distributed
Computing pruning - prune a search node, generate a new job and
assign it to an idle processor - asynchronous communication and low overhead
- backtracking is particularly suitable to parallel
processing because a subtree rooted at any node
can be searched independently
151Tree-based Reduction
3D hypercube (p8)
Star reduction (master-slave) O(p)
Tree reduction O(log(p))
152Job Assignment and Parallel Overheads
overheads
external representation (enhanced SMILES)
split
embed
stack
latency
donor
delay
parallel computing overhead communication
excess computation idling periods
receiver
idle
- 1st job assignment
- job assignment
- termination detection
Overlapped with useful computation
DB selection projection
DLB for irregular problems
153Parallel Execution Analysis
worker execution
setup
idle1
jobs
idle3
idle2
setup configuration message, DB loading idle1
wait first job assignment, due to initial
sequential part idle2 processor
starvation idle3 idle period due to load
imbalance jobs job processing time (including
computational overhead)
single job execution
mining
data
prep
data data preprocessing prep prepare root
search node (embed the core fragment) mining data
mining processing (useful work)
154Load Balancing
- The search space is not know a priori and is very
irregular - Dynamic load balancing
- receiver-initiated approach
- donor selection
- work splitting mechanism
- The DLB determines the overall performance and
efficiency.
155Highly Irregular Problem
Search tree node visit-time (subtree visit)
Search tree node expand-time (node extension)
Power-law distribution
156Work Splitting
A search tree node n can be donated only if
1) stackSize() gt minStackSize, 2) support(n) gt
(1 a) minSupp 3) lxa(n) lt ß atomCount(n)
157Dynamic Load Balancing
- Receiver-initiated approaches
- Random Polling (RP)
- Scheduler-based (MS)
excellent scalability
Quasi-Random Polling (QRP)
optimal solution
- Quasi-Random Polling (QRP) policy
- Global list of potential donors (sorted w. r. t.
the running time) - server collects job statistics
- receiver periodically gets updated donor-list
- P2P Computing framework
- receiver selects a random donor according to a
probability distribution decreasing with the
donor rank in the list - high probability to choose long running jobs
158Issues
- Penalty for global synchronization
- Adaptive application
- Asynchronous communication
- Highly irregular problem
- Difficulty in predicting work loads
- Heavy work loads may delay message processing
- Large-scale multi-domain heterogeneous computing
environment - Network latency and delay tolerant
159Tutorial Outline
- Part 1 Overview of High-Performance Computing
- Technology trends
- Parallel and Distributed Computing architectures
- Programming paradigms
- Part 2 Distributed Data Mining
- Classification
- Clustering
- Association Rules
- Graph Mining
- Conclusions
160Large-scale Parallel KDD Systems
- Data
- Terabyte-sized datasets
- Centralized or distributed datasets
- Incremental changes (refine knowledge as data
changes) - Heterogeneous data sources
161Large-scale Parallel KDD Systems
- Software
- Pre-processing, mining, post-processing
- Interactive (anytime mining)
- Modular (rapid development)
- Web services
- Workflow management tool integration
- Fault and latency tolerant
- Highly scalable
162Large-scale Parallel KDD Systems
- Computing Infrastructure
- Clusters already widespread
- Multi-domain heterogeneous
- Data and computational Grids
- Dynamic resource aggregation (P2P)
- Self-managing
163Research Directions
- Fast algorithms different mining tasks
- Classification, clustering, associations, etc.
- Incorporating concept hierarchies
- Parallelism and scalability
- Millions of records
- Thousands of attributes/dimensions
- Single pass algorithms
- Sampling
- Parallel I/O and file systems
164Research Directions (contd.)
- Parallel Ensemble Learning
- parallel execution of different data mining
algorithms and techniques that can be integrated
to obtain a better model. - Not just high performance but also high accuracy
165Research Directions (contd.)
- Tight database integration
- Push common primitives inside DBMS
- Use multiple tables
- Use efficient indexing techniques
- Caching strategies for sequence of data mining
operations - Data mining query language and parallel query
optimization
166Research Directions (contd.)
- Understandability too many patterns
- Incorporate background knowledge
- Integrate constraints
- Meta-level mining
- Visualization, exploration
- Usability build a complete system
- Pre-processing, mining, post-processing,
persistent management of mined results
167Conclusions
- Data mining is a rapidly growing field
- Fueled by enormous data collection rates, and
need for intelligent analysis for business and
scientific gains. - Large and high-dimensional data requires new
analysis techniques and algorithms. - High Performance Distributed Computing is
becoming an essential component in data mining
and data exploration. - Many research and commercial opportunities.
168Resources
- Workshops
- IEEE IPDPS Workshop on Parallel and Distributed
Data Mining - HiPC Special Session on Large-Scale Data Mining
- ACM SIGKDD Workshop on Distributed Data Mining
- IEEE IPDPS Workshop on High Performance Data
Mining - ACM SIGKDD Workshop on Large-Scale Parallel KDD
Systems - ACM SIGKDD Workshop on Distributed Data Mining,
- IEEE IPPS Workshop on High Performance Data
Mining - LifeDDM, Distributed Data Mining in Life Science
- Books
- A. Freitas and S. Lavington. Mining very large
databases with parallel processing. Kluwer
Academic Pub., Boston, MA, 1998. - M. J. Zaki and C.-T. Ho (eds). Large-Scale
Parallel Data Mining. LNAI State-of-the-Art
Survey, Volume 1759, Springer-Verlag, 2000. - H. Kargupta and P. Chan (eds). Advance in
Distributed and Parallel Knowledge Discovery,
AAAI Press, Summer 2000.
169References
- Journal Special Issues
- P. Stolorz and R. Musick (eds.). Scalable
High-Performance Computing for KDD, Data Mining
and Knowledge Discovery An International
Journal, Vol. 1, No. 4, December 1997. - Y. Guo and R. Grossman (eds.). Scalable Parallel
and Distributed Data Mining, Data Mining and
Kno