Parallel and Distributed Data Mining: A REVIEW

About This Presentation

Title:

Parallel and Distributed Data Mining: A REVIEW

Description:

If (age 30 and carType=Minivan) Then YES. If (age 30 and (carType=Sports or carType=Truck) ... Minivan. Age. Car Type. YES. NO. YES 30 =30. Sports, Truck ... – PowerPoint PPT presentation

Number of Views:287

Avg rating:3.0/5.0

Slides: 28

Provided by: Elio150

Category:

more less

Transcript and Presenter's Notes

Title: Parallel and Distributed Data Mining: A REVIEW

1
Parallel and Distributed Data Mining A REVIEW

Phd Student Elio Lozano
Advisor Dr. Edgar Acuña
University of Puerto Rico
Mayagüez Campus

2
What is Data Mining?

Data Mining is The nontrivial extraction of
implicit, previously unknown and potentially
useful information from data. (Frawley et al
2001).
It uses machine learning and statistical
techniques to discovery and present knowledge for
easy comprehension for humans.
Data mining is an application of a more general
problem called Pattern Recognition.

3
Parallel Distributed Data Mining

Distributed Data Mining techniques
Meta learning, Distributed classification
Parallel Data Mining
Association algorithm, classification tree,
clustering, discriminate methods, neural
networks, Genetic algorithms.

4
Meta- Learning

Finds global classifiers from distributed
databases.
Learning from learned method (Chan and Stolfo
1997).
Java agents for meta-learning (JAM) (Prodromidis
et al. 1999)

5
Distributed Data Mining

The classifiers are trained from local training
sets.
Predictions are generated for each learned
classifier.
Meta level training set is composed from the
validation set and the classifiers on the
validation set.
Final classifier is trained from the meta level
training set.

6
Strategies for meta-learning

Arbitrating

7
Meta-Learning

Combiner

8
Meta-Learning

Benefits of Meta-Learning
Base learning process can be executed in
parallel.
We can use serial code without parallelizing it.
Small sets of data can fit in main memory.

9
Resampling techniques

The technique of combine classifiers (Bagging
Breiman 1994 and Boosting Freund Shapire 1996
rely in resampling techniques)
Consists on built a classifier from a sequence of
training sets, each with n different
observations.
Instead of obtain the sequence of training
samples in the real life it is obtained in
artificial manner.
The technique used in this case is bootstrapping,
and then the classifier is obtained by voting the
sequence of classifiers from the bootstrap
samples.

10
Bootstrapping process
11
Parallel Bootstrapping

Beddo 2002
1) The master process sends the data set to all
nodes.
2) Each node produce approximately B/p (where B
is the number of bootstrap samples) classifiers
from B/p bootstrap samples.
3) Finally all nodes perform a reduction, and the
master process finds the classifier by voting
them.

12
Discussion

Beddo concludes that resampling techniques are
parallel in nature and they reach linear speedup
because a little communication is performed
between processors.
We proposed a natural dynamic load partition to
avoid static partition.
The master process gives one bootstrap training
set to each slave.
Each slave compute its classifier and classify
the test data, and send the result back to master
process.
Master process join the classifiers obtained by
such slave and give other bootstrap sample. This
process continue until there are not left
bootstrap samples.
In Boosting, bootstrap samples depends of the
previous classifier, for this reason we dont
split the bootstrap samples between processes,
With each bootstrap sample each processes
classifies its partial test sample to find the
partial errors and then sends its result to
master process.

13
Parallel Data mining

There is two main issues
Task parallelism.
Processor are divided into subgroups and subtask
are assigned to processor subgroup based on the
cost of processing each subtask.
Data Parallelism.
Task are solved in parallel using all the
processors. But may suffer from load imbalance
because the data may not be uniformly spread
across the processors.

14
The K-Nearest Neighbors

Cover Hart 1967
Given an unknown sample the k-nearest algorithm
searches the pattern space for k training samples
that are closest (using Euclidean distance) to
the unknown
sample.

15
Parallel K-nearest neighbors

Ruoming Agrawal 2001.
1) The training set is distributed among the
nodes.
2) Given an unknown sample, each node processes
the training samples it owns to calculate the
nearest neighbors locally.
3) A global reduction computes the overall
k-nearest neighbors from the locally k-nearest
neighbors of each node.

16
Discussion

Ruominng concluded that his algorithm achieved
high efficiency for distributed memory and shared
memory
Whereas when the number of instances are less
than number of features this algorithm dont give
good performance (such as golubdata).
We propose another parallel algorithm for this
problem.
Split the data feature based, for each object we
compute the partial distances from this object to
the training set.
After that a global reduction is achieved, and
the master process find the k nearest neighbors.

17
R Wrappers for MPI

Rmpi (Hao Yu)
Interface (Wrapper) to MPI (Message-Passing
Interface)
Provides an interface (wrapper) to MPI APIs.
It also provides interactive R slave environment
in which distributed
statistical computing can be carried out.
Maintainer. Hao Yu lthyu_at_stats.uwo.cagt

Snow (Luke Tierney, A. J. Rossini, Na Li, H.
Sevcikova)
Simple Network of Workstations (snow)
Support for simple parallel computing in R.
Maintainer Luke Tierney ltluke_at_stat.uiowa.edugt

18
Experimental Results

Data
Lansat
Shuttle
Cluster of 5 Intel Xeon (TM). 2.8 GHz CPU with 3
GB of RAM.
Cluster of 4 nodes HP intel itanium. Each node
with 2 1.5 GHz CPU with 1.5GB.

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
K-means

K-means Algorithm
Initialize k points (mj ).
For each Xi determine the point to which it is
closest and make it a member of that cluster.
Then find the new mj s as averages of the points
in their cluster.
Repeat 2) and 3) until convergence.

Hard Duda (1973) Find the minimum variance
clustering of the data into k clusters such a way
as to minimize
23
K-means
Assign each point to the closest cluster center
24
Parallel K-means

Dhillon Modha 1999.
1) Master process distributes the initial k
points (mj) to all nodes.
2) Master process distributes the data points
X1,X2, ...,Xn equally to all nodes.
3) Each node determines the midpoint to which its
set of Xi is closest, one Xi at a time. Each node
maintains a sum and a count of Xi to be
associated with each of the mj, and send its
respective labels of belonging to master process.
4) At each iteration, the master process computes
the new mj and sends it to slave processes.
5) Repeat 3) and 4) until convergence.

25
Discussion

Parallel algorithm proposed by Dhillon has almost
linear speedup when the data has more instances
than features and the processors have same
computational velocity.
Whereas when the features are grater than
instances or when the processors have different
velocities, the proposed parallel algorithm scale
poorly.
Fortunately there are other methods to
parallelize this algorithm
Feature based partition, in which the data are
split between features
Once the master process send the data, each
process compute its respective sum and them send
it back to master, which join the partial results
and then compute new centroids.

26
Parallel K-means run time (Rmpi)
27
(No Transcript)
28
ID3 Quinlan 1986, C4.5 Quinlan 1993.A decision
tree T encodes a classifier or regression
function in form of a tree.
Decision Tree

Encoded classifier
If (agelt30 and carTypeMinivan)Then YES
If (age lt30 and(carTypeSports or
carTypeTruck))Then NO
If (age gt 30)Then YES

splitting predicate
Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
29
Decision Tree
30
C4.5 Decision Tree
31
Feature Based Parallelization
Taner et al 2004
32
Node Based Parallelization
Taner et al 2004
33
Data Based Parallelization
Taner et al 2004
34
(No Transcript)
35
Discussion

Taner et al.
Feature based parallelization have better speedup
than Node parallelization when the instances in
the data sets are greater than number of
features. Otherwise Node based parallelization
had better speedup.
Amado et al. 2003
Hybrid algorithm can be characterized as using
data parallelism (feature based or data based)
and node based parallelism.
For nodes covering a significant amount of
examples, it is used data parallelism to avoid
load imbalance problem. The cost of communicate
nodes covering fewer examples can be higher than
the time spent in processing the examples.
To avoid this problem, one of the process
continues alone the construction of the tree
rooted at the node (node based parallelism).
Usually, the switch between data parallelism and
task parallelism (node based parallelism) is
performed when the communications cost overcomes
the processing and data transfer cost.

36
Types of Clustering Algorithms

Hierarchical
Agglomerative
Initially, each record is its own cluster.
Repeatedly combine the 2 most similar clusters.
Divisive
Initially, all records are in a single cluster.
Repeatedly split 1 cluster into 2 smaller
clusters.
Partition-based
Partition records into k clusters
Optimize objective function measuring goodness of
clustering
Heuristic optimization via hill-climbing
Random choice of initial clusters.
Iteratively refine clusters by re-assigning
records when doing so improves objective function
Eventually converge to local maximum

37
Association Rule Discovery Definition

Given a set of records each of which contain some
number of items from a given collection
Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.

Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
38
Parallel Association Rule

Zaki (1998) uses a lattice theoretic approach to
represent the space of frequent itemsets and
partitions it into smaller subspaces for parallel
processing.
Each node in the lattice represent the space of
frequent itemset.
He proposed 4 parallel algorithms, and they
require only 2 database scans.

39
Any Questions ?
Thank you

Write a Comment

User Comments (0)

About PowerShow.com

Parallel and Distributed Data Mining: A REVIEW - PowerPoint PPT Presentation

Parallel and Distributed Data Mining: A REVIEW

If (age 30 and carType=Minivan) Then YES. If (age 30 and (carType=Sports or carType=Truck) ... Minivan. Age. Car Type. YES. NO. YES 30 =30. Sports, Truck ... – PowerPoint PPT presentation