Title: Parallel and Distributed Data Mining: A REVIEW
1 Parallel and Distributed Data Mining A REVIEW
- Phd Student Elio Lozano
- Advisor Dr. Edgar Acuña
- University of Puerto Rico
- Mayagüez Campus
2What is Data Mining?
- Data Mining is The nontrivial extraction of
implicit, previously unknown and potentially
useful information from data. (Frawley et al
2001). - It uses machine learning and statistical
techniques to discovery and present knowledge for
easy comprehension for humans. - Data mining is an application of a more general
problem called Pattern Recognition.
3Parallel Distributed Data Mining
- Distributed Data Mining techniques
- Meta learning, Distributed classification
- Parallel Data Mining
- Association algorithm, classification tree,
clustering, discriminate methods, neural
networks, Genetic algorithms.
4Meta- Learning
- Finds global classifiers from distributed
databases. - Learning from learned method (Chan and Stolfo
1997). - Java agents for meta-learning (JAM) (Prodromidis
et al. 1999)
5Distributed Data Mining
- The classifiers are trained from local training
sets. - Predictions are generated for each learned
classifier. - Meta level training set is composed from the
validation set and the classifiers on the
validation set. - Final classifier is trained from the meta level
training set.
6Strategies for meta-learning
7Meta-Learning
8Meta-Learning
- Benefits of Meta-Learning
- Base learning process can be executed in
parallel. - We can use serial code without parallelizing it.
- Small sets of data can fit in main memory.
9Resampling techniques
- The technique of combine classifiers (Bagging
Breiman 1994 and Boosting Freund Shapire 1996
rely in resampling techniques) - Consists on built a classifier from a sequence of
training sets, each with n different
observations. - Instead of obtain the sequence of training
samples in the real life it is obtained in
artificial manner. - The technique used in this case is bootstrapping,
and then the classifier is obtained by voting the
sequence of classifiers from the bootstrap
samples.
10Bootstrapping process
11Parallel Bootstrapping
- Beddo 2002
- 1) The master process sends the data set to all
nodes. - 2) Each node produce approximately B/p (where B
is the number of bootstrap samples) classifiers
from B/p bootstrap samples. - 3) Finally all nodes perform a reduction, and the
master process finds the classifier by voting
them.
12Discussion
- Beddo concludes that resampling techniques are
parallel in nature and they reach linear speedup
because a little communication is performed
between processors. - We proposed a natural dynamic load partition to
avoid static partition. - The master process gives one bootstrap training
set to each slave. - Each slave compute its classifier and classify
the test data, and send the result back to master
process. - Master process join the classifiers obtained by
such slave and give other bootstrap sample. This
process continue until there are not left
bootstrap samples. - In Boosting, bootstrap samples depends of the
previous classifier, for this reason we dont
split the bootstrap samples between processes,
With each bootstrap sample each processes
classifies its partial test sample to find the
partial errors and then sends its result to
master process.
13Parallel Data mining
- There is two main issues
- Task parallelism.
- Processor are divided into subgroups and subtask
are assigned to processor subgroup based on the
cost of processing each subtask. - Data Parallelism.
- Task are solved in parallel using all the
processors. But may suffer from load imbalance
because the data may not be uniformly spread
across the processors.
14The K-Nearest Neighbors
- Cover Hart 1967
- Given an unknown sample the k-nearest algorithm
- searches the pattern space for k training samples
- that are closest (using Euclidean distance) to
the unknown - sample.
15Parallel K-nearest neighbors
- Ruoming Agrawal 2001.
- 1) The training set is distributed among the
nodes. - 2) Given an unknown sample, each node processes
the training samples it owns to calculate the
nearest neighbors locally. - 3) A global reduction computes the overall
k-nearest neighbors from the locally k-nearest
neighbors of each node.
16Discussion
- Ruominng concluded that his algorithm achieved
high efficiency for distributed memory and shared
memory - Whereas when the number of instances are less
than number of features this algorithm dont give
good performance (such as golubdata). - We propose another parallel algorithm for this
problem. - Split the data feature based, for each object we
compute the partial distances from this object to
the training set. - After that a global reduction is achieved, and
the master process find the k nearest neighbors.
17R Wrappers for MPI
- Rmpi (Hao Yu)
- Interface (Wrapper) to MPI (Message-Passing
Interface) - Provides an interface (wrapper) to MPI APIs.
- It also provides interactive R slave environment
in which distributed - statistical computing can be carried out.
- Maintainer. Hao Yu lthyu_at_stats.uwo.cagt
- Snow (Luke Tierney, A. J. Rossini, Na Li, H.
Sevcikova) - Simple Network of Workstations (snow)
- Support for simple parallel computing in R.
- Maintainer Luke Tierney ltluke_at_stat.uiowa.edugt
18Experimental Results
- Data
- Lansat
- Shuttle
- Cluster of 5 Intel Xeon (TM). 2.8 GHz CPU with 3
GB of RAM. - Cluster of 4 nodes HP intel itanium. Each node
with 2 1.5 GHz CPU with 1.5GB.
19(No Transcript)
20(No Transcript)
21(No Transcript)
22K-means
- K-means Algorithm
- Initialize k points (mj ).
- For each Xi determine the point to which it is
closest and make it a member of that cluster. - Then find the new mj s as averages of the points
in their cluster. - Repeat 2) and 3) until convergence.
Hard Duda (1973) Find the minimum variance
clustering of the data into k clusters such a way
as to minimize
23K-means
Assign each point to the closest cluster center
24Parallel K-means
- Dhillon Modha 1999.
- 1) Master process distributes the initial k
points (mj) to all nodes. - 2) Master process distributes the data points
X1,X2, ...,Xn equally to all nodes. - 3) Each node determines the midpoint to which its
set of Xi is closest, one Xi at a time. Each node
maintains a sum and a count of Xi to be
associated with each of the mj, and send its
respective labels of belonging to master process.
- 4) At each iteration, the master process computes
the new mj and sends it to slave processes. - 5) Repeat 3) and 4) until convergence.
25Discussion
- Parallel algorithm proposed by Dhillon has almost
linear speedup when the data has more instances
than features and the processors have same
computational velocity. - Whereas when the features are grater than
instances or when the processors have different
velocities, the proposed parallel algorithm scale
poorly. - Fortunately there are other methods to
parallelize this algorithm - Feature based partition, in which the data are
split between features - Once the master process send the data, each
process compute its respective sum and them send
it back to master, which join the partial results
and then compute new centroids.
26Parallel K-means run time (Rmpi)
27(No Transcript)
28ID3 Quinlan 1986, C4.5 Quinlan 1993.A decision
tree T encodes a classifier or regression
function in form of a tree.
Decision Tree
- Encoded classifier
- If (agelt30 and carTypeMinivan)Then YES
- If (age lt30 and(carTypeSports or
carTypeTruck))Then NO - If (age gt 30)Then YES
splitting predicate
Age
lt30
gt30
Car Type
YES
Minivan
Sports, Truck
NO
YES
29Decision Tree
30C4.5 Decision Tree
31Feature Based Parallelization
Taner et al 2004
32Node Based Parallelization
Taner et al 2004
33Data Based Parallelization
Taner et al 2004
34(No Transcript)
35Discussion
- Taner et al.
- Feature based parallelization have better speedup
than Node parallelization when the instances in
the data sets are greater than number of
features. Otherwise Node based parallelization
had better speedup. - Amado et al. 2003
- Hybrid algorithm can be characterized as using
data parallelism (feature based or data based)
and node based parallelism. - For nodes covering a significant amount of
examples, it is used data parallelism to avoid
load imbalance problem. The cost of communicate
nodes covering fewer examples can be higher than
the time spent in processing the examples. - To avoid this problem, one of the process
continues alone the construction of the tree
rooted at the node (node based parallelism).
Usually, the switch between data parallelism and
task parallelism (node based parallelism) is
performed when the communications cost overcomes
the processing and data transfer cost.
36Types of Clustering Algorithms
- Hierarchical
- Agglomerative
- Initially, each record is its own cluster.
- Repeatedly combine the 2 most similar clusters.
- Divisive
- Initially, all records are in a single cluster.
- Repeatedly split 1 cluster into 2 smaller
clusters. - Partition-based
- Partition records into k clusters
- Optimize objective function measuring goodness of
clustering - Heuristic optimization via hill-climbing
- Random choice of initial clusters.
- Iteratively refine clusters by re-assigning
records when doing so improves objective function - Eventually converge to local maximum
37Association Rule Discovery Definition
- Given a set of records each of which contain some
number of items from a given collection - Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.
Rules Discovered Milk --gt Coke
Diaper, Milk --gt Beer
38Parallel Association Rule
- Zaki (1998) uses a lattice theoretic approach to
represent the space of frequent itemsets and
partitions it into smaller subspaces for parallel
processing. - Each node in the lattice represent the space of
frequent itemset. - He proposed 4 parallel algorithms, and they
require only 2 database scans.
39Any Questions ?
Thank you