Title: The Road to a Ph'D'
1Distributed Classification in Peer-to-Peer
Networks
Ping Luo, Hui Xiong, Kevin Lü, Zhongzhi Shi
2Overview
- Introduction
- Building Local Classifiers
- Distributed Plurality Voting
- Experimental Results
- Related Works
- Summary
3Research Motivation
- Widespread use of P2P networks and sensor
networks - Data to be analyzed are distributed on nodes of
these large-scale dynamic networks - Traditional distributed data mining algorithms
must be extended to fit this new environment
- Motivating Examples
- - P2P anti-spam networks
- - Automatic organization of web documents in
P2P - environments
- A distributed classification algorithm is
critical in these applications.
4Research Motivation (2)
- New Challenges
- - highly decentralized peers, do not have
the notion - of clients and servers
- - including hundreds or thousands of nodes,
- impossible global synchronization
- - frequent topology changes caused by
frequent - failure and recovery of peers
- Algorithm Requirements
- - scalability, decentralized in-network
processing - - communication efficient, local synchronism
- - fault-tolerance
5Problem Formulation
- Given
- A connected topology graph
- Each peer owns its local training data
for classification - Local neighborhood change is informed to each
peer real-timely - Find
- Classification paradigm in this setting
- Including how to train and use a global
classifier - Objective
- Scalability, communication-efficient,
decentralized in-network processing,
fault-tolerance - Constraints
- Each peer can only communicate with its immediate
neighbors - The network topology changes dynamically
6Our Contributions
- An algorithm to build an ensemble classifier for
distributed classification in P2P networks by
plurality voting on all the local classifiers - Adapt the training paradigm of pasting bites for
building local classifiers - An algorithm of (restrictive) Distributed
Plurality Voting (DPV) to combine the decisions
of local classifiers - Correctness
- Optimality
- Extensive Experimental Evaluation
- Communication overhead and convergence time of
DPV - Accuracy comparison with centralized
classification
7Overview
- Introduction
- Building Local Classifiers
- Distributed Plurality Voting
- Experimental Results
- Related Works
- Summary
8Building Local Classifiers
- Pasting Bites by Breiman JML99
- Generating small bites of the data by importance
sampling based on the out-of-bag error of
classifiers built so far - Stopping criteria difference of errors between
two successive iteration is below a threshold - Voting uniformly all the classifiers
- The more data on a local node, the more
classifiers generated on it, the more votes it
owns.
9Overview
- Introduction
- Building Local Classifiers
- Distributed Plurality Voting
- Experimental Results
- Related Works
- Summary
10Problem Formulation Of DPV
- Given
- A group of peers in a graph
would like to agree on one of options. - Each peer conveys its preference by
initializing a voting vector ,
where is the number of votes on the
i-th option. - Local neighborhood change is informed to each
peer real-timely - Find
- The option with the largest number of votes over
all peers - Objective
- Scalability, communication-efficient,
decentralized in-network processing,
fault-tolerance - Constraints
- Each peer can only communicate with its immediate
neighbors - The network topology changes dynamically
11An Example Of DPV
- The third option is the answer.
12Comparison Between DPV and Distributed Majority
Voting (DMV, by Wolff et al. TSMC04)
- DMV Given
- A group of peers in a graph
- Each peer conveys its preference by
initializing a 2-tuple
, where stands for the number of
the votes for certain option and
stands for the number of the total vote on this
peer. - The majority ratio
- DMV Find
- Check whether the voting proportion of the
specified option is above - DMV Converted to DPV
- Replacing the 2-tuple
on each peer with the voting vector
13Comparison Between DPV and DMV (2)
- DPV vs. DMV
- DPV is a multi-values function while DMV is a
binary predicate. - DMV can be solved by converting it to DPV.
- However, DMV can only solve 2-option DPV
problems. For a d-option DPV problem, pairwise
comparisons among all d options must be performed
by DMV for times (Multiple Choice
Voting TSMC04). - DPV finds the maximally supported option
directly, and thus saves a lot of communication
overhead and the time for convergence. - DPV is the general form of DMV
14Challenges for DPV
- No central server to add all voting vectors, Only
communication between immediate neighbors - Dynamic change of not only the network topology
but also the local voting vectors - Supporting not only one-shot query, but also
continuous monitor the current voting result
according to the latest network status
15DPV Protocol Overview
- Assumption
- it includes a mechanism to maintain an
un-directional spanning tree for the dynamic P2P
network. The protocol performs on this tree
(duplicate insensitive). - A node is informed of changes in the status of
adjacent nodes. - Protocol Overview
- Each node performs the same algorithm
independently - Specify how nodes initialize and react under
different situations a message received,
neighboring node detached or joined, the local
voting vector changed - When the node status changes under the above
situation, the node notifies this change to the
other neighbors only if the condition for sending
messages satisfies. - To guarantee that each node in the network
converges toward the correct plurality
16The Condition for Sending Messages
No Message Sent
(5,2,1)(2,0,0)(7,2,1) 7-25
7-16 (8,4,1)(2,0,0)(10,4,1) 10-46
10-19 6gt5 9gt6 The differences between
the votes of maximally voted option and all other
options do not decrease.
17The Condition for Sending Messages (2)
Message Sent
(5,2,1)(2,0,0)(7,2,1) 7-25
7-16 (8,6,1)(2,0,0)(10,6,1) 10-44
10-19 4lt5 9gt6 The differences between
the votes of maximally voted option and any other
option decrease.
18The Correctness of DPV Protocol
All the nodes converge to the same result.
The difference between the actual votes of
maximally voted option and any other option is
not smaller than what the protocol have
sent. Then, all the nodes converge to the right
result.
19The Optimality of DPV Protocol
C1 is more restrictive than C2, iff, for any
input case if C1 is true then C2 is true. C1 is
strictly more restrictive than C2, iff, C1 is
more restrictive than C2 and there at least
exists an input case such that C1 is false and C2
is true.
is the most restrictive condition for
sending messages to keep the correctness of the
DPV protocol. It is the condition, which is the
most difficult to satisfy. In this sense, it
guarantees the optimality in communication
overhead.
20The Extension of DPV Protocol
Restrictive Distributed Plurality Voting
output the maximally voted option whose
proportion to all the votes is above a
user-specified threshold.
It can be used in a classification ensemble in a
restrictive manner by leaving out some uncertain
instances.
The new condition for sending messages is based
on the spirit of .
21Overview
- Introduction
- Building Local Classifiers
- Distributed Plurality Voting
- Experimental Results
- Related Works
- Summary
22Accuracy of P2P Classification
Data covtype (58101254, 7 classes) from the UCI
database, distributed onto 500 nodes
23The Performance of DPV Protocol
- Experimental Parameters
- Difference types of network topology Power-law
Graph, Random Graph, Grid - Number of nodes 500, 1000, 2000, 4000, 8000,
16000 - 7-option DPV problems
- Experimental Metrics
- The average communication overhead for each node
- The convergence time of the protocol for one-shot
query
24The Performance of DPV Protocol (2)
- DPV0 vs. RANK (Multiple Choice Voting)
- 500 nodes
- Averaging 2000 instances of 7-option plurality
voting problems
a and b are the largest and second largest
options, respectively.
25The Performance of DPV Protocol (3)
- The Scalability of DPV0
- Different number of nodes vs. communication
overhead of each node
26The Performance of DPV Protocol (4)
- The Local Optimality of DPV0
- Communication overhead and convergence time under
different conditions for sending messages
27Overview
- Introduction
- Building Local Classifiers
- Distributed Plurality Voting
- Experimental Results
- Related Works
- Summary
28Related Work - Ensemble Classifiers
- Model Combination (weighted) voting,
meta-learning - For Centralized Data
- applying different learning algorithms with
heterogeneous models - applying a single learning algorithm to different
versions of the data - Bagging random sampling with replacement
- Boosting re-weighting of the mis-classified
training examples - Pasting Bites generating small bites of the data
by importance sampling based on the quality of
classifiers built so far - For Distributed Data
- distributed boosting by Lazarevic et al.
sigkdd01 - distributed approach to pasting small bites by
Chawla et al. JMLR04, which uniformly votes
hundreds or thousands of classifiers built on all
distributed data sites
29Related Work - P2P Data Mining
- Primitive Aggregates
- Average
- Count, Sum
- Max, Min
- Distributed Majority Voting by Wolff et al.
TSMC04 - P2P Data Mining Algorithms
- P2P Association Rule Mining by Wolff et al.
TSMC04 - P2P K-means clustering by Datta et al. SDM06
- P2P L2 threshold monitor by Wolff et al. SDM06
- Outlier detection in wireless sensor networks by
Branch et al. ICDCS06 - A classification framework in P2P networks by
Siersdorfer et al. ECIR06 - Limitations local classifiers propagations,
experiments on 16 peers, only focusing on the
accuracy issue, without involving any dynamism of
P2P networks.
30Overview
- Introduction
- Building Local Classifiers
- Distributed Plurality Voting
- Experimental Results
- Related Works
- Summary
31Summary
- Proposed an ensemble paradigm for distributed
classification in P2P networks - Formalized a generalized Distributed Plurality
Voting (DPV) protocol for P2P networks - The property of DPV0
- supporting both one-shot query and continuous
monitor - theoretical local optimality in terms of
communication overhead - outperforms alternative approaches
- scale up to large networks
32Q. A.
Acknowledgement