The Road to a Ph'D' - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

The Road to a Ph'D'

Description:

Data to be analyzed are distributed on nodes of these large-scale dynamic networks ... Local neighborhood change is informed to each peer real-timely. Find: ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 33

Provided by: hui64

Category:

Tags: be | informed | road

more less

Transcript and Presenter's Notes

Title: The Road to a Ph'D'

1
Distributed Classification in Peer-to-Peer
Networks
Ping Luo, Hui Xiong, Kevin Lü, Zhongzhi Shi
2
Overview

Introduction
Building Local Classifiers
Distributed Plurality Voting
Experimental Results
Related Works
Summary

3
Research Motivation

Widespread use of P2P networks and sensor
networks
Data to be analyzed are distributed on nodes of
these large-scale dynamic networks
Traditional distributed data mining algorithms
must be extended to fit this new environment

Motivating Examples
- P2P anti-spam networks
- Automatic organization of web documents in
P2P
environments
A distributed classification algorithm is
critical in these applications.

4
Research Motivation (2)

New Challenges
- highly decentralized peers, do not have
the notion
of clients and servers
- including hundreds or thousands of nodes,
impossible global synchronization
- frequent topology changes caused by
frequent
failure and recovery of peers
Algorithm Requirements
- scalability, decentralized in-network
processing
- communication efficient, local synchronism
- fault-tolerance

5
Problem Formulation

Given
A connected topology graph
Each peer owns its local training data
for classification
Local neighborhood change is informed to each
peer real-timely
Find
Classification paradigm in this setting
Including how to train and use a global
classifier
Objective
Scalability, communication-efficient,
decentralized in-network processing,
fault-tolerance
Constraints
Each peer can only communicate with its immediate
neighbors
The network topology changes dynamically

6
Our Contributions

An algorithm to build an ensemble classifier for
distributed classification in P2P networks by
plurality voting on all the local classifiers
Adapt the training paradigm of pasting bites for
building local classifiers
An algorithm of (restrictive) Distributed
Plurality Voting (DPV) to combine the decisions
of local classifiers
Correctness
Optimality
Extensive Experimental Evaluation
Communication overhead and convergence time of
DPV
Accuracy comparison with centralized
classification

7
Overview

Introduction
Building Local Classifiers
Distributed Plurality Voting
Experimental Results
Related Works
Summary

8
Building Local Classifiers

Pasting Bites by Breiman JML99
Generating small bites of the data by importance
sampling based on the out-of-bag error of
classifiers built so far
Stopping criteria difference of errors between
two successive iteration is below a threshold
Voting uniformly all the classifiers
The more data on a local node, the more
classifiers generated on it, the more votes it
owns.

9
Overview

Introduction
Building Local Classifiers
Distributed Plurality Voting
Experimental Results
Related Works
Summary

10
Problem Formulation Of DPV

Given
A group of peers in a graph
would like to agree on one of options.
Each peer conveys its preference by
initializing a voting vector ,
where is the number of votes on the
i-th option.
Local neighborhood change is informed to each
peer real-timely
Find
The option with the largest number of votes over
all peers
Objective
Scalability, communication-efficient,
decentralized in-network processing,
fault-tolerance
Constraints
Each peer can only communicate with its immediate
neighbors
The network topology changes dynamically

11
An Example Of DPV

The third option is the answer.

12
Comparison Between DPV and Distributed Majority
Voting (DMV, by Wolff et al. TSMC04)

DMV Given
A group of peers in a graph
Each peer conveys its preference by
initializing a 2-tuple
, where stands for the number of
the votes for certain option and
stands for the number of the total vote on this
peer.
The majority ratio
DMV Find
Check whether the voting proportion of the
specified option is above
DMV Converted to DPV
Replacing the 2-tuple
on each peer with the voting vector

13
Comparison Between DPV and DMV (2)

DPV vs. DMV
DPV is a multi-values function while DMV is a
binary predicate.
DMV can be solved by converting it to DPV.
However, DMV can only solve 2-option DPV
problems. For a d-option DPV problem, pairwise
comparisons among all d options must be performed
by DMV for times (Multiple Choice
Voting TSMC04).
DPV finds the maximally supported option
directly, and thus saves a lot of communication
overhead and the time for convergence.
DPV is the general form of DMV

14
Challenges for DPV

No central server to add all voting vectors, Only
communication between immediate neighbors
Dynamic change of not only the network topology
but also the local voting vectors
Supporting not only one-shot query, but also
continuous monitor the current voting result
according to the latest network status

15
DPV Protocol Overview

Assumption
it includes a mechanism to maintain an
un-directional spanning tree for the dynamic P2P
network. The protocol performs on this tree
(duplicate insensitive).
A node is informed of changes in the status of
adjacent nodes.
Protocol Overview
Each node performs the same algorithm
independently
Specify how nodes initialize and react under
different situations a message received,
neighboring node detached or joined, the local
voting vector changed
When the node status changes under the above
situation, the node notifies this change to the
other neighbors only if the condition for sending
messages satisfies.
To guarantee that each node in the network
converges toward the correct plurality

16
The Condition for Sending Messages
No Message Sent
(5,2,1)(2,0,0)(7,2,1) 7-25
7-16 (8,4,1)(2,0,0)(10,4,1) 10-46
10-19 6gt5 9gt6 The differences between
the votes of maximally voted option and all other
options do not decrease.
17
The Condition for Sending Messages (2)
Message Sent
(5,2,1)(2,0,0)(7,2,1) 7-25
7-16 (8,6,1)(2,0,0)(10,6,1) 10-44
10-19 4lt5 9gt6 The differences between
the votes of maximally voted option and any other
option decrease.
18
The Correctness of DPV Protocol
All the nodes converge to the same result.
The difference between the actual votes of
maximally voted option and any other option is
not smaller than what the protocol have
sent. Then, all the nodes converge to the right
result.
19
The Optimality of DPV Protocol
C1 is more restrictive than C2, iff, for any
input case if C1 is true then C2 is true. C1 is
strictly more restrictive than C2, iff, C1 is
more restrictive than C2 and there at least
exists an input case such that C1 is false and C2
is true.
is the most restrictive condition for
sending messages to keep the correctness of the
DPV protocol. It is the condition, which is the
most difficult to satisfy. In this sense, it
guarantees the optimality in communication
overhead.
20
The Extension of DPV Protocol
Restrictive Distributed Plurality Voting
output the maximally voted option whose
proportion to all the votes is above a
user-specified threshold.
It can be used in a classification ensemble in a
restrictive manner by leaving out some uncertain
instances.
The new condition for sending messages is based
on the spirit of .
21
Overview

Introduction
Building Local Classifiers
Distributed Plurality Voting
Experimental Results
Related Works
Summary

22
Accuracy of P2P Classification
Data covtype (58101254, 7 classes) from the UCI
database, distributed onto 500 nodes
23
The Performance of DPV Protocol

Experimental Parameters
Difference types of network topology Power-law
Graph, Random Graph, Grid
Number of nodes 500, 1000, 2000, 4000, 8000,
16000
7-option DPV problems

Experimental Metrics
The average communication overhead for each node
The convergence time of the protocol for one-shot
query

24
The Performance of DPV Protocol (2)

DPV0 vs. RANK (Multiple Choice Voting)
500 nodes
Averaging 2000 instances of 7-option plurality
voting problems

a and b are the largest and second largest
options, respectively.
25
The Performance of DPV Protocol (3)

The Scalability of DPV0
Different number of nodes vs. communication
overhead of each node

26
The Performance of DPV Protocol (4)

The Local Optimality of DPV0
Communication overhead and convergence time under
different conditions for sending messages

27
Overview

Introduction
Building Local Classifiers
Distributed Plurality Voting
Experimental Results
Related Works
Summary

28
Related Work - Ensemble Classifiers

Model Combination (weighted) voting,
meta-learning
For Centralized Data
applying different learning algorithms with
heterogeneous models
applying a single learning algorithm to different
versions of the data
Bagging random sampling with replacement
Boosting re-weighting of the mis-classified
training examples
Pasting Bites generating small bites of the data
by importance sampling based on the quality of
classifiers built so far
For Distributed Data
distributed boosting by Lazarevic et al.
sigkdd01
distributed approach to pasting small bites by
Chawla et al. JMLR04, which uniformly votes
hundreds or thousands of classifiers built on all
distributed data sites

29
Related Work - P2P Data Mining

Primitive Aggregates
Average
Count, Sum
Max, Min
Distributed Majority Voting by Wolff et al.
TSMC04
P2P Data Mining Algorithms
P2P Association Rule Mining by Wolff et al.
TSMC04
P2P K-means clustering by Datta et al. SDM06
P2P L2 threshold monitor by Wolff et al. SDM06
Outlier detection in wireless sensor networks by
Branch et al. ICDCS06
A classification framework in P2P networks by
Siersdorfer et al. ECIR06
Limitations local classifiers propagations,
experiments on 16 peers, only focusing on the
accuracy issue, without involving any dynamism of
P2P networks.

30
Overview

Introduction
Building Local Classifiers
Distributed Plurality Voting
Experimental Results
Related Works
Summary

31
Summary

Proposed an ensemble paradigm for distributed
classification in P2P networks
Formalized a generalized Distributed Plurality
Voting (DPV) protocol for P2P networks
The property of DPV0
supporting both one-shot query and continuous
monitor
theoretical local optimality in terms of
communication overhead
outperforms alternative approaches
scale up to large networks

32
Q. A.
Acknowledgement

Write a Comment

User Comments (0)