The Road to a Ph'D' - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

The Road to a Ph'D'

Description:

Data to be analyzed are distributed on nodes of these large-scale dynamic networks ... Local neighborhood change is informed to each peer real-timely. Find: ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 33
Provided by: hui64
Category:
Tags: be | informed | road

less

Transcript and Presenter's Notes

Title: The Road to a Ph'D'


1
Distributed Classification in Peer-to-Peer
Networks
Ping Luo, Hui Xiong, Kevin Lü, Zhongzhi Shi
2
Overview
  • Introduction
  • Building Local Classifiers
  • Distributed Plurality Voting
  • Experimental Results
  • Related Works
  • Summary

3
Research Motivation
  • Widespread use of P2P networks and sensor
    networks
  • Data to be analyzed are distributed on nodes of
    these large-scale dynamic networks
  • Traditional distributed data mining algorithms
    must be extended to fit this new environment
  • Motivating Examples
  • - P2P anti-spam networks
  • - Automatic organization of web documents in
    P2P
  • environments
  • A distributed classification algorithm is
    critical in these applications.

4
Research Motivation (2)
  • New Challenges
  • - highly decentralized peers, do not have
    the notion
  • of clients and servers
  • - including hundreds or thousands of nodes,
  • impossible global synchronization
  • - frequent topology changes caused by
    frequent
  • failure and recovery of peers
  • Algorithm Requirements
  • - scalability, decentralized in-network
    processing
  • - communication efficient, local synchronism
  • - fault-tolerance

5
Problem Formulation
  • Given
  • A connected topology graph
  • Each peer owns its local training data
    for classification
  • Local neighborhood change is informed to each
    peer real-timely
  • Find
  • Classification paradigm in this setting
  • Including how to train and use a global
    classifier
  • Objective
  • Scalability, communication-efficient,
    decentralized in-network processing,
    fault-tolerance
  • Constraints
  • Each peer can only communicate with its immediate
    neighbors
  • The network topology changes dynamically

6
Our Contributions
  • An algorithm to build an ensemble classifier for
    distributed classification in P2P networks by
    plurality voting on all the local classifiers
  • Adapt the training paradigm of pasting bites for
    building local classifiers
  • An algorithm of (restrictive) Distributed
    Plurality Voting (DPV) to combine the decisions
    of local classifiers
  • Correctness
  • Optimality
  • Extensive Experimental Evaluation
  • Communication overhead and convergence time of
    DPV
  • Accuracy comparison with centralized
    classification

7
Overview
  • Introduction
  • Building Local Classifiers
  • Distributed Plurality Voting
  • Experimental Results
  • Related Works
  • Summary

8
Building Local Classifiers
  • Pasting Bites by Breiman JML99
  • Generating small bites of the data by importance
    sampling based on the out-of-bag error of
    classifiers built so far
  • Stopping criteria difference of errors between
    two successive iteration is below a threshold
  • Voting uniformly all the classifiers
  • The more data on a local node, the more
    classifiers generated on it, the more votes it
    owns.

9
Overview
  • Introduction
  • Building Local Classifiers
  • Distributed Plurality Voting
  • Experimental Results
  • Related Works
  • Summary

10
Problem Formulation Of DPV
  • Given
  • A group of peers in a graph
    would like to agree on one of options.
  • Each peer conveys its preference by
    initializing a voting vector ,
    where is the number of votes on the
    i-th option.
  • Local neighborhood change is informed to each
    peer real-timely
  • Find
  • The option with the largest number of votes over
    all peers
  • Objective
  • Scalability, communication-efficient,
    decentralized in-network processing,
    fault-tolerance
  • Constraints
  • Each peer can only communicate with its immediate
    neighbors
  • The network topology changes dynamically

11
An Example Of DPV
  • The third option is the answer.

12
Comparison Between DPV and Distributed Majority
Voting (DMV, by Wolff et al. TSMC04)
  • DMV Given
  • A group of peers in a graph
  • Each peer conveys its preference by
    initializing a 2-tuple
    , where stands for the number of
    the votes for certain option and
    stands for the number of the total vote on this
    peer.
  • The majority ratio
  • DMV Find
  • Check whether the voting proportion of the
    specified option is above
  • DMV Converted to DPV
  • Replacing the 2-tuple
    on each peer with the voting vector

13
Comparison Between DPV and DMV (2)
  • DPV vs. DMV
  • DPV is a multi-values function while DMV is a
    binary predicate.
  • DMV can be solved by converting it to DPV.
  • However, DMV can only solve 2-option DPV
    problems. For a d-option DPV problem, pairwise
    comparisons among all d options must be performed
    by DMV for times (Multiple Choice
    Voting TSMC04).
  • DPV finds the maximally supported option
    directly, and thus saves a lot of communication
    overhead and the time for convergence.
  • DPV is the general form of DMV

14
Challenges for DPV
  • No central server to add all voting vectors, Only
    communication between immediate neighbors
  • Dynamic change of not only the network topology
    but also the local voting vectors
  • Supporting not only one-shot query, but also
    continuous monitor the current voting result
    according to the latest network status

15
DPV Protocol Overview
  • Assumption
  • it includes a mechanism to maintain an
    un-directional spanning tree for the dynamic P2P
    network. The protocol performs on this tree
    (duplicate insensitive).
  • A node is informed of changes in the status of
    adjacent nodes.
  • Protocol Overview
  • Each node performs the same algorithm
    independently
  • Specify how nodes initialize and react under
    different situations a message received,
    neighboring node detached or joined, the local
    voting vector changed
  • When the node status changes under the above
    situation, the node notifies this change to the
    other neighbors only if the condition for sending
    messages satisfies.
  • To guarantee that each node in the network
    converges toward the correct plurality

16
The Condition for Sending Messages
No Message Sent
(5,2,1)(2,0,0)(7,2,1) 7-25
7-16 (8,4,1)(2,0,0)(10,4,1) 10-46
10-19 6gt5 9gt6 The differences between
the votes of maximally voted option and all other
options do not decrease.
17
The Condition for Sending Messages (2)
Message Sent
(5,2,1)(2,0,0)(7,2,1) 7-25
7-16 (8,6,1)(2,0,0)(10,6,1) 10-44
10-19 4lt5 9gt6 The differences between
the votes of maximally voted option and any other
option decrease.
18
The Correctness of DPV Protocol
All the nodes converge to the same result.
The difference between the actual votes of
maximally voted option and any other option is
not smaller than what the protocol have
sent. Then, all the nodes converge to the right
result.
19
The Optimality of DPV Protocol
C1 is more restrictive than C2, iff, for any
input case if C1 is true then C2 is true. C1 is
strictly more restrictive than C2, iff, C1 is
more restrictive than C2 and there at least
exists an input case such that C1 is false and C2
is true.
is the most restrictive condition for
sending messages to keep the correctness of the
DPV protocol. It is the condition, which is the
most difficult to satisfy. In this sense, it
guarantees the optimality in communication
overhead.
20
The Extension of DPV Protocol
Restrictive Distributed Plurality Voting
output the maximally voted option whose
proportion to all the votes is above a
user-specified threshold.
It can be used in a classification ensemble in a
restrictive manner by leaving out some uncertain
instances.
The new condition for sending messages is based
on the spirit of .
21
Overview
  • Introduction
  • Building Local Classifiers
  • Distributed Plurality Voting
  • Experimental Results
  • Related Works
  • Summary

22
Accuracy of P2P Classification
Data covtype (58101254, 7 classes) from the UCI
database, distributed onto 500 nodes
23
The Performance of DPV Protocol
  • Experimental Parameters
  • Difference types of network topology Power-law
    Graph, Random Graph, Grid
  • Number of nodes 500, 1000, 2000, 4000, 8000,
    16000
  • 7-option DPV problems
  • Experimental Metrics
  • The average communication overhead for each node
  • The convergence time of the protocol for one-shot
    query

24
The Performance of DPV Protocol (2)
  • DPV0 vs. RANK (Multiple Choice Voting)
  • 500 nodes
  • Averaging 2000 instances of 7-option plurality
    voting problems

a and b are the largest and second largest
options, respectively.
25
The Performance of DPV Protocol (3)
  • The Scalability of DPV0
  • Different number of nodes vs. communication
    overhead of each node

26
The Performance of DPV Protocol (4)
  • The Local Optimality of DPV0
  • Communication overhead and convergence time under
    different conditions for sending messages

27
Overview
  • Introduction
  • Building Local Classifiers
  • Distributed Plurality Voting
  • Experimental Results
  • Related Works
  • Summary

28
Related Work - Ensemble Classifiers
  • Model Combination (weighted) voting,
    meta-learning
  • For Centralized Data
  • applying different learning algorithms with
    heterogeneous models
  • applying a single learning algorithm to different
    versions of the data
  • Bagging random sampling with replacement
  • Boosting re-weighting of the mis-classified
    training examples
  • Pasting Bites generating small bites of the data
    by importance sampling based on the quality of
    classifiers built so far
  • For Distributed Data
  • distributed boosting by Lazarevic et al.
    sigkdd01
  • distributed approach to pasting small bites by
    Chawla et al. JMLR04, which uniformly votes
    hundreds or thousands of classifiers built on all
    distributed data sites

29
Related Work - P2P Data Mining
  • Primitive Aggregates
  • Average
  • Count, Sum
  • Max, Min
  • Distributed Majority Voting by Wolff et al.
    TSMC04
  • P2P Data Mining Algorithms
  • P2P Association Rule Mining by Wolff et al.
    TSMC04
  • P2P K-means clustering by Datta et al. SDM06
  • P2P L2 threshold monitor by Wolff et al. SDM06
  • Outlier detection in wireless sensor networks by
    Branch et al. ICDCS06
  • A classification framework in P2P networks by
    Siersdorfer et al. ECIR06
  • Limitations local classifiers propagations,
    experiments on 16 peers, only focusing on the
    accuracy issue, without involving any dynamism of
    P2P networks.

30
Overview
  • Introduction
  • Building Local Classifiers
  • Distributed Plurality Voting
  • Experimental Results
  • Related Works
  • Summary

31
Summary
  • Proposed an ensemble paradigm for distributed
    classification in P2P networks
  • Formalized a generalized Distributed Plurality
    Voting (DPV) protocol for P2P networks
  • The property of DPV0
  • supporting both one-shot query and continuous
    monitor
  • theoretical local optimality in terms of
    communication overhead
  • outperforms alternative approaches
  • scale up to large networks

32
Q. A.
Acknowledgement
Write a Comment
User Comments (0)
About PowerShow.com