Slingshot: Time-Critical Multicast for Clustered Applications - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Slingshot: Time-Critical Multicast for Clustered Applications

Description:

Slingshot: Time-Critical Multicast for Clustered Applications. Mahesh Balakrishnan ... Build a time-critical middleware layer that uses Slingshot as a generic ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 15
Provided by: mahes7
Category:

less

Transcript and Presenter's Notes

Title: Slingshot: Time-Critical Multicast for Clustered Applications


1
Slingshot Time-Critical Multicast for Clustered
Applications
  • Mahesh Balakrishnan
  • Stefan Pleisch
  • Ken Birman
  • Cornell University

2
The Contemporary Datacenter
  • Building-wide super-clusters 1000s of commodity
    blade-servers
  • Typically used as commercial website back-ends
    Amazon, etc.
  • Software Paradigms SOA, Eventing,
    Publish/Subscribe
  • many-to-many communication, Multicast!

3
Multicast in the Datacenter
  • IP Multicast available adding reliability to it
    is a well-researched technology
  • Scalability dimensions
  • Number of receivers
  • Number of senders?
  • Number of groups?
  • Metrics
  • Throughput
  • Timeliness?

4
Time-Critical Applications
  • dealing in perishable data stock quotes,
    location updates
  • willing to trade complete reliability for
    timeliness
  • requiring tunable reliability/ timeliness/
    overhead tradeoffs
  • Probabilistic Guarantee of Timeliness?
  • For x overhead, y of lost packets are recovered
    in time t.
  • Remainder can be optionally recovered in time t.

5
Design Space
  • Reactive vs. Proactive
  • Reactive Loss Discovery
  • ACK
  • Sender-Based Sequencing
  • If the multicast rate in a group is constant, the
    inter-multicast time at any sender goes up
    linearly with the number of senders
  • Gossip Scalable
  • Proactive FEC Tunable

6
Slingshot Overview
Receiver-Based FEC Senders send initially via
unreliable IP Multicast Phase 1 Receivers
repair losses by proactively sending each other
FEC repair packets Phase 2 Remaining losses are
recovered from the sender
Each receiver sends an error correction (XOR)
packet to c randomly selected receivers with the
last r packets it received Rate-of-fire parameter
(r, c) Allows tuning of overhead-timeliness
tradeoff
7
Protocol Details 0
  • Two Packet Types

Repair Packet
List of Data Packet IDs (sender1,seqno1),
(sender2,seqno2).
Data Packet
Packet ID (Sender, SeqNo)
Less than Network MTU
XOR of Data Packets
Application MTU 1024
Application Payload
Terminology Data packets are included in repair
packet
8
Protocol Details 1
  • Data Structures
  • Data Buffer received data packets
  • Repair Bin pointers to last ltr data packets
  • Arrival of Data Packet dp at Receiver
  • dp is added to the data buffer
  • dp is added to the repair bin
  • If repair bin size equals r, a repair packet rp
    is created from its contents, and the repair bin
    is cleared
  • rp is dispatched to c random receivers

9
Protocol Details 2
  • Arrival of Repair Packet rp at Receiver If
    (missing included data packets)
  • 0 rp is discarded
  • 1 it is recovered by XORing rp with the other
    r-1 data packets
  • gt1 rp is stored in a special buffer, in case
    future data packet arrivals and recoveries make
    it usable

10
Evaluation Setup
  • 64 node rack-style cluster at Cornell
  • Loss rate fixed at 1 packets dropped at end
    buffers
  • All nodes send and receive
  • Inter-node latencies 50-100 microseconds
  • Group Data Rate 1000 packets per second
  • Each node multicasts 64 packets per second i.e
    one packet every 64 milliseconds

11
Slingshot Tunability
For 27 overhead, 93.5 Lost Packets are
recovered at an avg. of 3.5 milliseconds
Example Tradeoff Points between Overhead,
Timeliness, and Reliability
Overhead and Recovered Packets plotted on left
y-axis, Recovery Time on right
12
Slingshot vs SRM
Slingshot recovers 93 in 10 ms, 97 in 25 ms
Fastest SRM packet Recovery is 2.2 seconds 93 in
4.85 seconds, 97 in 5.1 seconds
2-3 Orders of Magnitude faster
13
Slingshot Scalability Group Size
Simulation Results
Gossip-Style Scalability Insensitive to scale
beyond a certain size
14
Conclusion
  • Slingshot provides a tunable, probabilistic
    guarantee of timeliness
  • Outperforms SRM by 2 orders of magnitude in a 64
    node system
  • Insensitive to number of senders
  • Future Work
  • Achieve scalability in other dimensions (number
    of groups)
  • Build a time-critical middleware layer that uses
    Slingshot as a generic primitive
Write a Comment
User Comments (0)
About PowerShow.com