Title: Reliable Multicasting with JGroups
1Reliable Multicasting with JGroups
- Bela Ban, Jan 2004
- belaban_at_yahoo.com
- http//www.jgroups.org
2Overview
- API, architecture
- Protocols
- Building Blocks
- Performance
- Future, Conclusion
3What Is It ?
- Toolkit for reliable multicasting
- Fragmentation
- Message retransmission
- Ordering
- Group membership, membership change notification
- LAN or WAN based
4License
- JGroups is a toolkit (JAR), to be linked against
an application - Open Source under LGPL
- Commercial products can use JGroups without
having to LGPL their code - Modifications to JGroups itself need to be
LGPL'ed (if distributed) - Dual licensing in the future
5API
- Channel similar to java.net.MulticastSocket
- plus group membership, reliability
- Operations
- Create a channel with a set of properties
- Connect to a group X. Everyone that connects to X
will see each other - Send a message to all members of X
- Send a message to a single member
6API
- Receive a message
- Retrieve membership
- Be notified when members join, leave (including
crashes) - Disconnect from the group
- Close the channel
7API
JChannel channelnew JChannel("file//home/bela/de
fault.xml") channel.connect("demo-group") System
.out.println("members are " channel.getView().g
etMembers()) Message msgnew Message(null, null,
"Hello world") channel.send(msg) Message
m(Message)channel.receive(0) System.out.println(
"received msg from " m.getSrc() " "
m.getObject()) ch.disconnect() ch.close()
8Group topology
9Architecture of JGroups
10Demo
- Draw
- ReplicatedTree shared state
11Stats
- JGroups has 90KLOC
- 30KLOC protocols
- 45KLOC main building blocks
- 15KLOC unit tests
- 90 protocols shipped with JGroups
- Set of well-tested stacks (in XML files)
12Available protocols I
- Transport
- UDP, TCP, TCP_NIO, TUNNEL, JMS, LOOPBACK
- Discovery
- PING, TCPPING, TCPGOSSIP, UDPPING
- Group membership
- Reliable delivery FIFO
- NAKACK, SMACK, UNICAST
13Available protocols II
- Failure detection
- FD, FD_SOCK, FD_PID, FD_SIMPLE, FD_PROB,
VERIFY_SUSPECT - Security
- ENCRYPT, SSL ConnectionTable (n/a)
- Fragmentation (FRAG)
- State transfer (STATE_TRANSFER)
14Available protocols III
- Ordering
- FIFO, CAUSAL, TOTAL, TOTAL_TOKEN
- Virtual Synchrony
- FLUSH, QUEUE, VIEW_ENFORCER
- Probabilistic Broadcast
- PBCAST
- Merging
- MERGE(2), MERGEFAST
15Available protocols IV
- Distributed message garbage collection
- STABLE
- Debugging
- PERF, TRACE, PRINTOBJS, SIZE, BSH
- Simulation
- SHUFFLE, DELAY, DISCARD, DEADLOCK, LOSS,
PARTITIONER
16Available protocols V
- Dynamic configuration
- AUTOCONF
- Flow control
- FLOW_CONTROL, FC
- Misc
- PIGGYBACK, COMPRESS
17Transport
- Task
- Send messages from above to all members in the
group, or to a single member - Receive messages from NW, pass up stack
- UDP multicast and multiple UDP unicast
- TCP mcast done by multiple TCP unicasts
- TUNNEL send to external router, e.g. through
firewall
18Discovery
- Task
- Initial discovery of members
- Used by GMS to determine coordinator to send JOIN
request to - Each member returns its own addr, plus the addr
of the coordinator - Typical response (A,A, B,A, C,A)
- Wait for n milliseconds or m responses
19Discovery - UDP
- Multicast discovery request
- Each member responds with a unicast UDP datagram
(local-addr, coord-addr), back to the sender
20Discovery - TCPGOSSIP
- Can be used by both UDP and TCP
- External GossipServer
- org.jgroups.stack.GossipServer
- Maintains table of ltgroup, membersgt
- Each member registers (groupname, own addr)
- Lease based - members have to periodically renew
registration - Multiple GossipServers possible
21Discovery - TCPGOSSIP
- To obtain initial membership for a given group,
TCPGOSSIP contacts the GossipServer - Membership info does not need to be accurate -
only goal is to determine coord to send JOIN
request to
22Discovery - TCPPING
- Give a set of well known members
- For discovery, those members are pinged
- If at least 1 responds, we can find the
coordinator - Does not require additional process
23Group Membership
- Task
- Maintain a list of members
- Notify members when a new member joins, or an
existing member leaves (or crashes) - Each member has the same ordered list
- List can be retrieved by Channel.getView()
- First ( oldest) member is coordinator
- If coord crashes, 2nd oldest takes over
24Group Membership - JOIN
- New member uses discovery to find coord
- If first member -gt become coord
- Else sends JOIN to coord
- Coord adds new member to list, multicasts new
view (member list) to all members - If 2 initial members are started at the same
time, MERGE protocol merges them into a single
group
25Group Membership - LEAVE
- Member sends LEAVE to coord
- Coord multicasts new view to all members
26Group membership - CRASH
- Failure detection protocol sends up SUSPECT event
- VERIFY_SUSPECT double checks
- GMS multicasts new view (not containing crashed
member) - If member resurfaces, it will be shunned
- Has to leave and rejoin group
27Failure detection
- Task
- Detect if a member has crashed and send SUSPECT
event up the stack (to be handled by GMS) - Logical ring over membership
- Each member pings its neighbor to the right
28Failure detection - FD
29Reliable delivery FIFO
- Lossless and FIFO delivery for multicast and
unicast messages - Multicast NAK and ACK
- Unicast ACK
- Missing messages (gaps) are retransmitted
- Sender resends or
- Receiver requests retransmission
30Encryption
- Uses public/private encryption to join new member
and get shared group key - Shared key is used to encrypt all messages
- Group key is recomputed on joins/leaves
- SSL ConnectionTable
- As alternative, to be used in TCP
- Uses SSLSocket rather than Socket
31Properties configuration
- Plain string format
- "UDP(mcast_addr228.8.8.8mcast_port45566ip_ttl
32" - "mcast_send_buf_size64000mcast_recv_buf
_size64000)" - "PING(timeout2000num_initial_members3)
" - "MERGE2(min_interval5000max_interval10
000)" - "FD_SOCK"
- "VERIFY_SUSPECT(timeout1500)"
- "pbcast.NAKACK(max_xmit_size8096gc_lag
50retransmit_timeout600,1200,2400)" - "UNICAST(timeout600,1200,2400,4800)"
- "pbcast.STABLE(desired_avg_gossip20000)
" - "FRAG(frag_size8096down_threadfalseup
_threadfalse)" - "pbcast.GMS(join_timeout5000join_retry_
timeout2000" - "shunfalseprint_local_addrtrue)"
- URL / XML
32Advantages of protocol stacks
- Each property is implemented by 1 prot
- Fragmentation, retransmission, ordering
- Protocols are assembled into a stack
- Stack has exactly the properties needed by the
appl / required by the network - Cant get this with java.net.Socket, always comes
with full TCP/IP
33Advantages of protocol stacks
- Small scope a protocol does just one job, but
does it well - Protocol stacks are fashionable
- Servlet 2.3 filters
- Interceptors (Corba, JBoss)
- AOP separation of concerns, e.g. fragmentation
should not be an application concern
34Benefits
- Same application code, different protocol stacks
(deployment issue) - Application requirements reflected in protocol
stack specification - App focuses on domain specific issues
35Building Blocks
- Replicated Cache
- NotificationBus
- Group RPC
36Replicated Cache
- Shared state across a group
- Any change is replicated to all members
- New members acquire initial state from coord
- Structures supported
- Tree
- Hashmap
- Queues
37NotificationBus
- Thin layer on Channel
- Notifications sent to all members
- Callback when notification is received
- Hook for state sharing
38Group RPC
- Invoke a method call in all members
- Get a list of responses
- Wait for all responses, majority, first, or none
response (use optional timeout) - Handles crashed members correctly (no blocking)
39Theory
- Virtual Synchrony
- DEFAULT
- Probabilistic Broadcast
40Virtual Synchrony
- A View is a list of members (A,B,C,D)
- When members join/leave, a new view will be
installed (A,C,D) - Every healthy member receives the same set of
messages between subsequent views - Messages sent in V1 are received in V1
- All msgs by sender received in same order
41Virtual Synchrony
- The FLUSH protocol ensures that all members have
received all msgs in V1 before installing V2 - New members wont receive messages from previous
views - Member that left wont receive msgs
A
V1
V2
B
C
42DEFAULT
- VSYNC expensive, doesnt scale well
- Stop-the-world model on view changes
- DEFAULT treats views as regular msgs
- Less stringent reliability guarantees
- Still good enough for most apps
- SMACK does away with membership altogether, uses
approximation of mbrship - Good for large groups (no coord)
43Probabilistic Broadcast
- First a dirty multicast
44Probabilistic Broadcast
- Then gossipping to repair failures
45Probabilistic Broadcast
- Epidemic style msg dissemination
- Very resilient to attacks
- Avoid nak implosions
- Suited for large networks
- Probability
- 1 that either all members or none receives msg
- 0 that few members receive msg
46Serverless JMS
- JMS based on JGroups
- Peer-to-peer architecture rather than C/S
- Client publishing to a topic
- Instead of sending msg to server, and server
distributes to multiple clients publisher
multicasts message - JMS Server just another member
- Handles persistent messages (DB)
47Serverless JMS
Cost 4 unicasts
Cost 1 multicast
48Serverless JMS
- Clients are still able to publish even when
server is down - Caveat works in scenario where client and server
are in same multicast-reachable NW - Status
- Topics/Queues available
- No TX/XA, no durable subscriptions, no persistent
messages - Download (standalone) beta at jboss.org
49Session Replication in Tomcat
- Done by Filip Hanik in Tomcat 4.x
- Servlet sessions are replicated across Tomcat
processes - New Tomcat instance gets sessions from existing
Tomcat instance(s) - Modification (addition, removal of attributes) of
session gets replicated
50Session Clustering in Tomcat II
- Expiry of session will expire session everywhere
in the cluster - Last timestamp update
- External load-balancer distributes requests to
Tomcat instances - Round-robin
- Sticky, next server on crash
51Session Clustering in Tomcat III
52Where is JGroups used ?
- JBoss
- Clustering
- Replication of entity beans, SLSBs and SFSBs
- HA-JNDI
- Cache invalidation
- Session repl (integrated Tomcat, Jetty)
- Serverless JMS
- Cache
- Replicated transactional clustered cache
53Where is JGroups used ?
- Jonas appserver (clustering)
- GroupPac (FT-CORBA impl)
- GCT port to .NET
- Replicated Caching
- OpenSyphony OSCache
- Jakarta Turbine's JCS
- Swarmcache
54Where is JGroups used ?
- Session replication
- Jetty
- Tomcat 4.x
- Work in progress on plugin architecture for
Tomcat 5.x - Unofficial ones...
55Performance
- 4 nodes, 1 or 2 senders
- 750MHz SunBlade 1000 512MB, 100MB switched
ethernet - JGroups 2.1
- 8000 10K msgs, in 200 bursts of 20 (2 senders),
sleep after burst 5ms - 451 msgs/s 4.5MB/s throughput
- Resident heap size 35MB max (-Xmx128m)
56Performance
- 1.4 billion messages total
- 4 nodes, 2 senders
- Message size 10K
- Average msgs/s 350
- Max resident mem 35M (-Xmx128m)
- Tests available as part of JG distro
- Includes gnuplot scripts to generate graphs
57Current and future projects
- JBossCache, Serverless JMS
- Port to J2ME (first version available on
www.jgroups-me.org) - hsqldb (HyperSonic) database replication
- JCache JSR 107 compliant impl (JBoss Cache)
- Potential work on GroupComm JSR
- jcluster project on dev.java.net
58Links
- www.jgroups.org
- "Papers and Articles" link to IBM devworks
59Questions ?