Reliable Datagram IPC - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Reliable Datagram IPC

Description:

Key for target buffer returned from RDS interface (get memory handle) ... Remote side initiates directed send passing in key of remote target buffer ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 25
Provided by: ope7
Category:

less

Transcript and Presenter's Notes

Title: Reliable Datagram IPC


1
Reliable Datagram IPC
  • Richard.Frank_at_oracle.com, Zach.Brown_at_oracle.com

2
Vision Statement
  • A low overhead, low latency, high bandwidth,
    ultra reliable, supportable, IPC protocol and
    transport system
  • Which matches Oracles existing IPC models for RAC
    communication
  • Optimized for Xfers from 200 bytes to 8meg

3
Goal and Objective
  • Support for a reliable datagram IPC in OpenIB
  • Based on Socket API
  • Minimal code change / testing for Oracle
  • Failover inter HCA and intra HCA ports
  • Runs over IB, Ether, iWARP, etc
  • 2 month validation / certification for RAC

4
Todays Situation
  • TCP streams used for connections to database by
    external clients, app servers, etc.
  • Reliable Data grams used for internal database
    IPC (RAC)
  • Thousands of processes
  • 200k associations (not connections)
  • 64 nodes

5
Parallel Query
  • SQL decomposed into execution plan / tree
  • Set of producer / consumer pipelined stages
  • Based on data accessed (rows,physical
    organization,logical operations (hash,index)
  • Each execution stage has producer / consumer
    slave groups (source,sync)
  • Each group can be many slaves 32

6
Parallel Query
  • Operation tree / plan is not aware of slave
    locality comm. could be local via shared memory
    or remote via IPC.
  • N N, 1 N, N 1 comm. between groups (16
    source, 16 sync, 16 nodes 65k associations
    for nn com) 1 query
  • May change group organization / comm model at
    each stage of plan.
  • 64k msg size capable typical today 16k

7
Oracle Buffer Cache
  • Distributed Cache
  • Client / Server
  • Client sends request for buffer
  • Server Sends back buffer (DDP)
  • Each node has pool of servers
  • Any client can ask any server

8
Oracle Buffer Cache
  • Buffer size is 8k by default but can be 2k, up to
    32k in size
  • Associations per server are n-1 C
  • C clients per node, n Nodes
  • 16-1800 12k per server process.
  • 8 servers per node 96k associations

9
Oracle IPC Usage
  • New database functionality will significantly
    increase IPC utilization
  • Approaches database I/O rates
  • Very large msgs -gt 8meg

10
Reliable Datagram IPC
  • UDP Oracle adds reliable delivery via user mode
    wire protocol engine.
  • Two sockets per process, thousands of msgs on
    wire
  • Slow sends times (windowing,acks,retrans)
  • Holds together but degenerates under CPU load
  • Well tested !

11
Available Options
  • uDAPL / itAPI not supporting
  • IPOIB high CPU overhead, same unreliable
    delivery (UDP)
  • SDP connection oriented
  • We want to take our existing well tested UDP
    module, shutoff most of it to run over an O/S
    provided RD IPC

12
Recommendation
  • RD Reliable Datagram IPC over IB
  • 50 less CPU than IPOIB, UDP
  • ½ Latency of UDP (no user-mode acks)
  • Within 5 of uDAPL thru-put using Oracle
  • Minimal code change reduced our UDP module by
    70 - removed windowing, acks, retransmissions,
    etc.
  • RDS driver 1k C lines (b-copy)
  • Decoupled from user-mode CPU loading
  • Passes all Oracle regression tests in lt 2 wks
    !!!!
  • Supports fail-over across and within HCAs.

13
RDS IPC over IB
  • Uses IB reliable connection (RC)
  • Node to Node level connection
  • User mode sockets share small pool of node to
    node RCs.
  • Formed either dynamically at send or at system
    startup

14
Oracle Block Service Rate
15
Service Response Time
16
Cpu Cost Per Block Served
17
(No Transcript)
18
RDS IPC
  • Implemented in 3 phases
  • b-copy
  • Zero Copy
  • Z-copy Directed Sends / Recvs (ES-API additions)

19
B-Copy
  • Sends are copied and completed immediately
  • Sends are not guaranteed to have made it to
    remote application.
  • If Send fails async to submission application
    must detect loss of send
  • Can only fail if no path to destination (remote
    port / process is gone or path has failed no
    alternate path

20
B-Copy Send/Recv
  • Recvs are buffered in kernel / queued to remote
    socket.
  • If total buffers queued to remote socket exceeds
    threshold then sending socket is back pressured
    (ewouldblock) when sending to blocked remote
    socket.

21
Z-Copy Send/Recv
  • Dynamic registration of buffer gt size
  • Application is not required to do explicit
    registration.
  • Oracle IPC buffers are in shared memory and
    private heap
  • impractical to pre-register
  • O/S must manage any caching of registrations

22
Directed Sends / Recvs(DDP)
  • Key for target buffer returned from RDS interface
    (get memory handle)
  • Key is sent by application to remote side
  • Remote side initiates directed send passing in
    key of remote target buffer
  • Uses RDMA write to move data
  • ES-API additions working on definition

23
Next Steps ?
  • RDS bcopy supported in Oracle 10.2.0.2.
  • RDS from SilverStorm ported to OpenIB Gen2
  • Preparing to test OpenIB RDS at Oracle

24
Next Steps
  • Work on zcopy / directed send (ddp) specification
    now (ES-API).
  • RD IPC Docs from Oracle
  • Richard.Frank_at_oracle.com
  • RDS/eth
  • Zach.Brown_at_oracle.com
Write a Comment
User Comments (0)
About PowerShow.com