Make Protocol Ready for Gigabit - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Make Protocol Ready for Gigabit

Description:

Checksum Offloading. Solution 2: Let the network interface card do the ... Checksum offload risks that the errors occurring between the TCP layer and the ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 32
Provided by: shie151
Category:

less

Transcript and Presenter's Notes

Title: Make Protocol Ready for Gigabit


1
Make Protocol Ready for Gigabit
2
Scopes
  • In this presentation, we will present various
    protocol design and implementation techniques
    that can allow a protocol to function correctly
    on Gbps or deliver Gbps performance to the user
    application or the system output.
  • (In the previous presentation, what we presented
    were operating system design and implementation
    techniques for supporting Gbps network)

3
Protect Against Wrapped Sequence Number
4
Sequence Number Wrapping Around
  • TCP uses sequence numbers to help it detect
    packet loss and duplicate packets.
  • TCPs sequence number is in bytes, rather than in
    packets. The length of the sequence number field
    is 32 bits.
  • On a gigabit network, it would take only 32
    seconds to wrap around a sequence number! The TTL
    field in the IP header just limits the maximum
    number of hops a packet can traverse. It does not
    limit the maximum amount of time a packet can
    stay in the network.
  • Wrapping around a sequence number can result in
    wrong comparisons of the freshness of two
    sequence numbers. This can have a very bad effect.

5
Problem 1 Sequence Number Drops to Zero
  • Suppose that the length of the sequence number
    field is n bits, then when the sequence number
    grows from 0, 1, to 2n, the sequence number
    will wrap and drop to zero.
  • As a result, when we compare two sequence numbers
    a and b where b gt a, the comparison result may be
    wrong.
  • The effect of the wrong comparison is that a more
    recent packet carrying b will be rejected and
    discarded because it is considered older than the
    the packet (carrying a) that is already received.

6
Sequence Number Wheel to Avoid Problem 1
  • To avoid the comparison problem, we can use a
    sequence number wheel scheme.
  • The sequence number space (N 2 n ) is divided
    into two parts each of which is (N/2) large.
  • The division line is not fixed. It is floating
    with the sequence number (e.g., a) to be compared
    with.
  • One part represents all the sequence numbers that
    are considered as larger than a.
  • a lt b if a b lt N/2 and a lt b, or a b gt
    N/2 and a gt b
  • The other part represents all the sequence
    numbers that are considered as smaller than b.
  • Otherwise.

7
Sequence Number Wheel
8
Problem 2 Sequence Number Wraps and Grows Up
  • In Gbps network, in just 32 seconds, a sequence
    number can wrap and grow to the same number
    (e.g., a -gt 0 -gt a -1).
  • This means that an outdated packet (carrying a)
    that stays in the network for a long time (e.g.,
    32 seconds) may look like that it is exactly the
    next packet that the receiver expects to receive.
    (because the last packet received carries a 1).
  • This problem may result in a corrupted received
    file.
  • This problem cannot be solved by the sequence
    number wheel scheme.

9
PAWS Used to Detect Problem 2
  • PAWS (Protect against wrapped sequence numbers
    RFC 1323) is a scheme used in tcp_input() to
    detect problem 2.
  • PAWS is based on the premise that the 32-bit
    timestamp values wrap around at a much lower
    frequency than the 32-bit sequence number, on a
    high-speed network.
  • The TCP timestamp option is assumed to be used in
    the TCP header.
  • Right now in FreeeBSD 4.x, one tick used in
    timestamp represents 1 ms. Therefore, it needs
    about 24 days (1193 hours) to change the sign
    bit.

10
PAWS Used to Detect Problem 2
  • Therefore, when tcp_input() receives a packet, it
    will first check whether the new packets
    timestamp is older (smaller) than the timestamp
    of the lastly received packet.
  • Of course, if the TCP connection has been idle
    for more than 24 days, its timestamp may have
    wrapped around, which can make the comparison
    wrong.
  • In this case, the packet will not be dropped.
  • Otherwise, the packet will be dropped.

11
(No Transcript)
12
TCP Checksum Offloading
13
Turn off Checksum Computation
  • Computing checksum is very expensive.
  • Every byte of a packet need to be read from
    memory to the CPU, be added, and then written fro
    memory back to memory.
  • CPU cycles are wasted.
  • The bandwidth of the CPU-memory bus or the memory
    system (depending on which one is smaller) is
    wasted.
  • Therefore, on Gbps networks, how to avoid or
    reduce the checksum cost becomes an important
    topic.
  • Solution 1 Do not calculate checksum at all
  • E.g., right now on FreeBSD 4.x, you can turn off
    UDP packets checksum computation.

14
Checksum Offloading
  • Solution 2 Let the network interface card do the
    checksum computation. (IP header checksum and TCP
    data payload checksum)
  • Nowadays almost every Gigabit Ethernet NIC
    supports computing checksum on the NIC. For
    example, the 3COM if_ti.c driver.
  • To take advantage of this hardware function, the
    NIC device driver needs to communicate with TCP
    code so that tcp_output() and tcp_input() know
    whether they should compute checksums.
  • Checksum offload risks that the errors occurring
    between the TCP layer and the device driver
    cannot be detected by the receiver!

15
Free Checksum Computation on some RISC Processors
  • Sometimes, on a computer system with a RISC
    processor, the checksum computation can be
    performed without any cost.
  • Some researchers observed that most RISC
    processors can perform two instructions per clock
    cycle, of which only one operation can be loading
    data from memory or storing data to memory.
  • Thus, there is space for two instructions in the
    following copy loop instructions
  • Load r0 r2
  • Store r2 r1

16
Free Checksum Computation on some RISC Processors
  • As a result, if we add two instructions that
    calculate checksum in between these two load and
    store instructions, the checksum can be
    calculated for free. (No other work can be done
    in between the load and store instructions
    anyway)
  • Load r0 r2
  • Add r5 r2 r5 ! Add to running sum in r5
  • Addc r5 0 r5 ! Add carry into r5
  • Store r2 r1
  • This example also shows that programmed I/O is
    not always worse than DMA.

17
TCP Header Prediction for Gigabit Networks
18
TCP Implementation is Complicated
  • In FreeBSD 4.2, tcp_input.c has 2797 lines of C
    code and tcp_output.c has 939 lines. There are
    339 lines of if statements in tcp_input.c and
    126 lines in tcp_output.c.
  • These numbers show
  • TCP processing is complicated.
  • TCP input processing is more complicated than TCP
    output processing.
  • Previously, we presented that locality is
    important for good cache performance
  • And, because conditional branches can hurt the
    performance of a pipelined CPU a lot, their bad
    effect should be minimized.

19
TCP Header Prediction
  • Header prediction looks for packets that fit the
    profile of the packets that the receiver expects
    to receive next.
  • If a packet meets the header prediction
    condition, it will be handled in just a few
    instructions. Otherwise, it will be handled by a
    general-processing code.
  • Actually, this is an old design principle
    optimize for the common case!
  • TCP header prediction scheme can improve TCP
    transfer throughput because it improves
    instruction locality, which can improve cache
    performance.

20
Two Common Cases
  • If TCP is sending data, the next expected segment
    for this connection is an ACK for outstanding
    data.
  • If TCP is receiving data, the next expected
    segment for this connection is the next
    in-sequence data segment.
  • On LAN, where packet losses are rare, the header
    prediction works between 97 and 100. On WAN,
    where packet losses are more possible, the
    percentage drops to between 83 and 99.
  • The code for processing these two common cases is
    placed at the beginning of tcp_input(). This
    results in a better cache performance.
  • There is no information about how well this
    scheme can improve TCP transfer throughput though.

21
Measuring RTTs on Gigabit Networks
22
Measuring RTTs Is Important
  • To more efficiently retransmit lost packets, the
    RTT of a connection should be correctly and
    precisely measured.
  • When a packet is lost, we do not want to wait
    unnecessarily long before resending it.
  • To increase the accuracy of measurements, the
    first step is to use a high-resolution clock.
  • Before FreeBSD 3.0, the clock resolution for TCP
    RTT measurement is 500 ms.
  • Now it becomes 1 ms.
  • The second step is to use more RTT measurement
    samples to calculate the average RTT.

23
Timer Management
  • Timers are extensively used to measure RTTs.
  • When a packet is sent, a timer is started. When
    the corresponding ACK returns, the timer is
    stopped. The elapsed time of the timer represents
    one RTT sample.
  • The above approach can be used on a low speed
    network such as 10 Mbps. On a gigabit network,
    where a 1500-byte packet is sent every 12
    microseconds, this approach is infeasible.
  • To reduce the high frequency of timer
    setup/cancel operations, the original solution is
    to get a RTT sample every RTT, rather than
    getting a RTT sample for every sent packet.
  • Simple to implement. Just do not set up another
    timer until the previous timer is cancelled.
  • However, the accuracy suffers.

24
TCP Timestamp Option
  • To get a RTT sample for every sent packet while
    avoiding the need to setup/cancel timers at a
    high frequency, TCP uses a timestamp option (RFC
    1323).
  • The sender places a timestamp in every sent
    segment. The receiver sends the timestamp back in
    the ACK. This allows the sender to calculate the
    difference and use it as the RTT sample.
  • This option must be supported by both the TCP
    sender and receiver.
  • However, the original design does not need
    support from the receiver.

25
(No Transcript)
26
TCP Window Scale Option for Gigabit Networks
27
TCP Maximum Throughput
  • A TCP connections maximum achievable throughput
    is limited by the minimum of the TCP senders
    socket send buffer and the TCP receivers socket
    receive buffer.
  • Min(socket send buffer on the sender, socket
    receive buffer on the receiver) / RTT.
  • Although we can use setsockopt() to enlarge the
    socket send and receive buffer to a big value,
    the advertised window field in the TCP header is
    only 16 bits, which means a maximum window size
    of 64 KB only.
  • On gigabit networks, clearly this is not enough.

28
TCP Window Scale Option
  • In this scheme, the definition of the TCP window
    is enlarged from 16 to 32 bits.
  • The window field in the header still uses 16
    bits, but a option is defined that applies a
    scaling operation to the 16-bit values.
  • During the 3-way handshaking phase, this option
    is carried in the SYN and SYNACK packets to
    indicate whether the option is supported.
  • In TCP implementation, the real window size is
    internally maintained as a 32-bit value.
  • The shift field in the option is 1-byte long.
  • 0 means no scaling is performed.
  • 14 is the maximum, allowing 64KB 214.

29
TCP Window Scale Option
  • This option can only appear in a SYN packet.
    Therefore, the scale factor is fixed in each
    direction when the connection is established.
  • The shift count is automatically calculated by
    TCP, based on the size of the socket receive
    buffer.
  • Keep dividing the buffer size by 2 until the
    resulting number is less than 64 KB.
  • Each host thus maintains two shift counts S for
    sending and R for receiving.
  • Every 16-bit advertised window that is received
    from the other end is left-shifted by R bits to
    obtain the real advertised window.
  • When we need to send a window advertisement to
    the other end, the real window size is
    right-shifted by S bits, and the resulting 16-bit
    value is placed in the TCP header.

30
Congestion Control on Gigabit Networks
31
Congestion Control
  • Congestion control is more difficult on Gigabit
    networks.
  • Although the absolute control delay (a
    connections RTT) may remain about the same
    regardless of the link bandwidth (either 10, 100,
    or 1000 Mbps), the cost of the control delay
    becomes higher on higher-bandwidth network.
  • Why? During the same RTT, a larger amount of data
    has been injected into the network, before the
    control packet arrives at the traffic source to
    reduce its sending rate.
  • For example, in Gigabit Ethernet 802.3x PAUSE
    flow control scheme, one RTT is required for the
    pause packet to take effect. Therefore, as the
    bandwidth increases, more data will be sent
    before the congestion control starts to take
    effect.
  • The result is that congestion control becomes
    less and less effective on high-bandwidth
    networks. (No solution!)
Write a Comment
User Comments (0)
About PowerShow.com