Title: Make Protocol Ready for Gigabit
1Make Protocol Ready for Gigabit
2Scopes
- In this presentation, we will present various
protocol design and implementation techniques
that can allow a protocol to function correctly
on Gbps or deliver Gbps performance to the user
application or the system output. - (In the previous presentation, what we presented
were operating system design and implementation
techniques for supporting Gbps network)
3Protect Against Wrapped Sequence Number
4Sequence Number Wrapping Around
- TCP uses sequence numbers to help it detect
packet loss and duplicate packets. - TCPs sequence number is in bytes, rather than in
packets. The length of the sequence number field
is 32 bits. - On a gigabit network, it would take only 32
seconds to wrap around a sequence number! The TTL
field in the IP header just limits the maximum
number of hops a packet can traverse. It does not
limit the maximum amount of time a packet can
stay in the network. - Wrapping around a sequence number can result in
wrong comparisons of the freshness of two
sequence numbers. This can have a very bad effect.
5Problem 1 Sequence Number Drops to Zero
- Suppose that the length of the sequence number
field is n bits, then when the sequence number
grows from 0, 1, to 2n, the sequence number
will wrap and drop to zero. - As a result, when we compare two sequence numbers
a and b where b gt a, the comparison result may be
wrong. - The effect of the wrong comparison is that a more
recent packet carrying b will be rejected and
discarded because it is considered older than the
the packet (carrying a) that is already received.
6Sequence Number Wheel to Avoid Problem 1
- To avoid the comparison problem, we can use a
sequence number wheel scheme. - The sequence number space (N 2 n ) is divided
into two parts each of which is (N/2) large. - The division line is not fixed. It is floating
with the sequence number (e.g., a) to be compared
with. - One part represents all the sequence numbers that
are considered as larger than a. - a lt b if a b lt N/2 and a lt b, or a b gt
N/2 and a gt b - The other part represents all the sequence
numbers that are considered as smaller than b. - Otherwise.
7Sequence Number Wheel
8Problem 2 Sequence Number Wraps and Grows Up
- In Gbps network, in just 32 seconds, a sequence
number can wrap and grow to the same number
(e.g., a -gt 0 -gt a -1). - This means that an outdated packet (carrying a)
that stays in the network for a long time (e.g.,
32 seconds) may look like that it is exactly the
next packet that the receiver expects to receive.
(because the last packet received carries a 1). - This problem may result in a corrupted received
file. - This problem cannot be solved by the sequence
number wheel scheme.
9PAWS Used to Detect Problem 2
- PAWS (Protect against wrapped sequence numbers
RFC 1323) is a scheme used in tcp_input() to
detect problem 2. - PAWS is based on the premise that the 32-bit
timestamp values wrap around at a much lower
frequency than the 32-bit sequence number, on a
high-speed network. - The TCP timestamp option is assumed to be used in
the TCP header. - Right now in FreeeBSD 4.x, one tick used in
timestamp represents 1 ms. Therefore, it needs
about 24 days (1193 hours) to change the sign
bit.
10PAWS Used to Detect Problem 2
- Therefore, when tcp_input() receives a packet, it
will first check whether the new packets
timestamp is older (smaller) than the timestamp
of the lastly received packet. - Of course, if the TCP connection has been idle
for more than 24 days, its timestamp may have
wrapped around, which can make the comparison
wrong. - In this case, the packet will not be dropped.
- Otherwise, the packet will be dropped.
11(No Transcript)
12TCP Checksum Offloading
13Turn off Checksum Computation
- Computing checksum is very expensive.
- Every byte of a packet need to be read from
memory to the CPU, be added, and then written fro
memory back to memory. - CPU cycles are wasted.
- The bandwidth of the CPU-memory bus or the memory
system (depending on which one is smaller) is
wasted. - Therefore, on Gbps networks, how to avoid or
reduce the checksum cost becomes an important
topic. - Solution 1 Do not calculate checksum at all
- E.g., right now on FreeBSD 4.x, you can turn off
UDP packets checksum computation.
14Checksum Offloading
- Solution 2 Let the network interface card do the
checksum computation. (IP header checksum and TCP
data payload checksum) - Nowadays almost every Gigabit Ethernet NIC
supports computing checksum on the NIC. For
example, the 3COM if_ti.c driver. - To take advantage of this hardware function, the
NIC device driver needs to communicate with TCP
code so that tcp_output() and tcp_input() know
whether they should compute checksums. - Checksum offload risks that the errors occurring
between the TCP layer and the device driver
cannot be detected by the receiver!
15Free Checksum Computation on some RISC Processors
- Sometimes, on a computer system with a RISC
processor, the checksum computation can be
performed without any cost. - Some researchers observed that most RISC
processors can perform two instructions per clock
cycle, of which only one operation can be loading
data from memory or storing data to memory. - Thus, there is space for two instructions in the
following copy loop instructions - Load r0 r2
- Store r2 r1
16Free Checksum Computation on some RISC Processors
- As a result, if we add two instructions that
calculate checksum in between these two load and
store instructions, the checksum can be
calculated for free. (No other work can be done
in between the load and store instructions
anyway) - Load r0 r2
- Add r5 r2 r5 ! Add to running sum in r5
- Addc r5 0 r5 ! Add carry into r5
- Store r2 r1
- This example also shows that programmed I/O is
not always worse than DMA.
17TCP Header Prediction for Gigabit Networks
18TCP Implementation is Complicated
- In FreeBSD 4.2, tcp_input.c has 2797 lines of C
code and tcp_output.c has 939 lines. There are
339 lines of if statements in tcp_input.c and
126 lines in tcp_output.c. - These numbers show
- TCP processing is complicated.
- TCP input processing is more complicated than TCP
output processing. - Previously, we presented that locality is
important for good cache performance - And, because conditional branches can hurt the
performance of a pipelined CPU a lot, their bad
effect should be minimized.
19TCP Header Prediction
- Header prediction looks for packets that fit the
profile of the packets that the receiver expects
to receive next. - If a packet meets the header prediction
condition, it will be handled in just a few
instructions. Otherwise, it will be handled by a
general-processing code. - Actually, this is an old design principle
optimize for the common case! - TCP header prediction scheme can improve TCP
transfer throughput because it improves
instruction locality, which can improve cache
performance.
20Two Common Cases
- If TCP is sending data, the next expected segment
for this connection is an ACK for outstanding
data. - If TCP is receiving data, the next expected
segment for this connection is the next
in-sequence data segment. - On LAN, where packet losses are rare, the header
prediction works between 97 and 100. On WAN,
where packet losses are more possible, the
percentage drops to between 83 and 99. - The code for processing these two common cases is
placed at the beginning of tcp_input(). This
results in a better cache performance. - There is no information about how well this
scheme can improve TCP transfer throughput though.
21Measuring RTTs on Gigabit Networks
22Measuring RTTs Is Important
- To more efficiently retransmit lost packets, the
RTT of a connection should be correctly and
precisely measured. - When a packet is lost, we do not want to wait
unnecessarily long before resending it. - To increase the accuracy of measurements, the
first step is to use a high-resolution clock. - Before FreeBSD 3.0, the clock resolution for TCP
RTT measurement is 500 ms. - Now it becomes 1 ms.
- The second step is to use more RTT measurement
samples to calculate the average RTT.
23Timer Management
- Timers are extensively used to measure RTTs.
- When a packet is sent, a timer is started. When
the corresponding ACK returns, the timer is
stopped. The elapsed time of the timer represents
one RTT sample. - The above approach can be used on a low speed
network such as 10 Mbps. On a gigabit network,
where a 1500-byte packet is sent every 12
microseconds, this approach is infeasible. - To reduce the high frequency of timer
setup/cancel operations, the original solution is
to get a RTT sample every RTT, rather than
getting a RTT sample for every sent packet. - Simple to implement. Just do not set up another
timer until the previous timer is cancelled. - However, the accuracy suffers.
24TCP Timestamp Option
- To get a RTT sample for every sent packet while
avoiding the need to setup/cancel timers at a
high frequency, TCP uses a timestamp option (RFC
1323). - The sender places a timestamp in every sent
segment. The receiver sends the timestamp back in
the ACK. This allows the sender to calculate the
difference and use it as the RTT sample. - This option must be supported by both the TCP
sender and receiver. - However, the original design does not need
support from the receiver.
25(No Transcript)
26TCP Window Scale Option for Gigabit Networks
27TCP Maximum Throughput
- A TCP connections maximum achievable throughput
is limited by the minimum of the TCP senders
socket send buffer and the TCP receivers socket
receive buffer. - Min(socket send buffer on the sender, socket
receive buffer on the receiver) / RTT. - Although we can use setsockopt() to enlarge the
socket send and receive buffer to a big value,
the advertised window field in the TCP header is
only 16 bits, which means a maximum window size
of 64 KB only. - On gigabit networks, clearly this is not enough.
28TCP Window Scale Option
- In this scheme, the definition of the TCP window
is enlarged from 16 to 32 bits. - The window field in the header still uses 16
bits, but a option is defined that applies a
scaling operation to the 16-bit values. - During the 3-way handshaking phase, this option
is carried in the SYN and SYNACK packets to
indicate whether the option is supported. - In TCP implementation, the real window size is
internally maintained as a 32-bit value. - The shift field in the option is 1-byte long.
- 0 means no scaling is performed.
- 14 is the maximum, allowing 64KB 214.
29TCP Window Scale Option
- This option can only appear in a SYN packet.
Therefore, the scale factor is fixed in each
direction when the connection is established. - The shift count is automatically calculated by
TCP, based on the size of the socket receive
buffer. - Keep dividing the buffer size by 2 until the
resulting number is less than 64 KB. - Each host thus maintains two shift counts S for
sending and R for receiving. - Every 16-bit advertised window that is received
from the other end is left-shifted by R bits to
obtain the real advertised window. - When we need to send a window advertisement to
the other end, the real window size is
right-shifted by S bits, and the resulting 16-bit
value is placed in the TCP header.
30Congestion Control on Gigabit Networks
31Congestion Control
- Congestion control is more difficult on Gigabit
networks. - Although the absolute control delay (a
connections RTT) may remain about the same
regardless of the link bandwidth (either 10, 100,
or 1000 Mbps), the cost of the control delay
becomes higher on higher-bandwidth network. - Why? During the same RTT, a larger amount of data
has been injected into the network, before the
control packet arrives at the traffic source to
reduce its sending rate. - For example, in Gigabit Ethernet 802.3x PAUSE
flow control scheme, one RTT is required for the
pause packet to take effect. Therefore, as the
bandwidth increases, more data will be sent
before the congestion control starts to take
effect. - The result is that congestion control becomes
less and less effective on high-bandwidth
networks. (No solution!)