Title: What’s Inside Your TCP Optimization Toolbox? | Instart Logic
1WHATS INSIDE YOUR TCP OPTIMIZATION TOOLBOX?
BY ALEKSANDR KUPRIYANOV AND PADDY GANTI
2Note this post delves fairly deep into the inner
workings of TCP, so a quick refresher on TCP
prior might be helpful.
3INTRODUCTION
In Paddys previous post, he talked about how to
load resources intelligently given the
constraints of the last mile options (ISP/Cell
Tower). Before content is delivered from the edge
to the end user, it must first get to the edge
from the original source by traversing the global
IP infrastructure that we call the middle mile
in industry jargon. Todays post specifically
talks about this phase of the contents journey,
which primarily focuses on TCP optimizations. Most
people think that TCP optimizations means "set
some values for some parameters and be done with
it." For example, there is undue attention
focused on the initial congestion window
parameter (initcwnd) settings. We try to show
here that a holistic approach, examining each
detail of data transfer, is whats needed for
sustained and consistent TCP performance.
4Since we are performance-obsessed at Instart
Logic, lets start by taking a look at the impact
of bandwidth/latency on a given web pages load
time
5What this demonstrates is that content reduction
techniques are important in low
bandwidth contexts, while request reduction
shines in high latency regimes. To demonstrate
the same thing in different way, consider the
following breakdown of the Google home page
Request
Bytes
Domain Request
www.google.com 9
ssl.gstatic.com 1
www.gstatic.com 1
google.com 1
apis.google.com 1
Domain Byte
www.google.com 321105
www.gstatic.com 131172
apis.google.com 49394
ssl.gstatic.com 14290
google.com 576
6At high speeds the requests become the most
critical factor for performance, but at low
speeds the bytes dictate what the end user
experiences. Most synthetic performance
measurement services such as Keynote, Gomez, and
Catchpoint will be more sensitive to the number
of requests due to their high speed connectivity,
whereas Real User Monitoring (RUM) tools like
NewRelic, SOASTA, WebPageTest (using throttling)
will be more sensitive to the volume of content
delivered. Be sure to test on both types of
platforms to get a realistic view of the
performance experienced by your end users.
WHICH LAYER TO FOCUS ON HTTP VS. TCP
Now let's say you want to optimize the number of
requests, and you have read that the standard
best practice says that the way to implement
fewer HTTP resources is to package multiple small
resources into one bundle. For example, rather
than sending three individual resources, the same
content can be sent in one resource bundle. This
way you expect to save two round trips or at
least, this is the accepted wisdom. However
depending on certain conditions, three separate
connections downloading n bytes will complete
much faster than one connection downloading 3n
bytes. This is because there is no 1-1
correlation between a given HTTP request and a
TCP round trip. HTTP uses TCP to actually segment
your request/response into packets and sends a
certain number of packets in a "train." Just
because we have one big HTTP request, this does
not necessarily translate to a shorter time for
the browser to receive all of the bytes for the
combined request.
7The answer to what is the optimal bundling
strategy lies in TCP mechanics, which dictate the
delivery dynamics of any web resource. Based on
this information, we can re-interpret the same
latency graphic above as the reduction in the
number of back and forth exchanges between
transacting TCP peers. This should help convince
you to focus on optimizing TCP round trips rather
than HTTP request reductions.
WHY TCP SUCKS FOR LONG DISTANCE DATA TRANSFER
It turns out that most default TCP stacks aren't
set up for use over today's WAN and satellite
links, even gigabit ethernet anything with
either a high bandwidth or delay, or both. To
understand why, lets see how much TCP can
transfer in 1 second using basic arithmetic
(complicated models are out there, but the
following is sufficient for understanding the
concept)
Throughput lt BufferSize/NetLatency gt
NetLatency lt BufferSize/Throughput
8This should tell you that if the network latency
is greater than 5 ms, the throughput will be
limited even with the maximum possible value of
the receiver buffer. For example, a 100ms link
with a 32KB receive buffer, caps the throughput
at 2.56Mbps regardless of the available capacity.
This should convince you that something is broken
with TCP for long haul delivery. Given the above
situation, we employ the following five
heuristics to specifically overcome this handicap
of TCP for high bandwidth-delay paths. Please
note that due to their interdependencies, you
need all of these working in unison, rather than
deploying any single one of these options.
CONGESTION FLOOR
The bandwidth-delay for a 100Mbps link across the
US is 90KB which means you need to be pumping
that much data to fully utilize the link
capacity. Given that the middle mile nodes have a
greater-than-1Gbps link between them, and given
their geographical dispersion, we would want to
set the minimum value for the TCP congestion
window, and never fall below that so as to ensure
maximum network utilization. Even at slow start,
and after a timeout, the congestion window will
have to remain at least at this value. When we
set it to 30 or more, we can ensure that most
HTML/JSON responses get sent in a single flight
of packets even after slow start or packet loss.
(30 x 1500 bytes 45KB, more than 90 percent of
the Top 1000 sites' HTML response size.)
9DELAYED ACKNOWLEDGEMENTS
Simply adjusting the congestion floor wont do,
as we will be limited by the number of
acknowledgements we receive. When a TCP receiver
uses delayed acknowledgment, this also slows down
the rate of growth of the congestion window of
the sender and reduces the sender throughput.
Moreover, for HTTP-type request/response traffic,
there is no hope of piggy-backing the ACK on the
data anyway. So disabling the delayed
acknowledgement on our edge PoPs will ensure that
we can sustain the data transfer as fast as the
sender can send it, without bogging it down.
RETRANSMISSION TIMER OPTIMIZATION
Once we have a floor on the window and remove
delayed ACKs, we have the pump primed to send
data at high throughput. However, TCP timeouts
are unavoidable (full window loss, lost
re-transmit), so we should try reducing the time
spent waiting for a timeout. While it can visibly
improve throughput, this solution should be
viewed with caution because it also increases the
probability of premature timeouts. So, estimating
the right re-transmission timeout (RTO) value is
important for achieving a timely response to
packet losses, while avoiding premature timeouts.
10- A premature timeout has two negative effects
- It leads to a spurious re-transmission
- With every timeout, TCP enters the slow start
mode even though no packets are lost. Since
there is no congestion, TCP thus would
underestimate the link capacity and throughput
would suffer. -
- TCP has a conservative minimum RTO (RTOmin) value
to guard against spurious re-transmissions. The
Linux TCP stack uses an RTOmin value of 200ms.
Unfortunately, this value may be at times greater
than round-trip times for end-user connections
(which are typically about 20-50ms). To fix this
situation the following approach may be employed - reduce RTOmin to 20ms
- estimate the current RTO value as 3x the
current smoothed RTT - By disabling delayed acknowledgements, we don't
need the minimum to be at 200ms. Our tests with
mobile clients has shown that this strategy helps
achieve a timely response to packet losses, while
retaining a rather small risk of spurious
re-transmissions in case of RTT spikes.
11REDUNDANT PACKETS
While the above techniques again optimize for a
train of packets, the last packet in a train is
not eligible for fast recovery, and hence will
time out in the classic sense. The only way to
avoid a "classic" RTO re-transmit and to start
either slow start or fast re-transmit mechanisms
in the case of loss of a last-sent packet (or a
bunch of last-sent packets) is to resend it, if
we did not receive its ACK for a time a bit
longer than a single RTT. Two packets have a
higher probability of arriving at their
destination, so we resend the last packet in a
train. The same tactic can be used for SYN and
SYN/ACK packets when establishing connections to
make the establishment time faster.
REORDERING OPTIMIZATION
As network speeds increase, there is a greater
chance that packets wont arrive in the same
order we sent them. This occurs when the order of
packets is inverted due to multi-path routing or
parallelism at routers and communicating hosts.
12- It can affect performance because
- It causes unnecessary re-transmission When
the TCP receiver gets packets out of order, it
sends duplicate ACKs to trigger the fast
re-transmit algorithm at the sender. These ACKs
make the TCP sender infer that a packet has been
lost and retransmits it. If the temporary
sequence number gap is caused by reordering, then
the duplicate ACKs and the fast re-transmission
are unnecessary and a waste of bandwidth. - It limits transmission speed When fast
re-transmission is triggered by duplicate ACKs,
the TCP sender assumes it is an indication of
network congestion. It reduces its congestion
window to limit the transmission speed, which
needs to grow larger from a slow start again. If
reordering happens frequently, the congestion
window is at a small size and can hardly grow
larger. As a result, TCP has to transmit packets
at a limited speed and cannot efficiently utilize
the bandwidth. - Results of our measurements demonstrate the high
prevalence of packet reordering to packet losses
across high-speed backbone networks with a degree
of reordering up to 90 packets. Investigations of
real IP flows show also that most reordered
packets arrive at the receiver with time lags
less than 10ms. To take into account this fact,
the following strategy can be employed as a means
of blocking the impact of this phenomenon on
performance - If the first dupACK is detected, the stack
is blocked from any actions on this event for a
certain time. - If the actual packet reordering took place,
this timeout is enough for self-recovery. - If the packet loss took place, a "standard"
fast re-transmit algorithm starts.
13RESULTS
File Size 1MB 4MB 1MB 4MB
Intranode latency 180 ms 340 ms
Client bandwidth Direct Instart Diff Direct Instart Direct Instart Diff Direct Instart Diff
.5Mbps 18 17.2 105 69 67 18.5 17.3 107 68 67 101
1.5Mbps 7 6.1 115 26 23.3 12 6 200 35 22 159
100Mbps 7 0.9 778 18 3.2 12 1.7 706 34 6.5 523
THROUGHPUT 1.1 8.9 1.8 10 0.7 4.7 0.9 4.9
As you can see, the benefits are material and
significant. All of Instart Logic's customers
have access to these TCP benefits by virtue of
our Global Network Accelerator.
14CONCLUSION
Now, lets circle back to our original question
how should you package individual resources for
high performance, end-to-end application
delivery? The answer is to treat each resource
like a packet and model it after TCP
dynamics. We have a lot more to say on this
topic. Stay tuned to hear how this theory helps
you better package and bundle your assets.
REFERENCES
Robert T Morris gives you some magic numbers
like why TCP won't work if the packet loss climbs
up to more than 2, among other things. For those
interested in measuring the Internet, Vern
Paxson did this landmark study which remains
unparalleled.
15Visit our Blog for more information