Title: Reliable Stream Transport Service TCP
1Reliable Stream Transport Service (TCP)
2- Weve looked at
- Unreliable connectionless packet delivery service
- And the IP protocol that defines it
- Now we will examine
- Reliable stream delivery
- And the Transmission Control Protocol that
defines it - TCP is presented as a part of TCP/IP
- Is independent, general purpose protocol
- Can be adapted for use with other delivery systems
3Need for Stream Delivery
- At low levels, have unreliable packets
- Lost, destroyed, discarded, duplicated, delayed
- Size constraints affect efficient transfer
- Applications need to send lots of data
- Unreliability is tedious and annoying
- Programmers must worry about errors
- Goal of network protocol research
- General purpose reliable stream delivery method
4Properties of the Service
- Interface between applications and TCP/IP has
five characteristic features - Stream Orientation
- Sender provides stream of bits divided into bytes
- Receiver is passed exact same sequence
- Virtual Circuit Connection
- Service provides illusion of dedicated circuit
- Call setup from one application to the other
- Two OSs talk and settle details
- Continue to communicate during transfer
- If error, detect and report to applications
5- Buffered Transfer
- Applications send stream in whatever size it
wants - May be as small as a single octet
- Protocol software wants efficient transfer
- Small blocks of data buffer until get enough for
a datagram - Large blocks of data break into smaller pieces
- Push mechanism
- When transfer needs to happen before buffer is
full - Application invokes a push
- Data generated until then is sent immediately
- At receiving end, is delivered without delay
- Protocol software may divide stream in unexpected
ways
6- Unstructured Stream
- Applications cannot mark record boundaries
- Must agree that stream service will be
unstructured - Full Duplex Connection
- Connections allow concurrent transfer both ways
- Appears as two independent streams in opposite
directions - Can terminate one direction without affecting
other - Control information can be piggybacked on data
7 8Providing Reliability
- Want reliable transfer out of unreliable packet
delivery system - Most reliable protocols use a single technique
- Positive acknowledgement with retransmission
- Recipient must send ACK message as it gets data
- Sender keeps record of each packet sent
- If timer expires for an ACK, retransmits packet
9Figure 12.1
10 11- Can also have duplicate packets
- Network delays may cause premature retransmission
- Both packets and ACKs can be duplicated
- Usually solve by assigning sequence numbers
- Receiver must remember which sequence numbers
received - ACKs include the sequence numbers as well
12Sliding Windows
- Sending one packet and waiting for ACK wastes
time - Full duplex circuit have lots of idle time
- Sliding window technique used
- More complex form of positive ack retrans
- Use bandwidth more efficiently
- Sender transmits multiple packets before ACK
13 14- Number of unacknowledged packets limited by
window size - Performance depends upon window size
- Size of 1 same as simple positive ack protocol
- Increase size with goal of sending packets as
fast as the network can handle - Conceptually, separate timer for each packet
- Only unacked packets are retransmitted
- Receiver has a similar window
15 16TCP
- Is a communication protocol
- NOT a piece of software
- TCP is the standard
- Various TCP software implements the standard
- Standard includes
- Format of data and acknowledgments
- Procedures for reliability
- Distinguish multiple destinations on a machine
- Error recovery procedures
- Initiation and closing a TCP stream transfer
17- Standard does not include
- Details of application/TCP interface
- Not discuss exact procedures to invoke for
operations - Not specified for flexibility
- TCP usually implemented in OS
- Can use whatever interface given OS provides
- Single specification for variety of machines
- TCP assumes little about underlying system
- Can be used with variety of packet delivery
systems (including IP) - Dialup lines LAN high speed fiber low speed WAN
18Ports, Connections, Endpoints
- TCP resides above IP in the layering scheme
19- Multiple applications can communicate
concurrently - Multiplexes and demultiplexes incoming msgs
- Uses port numbers (like UDP discussion)
- TCP ports more complex
- Using the connection abstraction
- Objects are virtual circuits, not ports
- Connections identified by a pair of endpoints
- Endpoint is pair of integers (host, port)
- host is IP address for a host
- port is TCP port on that host
20- Pair of endpoints defines connection
- (128.9.0.32, 1184) and (128.10.2.3, 53)
- A single TCP port can be shared by multiple
connections on the same machine - (128.2.254.139, 1012) and (128.10.2.3, 53)
- No ambiguity
- Incoming messages associated with connection, not
port - Both endpoints used to identify appropriate
connection - Makes things easier for programmers
- Can provide concurrent service without unique
ports - Example Email
- Multiple computers can send mail concurrently
- Accepting program needs only one TCP port
21Passive Active Opens
- TCP is connection-oriented
- Both endpoints must agree to participate
- Passive open
- Application at one end tells OS it will accept
connection - OS assigns a TCP port number for its end
- Active open
- Done by application wishing to connect
- Tells OS to establish a connection
- Two TCP modules communicate
- Establish and verify the connection then pass
data
22Segments, Streams, Sequence Numbers
- TCP views the data stream in segments
- Segment contains sequence of octets
- Usually each segment in one IP datagram
- Two important problems
- Efficient transmission
- Good use of available network
- Flow control
- End-to-end problem
- Cannot overflow the receivers buffer
23- Special sliding window protocol used
- Solves both problems
- Octets of the data stream are numbered
sequentially - 1st pointer sent and ACKed vs sent and not ACKed
- 2nd pointer end of window
- 3rd pointer boundary between sent and unsent
1 3
2
24- Receiver maintains a similar window
- Full duplex SW at each end maintains 2 windows
- Also allows window size to vary over time
- Each ACK has window advertisement
- Tells how many more octets willing to accept
- Increased advertisement
- Sender can increase size of sliding window, send
more - Decreased advertisement
- Sender decreases size of sliding window, stop at
boundary - Extreme case sends advertisement of zero, stops
all
25- This provides flow control
- Essential in internet environment
- Two independent flow problems
- End-to-end
- Minicomputer communicating with mainframe
- Intermediate systems
- Routers need to control flow, too
- Overloaded router condition is congestion
- No explicit congestion control mechanism uses
sliding window - Good TCP implementation can detect recover
- Poor implementation can make it worse
26TCP Segment Format
- Unit of TCP/IP sw transfer is segment
- Establish connections
- Transfer data
- Send ACKs
- May piggyback on a segment carrying data
- Advertise window size
- Close connections
27Figure 12.7
28- Code Bits field reveals type of segment
29Out of Band Data
- Out of Band
- Data sent without waiting for octets in the
stream to be consumed by the receiver - Ex to interrupt or abort a program
- Use urgent bit and URGENT POINTER field
- This data is consumed first, regardless of stream
position
30Maximum Segment Size Option
- Not all segments will be of same size
- But, must agree on a maximum size
- Uses OPTIONS field
- Can specify MSS (maximum segment size)
- If on same network, may use size such that
resulting datagrams match network MTU - If not, will attempt to discover the minimum MTU
along the path - Or use 536 (default datagram size, minus IP TCP
headers)
31- Choosing good MSS is difficult
- Too large or too small are both bad
- Too small network utilization is low
- Segments in datagram datagram in frame
- At least 40 octets of headers
- Small amount of data gives poor utilization
- Too large large IP datagrams
- Probably get fragmented somewhere
- Cannot ACK partial segment
- Must receive all fragments
- More fragments increases probability of losing
one
32- In theory, best MSS is when IP datagrams are as
large as possible without being fragmented - Difficult to figure out
- Most implementations do not have a mechanism for
doing so - Routes can change dynamically
- This may change the MTU of the path
- Optimum size depends on lower level headers
- Segment size must be reduced to account for IP
options
33Window Scaling Option
- WINDOW field is 16 bits
- Limits max window size to 64 Kbytes
- Ok in early networks
- Need more for networks with large delay
- Option allows a larger size
- Do not need to know details.
34Timestamp Option
- Used to
- Help compute delay on underlying network
- Handle wrap around sequence numbers
- Process
- Sender
- Places timestamp from its clock in message
- Receiver
- Copies timestamp field into ack
- Allows sender to compute elapsed time
35TCP Checksum
- CHECKSUM contains 16-bit integer
- Uses a pseudo header like UDP
- Purpose is just the same
- Verify segment has reached correct destination
36ACKs Retransmission
- Hard to refer to datagrams or segments
- Variable length segments
- Retransmitted segments may have more data than
original - Instead, use position in stream
- Based on sequence numbers
37- Cumulative acknowledgement scheme
- Receiver collects arriving data octets
- Reconstructs stream of sender
- May have to reorder segments due to delivery
- Will have reconstructed zero or more octets
- May have other stream pieces present but out of
order - Receiver ACKs longest contiguous prefix
- ACK specifies the next octet expected to be
received - Adv
- ACKs easy to generate and unambiguous
- Lost ACKs may not force retransmission
- Disadv
- Only send info about single position in the stream
38- Lack of information is inefficient
- Imagine window that spans 5000 octets
- Starts with position 101 in the stream
- Sender has sent all data in five segments
- Suppose first segment got lost
- Receiver sends ACK as each segment arrives
- All ACKs specify octet 101 as next expected
- No way to tell sender that all the other data is
there - Sender has two choices upon timeout
- Send all five segments over
- Send only first segment, then wait for ACK to do
anything else
39Timeout and Retransmission
- TCP has a timer for each segment
- If timer goes off before ACK received retrans
- Different algorithm than other protocols
- Due to internet environment
- Cannot know how quickly ACKs should come
- May span one or many networks
- May encounter router delays
- Must accommodate vast time differences
40Figure 12.10
41- Adaptive Retransmission Algorithm
- Used to accommodate varying delays
- Monitors performance of each connection
- Deduces reasonable values for timeouts
- As performance changes, timeout value revised
- Must collect data for the algorithm
- Records time each segment sent when ACK arrives
- Computes elapsed time (sample round trip time)
- Get new sample adjust average round trip time
for the connection - RTT stored as weighted average (usually)
- New round trip samples change the average slowly
42- Example
- RTT (a Old_RTT) ((1-a) New_Round_Trip
_Sample) - where
- a is the constant weighting
factor 0 lt a lt 1 - Choosing a value close to 1
- Weighted average only changed small amount
- Immune to changes that last a short time
- Choosing a value close to 0
- Weighted average responds quickly to changes in
delay
43- Timeout value is a function of the current RTT
- Early implementations used constant weighting
factor, B (B gt 1) - Timeout B RTT
- Choosing a value for B is hard
- Close to 1
- Timeout close to current RTT
- Detects packet loss quickly
- Any small delay may cause unnecessary
retransmissions - Original specification recommended B2
- Will look at better techniques for timeout
44Measuring Round Trip Samples
- Measuring round trip sample seems trivial
- But, TCP uses cumulative acknowledgement
- ACK refers to data received, not datagram that
carried it - Consider a retransmission
- Form segment put in datagram send timer
expires - Send again in second datagram
- Get ACK for which datagram?
- Called acknowledgement ambiguity
45- Assume ACK belongs to earliest datagram
- Make estimated round trip time grow
- Incorrect if the original datagram was really
lost - If many lost, estimate grows arbitrarily large
- Assume ACK belongs to latest datagram
- Send retransmission just before ACK arrives
- Decreases the timeout time
- Makes things worse more retransmissions
- Estimate will eventually stabilize
- RTT will be slightly less than ½ of the correct
value - Every segment sent twice even though no loss
occurs
46Karns Algorithm
- If associating ACK with earliest or most recent
are both wrongwhat to do? - Do not update on retransmitted segments
- Idea known as Karns Algorithm
- Avoids ambiguous acknowledgement problem
- Simplistic implementation can be a problem
- Get sharp increase in delay do some
retransmissions - Ignore ACKs for retransmissions no new estimate
47- Must also use a timer backoff strategy
- Compute initial timeout with round trip estimate
- If timer expires and causes retransmission,
increase the timeout (within a bound) - Most implementations multiply timeout by 2
- Next segment timed with new timeout
- Continues backoff until send segment without
retransmitting - Computes new round trip estimate
- Resets timeout accordingly
- Shown to work well even with high packet loss
48High Variance in Delay
- Computations do not respond well to wide range of
variation in delay - Variation in RTT
- Proportional to 1/(1-network load)
- Original TCP standard estimated RTT as shown
earlier - Limiting B to 2 can adapt to loads of at most 30
- 1989 spec requires estimates of both average RTT
and variance - Must use variance in place of constant B
49- Approximations are computationally easy
- DIFF SAMPLE Old_RTT
- Smoothed_RTT Old_RTT d DIFF
- DEV Old_DEV p (DIFF - Old_DEV)
- Timeout Smoothed_RTT e DEV
- Where
- DEV is the estimated mean deviation
- d is fraction between 0 1 controls effect on
weighted average - p is fraction between 0 1 controls effect on
mean deviation - e is a factor controlling how much deviation
effects RT timeout - (Research suggests d and p to be inverse power of
2 scales by 2n, uses integer arithmetic, and - d 1/(23), p 1/(22), n 3, and e 4 )
50Figure 12.11
51Figure 12.12
12.10,
52Response to Congestion
- TCP software must deal with congestion
- Severe delay caused by an overload of datagrams
- Congestion occurs at routers
- Routers have finite storage
- When run out of storage, start dropping datagrams
- Endpoints do not know where congestion is
- Just see increased delay
- Get timeouts send more datagrams (retrans)
- May cause congestion collapse
53- TCP must reduce transmission rate
- ICMP source quench messages inform hosts of
congestion - TCP needs to help
- Want to automatically reduce transmission rates
when congestion occurs - TCP standard recommends two techniques
- Slow-start
- Multiplicative Decrease
54- Multiplicative Decrease
- TCP must already use receivers window size
- Keep another window size to use during congestion
- Called congestion window
- At any time, the allowed window is
- min(receiver_advertisement, congestion_window)
- During non-congestion, both are same
- To estimate congestion window size, TCP assumes
most datagram loss comes from congestion - Upon segment loss
- Reduce congestion window by half (min of one
segment) - For segments still in window, backoff timer
exponentially - Does for every loss quickly clear router traffic
55- Slow-start
- How recover when congestion ends?
- If do reverse (2x congestion window) - unstable
- Use slow-start recovery
- When starting traffic on connection or after
congestion - Start window at size of single segment
- Increase by one segment every time get an ACK
- Avoids swamping
- Not so slow actually
- Log2N round trips until can send N segments
- One other restriction congestion avoidance
phase - When congestion window reaches ½ original size,
increase by 1 segment only if all segments been
ACKed - Overall, known as Additive Increase
Multiplicative Decrease (AIMD)
56- Techniques powerful when combined
- Slow-start increase
- Multiplicative decrease
- Additive Increase
- Measurement of variation
- Exponential timer backoff
- Improve TCP performance dramatically
- Add very little computational overhead
- Performance improves by factors of 2 to 10
57Fast Recovery Other Modifications
- Heuristic used where loss is infrequent
- Uses info from cumulative ack scheme
- Can resend data before timer expires
- Do not need to know details
58Explicit Feedback Mechanisms
- Most TCP versions use implicit techniques
- Timeout and duplicate ACKs to detect loss
- Changes in RTT to detect congestion
- Two explicit techniques have been proposed
- Selective Acknowledgement (SACK)
- Explicit Congestion Notification (ECN)
59- SACK
- Can specify exactly which data has been received
and which is missing - Sender knows which segment(s) to retransmit
- TCP provides two options for SACK
- Do not need to know details
- Does not replace cumulative ack mechanism
- Nor is it mandatory
60- ECN
- Used to notify TCP about congestion
- As a TCP segment goes through routers
- Two bits in IP header used to record congestion
- When segment arrives, receiver knows
- Sender needs to know receiver uses ACK to tell
- IP header bits
- Taken from TOS field
- TCP header bits
- Taken from reserved area
61Congestion, Tail Drop, and TCP
- Protocols are layered
- Layers operate in isolation
- TCP at source/destination cannot interact with
lower layer elements along the path - TCP not know condition of network
- TCP not notify lower layers before transferring
data - Policies used by routers can affect TCP
- Both a single connection and aggregate of all
connections
62- Example
- Router delays some datagrams more than others
- TCP backs off retransmission timer
- If delay exceeds timer, TCP assumes congestion
- Layers are defined independently, but they
interact - Thus, try to define mechanisms in one layer to
work well with protocols in others
63- Important interaction between TCP and IP
- Router overrun and begins to drop datagrams
- Early router software used tail-drop policy
- If input queue is full when datagram arrives,
drop it - Interesting effect on TCP
- If segments are from a single TCP connection
- TCP enters slow-start until begin receiving ACKs
- If segments are from multiple TCP connections
- All N instances of TCP enter slow-start at same
time - Causes global synchronization
64Random Early Detection
- Routers need to avoid global synchronization
- Use scheme to avoid tail-drop when possible
- Called Random Early Detection (RED)
- (or Random Early Discard or Random Early Drop)
- Uses two markers in queue Tmin and Tmax
- Three rules
- If queue contains fewer than Tmin datagrams, add
new one - If queue contains more than Tmax datagrams,
discard new one - If queue contains between Tmin and Tmax
datagrams, randomly discard the datagram with
probability p
65- Randomness keeps from waiting for overflow
- Router slowly and randomly drops datagrams as
congestion increases - Keeps from putting all TCP connection in
slow-start - Key is in choice of the thresholds and p
- Tmin must be large enough to utilize output link
- Tmax must be larger than typical increase in
queue size during round trip time - Discard probability is most complex choice
- Not use a constant compute for each datagram
- Can vary probability from 0 (Tmin queue size) to
1 (Tmax queue size) in a linear fashion
66- Linear scheme forms the basis of probability p
- Must avoid overreacting to bursty traffic
- If short burst
- Do not drop datagrams because queue will not
overflow - But, cannot postpone discard indefinitely
- Long burst
- Will overflow queue and start tail-drop
- Use weighted average technique
- Not use actual queue size at any instant
- Compute weighted average queue size
- Update each time a datagram arrives
- Avg (1 g) Old_avg g
Current_queue_size - where
- g is a value between 1 and 0
67- Some details glossed over
- Computations very efficient if
- Choose constants as powers of 2
- Use integer arithmetic
- Measurement of queue size
- Time required to forward datagram proportional to
size - Measure queue size in octets versus datagrams
- Affects type of traffic dropped
- Discard probability proportional to amount of
data - Not based on number of segments
- Smaller datagrams less probability of being
dropped - Good for ACKs, remote login traffic, etc.
- Analysis and simulation shows RED works
68Establishing a TCP Connection
- Use a 3-way handshake
- Is both necessary and sufficient for correct
synchronization - Also uses rule that additional requests for
connection are ignored if connection established - Can initiate connection from both ends
simultaneously
69Figure 12.13 The sequence of messages in a
three-way handshake. Time
proceeds down the page diagonal lines
represent segments sent
between sites. SYN segments carry initial
sequence number
information.
70Initial Sequence Numbers
- 3-way handshake accomplishes 2 functions
- Guarantees both sides ready to transfer data
- Sets up agreement on initial sequence numbers
- Each machine can choose initial number at random
- Cannot start at 1 each time
- Numbers set in three messages
- First machine sends x
- Second machine records x, sends y and ACKs x
- First machine ACKs y
71- Possible to send data with handshake segments
- Included with the initial sequence numbers
- TCP software must buffer until handshake done
- Once connection established, can release the data
to the application program quickly
72Closing a TCP Connection
- Close operation used to terminate gracefully
- Connections are full duplex
- When application tell TCP it is done, TCP closes
the connection in one direction - Sending TCP sends remaining data
- Waits for receiver ACK
- Sends segment with FIN bit set
- Receiver ACKs the FIN segment and informs its
application that data is done
73- Can still send data in opposite direction
- When both directions closed, TCP deletes its
record of the connection - Modified 3-way handshake is used to close
74Figure 12.14 The modified three-way handshake
used to close connections.
The site that receives the first FIN segment
acknowledges it
immediately, and then delays before sending the
second FIN segment.
75TCP Connection Reset
- Close operation used for normal shutdown
- Sometimes abnormal conditions arise
- Force the connection to be broken
- TCP has a reset for such conditions
- One side sends segment with RST bit set
- Other side responds immediately by aborting
connection - TCP informs application that connection was reset
- Transfer in both directions ceases immediately
76TCP State Machine
- Operation of TCP can be explained with a
theoretical model called finite state machine - Circles represent states
- Arrows represent transitions between them
77Figure 12.15
78A
B
79A
B
80A
B
81A
B
82A
B
83A
B
84Forcing Data Delivery
- Data stream usually buffered
- Accumulate enough octets for efficient transfer
- May need to send data before get a lot
- Example interactive terminal keystrokes
- Push operation forces delivery of octets
- Also sets PSH bit in segment code field
- Causes delivery of data to destination application
85Reserved TCP Port Numbers
- Combines static and dynamic port binding
- Like UDP
- Many of the port numbers are the same for
services accessible by both TCP and UDP - See Figure 12.16
86Figure 12.16
87TCP Performance
- TCP is complex protocol
- Handles wide variety of underlying technologies
- Generality does not hinder TCP performance
- Research done at Berkeley
- Shows that same TCP that gives efficient internet
operation can sustain 8 Mbps throughput between
two stations on 10 Mbps Ethernet - Cray Research TCP thruput approaching Gps
88Silly Window Syndrome
- TCP can have serious performance problem
- Caused when sender receiver operate at
different speeds - If receiver reads data one octet at a time
- Sender quickly fills buffer
- Must wait for window advertisement
- Gets advertisement for one octet
- Results in many small segments
- Inefficient use of bandwith and lots of overhead
89- If sender sends data one octet at a time
- Ends up with same problem
- Known as silly window syndrome
- Early TCP implementations exhibited the problem
- Each ACK advertises small amount of space
- Causes each segment to carry a small amount of
data
90Avoiding Silly Window Syndrome
- TCP specs include heuristics to avoid SWS
- On sender, avoids sending small data amounts
- On receiver, avoids sending small advertisements
- TCP software should contain both
91- Receive-side silly window avoidance
- Receiver maintains currently available window
- Delays advertising until can advance window a
significant amount - Minimum of ½ of the receivers buffer, or
- Number of octets in a maximum-sized segment
- Summary of technique
- Before sending an updated advertisement after
advertising a zero window - Wait for space
- 50 of total buffer or maximum sized segment
92- Two approaches for implementation
- ACK each arriving segment, but do not advertise
until allowed - Delay sending ACK if window too small to
advertise - Standard recommends using delayed ACKs
- Adv delayed ACKs decrease traffic, increase
thruput - One ACK for all data received during delay
- May get outgoing data segment to piggyback on
- If data read quickly, ACK and adv can go in one
segment - Disadv
- May get retransmissions if delay too long
- Bad round trip time estimates
- Cannot delay more than 500 ms
- Recommend receiver ACK every other data segment
93- Send-side silly window avoidance
- Goal is to avoid sending small segments
- Use clumping
- Delay sending until get reasonable amount of data
- How long should TCP wait?
- Too long application has large delays
- Cannot know when application will send more data
- Not long enough get small segments
- Fixed delay not optimal for all applications
- Uses an adaptive algorithm
- Delay depends on current internet performance
94- Does not compute delays
- Uses arrival of ACK to trigger transmission of
additional packets - Heuristic
- Application generates more data to send
- Buffer if previous data sent but not ACKed
- Wait until get enough for maximum-sized segment
- If waiting when ACK arrives, send all data in
buffer - Apply rule even when push operation requested
- If application fast compared to network
- Successive segments have many octets
- If application slow compared to network
- Small segments get sent without long delay
95- Known as the Nagle algorithm
- Elegant due to little computational overhead
- Adapts to arbitrary combinations of
- network delay
- maximum segment size
- application speed
- But does not lower throughput in normal cases
96Summary
- TCP defines reliable stream delivery service
- Full duplex connection
- Exchange large volumes of data efficiently
- Sliding window gives efficient network use
- Few assumptions of underlying network
- Flexible for wide variety of delivery systems
- Has flow control
- Flexible for systems with differing speeds
97- Basic unit of transfer is a segment
- Pass data or control information
- Permits piggyback of ACKs
- Flow control
- Implemented by receiver advertisements
- Urgent facility supports out-of-band messages
- Push mechanism forces delivery
98- TCP standard specifies
- Exponential backoff for retransmission timers
- Congestion avoidance algorithms
- Slow-start
- Multiplicative decrease
- Additive increase
- Uses heuristics to avoid small packets
- Recommends using RED versus tail-drop
- Avoids TCP synchronization
- Improves throughput