Title: AHBHA: Managing Congestion through Adaptive HopByHop Aggregation
1AHBHA Managing Congestion through Adaptive
Hop-By-Hop Aggregation
- Michael Greenwald,University of Pennsylvania
2What is congestion?
- Congestion Applications/clients present a larger
aggregate load than intermediate nodes in the
network can handle. - Congestion Control mechanism that ensures
network remains manageable under overload. - Much more difficult than the related problem of
Flow Control participants are unaware that they
share a resource
3What causes congestion? (Isnt bandwidth cheap?)
- Persistent congestion solved by adequate
provisioning (True, bandwidth is cheap). - Cause Intermittent high load
- Intermittent emergency (earthquake, 9/11)
- Extreme loads (expected Mothers day.
Unexpected Pathfinder pictures) - Periods of growth
- DOS or DDOS attack
- Effect Congestion
- Bursty traffic (statistical multiplexing)
- Many sources converge on a single link
- Low capacity link becomes bottleneck
- subset of multicast destinations
4Why must congestion be controlled?
- Congestion collapse
- Links clogged with useless packets that will be
dropped anyway, or are retransmissions, or are
out-of-date - Long Delay
- Relevant mostly for short transactions over long
distances. - Variability in delay (jitter)
- Drop rate
- Not a problem in itself, since packets only
dropped if cant make it through bottleneck
anyway, but - Use up bandwidth on other links before being
dropped. - Control over which packets get dropped?
- Low Utilization (inefficiency)
- Fairness
5How is congestion controlled?
- Slow-start/congestion avoidance
- Losses-per-epoch/Fast Retransmit
- Full Buffers, tail-drop gt RED
- Non-compliant flows gt FRED, Penalty box etc.
- Pkt drop/buffer size is noisy signal gt Vegas,
- Adjust parameters gt BLUE, ARED
- Avoid packet loss gt ECN
- Explicit feedback gt XCP
- Fat pipes gt TCP FAST
- Fairness, lossy wireless gt TCP Westwood
- Mice, Fairness, QOS, bad RTE
6Response
- Concern with robustness, efficiency, and fairness
- Control theoretic approach
- Stability, convergence
- Make world safe for control theory
- Controller reacts as quickly as signal changes
- Know RTT, react quickly, change slowly
- Response to feedback must be predictable
- Behavior of aggregate independent of of flows.
- Behavior of client/application/transport
predictable.
7A Different View
- Complexity Epicycles on epicycles
- Fragility
- The end-to-end argument misinterpreted
- Trapped by success, religious dogma, need for
field testing - Congestion control common to all clients
- Dont optimize for a particular application, even
TCP
8A Different View
- View from routers predictable response to
feedback - View from hosts delivery fabric with predictable
congestion feedback - Stark contrast with current system
- Extreme example Aggregated small TCP flows do
not exponentially decrease or linearly increase. - 1,000,000 flows, so window for each flow is small
(approx 1) - Congestion notifies 10 of the flows, decrease by
at most 10 of packets. - Regardless, each of 1,000,000 flows increases
cwnd by 1 each RTT, effectively doubling rate.
(alternatively, larger fraction in SlowStart)
9AHBHAAdaptive Hop-by-Hop AggregationA simple
idea
Router architecture
Interconnect
- Hop-by-hop feedback and controller at each node.
- Why any different than CreditNet (Kung) or HBH
(Kanakia)? - Aggregate flows based on purely local
characteristics (next-hop X TOS X QOS) X input - What about head-of-line blocking? Local vs.
global behavior? Isolation of congestion? - Transitive renaming of congested links
- Based on observation that most of the net is
adequately provisioned.
10Controlling Utilization of a Resource
- Consider a Flow Queue with a current queue
length, a known rate capacity, and a set of input
flows - Capacity may be a physical limit for a physical
link, or a rate limit imposed by a neighbor for
finer grained flow. - Input flows may be local flows sharing a single
physical output link, or may be flows coming in
from neighbors. - Queuelength gt threshold triggers congestion
control (at most once per RTT) - Must determine whether queue growth due to
burstiness or due to input rate exceeding output
rate - If former, smooth inputs latter, throttle
neighbors.
11Controlling Utilization of a Resource
- MAIR (Mean Aggregate Input Rate) Sum (over
inputs) (smoothed) number of packets /
interval - If MAIR lt output capacity, then input flows need
only be smoothed. - Pacing 1/pkts-per-RTT
- If MAIR gt output capacity, then input rates must
be reduced - Acceptable rates computed in 2 passes.
- BaseAllocation Capacity/Nflows. // Can use
weights per flow instead. - ExcessAllocation 0 UncontrolledFlows 0for
flow in inputs, if 2flowRate lt BaseAllocation,
then ExcessAllocation BaseAllocation -
2flowRate UncontrolledFlowsend - FairAllocation BaseAllocation
ExcessAllocation/(Nflows-UncontrolledFlows) - Send FairAllocation if input rate gt
FairAllocation
Input from fairness controller
Drain queue in 2 RTT
12Renaming to Isolate Congestion
R1
R2
- R1-gtR2 becomes congested.
- QuarantineSet N NextHop(N)_at_R1 R2
13Renaming to Isolate Congestion
R1
R2
R1
- Create artificial node R1
- QuarantineSet N NextHop(N)_at_R1 R2
- Routing update to all neighbors of R1 advertising
R1 as best path to N.
14Renaming to Isolate Congestion
R1
R2
R1
- If queue to R1 is congested, recurse and split
artificial node and advertise to input queues.
15Releasing control
- Record time of transition from uncontrolled to
controlled. - Record time of most recent (last) congestion
event. - NewCongestionInterval LastCongestionEvent -
FirstCongestionEvent - CongestionInterval max(NewCongestionInterval,Ol
dCongestionInterval) - If ((now-LastCongestionEvent) gt
CongestionInterval) - Release control
- OldCongestionInterval max(OldCongestionInterva
l/2,NewCongestionInterval) -
- // Release state after 10 CongestionInterval w/
no congestion - // OldCongestionInterval gt 4max(IRTT,ORTT)
16Miscellaneous Details
- Uncontrolled flows
- XMIT If X packets sent in RTT interval t0, then
at most 2X packets in interval t1 - Periodic CC packets between immediate neighbors
- List controlled flows and rate
- Once per Max(RTT/2, 20 PacketTimes)
- If no CC pkt from N in RTT interval, then all
flows are controlled at X/2. - CC packets high priority
- Compute RTT
- Assume max rate known neighbors.
- Good assumption for dedicated lines
- May need to be estimated for Ether/shared channel
or multi-hop neighbors
17Advantages
- Works for Mice, Elephants, Non-TCP flows
- Long-delay flows ramp-up in log(n) round-trips
high utilization - Doesnt treat loss as congestion signal
- Not sensitive to parameters
- Fairness decoupled from CC mechanism agnostic on
policy, or policy delivery mechanism. Can work
with either packet marking (diffsrv) or
flow-weighting (periodic packets from src to dst,
providing per-flow weights). - Aggregation per-hop control makes flows
smoother and less self-similar - Response time to source comparable to e2e packet
loss.
18Serendipity
- Buffer sizes
- Per-link rather than cross-network
- 1 buffer per neighbor, rather than 1 per flow
- Broken routers, misbehaving hosts, DOS attacks
- Multicast
- Simplifies TCP
19Preliminary Observations
- Significantly simpler than current world (let
alone more complex world) - AHBHA comparable in all cases. Never
significantly better. Simulations using ns2 - RED (varying capacities loads), Floyd TCP
Friendly, FAST (), TCP WESTWOOD(), XCP (),
AHBHA Regions Legacy, Defective routers, DOS,
Short flows (1 pkt), Mbottle, SimpleTCP, w/ECN
to source - () Compared to results in paper, () needed to
compile separate versions of ns2 - Works with non-cooperating clients and routers.
20Preliminary Non-Results
- Stability not proven
- Convergence not proven
- Convergence-time not established
- (On the other hand, intuitive reasons to believe
stable e.g. bounded input increase,
superposition of stable systems, RTTs are equal
(not just by assumption)) - If many congestion points, then lose congestion
isolation
21Unresolved Issues with Naming Controlled
Aggregates
- 1 bit for renaming next hop? B bits? Exhaustive
list? - Aggregate by next hop? Or 2nd hop (horizon
effect)? - More hysteresis in determining CongestionInterval,
rather than Releasing control after quiet
period. - Right choices depend on patterns of congestion in
real network. - Measurement required.
22cing Measuring Network-Internal Delays using
only Existing Infrastructure
- joint work with Kostas Anagnostakis
- University of Pennsylvania Raphael RygerYale
University
23Remote measurement of per-link delays
- Network measurement techniques
- Understanding of control mechanisms (such as TCP
congestion control) --- both results and workload - Gain insights into network performance
- Fault Isolation, Error reporting
- Curiosity switch providers?
- Network parameters such as delay, loss, and
throughput are easy to measure end-to-end - Network parameters such as delay, loss, and
throughput are difficult to measure on individual
links inside the network.
RESEARCH
MANAGEMENT
USER
24Understand your toolsKnow yourself
- How accurate are the results?
- Why do we believe it is accurate?
- What are its limitations?
- Answering these questions is difficult, sometime
surprising, and results in a much better tool.
25Network Delay Tomography A Brief History
X2
B
A
From a remote source, S, estimate the
distribution of link delays
a2
X1
S
1
a5
a1
a4
2
X3
a3
C
- Direct measurement, using existing tools (e.g.
pathchar) - ltRTT to tailgt - ltRTT to headgt yields RTT on link
(TTL-expired responses) - Only existing infrastructure measure anywhere
w/o cooperation. - But
- ICMP responses representative?
- Asymmetric paths return paths vary so (tail -
head) may not be meaningful. - Round trip vs. one-way delay?
26Network Delay Tomography A Brief History
X2
B
A
From a remote source, S, estimate the
distribution of link delays
a2
X1
S
1
a5
a1
a4
2
X3
a3
C
- Indirect inference methods (e.g. minc project)
- One packet to multiple sources, and correlate
behavior on links in the resulting tree - But
- Deployability (works best with multicast, need
cooperating rcvrs) - Accuracy (assumes independence of delay, quality
of estimates degrades over longer paths) - Robustness (high variance in error)
- Computational complexity
- Need for many samples, therefore much time
27Network Delay Tomography A Brief History
X2
B
A
From a remote source, S, estimate the
distribution of link delays
a2
X1
S
1
a5
a1
a4
2
X3
a3
C
- Direct methods (e.g. cing project)
- f(ltTimestamp to tailgt,ltTimestamp to headgt) yields
delay on link - No infrastructure required, highly accurate,
strong experimental validation - But
- Packet pair may not encounter equal queues
- ICMP processing may not be representative
- Clocks are unsynchronized
- Routing irregularity, so not always applicable
28Network tomography a direct method
- Use router ICMP Timestamp messages and
packet-pair probes to directly estimate queuing
delay
2
1
3
A
B
2
1
Account for fixed, by subtracting min time over
set of observations
propagation delay
variable
fixed
queueing delay
29Question your assumptions
2
1
3
A
B
2
1
- Feasibility basic mechanism supported? Accuracy?
Stability of routing? etc. etc. - Do back-to-back packets really experience the
same delay on their shared path? - Are ICMP processing times indicative of
processing time for normal packets? - How to account for differing offset and skew on
clocks? - Are the paths to adjacent nodes coincident?
30Back-to-back packets
- Do packets arrive back-to-back?
- Do back-to-back packets experience identical
queuing delays and process time? (and stay
back-to-back?) - Distinctions are irrelevant to algorithm the
issue is simply difference in timestamped value. - Experiment Probe routers with varied load and
varied path length from source
7300 routers
This issue common to all algorithms
31ICMP Processing Time
Cooperating rcvr
Allows spoofed src
X2
A
a2
X1
S
a5
a1
a4
2
response
X3
a3
A?2, req
spoof
- Send direct first, so queuing delays err
conservatively and overestimate ICMP processing
time. - Median processing time always negligible
- 95 usually negligible
Dot median, boxinterquartile range, bars
5-95, dots are outliers.
- This issue common to all direct measurements that
use ICMP - Variation in processing time between head tail
- Comparison w/non-ICMP traffic
32ICMP Processing Time
Cooperating rcvr
Allows spoofed src
X2
A
a2
X1
S
a5
a1
a4
2
response
X3
a3
A?2, req
spoof
- Spoofing and cooperation limits scope of
experiment (7 targets, 20 routers). - Broader study? If processing delays significant
on head of link, then estimated queuing delay for
link should sometimes be negative. - Occasionally present in 9.9 of sample (1,368)
Dot median, boxinterquartile range, bars
5-95, dots are outliers.
33Unsynchronized Clocks
2
1
3
A
B
2
1
?I
OA,2
OA,1
OA,2 - OA,1
- Clock offsets may vary because of clock skew or
jumps to adjust for skew. - Both src dst may jump, and may be skewed in
opposite directions - May distort individual observations, as well as
provide an erroneous minimum for d2prop - Impossible to tell for individual observation
whether ?t due to queuing or clock artifact
34Unsynchronized Clocks
Local clock
- Post processing looks at multiple observations
- RTT provides valuable clues queuing and max
- Can recover skew only care if jump occurred
between request response - Look for colinear regions label others cant
tell
35Routing Issues
X2
B
A
a2
A routing map, R, is regular over a graph G if
Rs(m) Rs(d) for all m in s?d
X1
S
1
a5
a1
a4
2
X3
3
a3
C
- Reachability It is easy to see that most links
are not measurable by the direct method from a
single source many links connect to a node, but
only one is on the path from S. - (In some sense, S is mainly interested in links
reachable from S.) - Regularity If the path to the head of a link is
not a prefix of the path to the tail of the link,
then we cannot meaningfully subtract the
timestamp responses.
36Irregular routing
Internet routing is irregular.
- Nevertheless
- Coverage for single links ranges from 20 (SRI)
to 53 (LIACS) - Multiple sources increase the likelihood
measurably - Multi-hop segments increase the likelihood of
coverage
37Simpler approach?TTL vs. Timestamp
- Why arent TTL-limited RTT measurements (ala
pathchar) sufficient? - TTL-limiting removes the routing problem
- RTT measurements removes the clock problem.
- Accuracy
- Asymmetry
- One-way vs. round-trip
- Not back-to-back on return path
10,931 links
38A hybrid solution
- Indirect inference is accurate for small trees
look only at small, isolated trees. - Timestamps and TTL-limited probes make every
router a cooperating receiver. - Indirect inference can isolate return-path delays
from forward path delays. - Indirect inference can determine delay
distribution in shared portion of overlapping
segments - Deconvolution
39Putting it all together
- By combination/choice of Timestamps, RTT,
TTL-expiration and using either Indirect methods
of MINC or deconvolution we can cover just about
every link in the Internet, often by many methods - But which methods to use?
40Putting it all together
- Shared link is most accurate for MINC
- Deconvolution is only as accurate as the least
accurate segment. - But which methods to use?
41Relative Accuracy
- Estimated vs. Actual Mean delay
- 2nd row shows effect of divergent paths 200ms
extra delay on path of 2nd pair
42Increased Coverage
Multiple sources
Granularity vs. accuracy
43Collecting vast quantities of data
- Individual delay measurements over thousands of
paths, tens of thousands of nodes, millions of
samples - Long running simulation of AHBHA can generate
petabytes of data for moderate size networks. - How can we accurately collect these measurements
without sinking under their weight?
44Space-Efficient Online Computation of Quantile
Summaries
- joint work with Sanjeev KhannaUniversity of
Pennsylvania
45Summarizing extremely large data sets
- The problem
- Vast quantities of data, perhaps ephemeral
- Memory is limited and observations are lost once
observed - Therefore construct a proxy data structure of
manageable size, able to return needed
information - What kind of information do we need?
Distribution of values - Quantile queries Given a quantile, ?, return the
value whose rank is ??N? - e.g. min, max, median, 90th percentile, 99th
percentile - Munro Paterson 1980 (Pohl1969) p-pass
algorithm to compute exact quantile requires
?(N1/p) space.
46Trading off accuracy for space
- Explicit a priori guarantee on precision of the
approximation, but try to use the smallest memory
footprint possible. - Explicit and tunable a priori guarantee on
maximum memory footprint, and make the
approximation as accurate as possible.
47Trading off accuracy for space
- Explicit a priori guarantee on precision of the
approximation, but try to use the smallest memory
footprint possible. - Explicit and tunable a priori guarantee on
maximum memory footprint, and make the
approximation as accurate as possible.
Histograms
48Trading off accuracy for space
?-approximate quantile summary
- Explicit a priori guarantee on precision of the
approximation, but try to use the smallest memory
footprint possible. - An ?-approximate quantile summary can answer any
quantile query to within a precision of ? - Given a quantile, ?, return a value whose rank is
guaranteed to be within the interval (? - ? )N,
(? ? )N
49Requirements
- Explicit tunable a priori guarantees on the
precision of the approximation - As small a memory footprint as possible
- Online Single pass over the data
- Data Independent Performance guarantees should
be unaffected by arrival order, distribution of
values, or cardinality of observations. - Data Independent Setup no a priori knowledge
required about data set (size, range,
distribution, order).
50Related Work
- Manku, Rajagopalan, and Lindsay generalize a
class of 1-pass algorithms (e.g. Agrawal Swami
COMAD95, Alsabti, Ranka Singh VLDB97), - SIGMOD98
- a priori knowledge of size of data set
- O((1/?) log2 (? N)) worst case space
- does not exploit any structure in observations
- SIGMOD99
- Give up deterministic guarantee in exchange for
dropping the requirement of a priori knowledge of
size of data set - Gibbons, Matias, Poosala VLDB97 Chaudhuri,
Motwani, Narsayya SIGMOD98 - Multiple passes ( CMN only probabilistic
guarantee)
51Our epsilon-approximate quantile summary
52Overview of Summary Data Structure
??.01, N1750
192
204
201
529,536
539,540
501,503
- Keep a data structure that stores vi, rmin(vi),
and rmax(vi) for each observation. - vi value of ith observation stored in the
summary - ltv0, v1, . vi, vS-1gtS can be ltlt N
- rmin(vi) minimum possible rank of vi
- rmax(vi) maximum possible rank of vi
53Overview of Summary Data Structure
? .3
r ?N 525
??.01, N1750
15,2
28,7
10,1
192
204
201
529,536
539,540
501,503
- Keep a data structure that stores vi, rmin(vi),
and rmax(vi) for each observation. Tuple vi,
gi, ?i gi rmin(vi) - rmin(vi-1) , ?i
rmax(vi) - rmin(vi) - Quantile ? .3? Compute r and choose best vi
54Overview of Summary Data Structure
? .3
r ?N 525
??.01, N1750
15,2
28,7
10,1
2?N35
192
204
201
529,536
539,540
501,503
- Keep a data structure that stores vi, rmin(vi),
and rmax(vi) for each observation. Tuple vi,
gi, ?i gi rmin(vi) - rmin(vi-1) , ?i
rmax(vi) - rmin(vi) - If (rmax(vi1) - rmin(vi) - 1) lt 2?N, then
?-approximate summary. - Our goal always maintain this property.
55Overview of Summary Data Structure
? .3
r ?N 525
??.01, N1750
15,2
28,7
10,1
2?N35
192
204
201
529,536
539,540
501,503
- Keep a data structure that stores vi, rmin(vi),
and rmax(vi) for each observation. Tuple vi,
gi, ?i gi rmin(vi) - rmin(vi-1) , ?i
rmax(vi) - rmin(vi) - Goal always maintain ?-approximate summary
(rmax(vi1) - rmin(vi) - 1) (gi ?I - 1) lt
2?N - Insert new observations into summary
56Overview of Summary Data Structure
? .3
r ?N 525
??.01, N1750
15,2
28,7
10,1
2?N35
197
192
204
201
529,536
539,540
502,536
501,503
- Keep a data structure that stores vi, rmin(vi),
and rmax(vi) for each observation. Tuple vi,
gi, ?i gi rmin(vi) - rmin(vi-1) , ?i
rmax(vi) - rmin(vi) - Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi) - 1) (gi ?I -
1) lt 2?N - Insert new observations into summary
57Overview of Summary Data Structure
? .3
r ?N 525
??.01, N1751
15,2
28,7
1,34
10,1
2?N35.02
197
192
204
201
530,537
540,541
502,536
501,503
- Keep a data structure that stores vi, rmin(vi),
and rmax(vi) for each observation. Tuple vi,
gi, ?i gi rmin(vi) - rmin(vi-1) , ?i
rmax(vi) - rmin(vi) - Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi) - 1) (gi ?I -
1) lt 2?N - Insert new observations into summary
58Overview of Summary Data Structure
? .3
r ?N 525
??.01, N1751
15,2
28,7
1,34
10,1
2?N35.02
197
192
204
201
530,537
540,541
502,536
501,503
- Keep a data structure that stores vi, rmin(vi),
and rmax(vi) for each observation. Tuple vi,
gi, ?i gi rmin(vi) - rmin(vi-1) , ?i
rmax(vi) - rmin(vi) - Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi) - 1) (gi ?I -
1) lt 2?N - Insert new observations into summary
- Delete all superfluous entries.
59Overview of Summary Data Structure
? .3
r ?N 525
??.01, N1751
15,2
28,7
1,34
10,1
2?N35.02
192
204
201
530,537
540,541
501,503
- Keep a data structure that stores vi, rmin(vi),
and rmax(vi) for each observation. Tuple vi,
gi, ?i gi rmin(vi) - rmin(vi-1) , ?i
rmax(vi) - rmin(vi) - Goal always maintain ?-approximate
summary (rmax(vi1) - rmin(vi) - 1) (gi ?I -
1) lt 2?N - Insert new observations into summary
- Delete all superfluous entries.
60Reducing space requirement of summary
- Delete all superfluous entries What do we mean
by superfluous entries? - Goal minimizing workspace --- not size of final
summary - Can always reduce the final summary to size
O(1/?). - Deletion rule (compress) will reduce summary
size, but will take care to keep workspace small
regardless of incoming observations. - To explain COMPRESS operation, we need to develop
some more terminology
61Terminology
- Full tuple A tuple is full if gi ?I 2?N
- Full tuple pair A pair of tuples is full if
deleting the left-hand tuple would overfill the
right one - Capacity number of observations that can be
counted by gi before the tuple becomes full. (
2?N - ?I) - We say that ti and tj have similar capacities if
log capacity(ti) ? log capacity(tj) (intuition,
not defn) - Similarity partitions the possible values of ?
into bands.
62More TerminologyTree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
vi,gi,?i
S
- The bands can be used to impose a tree structure
over the tuples. - Group tuples with similar capacities into bands
63More TerminologyTree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
S
- The bands can be used to impose a tree structure
over the tuples. - Group tuples with similar capacities into bands
64More TerminologyTree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
- The bands can be used to impose a tree structure
over the tuples. - Group tuples with similar capacities into bands
65More TerminologyTree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
- The bands can be used to impose a tree structure
over the tuples. - Group tuples with similar capacities into bands
- First (least index) node to the right with higher
capacity band becomes parent.
66More TerminologyTree Representation
?-range Capacity Band0-7 8-15 38-11 4-7 212-13
2-3 114 1 0
??.001, N7,000
2?N14
- The bands can be used to impose a tree structure
over the tuples. - Group tuples with similar capacities into bands
- First (least index) node to the right with higher
capacity band becomes parent.
67COMPRESS operation
- General strategy delete tuples with small
capacity and preserve tuples with large capacity.
- 1) Deletion cannot leave descendants unmerged ---
it must delete entire subtrees - 2) Deletion can only merge a tuple with small
capacity into a tuple with similar or larger
capacity. - 3) Deletion cannot create an over-full tuple
(i.e with g? gt floor(2?N))
68Analysis
- Theorem
- At any time n, the total number of tuples stored
in S(n) is at most (11/2?)log(2?n) - Sketch of proof
- Each tuple requires the support of many
observations in order to survive a COMPRESS - Only n observations
- Therefore only a relatively small number of
tuples can survive
69Useful Lemmas
- A tuple that survives insertion at time m must
have ? floor(2?m) (else would be immediately
deleted (has no descendants, and if smaller ?
then parent has capacity to absorb it)). - If ?i and ?j are ever in the same band, they will
always be in the same band. (Technical details
on defn of band band boundaries are only
deleted, never created). - The number of observations covered cumulatively
by tuples in bands 0.. ? is bounded by 2?/?
70Limited number of full tuple pairs in each band
- For any given ?, at most 4/? nodes from band ?
are right partners in a full tuple pair. - Defn If neighbors are a full tuple pair, then
gj-1 gj ?j gt 2?n - Assume p pairs exist. Sum over all such
pairs ?gj-1 ? gj ? ?j gt 2p?n - 2?gj ? ?j gt 2p?n
- ?gj is bounded by of observations in bands
0..? (2?)/? - ? ?j is bounded by max ?j in ?, 2?n - 2?-1
- (2?1)/? p(2?n - 2?-1) gt 2p?n
- 4/? gt p
- What about non full tuple pairs? At most 1 per
parent.
71Each parent requires many descendants to survive
COMPRESS
Vi
Vj
- At time n, for any ?, at most 3/2? nodes have a
child in band ? - Choose a parent Vi with a child in band ?.
Choose the rightmost child, Vj. Let mj (lt n -
2?-1/(2?)) be the time Vj was inserted. - Red nodes, descendants of Vj, and anything merged
into Vj, must have arrived after n - 2?1/(2?) - ?g in picture gi ?i gt 2?n gi(mj) ?i lt
2?mj - ?g in picture gi (since n - 2?1/(2?)) gt 2?(n -
(n - 2?-1/(2?))) - At most 2?1/(2?) observations avail, each Vi
needs gt 2? (2?-1/(2?)) - Therefore at most 2/? parents of nodes in band ?
(more complexity needed to get to 3/2?)
72Analysis
- Theorem
- At any time n, the total number of tuples stored
in S(n) is at most (11/2?)log(2?n) - Combining Lemmas
- 4/? pairs per band
- At most 3/(2?) parents of children in band
- At most 1 singleton per parent
- 11/(2?) tuples per band
- At most log(2?n) bands
73Experimental Results
- Measurement
- S
- Observed ? (vs. desired ?) max, avg, and for 16
representative quantiles - Optimal max observed ?
- Compared 3 algorithms
- MRL
- Preallocated (1/3 number of stored observations
as MRL) - Adaptive allocate a new quantile only when
observed error is about to exceed desired ? - Optimization in algorithm
- Keep entries up to high water mark (can only help)
74Random Input
Space
Error
75Handling Deletions
- Artificial data set
- ATT CDR median length of active phone call?
76Summarizing Quantile Summaries
- Empirically, behaves very well indeed
- On average, for random input, seems to use
constant space - Best-known worst-case guarantees
- GK used as a black box to improve other
algorithms - Munro Patersons classic p-pass algorithm for
computing median exactly. GK reduces
space/number of passes by a factor of Omega(log
n) - Probabilistic quantile summaries
- The basic data structure has applications to
other problems - Order statistics in sensor networks
77Concluding remarks
- AHBHA
- seems very promising but a lot of work is
needed to evaluate it properly. - Cing
- Identified problems with existing techniques
- The hybrid approach was an obvious idea, but
required a lot of work and care to succeed - As accurate as cing, almost universally
applicable. - Quantile Summaries
- Exploit as much information as possible.
- Proof is unsatisfying inelegant because of
complexity notion of bands and COMPRESS
non-intuitive - Result is significant improvement with several
unexpected applications.
78General remarks
- A small shift in view can sometimes yield large
reductions in complexity - Even simple solutions to large scale problems are
extremely difficult to evaluate --- many details,
many cases, unexpected interactions, many
metrics. As a discipline we do not have a good
methodology for evaluation. - Experimental results are surprisingly difficult
to obtain, confirm, and evaluate. It is worth
persevering. - Formal analysis of sub-problems can give us solid
ground to stand on even when large problem is
analytically intractable. It can also yield
significant practical improvements. - Successful systems research needsvision,
experimental technique, and formal analytic skills
79Ongoing projects
- AHBHA congestion control, network architecture
- Cing network delay tomography, large scale
measurement studies - Streaming data summaries
- Sensor networks balanced power, order
statistics, communication optimizations - Coverage cooperative virus defense w/ untrusted
peers - EXCHANGE peer2peer incentives
- Harmony generic, safe, reconciliation of OTS
apps - Canon consistent security for heterogeneous
systems - NBS practical non-blocking algs, contention in
distributed algorithms,
80The End
81Simpler approach?TTL vs. Timestamp
- Why arent TTL-limited RTT measurements (ala
pathchar) sufficient? - TTL-limiting removes the routing problem
- RTT measurements removes the clock problem.
- Accuracy
- Asymmetry
- One-way vs. round-trip
- Not back-to-back on return path
10,931 links
82Network Tomography feasibility
- TIMESTAMP support? 96 response to TIMESTAMPs
- TIMESTAMP indicative of normal packets? within
ms resolution - Clock synchronization? robust post-facto
algorithm - Irregular routing limits choice of nodes
Example Path structure, Penn to Sprintlabs
Corresponding feasible measurement partitions,
Penn to Sprintlabs
83Network tomography feasibility (2)
- Data 10k paths from 5 different sources
- Metric fraction of nodes usable for tomography
- Results 50 nodes are usable, more difficult
as distance from source increases, better when
probing from multiple sources
84Why is this not ideal?Accepted quibbles
- Non-TCP Assumes everything TCP-Friendly
- Packet loss due to errors (e.g. wireless)
considered congestion signal - Bad RTE can also cause false signals
- Mice (congestion control only kicks in after 6
packets or so) - Large bandwidth-delay pipes
- Self-similarity of traffic (bursty)
- Buffer occupancy (high)
- RED hard to configure to perform well (different
parameters for different scenarios) - Fairness
- QOS
85Conjecture Internet is at local maximum with
very steep slopes
Some locally bad ideas
- Hop by hop feedback
- Head of line blocking
- Local, so cant achieve global fairness
- Aggregation
- Fractal nature of traffic
- Rate-based congestion control
- unbounded input, oscillatory
- Explicit out-of-band congestion notification
packets - adds to load under congestion, wastes bandwidth,
and unstable - Most new ideas, taken by themselves, make
matters worse than Standard TCP.