Title: The Cache Location Problem
1The Cache Location Problem
2Overview
- TERCs Vs. Proxies
- Stability
- Cache location
3Proxy Web Caching is Good
- Saves network bandwidth
- Reduces delay
- Reduces servers load
- But it is not perfect
- not everybody uses it (configuration)
- may become a bottleneck and increase delay
- increases delay for unsatisfied pages
4Transparent En-Route Caches (TERCs)
- Caches are located along routes from clients to
servers, and are transparent to both server and
client - Requests are intercepted by the TERC on their way
to the server, and either - answered by the cache if the information exists
- otherwise, forwarded to the server
- Advantages
- No configuration required! No management!
- No change required in current network
infrastructure - Can be deployed independently within an ISP
subnetwork
5TERCs (-)
- Must be on the route from client to server
- sensitive to route changes
- hierarchies are much harder to implemen
- Needs to intercept traffic
- implementation problem
- more complex
- can TERCs work at line speed?
- Depends on routing stability, and flow stability
Where should TERCs be placed?
6Route Stability
- Published results indicate that routing is stable
(Paxon, Labovitz) - We need stability only during the connection
lifetime (1 min.) - KRS00 measurements to more that 13000
destinations show that gt93 of connections were
stable - real numbers are probably higher
- TCP route caching
- equivalent of IP addresses
7Stability of Flows
- We built the flow tree from servers
- Data from Bell-Labs servers (www.bell-labs.com,
www.multimedia.bell-labs.com ) - Nov. 97 - Jan. 98
- 14000 different hosts, 1 Gbytes, 200k cachable
requests (per week) - From log files to results
- extract unique host
- run traceroute for each host
- obtain the routing tree (or is it DAG?)
8Stability - Visual
9Client return rate between days
day 0111 0112 0113 0114 0115 0116 0117 1130 1201 1202 1203 1204 1205 1206
0111 4.35 4 3.78 3.69 3.55 3.73 3.25 3.12 3.21 2.96 3.01 2.79 3.36
0112 4.35 6.93 6.06 5.66 5.34 3.58 2.77 4.4 3.85 3.86 3.87 4.02 3.33
0113 4 6.93 7.48 6.1 6.12 4.26 3.28 4.58 4.25 4.16 4.34 4.25 2.96
0114 3.78 6.06 7.48 7.33 6.48 4.07 3.03 4.21 4.23 4.28 4.34 4.25 3.15
0115 3.69 5.66 6.1 7.33 7.41 4.3 2.77 3.71 4.02 4.25 3.98 4.2 2.88
0116 3.55 5.34 6.12 6.48 7.41 5.38 3.13 4.21 4.56 4.12 4.1 4.36 3.25
0117 3.73 3.58 4.26 4.07 4.3 5.38 3.36 2.99 3.14 2.86 2.88 3.18 3.46
1130 3.25 2.77 3.28 3.03 2.77 3.13 3.36 4.32 4.08 4.15 3.42 3.49 4.23
1201 3.12 4.4 4.58 4.21 3.71 4.21 2.99 4.32 7 6.34 6.06 4.97 3.58
1202 3.21 3.85 4.25 4.23 4.02 4.56 3.14 4.08 7 6.88 5.89 5.35 3.94
1203 2.96 3.86 4.16 4.28 4.25 4.12 2.86 4.15 6.34 6.88 7.01 5.58 3.48
1204 3.01 3.87 4.34 4.34 3.98 4.1 2.88 3.42 6.06 5.89 7.01 7.15 3.95
1205 2.79 4.02 4.25 4.25 4.2 4.36 3.18 3.49 4.97 5.35 5.58 7.15 4.82
1206 3.36 3.33 2.96 3.15 2.88 3.25 3.46 4.23 3.58 3.94 3.48 3.95 4.82
10Stability (3)
- The relative flow in the tree is stable in time,
although the client population changes
significantly - Routing is stable for the lifetime of the
connection - Placing caches based on past traffic yields good
results
11How Fixed is the Hit Ratio?
12How Fixed is the Hit Ratio?(2)
13Where Should the TERCs be Placed?
14The Model
- Wide area network
- Requests are represented by a set of demands (of
client i from server j) - Goal minimize average delay
minimize total flow - The hit ratio (P) abstracts cache behavior
- most hits due to small number of popular pages
- full dependency - the same pages are cached
everywhere - But part of the flow can come from Proxies
gt
Each flow is associated with a hit ratio Pi,j
15The General k-cache Location Problem
- Instance
- an undirected graph G(V,E)
- a set of demands Ffi,j
- a set of hit ratios Ppi,j
- k - the number of caches
- Solution K, a subset of V of size k
- Objective minimizing total flow
min fi,j
pi,j d(i,v) (1-pi,j) (d(i,v)d(v,j))
Ã¥
i,j
v ? Kj
16The k-TERC Location Problem
- Instance
- an undirected graph G(V,E)
- a set of demands Ffi,j
- a set of hit ratios Ppi,j
- k - the number of caches
- Solution K, a subset of V of size k
- Objective minimizing total flow
min fi,j
Ã¥
pi,j d(i,v) (1-pi,j) (d(i,v)d(v,j))
i,j
v ? Kj on the path from j to i
17Remarks
- A generalization of the p-median problem(in the
p-median problem we want to minimize the total
cost of serving a set of demands from at most p
centers) - In the k-TERC location problem
- it is enough to solve the problem for fixed p
(pi,j p) - The optimal set K does not depend on p.
- (not true in general)
- The k-TERC location problem is a special case of
the general k-location problem(p1/n)
18The independence of ps,c
TERC
constant
19Hardness Results
line
tree
general graph
NP - hard
one server
Poly.
Poly.
m servers
Poly.
NP - hard
NP - hard
20Placement on a line
0
1
2
n-1
- Topology a line of n nodes
- Every node may be a server, a client, or both.
- FR(i) The flow demand on the segment (i-1,i)
- FR can be easily computed from the input.
- FC(i,lo,li) - The flow on the segment (i-1,i)
when the closest caches to i are in lo and li. - FC can be computed from the input with p1.
- Note FR(i) FC(i,n-1,0)
21Placement on a line
- C(j,lo,li,k) the overall flow in segment 0,j
when k caches are locate optimally inside the
segment, and the closest caches to j are in lo
and li.
22The dynamic Program
23The Algorithm
- Compute C(1,li,1,1) and C(1,li,0,0) for 1lin-1
- For each jgt1 compute C(j,lo,li,k) for all 0kk
and 0lijlon-1 - Complexity O(n3k)
24Optimizing for a single server
- The routes from the server to all clients form a
tree (actually a DAG) - Well use dynamic programing to find the optimal
cache locations
25The Greedy Algorithm
- Optimal algorithm using a bottom up dynamic
programming - not trivial
- complexity O(n k2 h)
- Greedy
- repeat k timesfind the best cache location
- complexity O(n k)
- How bad can it be?
26Greedy Vs. Optimal
27Dynamic Programming for Tree
- First we convert the tree to a binary tree by
adding dummy nodes. - Sort all nodes in reverse BFS order nodes
descendents are numbered before the node itself. - Children of node i are iR and iL
28Notations
- C(i,k,l) is the cost of a subtree rooted at i
with k optimally located caches, where the next
cache up the tree is at distance l from i. - F(i,k,l) is the sum of demands in the subtree i
that do not pass thru a cache in the solution
C(i,k,l).
29The Dynamic Program
30The DP Formula for C(i,k,l)
- The cost if a cache is not placed at node i
- The cost if a cache is placed at node i
- Complexity
- O(nhk) variables ? O(nhk2) time cmplx
- Finer analysis yields O(nhk) time complexity
31The Servers Point of View
32Traffic Reduction
33TERCs Vs. Edge Caches
34The Servers Point of View (2)
35Popularity Stability