Title: Towards a Scalable, Adaptive and Network-aware Content Distribution Network
1Towards a Scalable, Adaptive and Network-aware
Content Distribution Network
Yan Chen EECS Department UC Berkeley
2Outline
- Motivation and Challenges
- Our Contributions SCAN system
- Case Study Tomography-based overlay network
monitoring system - Conclusions
3Motivation
- The Internet has evolved to become a commercial
infrastructure for service delivery - Web delivery, VoIP, streaming media
- Challenges for Internet-scale services
- Scalability 600M users, 35M Web sites, 2.1Tb/s
- Efficiency bandwidth, storage, management
- Agility dynamic clients/network/servers
- Security, etc.
- Focus on content delivery - Content Distribution
Network (CDN) - Totally 4 Billion Web pages, daily growth of 7M
pages - Annual traffic growth of 200 for next 4 years
4How CDN Works
5Challenges for CDN
- Replica Location
- Find nearby replicas with good DoS attack
resilience - Replica Deployment
- Dynamics, efficiency
- Client QoS and server capacity constraints
- Replica Management
- Replica index state maintenance scalability
- Adaptation to Network Congestion/Failures
- Overlay monitoring scalability and accuracy
-
6SCAN Scalable Content Access Network
Provision Dynamic Replication Update
Multicast Tree Building
Replica Management (Incremental) Content
Clustering
Network DoS Resilient Replica Location Tapestry
Network End-to-End Distance Monitoring Internet
Iso-bar latency TOM loss rate
7Replica Location
- Existing Work and Problems
- Centralized, Replicated and Distributed Directory
Services - No security benchmarking, which one has the best
DoS attack resilience? - Solution
- Proposed the first simulation-based network DoS
resilience benchmark - Applied it to compare three directory services
- DHT-based Distributed Directory Services has best
resilience in practice - Publication
- 3rd Int. Conf. on Info. and Comm. Security
(ICICS), 2001
8Replica Placement/Maintenance
- Existing Work and Problems
- Static placement
- Dynamic but inefficient placement
- No coherence support
- Solution
- Dynamically place close to optimal of replicas
with clients QoS (latency) and servers capacity
constraints - Self-organize replica into a scalable
application-level multicast for disseminating
updates - With overlay network topology only
- Publication
- IPTPS 2002, Pervasive Computing 2002
9Replica Management
- Existing Work and Problems
- Cooperative access for good efficiency requires
maintaining replica indices - Per Website replication, scalable, but poor
performance - Per URL replication, good performance, but
unscalable - Solution
- Clustering-based replication reduces the overhead
significantly without sacrificing much
performance - Proposed a unique online Web object popularity
prediction scheme based on hyperlink structures - Online incremental clustering and replication to
push replicas before accessed - Publication
- ICNP 2002, IEEE J-SAC 2003
10Adaptation to Network Congestion/Failures
- Existing Work and Problems
- Latency estimation
- Clustering-based network proximity based,
inaccurate - Coordinate-based symmetric distance, unscalable
to update - General metrics n2 measurement for n end hosts
- Solution
- Latency Internet Iso-bar - clustering based on
latency similarity to a small number of landmarks - Loss rate Tomography-based Overlay Monitoring
(TOM) - selectively monitor a basis set of O(n
logn) paths to infer the loss rates of other
paths - Publication
- Internet Iso-bar SIGMETRICS PER 2002
- TOM SIGCOMM IMC 2003
11SCAN Architecture
- Leverage Distributed Hash Table - Tapestry for
- Distributed, scalable location with guaranteed
success - Search with locality
data plane
data source
Dynamic Replication/Update and Replica Management
Replica Location
Web server
SCAN server
Overlay Network Monitoring
network plane
12Methodology
Analytical evaluation
PlanetLab tests
- Network topology
- Web workload
- Network end-to-end latency measurement
13Case StudyTomography-based Overlay Network
Monitoring
14TOM Outline
- Goal and Problem Formulation
- Algebraic Modeling and Basic Algorithms
- Scalability Analysis
- Practical Issues
- Evaluation
- Application Adaptive Overlay Streaming Media
- Conclusions
15Existing Work
Goal a scalable, adaptive and accurate overlay
monitoring system to detect e2e
congestion/failures
- General Metrics RON (n2 measurement)
- Latency Estimation
- Clustering-based IDMaps, Internet Isobar, etc.
- Coordinate-based GNP, ICS, Virtual Landmarks
- Network tomography
- Focusing on inferring the characteristics of
physical links rather than E2E paths - Limited measurements -gt under-constrained system,
unidentifiable links
16Problem Formulation
- Given an overlay of n end hosts and O(n2) paths,
how to select a minimal subset of paths to
monitor so that the loss rates/latency of all
other paths can be inferred. - Assumptions
- Topology measurable
- Can only measure the E2E path, not the link
17Our Approach
- Select a basis set of k paths that fully describe
O(n2) paths (k O(n2)) - Monitor the loss rates of k paths, and infer the
loss rates of all other paths - Applicable for any additive metrics, like latency
18Algebraic Model
A
1
3
D
C
2
B
- Path loss rate p, link loss rate l
19Putting All Paths Together
A
1
3
D
C
2
B
Totally r O(n2) paths, s links, s r
20Sample Path Matrix
- x1 - x2 unknown gt cannot compute x1, x2
- Set of vectors
- form null space
- To separate identifiable vs. unidentifiable
components x xG xN
21Intuition through Topology Virtualization
- Virtual links
- Minimal path segments whose loss rates uniquely
identified - Can fully describe all paths
- xG is composed of virtual links
All E2E paths are in path space, i.e., GxN 0
22More Examples
Virtualization
Real links (solid) and all of the overlay paths
(dotted) traversing them
Virtual links
23Basic Algorithms
- Select k rank(G) linearly independent paths to
monitor - Use QR decomposition
- Leverage sparse matrix time O(rk2) and memory
O(k2) - E.g., 79 sec for n 300 (r 44850) and k 2541
- Compute the loss rates of other paths
- Time O(k2) and memory O(k2)
- E.g., 1.89 sec for the example above
24Scalability Analysis
- k O(n2) ?
- For a power-law Internet topology
- When the majority of end hosts are on the overlay
- When a small portion of end hosts are on overlay
- If Internet a pure hierarchical structure (tree)
k O(n) - If Internet no hierarchy at all (worst case,
clique) k O(n2) - Internet has moderate hierarchical structure
TGJ02
k O(n) (with proof)
For reasonably large n, (e.g., 100), k
O(nlogn) (extensive linear regression tests on
both synthetic and real topologies)
25TOM Outline
- Goal and Problem Formulation
- Algebraic Modeling and Basic Algorithms
- Scalability Analysis
- Practical Issues
- Evaluation
- Application Adaptive Overlay Streaming Media
- Summary
26Practical Issues
- Topology measurement errors tolerance
- Router aliases
- Incomplete routing info
- Measurement load balancing
- Randomly order the paths for scan and selection
of - Adaptive to topology changes
- Designed efficient algorithms for incrementally
update - Add/remove a path O(k2) time (O(n2k2) for
reinitialize) - Add/remove end hosts and Routing changes
-
27Evaluation Metrics
- Path loss rate estimation accuracy
- Absolute error p p
- Error factor BDPT02
- Lossy path inference coverage and false positive
ratio - Measurement load balancing
- Coefficient of variation (CV)
- Maximum vs. mean ratio (MMR)
- Speed of setup, update and adaptation
28Evaluation
- Extensive Simulations
- Experiments on PlanetLab
- 51 hosts, each from different organizations
- 51 50 2,550 paths
- On average k 872
- Results on Accuracy
- Avg real loss rate 0.023
- Absolute error mean 0.0027 90 lt 0.014
- Error factor mean 1.1 90 lt 2.0
- On average 248 out of 2550 paths have no or
incomplete routing information - No router aliases resolved
Areas and Domains Areas and Domains Areas and Domains of hosts
US (40) .edu .edu 33
US (40) .org .org 3
US (40) .net .net 2
US (40) .gov .gov 1
US (40) .us .us 1
Interna-tional (11) Europe (6) France 1
Interna-tional (11) Europe (6) Sweden 1
Interna-tional (11) Europe (6) Denmark 1
Interna-tional (11) Europe (6) Germany 1
Interna-tional (11) Europe (6) UK 2
Interna-tional (11) Asia (2) Taiwan 1
Interna-tional (11) Asia (2) Hong Kong 1
Interna-tional (11) Canada Canada 2
Interna-tional (11) Australia Australia 1
29Evaluation (contd)
- Results on Speed
- Path selection (setup) 0.75 sec
- Path loss rate calculation 0.16 sec for all 2550
paths - Results on Load Balancing
- Significantly reduce CV and MMR, up to a factor
of 7.3
30TOM Outline
- Goal and Problem Formulation
- Algebraic Modeling and Basic Algorithms
- Scalability Analysis
- Practical Issues
- Evaluation
- Application Adaptive Overlay Streaming Media
- Conclusions
31Motivation
- Traditional streaming media systems treat the
network as a black box - Adaptation only performed at the transmission end
points - Overlay relay can effectively bypass
congestion/failures - Built an adaptive streaming media system that
leverages - TOM for real-time path info
- An overlay network for adaptive packet buffering
and relay
32Adaptive Overlay Streaming Media
Stanford
UC San Diego
UC Berkeley
X
HP Labs
- Implemented with Winamp client and SHOUTcast
server - Congestion introduced with a Packet Shaper
- Skip-free playback server buffering and
rewinding - Total adaptation time lt 4 seconds
33Adaptive Streaming Media Architecture
34Summary
- A tomography-based overlay network monitoring
system - Selectively monitor a basis set of O(n logn)
paths to infer the loss rates of O(n2) paths - Works in real-time, adaptive to topology changes,
has good load balancing and tolerates topology
errors - Both simulation and real Internet experiments
promising - Built adaptive overlay streaming media system on
top of TOM - Bypass congestion/failures for smooth playback
within seconds
35Tie Back to SCAN
Provision Dynamic Replication Update
Multicast Tree Building
Replica Management (Incremental) Content
Clustering
Network DoS Resilient Replica Location Tapestry
Network End-to-End Distance Monitoring Internet
Iso-bar latency TOM loss rate
36Contribution of My Thesis
- Replica location
- Proposed the first simulation-based network DoS
resilience benchmark and quantify three types of
directory services - Dynamically place close to optimal of replicas
- Self-organize replicas into a scalable app-level
multicast tree for disseminating updates - Cluster objects to significantly reduce the
management overhead with little performance
sacrifice - Online incremental clustering and replication to
adapt to users access pattern changes - Scalable overlay network monitoring
37Thank you !
38Backup Materials
39Existing CDNs Fail to Address these Challenges
No coherence for dynamic content
X
Unscalable network monitoring - O(M N) M of
client groups, N of server farms
Non-cooperative replication inefficient
40Network Topology and Web Workload
- Network Topology
- Pure-random, Waxman transit-stub synthetic
topology - An AS-level topology from 7 widely-dispersed BGP
peers - Web Workload
Web Site Period Duration Requests avg min-max Clients avg min-max Client groups avg min-max
MSNBC Aug-Oct/1999 1011am 1.5M642K1.7M 129K69K150K 15.6K-10K-17K
NASA Jul-Aug/1995 All day 79K-61K-101K 5940-4781-7671 2378-1784-3011
- Aggregate MSNBC Web clients with BGP prefix
- BGP tables from a BBNPlanet router
- Aggregate NASA Web clients with domain names
- Map the client groups onto the topology
41Network E2E Latency Measurement
- NLANR Active Measurement Project data set
- 111 sites on America, Asia, Australia and Europe
- Round-trip time (RTT) between every pair of hosts
every minute - 17M daily measurement
- Raw data Jun. Dec. 2001, Nov. 2002
- Keynote measurement data
- Measure TCP performance from about 100 worldwide
agents - Heterogeneous core network various ISPs
- Heterogeneous access network
- Dial up 56K, DSL and high-bandwidth business
connections - Targets
- 40 most popular Web servers 27 Internet Data
Centers - Raw data Nov. Dec. 2001, Mar. May 2002
42Internet Content Delivery Systems
Properties Web caching (client initiated) Web caching (server initiated) ConventionalCDNs (Akamai) SCAN
Replica access Non-cooperative Cooperative (bloomfilter) Non-cooperative Cooperative
Load balancing No No Yes Yes
Pull/push Pull Push Pull Push
Transparent to clients No No Yes Yes
Coherence support No No No Yes
Network- awareness No No Yes, unscalable monitoring system Yes, scalable monitoring system
43Absolute and Relative Errors
- For each experiment, get its 95 percentile
absolute and relative errors for estimation of
2,550 paths
44Lossy Path Inference Accuracy
- 90 out of 100 runs have coverage over 85 and
false positive less than 10 - Many caused by the 5 threshold boundary effects
45PlanetLab Experiment Results
- Loss rate distribution
- Metrics
- Absolute error p p
- Average 0.0027 for all paths, 0.0058 for lossy
paths - Relative error BDPT02
- Lossy path inference coverage and false positive
ratio - On average k 872 out of 2550
loss rate 0, 0.05) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1) lossy path 0.05, 1.0 (4.1)
loss rate 0, 0.05) 0.05, 0.1) 0.1, 0.3) 0.3, 0.5) 0.5, 1.0) 1.0
95.9 15.2 31.0 23.9 4.3 25.6
46Experiments on Planet Lab
Areas and Domains Areas and Domains Areas and Domains of hosts
US (40) .edu .edu 33
US (40) .org .org 3
US (40) .net .net 2
US (40) .gov .gov 1
US (40) .us .us 1
Interna-tional (11) Europe (6) France 1
Interna-tional (11) Europe (6) Sweden 1
Interna-tional (11) Europe (6) Denmark 1
Interna-tional (11) Europe (6) Germany 1
Interna-tional (11) Europe (6) UK 2
Interna-tional (11) Asia (2) Taiwan 1
Interna-tional (11) Asia (2) Hong Kong 1
Interna-tional (11) Canada Canada 2
Interna-tional (11) Australia Australia 1
- 51 hosts, each from different organizations
- 51 50 2,550 paths
- Simultaneous loss rate measurement
- 300 trials, 300 msec each
- In each trial, send a 40-byte UDP pkt to every
other host - Simultaneous topology measurement
- Traceroute
- Experiments 6/24 6/27
- 100 experiments in peak hours
47Motivation
- With single node relay
- Loss rate improvement
- Among 10,980 lossy paths
- 5,705 paths (52.0) have loss rate reduced by
0.05 or more - 3,084 paths (28.1) change from lossy to
non-lossy - Throughput improvement
- Estimated with
- 60,320 paths (24) with non-zero loss rate,
throughput computable - Among them, 32,939 (54.6) paths have throughput
improved, 13,734 (22.8) paths have throughput
doubled or more - Implications use overlay path to bypass
congestion or failures
48SCAN
Coherence for dynamic content
X
s1, s4, s5
Cooperative clustering-based replication
Scalable network monitoring O(MN)
49Problem Formulation
- Subject to certain total replication cost (e.g.,
of URL replicas) - Find a scalable, adaptive replication strategy to
reduce avg access cost
50SCAN Scalable Content Access Network
CDN Applications (e.g. streaming media)
Provision Cooperative Clustering-based
Replication
Coherence Update Multicast Tree Construction
Network Distance/ Congestion/ Failure Estimation
User Behavior/ Workload Monitoring
Network Performance Monitoring
red my work, black out of scope
51Evaluation of Internet-scale System
- Analytical evaluation
- Realistic simulation
- Network topology
- Web workload
- Network end-to-end latency measurement
- Network topology
- Pure-random, Waxman transit-stub synthetic
topology - A real AS-level topology from 7 widely-dispersed
BGP peers
52Web Workload
Web Site Period Duration Requests avg min-max Clients avg min-max Client groups avg min-max
MSNBC Aug-Oct/1999 1011am 1.5M642K1.7M 129K69K150K 15.6K-10K-17K
NASA Jul-Aug/1995 All day 79K-61K-101K 5940-4781-7671 2378-1784-3011
World Cup May-Jul/1998 All day 29M 1M 73M 103K13K218K N/A
- Aggregate MSNBC Web clients with BGP prefix
- BGP tables from a BBNPlanet router
- Aggregate NASA Web clients with domain names
- Map the client groups onto the topology
53Simulation Methodology
- Network Topology
- Pure-random, Waxman transit-stub synthetic
topology - An AS-level topology from 7 widely-dispersed BGP
peers - Web Workload
Web Site Period Duration Requests avg min-max Clients avg min-max Client groups avg min-max
MSNBC Aug-Oct/1999 1011am 1.5M642K1.7M 129K69K150K 15.6K-10K-17K
NASA Jul-Aug/1995 All day 79K-61K-101K 5940-4781-7671 2378-1784-3011
- Aggregate MSNBC Web clients with BGP prefix
- BGP tables from a BBNPlanet router
- Aggregate NASA Web clients with domain names
- Map the client groups onto the topology
54Online Incremental Clustering
- Predict access patterns based on semantics
- Simplify to popularity prediction
- Groups of URLs with similar popularity? Use
hyperlink structures! - Groups of siblings
- Groups of the same hyperlink depth smallest of
links from root
55Challenges for CDN
- Over-provisioning for replication
- Provide good QoS to clients (e.g., latency bound,
coherence) - Small of replicas with small delay and
bandwidth consumption for update - Replica Management
- Scalability billions of replicas if replicating
in URL - O(104) URLs/server, O(105) CDN edge servers in
O(103) networks - Adaptation to dynamics of content providers and
customers - Monitoring
- User workload monitoring
- End-to-end network distance/congestion/failures
monitoring - Measurement scalability
- Inference accuracy and stability
56SCAN Architecture
- Leverage Decentralized Object Location and
Routing (DOLR) - Tapestry for - Distributed, scalable location with guaranteed
success - Search with locality
- Soft state maintenance of dissemination tree (for
each object)
data plane
data source
Dynamic Replication/Update and Content Management
Web server
Request Location
SCAN server
network plane
57Wide-area Network Measurement and Monitoring
System (WNMMS)
- Select a subset of SCAN servers to be monitors
- E2E estimation for
- Distance
- Congestion
- Failures
Cluster C
Cluster B
Cluster A
network plane
Monitors
SCAN edge servers
Clients
58Dynamic Provisioning
- Dynamic replica placement
- Meeting clients latency and servers capacity
constraints - Close-to-minimal of replicas
- Self-organized replicas into app-level multicast
tree - Small delay and bandwidth consumption for update
multicast - Each node only maintains states for its parent
direct children - Evaluated based on simulation of
- Synthetic traces with various sensitivity
analysis - Real traces from NASA and MSNBC
- Publication
- IPTPS 2002
- Pervasive Computing 2002
59Effects of the Non-Uniform Size of URLs
1
2
4
3
- Replication cost constraint bytes
- Similar trends exist
- Per URL replication outperforms per Website
dramatically - Spatial clustering with Euclidean distance and
popularity-based clustering are very
cost-effective
60SCAN Scalable Content Access Network
61Web Proxy Caching
ISP 1
Client
ISP 2
62Conventional CDN Non-cooperative Pull
Client 1
Web content server
ISP 1
Inefficient replication
ISP 2
63SCAN Cooperative Push
Client 1
CDN name server
ISP 1
Significantly reduce the of replicas and update
cost
ISP 2
64Internet Content Delivery Systems
Properties Web caching (client initiated) Web caching (server initiated) Pull-based CDNs (Akamai) Push-based CDNs SCAN
Efficiency ( of caches or replicas) No cache sharing among proxies Cache sharing No replica sharing among edge servers Replica sharing Replica sharing
Scalability for request redirection Pre-configured in browser Use Bloom filter to exchange replica locations Centralized CDN name server Centralized CDN name server Decentra-lized P2P location
Coherence support No No Yes No Yes
Network- awareness No No Yes, unscalable monitoring system No Yes, scalable monitoring system
65Previous Work Update Dissemination
- No inter-domain IP multicast
- Application-level multicast (ALM) unscalable
- Root maintains states for all children (Narada,
Overcast, ALMI, RMX) - Root handles all join requests (Bayeux)
- Root split is common solution, but suffers
consistency overhead
66Comparison of Content Delivery Systems (contd)
Properties Web caching (client initiated) Web caching (server initiated) Pull-based CDNs (Akamai) Push-based CDNs SCAN
Distributed load balancing No Yes Yes No Yes
Dynamic replica placement Yes Yes Yes No Yes
Network- awareness No No Yes, unscalable monitoring system No Yes, scalable monitoring system
No global network topology assumption Yes Yes Yes No Yes