Title: Berkeley RAD Lab Technical Vision
1Berkeley RAD LabTechnical Vision
- Armando Fox, Randy Katz, Michael Jordan, Dave
Patterson, Scott Shenker, Ion Stoica - RADS Retreat, June 2005
2Outline
- Overall Vision
- Internet Services Vision (ServRADS)
- Network Vision (NetRADS)
- Internet Services Network architecture
- Principles and Summary
3Overarching Mantra
- Enable a faster pace of network service
innovationthrough new distributed system
architectures that reduce operations cost by
2-3 orders of magnitude - The Challenge
- Software systems Too much information gt make
sense of it through statistical learning
control theory - Network systems Too little information gt
exploit better observation and monitoring in the
network infrastructure to drive management
processes
4In practice this means
- Single person can write, deploy, operate the
next-generation IT business (the Fortune 1
million) - Do for Internet apps what Web did for individual
publishing - Gray s challenge planetary-scale distributed
system operated by a single part-time operator - Goal programmers focus on functionality put
the ility in the platform - Could be built on utility computing, giving
access to distributed physical resources - Integrated approach to network and server/service
management - Requires 100x-1000x reduction in TCO from todays
levels
5What things are like today
- World-scale services created and operated by
expert teams - Google-sized organization to create a Google
- Amazons book browsing, designed by programmers,
is cumbersome - Browsing for housewares, designed by domain
experts on mature infrastructure, more usable - We dont know what the next killer app will be!
- NOW project didnt predict Internet search as a
Killer app for NOWs - If we succeed, the next killer Internet app will
be written, deployed, operated, at Google-like
scales, by a single programmer
6Focusing on lowering cost of ownership
- Standard way to account for where the money
goes in operating a deployed distributed
application - Definition independent of who is operating the
app - Operators per byte of storage or per CPU? No,
doesnt scale with technology changes - Operators per end-user served? (This is the
figure of merit for e-tailers) - Operators per geographic region served?
- Operators per spent on capital cost?
- Operators per of revenue?
7Outline
- Overall Vision
- Internet Services Vision (ServRADS)
- Network Vision (NetRADS)
- Internet Services Network architecture
- Principles and Summary
8Enabling Technologies for Reducing TCO in ServRADS
- Past successes
- microrebooting Fast recovery makes false
positives tolerable - Pinpoint using SLT to detect and localize
fine-grain failures - visualizationSLT to help operators earn their
trust - Elements of technical vision
- SLT and machine learning
- Operator-centric visualization
- Control theory
- Open source failures database (sanitized, open
failures forensics repository)
9Example scenarios
- Helping operators make sense of instrumentation
- Using ML techniques to localize failures (P.
Bodik, E. Kiciman) - Using automatically-induced statistical models to
identify likely causes of performance problems
(S. Zhang, I. Cohen et al.) - Combining SLT with visualization for
cross-checking problem reports and rapidly
spotting potential problems visually - Automating problem identification based on stored
signatures (S. Zhang, M. Goldszmidt, I. Cohen et
al.) - Facilitating self-tuning/configuration
- Using control theory to improve performance of a
distributed streaming database (W. Xu) - Service placement in wide-area distributed system
(D. Oppenheimer) - Microreboots (G. Candea) and microreplacement (S.
Kawamoto) as low-cost prevention/repair
strategies - If false positive cost can be kept low, automate.
Otherwise, help operator do her job.
10Services example combining viz SLT
11Reduce TCO via Planetary-scale Abstractions
- Inspiration narrowly-focused planetary-scale
abstractions whose design implementation... - scale well understand distributed scheduling,
locality, symptoms of wide-area failures - monitorable and controllable (using SLT linear
CT) - retain precisely-quantifiable and acceptable
semantics under partial-failure conditions - Examples of existing narrow but powerful
services - MapReduce in Google understands data locality
- Can easily imagine a lossy MapReduce, like
online aggregation - queues/messaging in Yahoo, Amazon, others
- User information database in Yahoo
- Instrumentation collection analysis services
using Telegraph-CQ
12Outline
- Overall Vision
- Internet Services Vision (ServRADS)
- Network Vision (NetRADS)
- Internet Services Network architecture
- Principles and Summary
13RADS Network Problem
- Internet routing has proven to be robust
- But
- Poor visibility hard to determine health of the
network - Routing policy interactions defeat propagation of
useful diagnostic info difficult to identify
root cause problems - Slow reaction times to connectivity failures
operator intervention (across admin domains)
increases cost of ownership - Key observation network service failures
attributed to unexpected traffic patterns - Approach identify and protect good traffic
- Mechanism deployed in network edge
- Its where the servers and clients are located
- Greatest need for lowering management costs
- Administrative scope and responsibility is
well-defined
14iBoxes New network element for Observe,
Analyze, Act
Enterprise Network Architecture
Inspection-and-Action Boxes Deep multiprotocol
packet inspection No routing observation
marking Policing points drop, fence, block
15Network-Level Observe-Analyze-Act
- Observe
- Packet, path, protocol, service invocation
statistical collection and sampling frequencies,
latencies, completion rates - Construct the collection infrastructure
- Analyze
- Determine correlations among observations
- Normal model discovery anomaly detection
- Exploit SLT
- Act
- Experiment to test correlations
- Prioritize and throttle
- Mark and annotate
- Control theory? Distributed analyses and actions
16Network Layer Mechanism Annotations
- Enhance network visibility disseminate
observations, communicate actions, provide
in-band network management actions, iBox-to-iBox
communications - iBoxes label packets at annotation layer but do
not rewrite packet contents - Annotations stack, must be removed from packets
before delivery to A-layer unaware end nodes
17Scenario Traffic Surge Inhibiting Network
Services
Internet Edge
II
R
Primary Secondary DNS Servers
Distribution Tier
S
S
E
Mail Server
E
R
R
S
IA
IS
E
Spam Appliance
Access Edge
Server Edge
E
S
- DNS Server swamped by excessive request traffic
- Observe DNS time outs, Web access traffic
slowed, but also higher than normal mail delivery
latency implying busy server edge (correlation
between Mail Server and DNS Server utilization?) - Root Cause High DNS request rates generated by
Spam Appliance triggered by mail surge
18Scenario
Internet Edge
II
R
Primary Secondary DNS Servers
Distribution Tier
S
S
E
Mail Server
E
R
R
S
IA
IS
E
Spam Appliance
Access Edge
Server Edge
E
S
- How Diagnosed?
- I-S detects high link utilization but abnormally
high DNS traffic - Stats from I-I high mail traffic, low outgoing
web traffic, in traffic high but link utilization
not high - Stats from I-A lower web traffic, no unusual
mail origination - Problem localized to Server edge, but visibility
limited RADS can help
19Scenario
Internet Edge
II
R
Primary Secondary DNS Servers
Distribution Tier
S
S
E
Mail Server
E
R
R
S
IA
IS
E
Spam Appliance
Access Edge
Server Edge
E
S
- Possible Action Responses
- Experiment Redirect local DNS requests to
Secondary DNS server if these complete, can
infer the server is the problem, not the network - Throttle Due to MS-DNS correlation, block/slow
email traffic at Server Edge should expect
reduced DNS server utilization
20Outline
- Overall Vision
- Internet Services Vision (ServRADS)
- Network Vision (NetRADS)
- Internet Services Network architecture
- Principles and Summary
21Embodying principles in a prototype
- Platform architecture and prototype to enable
rapid innovation in network services by
non-experts - automatically accommodates scaling, provisioning,
failure management - multi-datacenter (geoplexed)
- observable networks connecting datacenters
- potentially planetary scale
- runs with minimal operator oversight
- Prototype keeps various research projects focused
on common goal and allows ongoing testing - Participation in standards processes to promote
best practices in platform as open standards
22Reliable Adaptive Distributed Systems
Operator
User
Prototype Applications
Programming Abstractions For Roll-back
and wide-area distributed computations
SLT Services
Crash-only services Observation Infrastructure
forSystem SLT
Application- Specific Overlay Network
Checkable Protocols Fast Detection Route
Recovery ObservationInfrastructure for network
SLT
iBox
iBox
Edge Network
Edge Network
Commodity Internet
23Generic iBox Architecture
Tag Mem
Rules Programs
24Possible architecture of a rack
app. server application, e.g. J2EE
Microrecovery actions
Datacenter boundary
From other datacenters
High-leveleffectors
SLTalgo.
SLTalgo.
SLTalgo.
To other datacenters
Control loops
High-level sensor data
Externally-inducedfailures, workload changes,
etc.
T-CQ engine
Sanitizeddata
Visualization
SLTalgo.
SLTalgo.
SLTalgo.
Preprocesseddata
Syndrome identification
To otherdatacenters
25Outline
- Overall Vision
- Internet Services Vision (ServRADS)
- Network Vision (NetRADS)
- Internet Services Network architecture
- Principles and Summary
26ServRADS Observations Summary
- SLT algorithms make sense of large amounts of
data - Classification, outlier/anomaly detection,
clustering, etc. - Viz helps operator use visual pattern
recognition to quickly spot problems and
cross-check SLT models - Enables operator expertise to be quickly brought
to bear - Builds operators trust in statistical/machine
learning models - Challenge
- Fundamental challenges associated with applying
SLT to problem determination (coming up next
session) - Unifying many techniques into a coherent approach
- prototype platform as unifying artifact - Idea capture best practices in TCO-optimized,
planetary-scale abstractions
27NetRADS Observations Summary
- COPS Paradigm for (more) automatically
protecting critical resources when network is
under stress - Checkable protocols visible semantics
- Observe network behavior good (easy), bad
(hard), suspicious - Protect services throttle, redirect
- Network management major contributor to TCO
- NetRADS built on
- iBoxes pervasive infrastructure for observation
and action at the network level - Annotation Layer for marking, control,
inter-iBox communications - Integration with Internet service approach for
service/server-level visibility and integrated
management