Title: ACME: a platform for benchmarking distributed applications
1ACME a platform for benchmarking distributed
applications
- David Oppenheimer, Vitaliy Vatkovskiy, and David
Patterson - ROC Retreat
- 12 Jan 2003
2Motivation
- Benchmarking large-scale distributed apps
(peer-to-peer, Grid, CDNs, ...) is difficult - very large (1000s-10,000s nodes)
- need scalable measurement and control
- nodes and network links will fail
- need robust measurement and control
- large variety of possible applications
- need standard interfaces for measurement and
control - ACME platform that developers can use to
benchmark their distributed applications
3ACME benchmark lifecycle
- User describes benchmark scenario
- node requirements, workload, faultload, metrics
- System finds the appropriate nodes, starts up the
benchmarked application on those nodes - System the executes scenario
- collects measurements
- inject workload and faults note same
infrastructure for self-management (just replace
fault with control action and benchmark
scenario with self-management rulesor
recovery actions)
4Outline
- Motivation and System Environment
- Interacting with apps sensors actuators
- Data collection architecture
- Describing and executing benchmark scenario
- Resource discovery finding appropriate nodes in
shared Internet-distributed environments - Conclusion
5Sensors and actuators
- Source/sink for monitoring/control
- Application-external node-level
- sensors
- load, memory usage, network traffic, ...
- actuators
- start/kill processes
- reboot physical nodes
- modify emulated network topology
- Application-embedded application-level
- initial application type peer-to-peer overlay
networks - sensors
- number of application-level msgs sent/received
- actuators
- application-specific fault injection
- change parameters of workload generation
6Outline
- Motivation and System Environment
- Interacting with apps sensors actuators
- Data collection architecture
- Describing and executing benchmark scenario
- Resource discovery finding appropriate nodes in
shared Internet-distributed environments - Conclusion
7Query processor architecture
query
HTTP URL
ISING
SenTree
childrensvalues
HTTP CSV data
SenTreeDown
SenTreeDown
SenTreeDown/SenTreeUp
SenTreeDown/SenTreeUp
sensor
HTTP CSV data
aggregated response
HTTP URL
SenTree
ISING
query
childrens values
8Query processor (cont.)
- Scalability
- efficiently collect monitoring data from
thousands of nodes - in-network data aggregation and reduction
- Robustness
- handle failures in the monitoring system and
monitored application - query processor based on self-healing
peer-to-peer net - partial aggregates on failure
- Extensibility
- easy way to incorporate new monitoring data
sources as the system evolves - sensor interface
9Outline
- Motivation and System Environment
- Interacting with apps sensors actuators
- Data collection architecture
- Describing and executing benchmark scenario
- Resource discovery finding appropriate nodes in
shared Internet-distributed environments - Conclusion
10Describing a benchmark scenario
- Key is usability want easy way to define when
andwhat actions to trigger - kill half of the nodes after ten minutes
- kill nodes until response latency doubles
- Declarative XML-based rule system
- conditions over sensors gt invoke actuators
11- Start 100 nodes. Starting 10 minutes later, kill
10 nodes every 3 minutes until latency doubles
ltaction ID"1" name"startNode" timerName"T"gt
ltparams numToStart"100"/gt ltconditionsgt
ltcondition type"timer" value"0"/gt
lt/conditionsgt lt/actiongt ltaction ID2"
name"stopSensor" timerName"T"gt ltparams
sensorName"oldVal"/gt ltconditionsgt
ltcondition type"timer" value"600000"/gt
lt/conditionsgt lt/actiongt ltaction ID3"
name"killNode" timerName"T"gt ltparams
killNumber"10"/gt ltrepeat period"180000"/gt
ltconditionsgt ltcondition type"timer"
value"600000"/gt ltcondition type"sensor"
ID"oldVal" datatype"double" name"latency"
hosts"ibm4.CS.Berkeley.EDU34794 host2port2"
node"ALL3333" period"10000"
sensorAgg"AVG histSize"1" isSecondary"true"/gt
ltcondition type"sensor" datatype"double"
name"latency" hosts"ibm4.CS.Berkeley.EDU34
794 host2port2" node"ALL3333"
period"10000" sensorAgg"AVG histSize"1"
operator"lt" ID"oldVal" scalingFactor"2"/gt
lt/conditionsgt lt/actiongt
12ACME architecture
experimentspec./sys.mgmt. policy
query
HTTP URL
XML
ISING
SenTree
controller
XML
childrensvalues
HTTP CSV data
SenTreeDown
SenTreeDown
SenTreeDown/SenTreeUp
SenTreeDown/SenTreeUp
sensor
HTTP CSV data
aggregated response
HTTP URL
SenTree
ISING
query
childrens values
HTTP URL
HTTP CSV data
actuator
13ACME recap
- Taken together, the parts of ACME provide
- application deployment and process management
- data collection infrastructure
- workload generation
- fault injection
- ...all driven by a user-specified policy
- Future work (with Stanford)
- scaling down integrate cluster applications
- sensors/actuators for J2EE middleware
- target towards statistical monitoring
- use rule system to invoke recovery routines
- benchmark diagnosis techniques, not just apps
- new, user-friendly policy language
- include expressing statistical algorithms
14Benchmarking diagnosis techniques
fault injection
XML
experimentspec.
XML
queries
controller
ISING or other query processor
monitoring metrics
subscr. reqs
mon. data events / queries
diagnosisevents subscr. reqs.
pub/sub
rule-based diagnosis
statistical diagnosis
statistical diagnosis
fault injection
monitoring metrics
history
15Revamping the language
- Start 100 nodes. Starting 10 minutes later, kill
10 nodes every 3 minutes until latency doubles
when (timer_T gt 0) startNode(number100) when
((timer_T gt 600000) AND sensorCond_CompLatency)
killNode(number10)
repeat(period180000) when (timer_T gt 610000)
stopSensor(nameoldVal) define sensorCond
CompLatency hist1 lt 2 hist2 define history
hist1 sensorlat, size1 define history hist2
sensoroldVal, size1 define sensor lat
name"latency"
hosts"ibm4.CS.Berkeley.EDU34794
host2port2 node"ALL3333"
period"10000 sensorAgg"AVG"
define sensor oldVal lat
16Outline
- Motivation and System Environment
- Interacting with apps sensors actuators
- Data collection architecture
- Describing and executing benchmark scenario
- Resource discovery finding appropriate nodes in
shared Internet-distributed environments - Conclusion
17Resource discovery and mapping
- When benchmarking, map desired emulated topology
to available topology - example find me 100 P4-Linux nodes with
inter-node bandwidth, latency, and loss rates
characteristic of the Internet as a whole and
that are lightly loaded - When deploying a service, find set of nodes on
which to execute to achieve desired performance,
cost, and availability - example find me the cheapest 50 nodes that will
give me at least 3 9s of availability, that are
geographically well-dispersed, and that have at
least 100 Kb/sec of bandwidth between them
18Current RDM architecture
- Each node that is offering resources periodically
reports to a central server - single-node statistics
- inter-node statistics expressed as N-element
vector - central server builds an NxN inference matrix
- currently statistic values are generated randomly
- When desired, a node issues a resource discovery
request to central server - MxM constraint matrix
- load0,2 latency10ms,20ms,200ms,300ms
- load0,2 latency10ms,20ms,200ms,300ms
- load0,2 latency200ms,300ms,200ms,300ms
- Central server finds the M best nodes and returns
them to the querying node
19RDM next steps
- Decentralized resource discovery/mapping
- replicate needed statistics close to querying
nodes - improves avail. and perf. over centralized
approach - Better mapping functions
- NP-hard problem
- provide best mapping within cost/precision
constraints - Give user indication of accuracy and cost
- Integrate with experiment description language
- Integrate with PlanetLab resource allocation
- Evaluation
20Conclusion
- Platform for benchmarking distributed apps
- Collect metrics and events
- sensors
- ISING query processor
- Describe implement a benchmark scenario
- actuators
- controller/rule system process mgmt., fault
injection - XML-based (to be replaced)
- Next steps
- resource discovery/node mapping
- improved benchmark descr./resource discovery
lang. - incorporating Grid applications
- incorporating cluster applications and using to
benchmark diagnosis techniques (with Stanford)