ACME: a platform for benchmarking distributed applications - PowerPoint PPT Presentation

About This Presentation

Title:

ACME: a platform for benchmarking distributed applications

Description:

Benchmarking large-scale distributed apps (peer-to-peer, Grid, CDNs, ...) is difficult ... need standard interfaces for ... incorporating Grid applications ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 21

Provided by: rocCsBe

Learn more at: http://roc.cs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: ACME: a platform for benchmarking distributed applications

1
ACME a platform for benchmarking distributed
applications

David Oppenheimer, Vitaliy Vatkovskiy, and David
Patterson
ROC Retreat
12 Jan 2003

2
Motivation

Benchmarking large-scale distributed apps
(peer-to-peer, Grid, CDNs, ...) is difficult
very large (1000s-10,000s nodes)
need scalable measurement and control
nodes and network links will fail
need robust measurement and control
large variety of possible applications
need standard interfaces for measurement and
control
ACME platform that developers can use to
benchmark their distributed applications

3
ACME benchmark lifecycle

User describes benchmark scenario
node requirements, workload, faultload, metrics
System finds the appropriate nodes, starts up the
benchmarked application on those nodes
System the executes scenario
collects measurements
inject workload and faults note same
infrastructure for self-management (just replace
fault with control action and benchmark
scenario with self-management rulesor
recovery actions)

4
Outline

Motivation and System Environment
Interacting with apps sensors actuators
Data collection architecture
Describing and executing benchmark scenario
Resource discovery finding appropriate nodes in
shared Internet-distributed environments
Conclusion

5
Sensors and actuators

Source/sink for monitoring/control
Application-external node-level
sensors
load, memory usage, network traffic, ...
actuators
start/kill processes
reboot physical nodes
modify emulated network topology
Application-embedded application-level
initial application type peer-to-peer overlay
networks
sensors
number of application-level msgs sent/received
actuators
application-specific fault injection
change parameters of workload generation

6
Outline

Motivation and System Environment
Interacting with apps sensors actuators
Data collection architecture
Describing and executing benchmark scenario
Resource discovery finding appropriate nodes in
shared Internet-distributed environments
Conclusion

7
Query processor architecture
query
HTTP URL
ISING
SenTree
childrensvalues
HTTP CSV data
SenTreeDown
SenTreeDown
SenTreeDown/SenTreeUp
SenTreeDown/SenTreeUp
sensor
HTTP CSV data
aggregated response
HTTP URL
SenTree
ISING
query
childrens values
8
Query processor (cont.)

Scalability
efficiently collect monitoring data from
thousands of nodes
in-network data aggregation and reduction
Robustness
handle failures in the monitoring system and
monitored application
query processor based on self-healing
peer-to-peer net
partial aggregates on failure
Extensibility
easy way to incorporate new monitoring data
sources as the system evolves
sensor interface

9
Outline

Motivation and System Environment
Interacting with apps sensors actuators
Data collection architecture
Describing and executing benchmark scenario
Resource discovery finding appropriate nodes in
shared Internet-distributed environments
Conclusion

10
Describing a benchmark scenario

Key is usability want easy way to define when
andwhat actions to trigger
kill half of the nodes after ten minutes
kill nodes until response latency doubles
Declarative XML-based rule system
conditions over sensors gt invoke actuators

Start 100 nodes. Starting 10 minutes later, kill
10 nodes every 3 minutes until latency doubles

ltaction ID"1" name"startNode" timerName"T"gt
ltparams numToStart"100"/gt ltconditionsgt
ltcondition type"timer" value"0"/gt
lt/conditionsgt lt/actiongt ltaction ID2"
name"stopSensor" timerName"T"gt ltparams
sensorName"oldVal"/gt ltconditionsgt
ltcondition type"timer" value"600000"/gt
lt/conditionsgt lt/actiongt ltaction ID3"
name"killNode" timerName"T"gt ltparams
killNumber"10"/gt ltrepeat period"180000"/gt
ltconditionsgt ltcondition type"timer"
value"600000"/gt ltcondition type"sensor"
ID"oldVal" datatype"double" name"latency"
hosts"ibm4.CS.Berkeley.EDU34794 host2port2"
node"ALL3333" period"10000"
sensorAgg"AVG histSize"1" isSecondary"true"/gt
ltcondition type"sensor" datatype"double"
name"latency" hosts"ibm4.CS.Berkeley.EDU34
794 host2port2" node"ALL3333"
period"10000" sensorAgg"AVG histSize"1"
operator"lt" ID"oldVal" scalingFactor"2"/gt
lt/conditionsgt lt/actiongt
12
ACME architecture
experimentspec./sys.mgmt. policy
query
HTTP URL
XML
ISING
SenTree
controller
XML
childrensvalues
HTTP CSV data
SenTreeDown
SenTreeDown
SenTreeDown/SenTreeUp
SenTreeDown/SenTreeUp
sensor
HTTP CSV data
aggregated response
HTTP URL
SenTree
ISING
query
childrens values
HTTP URL
HTTP CSV data
actuator
13
ACME recap

Taken together, the parts of ACME provide
application deployment and process management
data collection infrastructure
workload generation
fault injection
...all driven by a user-specified policy
Future work (with Stanford)
scaling down integrate cluster applications
sensors/actuators for J2EE middleware
target towards statistical monitoring
use rule system to invoke recovery routines
benchmark diagnosis techniques, not just apps
new, user-friendly policy language
include expressing statistical algorithms

14
Benchmarking diagnosis techniques
fault injection
XML
experimentspec.
XML
queries
controller
ISING or other query processor
monitoring metrics
subscr. reqs
mon. data events / queries
diagnosisevents subscr. reqs.
pub/sub
rule-based diagnosis
statistical diagnosis
statistical diagnosis
fault injection
monitoring metrics
history
15
Revamping the language

Start 100 nodes. Starting 10 minutes later, kill
10 nodes every 3 minutes until latency doubles

when (timer_T gt 0) startNode(number100) when
((timer_T gt 600000) AND sensorCond_CompLatency)
killNode(number10)
repeat(period180000) when (timer_T gt 610000)
stopSensor(nameoldVal) define sensorCond
CompLatency hist1 lt 2 hist2 define history
hist1 sensorlat, size1 define history hist2
sensoroldVal, size1 define sensor lat
name"latency"
hosts"ibm4.CS.Berkeley.EDU34794
host2port2 node"ALL3333"
period"10000 sensorAgg"AVG"
define sensor oldVal lat
16
Outline

Motivation and System Environment
Interacting with apps sensors actuators
Data collection architecture
Describing and executing benchmark scenario
Resource discovery finding appropriate nodes in
shared Internet-distributed environments
Conclusion

17
Resource discovery and mapping

When benchmarking, map desired emulated topology
to available topology
example find me 100 P4-Linux nodes with
inter-node bandwidth, latency, and loss rates
characteristic of the Internet as a whole and
that are lightly loaded
When deploying a service, find set of nodes on
which to execute to achieve desired performance,
cost, and availability
example find me the cheapest 50 nodes that will
give me at least 3 9s of availability, that are
geographically well-dispersed, and that have at
least 100 Kb/sec of bandwidth between them

18
Current RDM architecture

Each node that is offering resources periodically
reports to a central server
single-node statistics
inter-node statistics expressed as N-element
vector
central server builds an NxN inference matrix
currently statistic values are generated randomly
When desired, a node issues a resource discovery
request to central server
MxM constraint matrix
load0,2 latency10ms,20ms,200ms,300ms
load0,2 latency10ms,20ms,200ms,300ms
load0,2 latency200ms,300ms,200ms,300ms
Central server finds the M best nodes and returns
them to the querying node

19
RDM next steps

Decentralized resource discovery/mapping
replicate needed statistics close to querying
nodes
improves avail. and perf. over centralized
approach
Better mapping functions
NP-hard problem
provide best mapping within cost/precision
constraints
Give user indication of accuracy and cost
Integrate with experiment description language
Integrate with PlanetLab resource allocation
Evaluation

20
Conclusion

Platform for benchmarking distributed apps
Collect metrics and events
sensors
ISING query processor
Describe implement a benchmark scenario
actuators
controller/rule system process mgmt., fault
injection
XML-based (to be replaced)
Next steps
resource discovery/node mapping
improved benchmark descr./resource discovery
lang.
incorporating Grid applications
incorporating cluster applications and using to
benchmark diagnosis techniques (with Stanford)