ACME: a platform for benchmarking distributed applications - PowerPoint PPT Presentation

About This Presentation
Title:

ACME: a platform for benchmarking distributed applications

Description:

Benchmarking large-scale distributed apps (peer-to-peer, Grid, CDNs, ...) is difficult ... need standard interfaces for ... incorporating Grid applications ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 21
Provided by: rocCsBe
Category:

less

Transcript and Presenter's Notes

Title: ACME: a platform for benchmarking distributed applications


1
ACME a platform for benchmarking distributed
applications
  • David Oppenheimer, Vitaliy Vatkovskiy, and David
    Patterson
  • ROC Retreat
  • 12 Jan 2003

2
Motivation
  • Benchmarking large-scale distributed apps
    (peer-to-peer, Grid, CDNs, ...) is difficult
  • very large (1000s-10,000s nodes)
  • need scalable measurement and control
  • nodes and network links will fail
  • need robust measurement and control
  • large variety of possible applications
  • need standard interfaces for measurement and
    control
  • ACME platform that developers can use to
    benchmark their distributed applications

3
ACME benchmark lifecycle
  • User describes benchmark scenario
  • node requirements, workload, faultload, metrics
  • System finds the appropriate nodes, starts up the
    benchmarked application on those nodes
  • System the executes scenario
  • collects measurements
  • inject workload and faults note same
    infrastructure for self-management (just replace
    fault with control action and benchmark
    scenario with self-management rulesor
    recovery actions)

4
Outline
  • Motivation and System Environment
  • Interacting with apps sensors actuators
  • Data collection architecture
  • Describing and executing benchmark scenario
  • Resource discovery finding appropriate nodes in
    shared Internet-distributed environments
  • Conclusion

5
Sensors and actuators
  • Source/sink for monitoring/control
  • Application-external node-level
  • sensors
  • load, memory usage, network traffic, ...
  • actuators
  • start/kill processes
  • reboot physical nodes
  • modify emulated network topology
  • Application-embedded application-level
  • initial application type peer-to-peer overlay
    networks
  • sensors
  • number of application-level msgs sent/received
  • actuators
  • application-specific fault injection
  • change parameters of workload generation

6
Outline
  • Motivation and System Environment
  • Interacting with apps sensors actuators
  • Data collection architecture
  • Describing and executing benchmark scenario
  • Resource discovery finding appropriate nodes in
    shared Internet-distributed environments
  • Conclusion

7
Query processor architecture
query
HTTP URL
ISING
SenTree
childrensvalues
HTTP CSV data
SenTreeDown
SenTreeDown
SenTreeDown/SenTreeUp
SenTreeDown/SenTreeUp
sensor
HTTP CSV data
aggregated response
HTTP URL
SenTree
ISING
query
childrens values
8
Query processor (cont.)
  • Scalability
  • efficiently collect monitoring data from
    thousands of nodes
  • in-network data aggregation and reduction
  • Robustness
  • handle failures in the monitoring system and
    monitored application
  • query processor based on self-healing
    peer-to-peer net
  • partial aggregates on failure
  • Extensibility
  • easy way to incorporate new monitoring data
    sources as the system evolves
  • sensor interface

9
Outline
  • Motivation and System Environment
  • Interacting with apps sensors actuators
  • Data collection architecture
  • Describing and executing benchmark scenario
  • Resource discovery finding appropriate nodes in
    shared Internet-distributed environments
  • Conclusion

10
Describing a benchmark scenario
  • Key is usability want easy way to define when
    andwhat actions to trigger
  • kill half of the nodes after ten minutes
  • kill nodes until response latency doubles
  • Declarative XML-based rule system
  • conditions over sensors gt invoke actuators

11
  • Start 100 nodes. Starting 10 minutes later, kill
    10 nodes every 3 minutes until latency doubles

ltaction ID"1" name"startNode" timerName"T"gt
ltparams numToStart"100"/gt ltconditionsgt
ltcondition type"timer" value"0"/gt
lt/conditionsgt lt/actiongt ltaction ID2"
name"stopSensor" timerName"T"gt ltparams
sensorName"oldVal"/gt ltconditionsgt
ltcondition type"timer" value"600000"/gt
lt/conditionsgt lt/actiongt ltaction ID3"
name"killNode" timerName"T"gt ltparams
killNumber"10"/gt ltrepeat period"180000"/gt
ltconditionsgt ltcondition type"timer"
value"600000"/gt ltcondition type"sensor"
ID"oldVal" datatype"double" name"latency"
hosts"ibm4.CS.Berkeley.EDU34794 host2port2"
node"ALL3333" period"10000"
sensorAgg"AVG histSize"1" isSecondary"true"/gt
ltcondition type"sensor" datatype"double"
name"latency" hosts"ibm4.CS.Berkeley.EDU34
794 host2port2" node"ALL3333"
period"10000" sensorAgg"AVG histSize"1"
operator"lt" ID"oldVal" scalingFactor"2"/gt
lt/conditionsgt lt/actiongt
12
ACME architecture
experimentspec./sys.mgmt. policy
query
HTTP URL
XML
ISING
SenTree
controller
XML
childrensvalues
HTTP CSV data
SenTreeDown
SenTreeDown
SenTreeDown/SenTreeUp
SenTreeDown/SenTreeUp
sensor
HTTP CSV data
aggregated response
HTTP URL
SenTree
ISING
query
childrens values
HTTP URL
HTTP CSV data
actuator
13
ACME recap
  • Taken together, the parts of ACME provide
  • application deployment and process management
  • data collection infrastructure
  • workload generation
  • fault injection
  • ...all driven by a user-specified policy
  • Future work (with Stanford)
  • scaling down integrate cluster applications
  • sensors/actuators for J2EE middleware
  • target towards statistical monitoring
  • use rule system to invoke recovery routines
  • benchmark diagnosis techniques, not just apps
  • new, user-friendly policy language
  • include expressing statistical algorithms

14
Benchmarking diagnosis techniques
fault injection
XML
experimentspec.
XML
queries
controller
ISING or other query processor
monitoring metrics
subscr. reqs
mon. data events / queries
diagnosisevents subscr. reqs.
pub/sub
rule-based diagnosis
statistical diagnosis
statistical diagnosis
fault injection
monitoring metrics
history
15
Revamping the language
  • Start 100 nodes. Starting 10 minutes later, kill
    10 nodes every 3 minutes until latency doubles

when (timer_T gt 0) startNode(number100) when
((timer_T gt 600000) AND sensorCond_CompLatency)
killNode(number10)
repeat(period180000) when (timer_T gt 610000)
stopSensor(nameoldVal) define sensorCond
CompLatency hist1 lt 2 hist2 define history
hist1 sensorlat, size1 define history hist2
sensoroldVal, size1 define sensor lat
name"latency"
hosts"ibm4.CS.Berkeley.EDU34794
host2port2 node"ALL3333"
period"10000 sensorAgg"AVG"
define sensor oldVal lat
16
Outline
  • Motivation and System Environment
  • Interacting with apps sensors actuators
  • Data collection architecture
  • Describing and executing benchmark scenario
  • Resource discovery finding appropriate nodes in
    shared Internet-distributed environments
  • Conclusion

17
Resource discovery and mapping
  • When benchmarking, map desired emulated topology
    to available topology
  • example find me 100 P4-Linux nodes with
    inter-node bandwidth, latency, and loss rates
    characteristic of the Internet as a whole and
    that are lightly loaded
  • When deploying a service, find set of nodes on
    which to execute to achieve desired performance,
    cost, and availability
  • example find me the cheapest 50 nodes that will
    give me at least 3 9s of availability, that are
    geographically well-dispersed, and that have at
    least 100 Kb/sec of bandwidth between them

18
Current RDM architecture
  • Each node that is offering resources periodically
    reports to a central server
  • single-node statistics
  • inter-node statistics expressed as N-element
    vector
  • central server builds an NxN inference matrix
  • currently statistic values are generated randomly
  • When desired, a node issues a resource discovery
    request to central server
  • MxM constraint matrix
  • load0,2 latency10ms,20ms,200ms,300ms
  • load0,2 latency10ms,20ms,200ms,300ms
  • load0,2 latency200ms,300ms,200ms,300ms
  • Central server finds the M best nodes and returns
    them to the querying node

19
RDM next steps
  • Decentralized resource discovery/mapping
  • replicate needed statistics close to querying
    nodes
  • improves avail. and perf. over centralized
    approach
  • Better mapping functions
  • NP-hard problem
  • provide best mapping within cost/precision
    constraints
  • Give user indication of accuracy and cost
  • Integrate with experiment description language
  • Integrate with PlanetLab resource allocation
  • Evaluation

20
Conclusion
  • Platform for benchmarking distributed apps
  • Collect metrics and events
  • sensors
  • ISING query processor
  • Describe implement a benchmark scenario
  • actuators
  • controller/rule system process mgmt., fault
    injection
  • XML-based (to be replaced)
  • Next steps
  • resource discovery/node mapping
  • improved benchmark descr./resource discovery
    lang.
  • incorporating Grid applications
  • incorporating cluster applications and using to
    benchmark diagnosis techniques (with Stanford)
Write a Comment
User Comments (0)
About PowerShow.com