Title: CompSci 296.2 Self-Managing Systems
1CompSci 296.2 Self-Managing Systems
2Today
- Some current work in self-managing systems
- Ideas resources for projects
- IBM
- ROC (Discussion deferred to next class)
- Our projects at Duke
- HP
3Project
- Group size lt 2
- Identify general topic by end of January, meet
Shivnath - Feb 7 Scope problem and give 15-minute talk
- Feb 21 3-minute talk
- March 7 15-minute talk
- March 28 3-minute talk
- April 4/6 15-minute talk
- April 20/24 15-minute final in-class
presentation ( demo)
4Work on Self-Managing Systems
- IBM
- IBM Journal, Volume 42, Number 1, 2003
- Autonomic computing home page
- IBM autonomic home library, demos
- Autonomic computing toolkit
- IBM Tivoli
5Work on Self-Managing Systems
- Berkeley-Stanford ROC project
- Reading for this class
- Interesting source of project ideas and source
code - Sample project reports/presentations (follow the
CS444A/294-4 link)
6The past research goals andassumptions of last
15 years
- Goal 1 Improve performance
- Goal 2 Improve performance
- Goal 3 Improve cost-performance
7New research goals for a New Century ACME
- Availability
- Changeability
- support rapid deployment of new software, apps,
UI - Maintainability
- reduce burden on system administrators
- provide helpful, forgiving SysAdmin environments
- Evolutionary Growth
- allow easy system expansion over time
- Also Security/Privacy
8Recovery-Oriented Computing (ROC) Philosophy
- If a problem has no solution, it may not be a
problem, but a fact, not to be solved, but to be
coped with over time - Shimon Peres (Peress Law)
- People/HW/SW failures are facts, not problems
- Recovery/repair is how we cope with above facts
- Since major Sys Admin job is recovery after
failure, ROC also helps with maintenance/TCO
ROC focus is on fast repair Vs.old focus on
longer time between failures
9An Example Project in ROC
- Undo functionality for system administrators
(useful for self-managing components as well) - To recover from human errors
- To recover from failed operations like software
upgrades, installs, and configuration updates - An interesting mechanism project for self-healing
10Mechanism Projects
- Required/useful mechanisms for self-managing
systems - Take a goal related to self-managing (e.g.,
self-optimization, predicting problems), take a
system (e.g., a database) ? What mechanisms are
needed? Will current mechanisms suffice? - Ex Data collection
- nonintrusive, distributed, active probing
11Our Projects at Duke
- Ques Querying Systems (as data)
- Better tools for system administrators and
self-managing system components - CoD Cluster on Demand
- Allocate virtual clusters to applications on
demand
12Querying Systems as Data
13Querying Systems as Data
WAN
14Querying Systems as Data
- What are probable causes of the
Service-Level-Agreement (SLA) violations rising
to 12?
Root-cause query
15Queries What if
- Given todays workload, how will average response
time change if my database fails? - If I double the memory on my application servers,
how will SLA violation rate change?
16Queries Let me know
- Let me know if, with 75 probability, average
response time will exceed 5 seconds in next 30
minutes - Prediction
- Continuous query
17Queries What should I do?
- What should I do to reduce SLA violations of
requests A to lt1, without increasing violations
of other requests? - Root-cause What-if
18Querying Systems as Data
- Instrumented traces, logs
- System activity data
- Data from active probing
- Workload
- System configuration data (e.g., buffer size,
indexes) - Source code
- Models
- Analytic performance models
- Machine learning models
- Rules from system experts
- Simulators
19Querying Systems with QueS (30,000 ft)
20Challenges Query Complexity
- Support for complex queries
- Rank probable causes of SLA violation rising to
12? - What should I do queries
- Queries are ad-hoc
- Queries may be acquisitional
21Challenges Query Specification
- Declarative query language
- Expressibility of language
- Composition
- Snapshot queries and continuous queries
22Challenges Query Processing
- Model-based query processing
- Many types of data sources
- Structured, semi-structured, and unstructured
- Uncertainty in input data
- E.g., legacy systems may have partial/no
instrumentation - Imprecise answers
- Answers may include quantification of accuracy
- Ranking
23Challenges Run-time Overhead
- Real-time service for 24x7 systems
- Tunable data acquisition
- Active probing
24Work in Progress
- With Piyush Shivam
- Models for answering queries about expected
performance given a resource assignment, feasible
resource assignments to meet SLA, what-if queries
for scientific applications - With Songyun Duan
- Use of Bayesian Networks for performance
prediction and root-cause queries - With Wanhong Xu
- What-if queries on configuration-parameter
settings
25Projects at HP Research
- Project 1 Predicting performance problems,
finding root cases of problems - Project 2 Debugging complex systems
- Project 3 Designing adaptive systems