Title: RADS Conceptual Architecture
1RADS Conceptual Architecture
User
Programming Abstractions For Roll-back (Necula
Operator
Prototype Applications E-voting,
Messaging, E-Mail, etc.
Benchmarks,Tools for Human Operators (Patterson)
SLT Services
Crash-Only Middleware Servers, System
OC Infrastructure (Fox)
SLT Services
Application- Specific Overlay Network
Online Statistical Learning Algorithms (Jordan)
PNE
PNE
Edge Network
Edge Network
Protocols Enabling Fast Detection Route
Recovery, Network OC Infrastructure (Katz,
Stoica)
Router
Router
CommodityInternet IP networks
- Reduction to practice of online SLT and
observe/analyze/act infrastructure - Reusable embeddable components
2Apps and Science
- Messaging (Randys scenario)
- Voting systems
- Online medical records system
- Volunteer coordination for disaster response
3What are SLT Services?
- SLT clients are client or server apps,
middleware or OS layer, machine hardware,
programmable network elements, ... - Monitoring hooks for SLT clients
- Control hooks for SLT clients
- Database(s) for aggregating SLT client data
- Plug-ins for online and offline analysis
4Macroscopic behaviors
- Application diversity
- Fail over to another whole infrastructure
- Completely separate app architecture (client,
server, middleware, - Free provisioning across different services (eg
messaging) - Use VM/appliance based migration for the servers
5Reflections from 9/11 (from Douglas Yoshida, MD,
Bellevue Hosp NYU Med Ctr)
- In a crisis, patients needing medical attention
brought to closest hospital, not most appropriate
hospital (absent better information) - Baseline EMS comms in NYC no direct contact
between EDs and ambulances sometimes doctors
would scramble to clear out ERs, then wait for
hours for patients to arrive - Cell phone and landline failure impeded
communication between hospitals - Needed separate inter-hospital radio comms with
direct link to onsite command center
6More reflections
- Families flooding hospitals trying to find out
about their loved ones - No other way to get the info out
- Creates potential security nightmare for hospital
(If terrorists had wanted to attack hospitals,
it would have been easy) - Lack of info leads to frustration and disaster
voyeurism - Med students and attendings flocked down to
Ground Zero because they were frustrated at not
being able to help w/in their own hospital - Too many doctors around each stretcher poor
allocation/distribution of resources
7Multiple communication channels
- Closed inter-hospital
- Semi-closed hospital/command site/firefighters
etc - Open/unidirectional communication to public
about condition of victims (can be largely
unidirectional) - Open/bidirectional volunteer coordination
8Fault injection
Client app
Recovery policy DB
Server app
Fault injection
Middleware
Fault injection
OS
Results fusion Policy selection
Fault injection
Overlay/PNEs
Internet
9From JBoss to JAGR
- Stalls user requests during recovery
Servlet/JSP Container
Http Server
Stall Proxy
Client Requests
- Builds fault propagation map, based on observed
failures - Restart single EJBs, redeploy apps, or restart
whole app-server
Persistence tier
Application Server (JBoss)
Fault Injector
Recovery Agent
Internal Monitors
Recovery Map
External Monitors
- E2EMon detects app-specific, end-to-end failures
in requests (also app-generic using character
histograms)
- Before deployment, use controlled faults to build
Recovery Map
10From JBoss to JAGR
- Stalls user requests during recovery
Servlet/JSP Container
Http Server
Stall Proxy
Client Requests
- Builds fault propagation map, based on observed
failures - Restart single EJBs, redeploy apps, or restart
whole app-server
- ExcMon detects Java exceptions in the
application app server - PPMon detects anomalous behaviors
Persistence tier
Application Server (JBoss)
Fault Injector
Recovery Agent
Internal Monitors
Recovery Map
External Monitors
- E2EMon detects app-specific, end-to-end failures
in requests (also app-generic using character
histograms)
- Before deployment, use controlled faults to build
Recovery Map