Title: Design Environment for Fault-Adaptive Systems
1Design Environment for Fault-Adaptive Systems
- Ted Bapty
- Sandeep Neema
- Sweta Shetty, Steve Nordstrom, Divya Vashishtha,
Jason Scott, Jason Overdorf - Vanderbilt Univ.
2BTeV RTES TeamNSF/ITR
- Fermilab
- Building BTeV Trigger Hardware
- Domain Experts, Define Goals, Constraints, etc.
- Vanderbilt
- RTES Lead (Physics)
- Design Environment, System Synthesis, System
Integration, Prototype Hardware - UIUC
- ARMOR, Fault Tolerant Middleware
- Syracuse Pitt
- Very Lightweight Agents, Diagnostics, Load
Balancing
3High Energy Physics
BTeV Experiment
FermiLab Accelerator
4Particle Measurement
Detector Grids
- Problem
- Massive amounts of data (Terabytes/Sec)
- Determine the set of particle trajectories
- Decide if it is interesting, keep or toss
- Hardware gt 2500 DSPs 2500 PCs
- Never Fail (ok to degrade)
5Trigger System(20,000 ft. view)
Store
Memory Queue, ms
2nd Level (PC)
Pre- Process (FPGA)
1st Level (DSP)
2000 Nodes
2000 Nodes
6System Constraints
- Triple-Mode Redundancy Too Expensive
- Some Over-capacity designed in
- Parallel System, Real-Time
- Heterogeneous Processors
- RT Constraints Queue Length.
- No Generic Response to Faults
- Based on application requirements
- Based on system state
- Based on available resources
7Fault Mitigation
- System has excess capacity
- But not much (10)
- Cannot pre-plan use of redundancy
- Excess capacity may be used for disposable
tasks - Fault Occurs
- React quickly to regain minimal function
- Rearrange Resources to make Best Use of Remaining
Resources - User-defined recovery behavior
8Reflex Healing
- Reflex Action
- Simple,
- Rapid,
- Real-Time, Guaranteed Response Time,
- Sub-Optimal
- Handle a Single Failure
- Healing
- Re-Evaluate Resources Tasks
- Re-Balance/Re-Allocate Resources
- Recover Failed Resources (After Testing)
- Generate New Reflex Actions
9ReflexMitigation Example
User-Defined Mitigation Actions
1. Normal Operation
2. Processor Failure
3. Subdivide Primary Task
4. Migrate to Adjacent Processors
5. Replace Secondary Task
Primary Task
5 . Reset/Test Failed Processor
Secondary Task
10HealingMitigation Example
Mitigation Actions
1. Normal Operation
2. Processor Failure Reflex Action
Re-Eval Re-Plan
3. Update Models
4. Re-Evaluate Resources
5. Re-Plan System
Primary Task
6. Rearrange tasks
Secondary Task
11Design Issues
- Complex System
- Thousands of Processors
- High Data Rates
- Real-Time Constraints
- User-Defined Behaviors
- Domain-Specific Design Tool
- System-Specific Implementation
- Run-Time Implementation
- Heterogeneous Architecture
- Real-Time - Execution Mitigation
- Fault-Tolerant
12Analysis
Model Integrated Computing
Reconfig Behavior
Resource
System Models
Performance Simulation
Diagnosability Analysis
Synthesis
Design and Analysis
Reliability Analysis
Feedback
Algorithm
Fault Behavior
Synthesis
Runtime
Region Operations Mgr
Experiment Interface
Global Operations Manager
Region Fault Mgr
Local Oper. Manager
L1/ DSP
Local Oper Manager
L2,3/ RISC
Logical Control Network
Logical Control Network
Local Fault Mgr
Logical Control Network
Local Fault Mgr
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
ARMOR/RTOS
ARMOR/Linux
Global Fault Manager
Logical Data Net
Logical Data Net
Local Oper. Manager
Local Oper Manager
DSP
RISC
Local Fault Mgr
Local Fault Mgr
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
Trig Algo.
ARMOR/RTOS
ARMOR/Linux
Soft Real-Time
Hard
13Modeling Language
Processing Data Flow
Hardware Resources
Hierarchical Fault Management
Full
Recov. Mode 1
Recov Mode 3
Recov. Mode 2
Concepts Processes, streams, data channels,
Functions, data types, communication
Concepts Processors, Memory, Topology,
Reliability, Failure Modes,
Concepts Recovery Strategies, Modes of
Operation, goals/importance
14Resource Models
- Capture Hardware Resources
- Nodes
- Networks
- Attributes
- Hierarchy
15Algorithm Models
- Processes
- Info Flow
- Interfaces
- Hierarchy
16Fault Mitigation Models
Local Manager
State
Transition
Regional Manager
Conditions
Mitigation Actions
- Finite State Machine
- Parallel, Hierarchical
- Events Transitions
- Mitigation Actions
- Time Specs
17System Generation
Algorithms
SW Loads
Comm Maps
Schedules
Resources
Boot Maps
Task Assign
OS Cfg
18Generation of Reflex Networks
State A
Reflex Struct.
Action AB1 Action AB2 Action AB3
Action AC1 Action AC2 Action AC3
Primary Struct.
ON (L76 Fail) DO Del P1 Conn P1,S22 Map S22,
C3 Kill T22 Migrate T33,
State C
State B
System Fault State
Reflex Scripts
1 Set for Each Processor And failure type
19Model-Based Healing
MIC Healing Controller
Nominal
Re- Balance
New Reflex
Update Model
Faults
System Hardware
Interface
Reflex
20Runtime Environment
Model Interface
Global Manager
Experiment Interface
Reflex Actions
Mitigation Engine
Actions
Feed Back
Regional Manager
Mitigation Engine
Reflex Actions
Actions
Feed Back
Local Manager
Mitigation Engine
Reflex Actions
DSP Kernel
DSP Hardware
21Fault Mitigation Interface
- Fault Mitigation Interface
- The FMA interfaces with the local diagnostics
facility (receive local status, clear errors,
trigger rediagnosis, set diagnosis mode, etc. - Commands
- RETRY_LINK(link_id)
- Function Reset/resync a comm link,
- Returns failure or success
- REROUTE_LINK(link_id)
- Function Reroute communications through a
separate link - ADD_TASK(task_id, link_id)
- Function Adds a task to the task list, operate
on data from link_id - TEST_MEMORY(memory_bank)
- Function Intensive test on memory bank
- RELOCATE_DATA(from_bank, to_bank)
- Function Moves data, marks source memory bank as
unused/unavail - GET_LOCAL_STATUS
- Function Reports status of a resource on a
local node - SEND_MESSAGE
- RECEIVE_MESSAGE
- . . .
22Synthesis Analysis/Offline
- Simulation
- Functional (e.g. Matlab)
- Performance (Timing, Discrete Event)
- Interfacing/generating to Swarms/Jackal/TAEMS
- Diagnosability
- Failure Modes Sensors
- Predict ability to Detect/Isolate Failures
- Reliability Analysis
- Predict MTBF, Maximum Failures
- Robustness
- Stability Analysis
- Reconfiguration Strategies/Control System
23System Simulation
System Model
Task Model
Communication Model
24Summary
- Developing Model-Based Approach
- Capture Algorithm, Resource, and Mitigation
Aspects - Generation of Software
- Normal application Code
- Fault Mitigation Code
- Two Fault Mitigation Approaches
- Reflex Fast, Limited Response
- Healing Slower, system re-design
- Analysis Simulation
- Runtime Infrastructure