Title: RT FT CORBA Survey Results
1RT FT CORBA Survey Results
- Bob Kukura
- Maureen Mayer
- 8/25/2004
- realtime/04-08-16
2RT FT RFP Status
- Existing FT CORBA not widely implemented or used
- RT FT not addressed
- Existing CORBA FT products not compliant
- Draft RT FT CORBA RFP presented in November 2003
- Draft discussed in April 2004
- Too broad
- Need roadmap
- Informal survey proposed in June 2004
- Conducted by Raytheon
- Volunteers canvassed from RTESS, vendors, etc.
- Results to be presented anonymously in Montreal
- Here they are
3A Collaborative Effort
- 22 Questions, 18 Respondents
- Represented companies include MITRE, Boeing,
Lockheed Martin, PrismTech, Navy NSWC, Open
Group, BAE Systems, DARPA, Raytheon, Telcordia,
CMU SEI, Borland, BBN, Semantic Designs, and
IONA - Applications represented include Commercial
(including mentions of Automotive Financial),
C4I, Ship, Radar, Telecommunications and
Avionics. - Following slides show questions, answers, and
interesting comments from respondents.
- Our observations follow
4Question 1
- Characterize the system as hard real-time, soft
real-time, or non-real-time and what does that
mean to you?
- NRT 3 SRT - 4 HRT - 4 SRTHRT - 4 SRTNRT
-1 NRTSRTHRT - 2.
- Do deadlines apply to distributed invocations in
both normal and fault cases?
- Yes - 8 No - 2.
- Do deadlines apply to individual invocations, or
to an overall mission thread made up of a
series of steps?
- Both - 5 Thread - 6.
- How are CORBA invocation timeouts used?
- 11 dont use 3 not sure 2 use for particular
communication patterns 2 use to detect failure
and trigger recovery logic.
- Comment CORBA invocation timeouts are used
poorly. They are used to establish a deadline by
which the time that the timer goes off the
process or component is presumed dead. Unilateral
detections of failures in a distributed system is
an unreliable system fault detection method.
5Question 2
What types of faults (host or process crash,
network partition, lost messages, missed
deadlines, software bug, etc.) need to be
tolerated? What types are explicitly not of
concern? What about multiple faults?
- Comments
- Most important failures are clustered in space
and time.
- Protocols should be written in terms of
everything is broken all the time instead of the
happy path first.
- 7 Multiple ( 1 roll up, 2 of secondary
concern, 1 not simultaneously)
- 6 care about All (except SW bugs 3)
- 5 NW Partitions
- 4 Processor or Process Crash (1 only some)
- 4 Lost Messages
- 3 Missed Deadline (1 only some 3 NRTs dont
care)
- 3 Battle Damage
- 1 each (HW faults, Object Failure, Common Mode,
Msgs out of Order, Out of Range Data)
6Question 3
- What types of operating systems and languages are
involved?
Is the deployment environment highly
resource-constrained (i.e. embedded)?
Resource Constrained Yes - 8 No -5. Embedded
Yes - 2 No - 4. Comment If timing requiremen
ts are not being met throwing more processors at
the problem wont help without a change in SW
Architecture.
7Question 4
- Is an RT CORBA implementation used? If so, what
features are used? If not, why?
- No - 5
- Yes - 13
- Comments
- Priorities should be used for performance and for
fine thread tuning but not for correct behavior
because otherwise the application is not
portable. - RT CORBA is used for human computer interface
only because it cant meet HRT requirements.
- Used to establish system wide priorities. The
ability to use a dynamic scheduling would be good
but it is not currently available.
- RT CORBA is used only in facets of the system.
8Question 5
- Is an FT CORBA implementation used? If so, what
features are used (e.g. Property Management,
Replication Management, Fault Detection
Notification, Logging Recovery)? If not, why? - No - 18
9Question 6
- Are ORBs from multiple vendors involved?
- Yes - 11
- No - 7
- Is interoperability from a client ORB to a
different vendors server required to tolerate
faults?
- Yes - 6
- No - 11
- Not Sure - 1
- Are the requirements different than when the
client and server ORBs are from the same vendor?
- Yes - 6
- No - 11
- Not Sure - 1
10Question 7
- Are other fault tolerant infrastructures (DBs,
networks, OSes, other middleware, etc) used in
conjunction with CORBA?
- Yes - 14
- DBs - 6
- Network - 2
- OS - 2 (including 1 RADEX)
- In House Development - 1
- No - 3
- NA - 1
- Comment
- It would be good if FT CORBA could provide a
mechanism to failover to other communication
links.
11Question 8
- Are services replicated for fault tolerance?
- Yes - 14
- No - 4
- Are these coarse-grained service interfaces or
fine-grained object interfaces?
- Very Coarse (Whole System) - 1
- Coarse - 8
- Medium - 1
- Fine - 4
- All - 1
- Are chained invocations (where server is also
client) used?
- Yes - 12
- Eliminated using Staged Arch (HRT sys) where each
stage has pure clients/servers. - 1
- NA - 5
12Question 9
- At what granularity do failovers occur
(datacenter, host, process, container, ORB, POA,
object, etc.)?
13Question 10
- What replication style (active, warm passive,
cold passive, etc.) is used? Why?
- Comments
- Active because speed to recover 3.
- When you expect a lot of failures you use active.
When you dont expect a lot of failures and can
afford the slower recovery time you use passive.
- Passive is less touchy and easier to
implement.
- Cost drives the choice including
- Criticality over time and space
- Dollars
- CPU Availability
- Behavior Over Time (e.g. mission apps only
critical for one mode).
- Active replication with FT CORBA can have no
out-of-band communications unless you use an
application controlled consistency at which time
so much development work is required you may as
well not have used CORBA.
14Question 11
- Are replicated services stateful or stateless?
- Stateful 8
- Stateless 5
- Both 3 (includes 1 non-active only stateful, and
1 20-30 of service stateful)
- How important is maintaining consistency of state
among replicas?
- Important 7
- Sometimes 1
- DB Consistency is Important 1
- How is state consistency maintained?
- With time lag 2
- Application Transparent using protocols 1
- Checkpoint / Restore 5
- Active (Built-in) 1
- NA 4
- Is persistence of state required even when no
replicas are active? Yes 4
15Question 12
- Are replicated service implementations
multi-threaded?
- Yes -13
- Minimized to meet Comm I/O Requirements -1
- Why?
- Throughput 2 Efficiency/performance -7
- What other sources of non-determinism (i.e. local
timers, non-CORBA events, hardware interfaces)
exist?
- (See Diagram to Right)
- Comments
- SW Arch Rule Implementing a CORBA call will
spawn a thread that doesnt block client.
- Goal is to meet real time deadlines even when
loosing a track file.
- One philosophy is to have as much concurrency as
possible to provide better performance and lower
level of granularity.
16Question 13
- Is CCM or any other component framework used?
What services related to fault tolerance does it
provide?
- Yes - 8 (CCM -3 J2EE -2 CCM Derivative -2
Component Framework -1)
- No - 10
- Are CORBA service such as naming, trader, or
events used?
- Yes - 11
- Notification 1
- Naming 11
- Trader 5
- Event 4
- No 7
- Are these fault-tolerant?
- Yes - 4
- No - 8
17Question 14
- How are faults detected and recovery initiated?
See Chart
- At what granularities are faults detected?
- 8 Process 1 Component 3 Dependent on Fault
Type 1 High Level 2 Host 1 Data Center
- Is this middleware-specific or global?
- Global 10
- Do these aspects need to be pluggable?
- Yes 3 No 4 Application Dependent 3
- How are dependencies handled?
- 5 Configuration/Design
- 1 Application Management Tool
- 3 Unknown / Not Considered
- Comment Used probabilistic heuristic trees, an
RM Grammar, Borland Deployment Op-Center or
Higher Level Models for dependency tools.
18Question 15
- To what extent is client application code
involved in recovering from failures?
- 3 None
- 4 Some
- 7 High
- Are application-transparent exactly-once
semantics needed (1), or are at-most-once (1) or
at-least-once semantics (3) sufficient in the
presence of faults? All (2). - Are there safety issues or other issues that
require handling certain faults at application
level? Yes 11
- Comments
- Currently FT products are geared toward data
servers which are quite different from radar
applications.
- If the system server is idempotent or not
determines what semantic is used.
19Question 16
- Are resource assignments and fault tolerance
properties set statically or dynamically?
- Static 10 (2 having some dynamic properties)
- Dynamic 4
- Both 2
- Do they vary with changing modes of operation?
- Yes 9
- No 2
- How are they determined? Design 7, Proprietary 3
- Can hardware be added dynamically? Yes 6, No 6
- Are services expected to be continuously
available?
- Yes 11
- No 2
20Question 17
- How are tradeoffs between meeting deadlines and
maintaining consistency in the presence of faults
handled?
- Case-by-Case Basis - 1
- Design - 5 (2 They Arent)
- Stored Doctrine - 1 (1 Hierarchical Mode
Driven)
- To what extent can performance or resource
utilization be traded off against fault
tolerance?
- RM handles - 1
- If missed Deadline fault system is designed to
continue - 2
- FT Recovery Deadlines prevail over RT deadlines,
which are pushed aside, upon failure -1
- Comments
- Resource Utilization and Meeting Deadlines has a
higher priority that FT
- FT should not add too much overhead
21Question 18
- What features of the FT CORBA specification (e.g.
Property Management, Replication Management,
Fault Detection Notification, Logging
Recovery) would be most valuable if only they
were available and usable in the ORB
implementations you use? - Provide an FT CORBA which has
- Process Level (1 Multilevel inc. Object)
Replication -5 Higher Level Fault Detection -4
- Replication Management -3, if HRT ORB
available-1
- Fault Detection Notification -2, Will always
implement own -1
- Logging Recovery 3
- A good checkpoint/recovery service (options for
boundaries periodic and on event (e.g. out of
band communication)). 2
- Priority awareness. -1
- A toolkit containing knobs, switches, and options
(granularity, policies) rather than a
take-it-or-leave-it approach. -3
- An option to not replay CORBA invocations on
Recovery. -1
- Deals with Databases and the replication of them.
-1
- Better control of non-determinism. -1
22Question 19
- What additions or improvements to the current FT
CORBA specification would be most valuable?
- Provide new heartbeat (HB) mechanism (auto HB
between Processes) 2
- Specifiable degree of determinism
- Have it meet deadlines while meeting FT needs
- Allow replication and recovery across multiple
LANs
- Application Transparent Fault Isolation
(Detection and Identification)
- Isolate network partition faults
- Application specifies fault criteria and leads
recovery
- Simplify replication to passive and cold restart
- Add semi-active replication
- Guidance on maintaining state consistency in the
presence of non-determinism
- Fault detection with minimal network impact
- Automatic slave promotion
- Upon state retrieval from persistent database
optimize startup time
- Allow for different platforms and OSes within
groups (different OSes doesnt work for PSS DB
abstraction layer)
- Be pluggable
- Address transactions how to tolerate failures
of these
- Mode Driven
- Fault tolerant ORB services
23Question 20
- What specific RFPs related to fault tolerance
would you like to see issued by the OMG in the
near-term or medium-term future?
- I live in fear of OMG RFPs
- RT-FT RFP impractical one size fit all verses
state of the art ( tech. immature) lego-block
fault tolerant communities
- FT needs to be reconciled with Load Distribution
RFP first
- One issued by SBC DTF to support FT in SW Radio
space
- RT transaction specification separates into 4
components of CORBA ones
- Reduced language mappings for RT FT CORBA
services
- RT FT RFP with
- Merge RT-FT spec with minimal work while
satisfying community
- Provide Object image consistency within delta T
- Clean FT schema that covers 80-90 of
applications reliability requirements.
- Replace interfaces with run-time policies set
through configuration which is multilevel and
managed at the appropriate level.
- Interoperability of Mechanisms across ORBs.
- Database system based FT
- Better use of QoS and Mode driven FT
24Question 21
- Is proactive dependability a feature that would
be of interest to you and your customer?
- As a cost constrained low priority. -4
- Yes -13 (includes 3 with trust maturity 1 at
application level)
- Comments
- Currently do it with a diagnostic thresholding
scheme.
- Not realistic
- How would it handle faults outside of the
middleware
- At system level to be used for diagnostics
25Question 22
- Rank the following seven items in terms of
importance to you customer for the most critical
part of your application
- Ease of implementation to the application
developer
- Fast Execution Time
- Bounding Execution Times
- Fast Recovery Time
- Bounding Recovery
- State Synchronization
- Efficient use of Resources
26Average Rankings by System Type
Logical Clusters cluster by the perceived
intended use of the system as determined using
questions 1-3.
Recall that lower ranking is more important!
27Observations
- Current FT CORBA spec has little relevance in
current practice
- RT FT systems are being built on RT CORBA
- Unit of failover is typically process or host,
not object
- Passive replication is more commonly used than
active
- Active replication perceived as more capable, but
too hard
- Majority of applications described as soft real
time
- COTS platforms common
- Requirements vary dramatically
- Toolkit approaches are preferred
- Everyone needs fast normal execution
- Fast recovery most important in soft real time
- Some need interoperable RT FT
28Potential RFP Topics
- Lightweight Real Time Fault Tolerant CORBA
- Passive replication
- No message logging or checkpointing
- Coarse grained (ORB or process) failover
- Integrate ORB with external resource management
- Interoperable RT Active CORBA Replication
- RT FT Group GIOP
- Real Time Fault Tolerant CCM
- Real Time Fault Tolerant Transactions
- Real Time Fault Tolerant State Management
- Real Time Fault Tolerant Resource Management
- Or leave to domains (i.e. C4Is CMS Application
Management RFP)?