RT FT CORBA Survey Results - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

RT FT CORBA Survey Results

Description:

Are CORBA service such as naming, trader, or events used? Yes - 11. Notification 1 ... Provide new heartbeat (HB) mechanism (auto HB between Processes) 2 ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 29

Provided by: bobkukuram

Category:

more less

Transcript and Presenter's Notes

Title: RT FT CORBA Survey Results

1
RT FT CORBA Survey Results

Bob Kukura
Maureen Mayer
8/25/2004
realtime/04-08-16

2
RT FT RFP Status

Existing FT CORBA not widely implemented or used
RT FT not addressed
Existing CORBA FT products not compliant
Draft RT FT CORBA RFP presented in November 2003
Draft discussed in April 2004
Too broad
Need roadmap
Informal survey proposed in June 2004
Conducted by Raytheon
Volunteers canvassed from RTESS, vendors, etc.
Results to be presented anonymously in Montreal
Here they are

3
A Collaborative Effort

22 Questions, 18 Respondents
Represented companies include MITRE, Boeing,
Lockheed Martin, PrismTech, Navy NSWC, Open
Group, BAE Systems, DARPA, Raytheon, Telcordia,
CMU SEI, Borland, BBN, Semantic Designs, and
IONA
Applications represented include Commercial
(including mentions of Automotive Financial),
C4I, Ship, Radar, Telecommunications and
Avionics.
Following slides show questions, answers, and
interesting comments from respondents.
Our observations follow

4
Question 1

Characterize the system as hard real-time, soft
real-time, or non-real-time and what does that
mean to you?
NRT 3 SRT - 4 HRT - 4 SRTHRT - 4 SRTNRT
-1 NRTSRTHRT - 2.
Do deadlines apply to distributed invocations in
both normal and fault cases?
Yes - 8 No - 2.
Do deadlines apply to individual invocations, or
to an overall mission thread made up of a
series of steps?
Both - 5 Thread - 6.
How are CORBA invocation timeouts used?
11 dont use 3 not sure 2 use for particular
communication patterns 2 use to detect failure
and trigger recovery logic.
Comment CORBA invocation timeouts are used
poorly. They are used to establish a deadline by
which the time that the timer goes off the
process or component is presumed dead. Unilateral
detections of failures in a distributed system is
an unreliable system fault detection method.

5
Question 2
What types of faults (host or process crash,
network partition, lost messages, missed
deadlines, software bug, etc.) need to be
tolerated? What types are explicitly not of
concern? What about multiple faults?

Comments
Most important failures are clustered in space
and time.
Protocols should be written in terms of
everything is broken all the time instead of the
happy path first.

7 Multiple ( 1 roll up, 2 of secondary
concern, 1 not simultaneously)
6 care about All (except SW bugs 3)
5 NW Partitions
4 Processor or Process Crash (1 only some)
4 Lost Messages
3 Missed Deadline (1 only some 3 NRTs dont
care)
3 Battle Damage
1 each (HW faults, Object Failure, Common Mode,
Msgs out of Order, Out of Range Data)

6
Question 3

What types of operating systems and languages are
involved?

Is the deployment environment highly
resource-constrained (i.e. embedded)?
Resource Constrained Yes - 8 No -5. Embedded
Yes - 2 No - 4. Comment If timing requiremen
ts are not being met throwing more processors at
the problem wont help without a change in SW
Architecture.
7
Question 4

Is an RT CORBA implementation used? If so, what
features are used? If not, why?
No - 5
Yes - 13

Comments
Priorities should be used for performance and for
fine thread tuning but not for correct behavior
because otherwise the application is not
portable.
RT CORBA is used for human computer interface
only because it cant meet HRT requirements.
Used to establish system wide priorities. The
ability to use a dynamic scheduling would be good
but it is not currently available.
RT CORBA is used only in facets of the system.

8
Question 5

Is an FT CORBA implementation used? If so, what
features are used (e.g. Property Management,
Replication Management, Fault Detection
Notification, Logging Recovery)? If not, why?
No - 18

9
Question 6

Are ORBs from multiple vendors involved?
Yes - 11
No - 7
Is interoperability from a client ORB to a
different vendors server required to tolerate
faults?
Yes - 6
No - 11
Not Sure - 1
Are the requirements different than when the
client and server ORBs are from the same vendor?
Yes - 6
No - 11
Not Sure - 1

10
Question 7

Are other fault tolerant infrastructures (DBs,
networks, OSes, other middleware, etc) used in
conjunction with CORBA?
Yes - 14
DBs - 6
Network - 2
OS - 2 (including 1 RADEX)
In House Development - 1
No - 3
NA - 1
Comment
It would be good if FT CORBA could provide a
mechanism to failover to other communication
links.

11
Question 8

Are services replicated for fault tolerance?
Yes - 14
No - 4
Are these coarse-grained service interfaces or
fine-grained object interfaces?
Very Coarse (Whole System) - 1
Coarse - 8
Medium - 1
Fine - 4
All - 1
Are chained invocations (where server is also
client) used?
Yes - 12
Eliminated using Staged Arch (HRT sys) where each
stage has pure clients/servers. - 1
NA - 5

12
Question 9

At what granularity do failovers occur
(datacenter, host, process, container, ORB, POA,
object, etc.)?

13
Question 10

What replication style (active, warm passive,
cold passive, etc.) is used? Why?

Comments
Active because speed to recover 3.
When you expect a lot of failures you use active.
When you dont expect a lot of failures and can
afford the slower recovery time you use passive.
Passive is less touchy and easier to
implement.
Cost drives the choice including
Criticality over time and space
Dollars
CPU Availability
Behavior Over Time (e.g. mission apps only
critical for one mode).
Active replication with FT CORBA can have no
out-of-band communications unless you use an
application controlled consistency at which time
so much development work is required you may as
well not have used CORBA.

14
Question 11

Are replicated services stateful or stateless?
Stateful 8
Stateless 5
Both 3 (includes 1 non-active only stateful, and
1 20-30 of service stateful)
How important is maintaining consistency of state
among replicas?
Important 7
Sometimes 1
DB Consistency is Important 1
How is state consistency maintained?
With time lag 2
Application Transparent using protocols 1
Checkpoint / Restore 5
Active (Built-in) 1
NA 4
Is persistence of state required even when no
replicas are active? Yes 4

15
Question 12

Are replicated service implementations
multi-threaded?
Yes -13
Minimized to meet Comm I/O Requirements -1
Why?
Throughput 2 Efficiency/performance -7
What other sources of non-determinism (i.e. local
timers, non-CORBA events, hardware interfaces)
exist?
(See Diagram to Right)
Comments
SW Arch Rule Implementing a CORBA call will
spawn a thread that doesnt block client.
Goal is to meet real time deadlines even when
loosing a track file.
One philosophy is to have as much concurrency as
possible to provide better performance and lower
level of granularity.

16
Question 13

Is CCM or any other component framework used?
What services related to fault tolerance does it
provide?
Yes - 8 (CCM -3 J2EE -2 CCM Derivative -2
Component Framework -1)
No - 10
Are CORBA service such as naming, trader, or
events used?
Yes - 11
Notification 1
Naming 11
Trader 5
Event 4
No 7
Are these fault-tolerant?
Yes - 4
No - 8

17
Question 14

How are faults detected and recovery initiated?
See Chart
At what granularities are faults detected?
8 Process 1 Component 3 Dependent on Fault
Type 1 High Level 2 Host 1 Data Center
Is this middleware-specific or global?
Global 10
Do these aspects need to be pluggable?
Yes 3 No 4 Application Dependent 3
How are dependencies handled?
5 Configuration/Design
1 Application Management Tool
3 Unknown / Not Considered
Comment Used probabilistic heuristic trees, an
RM Grammar, Borland Deployment Op-Center or
Higher Level Models for dependency tools.

18
Question 15

To what extent is client application code
involved in recovering from failures?
3 None
4 Some
7 High
Are application-transparent exactly-once
semantics needed (1), or are at-most-once (1) or
at-least-once semantics (3) sufficient in the
presence of faults? All (2).
Are there safety issues or other issues that
require handling certain faults at application
level? Yes 11
Comments
Currently FT products are geared toward data
servers which are quite different from radar
applications.
If the system server is idempotent or not
determines what semantic is used.

19
Question 16

Are resource assignments and fault tolerance
properties set statically or dynamically?
Static 10 (2 having some dynamic properties)
Dynamic 4
Both 2
Do they vary with changing modes of operation?
Yes 9
No 2
How are they determined? Design 7, Proprietary 3
Can hardware be added dynamically? Yes 6, No 6
Are services expected to be continuously
available?
Yes 11
No 2

20
Question 17

How are tradeoffs between meeting deadlines and
maintaining consistency in the presence of faults
handled?
Case-by-Case Basis - 1
Design - 5 (2 They Arent)
Stored Doctrine - 1 (1 Hierarchical Mode
Driven)
To what extent can performance or resource
utilization be traded off against fault
tolerance?
RM handles - 1
If missed Deadline fault system is designed to
continue - 2
FT Recovery Deadlines prevail over RT deadlines,
which are pushed aside, upon failure -1
Comments
Resource Utilization and Meeting Deadlines has a
higher priority that FT
FT should not add too much overhead

21
Question 18

What features of the FT CORBA specification (e.g.
Property Management, Replication Management,
Fault Detection Notification, Logging
Recovery) would be most valuable if only they
were available and usable in the ORB
implementations you use?
Provide an FT CORBA which has
Process Level (1 Multilevel inc. Object)
Replication -5 Higher Level Fault Detection -4
Replication Management -3, if HRT ORB
available-1
Fault Detection Notification -2, Will always
implement own -1
Logging Recovery 3
A good checkpoint/recovery service (options for
boundaries periodic and on event (e.g. out of
band communication)). 2
Priority awareness. -1
A toolkit containing knobs, switches, and options
(granularity, policies) rather than a
take-it-or-leave-it approach. -3
An option to not replay CORBA invocations on
Recovery. -1
Deals with Databases and the replication of them.
-1
Better control of non-determinism. -1

22
Question 19

What additions or improvements to the current FT
CORBA specification would be most valuable?
Provide new heartbeat (HB) mechanism (auto HB
between Processes) 2
Specifiable degree of determinism
Have it meet deadlines while meeting FT needs
Allow replication and recovery across multiple
LANs
Application Transparent Fault Isolation
(Detection and Identification)
Isolate network partition faults
Application specifies fault criteria and leads
recovery
Simplify replication to passive and cold restart
Add semi-active replication
Guidance on maintaining state consistency in the
presence of non-determinism
Fault detection with minimal network impact
Automatic slave promotion
Upon state retrieval from persistent database
optimize startup time
Allow for different platforms and OSes within
groups (different OSes doesnt work for PSS DB
abstraction layer)
Be pluggable
Address transactions how to tolerate failures
of these
Mode Driven
Fault tolerant ORB services

23
Question 20

What specific RFPs related to fault tolerance
would you like to see issued by the OMG in the
near-term or medium-term future?
I live in fear of OMG RFPs
RT-FT RFP impractical one size fit all verses
state of the art ( tech. immature) lego-block
fault tolerant communities
FT needs to be reconciled with Load Distribution
RFP first
One issued by SBC DTF to support FT in SW Radio
space
RT transaction specification separates into 4
components of CORBA ones
Reduced language mappings for RT FT CORBA
services
RT FT RFP with
Merge RT-FT spec with minimal work while
satisfying community
Provide Object image consistency within delta T
Clean FT schema that covers 80-90 of
applications reliability requirements.
Replace interfaces with run-time policies set
through configuration which is multilevel and
managed at the appropriate level.
Interoperability of Mechanisms across ORBs.
Database system based FT
Better use of QoS and Mode driven FT

24
Question 21

Is proactive dependability a feature that would
be of interest to you and your customer?
As a cost constrained low priority. -4
Yes -13 (includes 3 with trust maturity 1 at
application level)
Comments
Currently do it with a diagnostic thresholding
scheme.
Not realistic
How would it handle faults outside of the
middleware
At system level to be used for diagnostics

25
Question 22

Rank the following seven items in terms of
importance to you customer for the most critical
part of your application
Ease of implementation to the application
developer
Fast Execution Time
Bounding Execution Times
Fast Recovery Time
Bounding Recovery
State Synchronization
Efficient use of Resources

26
Average Rankings by System Type
Logical Clusters cluster by the perceived
intended use of the system as determined using
questions 1-3.
Recall that lower ranking is more important!
27
Observations