RT FT CORBA Survey Results - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

RT FT CORBA Survey Results

Description:

Are CORBA service such as naming, trader, or events used? Yes - 11. Notification 1 ... Provide new heartbeat (HB) mechanism (auto HB between Processes) 2 ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 29
Provided by: bobkukuram
Category:
Tags: corba | auto | results | survey | trader

less

Transcript and Presenter's Notes

Title: RT FT CORBA Survey Results


1
RT FT CORBA Survey Results
  • Bob Kukura
  • Maureen Mayer
  • 8/25/2004
  • realtime/04-08-16

2
RT FT RFP Status
  • Existing FT CORBA not widely implemented or used
  • RT FT not addressed
  • Existing CORBA FT products not compliant
  • Draft RT FT CORBA RFP presented in November 2003
  • Draft discussed in April 2004
  • Too broad
  • Need roadmap
  • Informal survey proposed in June 2004
  • Conducted by Raytheon
  • Volunteers canvassed from RTESS, vendors, etc.
  • Results to be presented anonymously in Montreal
  • Here they are

3
A Collaborative Effort
  • 22 Questions, 18 Respondents
  • Represented companies include MITRE, Boeing,
    Lockheed Martin, PrismTech, Navy NSWC, Open
    Group, BAE Systems, DARPA, Raytheon, Telcordia,
    CMU SEI, Borland, BBN, Semantic Designs, and
    IONA
  • Applications represented include Commercial
    (including mentions of Automotive Financial),
    C4I, Ship, Radar, Telecommunications and
    Avionics.
  • Following slides show questions, answers, and
    interesting comments from respondents.
  • Our observations follow

4
Question 1
  • Characterize the system as hard real-time, soft
    real-time, or non-real-time and what does that
    mean to you?
  • NRT 3 SRT - 4 HRT - 4 SRTHRT - 4 SRTNRT
    -1 NRTSRTHRT - 2.
  • Do deadlines apply to distributed invocations in
    both normal and fault cases?
  • Yes - 8 No - 2.
  • Do deadlines apply to individual invocations, or
    to an overall mission thread made up of a
    series of steps?
  • Both - 5 Thread - 6.
  • How are CORBA invocation timeouts used?
  • 11 dont use 3 not sure 2 use for particular
    communication patterns 2 use to detect failure
    and trigger recovery logic.
  • Comment CORBA invocation timeouts are used
    poorly. They are used to establish a deadline by
    which the time that the timer goes off the
    process or component is presumed dead. Unilateral
    detections of failures in a distributed system is
    an unreliable system fault detection method.

5
Question 2
What types of faults (host or process crash,
network partition, lost messages, missed
deadlines, software bug, etc.) need to be
tolerated? What types are explicitly not of
concern? What about multiple faults?
  • Comments
  • Most important failures are clustered in space
    and time.
  • Protocols should be written in terms of
    everything is broken all the time instead of the
    happy path first.
  • 7 Multiple ( 1 roll up, 2 of secondary
    concern, 1 not simultaneously)
  • 6 care about All (except SW bugs 3)
  • 5 NW Partitions
  • 4 Processor or Process Crash (1 only some)
  • 4 Lost Messages
  • 3 Missed Deadline (1 only some 3 NRTs dont
    care)
  • 3 Battle Damage
  • 1 each (HW faults, Object Failure, Common Mode,
    Msgs out of Order, Out of Range Data)

6
Question 3
  • What types of operating systems and languages are
    involved?

Is the deployment environment highly
resource-constrained (i.e. embedded)?
Resource Constrained Yes - 8 No -5. Embedded
Yes - 2 No - 4. Comment If timing requiremen
ts are not being met throwing more processors at
the problem wont help without a change in SW
Architecture.
7
Question 4
  • Is an RT CORBA implementation used? If so, what
    features are used? If not, why?
  • No - 5
  • Yes - 13
  • Comments
  • Priorities should be used for performance and for
    fine thread tuning but not for correct behavior
    because otherwise the application is not
    portable.
  • RT CORBA is used for human computer interface
    only because it cant meet HRT requirements.
  • Used to establish system wide priorities. The
    ability to use a dynamic scheduling would be good
    but it is not currently available.
  • RT CORBA is used only in facets of the system.

8
Question 5
  • Is an FT CORBA implementation used? If so, what
    features are used (e.g. Property Management,
    Replication Management, Fault Detection
    Notification, Logging Recovery)? If not, why?
  • No - 18

9
Question 6
  • Are ORBs from multiple vendors involved?
  • Yes - 11
  • No - 7
  • Is interoperability from a client ORB to a
    different vendors server required to tolerate
    faults?
  • Yes - 6
  • No - 11
  • Not Sure - 1
  • Are the requirements different than when the
    client and server ORBs are from the same vendor?
  • Yes - 6
  • No - 11
  • Not Sure - 1

10
Question 7
  • Are other fault tolerant infrastructures (DBs,
    networks, OSes, other middleware, etc) used in
    conjunction with CORBA?
  • Yes - 14
  • DBs - 6
  • Network - 2
  • OS - 2 (including 1 RADEX)
  • In House Development - 1
  • No - 3
  • NA - 1
  • Comment
  • It would be good if FT CORBA could provide a
    mechanism to failover to other communication
    links.

11
Question 8
  • Are services replicated for fault tolerance?
  • Yes - 14
  • No - 4
  • Are these coarse-grained service interfaces or
    fine-grained object interfaces?
  • Very Coarse (Whole System) - 1
  • Coarse - 8
  • Medium - 1
  • Fine - 4
  • All - 1
  • Are chained invocations (where server is also
    client) used?
  • Yes - 12
  • Eliminated using Staged Arch (HRT sys) where each
    stage has pure clients/servers. - 1
  • NA - 5

12
Question 9
  • At what granularity do failovers occur
    (datacenter, host, process, container, ORB, POA,
    object, etc.)?

13
Question 10
  • What replication style (active, warm passive,
    cold passive, etc.) is used? Why?
  • Comments
  • Active because speed to recover 3.
  • When you expect a lot of failures you use active.
    When you dont expect a lot of failures and can
    afford the slower recovery time you use passive.
  • Passive is less touchy and easier to
    implement.
  • Cost drives the choice including
  • Criticality over time and space
  • Dollars
  • CPU Availability
  • Behavior Over Time (e.g. mission apps only
    critical for one mode).
  • Active replication with FT CORBA can have no
    out-of-band communications unless you use an
    application controlled consistency at which time
    so much development work is required you may as
    well not have used CORBA.

14
Question 11
  • Are replicated services stateful or stateless?
  • Stateful 8
  • Stateless 5
  • Both 3 (includes 1 non-active only stateful, and
    1 20-30 of service stateful)
  • How important is maintaining consistency of state
    among replicas?
  • Important 7
  • Sometimes 1
  • DB Consistency is Important 1
  • How is state consistency maintained?
  • With time lag 2
  • Application Transparent using protocols 1
  • Checkpoint / Restore 5
  • Active (Built-in) 1
  • NA 4
  • Is persistence of state required even when no
    replicas are active? Yes 4

15
Question 12
  • Are replicated service implementations
    multi-threaded?
  • Yes -13
  • Minimized to meet Comm I/O Requirements -1
  • Why?
  • Throughput 2 Efficiency/performance -7
  • What other sources of non-determinism (i.e. local
    timers, non-CORBA events, hardware interfaces)
    exist?
  • (See Diagram to Right)
  • Comments
  • SW Arch Rule Implementing a CORBA call will
    spawn a thread that doesnt block client.
  • Goal is to meet real time deadlines even when
    loosing a track file.
  • One philosophy is to have as much concurrency as
    possible to provide better performance and lower
    level of granularity.

16
Question 13
  • Is CCM or any other component framework used?
    What services related to fault tolerance does it
    provide?
  • Yes - 8 (CCM -3 J2EE -2 CCM Derivative -2
    Component Framework -1)
  • No - 10
  • Are CORBA service such as naming, trader, or
    events used?
  • Yes - 11
  • Notification 1
  • Naming 11
  • Trader 5
  • Event 4
  • No 7
  • Are these fault-tolerant?
  • Yes - 4
  • No - 8

17
Question 14
  • How are faults detected and recovery initiated?
    See Chart
  • At what granularities are faults detected?
  • 8 Process 1 Component 3 Dependent on Fault
    Type 1 High Level 2 Host 1 Data Center
  • Is this middleware-specific or global?
  • Global 10
  • Do these aspects need to be pluggable?
  • Yes 3 No 4 Application Dependent 3
  • How are dependencies handled?
  • 5 Configuration/Design
  • 1 Application Management Tool
  • 3 Unknown / Not Considered
  • Comment Used probabilistic heuristic trees, an
    RM Grammar, Borland Deployment Op-Center or
    Higher Level Models for dependency tools.

18
Question 15
  • To what extent is client application code
    involved in recovering from failures?
  • 3 None
  • 4 Some
  • 7 High
  • Are application-transparent exactly-once
    semantics needed (1), or are at-most-once (1) or
    at-least-once semantics (3) sufficient in the
    presence of faults? All (2).
  • Are there safety issues or other issues that
    require handling certain faults at application
    level? Yes 11
  • Comments
  • Currently FT products are geared toward data
    servers which are quite different from radar
    applications.
  • If the system server is idempotent or not
    determines what semantic is used.

19
Question 16
  • Are resource assignments and fault tolerance
    properties set statically or dynamically?
  • Static 10 (2 having some dynamic properties)
  • Dynamic 4
  • Both 2
  • Do they vary with changing modes of operation?
  • Yes 9
  • No 2
  • How are they determined? Design 7, Proprietary 3

  • Can hardware be added dynamically? Yes 6, No 6
  • Are services expected to be continuously
    available?
  • Yes 11
  • No 2

20
Question 17
  • How are tradeoffs between meeting deadlines and
    maintaining consistency in the presence of faults
    handled?
  • Case-by-Case Basis - 1
  • Design - 5 (2 They Arent)
  • Stored Doctrine - 1 (1 Hierarchical Mode
    Driven)
  • To what extent can performance or resource
    utilization be traded off against fault
    tolerance?
  • RM handles - 1
  • If missed Deadline fault system is designed to
    continue - 2
  • FT Recovery Deadlines prevail over RT deadlines,
    which are pushed aside, upon failure -1
  • Comments
  • Resource Utilization and Meeting Deadlines has a
    higher priority that FT
  • FT should not add too much overhead

21
Question 18
  • What features of the FT CORBA specification (e.g.
    Property Management, Replication Management,
    Fault Detection Notification, Logging
    Recovery) would be most valuable if only they
    were available and usable in the ORB
    implementations you use?
  • Provide an FT CORBA which has
  • Process Level (1 Multilevel inc. Object)
    Replication -5 Higher Level Fault Detection -4
  • Replication Management -3, if HRT ORB
    available-1
  • Fault Detection Notification -2, Will always
    implement own -1
  • Logging Recovery 3
  • A good checkpoint/recovery service (options for
    boundaries periodic and on event (e.g. out of
    band communication)). 2
  • Priority awareness. -1
  • A toolkit containing knobs, switches, and options
    (granularity, policies) rather than a
    take-it-or-leave-it approach. -3
  • An option to not replay CORBA invocations on
    Recovery. -1
  • Deals with Databases and the replication of them.
    -1
  • Better control of non-determinism. -1

22
Question 19
  • What additions or improvements to the current FT
    CORBA specification would be most valuable?
  • Provide new heartbeat (HB) mechanism (auto HB
    between Processes) 2
  • Specifiable degree of determinism
  • Have it meet deadlines while meeting FT needs
  • Allow replication and recovery across multiple
    LANs
  • Application Transparent Fault Isolation
    (Detection and Identification)
  • Isolate network partition faults
  • Application specifies fault criteria and leads
    recovery
  • Simplify replication to passive and cold restart
  • Add semi-active replication
  • Guidance on maintaining state consistency in the
    presence of non-determinism
  • Fault detection with minimal network impact
  • Automatic slave promotion
  • Upon state retrieval from persistent database
    optimize startup time
  • Allow for different platforms and OSes within
    groups (different OSes doesnt work for PSS DB
    abstraction layer)
  • Be pluggable
  • Address transactions how to tolerate failures
    of these
  • Mode Driven
  • Fault tolerant ORB services

23
Question 20
  • What specific RFPs related to fault tolerance
    would you like to see issued by the OMG in the
    near-term or medium-term future?
  • I live in fear of OMG RFPs
  • RT-FT RFP impractical one size fit all verses
    state of the art ( tech. immature) lego-block
    fault tolerant communities
  • FT needs to be reconciled with Load Distribution
    RFP first
  • One issued by SBC DTF to support FT in SW Radio
    space
  • RT transaction specification separates into 4
    components of CORBA ones
  • Reduced language mappings for RT FT CORBA
    services
  • RT FT RFP with
  • Merge RT-FT spec with minimal work while
    satisfying community
  • Provide Object image consistency within delta T
  • Clean FT schema that covers 80-90 of
    applications reliability requirements.
  • Replace interfaces with run-time policies set
    through configuration which is multilevel and
    managed at the appropriate level.
  • Interoperability of Mechanisms across ORBs.
  • Database system based FT
  • Better use of QoS and Mode driven FT

24
Question 21
  • Is proactive dependability a feature that would
    be of interest to you and your customer?
  • As a cost constrained low priority. -4
  • Yes -13 (includes 3 with trust maturity 1 at
    application level)
  • Comments
  • Currently do it with a diagnostic thresholding
    scheme.
  • Not realistic
  • How would it handle faults outside of the
    middleware
  • At system level to be used for diagnostics

25
Question 22
  • Rank the following seven items in terms of
    importance to you customer for the most critical
    part of your application
  • Ease of implementation to the application
    developer
  • Fast Execution Time
  • Bounding Execution Times
  • Fast Recovery Time
  • Bounding Recovery
  • State Synchronization
  • Efficient use of Resources

26
Average Rankings by System Type
Logical Clusters cluster by the perceived
intended use of the system as determined using
questions 1-3.
Recall that lower ranking is more important!
27
Observations
  • Current FT CORBA spec has little relevance in
    current practice
  • RT FT systems are being built on RT CORBA
  • Unit of failover is typically process or host,
    not object
  • Passive replication is more commonly used than
    active
  • Active replication perceived as more capable, but
    too hard
  • Majority of applications described as soft real
    time
  • COTS platforms common
  • Requirements vary dramatically
  • Toolkit approaches are preferred
  • Everyone needs fast normal execution
  • Fast recovery most important in soft real time
  • Some need interoperable RT FT

28
Potential RFP Topics
  • Lightweight Real Time Fault Tolerant CORBA
  • Passive replication
  • No message logging or checkpointing
  • Coarse grained (ORB or process) failover
  • Integrate ORB with external resource management
  • Interoperable RT Active CORBA Replication
  • RT FT Group GIOP
  • Real Time Fault Tolerant CCM
  • Real Time Fault Tolerant Transactions
  • Real Time Fault Tolerant State Management
  • Real Time Fault Tolerant Resource Management
  • Or leave to domains (i.e. C4Is CMS Application
    Management RFP)?
Write a Comment
User Comments (0)
About PowerShow.com