John David Eriksen - PowerPoint PPT Presentation

About This Presentation
Title:

John David Eriksen

Description:

High-Performance, Dependable Multiprocessor John David Eriksen Jamie Unger-Fink * This might not be necessary * -DMS identifies, classifies, and manages the ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 38
Provided by: Jam123
Category:

less

Transcript and Presenter's Notes

Title: John David Eriksen


1
High-Performance, Dependable Multiprocessor
  • John David Eriksen
  • Jamie Unger-Fink

2
Background and Motivation
  • Traditional space computing limited primarily to
    mission-critical applications
  • Spacecraft control
  • Life support
  • Data collected in space and processed on the
    ground
  • Data sets in space applications continue to grow

3
Background and Motivation
  • Communication bandwidth not growing fast enough
    to cope with increasing size of data sets
  • Instruments and sensors grow in capability
  • Increasing need for on-board data processing
  • Perform data filtering and other operations
    on-board
  • Autonomous systems demand more computing power

4
Related Work
  • Advanced Onboard Signal Processor (AOSP)
  • Developed in 70s and 80s
  • Helped develop understanding of radiation on
    computing systems and components.
  • Advanced Architecture Onboard Processor (AAOP)
  • Engineered new approaches to onboard data
    processing

5
Related Work
  • Space Touchstone
  • First COTS-based, FT, high-performance system
  • Remote Exploration and Experimentation
  • Extended FT techniques to parallel and cluster
    computing
  • Focused on low-cost, high-performance, good
    power-ratio compute cluster designs.

6
Goal
  • Address need for increased data processing
    requirements
  • Bring COTS systems to space
  • COTS (Commodity Off-The-Shelf)
  • Less expensive
  • General-purpose
  • Need special considerations to meet requirements
    of aerospace environments
  • Fault-tolerance
  • High reliability
  • High availability

7
Dependable Multiprocessor is
  • A reconfigurable cluster computer with
    centralized control.

8
Dependable Multiprocessor is
  • A hardware architecture
  • High-performance characteristics
  • Scalable
  • Upgradable (thanks to reliance on COTS)
  • A parallel processing environment
  • Support common scientific computing development
    environment (FEMPI)
  • A fault-tolerant computing platform
  • System controllers provide FT properties
  • A toolset for predicting application behavior
  • Fault behavior, performance, availability

9
Hardware Architecture
  • Redundant radiation-hardened system controller
  • Cluster of COTS-based reconfigurable data
    processors
  • Redundant COTS-based packet-switched networks
  • Radiation-hardened mass data store
  • Redundancy available in
  • System controller
  • Network
  • Configurable N-of-M sparing in compute nodes

10
Hardware Architecture
11
Hardware Architecture
  • Scalability
  • Variable number of compute nodes
  • Cluster-of-cluster
  • Compute nodes
  • IBM PowerPC 750FX general processor
  • Xilinx VirtexII 6000 FPGA co-processor
  • Reconfigurable to fulfill various roles
  • DSP processor
  • Data compression
  • Vector processing
  • Applications implemented in hardware can be very
    fast
  • Memory and other support chips

12
Hardware Architecture
13
Hardware Architecture
14
Hardware Architecture
  • Network Interconnect
  • Gigabit Ethernet for data exchange
  • A low-latency, low-bandwidth bus used for control
  • Mission Interface
  • Provides interface to rest of space vehicles
    computer systems
  • Radiation-hardened

15
Hardware Architecture
  • Current hardware implementation
  • Four data processors
  • Two redundant system controllers
  • One mass data store
  • Two gigabit ethernet networks including two
    network switches
  • Software-controlled instrumented power supply
  • Workstation running spacecraft system emulator
    software

16
Hardware Architecture
17
Software Architecture
18
Software Architecture
  • Platform layer is lowest layer, interfaces
    hardware to middleware, hardware-specific
    software, network drivers
  • Uses Linux, allows for use of many existing
    software tools
  • Mission Layer
  • Middleware includes DM System Services fault
    tolerance, job management, etc.

19
Middleware
20
Middleware
  • DM Framework is application independent, platform
    independent
  • API to communicate with mission layer, SAL
    (System Abstraction Layer) for platform layer
  • Allows for future applications by facilitating
    porting to new platforms

21
High Availability Middleware
  • HA Middleware foundation includes Availability
    Management (AMS), Distributed Messaging (DMS),
    Cluster Management (CMS)
  • Primary functions
  • Resource monitoring
  • Fault detection, diagnosis, recovery and
    reporting
  • Cluster configuration
  • Event logging
  • Distributed messaging
  • Based on small, cross-platform kernel

22
Availability Management Service
  • Hosted on the clusters system controller
  • Managed Resources include
  • Applications
  • Operating System
  • Chassis
  • I/O cards
  • Redundant CPUs
  • Networks
  • Peripherals
  • Clusters
  • Other middleware

23
Distributed Messaging Service
  • Provides a reliable messaging layer for
    communications in DM cluster
  • Used for Checkpointing, Client/server,
    Communications, Event notification, Fault
    management, Time-critical communications
  • Application opens a DMS connection (channel) to
    pass data to interested subscribers
  • Since messaging is in middleware instead of lower
    layers, application doesnt have to specify
    explicitly destination address
  • Messages are classified and machines choose to
    receive message of a certain type

24
Cluster Management Service
  • Manages physical nodes or instances of HA
    middleware
  • Discovers and monitors nodes in a cluster
  • Passes node failures to AMS and FT Manager via DMS

25
Other Middleware
  • Database Management
  • Logging Services
  • Tracing

26
Control Process
  • Interface to control computer or ground station
  • Communicates with system via DMS
  • Monitors system health with FT Manager
  • Heartbeat

27
Fault-tolerance Manager
  • Detects and recovers from system faults
  • FTM refers to set of recovery policies at runtime
  • Relies on distributed software agents to gather
    system and application liveliness information
  • Avoids monitoring bottleneck

28
Job Manager
  • Provides application scheduling, resource
    allocation
  • Opportunistic load balancing scheduler
  • Jobs are registered and trace by the JM via
    tables
  • Checkpointing to allow seamless recovery of the
    JM
  • Heartbeats to the FT via middleware

29
FEMPI
  • Fault-Tolerant Embedded Message Passing Interface
  • Application independent FT middleware
  • Message Passing Interface (MPI) Standard
  • Built on top of HA middleware

30
FEMPI Interface
  • Recovery from failure should be automatic, with
    minimal impact
  • Needs to maintain global awareness of the
    processes in parallel applications
  • 3 Stages
  • Fault Detection
  • Notification
  • Recovery
  • Process failures vs Network failures
  • Survives the crash of n-1 processes in an
    n-process job

31
FPGA Co-Processor Services
  • Proprietary nature of FPGA industry
  • USURP - USURPs Standard for Unified
    Reconfigurable Platforms
  • Standard to interact with hardware
  • Provides middleware for portability
  • Black box IP cores
  • Wrappers mask FPGA board

32
USURP
  • Not a universal tool for mapping high-level code
    with hardware design
  • OpenFPGA
  • Adaptive Computing System (ACS) vs USURP
  • Object Oriented Models vs Software APIs
  • IGOL
  • BLAST
  • CARMA

33
USURP HW/SW Interface
  • Responsible for
  • Unifying vendor APIs
  • Standardizing HW interface
  • Organization of data for the user application
    core
  • Exposing the developer to common FPGA resources.

34
Checkpoint Interface
  • User level protocol for system recovery
  • Consists of
  • Server Process that runs on Mass Data Store
  • DMS
  • API for applications
  • C-type interfaces

35
ABFT Library
  • Algorithm-based Fault Tolerance Library
  • Collection of mathematical routines that can
    detect and correct faults
  • BLAS-3 Library
  • Matrix multiply, LU decomposition, QR
    decomposition, single-value decompositions (SVD)
    and fast Fourier transform (FFT).
  • Uses checksums

36
Replication Services
  • Triple Modular Redunancy
  • Process Level Replication

37
Conclusion
  • System architecture has been defined
  • Testbench has been assembled
  • Improvements
  • More aggressively address power consumption
    issues
  • Add support for other scientific computing
    platforms such as Fortran
Write a Comment
User Comments (0)
About PowerShow.com