ILC Global Control System - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

ILC Global Control System

Description:

State machines. Online models. Front-End Tier. Technical Systems Interfaces. Control-point level ... 16 Slot Dual Star. Backplane. 4 Hot-Swappable Fans. Shelf ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 34
Provided by: geral161
Category:

less

Transcript and Presenter's Notes

Title: ILC Global Control System


1
ILC Global Control System
  • John Carwardine, ANL

2
ILC Accelerator overview
  • Major accelerator systems
  • Polarized PC gun electron source and
    undulator-based positron source.
  • 5-GeV electron and positron damping rings, 6.7km
    circumference.
  • Beam transport from damping rings to bunch
    compressors.
  • Two 11km long 250-GeV linacs with 15,000
    cavities and 600 RF units.
  • A 4.5-km beam delivery system with a single
    interaction point.

J. Bagger
3
Control System Requirements and Challenges
  • General requirements are largely similar to those
    of any large-scale experimental physic machines
    but there are some challenges
  • Scalability
  • 100,000 devices, several million control points.
  • Large geographic scale 31km end to end
  • Multi-region, multi-lab development team.
  • Support ILC accelerator availability goals of
    85.
  • Intrinsic Control system availability of 99 by
    design.
  • Cannot rely on approach of fix in place.
  • May require 99.999 (five nines) availability
    from each crate.
  • Functionality to help minimize overall
    accelerator downtime.

4
Requirements and Challenges (2)
  • Precision timing synchronization
  • Distribute precision timing and RF phase
    references to many technical systems throughout
    the accelerator complex.
  • Requirements consistent with LLRF requirements of
    0.1 amplitude and 0.1 degree phase stability.
  • Support remote operations / remote access (GAN /
    GDN)
  • Allow collaborators to participate with machine
    commissioning, operation, optimization, and
    troubleshooting.
  • At technical equipment level there is little
    difference between on-site and off-site access -
    Control Room is already remote.
  • There are both technical and sociological
    challenges.

5
Requirements and Challenges (3)
  • Extensive reliance on machine automation
  • Manage accelerator operations of the many
    accelerator systems, eg 15,000 cavities, 600 RF
    units.
  • Automate machine startup, cavity conditioning,
    tuning, etc.
  • Extensive reliance on beam-based feedback
  • Multiple beam based feedback loops at 5Hz, eg
  • Trajectory control, orbit control
  • Dispersion measurement control
  • Beam energies
  • Emittance correction

6
Control System Functional Model
  • Client Tier
  • GUIs
  • Scripting
  • Services Tier
  • Business Logic
  • Device abstraction
  • Feedback engine
  • State machines
  • Online models
  • Front-End Tier
  • Technical Systems Interfaces
  • Control-point level

7
Physical Model as applied to main linac
(Front-end)
8
Some representative component counts
9
Which Control System?
  • Established accelerator control system..?
  • EPICS, DOOCS, TANGO, ACNET,
  • Development from scratch?
  • Commercial solution?
  • Too early to down-select for ILC and there are
    benefits to not down-selecting during RD phase

10
Availability Design Philosophy for the ILC
  • Design for Availability up front.
  • Budget 15 downtime total. Keep an extra 10 as
    contingency.
  • Try to get the high availability for the minimum
    cost.
  • Will need to iterate as design progresses.
  • Quantities are not final
  • Engineering studies may show that the cost
    minimum would be attained by moving some of the
    unavailability budget from one item to another.
  • This means some MTBFs may be allowed to go down,
    but others will have to go up.
  • Availability/reliability modeling (Availsim)

11
Availability budgets by system(percentage total
downtime)
12
MTBF/MTTR requirements from Availsim
13
High Availability primer
  • Availability A MTBF/(MTBFMTTR)
  • MTBFMean Time Before Failure
  • MTTR Mean Time To Repair
  • If MTBF approaches infinity A approaches 1
  • If MTTR approaches zero A approaches 1
  • Both are impossible on a unit basis
  • Both are possible on a system basis.
  • Key features for HA, i.e. A approaching 1
  • Modular design
  • Built-in 1/n redundancy
  • Hot standby systems
  • Hot-swap capable at subsystem unit or subunit
    level

14
Systems That Never Shut Down
  • Any large telecom system will have a few
    redundant Shelves, so loss of a whole unit does
    not bring down system like RF system in the
    Linac.
  • Load auto-rerouted to hot spare, again like
    Linac.
  • Key All equipment always accessible for hot
    swap.
  • Other Features
  • Open System Non-Proprietary very important for
    non-Telecom customers like ILC.
  • Developed by industry consortium¹ of major
    companies sharing in 100B market.
  • 20X larger market than any of old standards
    including VME leads to competitive prices.

¹ PICMG -- PCI Industrial Computer Manufacturers
Group
15
Controls Cluster
Dual Star/ Loop/Mesh
FEATURES ? Dual Star 1/N Redundant Backplanes ?
Redundant Fabric Switches ? Dual Star/ Loop/
Mesh Serial Links ? Dual Star Serial Links
To/From Level 2 Sector Nodes
Applications Modules
Dual Fabric Switches
Dual Star to/From Sector Nodes
16
HA Concept DR Kicker Systems
  • Approx 50 unit drivers
  • n/N Redundancy System level (extra kickers)
  • n/N Redundancy Unit level (extra cards)
  • Diagnostics on each card, networked, local
    wireless

17
Physical Model as applied to main linac
(Front-end)
18
High Availability Control System
  • Control system itself must be highly available
  • Redundant and hot-swap hardware platform
    (baseline ATCA).
  • Redundancy functionality in control system
    software.
  • In many cases, redundancy and hot-swap/hot-reconfi
    gure can only be implemented at the accelerator
    system level, eg
  • Rebalance RF systems if a klystron fails.
  • Modify control algorithm on loss of critical
    sensor.
  • Control System will provide High Availability
    functionality at the accelerator system level.
  • Technical systems must provide high level of
    diagnostics to support remote troubleshooting and
    re-configuration.

19
ATCA as a reference platform
5-Slot Crate w/ Shelf Manager Fabric Switch
Dual IOC Processors
4 Hot-Swappable Fans
16 Slot Dual Star Backplane
Shelf Manager
Dual 48VDC Power Interface
Dual IOCs Fabric Switch
Rear View
R. Larsen
20
(No Transcript)
21
ATCA as reference platform for Front-end
electronics
  • Representative of the breadth of
    high-availability functions needed
  • Hot-swappable components circuit boards, fans,
    power supplies,
  • Remote power management power on/off each
    circuit board
  • Supports redundancy processors, comms links,
    power supplies,
  • Remote resource management through Shelf Manager
  • µTCA offers lower cost but with reduced feature
    set.
  • There is growing interest in the physics
    community in exploring ATCA for instrumentation
    and DAQ applications.
  • As candidate technology for the ILC, ATCA/µTCA
    have strong potential currently is it an
    emerging standard.

22
Read Out evolution LHC --gt ILC
Subdetector
Subdetector
Digital Buffer
??C??CTA

Read Out Crate (VME 9U)
Read Out Driver
92
AMC
SLink
400 Robin (PCI)

Read Out Buffer (3 ROBin)
ROS (150 PCs)
ATCA Module
ATCA Crate
Gbit Link to Gbe Switch (60 PCs)
23
Cost/Benefit Analysis of HA Techniques
13. Automatic failover
Availability (benefit)
12. Model-based automated diagnosis
11. Manual failover (eg bad memory, live patching)
10. Hot swap hardware
9. Application design (error code checking, etc)
8. Development methodology (testing, standards,
patterns)
7. Adaptive machine control (detect failed BPM,
modify feedback)
6. Model-based configuration management (change
management)
5. Extensive monitoring (hardware and software)
4. COTS redundancy (switches, routers, NFS, RAID
disks, database, etc.)
3. Automation (supporting RF tune-up, magnet
conditioning, etc.)
2. Disk volume management
1. Good administrative practices
Cost (some effort laden, some materials laden)
24
HA RD objectives
  • Learn about HA (High Availability) in context of
    accelerator controls
  • Bring in expertise (RTES, training, NASA,
    military, )
  • Develop (adopt) a methodology for examining
    control system failures
  • Fault tree analysis
  • FMEA or scenario-based FMEA
  • Supporting software (CAFTA, SAPPHIRE, )
  • Others?
  • Develop policies for detecting and managing
    identified failure modes
  • Development and testing methodology
  • Workaround
  • Redundancy
  • Develop a full vertical prototype
    implementation
  • Ie. how we might implement above policies
  • Integrate portions of vertical prototype with
    test stands (LLRF)
  • Feed some software-oriented data to SLAC
    availability simulation?

25
High Availability Software
  • What are the most common and critical failure
    modes in control system software?
  • Mis-configuration
  • Network buffer overruns
  • Application logic bugs
  • Task deadlock
  • Accepting conflicting commands
  • Ungraceful handling of failed sensors/actuators
  • Flying blind (lack of monitoring)
  • Introduction of untested features
  • More
  • How do we mitigate these, and what is the
    cost/benefit?

26
Sample of Techniques Shelf Management
Client Tier
Services Tier
Controls Protocol
  • Shelf Manager
  • Identify all boards on shelf
  • Power cycle boards (individually)
  • Reset boards
  • Monitor voltages/temps
  • Manage Hot-Swap LED state
  • Switch to backup flash mem bank
  • More

Custom
CPU1
I/O
CPU2
Front-end tier
sensor
27
SAF Availability Management Framework
A simple example of software component runtime
lifecycle management
Service Unit Administrative States
AMF Logical Entities
Node V
Node U
Service Group
Unlocked
Locked
Service Unit
Service Unit
Component
Component
Component
Component
Locked- Instantiation
Shutting down
active
standby
1. Service unit starts out un-instantiated.
2. State changed to locked, meaning software is
instantiated on node, but not assigned work.
Service Instance is work assigned to Service Unit
Service Instance
3. State changed to unlocked, meaning software is
assigned work (Service Instance).
28
SAF Service Availability Forum Specifications
Application Interface Specification
HA Applications
Other Middleware and Application Services
HPI Middleware
AIS Middleware
Carrier Grade Operating System
Managed Hardware Platform
Hardware Platform Interface
Diagram courtesy of Service Availability Forum
29
SAF Availability Management Framework
  • AMF Availability Management Framework
  • Manages software runtime lifecycle, fault
    reporting, failover policies, etc.
  • Works in combination with a collection of
    well-defined services to provide a powerful
    environment for application software components.
  • CLM Cluster Membership Service
  • LOG Log Service
  • CKPT Checkpoint Service
  • EVT Event Service
  • LCK Lock Service
  • More
  • An open standard from telecom industry geared
    towards supporting a highly available, highly
    distributed system.
  • Potential application to critical core control
    system software such as IOCs, device servers,
    gateways, nameservers, data reduction, etc.
  • Know exactly what software is running where.
  • Be able to gracefully restart components, or
    manage state while hot-swapping underlying
    hardware.
  • Uniform diagnostics to troubleshoot problems.

30
An HA software framework is just the start
  • SAF (Service Availability Forum) implementations
    wont solve HA problem
  • You still have to determine what you want to do
    and encode it in the framework this is where
    work lies
  • What are failures
  • How to identify failure
  • How to compensate (failover, adaptation,
    hot-swap)
  • Is resultant software complexity manageable?
  • Potential fix worse than the problem
  • Always evaluate am I actually improving
    availability?

31
RD Engineering Design (EDR) Phase
  • Main focus of RD efforts are on high
    availability
  • Gain experience with high availability tools
    techniques to be able to make value-based
    judgments of cost versus benefit.
  • Four broad categories
  • Control system failure mode analysis
  • High-availability electronics platforms (ATCA)
  • High-availability integrated control systems
  • Conflict avoidance failover, model-based
    resource monitoring.
  • Control System as a tool for implementing
    system-level HA
  • Fault detection methods, failure modes effects

32
HA means doing things differently
  • ILC must apply techniques not typically used at
    an accelerator, particularly in software
  • Development culture must be different this time.
  • Cannot build ad-hoc with in-situ testing.
  • Build modeling, simulation, testing, and
    monitoring into hardware and software methodology
    up front.
  • Reliable hardware
  • Instrumentation electronics to servers and disks.
  • Redundancy where feasible, otherwise adapt in
    software.
  • Modeling and simulation (T. Himel).
  • Reliable software
  • Equally important.
  • Software has many more internal states
    difficult to predict.
  • Modeling and simulation needed here for
    networking and software.

33
Controls topic areas
  • LLRF algorithms
  • RF phase timing distribution, synchronization
  • Machine automation, beam-based feedback
  • ATCA evaluation as front-end instrumentation
    platform
  • ATCA evaluation for control system integration
  • HA integrated control system
  • Integrated Control System as a tool for
    system-level HA
  • Remote access, remote operations (GAN/GDN)
  • Failure modes analysis
  • Lots of opportunities to get involved
Write a Comment
User Comments (0)
About PowerShow.com