Large Farm 'Real Life Problems' and their Solutions - PowerPoint PPT Presentation

About This Presentation
Title:

Large Farm 'Real Life Problems' and their Solutions

Description:

The evolution of the ELFms tools is described in various previous presentations: ... `Experience in the use of quattor tool suite outside CERN' ... – PowerPoint PPT presentation

Number of Views:247
Avg rating:3.0/5.0
Slides: 21
Provided by: thorstenk
Learn more at: https://www.racf.bnl.gov
Category:

less

Transcript and Presenter's Notes

Title: Large Farm 'Real Life Problems' and their Solutions


1
Large Farm 'Real Life Problems' and their
Solutions
  • Thorsten Kleinwort
  • CERN IT/FIO
  • HEPiX II/2004
  • BNL

2
Outline
  • Farms at the CERN CC
  • The Tools Framework
  • The Working Teams
  • Real Life Use Cases
  • Collaborations
  • Summary
  • Useful Links

3
The Tools Framework
  • ELFms
  • Quattor
  • Installation (Kickstart SWREP)
  • Configuration (CDB NCM)
  • Management (SPMA NCM)
  • Lemon
  • Monitoring
  • Batch system statistics
  • LEAF
  • State management (SMS)
  • Hardware management (HMS)

4
The Tools Framework (contd)
  • The evolution of the ELFms tools is described in
    various previous presentations
  • HEPiX II/2003 (Vanouver)
  • The new Fabric Management Tools in Production at
    CERN
  • HEPiX I/2004 (Edinburgh)
  • ELFms, status, deployment by German Cancio
  • Lemon Web Monitoring by Miroslav Siket
  • CHEP 2004 (Interlaken)
  • Current Status of Fabric Management at CERN by
    German
  • This HEPiX
  • Experience in the use of quattor tool suite
    outside CERN
  • gt Progress has been made, improvements are
    ongoing, Quattor is more and more used outside
    CERN

5
Tools (contd)
  • Other tools interfacing CDB
  • Script PrepareInstall.pl
  • Does all necessary steps to prepare a machine
    install
  • Can run with a list of hosts (for mass installs)
  • Gets all the necessary information from CDB
  • Creates a kickstart file for each node
  • Local Script maintenance
  • Script to rundown a node
  • Drains batch nodes
  • Warns users on interactive nodes
  • Can execute configurable script at the end, e.g.
    reboot

6
Tools (contd)
  • Automated Fabric LEAF
  • State Management System SMS
  • Other CDB changes are done by SMS
  • Change OS/Cluster
  • Systems have state
  • production or standby
  • Hardware Management System HMS
  • Workflow to track hardware changes interfaces
    CDB
  • New machine arrival
  • Machine moves
  • Machine interventions (Vendor calls), retirements

7
The Working Teams
Customers
Service Manager
SysAdmins
Operator
8
Another Management Tool
  • Remedy
  • The problem tracking tool in CERN IT
  • Used in different workflows, e.g. by
  • The Operator to open tickets following up on
    alarms
  • The Service Managers to ask for machine
    interventions
  • The SysAdmins to follow up on problems/general
    issues
  • HMS is implemented as a Remedy Workflow as well
  • Recently started to get statistics on hardware
    failures

9
Real Life Use Cases
  • Kernel upgrade (on LXBATCH, 1500 hosts)
  • Put the new software into the repository (SWREP,
    precaching)
  • Put the new kernel RPM on the nodesSPMA, with
    multi-package option (old kernel is still
    running!)
  • Configure the new kernel version for the cluster
    in CDB, and run the GRUB NCM component for
    configuring the node
  • Drain the nodes by disabling new batch jobs
    (maintenance)

10
Real Life Use Cases
  • Kernel upgrade (contd)
  • Node reboots when it is drained (could be at any
    time)
  • New machine comes up with new kernel, and goes
    back into production immediately
  • Least downtime for each node. Capacity is always
    available
  • First reboot instantaneous, last one can be
    several days later
  • Everything runs automatically, some cleanup has
    to be done for few machines (dont shutdown or
    h/w failure on startup) gt caught by the
    monitoring/alarm

11
Real Life Use Cases (contd)
  • Configure batch resources (LSF)
  • LSF resources are defined, depending on
    availability, power and cluster of machines
  • Resources are defined in CDB
  • Configured on the node using NCM
  • The master file is generated from CDB2SQL in a
    cron job every day (reconfig takes several
    minutes)
  • Consistency of client/master due to CDB
  • Resources assignments are done in CDB on (sub-)
    cluster level (template structure)
  • Reassignments of (sub-)clusters in CDB are done
    with SMS tools

12
Real Life Use Cases (contd)
  • Emptying the Computer Centre
  • For the refurbishment of the CERN Computer Centre
    all machines had to be moved, either from one
    side to the other, or downstairs (vault)
  • 2000 machines had to be moved
  • Taking the opportunity to add machines to CDB
  • As quattor and non-quattor nodes
  • Batch machines were moved in racks44 nodes
  • HMS was used to steer the moves
  • SMS/maintenance to shut down the machines
  • Rename/PrepareInstall to bring machines back

13
(No Transcript)
14
Real Life Use Cases (contd)
  • New h/w arrival gt mass installation
  • New machines (400) arrive at CERN(in bunches of
    50 100)
  • Racks have to be prepared
  • Network equipment
  • Power supply
  • (Console service)
  • Plan machine membership (cluster)
  • Put machine into CDB
  • h/w type
  • Cluster type/OS

15
Real Life Use Cases
  • New h/w arrival (contd)
  • Physical machine installation (HMS)
  • New DNS entry
  • OS installation PrepareInstall
  • Installation by the SysAdmin
  • Burn-in test (h/w test, several days to weeks)
  • Follow up on h/w problems with Vendor
  • Add the machines to the alarm display (SURE)
  • Put machines into production

16
(No Transcript)
17
Collaborations
  • External Customers
  • EGEE, LCG, and other groups at CERN are now using
    Quattor managed machines
  • They benefit from standard, manageable, and
    reproducible machine setups
  • They are able/should learn to do modifications
    themselves
  • External sites using Quattor
  • IN2P3, NIKHEF, UAM Madrid, discussing to or use
    already Quattor gt see Rafaels talk
  • This helps to enhance the tools
  • Service nodes (for LCG-2)
  • Having a wider usage
  • Generalizing components

18
Summary
  • ELFms is deployed in production at CERN
  • Established technology from Prototype to
    Production
  • Though enhancements are ongoing
  • Fundamental part of our infrastructure
  • Merged with our existing environment
  • Quattor and Lemon are generic software
  • Used by others inside/outside CERN
  • Hopefully a fruitful collaboration in the future

19
Useful Links
  • ELFms http//cern.ch/elfms
  • Quattor http//quattor.org/
  • Lemon http//cern.ch/lemon
  • LEAF http//cern.ch/leaf
  • Previous presentations
  • HEPiX II/2003 (Vanouver)http//www.triumf.ca/hep
    ix2003
  • The new Fabric Management Tools in Production at
    CERN
  • HEPiX I/2004 (Edinburgh)http//www.nesc.ac.uk/es
    i/events/291/
  • ELFms, status, deployment by German Cancio
  • Lemon Web Monitoring by Miroslav Siket
  • CHEP 2004 (Interlaken)http//chep2004.web.cern.c
    h/chep2004/
  • Current Status of Fabric Management at CERN by
    German Cancio

20
Questions?
Write a Comment
User Comments (0)
About PowerShow.com