BNL Site Report - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

BNL Site Report

Description:

CPU and ambient temperature monitoring via dtgraph and custom python scripts ... Custom startd, schedd, and 'startd cron' ClassAds allow for quick viewing of the ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 21
Provided by: ofer1
Category:
Tags: bnl | custom | report | site

less

Transcript and Presenter's Notes

Title: BNL Site Report


1
BNL Site Report
  • Ofer Rind
  • Brookhaven National Laboratory
  • rind_at_bnl.gov
  • Spring HEPiX Meeting, CASPUR
  • April 3, 2006

2
BNL Site Report
  • Ofer Rind
  • Brookhaven National Laboratory
  • rind_at_bnl.gov
  • Spring HEPiX Meeting, CASPUR
  • April 4, 2006

3
(Brief) Facility Overview
  • RHIC/ATLAS Computing Facility is operated by BNL
    Physics Dept. to support the scientific computing
    needs of two large user communities
  • RCF is the Tier-0 facility for the four RHIC
    expts.
  • ACF is the Tier-1 facility for ATLAS in the U.S.
  • Both are full-service facilities
  • gt2400 Users, 31 FTE
  • RHIC Run6 (Polarized Protons) started March 5th

4
Mass Storage
  • Soon to be in full production....
  • Two SL8500s 2 X 6.5K tape slots, 5 PB
    capacity
  • LTO-3 drives 30 X 80 MB/sec 400 GB/tape
    (native)
  • All Linux movers 30 RHEL4 machines, each with 7
    Gbps ethernet connectivity and aggregate 4 Gbps
    direct attached connection to DataDirect S2A
    fibre channel disk
  • This is in addition to the 4 STK Powderhorn silos
    already in service (4 PB, 20K 9940B tapes)
  • Transition to HPSS 5.1 is complete
  • Its different...learning curve due to numerous
    changes
  • PFTP client incompatibilities and cosmetic
    changes
  • Improvements to Oak Ridge Batch System optimizer
  • Code fixed to remove long-time source of
    instability (no crashes since)
  • New features being designed to improve access
    control

5
Centralized Storage
  • NFS Currently 220 TB of FC SAN 37 Solaris 9
    servers
  • Over the next year, plan to retire 100 TB of
    mostly NFS served storage (MTI, ZZYZX)
  • AFS RHIC and USATLAS cells
  • Looking at Infortrend disk (SATA FC frontend
    RAID6) for additional 4 TB (raw) per cell
  • Future upgrade to OpenAFS 1.4
  • Panasas 20 shelves, 100 TB, heavily used by RHIC

6
Panasas Issues
  • Panasas DirectFlow (version 2.3.2)
  • High performance and fairly stable,
    but...problematic from an administrative
    perspective
  • Occasional stuck client-side processes left in
    uninterruptible sleep
  • DirectFlow module causes kernel panics from time
    to time
  • Can always panic a kernel with panfs mounted by
    running a Nessus scan on the host
  • Changes in ActiveScale server configuration (e.g.
    changing the IP addresses of non-primary director
    blades), which the company claims are innocuous,
    can cause clients to hang.
  • Server-side NFS limitations
  • NFS mounting was tried and found to be unfeasible
    as a fallback option with our configuration ?
    heavy NFS traffic causes director blades crash
    Panasas suggests limiting to lt100 clients per
    director blade.

7
Update on Security
  • Nessus scanning program implemented as part of
    ongoing DOE CA process
  • Constant low-level scanning
  • Quarterly scanning more intensive ? port
    exclusion scheme to protect sensitive processes
  • Samhain
  • Filesystem integrity checker (akin to Tripwire)
    with central management of monitored systems
  • Currently deployed on all administrative systems

8
Linux Farm Hardware
  • gt4000 processors, gt3.5 MSI2K
  • 700 TB of local storage (SATA, SCSI, PATA)
  • SL3.05(03) for RHIC (ATLAS)
  • Evaluated dual-core Opteron Xeon for upcoming
    purchase
  • Recently encountered problems with Bonnie I/O
    tests using RHEL4 64 bit w/software RAIDLVM on
    Opteron
  • Xeon (Paxville) gives poor SI/watt performance

9
Power Cooling
  • Power Cooling now significant factors in
    purchasing
  • Added 240KW to facility for 06 upgrades
  • Long term possible site expansion
  • Liebert XDV Vertical Top Cooling Modules to be
    installed on new racks
  • CPU and ambient temperature monitoring via
    dtgraph and custom python scripts

10
Distributed Storage
  • Two large dCache instances (v1.6.6) deployed in
    hybrid server/client model
  • PHENIX 25 TB disk, 128 servers, gt240 TB data
  • ATLAS 147 TB disk, 330 servers, gt150 TB data
  • Two custom HPSS backend interfaces
  • Perf. tuning on ATLAS write pools
  • Peak transfer rates of gt50 TB/day
  • Other large deployment of Xrootd (STAR), rootd,
    anatrain (PHENIX)

11
Batch Computing
  • All reconstruction and analysis batch systems
    have been migrated to Condor, except STAR
    analysis ---which still awaits features like
    global job-level resource reservation --- and
    some ATLAS distributed analysis (these use LSF
    6.0)
  • Configuration
  • Five Condor (6.6.x) pools on two central managers
  • 113 available submit nodes
  • One monitoring/Condorview server and one backup
    central manager
  • Lots of performance tuning
  • Autoclustering of jobs for scheduling timeouts
    negotiation cycle socket cache collector query
    forking etc....

12
Condor Usage
  • Use of heavily modified CondorView client to
    display historical usage.

13
Condor Flocking
  • Goal Full utilization of computing resources on
    the farm
  • Increasing use of a general queue which allows
    jobs to run on idle resources belonging to other
    experiments, provided that there are no local
    resources available to run the job
  • Currently, such opportunistic jobs are
    immediately evicted if a local job places a claim
    on the resource
  • gt10K jobs completed so far

14
Condor Monitoring
  • Nagios and custom scripts provide live monitoring
    of critical daemons
  • Place job history from 100 submit nodes into
    central database
  • This model will be replaced by Quill.
  • Custom statistics extracted from database (i.e.
    general queue, throughput, etc.)
  • Custom startd, schedd, and startd cron ClassAds
    allow for quick viewing of the state of the pool
    using Condor commands
  • Some information accessible via web interface
  • Custom startd ClassAds allow for remote and
    peaceful turn off of any node
  • Not available in Condor
  • Note that the condor_off -peaceful command
    (v6.8) cannot be canceled, must wait until
    running jobs exit

15
Nagios Monitoring
  • 13958 services total and 1963 hosts average of 7
    services checked per host.
  • Originally had one nagios server (dual 2.4
    Gz)....
  • Tremendous latency services reported down many
    minutes after the fact.
  • Web interface completely unusable (due to number
    of hosts and services)
  • ...all of this despite a lot of nagios and system
    tuning...
  • Nagios data written to ramdisk
  • Increased no. of file descriptors and no. of
    processes allowed
  • Monitoring data read from MySQL database on
    separate host
  • Web interface replaced with lightweight interface
    to the database server
  • Solution split services roughly in half between
    two nagios servers.
  • Latency is now very good
  • Events from both servers logged to one MySQL
    server
  • With two servers there is still room for many
    more hosts and a handful of service checks.

16
Nagios Monitoring
17
ATLAS Tier-1 Activities
  • OSG 0.4, LCG 2.7 (this wk.)
  • ATLAS Panda (Production And Distributed Analysis)
    used for production since Dec. 05
  • Good performance in scaling tests, with low
    failure rate and manpower requirements
  • Network Upgrade
  • 2 X 10 Gig LAN and WAN
  • Terapath QoS/MPLS (BNL, UM, FNAL, SLAC, ESNET)
  • DOE supported project to introduce end-to-end QoS
    network into data mgmt.
  • Ongoing intensive development w/ESNET

SC 2005
18
ATLAS Tier-1 Activities
  • SC3 Service Phase (Oct-Dec 05)
  • Functionality validated for full production chain
    to Tier-1
  • Exposed some interoperability problems between
    BNL dCache and FTS (fixed now)
  • Needed further improvement in operation,
    performance and monitoring.
  • SC3 Rerun Phase (Jan-Feb 06)
  • Achieved performance (disk-disk, disk-tape) and
    operations benchmarks

19
ATLAS Tier-1 Activities
  • SC4 Plan
  • Deployment of storage element, grid middleware,
    (LFC, LCG, FTS), and ATLAS VO box
  • April Data throughput phase (disk-disk and
    disk-tape) goal is T0 to T1 operational
    stability
  • May T1 to T1 data exercise.
  • June ATLAS Data Distribution from T0 to T1 to
    select T2.
  • July-August Limited distributed data processing,
    plus analysis
  • Remainder of 06 Increasing scale of data
    processing and analysis.

20
Recent P-P Collision in STAR
Write a Comment
User Comments (0)
About PowerShow.com