BNL Site Report - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

BNL Site Report

Description:

CPU and ambient temperature monitoring via dtgraph and custom python scripts ... Custom startd, schedd, and 'startd cron' ClassAds allow for quick viewing of the ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 21

Provided by: ofer1

Category:

more less

Transcript and Presenter's Notes

Title: BNL Site Report

1
BNL Site Report

Ofer Rind
Brookhaven National Laboratory
rind_at_bnl.gov
Spring HEPiX Meeting, CASPUR
April 3, 2006

2
BNL Site Report

Ofer Rind
Brookhaven National Laboratory
rind_at_bnl.gov
Spring HEPiX Meeting, CASPUR
April 4, 2006

3
(Brief) Facility Overview

RHIC/ATLAS Computing Facility is operated by BNL
Physics Dept. to support the scientific computing
needs of two large user communities
RCF is the Tier-0 facility for the four RHIC
expts.
ACF is the Tier-1 facility for ATLAS in the U.S.
Both are full-service facilities
gt2400 Users, 31 FTE
RHIC Run6 (Polarized Protons) started March 5th

4
Mass Storage

Soon to be in full production....
Two SL8500s 2 X 6.5K tape slots, 5 PB
capacity
LTO-3 drives 30 X 80 MB/sec 400 GB/tape
(native)
All Linux movers 30 RHEL4 machines, each with 7
Gbps ethernet connectivity and aggregate 4 Gbps
direct attached connection to DataDirect S2A
fibre channel disk
This is in addition to the 4 STK Powderhorn silos
already in service (4 PB, 20K 9940B tapes)
Transition to HPSS 5.1 is complete
Its different...learning curve due to numerous
changes
PFTP client incompatibilities and cosmetic
changes
Improvements to Oak Ridge Batch System optimizer
Code fixed to remove long-time source of
instability (no crashes since)
New features being designed to improve access
control

5
Centralized Storage

NFS Currently 220 TB of FC SAN 37 Solaris 9
servers
Over the next year, plan to retire 100 TB of
mostly NFS served storage (MTI, ZZYZX)
AFS RHIC and USATLAS cells
Looking at Infortrend disk (SATA FC frontend
RAID6) for additional 4 TB (raw) per cell
Future upgrade to OpenAFS 1.4
Panasas 20 shelves, 100 TB, heavily used by RHIC

6
Panasas Issues

Panasas DirectFlow (version 2.3.2)
High performance and fairly stable,
but...problematic from an administrative
perspective
Occasional stuck client-side processes left in
uninterruptible sleep
DirectFlow module causes kernel panics from time
to time
Can always panic a kernel with panfs mounted by
running a Nessus scan on the host
Changes in ActiveScale server configuration (e.g.
changing the IP addresses of non-primary director
blades), which the company claims are innocuous,
can cause clients to hang.
Server-side NFS limitations
NFS mounting was tried and found to be unfeasible
as a fallback option with our configuration ?
heavy NFS traffic causes director blades crash
Panasas suggests limiting to lt100 clients per
director blade.

7
Update on Security

Nessus scanning program implemented as part of
ongoing DOE CA process
Constant low-level scanning
Quarterly scanning more intensive ? port
exclusion scheme to protect sensitive processes
Samhain
Filesystem integrity checker (akin to Tripwire)
with central management of monitored systems
Currently deployed on all administrative systems

8
Linux Farm Hardware

gt4000 processors, gt3.5 MSI2K
700 TB of local storage (SATA, SCSI, PATA)
SL3.05(03) for RHIC (ATLAS)
Evaluated dual-core Opteron Xeon for upcoming
purchase
Recently encountered problems with Bonnie I/O
tests using RHEL4 64 bit w/software RAIDLVM on
Opteron
Xeon (Paxville) gives poor SI/watt performance

9
Power Cooling

Power Cooling now significant factors in
purchasing
Added 240KW to facility for 06 upgrades
Long term possible site expansion
Liebert XDV Vertical Top Cooling Modules to be
installed on new racks
CPU and ambient temperature monitoring via
dtgraph and custom python scripts

10
Distributed Storage

Two large dCache instances (v1.6.6) deployed in
hybrid server/client model
PHENIX 25 TB disk, 128 servers, gt240 TB data
ATLAS 147 TB disk, 330 servers, gt150 TB data
Two custom HPSS backend interfaces
Perf. tuning on ATLAS write pools
Peak transfer rates of gt50 TB/day

Other large deployment of Xrootd (STAR), rootd,
anatrain (PHENIX)

11
Batch Computing

All reconstruction and analysis batch systems
have been migrated to Condor, except STAR
analysis ---which still awaits features like
global job-level resource reservation --- and
some ATLAS distributed analysis (these use LSF
6.0)
Configuration
Five Condor (6.6.x) pools on two central managers
113 available submit nodes
One monitoring/Condorview server and one backup
central manager
Lots of performance tuning
Autoclustering of jobs for scheduling timeouts
negotiation cycle socket cache collector query
forking etc....

12
Condor Usage

Use of heavily modified CondorView client to
display historical usage.

13
Condor Flocking

Goal Full utilization of computing resources on
the farm
Increasing use of a general queue which allows
jobs to run on idle resources belonging to other
experiments, provided that there are no local
resources available to run the job
Currently, such opportunistic jobs are
immediately evicted if a local job places a claim
on the resource
gt10K jobs completed so far

14
Condor Monitoring

Nagios and custom scripts provide live monitoring
of critical daemons
Place job history from 100 submit nodes into
central database
This model will be replaced by Quill.
Custom statistics extracted from database (i.e.
general queue, throughput, etc.)
Custom startd, schedd, and startd cron ClassAds
allow for quick viewing of the state of the pool
using Condor commands
Some information accessible via web interface
Custom startd ClassAds allow for remote and
peaceful turn off of any node
Not available in Condor
Note that the condor_off -peaceful command
(v6.8) cannot be canceled, must wait until
running jobs exit

15
Nagios Monitoring

13958 services total and 1963 hosts average of 7
services checked per host.
Originally had one nagios server (dual 2.4
Gz)....
Tremendous latency services reported down many
minutes after the fact.
Web interface completely unusable (due to number
of hosts and services)
...all of this despite a lot of nagios and system
tuning...
Nagios data written to ramdisk
Increased no. of file descriptors and no. of
processes allowed
Monitoring data read from MySQL database on
separate host
Web interface replaced with lightweight interface
to the database server
Solution split services roughly in half between
two nagios servers.
Latency is now very good
Events from both servers logged to one MySQL
server
With two servers there is still room for many
more hosts and a handful of service checks.

16
Nagios Monitoring
17
ATLAS Tier-1 Activities

OSG 0.4, LCG 2.7 (this wk.)
ATLAS Panda (Production And Distributed Analysis)
used for production since Dec. 05
Good performance in scaling tests, with low
failure rate and manpower requirements
Network Upgrade
2 X 10 Gig LAN and WAN
Terapath QoS/MPLS (BNL, UM, FNAL, SLAC, ESNET)
DOE supported project to introduce end-to-end QoS
network into data mgmt.
Ongoing intensive development w/ESNET

SC 2005
18
ATLAS Tier-1 Activities

SC3 Service Phase (Oct-Dec 05)
Functionality validated for full production chain
to Tier-1
Exposed some interoperability problems between
BNL dCache and FTS (fixed now)
Needed further improvement in operation,
performance and monitoring.
SC3 Rerun Phase (Jan-Feb 06)
Achieved performance (disk-disk, disk-tape) and
operations benchmarks

19
ATLAS Tier-1 Activities

SC4 Plan
Deployment of storage element, grid middleware,
(LFC, LCG, FTS), and ATLAS VO box
April Data throughput phase (disk-disk and
disk-tape) goal is T0 to T1 operational
stability
May T1 to T1 data exercise.
June ATLAS Data Distribution from T0 to T1 to
select T2.
July-August Limited distributed data processing,
plus analysis
Remainder of 06 Increasing scale of data
processing and analysis.

20
Recent P-P Collision in STAR

Write a Comment

User Comments (0)