Title: Site%20Report:%20The%20Linux%20Farm%20at%20the%20RCF
1Site Report The Linux Farm at the RCF
- October 22-25, 2002
- Ofer Rind
- RHIC Computing Facility
- Brookhaven National Laboratory
2RCF - Overview
- Provide computing facilities for RHIC users
- General computing environment
- General interactive tasks (email, document
processing, web) - Data analysis facility
- Computing infrastructure for RHIC experiments
- Code development, repository distribution
- Raw data recording reconstruction
- Data analysis
- ACF US Atlas Tier 1 Computing Facility
- Shared infrastructure and synergy with RCF
- Support staff 25 FTE's (4 dedicated to Linux
Ofer Rind - RHIC Computing Facility Site Report
3RCF - Structure
Ofer Rind - RHIC Computing Facility Site Report
4RCF - Component Summary
- Mass Storage Subsystem
- StorageTek library managed by HPSS
- 4 Silos, 1.2PB capacity (expanding to 4.5PB)
- In Run-2, raw data recorded at a common rate of
70MB/sec for a total of 170TB - Total data store 300TB
- Disk Storage
- Fibre channel SAN served by NFS
- 110TB Raid5
- 14 Sun 450, Solaris 8 2-02 (5 Sun 480 coming
online) - IBM AFS servers (AIX)
- Linux Server Farm
Ofer Rind - RHIC Computing Facility Site Report
5Linux Farm Hardware
- 840 1U and 2U servers (pre-'99 towers have been
retired) - 69 kSPECint95, expanding to 100 kSPECint95 (2
TFLOPS) - Most have 1GB mem (at least 500MB)
- Local SCSI disks up to 140GB/node
- Allocated by experiment
- Further allocated for Raw Data Reconstruction
(CRS) and Re- constructed Data Analysis (CAS)
VA Linux PIII 450Mz 148 Jun 99 VA Linux PIII
700Mz 48 Aug 00 VA Linux PIII 800Mz 168 Nov 00
IBM PIII 1000Mz 316 Aug 01 IBM PIII
1400Mz 160 Oct 02
Ofer Rind - RHIC Computing Facility Site Report
6Linux Farm Software Configuration
- RedHat 7.2 upgraded to 2.4.9-31 kernel
- Image(s) installed via Kickstart server and
customized for RCF environment via rpm - NFS AFS home directory and file access
- Interactive login allowed on selected nodes
- Job management
- (CAS) LSF 4.2 - slightly re-architected for
robustness. Peak throughput before summer
conferences was gt150K jobs/week. - (CRS) Locally produced Perl-based batch system
(AIX needed for HPSS API). Approx. 670K jobs
processed for Run-2. - Expanding use of distributed disk models (rootd,
??) - Atlas Grid testbed
Ofer Rind - RHIC Computing Facility Site Report
7Tracking LSF Usage
Star queues weekly job statistics (week of Oct.
Job starts/hr
Avg runtime/hr
Ofer Rind - RHIC Computing Facility Site Report
8Security and Monitoring
- Security
- RCF firewall within BNL site firewall
- SSH2 only access through gateway bastion nodes
(Solaris x86) - User access restricted to a subset of systems
(CAS only) - Monitoring
- 24 hr. on-call staff for critical systems during
RHIC operation - Cluster mgmt. software
- VACM (VA Linux)
- xCAT (IBM, http//www.x-cat.org)
- Cron scripts to "clean" nodes and head off
possible problems (memory leaks, full disks,
etc.) - CTS system for problem reports
Ofer Rind - RHIC Computing Facility Site Report
9Farm Alert System
Web-monitoring (user-accessible) plus
paging/email alerts Python scripts running
locally transferring node status information to a
MySQL database. Notification of problems with
NFS/AFS (e.g. stale file handles), LSF daemons,
high load, etc.
Ofer Rind - RHIC Computing Facility Site Report
10Network Operation Status
Perl scripts monitor network service connectivity
for all nodes (ssh, yp, etc.)
Ofer Rind - RHIC Computing Facility Site Report
11Load Monitoring and History
MySQL database for usage history History
available back to Sept. '01 via web
interface. CPU Load averaged over (98) Phenix
machines during the month of September.
Ofer Rind - RHIC Computing Facility Site Report
12Plans for the Near Future
- 160 newly delivered IBM nodes to be brought
online - Expect purchase bid to go out for 220 more nodes
at beginning of FY03 (pending funding approval) - Scaling up data storage capacity and throughput
for Run-3 (up to 10X data increase over Run-2,
starting in December) - Evaluation of LSF 5 and Condor ongoing, with an
eye towards distributed disk services - Expanding Atlas GRID services
Ofer Rind - RHIC Computing Facility Site Report