John Gordon - PowerPoint PPT Presentation

About This Presentation

Title:

John Gordon

Description:

This years new hardware consists of 4 racks holding 156 dual cpu PCs, ... 104 'noma'-like machines allocated to BaBar. 156 old farm shared with other experiments ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 20

Provided by: tima2

Learn more at: https://conferences.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: John Gordon

1
CLRC-RAL Site Report

John Gordon
CLRC eScience Centre

General PP Facilities
New UK Supercomputer
BaBar TierA Centre
Networking

3
Prototype Tier 1 centre for CERN LHC and FNAL
experiments Tier A centre for SLAC BaBar
experiment Testbed for EU DataGrid project
UK GridPP Tier1/A Centre at CLRC
Computing Farm This years new hardware consists
of 4 racks holding 156 dual cpu PCs, a total of
312 1.4GHz Pentium III Tualatin cpus. Each box
has 1GB of memory, a 40GB internal disk and 100Mb
ethernet.
40TByte disk-based Disk Farm The new mass storage
unit can store 40Tb or raw data after a RAID 5
overhead. The PCs are clustered on network
switches with up to 8x1000Mbit ethernet out of
each rack.
Inside The Tape Robot The tape robot was
upgraded last year and now uses 60GB STK 9940
tapes. It currently holds 45TB but could hold
330TB when full.
4
(No Transcript)
5
(No Transcript)
6
Free
Free and out of robot
36TB HEP Data
7
HPCx

UK SuperComputer for next 6 years
Collaboration of CLRC Daresbury Laboratory,
Edinburgh EPCC. IBM
Sited at CLRC-DL
http//www.hpcx.ac.uk
Double in performance every 2 years ie 2 upgrades
Capability computing
Target to get 50 of the jobs using 50 of the
machine
Hardware
40x32 IBM pSeries 690 Regatta-H nodes (Power4
CPUs)
1280 1.3GHz cpus estimated peak performance
6.6TeraFLOPS
IBM Colony switch connects blocks of 8 cpus (ie
looks like 160x8, not 40x32)
1280 GB of Memory
2x32 already in place as a migration aid.
Service testing mid November, service December.

8
HPCx

Software
Capability computing on around 1000 high
performance CPUs
Terascale Applications team
Parallelising applications for 1000s of CPUs
Different architecture compared to T3E etc
HPCx is a cluster of 32 Processor machines
compared to MPP style of T3E
Some MPI operations now very slow (eg barriers,
all-to-all communications)

9
RAL Tier A

RAL is TierA Centre for BaBar
Like CC-IN2P3 but concentrating on different
data.
Shared resource with LHC and other experiments
Use

10
Hardware

104 noma-like machines allocated to BaBar
156old farm shared with other experiments
6 BaBar Suns (4-6 CPUs each)
20 TB disk for BaBar
Also using 10 TB of pool disk for data transfers
All disk servers on Gigabit ethernet
Pretty good server performance
as well as existing RAL facilities
622 Mbits/s network to SLAC and elsewhere
AFS cell
100TB Tape robot
Many years experience running BaBar software

11
Problems

Disk problems tracked down to a bad batch of
drives
All drives are now being replaced by the
manufacturer
our disks should be done in 1 month
By using spare servers, replacement shouldnt
interrupt service
Initially suffered from lack of support staff and
out-of-hours support (for US hours)
Two new system managers now in post
Two more being recruited (one just for BaBar)
Additional staff have been able to help with
problems at weekends
Discussing more formal arrangements

12
(No Transcript)
13
RAL Batch CPU Use
14
RAL Batch Users(running at least one non-trivial
job each week)
A total of 113 new BaBar users registered since
December
15
Data at RAL

All data in Kanga format is at RAL
19 TB currently on disk
Series-8 series-10 reskimmed series-10
AllEvents streams
data signalgeneric MC
New data copied from SLAC within 1-2 days
RAL is now the primary Kanga analysis site
New data is archived to tape at SLAC and then
deleted from disk

16
Changes since July

Two new RedHat 6 front-end machines
Dedicated to BaBar use
Login to babar.gridpp.rl.ac.uk
Trial RedHat 7.2 service
One front-end and (currently) 5 batch workers
Once we are happy with the configuration,
many/all of the rest of the batch workers will be
rapidly upgraded
ssh AFS token passing installed on front-ends
So, your local (eg. SLAC) token is available when
you log in
Trial Grid Gatekeeper available (EDG 1.2)
Allows job submission from the Grid
Improved new user registration procedures

17
Plans

Upgrade full farm to RedHat 7.2
Leave RedHat 6 front-end for use with older
releases
Upgrade Suns to Solaris 8 and integrate into PBS
queues
Install data dedicated import-export machines
Fast (Gigabit) network connection
Special firewall rules to allow scp, bbftp, bbcp,
etc.
AFS authentication improvements
PBS token passing and renewal
integrated login (AFS token on login, like SLAC)

18
Plans

Objectivity support
Works now for private federations, but no data
import
Support Grid generic accounts, so special RAL
user registration is no longer necessary
Procure next batch of hardware
Delivery probably early 2003

19
Network