Title: Tier1 Status
1Tier-1 Status
- Andrew Sansum
- GRIDPP20
- 12 March 2008
2Tier-1 Capacity delivered to WLCG (2007)
RAL
RAL
3Tier-1 CPU Share by 2007 MoU
4Wall Time
5CPU Use by VO (2007)
ATLAS
ALICE
CMS
LHCB
6Experiment Shares (2008)
7Grid Only
- Non-Grid access to Tier-1 has now ended. Only
special cases (contact us if you believe you are)
now have access to - UIs
- Job Submission
- Until end of May 2008
- IDs will be maintained (disabled)
- Home directories will be maintained online
- Mail forwarding will be maintained.
- After end of May 2008
- Ids will be deleted
- Home filesystem will be backed up
- Mail spool will be backed up
- Mail forwarding will stop
- AFS service continues for Babar (and just in case)
8Reliability
- Feb mainly due to power failure 8 hours network
- Jan/December mainly CASTOR problems over Xmass
period (despite multiple callouts) - Out of hours on-call will help but some problems
take time to diagnose/fix
9Power Failure Thursday 7th February 1300
- Work on power supply since December
- Down to 1 transformer (from 2) for extended
periods (weeks). Increased risk of disaster. - Single transformer running at max operating load
- No problems until work finished and casing closed
control line crushed and power supply tripped. - Total loss of power to whole building
- First power interruption for over 3 years
- Restart (Effort gt 200 FTE hours)
- Most Global/National/Tier-1 core systems up by
Thursday evening - Most of CASTOR and part of batch up by Friday
- Remaining batch on Saturday
- Still problems to iron out in CASTOR on
Monday/Tuesday - Lessons
- Communication was prompt and sufficient but
ad-hoc - Broadcast unavailable as RAL run GOCDB (now fixed
by caching) - Careful restart of disk servers slow and labour
intensive (but worked) will not scale - See http//www.gridpp.rl.ac.uk/blog/2008/02/18/re
view-of-the-recent-power-failure/
10Hardware Disk
- Production capacity 138 Servers, 2800 drives,
850TB (usable) - 1.6PB capacity delivered in January by Viglen
- 91 Supermicro 3U servers with dual AMD 2220E
(2.8GHz) dual-core CPUs, 8GB RAM, IPMI - 1 x 3ware 4 port 9650 PCIe RAID controller with 2
x 250GB WD HDD - 1 x 3ware 16 port 9650 PCIe RAID controller with
14 x 750GB WD HDD - 91 Supermicro 3U servers with dual Intel E5310
(1.6GHz) quad-core CPUs, 8GB RAM, IPMI - 1 x 3ware 4 port 9650 PCIe RAID controller with 2
x 400GB Seagate HDD - 1 x 3ware 16 port 9650 PCIe RAID controller with
14 x 750GB Seagate HDD - Acceptance test running scheduled to be
available end of March. - 5400 spinning drives after planned phase out in
April (expect drive failure every 3 days)
11Hardware CPU
- Production about 1500KSI2K on 600 systems.
- Recently upgraded about 50 of capacity to
2GB/core - Recent procurement (approximately 3000KSI2K - but
YMMV) delivered and under test - Streamline
- 57 x 1U servers (114 systems, 3 racks), each
system - dual Intel E5410 (2.33GHz) quad-core CPUs
- 2GB/core, 1 x 500GB HDD
- Clustervision
- 56 x 1U servers (112 systems, 4 racks), each
system - dual Intel E5440 (2.83GHz) quad-core CPUs
- 2GB/core, 1 x 500GB HDD
12Hardware Tape
- Tape Drives
- 8 9940B drives
- Used on legacy ADS/dCache service phase out
soon - 18 T10K tape drives and associated servers
delivered, 15 in production, remainder soon - Planned bandwidth 50MB/s per drive
- Actual bandwidth (8-80MB/s) - a work in progress
- Media
- Approximately 2PB on site
13Hardware Network
RAL Site
ADS Caches
CPUs Disks
CPUs Disks
RAL Tier 2
N x 1Gb/s
5510 5530
3 x 5510 5530
2 x 5510 5530
10Gb/s
Router A
Firewall
Stack 4 x Nortel 5530
Force10 C300 8 slot Router (6410Gb)
10Gb/s
bypass
OPN Router
10Gb/s
Site Access Router
6 x 5510 5530
5 x 5510 5530
Oracle systems
1Gb/sLancaster (test network)
10Gb/s to SJ5
CPUs Disks
CPUs Disks
Tier 1
10Gb/s to CERN
14RAL links
implemented
Implement soon
never
15Backplane Failures (Supermicro)
- 3 servers burnt out backplane
- 2 of which set off VESDA
- 1 called out fire-brigade
- Safety risk assessment Urgent rectification
needed
- Good response from supplier/manufacturer
- PCB fault in bad batch
- Replacement nearly complete
16Machine Rooms
- Existing Machine room
- Approximately 100 racks of equipment
- Getting close to power/cooling capacity
- New Machine Room
- Work still proceeding near to schedule
- 800M2 can accommodate 300 racks 5 robots
- 2.3MW Power/Cooling capacity (some UPS)
- Scheduled to be available for September 2008
17CASTOR Memory Lane
Happy days!
1Q07
2Q07
3Q07
4Q07
1Q08
4Q06
3Q06
2Q06
1Q06
4Q05
CASTOR1 tests OK
2.1.3 good but missing functionality
2.1.2 bad
CASTOR2 Core Running Hard to install
dependencies
ATLAS on CASTOR ?
CSA07 encouraging
Problems with functionality and performance it
doesnt work!
OC Committees note improvement but concerned
Service stopped for extended upgrade
CSA08 reasonably successful
2.1.4 upgrade Goes well. Disk 1 support!
CMS on CASTOR for CSA06. Encouraging. Declare
production service.
LHCB on CASTOR ?
18Growth in Use of CASTOR
19Test Architecture
Oracle NS vmgr
Oracle NS vmgr
Shared Services
Shared Services
Tape Server
Oracle stager
Oracle DLF
Oracle DLF
Oracle repack
Oracle DLF
Oracle repack
Oracle stager
Oracle stager
stager
stager
DLF
stager
DLF
DLF LSF
repack
repack
LSF
LSF
Development
Preproduction
Certification Testbed
1 Diskserver - variable
1 Diskserver - variable
1 Diskserver - variable
20CASTOR Production Architecture
Oracle NS vmgr
Name Server 2
Shared Services
Tape Server
Tape Server
Oracle stager
Oracle DLF
Oracle stager
Oracle DLF
Oracle DLF
Oracle repack
Oracle stager
stager
DLF
stager
stager
stager
DLF
DLF
DLF
repack
LSF
LSF
LSF
LSF
CMS Stager Instance
Atlas Stager Instance
LHCb Stager Instance
Repack and Small User Stager Instance
1 Diskserver
Diskservers
Diskservers
Diskservers
21Atlas Data Flow Model
AOD2
RAW
RAW
T0Raw
simRaw
AODm2/ TAG
RAW
TAG/ AODm2
RAW
ESD2/ AODm2/ TAG
T0
T2
T1s
ESD/ AODm/ TAG/
AODm1/ TAG
AODm2/ TAG
ESD1/ AODm1/ TAG
ESD1
StripInput
simStrip
ESD
Partner T1
22CMS Dataflow
All pools are disk0tape1
FarmRead 50 LSF Slots Per server
Batch Farm
Recall
Disk2Disk Copy
WanIn 8 LSF Slots Per server
T0, T1 T2
Disk2Disk Copy
WanOut 16 LSF Slots Per server
T1 T2
23CMS Disk Server Tuning CSA06/CSA07
- Problem network performance too low
- Increase default/maximum tcp window size
- Increase tcp ring buffers and tx queue
- Ext3 journal changed to datawriteback
- Problem Performance still too low
- Reduce number of gridftp slots/server
- Reduce number of streams per file
- Problem Phedex transfers now timeout
- Reduce FTS slots to match disk pools
- Problem servers sticky or crash with OOM
- Limit total tcp buffer space
- Protect low memory
- Aggressive cache flushing
- See http//www.gridpp.ac.uk/wiki/RAL_Tier1_Disk_S
erver_Tuning
243Ware Write Throughput
25CCRC08 Disk server Tuning
- Migration rate to tape very bad (5-10MB/s) when
concurrent with writing data to disk - Was OK in CSA06 (50MB/server) Areca servers
- 3Ware 9550 performance terrible under concurrent
read/write (2MB/s read, 120MB/s write) - 3Ware appears to prioritise writes
- Tried many tweaks, most with little success
except - Either changing elevator to anticipatory
- Downside write throughput reduced
- Good under benchmarking - testing in production
this week - Or increasing block device read ahead
- Read throughput high but erratic under test
- But seems OK in production (30MB/server)
- See http//www.gridpp.rl.ac.uk/blog/2008/02/29/3w
are-raid-controllers-and-tape-migration-rates/
26CCRC (CMS WANIN)
300MB/s
Network In
Network Out
Phedex
Migration queue
CPU
Tier-0 Rate
27CCRC (WANOUT)
300MB/s
Network In
Network Out
Phedex
Before Replication
CPU
After Replication
28CASTOR Plans for May CCRC08
- Still problems
- Optimising end to end transfer performance
remains a balancing act. - Hard to manage complex configuration
- Working on
- Alice/xrootd deployment
- Preparation for 2.1.6 upgrade
- Installation of Oracle RACS (resilient Oracle
services for CASTOR) - Provisioning and configuration management
29dCache Closure
- Agreed with UB that we would give 6 months notice
before terminating dCache service - dCache closure announced to UB to be May 2008
- ATLAS and LHCB working to migrate their data
- Migration slower than hoped
- Service much reduced in size now (10-12 servers
remain) and operational overhead much lower - Remaining non-LHC experiments migration delayed
by low priority for non-CCRC work. - Work on Gen instance of CASTOR will recommence
shortly. - Pragmatically closure may be delayed by several
months until Minos and tiny VOs migrated
30Termination of GRIDPP use of ADS Service
- GRIDPP funding and use of old legacy Atlas
Datastore service scheduled to end at end of
March 2008. - No gridpp access by tape command after this
- Also no access via C callable VTP interface
- RAL will continue to operate ADS service and
experiments are free to purchase capacity
directly from Datastore Team. - Pragmatically closure cannot happen until
- dCache ends (uses ADS back end)
- CASTOR is available for small VOs
- Probably 6 months away
31Conclusions
- Hardware for 2008 MoU in the machine room and
moving satisfactorily through acceptance - Volume not yet a problem but warning signs
starting to appear. - CASTOR situation continues to improve
- Reliable during CCRC08
- Hardware performance improving. Tape migration
problem reasonably understood and partly solved.
Scope for further improvement - Progressing various upgrades
- Remaining Tier-1 infrastructure essentially
problem free. - Availability fair, but stagnating. need to
progress - Incident response staff
- On-Call
- Disaster Planning and National/Global/Cluster
Resilience - Concerned that we still not seen all experiment
use cases.