Tier1 Status - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Tier1 Status

Description:

Down to 1 transformer (from 2) for extended periods (weeks). Increased risk of disaster. Single transformer running at max operating load ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 32
Provided by: grid4
Category:

less

Transcript and Presenter's Notes

Title: Tier1 Status


1
Tier-1 Status
  • Andrew Sansum
  • GRIDPP20
  • 12 March 2008

2
Tier-1 Capacity delivered to WLCG (2007)
RAL
RAL
3
Tier-1 CPU Share by 2007 MoU
4
Wall Time
5
CPU Use by VO (2007)
ATLAS
ALICE
CMS
LHCB
6
Experiment Shares (2008)
7
Grid Only
  • Non-Grid access to Tier-1 has now ended. Only
    special cases (contact us if you believe you are)
    now have access to
  • UIs
  • Job Submission
  • Until end of May 2008
  • IDs will be maintained (disabled)
  • Home directories will be maintained online
  • Mail forwarding will be maintained.
  • After end of May 2008
  • Ids will be deleted
  • Home filesystem will be backed up
  • Mail spool will be backed up
  • Mail forwarding will stop
  • AFS service continues for Babar (and just in case)

8
Reliability
  • Feb mainly due to power failure 8 hours network
  • Jan/December mainly CASTOR problems over Xmass
    period (despite multiple callouts)
  • Out of hours on-call will help but some problems
    take time to diagnose/fix

9
Power Failure Thursday 7th February 1300
  • Work on power supply since December
  • Down to 1 transformer (from 2) for extended
    periods (weeks). Increased risk of disaster.
  • Single transformer running at max operating load
  • No problems until work finished and casing closed
    control line crushed and power supply tripped.
  • Total loss of power to whole building
  • First power interruption for over 3 years
  • Restart (Effort gt 200 FTE hours)
  • Most Global/National/Tier-1 core systems up by
    Thursday evening
  • Most of CASTOR and part of batch up by Friday
  • Remaining batch on Saturday
  • Still problems to iron out in CASTOR on
    Monday/Tuesday
  • Lessons
  • Communication was prompt and sufficient but
    ad-hoc
  • Broadcast unavailable as RAL run GOCDB (now fixed
    by caching)
  • Careful restart of disk servers slow and labour
    intensive (but worked) will not scale
  • See http//www.gridpp.rl.ac.uk/blog/2008/02/18/re
    view-of-the-recent-power-failure/

10
Hardware Disk
  • Production capacity 138 Servers, 2800 drives,
    850TB (usable)
  • 1.6PB capacity delivered in January by Viglen
  • 91 Supermicro 3U servers with dual AMD 2220E
    (2.8GHz) dual-core CPUs, 8GB RAM, IPMI
  • 1 x 3ware 4 port 9650 PCIe RAID controller with 2
    x 250GB WD HDD
  • 1 x 3ware 16 port 9650 PCIe RAID controller with
    14 x 750GB WD HDD
  • 91 Supermicro 3U servers with dual Intel E5310
    (1.6GHz) quad-core CPUs, 8GB RAM, IPMI
  • 1 x 3ware 4 port 9650 PCIe RAID controller with 2
    x 400GB Seagate HDD
  • 1 x 3ware 16 port 9650 PCIe RAID controller with
    14 x 750GB Seagate HDD
  • Acceptance test running scheduled to be
    available end of March.
  • 5400 spinning drives after planned phase out in
    April (expect drive failure every 3 days)

11
Hardware CPU
  • Production about 1500KSI2K on 600 systems.
  • Recently upgraded about 50 of capacity to
    2GB/core
  • Recent procurement (approximately 3000KSI2K - but
    YMMV) delivered and under test
  • Streamline
  • 57 x 1U servers (114 systems, 3 racks), each
    system
  • dual Intel E5410 (2.33GHz) quad-core CPUs
  • 2GB/core, 1 x 500GB HDD
  • Clustervision
  • 56 x 1U servers (112 systems, 4 racks), each
    system
  • dual Intel E5440 (2.83GHz) quad-core CPUs
  • 2GB/core, 1 x 500GB HDD

12
Hardware Tape
  • Tape Drives
  • 8 9940B drives
  • Used on legacy ADS/dCache service phase out
    soon
  • 18 T10K tape drives and associated servers
    delivered, 15 in production, remainder soon
  • Planned bandwidth 50MB/s per drive
  • Actual bandwidth (8-80MB/s) - a work in progress
  • Media
  • Approximately 2PB on site

13
Hardware Network
RAL Site
ADS Caches
CPUs Disks
CPUs Disks
RAL Tier 2
N x 1Gb/s
5510 5530
3 x 5510 5530
2 x 5510 5530
10Gb/s
Router A
Firewall
Stack 4 x Nortel 5530
Force10 C300 8 slot Router (6410Gb)
10Gb/s
bypass
OPN Router
10Gb/s
Site Access Router
6 x 5510 5530
5 x 5510 5530
Oracle systems
1Gb/sLancaster (test network)
10Gb/s to SJ5
CPUs Disks
CPUs Disks
Tier 1
10Gb/s to CERN
14
RAL links
implemented
Implement soon
never
15
Backplane Failures (Supermicro)
  • 3 servers burnt out backplane
  • 2 of which set off VESDA
  • 1 called out fire-brigade
  • Safety risk assessment Urgent rectification
    needed
  • Good response from supplier/manufacturer
  • PCB fault in bad batch
  • Replacement nearly complete

16
Machine Rooms
  • Existing Machine room
  • Approximately 100 racks of equipment
  • Getting close to power/cooling capacity
  • New Machine Room
  • Work still proceeding near to schedule
  • 800M2 can accommodate 300 racks 5 robots
  • 2.3MW Power/Cooling capacity (some UPS)
  • Scheduled to be available for September 2008

17
CASTOR Memory Lane
Happy days!
1Q07
2Q07
3Q07
4Q07
1Q08
4Q06
3Q06
2Q06
1Q06
4Q05
CASTOR1 tests OK
2.1.3 good but missing functionality
2.1.2 bad
CASTOR2 Core Running Hard to install
dependencies
ATLAS on CASTOR ?
CSA07 encouraging
Problems with functionality and performance it
doesnt work!
OC Committees note improvement but concerned
Service stopped for extended upgrade
CSA08 reasonably successful
2.1.4 upgrade Goes well. Disk 1 support!
CMS on CASTOR for CSA06. Encouraging. Declare
production service.
LHCB on CASTOR ?
18
Growth in Use of CASTOR
19
Test Architecture
Oracle NS vmgr
Oracle NS vmgr
Shared Services
Shared Services
Tape Server
Oracle stager
Oracle DLF
Oracle DLF
Oracle repack
Oracle DLF
Oracle repack
Oracle stager
Oracle stager
stager
stager
DLF
stager
DLF
DLF LSF
repack
repack
LSF
LSF
Development
Preproduction
Certification Testbed
1 Diskserver - variable
1 Diskserver - variable
1 Diskserver - variable
20
CASTOR Production Architecture
Oracle NS vmgr
Name Server 2
Shared Services
Tape Server
Tape Server
Oracle stager
Oracle DLF
Oracle stager
Oracle DLF
Oracle DLF
Oracle repack
Oracle stager
stager
DLF
stager
stager
stager
DLF
DLF
DLF
repack
LSF
LSF
LSF
LSF
CMS Stager Instance
Atlas Stager Instance
LHCb Stager Instance
Repack and Small User Stager Instance
1 Diskserver
Diskservers
Diskservers
Diskservers
21
Atlas Data Flow Model
AOD2
RAW
RAW
T0Raw
simRaw
AODm2/ TAG
RAW
TAG/ AODm2
RAW
ESD2/ AODm2/ TAG
T0
T2
T1s
ESD/ AODm/ TAG/
AODm1/ TAG
AODm2/ TAG
ESD1/ AODm1/ TAG
ESD1
StripInput
simStrip
ESD
Partner T1
22
CMS Dataflow
All pools are disk0tape1
FarmRead 50 LSF Slots Per server
Batch Farm
Recall
Disk2Disk Copy
WanIn 8 LSF Slots Per server
T0, T1 T2
Disk2Disk Copy
WanOut 16 LSF Slots Per server
T1 T2
23
CMS Disk Server Tuning CSA06/CSA07
  • Problem network performance too low
  • Increase default/maximum tcp window size
  • Increase tcp ring buffers and tx queue
  • Ext3 journal changed to datawriteback
  • Problem Performance still too low
  • Reduce number of gridftp slots/server
  • Reduce number of streams per file
  • Problem Phedex transfers now timeout
  • Reduce FTS slots to match disk pools
  • Problem servers sticky or crash with OOM
  • Limit total tcp buffer space
  • Protect low memory
  • Aggressive cache flushing
  • See http//www.gridpp.ac.uk/wiki/RAL_Tier1_Disk_S
    erver_Tuning

24
3Ware Write Throughput
25
CCRC08 Disk server Tuning
  • Migration rate to tape very bad (5-10MB/s) when
    concurrent with writing data to disk
  • Was OK in CSA06 (50MB/server) Areca servers
  • 3Ware 9550 performance terrible under concurrent
    read/write (2MB/s read, 120MB/s write)
  • 3Ware appears to prioritise writes
  • Tried many tweaks, most with little success
    except
  • Either changing elevator to anticipatory
  • Downside write throughput reduced
  • Good under benchmarking - testing in production
    this week
  • Or increasing block device read ahead
  • Read throughput high but erratic under test
  • But seems OK in production (30MB/server)
  • See http//www.gridpp.rl.ac.uk/blog/2008/02/29/3w
    are-raid-controllers-and-tape-migration-rates/

26
CCRC (CMS WANIN)
300MB/s
Network In
Network Out
Phedex
Migration queue
CPU
Tier-0 Rate
27
CCRC (WANOUT)
300MB/s
Network In
Network Out
Phedex
Before Replication
CPU
After Replication
28
CASTOR Plans for May CCRC08
  • Still problems
  • Optimising end to end transfer performance
    remains a balancing act.
  • Hard to manage complex configuration
  • Working on
  • Alice/xrootd deployment
  • Preparation for 2.1.6 upgrade
  • Installation of Oracle RACS (resilient Oracle
    services for CASTOR)
  • Provisioning and configuration management

29
dCache Closure
  • Agreed with UB that we would give 6 months notice
    before terminating dCache service
  • dCache closure announced to UB to be May 2008
  • ATLAS and LHCB working to migrate their data
  • Migration slower than hoped
  • Service much reduced in size now (10-12 servers
    remain) and operational overhead much lower
  • Remaining non-LHC experiments migration delayed
    by low priority for non-CCRC work.
  • Work on Gen instance of CASTOR will recommence
    shortly.
  • Pragmatically closure may be delayed by several
    months until Minos and tiny VOs migrated

30
Termination of GRIDPP use of ADS Service
  • GRIDPP funding and use of old legacy Atlas
    Datastore service scheduled to end at end of
    March 2008.
  • No gridpp access by tape command after this
  • Also no access via C callable VTP interface
  • RAL will continue to operate ADS service and
    experiments are free to purchase capacity
    directly from Datastore Team.
  • Pragmatically closure cannot happen until
  • dCache ends (uses ADS back end)
  • CASTOR is available for small VOs
  • Probably 6 months away

31
Conclusions
  • Hardware for 2008 MoU in the machine room and
    moving satisfactorily through acceptance
  • Volume not yet a problem but warning signs
    starting to appear.
  • CASTOR situation continues to improve
  • Reliable during CCRC08
  • Hardware performance improving. Tape migration
    problem reasonably understood and partly solved.
    Scope for further improvement
  • Progressing various upgrades
  • Remaining Tier-1 infrastructure essentially
    problem free.
  • Availability fair, but stagnating. need to
    progress
  • Incident response staff
  • On-Call
  • Disaster Planning and National/Global/Cluster
    Resilience
  • Concerned that we still not seen all experiment
    use cases.
Write a Comment
User Comments (0)
About PowerShow.com