London Tier 2

About This Presentation

Title:

London Tier 2

Description:

320 CPU Torque farm. After difficulties with Fedora 2, have moved LCG WN to SL3 ... Little change: 148 CPU Torque farm. LCG 2_4_0. OS SL3. RGMA installed ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 12

Provided by: owenma

Category:

more less

Transcript and Presenter's Notes

Title: London Tier 2

1
London Tier 2

Status Report
GridPP 13, Durham, 4th July 2005
Owen Maroney, David Colling

2
Brunel

2 WN PBS _at_ LCG-2_4_0
R-GMA and APEL installed
RH7.3 LCFG installed
Additional farm being installed
SL3
Private networked WN
16 nodes
Expected to move into production after 2_6_0
upgrade
Hoping to bring further resources over the summer
Recruiting support post with RHUL (Job offer
made)

3
Imperial College London

Appointment of Mona Aggarwal to GridPP Hardware
Support Post
52 CPU Torque HEP farm _at_ LCG-2_5_0
RGMA and APEL installed
OS RHEL 3
IC HEP participating in SC3 as the UK CMS site
dCache SRM installed with 2.6TB storage 6TB on
order
Another 6TB on order
Numerous power outages (scheduled and
unscheduled) have caused availability problems
London e-Science Centre
SAMGrid installed across HEP and LeSC
Certified for D0 data reprocessing
186 Job Slots
SGE farm, 64bit RHEL
Globus-jobmanager installed
Beta version of SGE plug-in to generic
information provider
Firewall issues had blocked progress but this has
now been resolved. Testing will start soon.
Community of Interest mailing list established
for sites interested in SGE integration with LCG

4
Queen Mary

320 CPU Torque farm
After difficulties with Fedora 2, have moved LCG
WN to SL3
Departure of key staff member just as LCG-2_4_0
released led to manpower problems
GridPP Hardware Support post filled
Guiseppe Mazza start(ed) 1st July
RGMA and APEL installed early in June.

5
Royal Holloway

Little change 148 CPU Torque farm
LCG 2_4_0
OS SL3
RGMA installed
Problems with APEL default installation
Gatekeeper and batch server on separate nodes
Little manpower available
Shared GridPP Hardware Support post with Brunel
still in recruitment process
Job offer made?

6
University College London

UCL-HEP 20 CPU PBS farm _at_ LCG-2_4_0
OS SL3
RGMA installed
Problems with APEL default installation
Separate batch server to gatekeeper
UCL-CCC 88 CPU Torque farm _at_ LCG-2_4_0
OS SL3
RGMA and APEL installed
Main cluster is SGE farm
interest in putting SGE farm into LCG and
integrating nodes into single farm

7
Current site status summary
Site Service nodes Worker nodes Local network connectivity Site connectivity SRM Days SFT failed Days in scheduled maintenance
Brunel RH7.3 LCG2.4.0 RH7.3 LCG2.4.0 1Gb 100Mb No 21 16
Imperial RHEL3 LCG2.5.0 RHEL3 LCG2.5.0 1Gb 1Gb dCache 26 28
QMUL SL3 LCG2.4.0 SL3 LCG2.4.0 1Gb 100Mb No 45 12
RHUL RHEL3 LCG2.4.0 RHEL3 LCG2.4.0 1Gb 1Gb No 22 29
UCL (HEP) SL3 LCG2.4.0 SL3 2.4.0 1Gb 1Gb No 9 30
UCL (CCC) SL3 LCG2.4.0 SL3 LCG2.4.0 1Gb 1Gb No 12 9

Local network connectivity is that to the site SE
It is understood that SFT failures do not always
result from site problems, but it is the best
measure currently available.

8
LCG resources
Site Estimated for LCG Estimated for LCG Estimated for LCG Currently delivering to LCG Currently delivering to LCG Currently delivering to LCG
Site Total job slots CPU (kSI2K) Storage (TB) Total jobs slots CPU (kSI2K) Storage (TB)
Brunel 60 60 1 4 4 0.4
IC 66 33 16 52 26 3
QMUL 572 247 13.5 464 200 0.1
RHUL 142 167 3.2 148 167 7.7
UCL 204 108 0.8 186 98 0.8
Total 1044 615 34.5 854 495 12
1) The estimated figures are those that were
projected for LCG planning purposes http//lcg-co
mputing-fabric.web.cern.ch/LCG-Computing-Fabric/GD
B_resource_infos/Summary_Institutes_2004_2005_v11.
htm 2) Current total job slots are those
reported by EGEE/LCG gstat page.
9
Resources used per VO over quarter (kSI2K hours)
Site CPU ALICE ATLAS BABAR CMS LHCB ZEUS Total
Brunel 6 149 155
Imperial 19 848 221 4,863 312 6,263
QMUL 41 116 82,697 82,854
RHUL 1,124 1,840 79 42,218 45,261
UCL 6,982 126 14,115 21,223
Total 1,143 9,711 548 144,042 312 155,756
Data taken from APEL
10
Njobs percentage numbers of jobs
Expressed as a pie chart
51,209 according to APEL
11
Site Experiences

LCG-2_4_0 release was first scheduled release
date
Despite a slippage of 1 week in the release (and
an overlap with EGEE conference) all LT2 sites
upgraded within 3 weeks
Some configuration problems for a week after
Overall experience was better than the past
Farms are not fully utilised
This is true of the grid as a whole
Will extend the range of VOs supported
Overall improvement in Scheduled Downtime (SD)
compared to previous quarter.
QMUL had manpower problems
NB Although QMUL had highest number of (SFT
failureSD) provided most actual processing power
during quarter!
IC had several scheduled power outages, plus two
unscheduled power failures
Caused knock-on failures for sites using BDII
hosted at IC
IC installed dCache SRM in preparation for SC3
Installation configuration not simple default
configuration was not suitable for most Tier 2
sites and changing from the default was hard
Some security concerns installations not Secure
by Default
Coordinator- Owen Leaving in two weeks
Have made an offer