Title: David Colling
1Performance of the LHC Computing Grid (LCG)
2Thanks Slides/pictures/text taken from several
people including Les Robertson, Jamie Shears, Bob
Jones, Gabriel Zaquine, Jeremy Coles, Gidon Moont
Caveat LCG means different things to different
people and funding bodies.
3Contents
- Description of the LCG, what the target are, how
it works. - The monitoring that is currently in place
- The current and future metrics.
- The Service Challenges
- The testing and release procedure
4The LHC
5What is the LHC?
- LHC will collide beams of protons at an energy of
14 TeV - Using the latest super-conducting technologies,
it will operate at about 270ºC, just above the
absolute zero of temperature - With its 27 km circumference, the accelerator
will be the largest superconducting installation
in the world. - Four detectors constructed and operated by
international collaborations of thousands of
physicists, egineers and technicians.
The largest terrestrial scientific Endeavour
ever undertaken Due to start taking data in 2007
LHC is due to switch on in 2007 Four
experiments, with detectors as big as
cathedrals ALICE ATLAS CMS LHCb
6Data Volume
Data accumulating at 15 PetaBytes/year Equivalen
t to writing a CD every 2 seconds
7The Role of LCG
- LCG is the system on which this data will be
analysed and similar volumes of MC simulation
generated - High Energy Physics jobs have particular
characteristics e.g. they are thankfully
parallel - However, LCG and EGEE are very closely linked
and EGEE has a more general remit e.g. biomed,
earth obs, etc applications as well HEP
8Middleware and Deployment
- Current Middleware based on EDG but hardened
and extended - New middleware being developed with the EGEE
project -
- Deployment and monitoring is also done jointly
with EGEE
9The System (ATLAS Case)
10 Tier-1s reprocess house simulation Group
Analysis
Workstations
10The World as seen by the EDG
Now a happy user
Replica Location service (Replicac Catalogue)
Each Site consists of
edg-job-get-output ltdg-job-idgt
VO server
Confused and unhappy user
So now the user knows about what machines are out
there and can communicate with them however
where to submit the job is too complex a decision
for user alone.
What is needed is an automated system
So lets introduce some grid infrastructure
Security and an information system
This is the world without Grids
- Sites are not identical.
- Different Computers
- Different Storage
- Different Files
- Different Usage Policies
Workload Management System (Resource Broker)
WMS using RC decide on execution location
Logging Bookkeeping
11So what is actually there now?
- Currently, 138 sites in 36 countries
- 14K cpus, 10PB storage
- 1000 registered users (gt100 active users)
12Monitoring LCG/EGEE
- Four forms of monitoring (demos)
- What are the state of a given site
- What is currently in being used
- Accounting how many resources have been used by
a given Virtual Organisation. - EGEE quality assurance
These different activities are not always well
connected
13What is the state of site
- Series of site functional tests run
automatically at every site some involve asking
a site questions some by running jobs - These tests are defined as critical or
non-critical. If a site consistently fails
critical tests automated messages are sent to the
site and it will be removed from the information
system if the error is not connected. - Also analyses the information published by a
site.
14What is the state of site
Information gathered at two GOCs http//goc.grid.s
inica.edu.tw/ and http//goc.grid-support.ac.uk/
gridsite/gocmain/
15What is the state of site
Maps as well
16What is currently being usedGridIce
http//gridice2.cnaf.infn.it50080/gridice/site/si
te.php Kind of like the gstat asked for earlier
today
17Accounting APEL
Uses the local batch system logs and publishes
information over RGMA
18Quality assurance
Interrogates the logging and bookkeeping
- Overall Job success, from January 2005
- Job Success rate Done(ok) / (Submitted-Cancelled
) - Results should be validated
- http//ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/sho
wstatsVO.php
19Quality assurance
- VOs job throughput and success rate, from January
until May-2005
20Quality assurance
Next step is to understand these failures
By the end of June will also will measure the
overhead caused by running via the LCG by
measuring the running time/total time
21Quality assurance
- Many other metrics have been suggested (especialy
in the UK) including - Number of user ( from different communities)
- Training quality
- Maintenance and reliability (already measured)
- Upgrade time etc etc
UK sites only, target was 3 weeks
22Demos
http//www.hep.ph.ic.ac.uk/e-science/projects/demo
/index.html
23How will we know if we are going get there?
- There are an ongoing set of service challenges
- Each Service Challenge growing in complexity
approaching the full production service - Currently we are between SC2 and SC3.
- SC2 only the T0 and T1.
-
- SC3 will involve 5 T2s as well and SC4 will
involve all T2 sites.
24Service Challenge 2
- Goal for throughput was gt600MB/s daily average
for 10 days was achieved - Midday 23rd March to
Midday 2nd April - Not without outages, but system showed it could
recover rate again from outages - Load reasonable evenly divided over sites (give
network bandwidth constraints of Tier-1 sites)
25Service Challenge 3 and beyond
26Testing and deployment
- Multi stage release
- New components first tested on the testing
testbed. Rapid feedback to developers. This
testing to include performance/scalability
testing. Currently, this only at 4 (5) site.
CERN, NIKHEF, RAL, Imperial (two installations) - Pre-production testbed
- Releases on to production every 3 months
27Conclusions
- Very hard deadline by which this must work
- We are monitoring as much as we can to try to
understand where our current failures come from. - We have a release process that hopefully will
improve performance of future releases
28http//goc.grid-support.ac.uk/gridsite/monitoring/
http//goc.grid.sinica.edu.tw/gstat/ http//gr
idice2.cnaf.infn.it50080/gridice/site/site.php h
ttp//www.hep.ph.ic.ac.uk/e-science/projects/demo/
index.html