Title: LHC Computing Grid Project LCG
1- LHC Computing Grid Project LCG
- CERN European Organisation for Nuclear Research
- Geneva, Switzerland
Project Status Les Robertson, LCG Project
Leader EGEE Conference Den Haag 26 November 2004
2Summary
- Key points about LCG, EGEE and other Grid
Infrastructures - Status Concerns
- Planning for LHC Startup
3LCG Project Activity Areas
Applications Development environment and common
libraries, frameworks, tools for the LHC
experiments
CERN Fabric Construction and operation of the
central LHC computing facility at CERN
Networking Planning the availability of the high
bandwidth network services to interconnect the
major computing centres used for LHC data analysis
4Risks and Opportunities
- LCG EGEE are combining resources to build an
operation that is wider in scope and ambition
than LCG would be able to tackle on its own. - LCG has all of its middleware eggs in the EGEE
basket - If we can use the real needs and real resources
of the LHC experience to establish a general
science grid infrastructure that is supported
long term we will all benefit -
-- that is why we are in this project - EGEE stops in March 2006! LHC starts in 2007!
- This is an enormous risk for LCG
- I am not sure that there are other applications
that have shown this level of confidence in the
EGEE project - I am sure that the LCG reviewers would not agree
entirely with some of the views of the EGEE
reviewers - -- the risk we are taking deserves a considerable
priority from the EGEE project
5LCG Service Hierarchy
- Tier-2 100 centres in 40 countries
- Simulation
- End-user analysis batch and interactive
6Networking
- Latest estimates are that Tier-1s will need
connectivity at 10 Gbps with 70 Gbps at CERN - There is no real problem for the technology
as has been demonstrated by a succession
of Land
Speed Records - But LHC will be one of the few applications
needing - this level of performance as a
service on a global scale - We have to ensure that there will be an effective
international backbone that reaches
through the national research networks
to the Tier-1s - LCG has to be pro-active in working with service
providers - Pressing our requirements and our timetable
- Exercising pilot services
7LHC Computing Resources
- Most of the LHC resources around the world are
organised as national and regional grid projects,
integrated into the combined LCG-2/EGEE
operation - There are separate infrastructures in the US
(Grid-3) and the Nordic countries (NorduGrid)
that use different middleware - The LCG project has a dual role
- Operating the LCG-2/EGEE grid - a joint LCG-EGEE
activity - Coordinating the wider set of resources available
to LHC - There is an active programme aimed at
compatibility/inter-working of LCG-2/EGEE and
Grid3 - And on-going technical discussions with similar
aims with NorduGrid - ? Lack of standards is a major headache for LHC
experiments -
- In practice, the standard is most likely to be
set by a
winning middleware implementation
8Status Concerns
9Grid Deployment - going well
- The grid deployment process (LCG-2) is working
well - Integration certification debugging
- Distribution - installation
- Rapid reaction to problems encountered during
the LHC experiments data challenges?
incremental releases of LCG-2? significant
improvements in reliability, performance and
scalability - within the limits of the current architecture
- Scalability is much better than
scheduled, or expected a year ago - ? 90 nodes, 9,000 processors ? close
to final scale of the LCG grid! - Heavily used during the data challenges in 2004
- lots of real work done for real physicists --
these are not tests or demos - many small sites have contributed to simulation
runs - one experiment (LHCb) has run up to 3,500
concurrent jobs
10Grid Deployment - concerns
- The basic issues of middleware reliability and
scalability that we were struggling with a year
ago have been overcome - BUT - there are many issues of functionality,
usability and performance to be
resolved -- soon - Overall job success rate 60-75
- Can be tolerated for production work
submitted by small teams with automatic job
generation, bookkeeping systems - Unacceptable for end-user data analysis
11- Urgent to improve operations coordination and
management - EGEE support resources now in place
- Core operations centres established ? CLRC
Oxford, IN2P3 Lyon, CNAF Bologna, ASCC Taipei,
CERN - Global Grid User Support centre ?
Forschungszentrum Karlsruhe - Operations workshop at CERN 2-4 November
- The new, improved middleware from EGEE is awaited
with impatience
12LCG-2 and Next Generation Middleware
LCG-2
gLite
2004
- LCG-2focus on production, large-scale data
handling - The service for the 2004/5 data challenges
- Provides experience on operating and managing a
global grid service -- middleware neutral - Continuing, modest development programme driven
by data challenge experience - Will be supported until gLite is able to replace
it (functionality, scaling, reliability,
performance)
- focus on analysis
- LHC applications and users closely involved in
prototyping development (ARDA/NA4 project) - Short development cycles
- Deployed along with LCG-2 (co-existence)
- Hope to be able to replace some LCG-2 components
at an early stage with gLite components
prototyping
prototyping
product
2005
product
?
13Middleware from EGEE
- We have a rapidly growing number of sites
connecting to the LCG-2/EGEE grid -- but there
are major holes in the functionality, especially
in data management, and concerns about workload
management - The first gLite prototype was made available in a
development environment in May (6 weeks after
EGEE started!) - Good experience with this leads to strong
pressure for extended access
more users, more data - But there are difficulties in getting the product
out - the first pieces are only being delivered to the
pre-production testbed this month - key components will only arrive next year
- Absolute priority must now be to get the basic
gLite functionality out on the pre-production
testbed - -- and establish the process of short
development cycles - The LHC experiments have a pressing time-line
-- I do not want them to be forced to employ
alternative solutions
14Planning for LHC Startup
15Planning for LHC Startup
To what extent will there be experience of the
new middleware before these major decisions are
made?
- The agreements between the centres that will
implement the LHC computing environment will be
mapped out over the next 6-9 months - December 2004
- Experiment requirements and computing models
published - First quarter 2005
- Establish resource plans for Tier-0, Tier-1 and
major Tier-2s - Initial plan for Tier-0/1/2 networking
- April 2005
- Formal collaboration framework memorandum of
understanding - July 2005 Technical Design Report
- Detailed plan for installation and commissioning
the LHC computing environment
16Service Challenge Programmeto Ramp-up to LHC
Startup
- Dec04 - Service Challenge 1
- Basic high performance data transfer - 2 weeks
sustained - CERN 3 Tier-1s, 500 MB/sec between CERN and
Tier-1s - Mar05 - Service Challenge 2
- Reliable file transfer service
- mass store (disk) - mass store (disk)
- CERN 5 sites, 500 MB/sec between sites, 1
month sustained
17Service Challenge Programmeto Ramp-up to LHC
Startup
- Jul05 - Service Challenge 3
- - Tier-0/Tier-1 base service- CERN 5
Tier-1s, 300 MB/sec. including mass store
(disktape)- sustained 1 month - - 5 Tier-2 centres at lower bandwidth
- Preparation for --
- Tier-0/1 model verification two experiments
concurrently at 50 of nominal data rate
2008
First beams
Full physics run
18Service Challenge Programmeto Ramp-up to LHC
Startup
- Apr06 - Service Challenge 4
- - Tier-0, ALL Tier-1s, major Tier-2s operational
at full target data rates (1.2 GB/sec at Tier-0
) - Preparation for ..
- Tier-0/1/2 full model test - All experiments
- - 100 nominal data rate, with processing load
scaled to 2006 cpus - - sustained 1 month
2008
First beams
Full physics run
19Service Challenge Programmeto Ramp-up to LHC
Startup
- Nov06 - Service Challenge 5
- Infrastructure Ready at ALL Tier-1s, selected
Tier-2s - - Tier 0/1/2 operation - sustained 1 month
- - twice target data rates ( 2.5 GB/sec at
Tier-0) -
- Preparation for ..
- Feb07 - ATLAS CMS LHCb ALICE (proton mode)
- - Tier-0/1 100 full model test
2008
First beams
Full physics run
20Summary
- Grid Operation
- Very good progress during the past year
- Large scale deployment
- Real work performed for experiments
- Much work to be done to improve job success rate
-- operations management, site discipline,
middleware - Grid Middleware
- Some of the missing functionality can be provided
through short term developments of LCG-2 - But we are looking to the EGEE/gLite work for
middleware adapted to end user analysis - Urgent to deliver the base set of gLite
components - LCG needs a permanent, increasingly stable
service for experiments to do physics - And in addition has a tight schedule of service
and computing model readiness tests