Title: les robertson cernit 0300 1
1The LHC Computing Challenge
- GRID Workshop
- Maxwell Institute Edinburgh
- Les Robertson
- CERN - IT Division
- 21 September 2000
- les.robertson_at_cern.ch
2Summary
- HEP offline computing the current model
- LHC computing requirements
- The wide area computing model
- A place for Grid technology?
- The DataGRID project
- Conclusions
3Data Handling and Computation for Physics Analysis
event filter (selection reconstruction)
detector
processed data
event summary data
raw data
batch physics analysis
event reprocessing
analysis objects (extracted by physics topic)
event simulation
interactive physics analysis
les.robertson_at_cern.ch
4HEP Computing Characteristics
- Large numbers of independent events
- trivial parallelism
- Very large data sets
- smallish records
- mostly read-only
- Modest I/O rates
- Few MB/sec per fast processor
- Modest floating point requirement
- SPECint performance
5The SHIFT Software Model
application servers
IP network
stage (migration)servers
Storage access API which can be implemented over
IP ----- all data available to all
processes ----- replicated components -
scalable heterogeneous distributed
les.robertson_at_cern.ch
6Generic computing farm
network servers
application servers
tape servers
les.robertson_at_cern.ch
disk servers
Cern/it/pdp-les.robertson 10-98-6
7History
- 1960s thru 1980s
- The largest scientific supercomputers
mainframes (Control Data, Cray,
IBM, Siemens/Fujitsu) - Time-sharing interactive services on IBM
DEC-VMS - Scientific workstations from 1982 (Apollo) for
development, final analysis - 1989 -- First batch services on RISC - joint
project with HP (Apollo DN10.000 ) - 1990 -- Central Simulation Facility (CSF) - 4 X
mainframe capacity - 1991 -- SHIFT - data intensive applications,
distributed model - 1993 -- First central interactive service on RISC
- 1996 -- Last of the mainframes de-commissioned
- 1997 -- First batch services on PCs
- 1998 -- NA48 record 70 TeraBytes of data
- 2000 -- gt75 capacity from PCs
8High Throughput Computing
- High Throughput Computing
- mass of modest problems
- throughput rather than performance
- resilience rather than ultimate reliability
- HEP can exploit inexpensive mass market
components - to build large computing/data clusters
- scalable, extensible, flexible, heterogeneous,
.. - and as a result - really hard to manage
- We should have much in common with data mining,
Internet computing facilities,
Chaotic workload
9Architectures operating systems supported at
end 1999
Are we sure that flexibility is an advantage?
Digital Unix
SPARC
Solaris
MIPS
AIX
Alpha
Windows 2000
Irix
Windows 95
Windows NT
Linux
Power PC
MAC-OS
Intel IA-32
PA-RISC
HP-UX
10Physics Data Handling Evolution of capacity
and cost through the nineties
50 annual growth
CPU capacity
les.robertson_at_cern.ch
80 annual growth
LEP startup
11(No Transcript)
12LHC Computing Requirements
13online system multi-level trigger filter out
background reduce data volume
les.robertson_at_cern.ch
14The LHC Detectors
CMS
ATLAS
3.5 PetaBytes / year 108 events/year
LHCb
15LHC Computing Fabric at CERN
Estimated computing resources required at CERN
for LHC
experiments in 2006
collaboration
ALICE
ATLAS
CMS
LHCB
Total
420 000
520 000
1 760 000
600 000
220 000
CPU capacity (SPECint95)
2006
3 000
3 000
3 000
estimated cpus in 2006
1 500
10 500
disk capacity (TB)
2006
800
750
650
450
2 650
3,7
3,0
1,8
0.6
9,1
2006
mag.tape capacity (PB)
aggregate I/O rates (GB/sec)
100
100
40
340
100
disk
1,2
0,8
0,8
0,2
3,0
tape
Effective throughputof LAN backbone
16lt 50 of the main analysis capacity will be at
CERN
les.robertson_at_cern.ch
17Other experiments
LHC experiments
Jan 200030 TB disk1 PB tape
Other experiments
LHC experiments
les.robertson_at_cern.ch
18Components to Fabrics
- Commodity components are just fine for HEP
- Masses of experience with inexpensive farms
- LAN technology is going the right way
- Inexpensive high performance PC attachments
- Compatible with hefty backbone switches
- Good ideas for improving automated operation and
management
19Two Problems
20Funding
- Requirements growing faster than Moores law
- CERNs overall budget is fixed
Estimated cost of facility at CERN 30 of
offline requirements
Budget level in 2000 for all physics data
handling
assumes physics in July 2005, rapid ramp-up of
luminosity
21World Wide Collaboration ? distributed
computing storage capacity
CMS 1800 physicists 150 institutes 32 countries
22The Wide Area Computing Model
23Solution? - Regional Computing Centres
- Exploit established computing expertise
infrastructure - in national labs, universities
- Reduce dependence on links to CERN
- full summary data available nearby
- through a fat, fast, reliable network link
- Tap funding sources not otherwise available to
HEP - Devolve control over resource allocation
- national interests?
- regional interests?
- at the expense of physics interests?
24Regional Centres - a Multi-Tier Model
les.robertson_at_cern.ch
25More realistically - a Grid Topology
les.robertson_at_cern.ch
26The Basic Problem - Summary
- Scalability ? cost ? management
- Thousands of processors, thousands of disks,
PetaBytes of data, Terabits/second of I/O
bandwidth, . - Wide-area distribution
- WANs are only and will only be 1 of LANs
- Distribute, replicate, cache, synchronise the
data - Multiple ownership, policies, .
- Integration of this amorphous collection of
Regional Centres .. - .. with some attempt at optimisation
- Adaptability
- We shall only know how analysis is done once the
data arrives
27A place for Grid technology?
28Are Grids a solution?
- Change of orientation of Meta-computing activity
- From inter-connected super-computers ..
towards a more general concept of a
computational Grid (The Grid Ian
Foster, Carl Kesselman) - Has found resonance with the press, funding
agencies - and initiated a flurry of activity in HEP
- US Particle Physics Data Grid (PPDG)
- Grid technology evaluation project in INFN
(Italy) - GriPhyN data grid RD funded by NSF
- Several national initiatives with solid HEP
component (e.g. UK) - DataGRID proposal to European Union
- HEP, Earth Observation, Biology
29Current state
- Globus project (http//www.globus.org)
- Basic middleware
- Authentication
- Information service
- Resource management
- Good basis to build on
- Active collaborative community
- Open approach Grid Forum (http//www.gridforum.o
rg) - Who is handling lots of data?
- How many production quality implementations?
30RD required
- Local fabric
- Issues of scalability, management, reliability of
the local computing fabric - Adaptation of these amorphous computing fabrics
to the Grid - Wide Area Mass Storage
- Grid technology in an environment that is High
Throughput, Data Intensive, and has a Chaotic
Worload - Grid scheduling
- Data management
- Monitoring - reliability and performance
31The DataGRID Project
32The Data Grid Project
- Proposal for EC Fifth Framework funding
- Principal goals
- Middleware for fabric Grid management
- Large scale testbed
- Production quality demonstrations
- mock data, simulation analysis, current
experiments - Three year phased developments demos
- Collaborate with and complement other European
and US projects - Open source and communication
- GRID Forum
- Industry and Research Forum
33DataGRID Partners
- Managing partners
- UK PPARC Italy INFN
- France CNRS Holland NIKHEF
- Italy ESA/ESRIN CERN
- Industry
- IBM (UK), Compagnie des Signaux (F), Datamat (I)
- Associate partners
- Istituto Trentino di Cultura, Helsinki Institute
of Physics, Swedish Science Research Council,
Zuse Institut Berlin, University of Heidelberg,
CEA/DAPNIA (F), IFAE Barcelona, CNR (I), CESNET
(CZ), KNMI (NL), SARA (NL), SZTAKI (HU)
34Preliminary programme of work
- Middleware
- Grid Workload Management (C. Vistoli/INFN-CNAF)
- Grid Data Management (B. Segal/CERN)
- Grid Monitoring services (R. Middleton/RAL)
- Fabric Management (T. Smith/CERN)
- Mass Storage Management (J. Gordon/RAL)
- Testbed
- Testbed Integration (F. Etienne/CNRS-Marseille
) - Network Services (C. Michau/CNRS)
- Scientific Applications
- HEP Applications (F. Carminati/CERN)
- Earth Observation Applications (L.
Fusco/ESA-ESRIN) - Biology Applications (C. Michau/CNRS)
35Middleware
- Wide-area - building on an existing framework
(Globus) - workload management
- The workload is chaotic unpredictable job
arrival rates, data access patterns - The goal is maximising the global system
throughput (events processed per second) - data management
- Management of petabyte-scale data volumes, in an
environment with limited network bandwidth and
heavy use of mass storage (tape) - Caching, replication, synchronisation, object
database model - application monitoring
- Tens of thousands of components, thousands of
jobs and individual users - End-user - tracking of the progress of jobs and
aggregates of jobs - Understanding application and grid level
performance - Administrator understanding which global-level
applications were affected by failures, and
whether and how to recover
36Middleware
- Local fabric
- Effective local site management of giant
computing fabrics - Automated installation, configuration management,
system maintenance - Automated monitoring and error recovery -
resilience, self-healing - Performance monitoring
- Characterisation, mapping, management of local
Grid resources - Mass storage management
- multi-PetaByte data storage
- real-time data recording requirement
- active tape layer 1,000s of users
- uniform mass storage interface
- exchange of data and meta-data between mass
storage systems
37Infrastructure
- Operate a production quality trans European
testbed interconnecting clusters in several
sites - Initial tesbed participantsCERN, RAL, INFN
(several sites), IN2P3-Lyon, ESRIN (ESA-Italy),
SARA/NIKHEF (Amsterdam), ZUSE Institut (Berlin),
CESNET (Prague), IFAE (Barcelona), LIP (Lisbon),
IFCA (Santander) - Define, integrate and build successive releases
of the project middleware - Define, negotiate and manage the network
infrastructure - assume that this is largely Ten-155 and then
Géant - Stage demonstrations, data challenges
- Monitor, measure, evaluate, report
38Applications
- HEP
- The four LHC experiments
- Live testbed for the Regional Centre model
- Earth Observation
- ESA-ESRIN
- KNMI (Dutch meteo) climatology
- Processing of atmospheric ozone data derived from
ERS GOME and ENVISAT SCIAMACHY sensors - Biology
- CNRS (France), Karolinska (Sweden)
- Application being defined
39Data Grid Challenges
40DataGRID Challenges (ii)
- Large, diverse, dispersed project
- but coordinating this European activity is one of
the projects raisons dêtre - Collaboration, convergence with US and other Grid
activities this area is very dynamic - Organising adequate Network bandwidth a
vital ingredient for success of a Grid - Keeping the feet on the ground The GRID is a
good idea but not the panacea suggested by some
recent press articles
41Conclusions on LHC Computing
- The scale of the computing needs of the LHC
experiments is large compared with current
experiments - each experiment is one to two orders of magnitude
greater than the TOTAL capacity installed at CERN
today - We believe that the hardware technology will be
there to evolve the current architecture of
commodity clusters into large scale computing
fabrics - But there are many management problems -
workload, computing fabric, data, storage in a
wide area distributed environment - Disappointingly solutions for local site
management on this scale are not emerging from
industry - The Grid technologies look very promising to
deliver a major step forward in wide area
computing usability and effectiveness - But a great deal of work will be required to
make this a reality