Title: Tony Doyle University of Glasgow
1Particle Physics and Grid Development
- Joint Edinburgh/Glasgow SHEFC JREI-funded project
to develop a prototype Tier-2 centre for LHC
computing. - UK-wide project to develop a prototype Grid for
Particle Physics Applications. - EU-wide project to develop middleware for
Particle Physics, Bioinformatics and Earth
Observation applications - Emphasis on local developments
2Outline
- Introduction
- Grid Computing Context
- Challenges
- ScotGrid
- Starting Point
- Timelines
- UK Context Tier-1 and -2 Centre Resources
- How Does the Grid Work?
- Middleware Development
- Grid Data Management
- Testbed
- ScotGrid Tour
- Hardware
- Software
- Web-Based Monitoring
- Summary(s)
3Grid Computing Context
- LHC computing investment will be massive
- LHC Review estimated 240MCHF
- 80MCHF/y afterwards
Europe 267 institutes, 4603 usersElsewhere
208 institutes, 1632 users
Hardware
Middleware
Applications
Total Investment 1.5m
4Rare Phenomena Huge Background
All interactions
9 orders of magnitude
The HIGGS
When you are face to face with a difficulty you
are up against a discovery Lord Kelvin
5Challenges Event Selection
All interactions
9 orders of magnitude
The HIGGS
6Challenges Complexity
- Many events
- 109 events/experiment/year
- gt1 MB/event raw data
- several passes required
- Worldwide Grid computing requirement (2007)
- 300 TeraIPS
- (100,000 of todays fastest processors connected
via a Grid)
Detectors
16 Million
channels
40 MHz
3
Gigacell
buffers
COLLISION RATE
Charge
Time
Pattern
100 kHz
LEVEL
-
1 TRIGGER
Energy
Tracks
1 MegaByte
EVENT DATA
1 Terabit/s
200 GigaByte
BUFFERS
(50000 DATA CHANNELS)
500 Readout memories
EVENT BUILDER
500 Gigabit/s
Networks
20
TeraIPS
EVENT FILTER
Gigabit/s
PetaByte
Grid Computing Service
SERVICE LAN
ARCHIVE
300
TeraIPS
- Understand/interpret data via numerically
intensive simulations - e.g. ATLAS Monte Carlo (gg H
bb) 182 sec/3.5 MB event on 1 GHz linux box
(current ScotGrid nodes)
7LHC Computing Challenge
PBytes/sec
Event Builder
500 Gigabit/s
Event Filter20 TIPS
- One bunch crossing per 25 ns
- 100 triggers per second
- Each event is 1 MByte
100 Gigabit/s
Tier 0
CERN Computer Centre gt20 TIPS
HPSS
Gigabit/s
Tier 1
RAL Regional Centre
US Regional Centre
French Regional Centre
Italian Regional Centre
HPSS
HPSS
HPSS
HPSS
Tier 2
Tier2 Centre 1 TIPS
Tier2 Centre 1 TIPS
Tier 2 Centre 1 TIPS
Gigabit/s
Tier 3
Physicists work on analysis channels Each
institute has 10 physicists working on one or
more channels Data for these channels should be
cached by the institute server
Institute 0.25TIPS
Institute
Institute
Institute
Physics data cache
100 - 1000 Mbits/sec
Tier 4
Workstations
8Starting Point (Dec 2000)
2001 2002 2003
9Timelines
IBM equipment arrived at Edinburgh and Glasgow
for Phase 1.
Phase 0. Equipment is tested and set up in a
basic configuration, networking the two sites.
Phase 1. Prototyping of the integrated local
computing fabric, with emphasis on scaling,
reliability and resilience to errors.
IBM equipment arrives at Edinburgh and Glasgow
for Phase 2.
LHC Global Grid TDR
50 prototype (LCG-3) available
ScotGRID 300 CPUs 50 TBytes
LCG-1 reliability and performance targets
First Global Grid Service (LCG-1) available
10The Spirit of the Project
The JREI funds make it possible to commission and
fully exercise a prototype LHC computing centre
in Scotland
- The Centre would develop, support and test
- Technical service based on the grid
- DataStore to handle samples of data for user
analysis - Significant simulation production capability
- Network connectivity (internal and external)
- Grid middleware
- Core software within LHCb and ATLAS
- User applications in other scientific areas
- This will enable us to answer
- is the grid a viable solution for the LHC
computing challenge? - Can a two-site T2 centre be set up and operate
effectively? - How can network use between Edinburgh, Glasgow,
RAL CERN be improved?
11Tier-1 and -2 CentreResource Planning
- Estimated resources at start
- of GridPP2 (Sept. 2004)
Tier-1
Tier-2 e(6000 CPUs 400 TB) Tier-1 1000 CPUs
500 TB
Shared distributed resources required to meet
experiment requirements Connected by
network and grid
12Testbed to ProductionTotal Resources
- Dynamic Grid Optimisation via Replica
Optimisation Service
2004 2007 7,000 1GHz
CPUs 30,000 1GHz CPUs
400 TB disk 2200 TB disk
(note x2 scale change)
13Experiment Requirements UK only
Total Requirement
14From Testbed to Production
Build System
Development Testbed 15CPU
Production Testbed 1000CPU
Certification Testbed 40CPU
Unit Test
Production
Build
Integration
Certification
add unit tested code to repository
Run nightly build auto. tests
Grid certification
Certified public release for use by apps.
Individual WP tests
Users
Build system
Test Group
Integration Team
Tagged package
WPs
Certified release selected for deployment
Application Certification
Overall release tests
Tagged release selected for certification
Fix problems
Apps. Representatives
Releases candidate
Releases candidate
Tagged Releases
Certified Releases
24x7
Problem reports
15How Does theGrid Work?
1. Authentication grid-proxy-init 2. Job
submission edg-job-submit 3. Monitoring and
control edg-job-status edg-job-cancel edg-job-g
et-output 4. Data publication and
replication Replica Location Service, Replica
Optimisation Service 5. Resource scheduling use
of Mass Storage Systems JDL, sandboxes, storage
elements
0. Web User Interface
16Middleware Development
17Grid Data Management
- Secure access to metadata
- metadata where are the files on the grid?
- database client interface
- grid service using standard web services
- develop with UK e-science programme
- Input to OGSA-DAI
- Optimised file replication
- simulations required
- economic models using CPU, disk, network inputs
- OptorSim
- Large increases in cost with questionable
increases in performance can be tolerated only in
race horses and fancy women Lord Kelvin
18MetaData Spitfire
Servlet Container
SSLServletSocketFactory
RDBMS
Trusted CAs
TrustManager
Revoked Certsrepository
Secure? At the level required in Particle
Physics
Security Servlet
ConnectionPool
Authorization Module
Does user specify role?
Role repository
Translator Servlet
Role
Connectionmappings
Map role to connection id
Glasgow authors Will Bell, Gavin McCance
19OptorSim File Replication Simulation
- Test P2P file replication strategies
e.g. economic models
3. Build in realistic JANET background traffic
1. Optimisation principles applied to GridPP
2004 testbed with realistic PP use
patterns/policies
4. Replication algorithms optimise CPU use/job
time as replicas are built up on the Grid.
2. Job scheduling Queue access cost takes into
account queue length and network connectivity.
Anticipate replicas needed at close sites using
three replication algorithms.
Glasgow authors Will Bell, David Cameron, Paul
Millar, Caitriana Nicholson
20Testbed StatusSummer 2003
Tier-2 Regional Centres
ScotGrid
NorthGrid
UK-wide development using EU-DataGrid tools
(v1.47). Deployed during Sept 02-03. Currently
being upgraded to v2.0. See http//www.gridpp.ac.u
k/map/
SouthGrid
London Grid
21Sequential Access via Metadata
SAM
SAM system went into Production Mode for CDF
on June 3, 2002.
Treat WAN as an abundant file transfer
resource Rick St Denis, Run II Computing
Review, (June 4-6, 2002).
Grid theme require metadata to enable
distributed resources e.g. CDF_at_Ggo to work
coherently.
Glasgow authors Morag Burgon-Lyon, Rick St.
Denis, Stan Thompson
22Tour of ScotGrid
- Hardware
- 59 IBM X Series 330 dual 1 GHz Pentium III with
2GB memory - 2 IBM X Series 340 dual 1 GHz Pentium III with
2GB memory and dual ethernet - 3 IBM X Series 340 dual 1 GHz Pentium III with
2GB memory and 100
1000 Mbit/s ethernet - 1TB disk
- LTO/Ultrium Tape Library
- Cisco ethernet switches
- New..
- IBM X Series 370 PIII Xeon with 32 x 512 MB RAM
- 5TB FastT500 disk 70 x 73.4 GB IBM FC Hot-Swap
HDD - eDIKT 28 IBM blades dual 2.4 GHz Xeon with 1.5GB
memory - eDIKT 6 IBM X Series 335 dual 2.4 GHz Xeon with
1.5GB memory - CDF 10 Dell PowerEdge 2650 2.4 GHz Xeon with
1.5GB memory - CDF 7.5TB Raid disk
Shared Resources Disk 15TB CPU 330 1GHz
23Tour of ScotGrid
Ongoing Upgrade Programme
- Software
- OPENPBS Batch System
- Job Description Language is Shell Scripts with
special comments - Jobs submitted using qsub command
- Location of job output determined by PBS shell
script - Jobs scheduled using the maui plugin to OpenPBS
- Frontend machines provide a Grid-based entry
mechanism to this system - Remote users prepare jobs and submit them
from e.g. their desktop - Users authenticate using X509 certificates
- Users do not have personal accounts on
e.g. ScotGRID-Glasgow but use pool
accounts
Development/deployment System Manager David
Martin
EDG 1.4
CE
SE
59xWN
24Web-BasedMonitoring
Accumulated CPU Use
Total Disk Use
ScotGrid reached its 500,000th processing hour on
Thursday 17th July 2003.
Documentation
Instantaneous CPU Use
Prototype
25Duty cycle typically 70 Large
fluctuations Contention control? via target
shares
Total delivered CPU
LHC targets met Significant non-LHC application
use Bioinformatics/CDF resources being
integrated
UKQCD
26 Summary in UK-EU-World context..
- 50/50 Edin/GU funding model, funded by SHEFC
- compute-intensive jobs performed at GU
- data-intensive jobs performed at Edin
- Leading RD in Grid Data Management in UK
- Open policy on usage and target shares
- Open monitoring system
- Meeting real requirements of applications
currently HEP (Experiment and Theory),
Bioinformatics, Computing Science - open source research (all code)
- open source systems (IBM linux-based system)
- part of a worldwide grid infrastructure through
GridPP - GridPP Project (17m over three years -gt Sep 04)
- Dedicated people actively developing a Grid All
with personal certificates - Using the largest UK grid testbed (16 sites and
more than 100 servers) - Deployed within EU-wide programme
- Linked to Worldwide Grid testbeds
- LHC Grid Deployment Programme Defined First
International testbed in July - Active Tier-1/A Production Centre already meeting
International Requirements - Latent Tier-2 resources being monitored
ScotGRID recognised as leading developments - Significant middleware development programme
importance of Grid Data Management
27Summary
Hardware
Middleware
Applications
- PPARC/SHEFC/University strategic investment
- Software prototyping (Grid Data Management) and
stress-testing (Applications) - Long-term commitment to Grid computing (LHC era)
- Partnership with Bioinformatics, Computing
Science, Edinburgh, Glasgow, IBM, Particle
Physics - Working locally as part of National, European and
International Grid development - middleware testbed linked to real applications
via ScotGrid - Development/deployment
ScotGRID
Radio has no future X-rays will prove to be a
hoax
28Where are we?
- Depends on your perspective
- (What tangled webs we weave when first we
practise.. building grids)