Title: The Data Deluge
1The Data Deluge and the Grid
- The Data Deluge
- The Large Hadron Collider
- The LHC Data Challenge
- The Grid
- Grid Applications
- GridPP
- Conclusion
Steve Lloyd Queen Mary University of
London s.l.lloyd_at_qmul.ac.uk
2The Data Deluge
- Expect massive increases in amount of data being
collected in several diverse fields over the next
few years - Astronomy - Massive sky surveys
- Biology - Genome databases etc.
- Earth Observing
- Digitisation of paper, film, tape records etc to
create Digital Libraries, Museums . . . - Particle Physics - Large Hadron Collider
- . . .
- 1PByte 1000 TBytes 1M GBytes 1.4M CDs
- Petabyte Terabyte Gigabyte
3Digital Sky Project
Federating new astronomical surveys 40,000
square degrees 1/2 trillion pixels (1 arc
second) 1 TB x multi-wavelengths gt 1
billion sources
- Integrated catalogue and image database
- Digital Palomer Observatory Sky Survey
- 2? All Sky Survey
- NRAO VLA Sky Survey
- VLA FIRST Radio Survey
- Later
- ROSAT
- IRAS
- Westerbork 327 MHz Survey
4Sloan Digital Sky Survey
Survey 10,000 square degrees of Northern Sky over
5 years
- 1 million spectra
- positions and images of 100 million objects
- 5 wavelength bands
- 40 TB
5VISTA
Visible and Infrared Survey Telescope for
Astronomy
6Virtual Observatories
Chandra X-ray
HST optical
Gemini mid-IR
VLA radio
Jet in M87
7NASAs Earth Observing System
Galapagos Oil Spill
1 TB/day
8ESA EO Facilities
LANDSAT 7 TERRA/MODIS
AVHRR
SEAWIFS
SPOT
IRS-P3
MATERA (I)
HISTORICAL ARCHIVES
KIRUNA (S) - ESRANGE
TROMSO (N)
MATERA (I)
STANDARD PRODUCTION CHAINS
MASPALOMAS (E)
NEUSTREL.ITZ (D)
PRODUCTS
GOME analysis detected ozone thinning over
Europe 31 Jan 2002
ESRIN
USERS
USERS
9Species 2000
To enumerate all 1.7 million known species of
plants, animals, fungi and microbes on Earth
A federation of initially 18 taxonomic databases
- eventually 200 databases
10Genomics
11The LHC
- The Large Hadron Collider (LHC) will be a 14 TeV
centre of mass proton proton collider operating
in the existing 26.7Km LEP tunnel at CERN. Due to
start operation gt 2006
- 1,232 superconducting main dipoles of 8.3Tesla
- 788 quadrupoles
- 2,835 bunches of 1011 protons per bunch spaced by
25ns
12Particle Physics Questions
- Need to discover (confirm) Higgs Particle
- Study its properties
- Prove that Higgs couplings depend on masses
- Other unanswered questions
- Does Supersymmetry exist?
- How are quarks and leptons related?
- Why are there 3 sets of quarks and leptons?
- What about Gravity?
- Anything unexpected?
-
13The LHC
14The LEP/LHC Tunnel
15LHC Experiments
- LHC will house 4 experiments
- ATLAS and CMS are large 'General Purpose'
detectors designed to detect everything and
anything - LHCb is a specialised experiment designed to
study CP violation in the b quark system - ALICE is a dedicated Heavy Ion Physics Detector
16Schematic View of the LHC
17The ATLAS Experiment
- ATLAS Consists of
- An inner tracker to measures the momentum of each
charged particle - A calorimeter to measure the energies carried by
the particles - A muon spectrometer to identify and measure muons
- A huge magnet system for bending charged
particles for momentum measurement - A total of gt 108 electronic channels
18The ATLAS Detector
19Simulated ATLAS Higgs Event
20LHC Event Rates
- The LHC proton bunches collide every 25ns and
each collision yields 20 proton proton
interactions superimposed in the Detector i.e. - 40 MHz x 20 8x108 pp interactions/sec
- The (110 GeV) Higgs cross section is 24.2pb.
- A good channel is H ? ?? with a branching ratio
of 0.19 and a detector acceptance 50 - At full (1034cm-2s-1) LHC luminosity this gives
1034 x 24.2x10-12 x 10-24 x 0.0019 x 0.5 - 2x10-4 H ? ?? per second
- A 2x10-4 needle in a 8x108 Haystack
21'Online' Data Reduction
Collision Rate 40 MHz
40 TB/sec
Level 1 Special Hardware Trigger
104 - 105 Hz
10-100 GB/sec
Selecting interesting events based on
progressively more detector information
Level 2 Embedded Processor Trigger
1-10 GB/sec
102 - 103 Hz
Level 3 Processor Farm
10 - 100 Hz
100-200 MB/sec
Raw Data Storage
Offline Data Reconstruction
22Offline Analysis
Raw Data from Detector
1-2 MB/event _at_ 100-400 Hz
Total Data per year from one experiment 1 to 8
PBytes (1015 Bytes)
Data Reconstruction (Digits to Energy/momentum
etc)
Event Summary Data
0.5 MB/event
Analysis Event Selection
10 kB/event
Analysis Object Data
Physics Analysis
23Computing Resources Required
- CPU Power (Reconstruction, Simulation, User
Analysis etc) - 2 Million SpecInt95
- (A 1 GHz PC is rated at 40 SpecInt95)
- i.e. 50,000 of today's PCs
- 'Tape' Storage
- 20,000 TB
- Disk Storage
- 2,500 TB
- Analysis carried out throughout the world by
hundreds of Physicists
24Worldwide Collaboration
CMS 1800 physicists 150 institutes 32 countries
25Solutions
- Centralised Solution
- Put all resources at CERN
- Funding agencies certainly won't place all their
investment at CERN - Sociological problems
- Distributed solution
- exploit established computing expertise
infrastructure in national labs and universities - reduce dependence on links to CERN
- tap additional funding sources (spin off)
- Is the Grid the solution?
26What is The Grid?
- Analogy with the Electricity Power Grid
- Unlimited ubiquitous distributed computing
- Transparent access to multipetabyte distributed
databases - Easy to plug in
- Complexity of infrastructure hidden
27The Grid
- Five emerging models
- Distributed Computing
- - synchronous processing
- High-Throughput Computing
- - asynchronous processing
- On-Demand Computing
- - dynamic resources
- Data-Intensive Computing
- - databases
- Collaborative Computing
- - scientists
Ian Foster and Carl Kesselman, editors, The
Grid Blueprint for a New Computing
Infrastructure, Morgan Kaufmann, 1999,
http//www.mkp.com/grids
28The Grid
- Ian Foster / Carl Kesselman
- "A computational Grid is a hardware and
software infrastructure that provides dependable,
consistent, pervasive and inexpensive access to
high-end computational capabilities."
29The Grid
- Dependable - Need to rely on remote equipment as
much as the machine on your desk - Consistency - Machines need to communicate so
need consistent environments and interfaces - Pervasive - The more resources that participate
in the same system the more useful they all are - Inexpensive - Important for pervasiveness - i.e.
built using commodity PCs and disks
30The Grid
- You simply submit your job to the 'Grid'- you
shouldn't have to know where the data you want is
or where the job will run. The Grid software
(Middleware) will take care of - running the job where the data is or
- moving the data to where there is CPU power
available
31The Grid for the Scientist
Putting the bottleneck back in the Scientists
mind
32Grid Tiers
- For the LHC we envisage a 'Hierarchical'
structure based on several 'Tiers' since the data
mostly originates at one place - Tier-0 - CERN - the source of the data
- Tier-1 - 10 Major Regional Centres (inc UK)
- Tier-2 - smaller more specialised Regional
Centres (4 in UK?) - Tier-3 - University Groups
- Tier-4 My laptop? Mobile Phone?
- Doesn't need to be hierarchical e.g. for
Biologists probably not desirable
33Grid Services
Cosmology
Chemistry
Environment
Applications
Biology
Particle Physics
Data-
Remote
Problem
Remote
Collaborative
Distributed
Application
Intensive
Visualization
Solving
Instrumentation
Applications
Computing
Applications
Applications
Applications
Applications
Toolkits
Toolkit
Toolkit
Toolkit
Toolkit
Toolkit
Toolkit
Grid Services
Resource-independent and application-independent
services
(Middleware)
authentication, authorization, resource
location, resource allocation, events, accounting,
remote data access, information, policy, fault
detection
Resource-specific implementations of basic
services
Grid Fabric
e.g., Transport protocols, name servers,
differentiated services, CPU schedulers, public
key
(Resources)
infrastructure, site accounting, directory
service, OS bypass
34Problems
- Scalability
- Will it scale to thousands of processors,
thousands of disks, PetaBytes of data,
Terabits/sec of IO? - Wide-area distribution
- How to distribute, replicate, cache, synchronise,
catalogue the data? - How to balance local ownership of resources with
the requirements of the whole? - Adaptability/Flexibility
- Need to adapt to rapidly changing hardware and
costs, new analysis methods etc.
35SETI_at_home
- A distributed computing project - not really a
Grid project - You pull the data from them rather than they
submit the job to you - total of 3,864,230 users
- 564,194,228 results received
- 1,063,104 years of cpu time
- 1.8x1021 floating point operations
- 77 different cpu types
- 100 different OS
Arecibo telescope in Puerto Rico
36SETI_at_home
37Entropia
- Uses idle cycles on Home PCs for profit and
non-profit projects - Mersenne Prime Search
- 146,622 machines
- 784,360,165 cpu hours
- FightAIDS_at_Home
- 13,944 Machines
- 1,652,126 cpu hours
38NASA Information Power Grid
- Knit together widely distributed computing, data,
instrumentation and human resources - to address complex large scale computing and data
analysis problems
39Collaborative Engineering
Unitary Plan Wind Tunnel
Multi-source Data Analysis
Real-time collection
Archival storage
40Other Grid Applications
- Distributed Supercomputing
- Simultaneous execution across multiple
supercomputers - Smart Instruments
- Enhance the power of scientific instruments by
providing access to data archives and online
processing capabilities and visualisation
e.g. coupling Argonnes Photon Source to a
supercomputer
41GridPP
http//www.gridpp.ac.uk
42GridPP Overview
Provide architecture and middleware
Future LHC Experiments
Running US Experiments
Build prototype Tier-1 and Tier-2s in the UK and
implement middleware in experiments
Use the Grid with simulation data
Use the Grid with real data
43The Prototype UK Tier-1
- Jan 2002 Central Facilities used by all
experiments - 250 CPUs (450Mhz-1GHz)
- 10TB Disk
- 35TB Tape in use (theoretical tape capacity 330
TB) - March 2002 Extra Resources for LHC and BaBar
- 312 CPUs
- 40TB Disk
- extra 36 TB of tape and three new drives
44Conclusions
- Enormous data challenges in next few years.
- The Grid is likely solution.
- The Web gives ubiquitous access to distributed
information. - The Grid will give ubiquitous access to computing
resources and hence knowledge. - Many Grid projects and testbeds starting to take
off. - GridPP is building a UK Grid for Particle
Physicists to prepare for future LHC Data.