Title: Grids for 21st Century Data Intensive Science
1- Grids for 21st CenturyData Intensive Science
Paul Avery University of Florida http//www.phys.u
fl.edu/avery/ avery_at_phys.ufl.edu
University of MichiganMay 8, 2003
2Grids and Science
3The Grid Concept
- Grid Geographically distributed computing
resources configured for coordinated use - Fabric Physical resources networks provide
raw capability - Middleware Software ties it all together (tools,
services, etc.) - Goal Transparent resource sharing
4Fundamental Idea Resource Sharing
- Resources for complex problems are distributed
- Advanced scientific instruments (accelerators,
telescopes, ) - Storage, computing, people, institutions
- Communities require access to common services
- Research collaborations (physics, astronomy,
engineering, ) - Government agencies, health care organizations,
corporations, - Virtual Organizations
- Create a VO from geographically separated
components - Make all community resources available to any VO
member - Leverage strengths at different institutions
- Grids require a foundation of strong networking
- Communication tools, visualization
- High-speed data transmission, instrument operation
5Some (Realistic) Grid Examples
- High energy physics
- 3,000 physicists worldwide pool Petaflops of CPU
resources to analyze Petabytes of data - Fusion power (ITER, etc.)
- Physicists quickly generate 100 CPU-years of
simulations of a new magnet configuration to
compare with data - Astronomy
- An international team remotely operates a
telescope in real time - Climate modeling
- Climate scientists visualize, annotate, analyze
Terabytes of simulation data - Biology
- A biochemist exploits 10,000 computers to screen
100,000 compounds in an hour
6Grids Enhancing Research Learning
- Fundamentally alters conduct of scientific
research - Central model People, resources flow inward to
labs - Distributed model Knowledge flows between
distributed teams - Strengthens universities
- Couples universities to data intensive science
- Couples universities to national international
labs - Brings front-line research and resources to
students - Exploits intellectual resources of formerly
isolated schools - Opens new opportunities for minority and women
researchers - Builds partnerships to drive advances in
IT/science/eng - Application sciences ? Computer Science
- Physics ? Astronomy, biology, etc.
- Universities ? Laboratories
- Scientists ? Students
- Research Community ? IT industry
7Grid Challenges
- Operate a fundamentally complex entity
- Geographically distributed resources
- Each resource under different administrative
control - Many failure modes
- Manage workflow across Grid
- Balance policy vs. instantaneous capability to
complete tasks - Balance effective resource use vs. fast
turnaround for priority jobs - Match resource usage to policy over the long term
- Goal-oriented algorithms steering requests
according to metrics - Maintain a global view of resources and system
state - Coherent end-to-end system monitoring
- Adaptive learning for execution optimization
- Build high level services integrated user
environment
8Data Grids
9Data Intensive Science 2000-2015
- Scientific discovery increasingly driven by data
collection - Computationally intensive analyses
- Massive data collections
- Data distributed across networks of varying
capability - Internationally distributed collaborations
- Dominant factor data growth (1 Petabyte 1000
TB) - 2000 0.5 Petabyte
- 2005 10 Petabytes
- 2010 100 Petabytes
- 2015 1000 Petabytes?
How to collect, manage, access and interpret
this quantity of data?
Drives demand for Data Grids to
handleadditional dimension of data access
movement
10Data Intensive Physical Sciences
- High energy nuclear physics
- Including new experiments at CERNs Large Hadron
Collider - Astronomy
- Digital sky surveys SDSS, VISTA, other Gigapixel
arrays - VLBI arrays multiple- Gbps data streams
- Virtual Observatories (multi-wavelength
astronomy) - Gravity wave searches
- LIGO, GEO, VIRGO, TAMA
- Time-dependent 3-D systems (simulation data)
- Earth Observation, climate modeling
- Geophysics, earthquake modeling
- Fluids, aerodynamic design
- Dispersal of pollutants in atmosphere
11Data Intensive Biology and Medicine
- Medical data
- X-Ray, mammography data, etc. (many petabytes)
- Radiation Oncology (real-time display of 3-D
images) - X-ray crystallography
- Bright X-Ray sources, e.g. Argonne Advanced
Photon Source - Molecular genomics and related disciplines
- Human Genome, other genome databases
- Proteomics (protein structure, activities, )
- Protein interactions, drug delivery
- Brain scans (1-10?m, time dependent)
12Driven by LHC Computing Challenges
- Complexity Millions of individual detector
channels - Scale PetaOps (CPU), Petabytes (Data)
- Distribution Global distribution of people
resources
1800 Physicists 150 Institutes 32 Countries
13CMS Experiment at LHC
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
14LHC Data Rates Detector to Storage
40 MHz
1000 TB/sec
Physics filtering
Level 1 Trigger Special Hardware
75 GB/sec
75 KHz
Level 2 Trigger Commodity CPUs
5 GB/sec
5 KHz
Level 3 Trigger Commodity CPUs
100 1500 MB/sec
100 Hz
Raw Data to storage
15LHC Higgs Decay into 4 muons
16Hierarchy of LHC Data Grid Resources
CMS Experiment
Tier0/(? Tier1)/(? Tier2) 111
Online System
100-1500 MBytes/s
CERN Computer Center 20 TIPS
Tier 0
10-40 Gbps
Tier 1
2.5-10 Gbps
Tier 2
1-2.5 Gbps
Tier 3
Physics cache
1-10 Gbps
10s of Petabytes by 2007-81000 Petabytes in
5-7 years
PCs
Tier 4
17Digital Astronomy
- Future dominated by detector improvements
- Moores Law growth in CCDs
- Gigapixel arrays on horizon
- Growth in CPU/storage tracking data volumes
Glass
MPixels
- Total area of 3m telescopes in the world in m2
- Total number of CCD pixels in Mpixels
- 25 year growth 30x in glass, 3000x in pixels
18The Age of Astronomical Mega-Surveys
- Next generation mega-surveys will change
astronomy - Large sky coverage
- Sound statistical plans, uniform systematics
- The technology to store and access the data is
here - Following Moores law
- Integrating these archives for the whole
community - Astronomical data mining will lead to stunning
new discoveries - Virtual Observatory (next slides)
19Virtual Observatories
Multi-wavelength astronomy,Multiple surveys
20Virtual Observatory Data Challenge
- Digital representation of the sky
- All-sky deep fields
- Integrated catalog and image databases
- Spectra of selected samples
- Size of the archived data
- 40,000 square degrees
- Resolution 50 trillion pixels
- One band (2 bytes/pixel) 100 Terabytes
- Multi-wavelength 500-1000 Terabytes
- Time dimension Many Petabytes
- Large, globally distributed database engines
- Multi-Petabyte data size, distributed widely
- Thousands of queries per day, Gbyte/s I/O speed
per site - Data Grid computing infrastructure
21Sloan Sky Survey Data Grid
22International Grid/Networking ProjectsUS, EU, E.
Europe, Asia, S. America,
23Global Context Data Grid Projects
- U.S. Projects
- Particle Physics Data Grid (PPDG) DOE
- GriPhyN NSF
- International Virtual Data Grid Laboratory
(iVDGL) NSF - TeraGrid NSF
- DOE Science Grid DOE
- NSF Middleware Initiative (NMI) NSF
- EU, Asia major projects
- European Data Grid (EU, EC)
- LHC Computing Grid (LCG) (CERN)
- EU national Projects (UK, Italy, France, )
- CrossGrid (EU, EC)
- DataTAG (EU, EC)
- Japanese Project
- Korea project
24Particle Physics Data Grid
- Funded 2001 2004 _at_ US9.5M (DOE)
- Driven by HENP experiments D0, BaBar, STAR, CMS,
ATLAS
25PPDG Goals
- Serve high energy nuclear physics (HENP)
experiments - Unique challenges, diverse test environments
- Develop advanced Grid technologies
- Focus on end to end integration
- Maintain practical orientation
- Networks, instrumentation, monitoring
- DB file/object replication, caching, catalogs,
end-to-end movement - Make tools general enough for wide community
- Collaboration with GriPhyN, iVDGL, EDG, LCG
- ESNet Certificate Authority work, security
26GriPhyN and iVDGL
- Both funded through NSF ITR program
- GriPhyN 11.9M (NSF) 1.6M (matching) (2000
2005) - iVDGL 13.7M (NSF) 2M (matching) (2001
2006) - Basic composition
- GriPhyN 12 funded universities, SDSC, 3
labs (80 people) - iVDGL 16 funded institutions, SDSC, 3 labs (80
people) - Expts US-CMS, US-ATLAS, LIGO, SDSS/NVO
- Large overlap of people, institutions, management
- Grid research vs Grid deployment
- GriPhyN CS research, Virtual Data Toolkit (VDT)
development - iVDGL Grid laboratory deployment
- 4 physics experiments provide frontier challenges
- VDT in common
27GriPhyN Computer Science Challenges
- Virtual data (more later)
- Data programs (content) programs (executions)
- Representation, discovery, manipulation of
workflows and associated data programs - Planning
- Mapping workflows in an efficient, policy-aware
manner to distributed resources - Execution
- Executing workflows, inc. data movements,
reliably and efficiently - Performance
- Monitoring system performance for scheduling
troubleshooting
28Goal PetaScale Virtual-Data Grids
Production Team
Single Researcher
Workgroups
Interactive User Tools
Request Execution Management Tools
Request Planning Scheduling Tools
Virtual Data Tools
ResourceManagementServices
Security andPolicyServices
Other GridServices
- PetaOps
- Petabytes
- Performance
Transforms
Distributed resources(code, storage,
CPUs,networks)
Raw datasource
29GriPhyN/iVDGL Science Drivers
- US-CMS US-ATLAS
- HEP experiments at LHC/CERN
- 100s of Petabytes
- LIGO
- Gravity wave experiment
- 100s of Terabytes
- Sloan Digital Sky Survey
- Digital astronomy (1/4 sky)
- 10s of Terabytes
- Massive CPU
- Large, distributed datasets
- Large, distributed communities
30Virtual Data Derivation and Provenance
- Most scientific data are not simple
measurements - They are computationally corrected/reconstructed
- They can be produced by numerical simulation
- Science eng. projects are more CPU and data
intensive - Programs are significant community resources
(transformations) - So are the executions of those programs
(derivations) - Management of dataset transformations important!
- Derivation Instantiation of a potential data
product - Provenance Exact history of any existing data
product
We already do this, but manually!
31Virtual Data Motivations (1)
Ive detected a muon calibration error and want
to know which derived data products need to be
recomputed.
Ive found some interesting data, but I need to
know exactly what corrections were applied before
I can trust it.
Data
consumed-by/ generated-by
product-of
Derivation
Transformation
execution-of
I want to search a database for 3 muon SUSY
events. If a program that does this analysis
exists, I wont have to write one from scratch.
I want to apply a forward jet analysis to 100M
events. If the results already exist, Ill save
weeks of computation.
32Virtual Data Motivations (2)
- Data track-ability and result audit-ability
- Universally sought by scientific applications
- Facilitates tool and data sharing and
collaboration - Data can be sent along with its recipe
- Repair and correction of data
- Rebuild data productscf., make
- Workflow management
- Organizing, locating, specifying, and requesting
data products - Performance optimizations
- Ability to re-create data rather than move it
Manual /error prone ? Automated /robust
33Chimera Virtual Data System
- Virtual Data API
- A Java class hierarchy to represent
transformations derivations - Virtual Data Language
- Textual for people illustrative examples
- XML for machine-to-machine interfaces
- Virtual Data Database
- Makes the objects of a virtual data definition
persistent - Virtual Data Service (future)
- Provides a service interface (e.g., OGSA) to
persistent objects - Version 1.0 available
- To be put into VDT 1.1.7
34Chimera Application SDSS Analysis
Galaxy cluster data
Size distribution
Chimera Virtual Data System GriPhyN Virtual
Data Toolkit iVDGL Data Grid (many CPUs)
35Virtual Data and LHC Computing
- US-CMS
- Chimera prototype tested with CMS MC (200K
events) - Currently integrating Chimera into standard CMS
production tools - Integrating virtual data into Grid-enabled
analysis tools - US-ATLAS
- Integrating Chimera into ATLAS software
- HEPCAL document includes first virtual data use
cases - Very basic cases, need elaboration
- Discuss with LHC expts requirements, scope,
technologies - New ITR proposal to NSF ITR program (15M)
- Dynamic Workspaces for Scientific Analysis
Communities - Continued progress requires collaboration with CS
groups - Distributed scheduling, workflow optimization,
- Need collaboration with CS to develop robust tools
36iVDGL Goals and Context
- International Virtual-Data Grid Laboratory
- A global Grid laboratory (US, EU, E. Europe,
Asia, S. America, ) - A place to conduct Data Grid tests at scale
- A mechanism to create common Grid infrastructure
- A laboratory for other disciplines to perform
Data Grid tests - A focus of outreach efforts to small institutions
- Context of iVDGL in US-LHC computing program
- Develop and operate proto-Tier2 centers
- Learn how to do Grid operations (GOC)
- International participation
- DataTag
- UK e-Science programme support 6 CS Fellows per
year in U.S.
37US-iVDGL Sites (Spring 2003)
- Partners?
- EU
- CERN
- Brazil
- Australia
- Korea
- Japan
38US-CMS Grid Testbed
39US-CMS Testbed Success Story
- Production Run for Monte Carlo data production
- Assigned 1.5 million events for eGamma Bigjets
- 500 sec per event on 750 MHz processor all
production stages from simulation to ntuple - 2 months continuous running across 5 testbed
sites - Demonstrated at Supercomputing 2002
40Creation of WorldGrid
- Joint iVDGL/DataTag/EDG effort
- Resources from both sides (15 sites)
- Monitoring tools (Ganglia, MDS, NetSaint, )
- Visualization tools (Nagios, MapCenter, Ganglia)
- Applications ScienceGrid
- CMS CMKIN, CMSIM
- ATLAS ATLSIM
- Submit jobs from US or EU
- Jobs can run on any cluster
- Demonstrated at IST2002 (Copenhagen)
- Demonstrated at SC2002 (Baltimore)
41WorldGrid Sites
42Grid Coordination
43U.S. Project Coordination Trillium
- Trillium GriPhyN iVDGL PPDG
- Large overlap in leadership, people, experiments
- Driven primarily by HENP, particularly LHC
experiments - Benefit of coordination
- Common software base packaging VDT PACMAN
- Collaborative / joint projects monitoring,
demos, security, - Wide deployment of new technologies, e.g. Virtual
Data - Stronger, broader outreach effort
- Forum for US Grid projects
- Joint view, strategies, meetings and work
- Unified entity to deal with EU other Grid
projects
44International Grid Coordination
- Global Grid Forum (GGF)
- International forum for general Grid efforts
- Many working groups, standards definitions
- Close collaboration with EU DataGrid (EDG)
- Many connections with EDG activities
- HICB HEP Inter-Grid Coordination Board
- Non-competitive forum, strategic issues,
consensus - Cross-project policies, procedures and
technology, joint projects - HICB-JTB Joint Technical Board
- Definition, oversight and tracking of joint
projects - GLUE interoperability group
- Participation in LHC Computing Grid (LCG)
- Software Computing Committee (SC2)
- Project Execution Board (PEB)
- Grid Deployment Board (GDB)
45HEP and International Grid Projects
- HEP continues to be the strongest science driver
- (In collaboration with computer scientists)
- Many national and international initiatives
- LHC a particularly strong driving function
- US-HEP committed to working with international
partners - Many networking initiatives with EU colleagues
- Collaboration on LHC Grid Project
- Grid projects driving linked to network
developments - DataTag, SCIC, US-CERN link, Internet2
- New partners being actively sought
- Korea, Russia, China, Japan, Brazil, Romania,
- Participate in US-CMS and US-ATLAS Grid testbeds
- Link to WorldGrid, once some software is fixed
46New Grid Efforts
47An Inter-Regional Center for High Energy Physics
Research and Educational Outreach (CHEPREO) at
Florida International University
- Status
- Proposal submitted Dec. 2002
- Presented to NSF review panel
- Project Execution Plan submitted
- Funding in June?
- E/O Center in Miami area
- iVDGL Grid Activities
- CMS Research
- AMPATH network (S. America)
- Intl Activities (Brazil, etc.)
48A Global Grid Enabled Collaboratory for
Scientific Research (GECSR)
- 4M ITR proposal from
- Caltech (HN PI,JBCoPI)
- Michigan (CoPI,CoPI)
- Maryland (CoPI)
- Plus senior personnel from
- Lawrence Berkeley Lab
- Oklahoma
- Fermilab
- Arlington (U. Texas)
- Iowa
- Florida State
- First Grid-enabled Collaboratory
- Tight integration between
- Science of Collaboratories
- Globally scalable work environment
- Sophisticated collaborative tools (VRVS, VNC
Next-Gen) - Agent based monitoring decision support system
(MonALISA)
- Initial targets are the global HENP
collaborations, but GESCR is expected to be
widely applicable to other large scale
collaborative scientific endeavors - Giving scientists from all world regions the
means to function as full partners in the process
of search and discovery
49Large ITR Proposal 15M
Dynamic WorkspacesEnabling Global Analysis
Communities
50UltraLight Proposal to NSF
- 10 Gb/s network
- Caltech, UF, FIU, UM, MIT
- SLAC, FNAL
- Intl partners
- Cisco
- Applications
- HEP
- VLBI
- Radiation Oncology
- Grid Projects
51GLORIAD
- New 10 Gb/s network linking US-Russia-China
- Plus Grid component linking science projects
- H. Newman, P. Avery participating
- Meeting at NSF April 14 with US-Russia-China
reps. - HEP people (Hesheng, et al.)
- Broad agreement that HEP can drive Grid portion
- More meetings planned
52Summary
- Progress on many fronts in PPDG/GriPhyN/iVDGL
- Packaging Pacman VDT
- Testbeds (development and production)
- Major demonstration projects
- Productions based on Grid tools using iVDGL
resources - WorldGrid providing excellent experience
- Excellent collaboration with EU partners
- Building links to our Asian and other partners
- Excellent opportunity to build lasting
infrastructure - Looking to collaborate with more international
partners - Testbeds, monitoring, deploying VDT more widely
- New directions
- Virtual data a powerful paradigm for LHC
computing - Emphasis on Grid-enabled analysis
53Grid References
- Grid Book
- www.mkp.com/grids
- Globus
- www.globus.org
- Global Grid Forum
- www.gridforum.org
- PPDG
- www.ppdg.net
- GriPhyN
- www.griphyn.org
- iVDGL
- www.ivdgl.org
- TeraGrid
- www.teragrid.org
- EU DataGrid
- www.eu-datagrid.org