Title: Global Data Grids for 21st Century Science
1- Global Data Grids for21st Century Science
Paul Avery University of Florida http//www.phys.u
fl.edu/avery/ avery_at_phys.ufl.edu
Florida International UniversitySept. 12, 2002
2The Grid Concept
- Grid Geographically distributed computing
resources configured for coordinated use - Fabric Physical resources networks provide
raw capability - Middleware Software ties it all together (tools,
services, etc.) - Goal Transparent resource sharing
3Fundamental Idea Resource Sharing
- Resources for complex problems are distributed
- Advanced scientific instruments (accelerators,
telescopes, ) - Storage and computing
- Groups of people
- Communities require access to common services
- Scientific collaborations (physics, astronomy,
biology, eng. ) - Government agencies
- Health care organizations, large corporations,
- Goal is to build Virtual Organizations
- Make all community resources available to any VO
member - Leverage strengths at different institutions
- Add people resources dynamically
4What Are Grids Good For?
- Climate modeling
- Climate scientists visualize, annotate, analyze
Terabytes of simulation data - Biology
- A biochemist exploits 10,000 computers to screen
100,000 compounds in an hour - High energy physics
- 3,000 physicists worldwide pool Petaflops (1M
GigaFlops) of CPU resources to analyze Petabytes
of data - Engineering
- Civil engineers collaborate to design, execute,
analyze shake table experiments - A multidisciplinary analysis in aerospace couples
code and data in four companies to design a new
airframe
From Ian Foster
5What Are Grids Good For?
- Application Service Providers
- A home user invokes architectural design
functions at an application service provider - which purchases computing cycles from cycle
providers - Commercial
- Scientists at a multinational toy company design
a new product - Cities, communities
- An emergency response team couples real time
data, weather model, population data - A community group pools members PCs to analyze
alternative designs for a local road - Health
- Hospitals and international agencies collaborate
on stemming a major disease outbreak
From Ian Foster
6Proto-Grid SETI_at_home
- Community SETI researchers enthusiasts
- Arecibo radio data sent to users (250KB data
chunks) - Over 2M PCs used
7More Advanced Proto-GridEvaluation of AIDS Drugs
- Community
- Research group (Scripps)
- 1000s of PC owners
- Vendor (Entropia)
- Common goal
- Drug design
- Advance AIDS research
8Grids Why Now?
- Moores law improvements in computing
- Highly functional endsystems
- Universal wired and wireless Internet connections
- Universal connectivity
- Changing modes of working and problem solving
- Teamwork, computation
- Network exponentials
- (Next slide)
9Network Exponentials Collaboration
- Network vs. computer performance
- Computer speed doubles every 18 months
- Network speed doubles every 12 months (revised)
- Difference order of magnitude per 10 years
- Other factor network connectivity
- 1986 to 2001
- Computers ? 1,000
- Networks ? 50,000
- 2001 to 2010?
- Computers ? 60
- Networks ? 500
Scientific American (Jan-2001)
10Grid Challenges
- Overall goal Coordinated sharing of resources
- Technical problems to overcome
- Authentication, authorization, policy, auditing
- Resource discovery, access, allocation, control
- Failure detection recovery
- Resource brokering
- Additional issue lack of central control
knowledge - Preservation of local site autonomy
- Policy discovery and negotiation important
11Layered Grid Architecture(Analogy to Internet
Architecture)
Specialized servicesApp. specific distributed
services
User
Managing multiple resourcesubiquitous
infrastructure services
Collective
Sharing single resourcesnegotiating access,
controlling use
Resource
Talking to thingscommunications, security
Connectivity
Controlling things locallyAccessing,
controlling resources
Fabric
From Ian Foster
12Globus Project and Toolkit
- Globus Project (Argonne USC/ISI)
- O(40) researchers developers
- Identify and define core protocols and services
- Globus Toolkit 2.0
- A major product of the Globus Project
- Reference implementation of core protocols
services - Growing open source developer community
- Globus Toolkit used by all Data Grid projects
today - US GriPhyN, PPDG, TeraGrid, iVDGL
- EU EU-DataGrid and national projects
- Recent announcement of applying web services to
Grids - Keeps Grids in the commercial mainstream
- GT 3.0
13Globus General Approach
Applications
- Define Grid protocols APIs
- Protocol-mediated access to remote resources
- Integrate and extend existing standards
- Develop reference implementation
- Open source Globus Toolkit
- Client server SDKs, services, tools, etc.
- Grid-enable wide variety of tools
- Globus Toolkit
- FTP, SSH, Condor, SRB, MPI,
- Learn about real world problems
- Deployment
- Testing
- Applications
Diverse global services
Core services
Diverse resources
14Data Grids
15Data Intensive Science 2000-2015
- Scientific discovery increasingly driven by IT
- Computationally intensive analyses
- Massive data collections
- Data distributed across networks of varying
capability - Geographically distributed collaboration
- Dominant factor data growth (1 Petabyte 1000
TB) - 2000 0.5 Petabyte
- 2005 10 Petabytes
- 2010 100 Petabytes
- 2015 1000 Petabytes?
How to collect, manage, access and interpret
this quantity of data?
Drives demand for Data Grids to
handleadditional dimension of data access
movement
16Data Intensive Physical Sciences
- High energy nuclear physics
- Including new experiments at CERNs Large Hadron
Collider - Gravity wave searches
- LIGO, GEO, VIRGO
- Astronomy Digital sky surveys
- Sloan Digital sky Survey, VISTA, other Gigapixel
arrays - Virtual Observatories (multi-wavelength
astronomy) - Time-dependent 3-D systems (simulation data)
- Earth Observation, climate modeling
- Geophysics, earthquake modeling
- Fluids, aerodynamic design
- Pollutant dispersal scenarios
17Data Intensive Biology and Medicine
- Medical data
- X-Ray, mammography data, etc. (many petabytes)
- Digitizing patient records (ditto)
- X-ray crystallography
- Bright X-Ray sources, e.g. Argonne Advanced
Photon Source - Molecular genomics and related disciplines
- Human Genome, other genome databases
- Proteomics (protein structure, activities, )
- Protein interactions, drug delivery
- Brain scans (3-D, time dependent)
- Virtual Population Laboratory (proposed)
- Database of populations, geography,
transportation corridors - Simulate likely spread of disease outbreaks
Craig Venter keynote _at_SC2001
18Example High Energy Physics
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
19LHC Computing Challenges
- Complexity of LHC interaction environment
resulting data - Scale Petabytes of data per year (100 PB by
2010-12) - Global distribution of people and resources
1800 Physicists 150 Institutes 32 Countries
20Global LHC Data Grid
Tier0 CERNTier1 National LabTier2 Regional
Center (University, etc.)Tier3 University
workgroupTier4 Workstation
- Key ideas
- Hierarchical structure
- Tier2 centers
21Example Global LHC Data Grid
Experiment (e.g., CMS)
Tier0/(? Tier1)/(? Tier2) 111
Online System
100 MBytes/sec
CERN Computer Center gt 20 TIPS
Tier 0
2.5 Gbits/sec
Tier 1
France
Italy
UK
USA
2.5 Gbits/sec
Tier 2
0.6 Gbits/sec
Tier 3
Institute 0.25TIPS
Institute
Institute
Institute
0.1 - 1 Gbits/sec
Physics data cache
Tier 4
PCs, other portals
22Sloan Digital Sky Survey Data Grid
23LIGO (Gravity Wave) Data Grid
MIT
LivingstonObservatory
HanfordObservatory
OC48
OC3
OC3
OC12
Caltech
Tier1
OC48
24Data Grid Projects
25Data Grid Projects
- Particle Physics Data Grid (US, DOE)
- Data Grid applications for HENP expts.
- GriPhyN (US, NSF)
- Petascale Virtual-Data Grids
- iVDGL (US, NSF)
- Global Grid lab
- TeraGrid (US, NSF)
- Dist. supercomp. resources (13 TFlops)
- European Data Grid (EU, EC)
- Data Grid technologies, EU deployment
- CrossGrid (EU, EC)
- Data Grid technologies, EU emphasis
- DataTAG (EU, EC)
- Transatlantic network, Grid applications
- Japanese Grid Projects (APGrid?) (Japan)
- Grid deployment throughout Japan
- Collaborations of application scientists
computer scientists - Infrastructure devel. deployment
- Globus based
26GriPhyN App. Science CS Grids
- GriPhyN Grid Physics Network
- US-CMS High Energy Physics
- US-ATLAS High Energy Physics
- LIGO/LSC Gravity wave research
- SDSS Sloan Digital Sky Survey
- Strong partnership with computer scientists
- Design and implement production-scale grids
- Develop common infrastructure, tools and services
- Integration into the 4 experiments
- Broad application to other sciences via Virtual
Data Toolkit - Strong outreach program
- Multi-year project
- RD for grid architecture (funded at 11.9M
1.6M) - Integrate Grid infrastructure into experiments
through VDT
27GriPhyN Institutions
- UC San Diego
- San Diego Supercomputer Center
- Lawrence Berkeley Lab
- Argonne
- Fermilab
- Brookhaven
- U Florida
- U Chicago
- Boston U
- Caltech
- U Wisconsin, Madison
- USC/ISI
- Harvard
- Indiana
- Johns Hopkins
- Northwestern
- Stanford
- U Illinois at Chicago
- U Penn
- U Texas, Brownsville
- U Wisconsin, Milwaukee
- UC Berkeley
28GriPhyN PetaScale Virtual-Data Grids
Production Team
Individual Investigator
Workgroups
1 Petaflop 100 Petabytes
Interactive User Tools
Request Planning
Request Execution
Virtual Data Tools
Management Tools
Scheduling Tools
Resource
Other Grid
Security and
Management
Policy
Services
Services
Services
Transforms
Distributed resources(code, storage,
CPUs,networks)
Raw data
source
29GriPhyN Research Agenda
- Virtual Data technologies (fig.)
- Derived data, calculable via algorithm
- Instantiated 0, 1, or many times (e.g., caches)
- Fetch value vs execute algorithm
- Potentially complex (versions, consistency, cost
calculation, etc) - LIGO example
- Get gravitational strain for 2 minutes around
each of 200 gamma-ray bursts over the last year - For each requested data value, need to
- Locate item location and algorithm
- Determine costs of fetching vs calculating
- Plan data movements computations required to
obtain results - Execute the plan
30Virtual Data in Action
- Data request may
- Compute locally
- Compute remotely
- Access local data
- Access remote data
- Scheduling based on
- Local policies
- Global policies
- Cost
Major facilities, archives
Regional facilities, caches
Local facilities, caches
31Chimera Virtual Data System
- Virtual data language
- Transformations, derivations, data
- Virtual data catalog
- Persistent definitions
- Query capability
- Data production analysis applications
32Transformations and Derivations
- Transformation
- Abstract template of program invocation
- Similar to "function definition"
- Derivation
- Formal invocation of a Transformation
- Similar to "function call"
- Store past and future
- A record of how data products were generated
- A recipe of how data products can be generated
- Invocation (future)
- Record of each Derivation execution
33GriPhyN Research Agenda (cont.)
- Execution management
- Co-allocation of resources (CPU, storage, network
transfers) - Fault tolerance, error reporting
- Interaction, feedback to planning
- Performance analysis (with PPDG)
- Instrumentation and measurement of all grid
components - Understand and optimize grid performance
- Virtual Data Toolkit (VDT)
- VDT virtual data services virtual data tools
- One of the primary deliverables of RD effort
- Technology transfer mechanism to other scientific
domains
34GriPhyN/PPDG Data Grid Architecture
Application
initial solution is operational
DAG
Catalog Services
Monitoring
Planner
Info Services
DAG
Repl. Mgmt.
Executor
Policy/Security
Reliable Transfer Service
Compute Resource
Storage Resource
35Catalog Architecture
Transparency wrt location
Metadata Catalog
Metadata Catalog
Name
LObjN
Name
LObjN
X logO1
Y logO2
F.X
logO3
F.X
logO3
G(1).Y logO4
Object Name
Object Name
GCMS
GCMS
Logical Container
Name
Replica Catalog
Replica Catalog
LCN
PFNs
LCN
PFNs
logC1 URL1
logC1 URL1
logC2 URL2 URL3
logC2 URL2 URL3
logC3 URL4
logC3 URL4
logC4 URL5 URL6
logC4 URL5 URL6
URLs for physical file location
Physical file storage
36iVDGL A Global Grid Laboratory
We propose to create, operate and evaluate, over
asustained period of time, an international
researchlaboratory for data-intensive
science. From NSF proposal, 2001
- International Virtual-Data Grid Laboratory
- A global Grid laboratory (US, EU, South America,
Asia, ) - A place to conduct Data Grid tests at scale
- A mechanism to create common Grid infrastructure
- A facility to perform production exercises for
LHC experiments - A laboratory for other disciplines to perform
Data Grid tests - A focus of outreach efforts to small institutions
- Funded for 13.65M by NSF
37iVDGL Components
- Computing resources
- Tier1, Tier2, Tier3 sites
- Networks
- USA (TeraGrid, Internet2, ESNET), Europe (Géant,
) - Transatlantic (DataTAG), Transpacific, AMPATH,
- Grid Operations Center (GOC)
- Indiana (2 people)
- Joint work with TeraGrid on GOC development
- Computer Science support teams
- Support, test, upgrade GriPhyN Virtual Data
Toolkit - Outreach effort
- Integrated with GriPhyN
- Coordination, interoperability
38Current iVDGL Participants
- Initial experiments (funded by NSF proposal)
- CMS, ATLAS, LIGO, SDSS, NVO
- U.S. Universities and laboratories
- (Next slide)
- Partners
- TeraGrid
- EU DataGrid EU national projects
- Japan (AIST, TITECH)
- Australia
- Complementary EU project DataTAG
- 2.5 Gb/s transatlantic network
39Initial U.S. iVDGL Participants
- U Florida CMS
- Caltech CMS, LIGO
- UC San Diego CMS, CS
- Indiana U ATLAS, GOC
- Boston U ATLAS
- U Wisconsin, Milwaukee LIGO
- Penn State LIGO
- Johns Hopkins SDSS, NVO
- U Chicago/Argonne CS
- U Southern California CS
- U Wisconsin, Madison CS
- Salish Kootenai Outreach, LIGO
- Hampton U Outreach, ATLAS
- U Texas, Brownsville Outreach, LIGO
- Fermilab CMS, SDSS, NVO
- Brookhaven ATLAS
- Argonne Lab ATLAS, CS
T2 / Software
CS support
T3 / Outreach
T1 / Labs(funded elsewhere)
40Possible Participant TeraGrid(13 TeraFlops, 40
Gb/s)
Site Resources
Site Resources
26
HPSS
HPSS
4
24
External Networks
External Networks
8
5
Caltech
Argonne
40 Gb/s
External Networks
External Networks
NCSA/PACI 8 TF 240 TB
SDSC 4.1 TF 225 TB
Site Resources
Site Resources
HPSS
UniTree
41US-iVDGL Data Grid (Dec. 2002)
SKC
Boston U
Wisconsin
Michigan
PSU
BNL
Fermilab
LBL
Argonne
J. Hopkins
NCSA
Indiana
Hampton
Caltech
Oklahoma
Vanderbilt
UCSD/SDSC
FSU
Arlington
UF
Plus other sites in 2002
FIU
Brownsville
42iVDGL Map (2002-2003)
Surfnet
DataTAG
- New partners
- Brazil T1
- Russia T1
- Chile T2
- Pakistan T2
- China T2
- Romania ?
43FIU Participation in iVDGL
- Immediate participation in GriPhyN-iVDGL outreach
effort - Extend outreach effort to new participants
- iVDGL/GriPhyN outreach leaders enthusiastic about
this idea - Connections to South America AMPATH
- HEP in Brazil, etc.
- New astronomy projects for iVDGL?
- Outreach to South America?
- Connections with Florida neighbors UF and FSU
- Major CMS leadership at UF and FSU
- Extend UF ? FSU CMS collaboration to UF ? FSU ?
FIU - Extend nuclear physics FSU ? FIU connection to
iVDGL - Connections with Caltech CMS and Grid projects
- H. Newman actively developing new Grid
collaborations with several countries (Brazil,
Romania, Pakistan, etc.)
44iVDGL Management and Coordination
U.S. Piece
US ProjectDirectors
International Piece
US External Advisory Committee
Collaborating Grid Projects
US Project Steering Group
Facilities Team
FIU
Core Software Team
FIU
Operations Team
Project Coordination Group
Applications Team
FIU
GLUE Interoperability Team
Outreach Team
FIU
45Need for Common Grid Infrastructure
- Grid computing sometimes compared to electric
grid - You plug in to get a resource (CPU, storage, )
- You dont care where the resource is located
- This analogy is more appropriate than originally
intended - It expresses a USA viewpoint ? uniform power grid
- What happens when you travel around the world?
Different frequencies 60 Hz, 50 Hz Different
voltages 120 V, 220 V Different sockets! USA, 2
pin, France, UK, etc.
Want to avoid this situation in Grid computing
46Role of Grid Infrastructure
- Provide essential common Grid services
- Cannot afford to develop separate
infrastructures(Manpower, timing, immediate
needs, etc.) - Meet needs of high-end scientific enging
collaborations - HENP, astrophysics, GVO, earthquake, climate,
space, biology, - Already international and even global in scope
- Drive future requirements
- Be broadly applicable outside science
- Government agencies National, regional (EU), UN
- Non-governmental organizations (NGOs)
- Corporations, business networks (e.g., suppliers,
RD) - Other virtual organizations (see Anatomy of the
Grid) - Be scalable to the Global level
47Coordination of U.S. Grid Projects
- Three closely coordinated U.S. projects
- PPDG HENP experiments, short term tools,
deployment - GriPhyN Data Grid research, Virtual Data, VDT
deliverable - iVDGL Global Grid laboratory
- Coordination of PPDG, GriPhyN, iVDGL
- Common experiments personnel, management
integration - iVDGL as joint PPDG GriPhyN laboratory
- Joint meetings (Jan. 2002, April 2002, Sept.
2002) - Joint architecture creation (GriPhyN, PPDG)
- Adoption of VDT as common core Grid
infrastructure - Common Outreach effort (GriPhyN iVDGL)
- New TeraGrid project (Aug. 2001)
- 13MFlops across 4 sites, 40 Gb/s networking
- Aim to integrate into iVDGL, adopt VDT, common
Outreach
48Grid Coordination Efforts
- Global Grid Forum (GGF)
- www.gridforum.org
- International forum for general Grid efforts
- Many working groups, standards definitions
- Next one in Toronto, Feb. 17-20
- HICB (High energy physics)
- Represents HEP collaborations, primarily LHC
experiments - Joint development deployment of Data Grid
middleware - GriPhyN, PPDG, TeraGrid, iVDGL, EU-DataGrid, LCG,
DataTAG, CrossGrid - Common testbed, open source software model
- Several meeting so far
- New infrastructure Data Grid projects?
- Fold into existing Grid landscape (primarily US
EU)
49Worldwide Grid Coordination
- Two major clusters of physics Grid projects
- US based GriPhyN Virtual Data Toolkit (VDT)
- EU based Different packaging of similar
components - MAGIC coordination workshop in Chicago in August
- Organized by NSF and DOE
- Final report in a few weeks
- Determine Grid coordination strategy over broad
range - Many activities
50Summary
- Data Grids will qualitatively and quantitatively
change the nature of collaborations and
approaches to computing - The iVDGL will provide vast experience for new
collaborations - Many challenges during the coming transition
- New grid projects will provide rich experience
and lessons - Difficult to predict situation even 3-5 years
ahead
51Grid References
- Grid Book
- www.mkp.com/grids
- Globus
- www.globus.org
- Global Grid Forum
- www.gridforum.org
- TeraGrid
- www.teragrid.org
- EU DataGrid
- www.eu-datagrid.org
- PPDG
- www.ppdg.net
- GriPhyN
- www.griphyn.org
- iVDGL
- www.ivdgl.org