Global Data Grids for 21st Century Science - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Global Data Grids for 21st Century Science

Description:

Fabric: Physical resources & networks provide raw capability ... Community: SETI researchers enthusiasts. Arecibo radio data sent to users (250KB data chunks) ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 52
Provided by: paula92
Category:

less

Transcript and Presenter's Notes

Title: Global Data Grids for 21st Century Science


1
  • Global Data Grids for21st Century Science

Paul Avery University of Florida http//www.phys.u
fl.edu/avery/ avery_at_phys.ufl.edu
Florida International UniversitySept. 12, 2002
2
The Grid Concept
  • Grid Geographically distributed computing
    resources configured for coordinated use
  • Fabric Physical resources networks provide
    raw capability
  • Middleware Software ties it all together (tools,
    services, etc.)
  • Goal Transparent resource sharing

3
Fundamental Idea Resource Sharing
  • Resources for complex problems are distributed
  • Advanced scientific instruments (accelerators,
    telescopes, )
  • Storage and computing
  • Groups of people
  • Communities require access to common services
  • Scientific collaborations (physics, astronomy,
    biology, eng. )
  • Government agencies
  • Health care organizations, large corporations,
  • Goal is to build Virtual Organizations
  • Make all community resources available to any VO
    member
  • Leverage strengths at different institutions
  • Add people resources dynamically

4
What Are Grids Good For?
  • Climate modeling
  • Climate scientists visualize, annotate, analyze
    Terabytes of simulation data
  • Biology
  • A biochemist exploits 10,000 computers to screen
    100,000 compounds in an hour
  • High energy physics
  • 3,000 physicists worldwide pool Petaflops (1M
    GigaFlops) of CPU resources to analyze Petabytes
    of data
  • Engineering
  • Civil engineers collaborate to design, execute,
    analyze shake table experiments
  • A multidisciplinary analysis in aerospace couples
    code and data in four companies to design a new
    airframe

From Ian Foster
5
What Are Grids Good For?
  • Application Service Providers
  • A home user invokes architectural design
    functions at an application service provider
  • which purchases computing cycles from cycle
    providers
  • Commercial
  • Scientists at a multinational toy company design
    a new product
  • Cities, communities
  • An emergency response team couples real time
    data, weather model, population data
  • A community group pools members PCs to analyze
    alternative designs for a local road
  • Health
  • Hospitals and international agencies collaborate
    on stemming a major disease outbreak

From Ian Foster
6
Proto-Grid SETI_at_home
  • Community SETI researchers enthusiasts
  • Arecibo radio data sent to users (250KB data
    chunks)
  • Over 2M PCs used

7
More Advanced Proto-GridEvaluation of AIDS Drugs
  • Community
  • Research group (Scripps)
  • 1000s of PC owners
  • Vendor (Entropia)
  • Common goal
  • Drug design
  • Advance AIDS research

8
Grids Why Now?
  • Moores law improvements in computing
  • Highly functional endsystems
  • Universal wired and wireless Internet connections
  • Universal connectivity
  • Changing modes of working and problem solving
  • Teamwork, computation
  • Network exponentials
  • (Next slide)

9
Network Exponentials Collaboration
  • Network vs. computer performance
  • Computer speed doubles every 18 months
  • Network speed doubles every 12 months (revised)
  • Difference order of magnitude per 10 years
  • Other factor network connectivity
  • 1986 to 2001
  • Computers ? 1,000
  • Networks ? 50,000
  • 2001 to 2010?
  • Computers ? 60
  • Networks ? 500

Scientific American (Jan-2001)
10
Grid Challenges
  • Overall goal Coordinated sharing of resources
  • Technical problems to overcome
  • Authentication, authorization, policy, auditing
  • Resource discovery, access, allocation, control
  • Failure detection recovery
  • Resource brokering
  • Additional issue lack of central control
    knowledge
  • Preservation of local site autonomy
  • Policy discovery and negotiation important

11
Layered Grid Architecture(Analogy to Internet
Architecture)
Specialized servicesApp. specific distributed
services
User
Managing multiple resourcesubiquitous
infrastructure services
Collective
Sharing single resourcesnegotiating access,
controlling use
Resource
Talking to thingscommunications, security
Connectivity
Controlling things locallyAccessing,
controlling resources
Fabric
From Ian Foster
12
Globus Project and Toolkit
  • Globus Project (Argonne USC/ISI)
  • O(40) researchers developers
  • Identify and define core protocols and services
  • Globus Toolkit 2.0
  • A major product of the Globus Project
  • Reference implementation of core protocols
    services
  • Growing open source developer community
  • Globus Toolkit used by all Data Grid projects
    today
  • US GriPhyN, PPDG, TeraGrid, iVDGL
  • EU EU-DataGrid and national projects
  • Recent announcement of applying web services to
    Grids
  • Keeps Grids in the commercial mainstream
  • GT 3.0

13
Globus General Approach
Applications
  • Define Grid protocols APIs
  • Protocol-mediated access to remote resources
  • Integrate and extend existing standards
  • Develop reference implementation
  • Open source Globus Toolkit
  • Client server SDKs, services, tools, etc.
  • Grid-enable wide variety of tools
  • Globus Toolkit
  • FTP, SSH, Condor, SRB, MPI,
  • Learn about real world problems
  • Deployment
  • Testing
  • Applications

Diverse global services
Core services
Diverse resources
14
Data Grids
15
Data Intensive Science 2000-2015
  • Scientific discovery increasingly driven by IT
  • Computationally intensive analyses
  • Massive data collections
  • Data distributed across networks of varying
    capability
  • Geographically distributed collaboration
  • Dominant factor data growth (1 Petabyte 1000
    TB)
  • 2000 0.5 Petabyte
  • 2005 10 Petabytes
  • 2010 100 Petabytes
  • 2015 1000 Petabytes?

How to collect, manage, access and interpret
this quantity of data?
Drives demand for Data Grids to
handleadditional dimension of data access
movement
16
Data Intensive Physical Sciences
  • High energy nuclear physics
  • Including new experiments at CERNs Large Hadron
    Collider
  • Gravity wave searches
  • LIGO, GEO, VIRGO
  • Astronomy Digital sky surveys
  • Sloan Digital sky Survey, VISTA, other Gigapixel
    arrays
  • Virtual Observatories (multi-wavelength
    astronomy)
  • Time-dependent 3-D systems (simulation data)
  • Earth Observation, climate modeling
  • Geophysics, earthquake modeling
  • Fluids, aerodynamic design
  • Pollutant dispersal scenarios

17
Data Intensive Biology and Medicine
  • Medical data
  • X-Ray, mammography data, etc. (many petabytes)
  • Digitizing patient records (ditto)
  • X-ray crystallography
  • Bright X-Ray sources, e.g. Argonne Advanced
    Photon Source
  • Molecular genomics and related disciplines
  • Human Genome, other genome databases
  • Proteomics (protein structure, activities, )
  • Protein interactions, drug delivery
  • Brain scans (3-D, time dependent)
  • Virtual Population Laboratory (proposed)
  • Database of populations, geography,
    transportation corridors
  • Simulate likely spread of disease outbreaks

Craig Venter keynote _at_SC2001
18
Example High Energy Physics
Compact Muon Solenoid at the LHC (CERN)
Smithsonianstandard man
19
LHC Computing Challenges
  • Complexity of LHC interaction environment
    resulting data
  • Scale Petabytes of data per year (100 PB by
    2010-12)
  • Global distribution of people and resources

1800 Physicists 150 Institutes 32 Countries
20
Global LHC Data Grid
Tier0 CERNTier1 National LabTier2 Regional
Center (University, etc.)Tier3 University
workgroupTier4 Workstation
  • Key ideas
  • Hierarchical structure
  • Tier2 centers

21
Example Global LHC Data Grid
Experiment (e.g., CMS)
Tier0/(? Tier1)/(? Tier2) 111
Online System
100 MBytes/sec
CERN Computer Center gt 20 TIPS
Tier 0
2.5 Gbits/sec
Tier 1
France
Italy
UK
USA
2.5 Gbits/sec
Tier 2
0.6 Gbits/sec
Tier 3
Institute 0.25TIPS
Institute
Institute
Institute
0.1 - 1 Gbits/sec
Physics data cache
Tier 4
PCs, other portals
22
Sloan Digital Sky Survey Data Grid
23
LIGO (Gravity Wave) Data Grid
MIT
LivingstonObservatory
HanfordObservatory
OC48
OC3
OC3
OC12
Caltech
Tier1
OC48
24
Data Grid Projects
25
Data Grid Projects
  • Particle Physics Data Grid (US, DOE)
  • Data Grid applications for HENP expts.
  • GriPhyN (US, NSF)
  • Petascale Virtual-Data Grids
  • iVDGL (US, NSF)
  • Global Grid lab
  • TeraGrid (US, NSF)
  • Dist. supercomp. resources (13 TFlops)
  • European Data Grid (EU, EC)
  • Data Grid technologies, EU deployment
  • CrossGrid (EU, EC)
  • Data Grid technologies, EU emphasis
  • DataTAG (EU, EC)
  • Transatlantic network, Grid applications
  • Japanese Grid Projects (APGrid?) (Japan)
  • Grid deployment throughout Japan
  • Collaborations of application scientists
    computer scientists
  • Infrastructure devel. deployment
  • Globus based

26
GriPhyN App. Science CS Grids
  • GriPhyN Grid Physics Network
  • US-CMS High Energy Physics
  • US-ATLAS High Energy Physics
  • LIGO/LSC Gravity wave research
  • SDSS Sloan Digital Sky Survey
  • Strong partnership with computer scientists
  • Design and implement production-scale grids
  • Develop common infrastructure, tools and services
  • Integration into the 4 experiments
  • Broad application to other sciences via Virtual
    Data Toolkit
  • Strong outreach program
  • Multi-year project
  • RD for grid architecture (funded at 11.9M
    1.6M)
  • Integrate Grid infrastructure into experiments
    through VDT

27
GriPhyN Institutions
  • UC San Diego
  • San Diego Supercomputer Center
  • Lawrence Berkeley Lab
  • Argonne
  • Fermilab
  • Brookhaven
  • U Florida
  • U Chicago
  • Boston U
  • Caltech
  • U Wisconsin, Madison
  • USC/ISI
  • Harvard
  • Indiana
  • Johns Hopkins
  • Northwestern
  • Stanford
  • U Illinois at Chicago
  • U Penn
  • U Texas, Brownsville
  • U Wisconsin, Milwaukee
  • UC Berkeley

28
GriPhyN PetaScale Virtual-Data Grids
Production Team
Individual Investigator
Workgroups
1 Petaflop 100 Petabytes
Interactive User Tools
Request Planning
Request Execution
Virtual Data Tools
Management Tools
Scheduling Tools
Resource
Other Grid
  • Resource
  • Security and
  • Other Grid

Security and
Management
  • Management
  • Policy
  • Services

Policy
Services
Services
  • Services
  • Services

Services
Transforms
Distributed resources(code, storage,
CPUs,networks)
Raw data
source
29
GriPhyN Research Agenda
  • Virtual Data technologies (fig.)
  • Derived data, calculable via algorithm
  • Instantiated 0, 1, or many times (e.g., caches)
  • Fetch value vs execute algorithm
  • Potentially complex (versions, consistency, cost
    calculation, etc)
  • LIGO example
  • Get gravitational strain for 2 minutes around
    each of 200 gamma-ray bursts over the last year
  • For each requested data value, need to
  • Locate item location and algorithm
  • Determine costs of fetching vs calculating
  • Plan data movements computations required to
    obtain results
  • Execute the plan

30
Virtual Data in Action
  • Data request may
  • Compute locally
  • Compute remotely
  • Access local data
  • Access remote data
  • Scheduling based on
  • Local policies
  • Global policies
  • Cost

Major facilities, archives
Regional facilities, caches
Local facilities, caches
31
Chimera Virtual Data System
  • Virtual data language
  • Transformations, derivations, data
  • Virtual data catalog
  • Persistent definitions
  • Query capability
  • Data production analysis applications

32
Transformations and Derivations
  • Transformation
  • Abstract template of program invocation
  • Similar to "function definition"
  • Derivation
  • Formal invocation of a Transformation
  • Similar to "function call"
  • Store past and future
  • A record of how data products were generated
  • A recipe of how data products can be generated
  • Invocation (future)
  • Record of each Derivation execution

33
GriPhyN Research Agenda (cont.)
  • Execution management
  • Co-allocation of resources (CPU, storage, network
    transfers)
  • Fault tolerance, error reporting
  • Interaction, feedback to planning
  • Performance analysis (with PPDG)
  • Instrumentation and measurement of all grid
    components
  • Understand and optimize grid performance
  • Virtual Data Toolkit (VDT)
  • VDT virtual data services virtual data tools
  • One of the primary deliverables of RD effort
  • Technology transfer mechanism to other scientific
    domains

34
GriPhyN/PPDG Data Grid Architecture
Application
initial solution is operational
DAG
Catalog Services
Monitoring
Planner
Info Services
DAG
Repl. Mgmt.
Executor
Policy/Security
Reliable Transfer Service
Compute Resource
Storage Resource
35
Catalog Architecture
Transparency wrt location
Metadata Catalog
Metadata Catalog
Name
LObjN

Name
LObjN
X logO1
Y logO2
F.X
logO3
F.X
logO3
G(1).Y logO4
Object Name
Object Name
GCMS
GCMS
Logical Container
Name
Replica Catalog
Replica Catalog
LCN
PFNs

LCN
PFNs

logC1 URL1
logC1 URL1
logC2 URL2 URL3
logC2 URL2 URL3
logC3 URL4
logC3 URL4
logC4 URL5 URL6
logC4 URL5 URL6
URLs for physical file location
Physical file storage
36
iVDGL A Global Grid Laboratory
We propose to create, operate and evaluate, over
asustained period of time, an international
researchlaboratory for data-intensive
science. From NSF proposal, 2001
  • International Virtual-Data Grid Laboratory
  • A global Grid laboratory (US, EU, South America,
    Asia, )
  • A place to conduct Data Grid tests at scale
  • A mechanism to create common Grid infrastructure
  • A facility to perform production exercises for
    LHC experiments
  • A laboratory for other disciplines to perform
    Data Grid tests
  • A focus of outreach efforts to small institutions
  • Funded for 13.65M by NSF

37
iVDGL Components
  • Computing resources
  • Tier1, Tier2, Tier3 sites
  • Networks
  • USA (TeraGrid, Internet2, ESNET), Europe (Géant,
    )
  • Transatlantic (DataTAG), Transpacific, AMPATH,
  • Grid Operations Center (GOC)
  • Indiana (2 people)
  • Joint work with TeraGrid on GOC development
  • Computer Science support teams
  • Support, test, upgrade GriPhyN Virtual Data
    Toolkit
  • Outreach effort
  • Integrated with GriPhyN
  • Coordination, interoperability

38
Current iVDGL Participants
  • Initial experiments (funded by NSF proposal)
  • CMS, ATLAS, LIGO, SDSS, NVO
  • U.S. Universities and laboratories
  • (Next slide)
  • Partners
  • TeraGrid
  • EU DataGrid EU national projects
  • Japan (AIST, TITECH)
  • Australia
  • Complementary EU project DataTAG
  • 2.5 Gb/s transatlantic network

39
Initial U.S. iVDGL Participants
  • U Florida CMS
  • Caltech CMS, LIGO
  • UC San Diego CMS, CS
  • Indiana U ATLAS, GOC
  • Boston U ATLAS
  • U Wisconsin, Milwaukee LIGO
  • Penn State LIGO
  • Johns Hopkins SDSS, NVO
  • U Chicago/Argonne CS
  • U Southern California CS
  • U Wisconsin, Madison CS
  • Salish Kootenai Outreach, LIGO
  • Hampton U Outreach, ATLAS
  • U Texas, Brownsville Outreach, LIGO
  • Fermilab CMS, SDSS, NVO
  • Brookhaven ATLAS
  • Argonne Lab ATLAS, CS

T2 / Software
CS support
T3 / Outreach
T1 / Labs(funded elsewhere)
40
Possible Participant TeraGrid(13 TeraFlops, 40
Gb/s)
Site Resources
Site Resources
26
HPSS
HPSS
4
24
External Networks
External Networks
8
5
Caltech
Argonne
40 Gb/s
External Networks
External Networks
NCSA/PACI 8 TF 240 TB
SDSC 4.1 TF 225 TB
Site Resources
Site Resources
HPSS
UniTree
41
US-iVDGL Data Grid (Dec. 2002)
SKC
Boston U
Wisconsin
Michigan
PSU
BNL
Fermilab
LBL
Argonne
J. Hopkins
NCSA
Indiana
Hampton
Caltech
Oklahoma
Vanderbilt
UCSD/SDSC
FSU
Arlington
UF
Plus other sites in 2002
FIU
Brownsville
42
iVDGL Map (2002-2003)
Surfnet
DataTAG
  • New partners
  • Brazil T1
  • Russia T1
  • Chile T2
  • Pakistan T2
  • China T2
  • Romania ?

43
FIU Participation in iVDGL
  • Immediate participation in GriPhyN-iVDGL outreach
    effort
  • Extend outreach effort to new participants
  • iVDGL/GriPhyN outreach leaders enthusiastic about
    this idea
  • Connections to South America AMPATH
  • HEP in Brazil, etc.
  • New astronomy projects for iVDGL?
  • Outreach to South America?
  • Connections with Florida neighbors UF and FSU
  • Major CMS leadership at UF and FSU
  • Extend UF ? FSU CMS collaboration to UF ? FSU ?
    FIU
  • Extend nuclear physics FSU ? FIU connection to
    iVDGL
  • Connections with Caltech CMS and Grid projects
  • H. Newman actively developing new Grid
    collaborations with several countries (Brazil,
    Romania, Pakistan, etc.)

44
iVDGL Management and Coordination
U.S. Piece
US ProjectDirectors
International Piece
US External Advisory Committee
Collaborating Grid Projects
US Project Steering Group
Facilities Team
FIU
Core Software Team
FIU
Operations Team
Project Coordination Group
Applications Team
FIU
GLUE Interoperability Team
Outreach Team
FIU
45
Need for Common Grid Infrastructure
  • Grid computing sometimes compared to electric
    grid
  • You plug in to get a resource (CPU, storage, )
  • You dont care where the resource is located
  • This analogy is more appropriate than originally
    intended
  • It expresses a USA viewpoint ? uniform power grid
  • What happens when you travel around the world?

Different frequencies 60 Hz, 50 Hz Different
voltages 120 V, 220 V Different sockets! USA, 2
pin, France, UK, etc.
Want to avoid this situation in Grid computing
46
Role of Grid Infrastructure
  • Provide essential common Grid services
  • Cannot afford to develop separate
    infrastructures(Manpower, timing, immediate
    needs, etc.)
  • Meet needs of high-end scientific enging
    collaborations
  • HENP, astrophysics, GVO, earthquake, climate,
    space, biology,
  • Already international and even global in scope
  • Drive future requirements
  • Be broadly applicable outside science
  • Government agencies National, regional (EU), UN
  • Non-governmental organizations (NGOs)
  • Corporations, business networks (e.g., suppliers,
    RD)
  • Other virtual organizations (see Anatomy of the
    Grid)
  • Be scalable to the Global level

47
Coordination of U.S. Grid Projects
  • Three closely coordinated U.S. projects
  • PPDG HENP experiments, short term tools,
    deployment
  • GriPhyN Data Grid research, Virtual Data, VDT
    deliverable
  • iVDGL Global Grid laboratory
  • Coordination of PPDG, GriPhyN, iVDGL
  • Common experiments personnel, management
    integration
  • iVDGL as joint PPDG GriPhyN laboratory
  • Joint meetings (Jan. 2002, April 2002, Sept.
    2002)
  • Joint architecture creation (GriPhyN, PPDG)
  • Adoption of VDT as common core Grid
    infrastructure
  • Common Outreach effort (GriPhyN iVDGL)
  • New TeraGrid project (Aug. 2001)
  • 13MFlops across 4 sites, 40 Gb/s networking
  • Aim to integrate into iVDGL, adopt VDT, common
    Outreach

48
Grid Coordination Efforts
  • Global Grid Forum (GGF)
  • www.gridforum.org
  • International forum for general Grid efforts
  • Many working groups, standards definitions
  • Next one in Toronto, Feb. 17-20
  • HICB (High energy physics)
  • Represents HEP collaborations, primarily LHC
    experiments
  • Joint development deployment of Data Grid
    middleware
  • GriPhyN, PPDG, TeraGrid, iVDGL, EU-DataGrid, LCG,
    DataTAG, CrossGrid
  • Common testbed, open source software model
  • Several meeting so far
  • New infrastructure Data Grid projects?
  • Fold into existing Grid landscape (primarily US
    EU)

49
Worldwide Grid Coordination
  • Two major clusters of physics Grid projects
  • US based GriPhyN Virtual Data Toolkit (VDT)
  • EU based Different packaging of similar
    components
  • MAGIC coordination workshop in Chicago in August
  • Organized by NSF and DOE
  • Final report in a few weeks
  • Determine Grid coordination strategy over broad
    range
  • Many activities

50
Summary
  • Data Grids will qualitatively and quantitatively
    change the nature of collaborations and
    approaches to computing
  • The iVDGL will provide vast experience for new
    collaborations
  • Many challenges during the coming transition
  • New grid projects will provide rich experience
    and lessons
  • Difficult to predict situation even 3-5 years
    ahead

51
Grid References
  • Grid Book
  • www.mkp.com/grids
  • Globus
  • www.globus.org
  • Global Grid Forum
  • www.gridforum.org
  • TeraGrid
  • www.teragrid.org
  • EU DataGrid
  • www.eu-datagrid.org
  • PPDG
  • www.ppdg.net
  • GriPhyN
  • www.griphyn.org
  • iVDGL
  • www.ivdgl.org
Write a Comment
User Comments (0)
About PowerShow.com