Title: CS Buzzwords/
1CS Buzzwords/ The Grid and the Future of
computing
- Scott A. Klasky
- sklasky_at_pppl.gov
2Why?
- Why do you have to program in a language which
doesnt let you program in equations? - Why do you have to care about the machine you are
programming on? - Why do you care which machine the computer runs
on? - Why cant you visualize/analyze your data as soon
as the data is produced? - Why do you run your codes at NERSC?
- Silly question for those who use 100s/1000s of
processors. - Why do your results from your analysis dont
always get stored in a database? - Why cant the computer do the data analysis for
you, and have it ask you questions? - Why are people still talking about vector
computers?
- I just dont have TIME!!!
- COLLABORATION IS THE KEY!
3Scotts view of computing (HYPE)
- Why cant we program in high level languages?
- RNPL (Rapid Numerical Programming Language)
http//godel.ph.utexas.edu/Members/marsa/rnpl/user
s_guide/node4.html - Mathmatica/Maple
- Use object oriented programming to manage memory,
state, etc. - This is the framework for your code.
- You write modules in this framework.
- Use F90/F77/C, as modules for the code.
- These modules can be reused for multiple codes,
multiple authors. - Compute Fundamental Variables on main computers,
other variables on secondary computers. - Cactus code is a good example (2001 Gordon Bell
Prize Winner) - What are the benefits?
- Let the CS people worry about memory management,
data I/O, visualization, security, machine
locations - Why should you care about the machine you are
running on? - All you should care about is running you code,
and getting your accurate results as fast as
possible.
4Buzzwords
- Fortran, HPF, C, C, Java
- MPI, MPI-G2, OpenMP
- Python, PERL, TCL/TK
- HTML, SGML, XML
- JavaScript, DHTML
- FLTK (Fast Light Toolkit)
- The Grid
- Globus
- Web Services
- DataMining
- WireGL, Chromium
- AccessGrid
- Portals (Discover Portal)
- CCA
- SOAP (Simple Object Access Protocol)
- A way to create widely distributed, complex
computing environments that run over the Internet
using existing infrastructure. - It is about applications cumminicating directly
with each other over the Internet in a very rich
way). - HTC (High Throughput Computing)
- Deliver large amounts of processing capacity over
long periods of time - CONDOR (http//www.cs.wisc.edu/condor/)
- Goal
- develop, implement, deploy, and evaluate
mechanisms and policies that support High
Throughput Computing (HTC) on large collections
of distributively owned computing resources.
5Cactus (http//www.cactuscode.org) (Allen,
Dramlitsch, Seidel, Shalf, Radke)
- Modular, portable framework for parallel,
multidimensional simulations - Construct codes by linking
- Small core (flesh) mgmt services
- Selected modules (thorns) Numerical methods,
grids domain decomps, visualization and
steering, etc. - Custom linking/configuration tools
- Developed for astrophysics, but not
astrophysics-specific - They have
- Cactus Worms
- Remote monitoring and steering of an application
from any web browser - Streaming of isosurfaces from a simulation, which
can then be viewed on a local machine - Remote visualization of 2D slices from any grid
function in a simulation as jpegs in a web
browser - Accessible MPI-based parallelism for finite
difference grids - Access to a variety of supercomputing
architectures and clusters - Several parallel I/O layers
- Fixed and Adaptive mesh refinement under
development - Elliptic solvers
- Parallel interpolators and reductions
- Metacomputing and distributed computing
Thorns
6Discover Portal
- http//tassl-pc-5.rutgers.edu/discover/main.php
- Discover is a virtual, interactive and
collaborative PSE - Enables geographically distributed scientists and
engineers to collaboratively monitor, and control
high performance parallel/distributed
applications using web-based portals. - Its primary objective is to transform
high-performance simulation into true research
and instructional modalities - Bring large distributed simulations to the
scientists/engineers desktop by providing
collaborative web-based portals for interaction
and control. - Provides a 3-tier architecture composed of
detachable thin-clients at the front-end, a
network of web servers in the middle, and a
control network of sensors, actuators,
interaction agents superimposed on the
application at the back-end.
7MPICH-G2 (http//www.hpclab.niu.edu/mpi/)
- What is MPICH-G2?
- It is a grid-enabled implementation of MPI v1.1
standard. - Using Globus services (job startup, security),
MPICH-G2 allows you to couple multiple machines, - MPICH-G2 automatically converts data in messages
sent between machines of different architectures
and supports multiprotocol communication by
automatically selecting TCP for intermachine
messaging and vendor-supplied MPI for
intramachine messaging
8Accessgrid
Supporting group-to-group interaction across the
Grid http//www.accessgrid.org Over 70 AG sites
(PPPL will be next!)
- Extending the Computational Grid
- Group-to-group interactions are different from
and more complex than individual-to-individual
interactions. - Large-scale scientific and technical
collaborations often involve multiple teams
working together. - The Access Grid concept complements and extends
the concept of the Computational Grid. - The Access Grid project aims at exploring and
supporting this more complex set of requirements
and functions. - An Access Grid node involves 3-20 people per
site. - Access Grid nodes are designed spaces that
support the high-end audio/video technology
needed to provide a compelling and productive
user experience. - The Access Grid consists of large-format
multimedia display, presentation, and interaction
software environments interfaces to grid
middleware and interfaces to remote
visualization environments. - With these resources, the Access Grid supports
large-scale distributed meetings, collaborative
teamwork sessions, seminars, lectures, tutorials,
and training.
- Providing New Capabilities
- The Alliance Access Grid project has prototyped a
number of Access Grid Nodes and uses these nodes
to conduct remote meetings, site visits, training
sessions and educational events. - Capabilities will include
- high-quality multichannel digital video and
audio, - prototypic large-format display
- integrated presentation technologies (PowerPoint
slides, mpeg movies, shared OpenGL windows), - prototypic recording capabilities
- integration with Globus for basic services
(directories, security, network resource
management), - macroscreen management
- integration of local desktops into the Grid
- multiple session capability
9Access Grid
10Chromium
- http//graphics.stanford.edu/humper/chromium_docu
mentation/ - Chromium is a new system for interactive
rendering on clusters of workstations. - It is a completely extensible architecture, so
that parallel rendering algorithms can be
implemented on clusters with ease. - We are still using WireGL, but will be switching
to Chromium. - Basically, it will allow us to run a program
which uses OpenGL, and have it display on a
cluster tiled display wall. - There are parallel APIs!
11Common Component Architecture (http//www.acl.lanl
.gov/cca/)
- Goal provide interoperable components and
frameworks for rapid construction of complex,
high-performance applications. - CCA is needed because existing component
standards (EJB, CORBA, COM) are not designed for
large-scale, high-performance computing or
parallel components. - The CCA will leverage existing standards'
infrastructure such as name service, event
models, builders, security, and tools.
12Requirements of Component Architectures for
High-Performance Computing
- Component characteristics. The CCA will be used
primarily for high-performance components of both
coarse and fine grain, implemented according to
different paradigms such as SPMD-style as well as
shared memory multi-threaded models. - Heterogeneity. Whenever technically possible, the
CCA should be able to combine within one
multi-component application components executing
on multiple architectures, implemented in
different languages, and using different run-time
systems. Furthermore, design priorities should be
geared towards addressing software needs most
common in HPC environment for example
interoperability with languages popular in
scientific programming such as Fortran, C and C
should be given priority. - Local and remote components. Whenever possible we
would like to stage interoperability of both
local and remote components and be able to
seamlessly change interactions from local to
remote. We will address the needs both of remote
components running over a local area network and
wide area network component applications running
over the HPC grid should be able to satisfy
real-time constraints and interact with diverse
supercomputing schedulers. - Integration. We will try to make the integration
of components as smooth as possible. In general
it should not be necessary to develop a component
specially to integrate with the framework, or to
rewrite an existing component substantially. - High-Performance. It is essential that the set of
standard features agreed on contain mechanisms
for supporting high-performance interactions
whenever possible we should be able to avoid
extra copies, extra communication or
synchronization and encourage efficient
implementation such as parallel data transfers. - Openess. The CCA specification should be open,
and used with open software. In HPC this
flexibility is needed to keep pace with the
ever-changing demands of the scientific
programming world.
13The Grid (http//www.globus.org)
- The Grid Problem
- Flexible, secure, coordinated resource sharing
among dynamic collections of individuals,
institutions, and resource - From The Anatomy of the Grid Enabling Scalable
Virtual Organizations - Enable communities (virtual organizations) to
share geographically distributed resources as
they pursue common goals -- assuming the absence
of - central location,
- central control,
- omniscience,
- existing trust relationships.
14Elements of the Problem
- Resource sharing
- Computers, storage, sensors, networks,
- Sharing always conditional issues of trust,
policy, negotiation, payment, - Coordinated problem solving
- Beyond client-server distributed data analysis,
computation, collaboration, - Dynamic, multi-institutional virtual orgs
- Community overlays on classic org structures
- Large or small, static or dynamic
15Why Grids?
- A biochemist exploits 10,000 computers to screen
100,000 compounds in an hour - 1,000 physicists worldwide pool resources for
petaop analyses of petabytes of data - Civil engineers collaborate to design, execute,
analyze shake table experiments - Climate scientists visualize, annotate, analyze
terabyte simulation datasets - An emergency response team couples real time
data, weather model, population data - A multidisciplinary analysis in aerospace couples
code and data in four companies - A home user invokes architectural design
functions at an application service provider - An application service provider purchases cycles
from compute cycle providers - Scientists working for a multinational soap
company design a new product - A community group pools members PCs to analyze
alternative designs for a local road
16Online Access to Scientific Instruments
Advanced Photon Source
wide-area dissemination
desktop VR clients with shared controls
real-time collection
archival storage
tomographic reconstruction
DOE X-ray grand challenge ANL, USC/ISI, NIST,
U.Chicago
17Data Grids for High Energy Physics
Image courtesy Harvey Newman, Caltech
18Broader Context
- Grid Computing has much in common with major
industrial thrusts - Business-to-business, Peer-to-peer, Application
Service Providers, Storage Service Providers,
Distributed Computing, Internet Computing - Sharing issues not adequately addressed by
existing technologies - Complicated requirements run program X at site
Y subject to community policy P, providing access
to data at Z according to policy Q - High performance unique demands of advanced
high-performance systems
19Why Now?
- Moores law improvements in computing produce
highly functional end-systems - The Internet and burgeoning wired and wireless
provide universal connectivity - Changing modes of working and problem solving
emphasize teamwork, computation - Network exponentials produce dramatic changes in
geometry and geography
20Network Exponentials
- Network vs. computer performance
- Computer speed doubles every 18 months
- Network speed doubles every 9 months
- Difference order of magnitude per 5 years
- 1986 to 2000
- Computers x 500
- Networks x 340,000
- 2001 to 2010
- Computers x 60
- Networks x 4000
Moores Law vs. storage improvements vs. optical
improvements. Graph from Scientific American
(Jan-2001) by Cleo Vilett, source Vined Khoslan,
Kleiner, Caufield and Perkins.
21The Globus ProjectMaking Grid computing a
reality
- Close collaboration with real Grid projects in
science and industry - Development and promotion of standard Grid
protocols to enable interoperability and shared
infrastructure - Development and promotion of standard Grid
software APIs and SDKs to enable portability and
code sharing - The Globus Toolkit Open source, reference
software base for building grid infrastructure
and applications - Global Grid Forum Development of standard
protocols and APIs for Grid computing
22One View of Requirements
- Identity authentication
- Authorization policy
- Resource discovery
- Resource characterization
- Resource allocation
- (Co-)reservation, workflow
- Distributed algorithms
- Remote data access
- High-speed data transfer
- Performance guarantees
- Monitoring
- Adaptation
- Intrusion detection
- Resource management
- Accounting payment
- Fault management
- System evolution
- Etc.
- Etc.
-
23Three Obstacles to Making Grid Computing Routine
- New approaches to problem solving
- Data Grids, distributed computing, peer-to-peer,
collaboration grids, - Structuring and writing programs
- Abstractions, tools
- Enabling resource sharing across distinct
institutions - Resource discovery, access, reservation,
allocation authentication, authorization,
policy communication fault detection and
notification
Programming Problem
Systems Problem
24Programming Systems Problems
- The programming problem
- Facilitate development of sophisticated apps
- Facilitate code sharing
- Requires prog. envs APIs, SDKs, tools
- The systems problem
- Facilitate coordinated use of diverse resources
- Facilitate infrastructure sharing e.g.,
certificate authorities, info services - Requires systems protocols, services
- E.g., port/service/protocol for accessing
information, allocating resources
25The Systems ProblemResource Sharing Mechanisms
That
- Address security and policy concerns of resource
owners and users - Are flexible enough to deal with many resource
types and sharing modalities - Scale to large number of resources, many
participants, many program components - Operate efficiently when dealing with large
amounts of data computation
26Aspects of the Systems Problem
- Need for interoperability when different groups
want to share resources - Diverse components, policies, mechanisms
- E.g., standard notions of identity, means of
communication, resource descriptions - Need for shared infrastructure services to avoid
repeated development, installation - E.g., one port/service/protocol for remote access
to computing, not one per tool/appln - E.g., Certificate Authorities expensive to run
- A common need for protocols services
27Hence, a Protocol-Oriented View of Grid
Architecture, that Emphasizes
- Development of Grid protocols services
- Protocol-mediated access to remote resources
- New services e.g., resource brokering
- On the Grid speak Intergrid protocols
- Mostly (extensions to) existing protocols
- Development of Grid APIs SDKs
- Interfaces to Grid protocols services
- Facilitate application development by supplying
higher-level abstractions - The (hugely successful) model is the Internet
28The Data Grid Problem
- Enable a geographically distributed community
of thousands to perform sophisticated,
computationally intensive analyses on Petabytes
of data
29Major Data Grid Projects
Name URL/Sponsor Focus
Grid Application Dev. Software hipersoft.rice.edu/grads NSF Research into program development technologies for Grid applications
Grid Physics Network griphyn.org NSF Technology RD for data analysis in physics expts ATLAS, CMS, LIGO, SDSS
Information Power Grid ipg.nasa.gov NASA Create and apply a production Grid for aerosciences and other NASA missions
International Virtual Data Grid Laboratory ivdgl.org NSF Create international Data Grid to enable large-scale experimentation on Grid technologies applications
Network for Earthquake Eng. Simulation Grid neesgrid.org NSF Create and apply a production Grid for earthquake engineering
Particle Physics Data Grid ppdg.net DOE Science Create and apply production Grids for data analysis in high energy and nuclear physics experiments
TeraGrid teragrid.org NSF U.S. science infrastructure linking four major resource sites at 40 Gb/s
UK Grid Support Center grid-support.ac.uk U.K. eScience Support center for Grid projects within the U.K.
Unicore BMBFT Technologies for remote access to supercomputers
FusionGrid? ??? Link TBs of data from NERSC, generated by fusion codes, to clusters at PPPL
30Data Intensive Issues Include
- Harness potentially large numbers of data,
storage, network resources located in distinct
administrative domains - Respect local and global policies governing what
can be used for what - Schedule resources efficiently, again subject to
local and global constraints - Achieve high performance, with respect to both
speed and reliability - Catalog software and virtual data
31Data IntensiveComputing and Grids
- The term Data Grid is often used
- Unfortunate as it implies a distinct
infrastructure, which it isnt but easy to say - Data-intensive computing shares numerous
requirements with collaboration, instrumentation,
computation, - Security, resource mgt, info services, etc.
- Important to exploit commonalities as very
unlikely that multiple infrastructures can be
maintained - Fortunately this seems easy to do!
32Examples ofDesired Data Grid Functionality
- High-speed, reliable access to remote data
- Automated discovery of best copy of data
- Manage replication to improve performance
- Co-schedule compute, storage, network
- Transparency wrt delivered performance
- Enforce access control on data
- Allow representation of global resource
allocation policies - Central Q How must Grid architecture be extended
to support these functions?
33Grid Protocols, Services, ToolsEnabling Sharing
in Virtual Organizations
- Protocol-mediated access to resources
- Mask local heterogeneities
- Extensible to allow for advanced features
- Negotiate multi-domain security, policy
- Grid-enabled resources speak protocols
- Multiple implementations are possible
- Broad deployment of protocols facilitates
creation of Services that provide integrated view
of distributed resources - Tools use protocols and services to enable
specific classes of applications
34A Model Architecture for Data Grids
Attribute Specification
Replica Catalog
Metadata Catalog
Application
Multiple Locations
Logical Collection and Logical File Name
MetacomputingDirectoryService
Selected Replica
Replica Selection
Performance Information Predictions
NetworkWeatherService
GridFTP Control Channel
Disk Cache
GridFTPDataChannel
Tape Library
Disk Array
Disk Cache
Replica Location 1
Replica Location 2
Replica Location 3
35Globus Toolkit Components
- Two major Data Grid components
- 1. Data Transport and Access
- Common protocol
- Secure, efficient, flexible, extensible data
movement - Family of tools supporting this protocol
- 2. Replica Management Architecture
- Simple scheme for managing
- multiple copies of files
- collections of files
- APIs, white papers http//www.globus.org
36Motivation for a Common Data Access Protocol
- Existing distributed data storage systems
- DPSS, HPSS focus on high-performance access,
utilize parallel data transfer, striping - DFS focus on high-volume usage, dataset
replication, local caching - SRB connects heterogeneous data collections,
uniform client interface, metadata queries - Problems
- Incompatible (and proprietary) protocols
- Each require custom client
- Partitions available data sets and storage
devices - Each protocol has subset of desired functionality
37A Common, Secure, EfficientData Access Protocol
- Common, extensible transfer protocol
- Common protocol means all can interoperate
- Decouple low-level data transfer mechanisms from
the storage service - Advantages
- New, specialized storage systems are
automatically compatible with existing systems - Existing systems have richer data transfer
functionality - Interface to many storage systems
- HPSS, DPSS, file systems
- Plan for SRB integration
38A UniversalAccess/Transport Protocol
- Suite of communication libraries and related
tools that support - GSI, Kerberos security
- Third-party transfers
- Parameter set/negotiate
- Partial file access
- Reliability/restart
- Large file support
- Data channel reuse
- All based on a standard, widely deployed protocol
39And the Universal Protocol is GridFTP
- Why FTP?
- Ubiquity enables interoperation with many
commodity tools - Already supports many desired features, easily
extended to support others - Well understood and supported
- We use the term GridFTP to refer to
- Transfer protocol which meets requirements
- Family of tools which implement the protocol
- Note GridFTP gt FTP
- Note that despite name, GridFTP is not restricted
to file transfer!
40Summary
Supercomputer
PPPL petrel
PPPL Pared Display Wall
Webservices are run (Data Analysis, Data mining)
Accessgrid is run here Chromium XPLIT Scirun or
VTK
CPU
AVS/Express IDL HTTPAccessgrid docking