Title: The%20Grid,Globus%20Tool%20Kit,%20Condor%20
1The Grid,Globus Tool Kit, Condor Gand What?
2Outline
- The Grid Problem
- The Grid Architecture
- The Globus Toolkit
- The Condor-G
3Culled From
- http//www.globus.org
- The Anatomy of the Grid Enabling Scalable
Virtual Organizations. I. Foster, C. Kesselman,
S. Tuecke. International J. Supercomputer
Applications, 15(3), 2001 - Globus A Metacomputing Infrastructure Toolkit.
I. Foster, C. Kesselman. Intl J. Supercomputer
Applications, 11(2)115-128, 1997 - Condor-G A Computation Management Agent for
Multi-Institutional Grids. J.Frey, T.Tannenbaum,
I.Foster, M.Livny,S.Tuecke. Proc. of HPDC10, 2001 - http//charm.cs.uiuc.edu/ppl_research/faucets/
4"The Grid"
- Coined in 1990's to denote a proposed distributed
computing architecture. - "Flexible, secure, coordinated resource sharing
among dynamic collections of individuals,
institutions and resources" -From "The
Anatomy of the Grid" - Resource Sharing
- Computers,Storage,Sensors, Networks, Scientific
Instruments - sharing is highly controlled -- Providers
Consumer define - What is shared
- Who is allowed to share
- Conditions for sharing
- Coordinated problem solving
- Beyond client-server distributed data analysis,
visualization,computation, collaboration - Similar to the Power Grid, Faucets (Water
supply), Nationwide Phone System.
5Virtual Organization
- A set of individual/institutions defined by some
set of sharing rules form a Virtual Organization
(VO). - VO may contain
- Application Service Providers (ASP)
- Storage Service Providers (SSP)
- Cycle Providers
- They lack
- Central Control
- Central Location
- Existing Trust Relationships
6Why Grids?
- Civil Engineers collaborate to design, execute
analyze shake table experiments - Climate Scientists visualize, annotate analyze
terabytes of simulation datasets - An Emergency response team couples real-time
data, weather model, population data - An application service provider purchases cycles
from compute cycle provider - A biomedical engineer exploits 10K computers to
screen 100K compounds in a hour. - NEESgrid national infrastructure to couple
earthquake engineers with experimental
facilities, databases, computers with each other.
(Argonne,NCSA,Michigan,UIUC,USC)
7Online Access to Scientific Instruments
Advanced Photon Source
wide-area dissemination
desktop VR clients with shared controls
real-time collection
archival storage
tomographic reconstruction
DOE X-ray grand challenge ANL, USC/ISI, NIST,
U.Chicago
8Data Grids forHigh Energy Physics
Image courtesy Harvey Newman, Caltech
9Is it Really NEW?
- Grid Computing has much in common with the
existing industrial thrusts - Application Storage service providers (ASPs ,
SSPs) - Internet Peer-to-Peer computing
- Enterprise Computing Systems
- Business-to-Business exchanges
- SSPs ASPs allow organizations to outsource
storage computing requirements to other parties
(typically via a VPN) - Enterprise distributed computing technologies
(CORBA, Enterprise Java) enable resource sharing
within a single organization - Business-to-Business virtual enterprise
technologies exchanges focus on information
sharing via central servers.
10Is it NEW?
- Sharing is not adequately addressed by these
technologies. - Complicated Requirements run program X at site
Y subject to community policy P, using data at Z
according to the policy Q - High Performance requirements of most of the
applications. - Users may not care where their program may run,
as long as it satisfies their QoS requirements
(Faucets) - controlled, dynamic sharing within VOs
- Current doesnt accommodate the range of resource
types or doesnt provide the flexibility and
control on sharing relationships to establish VOs
11But Why Now???
- The internet the increasing use of wireless
devices provides the universal connectivity. - Many current research projects need teamwork,
collaboration. - Network Vs. Computer Performance
- Computer speed doubles every 18 months
- Network speed doubles every 9 months
Moores Law vs. storage improvements vs. optical
improvements. Graph from Scientific American
(Jan-2001) by Cleo Vilett, source Vined Khoslan,
Kleiner, Caufield and Perkins.
12Major Grid Projects
Name URL Sponsors Focus
BlueGrid IBM Grid testbed linking IBM laboratories
DISCOM www.cs.sandia.gov/discomDOE Defense Programs Create operational Grid providing access to resources at three U.S. DOE weapons laboratories
DOE Science Grid sciencegrid.org DOE Office of Science Create operational Grid providing access to resources applications at U.S. DOE science laboratories partner universities
Earth System Grid (ESG) earthsystemgrid.orgDOE Office of Science Delivery and analysis of large climate model datasets for the climate research community
European Union (EU) DataGrid eu-datagrid.org European Union Create apply an operational grid for applications in high energy physics, environmental science, bioinformatics
13Major Grid Projects
Name URL/Sponsor Focus
EuroGrid, Grid Interoperability (GRIP) eurogrid.org European Union Create tech for remote access to supercomp resources simulation codes in GRIP, integrate with Globus Toolkit
Fusion Collaboratory fusiongrid.org DOE Off. Science Create a national computational collaboratory for fusion research
Globus Project globus.org DARPA, DOE, NSF, NASA, Msoft Research on Grid technologies development and support of Globus Toolkit application and deployment
GridPP gridpp.ac.uk U.K. eScience Create apply an operational grid within the U.K. for particle physics research
Information Power Grid Ipg.nasa.gov Create apply a production Grid for aero sciences and other NASA missions
Grid Research Integration Dev. Support Center grids-center.org NSF Integration, deployment, support of the NSF Middleware Infrastructure for research education
14 www.teragrid.org
15iVDGLInternational Virtual Data Grid Laboratory
www.ivdgl.org
16Grid Architecture
17Why Do We Need It?
- To structure the development of new technology
- Common Vocabulary, Guidance, Perspective
- To
- Identify the fundamental system components
- Specify the purpose and function of these
components - Indicate how these components interact
- Emphasizes
- identification and definition of protocols and
services - APIs and SDKs
- A Protocol architecture mechanism for VO users
and resources to negotiate, manage sharing
relationships - Facilitates
- Extensibility
- Interoperability
18- Why is interoperability a fundamental concern?
- Standard protocols ? standard services to
abstract away resource specific details. - To accelerate application development in complex
and dynamic execution environments we need
APIs,SDKs - The Technology Services ? middleware
19Layered Grid Architecture
- Analogy to Internet Architecture
- High level description places few constraints
on design implementation
20Features
- Open and Extensible
- Built on Internet protocols services
- Communication, routing, name resolution, etc.
- Layering is conceptual ? No constraints on who
can call what - Protocols/Services/APIs/SDKs will (ideally) be
largely self-contained - Things like communication, security are very
fundamental - Advantageous for higher layer functions to use
common lower-level functions
21The Hourglass Model
- Resource and connectivity form the
- neck in the hour glass
- Designed to be implemented on top
- of diverse range of resource types
- (fabric layer)
- Can be used to construct wide range
- of global and application specific
- Services (collective layer)
A p p l i c a t i o n s
Diverse global services
Core services
Local OS
22What's the Status?
- No official standards yet.
- Globus Toolkit is the unofficial standard for
several connectivity, resource, collective
protocols - Global Grid Forum (GGF) has Grid Protocol
Architecture group - In security some RFCs are available
- In scheduling resource management some working
documents are available
23Globus Toolkit
- A software toolkit (GTK) to address key issues to
pave the road - Offers a set of technologies (NCSAs
Grid-in-a-box) - Try to standardize the Grid protocols and APIs
- Open Architecture Open Source (Reference
implementation) - Define Grid Protocols APIs
- Integrate and extend existing protocols
- provides software tools that make it easier to
build computational grids and grid-based
applications - Learn from the experiences gained through
deployment and applications
24Key Components
- Security
- Communication
- Information Infrastructure
- Fault Detection
- Resource Management
- Portability
- Data Management
25Fabric Layer
- What do you expect? -- diverse resource that may
be shared - Can be a logical entity such as a distributed
file system, computer cluster But the Grid
architecture dont care - Components implement the local, resource-specific
operations that occur on specific resources
(physical or logical) - Trade-off (Rich fabric functionality vs. Easy of
deployment) - Richer functionality ? more sophisticated sharing
operations (e.g., reservation) - Few demands ? simplified Grid infrastructure
deployment. - Should implement enquiry mechanisms and resource
management
26Fabric in Globus Toolkit
- Is designed to use existing fabric components
- If a vendor doesnt provide the necessary
behavior, GTK includes the missing piece - Resource management, is generally the domain of
local resource managers. - GARA (General - purpose Architecture for
Reservation and Allocation) can provide QoS for
different types resources
27Connectivity
- Communication
- Enable exchange of data between fabric layer
resources - Include transport, routing, naming (TCP,IP,DNS)
- Authentication
- Build on communication protocols
- Uniform authentication, authorization and message
protection in multi institutional scenario - Based on existing standards whenever possible
- Various Requirements
- Single sign on (access to multiple Grid resources
without user intervention) - Delegation (users program is able to access
resources on which user is authorized) - Integration with various local security solutions
(identity mapping) - User-based trust relationships (must not require
for various resources to cooperate in configuring
security environment) - stake holder should have the final control the
authorization decisions
28Security in GTK
- Grid Security Infrastructure (GSI) is based on
- Public key
- X.509 certificates
- SSL/TLS communication
- GSS-API (Generic Authorization and Access)
- Extensions are added for single sign-on and
delegation - Stakeholder control is supported via GAA (Generic
Authorization and Access). - GSI adheres to the IETFs standard GSS-API
29GSI Example ScenarioCreate Processes at A and B
that Communicate Access Files at C
User
Site A (Kerberos)
Site B (Unix)
Computer
Computer
Site C (Kerberos)
Storage system
30Resource Layer
- Concerned entirely with individual resources
- Addresses
- Resource discovery
- Reservation Allocation
- Monitoring Control
- Secure Negotiation
- Two primary classes of protocols
- Information protocols to obtain information
about the structure state of a resource - Management Protocols policy application point
- in the neck of hourglass ? the number of such
protocols should be limited to small and focused
set.
31Resource Layer Components in GTK
- GRIP (Grid Resource Information Protocol)
- Based on LDAP
- Defines standard resource information
- GRRP (Grid Resource Registration Protocol)
- To register resources with Grid Index Information
Servers - GRAM (Grid Resource Access and Management)
- HTTP based RPC
- Used for allocation of computational resources
- for monitoring and controlling of computation on
these resources - GridFTP
- allows grid applications to have secure,
ubiquitous, high-performance access to data - uses the GSI for authentication
- new extensions to the FTP protocol for
- parallel data transfer
- partial file transfer
- third-party (server-to-server) data transfer
32Collective Layer
- Coordinate multiple resources
- Protocols services (APIs,SDKs) are not
associated with any resource but rather are
global in nature - Capture interactions across collections of
resources - Examples
- Index servers aka Metadirectory services custom
views on dynamic resource collections - Resource Brokers (e.g., Condor-G Matchmaker,
AppLes, Nimrod-G, DRM broker) - Resource discovery allocation
- Replica catalogs services
- Co-reservation Co-allocation services
- Software discovery services select best s/w
implementation and platform based on the problem
parameters (e.g., NetSolve, Ninf) - Community accounting and payment services
- Collaboratory services (e.g., CAVERNsoft)
33Information Infrastructure in GTK
- An infrastructure that provides coherent system
information spanning virtual organizations is
necessary - MDS (Metacomputing Directory Service)
- Uses LDAP (Lightweight Directory Access Protocol)
- Provides uniform means of querying system
information from a rich variety of system
components - uniform namespace for resource information across
a system that may involve many organizations. - GRIS (Grid Resource Information Service) ? a
uniform means of querying resources for their
current configuration, capabilities, and status - GIIS (Grid Index Information Service) ? a means
of knitting together arbitrary GRIS services to
provide a coherent system image. - GIISes provide a mechanism for identifying
"interesting" resources
34Resource Management Services in GTK
- Three main components
- RSL (Resource Specification Language) ? to
communicate resource requirements - GRAM (Grid Resource Allocation Management) ?
standardized interface to all of the various
local resource management tools (e.g.,
Condor,LSF,PBS) - DUROC (Dynamically-Updated Request Online
Co-allocator) ?coordinates a single request that
may span multiple GRAMs - Resource Broker ? handle the mapping of high
level application requests into requests to
individual resource managers
35Resource Management Architecture
RSL specialization
RSL
Application
Information Service
Queries
Info
RSL
Simple ground RSL
Local resource managers
GRAM
GRAM
GRAM
LSF
Condor
PBS
36GRAM Components
MDS client API calls to locate resources
Client
MDS Grid Index Info Server
Site boundary
MDS client API calls to get resource info
GRAM client API calls to request resource
allocation and process creation.
MDS Grid Resource Info Server
Query current status of resource
GRAM client API state change callbacks
Grid Security Infrastructure
Local Resource Manager
Allocate create processes
Request
Job Manager
Create
Gatekeeper
Process
Parse
Monitor control
Process
RSL Library
Process
37GRAM
- GRAM-1 ?HTTP-based RPC
- GRAM-1.5 (Reliability improvements)
- Once-and-only-once submission
- Reliable termination detection
- GRAM-2 ? towards integration with web-services
(SOAP) - OGSA (Open Grid Services Architecture)
- Gate Keeper
- Single point of entry ? secure inetd
- Job Manager
- Layers on top of local resource management system
(e.g., PBS,LSF, etc) - Handles remote interaction with the job
38Data Management in GTK
- "Access to distributed data is typically as
important as access to distributed computational
resources - Globus - Tools for managing data in Grid systems and
applications - Also called Data Grid.
- GridFTP
- Data Replication
- Two tools for managing data replicas multiple
copies of data stored in different systems to
improve access across geographically-distributed
Grids - Replica Catalog based on LDAP directory
- Replica Manager combines the Replica Catalog
with GridFTP to manage data replication - GASS (Global Access to Secondary Storage)
- Allows applications to access data stored in any
remote file system by specifying a URL - Can be in HTTP, FTP
39Condor-GA Computation Management Agent for
Multi-Institutional Grid
40What is Condor-G?
- Condor enhanced with Globus Toolkit components to
harness multi-domain resources as if they all
belong to one personal domain - Example of applying the general purpose GTK
components to solve a particular problem (i.e.,
high-throughput computing on the Grid)
41Separation of Concern
- Remote Resource Access
- Secure remote resource discovery,allocation,
management - Uses Globus Toolkit components
- Computation Management
- Via user computation management agent responsible
for job submission, job management, error
recovery - Taken from Condor system
- Remote Execution
- Via mobile sandboxing to create a user tailored
execution environment on a remote node - Taken from Condor system
42Why Condor for Grid Jobs??
- Adv. Of using condor-G to manage Grid jobs
- Can Query a jobs status or cancel a job
- Credential management
- Get informed of job termination or problems via
callbacks or asynchronous mechanisms such as
email - Access to detailed logs with complete history of
the jobs execution - Fault tolerance and exactly once semantics
43Job Execution
- Stages a jobs standard I/O and executable using
GASS - Submits jobs to remote machines using revised
GRAM(1.5) - Job manager checkpoint restart
- Two-phase commit during job submission
- Monitors job status recovers from remote
failures using revised GRAM callbacks and status
calls - Condor-G handles resubmission of failed jobs,
communications with the user concerning unusual
erroneous conditions, recording of computation
status to support restart
44Execution Mechanism
45Fault Tolerance
- Tolerates four types of failure
- Local Crash
- Crash of the host on which GridManager is running
(or crash of the GridManager alone) - Queue state stored on disk
- Reconnect to the JobManagers that were running at
the time of crash - Remote Crash
- Crash of the GlobusJobManager
- Start a new JobManager
- Crash of the machine that hosts the remote
resource (GateKeeper,JobManager) - Wait until connectivity returns
- Start a new JobManager
- Network Failures
46Credential Management
- Authentication in is done with limited-lifetime
X509 proxies - Credential may expire before the job finishes
execution - Condor-G agent periodically analyzes the
credentials for all users with currently queued
jobs - Can put jobs on hold and e-mail user to refresh
proxy - Can forward new credentials to execution sites
- Using the MyProxy system, which lets a user store
a long-lived proxy credential on a secure server. - Remote services acting on behalf of user can
obtain short-lived proxies - Condor-G can use these to refresh the user
credential automatically
47Resource Discovery Scheduling
- Simple user supplies list of GRAM servers
- Resource broker in Condor-G agent Condor
Matchmaking - flood candidate resources Glidelns
48GlideIn Mechanism
- Use same codor mechanism to start on a remote
node not a user job, but a daemon - The deamon traps system calls made by users job
and redirects back to the originating system - Periodically checkpoints the job and migrates job
to another location if it is requested - The Condor-G GlideIn mechanism uses Grid
protocols to dynamically create a personal condor
pool out of Grid resources by gliding-in Condor
daemons to remote resource - Allows to delay the binding of an application to
a resource - Prevent queuing delays
- Can guarantee optimal queuing times to users
49Remote Execution via GlideIn
50Grid Architecture in Practice
App
High Throughput Computing System
Collective (App)
Dynamic checkpoint, job management, failover,
staging
Collective (Generic)
Brokering, certificate authorities
Access to data, access to computers, access to
network performance data
Resource
Communication, service discovery (DNS),
authentication, authorization, delegation
Connect
Storage systems, schedulers
Fabric
51Future Conclusions
- Evolution
- Past-Present O(102) computers Mb/s networks
local (centralized) control - Present O(104-106) data systems, computers Gb/s
networks restricted decentralized control - Future O(106-109) data,sensors,computer,instrumen
ts highly flexible policy,control - A computer (includes software) is a dynamically,
often collaboratively constructed collection of
processors,data sources,networks,sensors,instrumen
ts - Open the faucet get the water Connect to the
Grid get the compute power - We need a powerful computational economy model
(Bidding systems new optimization algorithms)
52Summary
- The Grid Problem Resource sharing coordinated
problem solving in dynamic multi-institutional
virtual organizations - Grid Architecture Emphasizes protocol and
service definition to enable interoperability and
resource sharing - Globus Toolkit a source of protocol and APIs,
reference implementation - Condor-G applies general purpose Globus Toolkit
to solve high-throughput computing on the Grid
53Thanks for Ur Patience