The%20Grid,Globus%20Tool%20Kit,%20Condor%20 - PowerPoint PPT Presentation

About This Presentation
Title:

The%20Grid,Globus%20Tool%20Kit,%20Condor%20

Description:

Condor-G: A Computation Management Agent for Multi-Institutional Grids. ... Resource Brokers (e.g., Condor-G Matchmaker, AppLes, Nimrod-G, DRM broker) ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 54
Provided by: charmC
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: The%20Grid,Globus%20Tool%20Kit,%20Condor%20


1
The Grid,Globus Tool Kit, Condor Gand What?
2
Outline
  1. The Grid Problem
  2. The Grid Architecture
  3. The Globus Toolkit
  4. The Condor-G

3
Culled From
  • http//www.globus.org
  • The Anatomy of the Grid Enabling Scalable
    Virtual Organizations. I. Foster, C. Kesselman,
    S. Tuecke. International J. Supercomputer
    Applications, 15(3), 2001
  • Globus A Metacomputing Infrastructure Toolkit.
    I. Foster, C. Kesselman. Intl J. Supercomputer
    Applications, 11(2)115-128, 1997
  • Condor-G A Computation Management Agent for
    Multi-Institutional Grids. J.Frey, T.Tannenbaum,
    I.Foster, M.Livny,S.Tuecke. Proc. of HPDC10, 2001
  • http//charm.cs.uiuc.edu/ppl_research/faucets/

4
"The Grid"
  • Coined in 1990's to denote a proposed distributed
    computing architecture.
  • "Flexible, secure, coordinated resource sharing
    among dynamic collections of individuals,
    institutions and resources" -From "The
    Anatomy of the Grid"
  • Resource Sharing
  • Computers,Storage,Sensors, Networks, Scientific
    Instruments
  • sharing is highly controlled -- Providers
    Consumer define
  • What is shared
  • Who is allowed to share
  • Conditions for sharing
  • Coordinated problem solving
  • Beyond client-server distributed data analysis,
    visualization,computation, collaboration
  • Similar to the Power Grid, Faucets (Water
    supply), Nationwide Phone System.

5
Virtual Organization
  • A set of individual/institutions defined by some
    set of sharing rules form a Virtual Organization
    (VO).
  • VO may contain
  • Application Service Providers (ASP)
  • Storage Service Providers (SSP)
  • Cycle Providers
  • They lack
  • Central Control
  • Central Location
  • Existing Trust Relationships

6
Why Grids?
  • Civil Engineers collaborate to design, execute
    analyze shake table experiments
  • Climate Scientists visualize, annotate analyze
    terabytes of simulation datasets
  • An Emergency response team couples real-time
    data, weather model, population data
  • An application service provider purchases cycles
    from compute cycle provider
  • A biomedical engineer exploits 10K computers to
    screen 100K compounds in a hour.
  • NEESgrid national infrastructure to couple
    earthquake engineers with experimental
    facilities, databases, computers with each other.
    (Argonne,NCSA,Michigan,UIUC,USC)

7
Online Access to Scientific Instruments
Advanced Photon Source
wide-area dissemination
desktop VR clients with shared controls
real-time collection
archival storage
tomographic reconstruction
DOE X-ray grand challenge ANL, USC/ISI, NIST,
U.Chicago
8
Data Grids forHigh Energy Physics
Image courtesy Harvey Newman, Caltech
9
Is it Really NEW?
  • Grid Computing has much in common with the
    existing industrial thrusts
  • Application Storage service providers (ASPs ,
    SSPs)
  • Internet Peer-to-Peer computing
  • Enterprise Computing Systems
  • Business-to-Business exchanges
  • SSPs ASPs allow organizations to outsource
    storage computing requirements to other parties
    (typically via a VPN)
  • Enterprise distributed computing technologies
    (CORBA, Enterprise Java) enable resource sharing
    within a single organization
  • Business-to-Business virtual enterprise
    technologies exchanges focus on information
    sharing via central servers.

10
Is it NEW?
  • Sharing is not adequately addressed by these
    technologies.
  • Complicated Requirements run program X at site
    Y subject to community policy P, using data at Z
    according to the policy Q
  • High Performance requirements of most of the
    applications.
  • Users may not care where their program may run,
    as long as it satisfies their QoS requirements
    (Faucets)
  • controlled, dynamic sharing within VOs
  • Current doesnt accommodate the range of resource
    types or doesnt provide the flexibility and
    control on sharing relationships to establish VOs

11
But Why Now???
  • The internet the increasing use of wireless
    devices provides the universal connectivity.
  • Many current research projects need teamwork,
    collaboration.
  • Network Vs. Computer Performance
  • Computer speed doubles every 18 months
  • Network speed doubles every 9 months

Moores Law vs. storage improvements vs. optical
improvements. Graph from Scientific American
(Jan-2001) by Cleo Vilett, source Vined Khoslan,
Kleiner, Caufield and Perkins.
12
Major Grid Projects
Name URL Sponsors Focus
BlueGrid IBM Grid testbed linking IBM laboratories
DISCOM www.cs.sandia.gov/discomDOE Defense Programs Create operational Grid providing access to resources at three U.S. DOE weapons laboratories
DOE Science Grid sciencegrid.org DOE Office of Science Create operational Grid providing access to resources applications at U.S. DOE science laboratories partner universities
Earth System Grid (ESG) earthsystemgrid.orgDOE Office of Science Delivery and analysis of large climate model datasets for the climate research community
European Union (EU) DataGrid eu-datagrid.org European Union Create apply an operational grid for applications in high energy physics, environmental science, bioinformatics
13
Major Grid Projects
Name URL/Sponsor Focus
EuroGrid, Grid Interoperability (GRIP) eurogrid.org European Union Create tech for remote access to supercomp resources simulation codes in GRIP, integrate with Globus Toolkit
Fusion Collaboratory fusiongrid.org DOE Off. Science Create a national computational collaboratory for fusion research
Globus Project globus.org DARPA, DOE, NSF, NASA, Msoft Research on Grid technologies development and support of Globus Toolkit application and deployment
GridPP gridpp.ac.uk U.K. eScience Create apply an operational grid within the U.K. for particle physics research
Information Power Grid Ipg.nasa.gov Create apply a production Grid for aero sciences and other NASA missions
Grid Research Integration Dev. Support Center grids-center.org NSF Integration, deployment, support of the NSF Middleware Infrastructure for research education
14
www.teragrid.org
15
iVDGLInternational Virtual Data Grid Laboratory
www.ivdgl.org
16
Grid Architecture
17
Why Do We Need It?
  • To structure the development of new technology
  • Common Vocabulary, Guidance, Perspective
  • To
  • Identify the fundamental system components
  • Specify the purpose and function of these
    components
  • Indicate how these components interact
  • Emphasizes
  • identification and definition of protocols and
    services
  • APIs and SDKs
  • A Protocol architecture mechanism for VO users
    and resources to negotiate, manage sharing
    relationships
  • Facilitates
  • Extensibility
  • Interoperability

18
  • Why is interoperability a fundamental concern?
  • Standard protocols ? standard services to
    abstract away resource specific details.
  • To accelerate application development in complex
    and dynamic execution environments we need
    APIs,SDKs
  • The Technology Services ? middleware

19
Layered Grid Architecture
  • Analogy to Internet Architecture
  • High level description places few constraints
    on design implementation

20
Features
  • Open and Extensible
  • Built on Internet protocols services
  • Communication, routing, name resolution, etc.
  • Layering is conceptual ? No constraints on who
    can call what
  • Protocols/Services/APIs/SDKs will (ideally) be
    largely self-contained
  • Things like communication, security are very
    fundamental
  • Advantageous for higher layer functions to use
    common lower-level functions

21
The Hourglass Model
  • Resource and connectivity form the
  • neck in the hour glass
  • Designed to be implemented on top
  • of diverse range of resource types
  • (fabric layer)
  • Can be used to construct wide range
  • of global and application specific
  • Services (collective layer)

A p p l i c a t i o n s
Diverse global services
Core services
Local OS
22
What's the Status?
  • No official standards yet.
  • Globus Toolkit is the unofficial standard for
    several connectivity, resource, collective
    protocols
  • Global Grid Forum (GGF) has Grid Protocol
    Architecture group
  • In security some RFCs are available
  • In scheduling resource management some working
    documents are available

23
Globus Toolkit
  • A software toolkit (GTK) to address key issues to
    pave the road
  • Offers a set of technologies (NCSAs
    Grid-in-a-box)
  • Try to standardize the Grid protocols and APIs
  • Open Architecture Open Source (Reference
    implementation)
  • Define Grid Protocols APIs
  • Integrate and extend existing protocols
  • provides software tools that make it easier to
    build computational grids and grid-based
    applications
  • Learn from the experiences gained through
    deployment and applications

24
Key Components
  • Security
  • Communication
  • Information Infrastructure
  • Fault Detection
  • Resource Management
  • Portability
  • Data Management

25
Fabric Layer
  • What do you expect? -- diverse resource that may
    be shared
  • Can be a logical entity such as a distributed
    file system, computer cluster But the Grid
    architecture dont care
  • Components implement the local, resource-specific
    operations that occur on specific resources
    (physical or logical)
  • Trade-off (Rich fabric functionality vs. Easy of
    deployment)
  • Richer functionality ? more sophisticated sharing
    operations (e.g., reservation)
  • Few demands ? simplified Grid infrastructure
    deployment.
  • Should implement enquiry mechanisms and resource
    management

26
Fabric in Globus Toolkit
  • Is designed to use existing fabric components
  • If a vendor doesnt provide the necessary
    behavior, GTK includes the missing piece
  • Resource management, is generally the domain of
    local resource managers.
  • GARA (General - purpose Architecture for
    Reservation and Allocation) can provide QoS for
    different types resources

27
Connectivity
  • Communication
  • Enable exchange of data between fabric layer
    resources
  • Include transport, routing, naming (TCP,IP,DNS)
  • Authentication
  • Build on communication protocols
  • Uniform authentication, authorization and message
    protection in multi institutional scenario
  • Based on existing standards whenever possible
  • Various Requirements
  • Single sign on (access to multiple Grid resources
    without user intervention)
  • Delegation (users program is able to access
    resources on which user is authorized)
  • Integration with various local security solutions
    (identity mapping)
  • User-based trust relationships (must not require
    for various resources to cooperate in configuring
    security environment)
  • stake holder should have the final control the
    authorization decisions

28
Security in GTK
  • Grid Security Infrastructure (GSI) is based on
  • Public key
  • X.509 certificates
  • SSL/TLS communication
  • GSS-API (Generic Authorization and Access)
  • Extensions are added for single sign-on and
    delegation
  • Stakeholder control is supported via GAA (Generic
    Authorization and Access).
  • GSI adheres to the IETFs standard GSS-API

29
GSI Example ScenarioCreate Processes at A and B
that Communicate Access Files at C
User
Site A (Kerberos)
Site B (Unix)
Computer
Computer
Site C (Kerberos)
Storage system
30
Resource Layer
  • Concerned entirely with individual resources
  • Addresses
  • Resource discovery
  • Reservation Allocation
  • Monitoring Control
  • Secure Negotiation
  • Two primary classes of protocols
  • Information protocols to obtain information
    about the structure state of a resource
  • Management Protocols policy application point
  • in the neck of hourglass ? the number of such
    protocols should be limited to small and focused
    set.

31
Resource Layer Components in GTK
  • GRIP (Grid Resource Information Protocol)
  • Based on LDAP
  • Defines standard resource information
  • GRRP (Grid Resource Registration Protocol)
  • To register resources with Grid Index Information
    Servers
  • GRAM (Grid Resource Access and Management)
  • HTTP based RPC
  • Used for allocation of computational resources
  • for monitoring and controlling of computation on
    these resources
  • GridFTP
  • allows grid applications to have secure,
    ubiquitous, high-performance access to data
  • uses the GSI for authentication
  • new extensions to the FTP protocol for
  • parallel data transfer
  • partial file transfer
  • third-party (server-to-server) data transfer

32
Collective Layer
  • Coordinate multiple resources
  • Protocols services (APIs,SDKs) are not
    associated with any resource but rather are
    global in nature
  • Capture interactions across collections of
    resources
  • Examples
  • Index servers aka Metadirectory services custom
    views on dynamic resource collections
  • Resource Brokers (e.g., Condor-G Matchmaker,
    AppLes, Nimrod-G, DRM broker)
  • Resource discovery allocation
  • Replica catalogs services
  • Co-reservation Co-allocation services
  • Software discovery services select best s/w
    implementation and platform based on the problem
    parameters (e.g., NetSolve, Ninf)
  • Community accounting and payment services
  • Collaboratory services (e.g., CAVERNsoft)

33
Information Infrastructure in GTK
  • An infrastructure that provides coherent system
    information spanning virtual organizations is
    necessary
  • MDS (Metacomputing Directory Service)
  • Uses LDAP (Lightweight Directory Access Protocol)
  • Provides uniform means of querying system
    information from a rich variety of system
    components
  • uniform namespace for resource information across
    a system that may involve many organizations.
  • GRIS (Grid Resource Information Service) ? a
    uniform means of querying resources for their
    current configuration, capabilities, and status
  • GIIS (Grid Index Information Service) ? a means
    of knitting together arbitrary GRIS services to
    provide a coherent system image.
  • GIISes provide a mechanism for identifying
    "interesting" resources

34
Resource Management Services in GTK
  • Three main components
  • RSL (Resource Specification Language) ? to
    communicate resource requirements
  • GRAM (Grid Resource Allocation Management) ?
    standardized interface to all of the various
    local resource management tools (e.g.,
    Condor,LSF,PBS)
  • DUROC (Dynamically-Updated Request Online
    Co-allocator) ?coordinates a single request that
    may span multiple GRAMs
  • Resource Broker ? handle the mapping of high
    level application requests into requests to
    individual resource managers

35
Resource Management Architecture
RSL specialization
RSL
Application
Information Service
Queries
Info
RSL
Simple ground RSL
Local resource managers
GRAM
GRAM
GRAM
LSF
Condor
PBS
36
GRAM Components
MDS client API calls to locate resources
Client
MDS Grid Index Info Server
Site boundary
MDS client API calls to get resource info
GRAM client API calls to request resource
allocation and process creation.
MDS Grid Resource Info Server
Query current status of resource
GRAM client API state change callbacks
Grid Security Infrastructure
Local Resource Manager
Allocate create processes
Request
Job Manager
Create
Gatekeeper
Process
Parse
Monitor control
Process
RSL Library
Process
37
GRAM
  • GRAM-1 ?HTTP-based RPC
  • GRAM-1.5 (Reliability improvements)
  • Once-and-only-once submission
  • Reliable termination detection
  • GRAM-2 ? towards integration with web-services
    (SOAP)
  • OGSA (Open Grid Services Architecture)
  • Gate Keeper
  • Single point of entry ? secure inetd
  • Job Manager
  • Layers on top of local resource management system
    (e.g., PBS,LSF, etc)
  • Handles remote interaction with the job

38
Data Management in GTK
  • "Access to distributed data is typically as
    important as access to distributed computational
    resources - Globus
  • Tools for managing data in Grid systems and
    applications
  • Also called Data Grid.
  • GridFTP
  • Data Replication
  • Two tools for managing data replicas multiple
    copies of data stored in different systems to
    improve access across geographically-distributed
    Grids
  • Replica Catalog based on LDAP directory
  • Replica Manager combines the Replica Catalog
    with GridFTP to manage data replication
  • GASS (Global Access to Secondary Storage)
  • Allows applications to access data stored in any
    remote file system by specifying a URL
  • Can be in HTTP, FTP

39
Condor-GA Computation Management Agent for
Multi-Institutional Grid
40
What is Condor-G?
  • Condor enhanced with Globus Toolkit components to
    harness multi-domain resources as if they all
    belong to one personal domain
  • Example of applying the general purpose GTK
    components to solve a particular problem (i.e.,
    high-throughput computing on the Grid)

41
Separation of Concern
  • Remote Resource Access
  • Secure remote resource discovery,allocation,
    management
  • Uses Globus Toolkit components
  • Computation Management
  • Via user computation management agent responsible
    for job submission, job management, error
    recovery
  • Taken from Condor system
  • Remote Execution
  • Via mobile sandboxing to create a user tailored
    execution environment on a remote node
  • Taken from Condor system

42
Why Condor for Grid Jobs??
  • Adv. Of using condor-G to manage Grid jobs
  • Can Query a jobs status or cancel a job
  • Credential management
  • Get informed of job termination or problems via
    callbacks or asynchronous mechanisms such as
    email
  • Access to detailed logs with complete history of
    the jobs execution
  • Fault tolerance and exactly once semantics

43
Job Execution
  • Stages a jobs standard I/O and executable using
    GASS
  • Submits jobs to remote machines using revised
    GRAM(1.5)
  • Job manager checkpoint restart
  • Two-phase commit during job submission
  • Monitors job status recovers from remote
    failures using revised GRAM callbacks and status
    calls
  • Condor-G handles resubmission of failed jobs,
    communications with the user concerning unusual
    erroneous conditions, recording of computation
    status to support restart

44
Execution Mechanism
45
Fault Tolerance
  • Tolerates four types of failure
  • Local Crash
  • Crash of the host on which GridManager is running
    (or crash of the GridManager alone)
  • Queue state stored on disk
  • Reconnect to the JobManagers that were running at
    the time of crash
  • Remote Crash
  • Crash of the GlobusJobManager
  • Start a new JobManager
  • Crash of the machine that hosts the remote
    resource (GateKeeper,JobManager)
  • Wait until connectivity returns
  • Start a new JobManager
  • Network Failures

46
Credential Management
  • Authentication in is done with limited-lifetime
    X509 proxies
  • Credential may expire before the job finishes
    execution
  • Condor-G agent periodically analyzes the
    credentials for all users with currently queued
    jobs
  • Can put jobs on hold and e-mail user to refresh
    proxy
  • Can forward new credentials to execution sites
  • Using the MyProxy system, which lets a user store
    a long-lived proxy credential on a secure server.
  • Remote services acting on behalf of user can
    obtain short-lived proxies
  • Condor-G can use these to refresh the user
    credential automatically

47
Resource Discovery Scheduling
  • Simple user supplies list of GRAM servers
  • Resource broker in Condor-G agent Condor
    Matchmaking
  • flood candidate resources Glidelns

48
GlideIn Mechanism
  • Use same codor mechanism to start on a remote
    node not a user job, but a daemon
  • The deamon traps system calls made by users job
    and redirects back to the originating system
  • Periodically checkpoints the job and migrates job
    to another location if it is requested
  • The Condor-G GlideIn mechanism uses Grid
    protocols to dynamically create a personal condor
    pool out of Grid resources by gliding-in Condor
    daemons to remote resource
  • Allows to delay the binding of an application to
    a resource
  • Prevent queuing delays
  • Can guarantee optimal queuing times to users

49
Remote Execution via GlideIn
50
Grid Architecture in Practice
App
High Throughput Computing System
Collective (App)
Dynamic checkpoint, job management, failover,
staging
Collective (Generic)
Brokering, certificate authorities
Access to data, access to computers, access to
network performance data
Resource
Communication, service discovery (DNS),
authentication, authorization, delegation
Connect
Storage systems, schedulers
Fabric
51
Future Conclusions
  • Evolution
  • Past-Present O(102) computers Mb/s networks
    local (centralized) control
  • Present O(104-106) data systems, computers Gb/s
    networks restricted decentralized control
  • Future O(106-109) data,sensors,computer,instrumen
    ts highly flexible policy,control
  • A computer (includes software) is a dynamically,
    often collaboratively constructed collection of
    processors,data sources,networks,sensors,instrumen
    ts
  • Open the faucet get the water Connect to the
    Grid get the compute power
  • We need a powerful computational economy model
    (Bidding systems new optimization algorithms)

52
Summary
  • The Grid Problem Resource sharing coordinated
    problem solving in dynamic multi-institutional
    virtual organizations
  • Grid Architecture Emphasizes protocol and
    service definition to enable interoperability and
    resource sharing
  • Globus Toolkit a source of protocol and APIs,
    reference implementation
  • Condor-G applies general purpose Globus Toolkit
    to solve high-throughput computing on the Grid

53
Thanks for Ur Patience
Write a Comment
User Comments (0)
About PowerShow.com