Title: The Grid Middleware (Fundamentals of Grid Computing)
1The Grid Middleware (Fundamentals of Grid
Computing)
- Adam Belloum
- Computer Architecture Parallel Systems group
- University of Amsterdam
- adam_at_science.uva.nl
2What did we learn in the introductory course?
- Introduction to Grid computing
3What the Grid Can do
- Exploring underutilized resources
- Parallel CPU Capacity
- Virtual resources and Virtual Organization
- Access to additional resources
- Resource Balancing
- Reliability
- Management
4The Best of Two Worlds
Open Grid Services Architecture
manage
share
access
Resources on demand
Applications on demand
Global Accessibility
Secure and universal access
Business integration
Vast resource scalability
Grid Protocols
Web Services
- Open Grid Services Architecture Evolution, J.P.
Prost, IBM Montpellier, France, Ecole Bruide 2004
5Grid and Web Services Convergence?
Grid
Web
However, despite enthusiasm for OGSI, adoption
within Web community turned out to be problematic
WS-Resource Framework Globus Alliance
Perspectives I. Foster
6Grid and Web ServicesConvergence Yes!
Grid
Web
The definition of WSRF means that Grid and Web
communities can move forward on a common base
WS-Resource Framework Globus Alliance
Perspectives I. Foster
7Summary
- Initial exploration (1996-1999 GT 1.0)
- Extensive application experiments core protocols
- Data Grids (1999-2002 GT 2.0)
- Large-scale data management and analysis
- Open Grid Services Architecture (2001-2003, GT
3.0) - Integration w/ Web services, hosting envs, res
virtualization - Databases, higher-level services
- Radically scalable systems (2003-??)
- Sensors, wireless, ubiquitous computing
8Outline
- Type of Grid Resources
- Scheduling, reservation, and scavenging
- Grid Construction
- Grid Middleware components
- Deploying the Grid Software
- Configuring the Grid Middleware
9Grid resources Computation
- CPU Cycles. There are 3 primary ways to exploit
the computation resources of a grid. - Execute an existing application on a grid host
- Parallel execution on several grid hosts
- Execute an application as many times on many grid
hosts - Scalability is a measure of how efficiently the
multiple processors on a grid are used.
10Grid resources Storage
- A grid provides an integrated view of data
storage - Machine on the grid provides a quantity of
storage - Temporary storage, attached to the Processor
fast, but volatile, - Secondary storage increases capacity,
performance, sharing, and reliability of data - Capacity can be increased by using the storage on
multiple machines with a unifying file system. - Data sets can span several storage machines,
eliminating size restrictions often imposed by
file systems
11Grid resources Storage
- Advanced file systems on a grid can
- automatically duplicate sets of data, to provide
redundancy for increased reliability and
increased performance. - Intelligent scheduler can select the storage
location, based on usage pattern
12Grid resources Communication
- Communications are important for
- sending jobs and their required data to points
within the grid.
- Communication Bandwidth
- Is the most important feature needed to setup an
effective grid. - Redundant communications are sometimes needed to
- better handle potential network failures and
excessive data traffic.
13Grid resources Software licenses
- The grid may have software installed that may be
too expensive to install on every grid machine. - Using a grid, the jobs requiring this software
are sent to the particular machines on which this
software happens to be installed
14Grid Resources Special Equipments
- Platforms on the grid often have different
architectures, operating systems, capacities, and
equipment. - Each of these attributes represents a different
kind of resource that the grid can use as
criteria for assigning jobs to machines - Such attributes must be considered when assigning
jobs to resources in the grid
15How to use the Grid Jobs Applications
- Grid resources are accessed executing
application or job. - The term application as the highest level of
work on the grid. - Applications may be broken down into any number
of individual jobs. - Those, in turn, can be further broken down into
sub-jobs. - Jobs are programs that are executed at an
appropriate point on the grid. They may - compute something,
- execute one or more system commands,
- move or collect data,
- or operate machinery.
16How to use the Grid Jobs Applications
- A grid application that is organized as a
collection of jobs is usually designed to - have the jobs execute in parallel on different
machines in the grid. - The jobs may have specific dependencies that may
prevent them from executing in parallel in all
cases. - Jobs may require some specific input data that
must be copied to the machine on which the job is
to run. - Jobs may require the output produced by certain
other jobs and cannot be executed until those
prerequisite jobs have completed executing.
17How to use the Grid Jobs Applications
- Jobs may spawn additional sub-jobs, depending on
the data they process. - This workflow can create a hierarchy of jobs and
sub-jobs. - The results of the jobs must be collected and
assembled to produce the ultimate answer for the
application
18Resource Management
- Scheduling, Reservation, Scavenging
19Scheduling, Reservation, Scavenging
- In the simplest of grid systems
- User selects a machine for running his job and
then executes a grid command that sends the job
to the selected machine. - More advanced grid systems would include a job
scheduler - that automatically finds the most appropriate
machine.
20Scheduling, Reservation, Scavenging
- In a scavenging grid system
- Any machine that becomes idle reports its idle
status to the grid management node. - The management node assigns to the idle machine
the next job that is satisfied by the machines
resources. - Scavenging is usually implemented in a way that
is unobtrusive to the normal machine user. - If the machine becomes busy with local non-grid
work, the grid job is usually suspended or
delayed.
21Scheduling, Reservation, Scavenging
- Grid resources can be reserved in advance for a
designated set of jobs. - This is done to meet deadlines and guarantee
quality of service. - When policies permit, resources reserved in
advance could also be scavenged to run lower
priority jobs when they are not busy during a
reservation period. - Various combinations of scheduling, reservation,
and scavenging can be used to more completely
utilize the grid.
22Grid construction
23Deployment planning
- Security
- It is important to understand exactly which
components of the grid must be rigorously secured
to deter any kind of attack. - It is important to understand the issues involved
in authenticating users and properly executing
the responsibilities of a CA. - For example in an organization
- It is important to understand how the departments
in an organization interact, operate, and
contribute to the whole
24Grids are by definition Heterogeneous
- Grid is about legacy resources, infrastructure,
applications, policies, and procedures - The grid and its administrators must integrate in
stealth modewith - Firewalls
- Filesystems
- Queuing systems
- Grumpy systems administrators
- Tried and true applications
25Challenges in Grid Computing
- Reliable performance
- Trust relationships between multiple security
domains - Deployment and maintenance of grid middleware
across hundreds or thousands of nodes - Access to data across WANs
- Access to state information of remote processes
- Workflow / dependency management
- Distributed software and license management
- Accounting and billing
26Grid Middleware Components
- Submission software
- Machine of a grid can be used to submit jobs
initiate grid queries. - Some grid systems, use a separate component
installed on submission nodes. - Management software
- Large grids have a hierarchical matching the
connectivity topology. - machines locally connected together with a LAN
form a cluster - Scheduling Software
- round-robin fashion
- implement a job priority system.
27Grid software components
- Communications
- A grid system may include software to help jobs
communicate. - Open standard MPI is often included as part of
the grid system - Observation, management, and measurement
- The donor software include some tools that
measure the current load on a machine using
either OS tools or by direct measurement. - It is sometimes referred to as a load sensor.
Some grid systems provide the means for
implementing custom load sensors for other than
CPU or storage resources.
28Interacting/Using with the Grid
- User Perspective
- Application developer Perspective
- Administrator Perspective
29What the Grid Means To
- The end user
- Can transparently access resources in multiple
VOs - Can more easily collaborate with other
researchers - The IT administrator
- Has a secure framework for implementing
distributed resource sharing - Local resource administrators can control access
to their resources - The manager
- Sees better utilization of capital resources
- Has a tool that helps break down organizational
barriers
30Using a grid A users perspective
- Enrolling and installing grid software
- A user enrolls as a grid user and installs the
provided grid software on his own machine. - He may optionally enroll his machine as a donor.
- Enrolling in the grid requires authentication
- certificate authority
- Installing the grid software on a machine
- for the purposes of using the grid
- as well as donating to the grid.
31Using a grid A users perspective
- For users, the primary requirement is
simplicity Access to the virtual organizations
resources should not be significantly different
from access to the local organizations
resources. - There should be a single sign-on, where users
need to log on only once to access all permitted
resources. - Programs running on a users behalf should
possess a subset of the users rights and have
access to the permitted resources.
Protected channel
passwd
Randy Butler et al. A Natioanal-Scale
Authentication Infrastructure
32Using a grid A users perspective
- Logging onto the Grid
- To use the grid, most grid systems require the
user to log on to a system using a user ID that
is enrolled in the grid - Globus, for example, implements a proxy login
model that keeps the user logged in for a
specified amount of time, even if he logs off and
back on the operating system and even if the
machine is rebooted.
33Using a grid A users perspective
- Queries and submitting jobs
- Grid systems provide command line tools as well
as GUIs for queries. - Job submission consists of 3 parts, even if there
is only one command required. - Input data and possibly the executable program or
script are sent to the machine to execute the
job. - Job is executed on the grid (inside a protective
sandbox) - Results of the job are sent back to the submitter.
(1)
(3) Monitor/Control
(2) Configuration
34Using a grid A users perspective
- Data configuration
- The data accessed by the grid jobs may simply be
staged in and out by the grid system - a large amount of data traffic
- There are many considerations in efficiently
planning the distribution and sharing of data - Data replication and reallocation
35Using a grid A users perspective
- Monitoring progress and recovery
- The user can query the grid system to see how his
application and its sub-jobs are progressing - A grid system, in conjunction with its job
scheduler, often provides a recovery for
sub-jobs that fail. - Programming error
- Hardware or power failure
- Communications interruption
- Excessive slowness
36Using a grid A users perspective
- Reserving Resources
- User may arrange to reserve a set of resources in
advance for his exclusive or high priority use. - Planned Hardware
- Software maintenance events
- In a scavenging grid, it may not be possible to
reserve specific machines in advance. - Grid management systems may allocate a larger
fraction of its capacity for a given reservation
to allow for the likelihood of some of the
resources becoming unavailable.
37Using a grid Admins Perspective
- The concerns of resource-providing sites
constrain an authentication and authorization
infrastructure in two ways - have there local (intra-domain) security policy
- typically cannot easily replace or modify their
intra-domain security solution. - How they see inter-domain security
- a distinct inter-domain solution that
interoperates with local security solutions and
is at least as strong as local solutions (will
not weaken site security), is easy to understand
(administrators can trust it). - Administrators must have tight control over
policies governing access to their resources,
Randy Butler et al. A Natioanal-Scale
Authentication Infrastructure
38Using a grid Admins Perspective
- Planning
- Administrator should understand the
organizations requirements for the grid to
better choose the grid technologies that satisfy
those requirements. - Installation
- Selected grid system must be installed on an
appropriately configured set of machines. - Of prime importance is understanding the
fail-over scenarios for the given grid system
39Using a grid Admins Perspective
- Root access to both the node managing the grid
and the donor machine - Software to be installed on the donor machines
need to be customized so that it can find the
grid management machines automatically. - Application software and data that should be
installed on donor machines as well. - This software may have specific licensing
restrictions that should be understood and
adhered to.
40Using a grid Admins Perspective
- Managing enrollment of donors and users
- Machines donating resources Donor machines may
have access rights - Users controlling the rights of the users in the
grid. - As users join the grid,
- identities must be positively established and
entered in the CA. - In some cases, the administrator must propagate
the user information to several or all grid
machines. - A new Donator machine has be added to the grid,
41Using a grid Admins Perspective
- Certification authority
- An important aspect of maintaining strong grid
security. - You may choose to use an external CA or operate
one. - You must be able to trust the CA to strictly
adhere to its responsibilities. - The responsibilities of a certificate authority
are - Serve signed certificates to those needing to
authenticate entities - Maintaining a namespace of unique names for
certificate owners - Positively identify entities requesting
certificates - Issuing, removing, and archiving certificates
- Protecting the certificate authority server
- Logging activity
42Using a grid Admins Perspective
- Resource Management this includes setting
permissions for grid users - use the resources
- track resource usage
- implementing a corresponding accounting or
billing system.
43Using a grid Admins Perspective
- Data sharing
- For small grids, the sharing of data can be
fairly easy, using existing networked file
systems, databases, or standard data transfer
protocols. - As a grid grows and the users become dependent
on any of the data storage repositories - procedures to maintain backup copies and replicas
to improve performance.
Requested Data Management System ?
Services specific to the data Grid
infrastructure
Low level Services (shared with other Grid
Components)
44Using a grid An application developers
perspective
- Grid applications can be categorized in one of
the following three categories -
- Applications that are not enabled for using
multiple processors but can be executed on
different machines. - Applications that are already designed to use
the multiple processors of a grid setting. - Applications that need to be modified or
rewritten to exploit a grid. - Grid developers are developing tools for
debugging and measuring the behavior of grid
applications.
45The present and the future
- Most grid systems include some job schedulers,
but as grids span wider areas areas - there will be a need for more meta-schedulers
that can manage variously configured collections
of clusters and smaller grids - They will also extend their reach to implement
better QoS, using - reservations,
- redundancy,
- and history profiles of jobs and grid performance.
46The present and the future
- Providing a reliable, well performing, and
automatically recoverable virtual data sharing
and storage. - Projects are taking on this task, federating
data, and achieving better performance,
integration with scheduling, reliability, and
capacity. - Autonomic computing has the goal to make the
administrators job easier by automating the
various complicated tasks involved in managing a
grid. - These include identifying problems in real time
and quickly initiating corrective actions before
they seriously impair the grid.
47A word of caution should be
- The grid is not a silver bullet that can take any
application and run it a 1000 times faster
without the need for buying any more machines or
software. - Not every application is suitable for running on
a grid. - Some kinds of applications simply cannot be
parallelized. - For others, it can take a large amount of work to
modify them to achieve faster throughput. - The configuration of a grid can affect the
performance, reliability, and security of an
organizations computing infrastructure.
48References
- V. Berstis, Fundamentals of Grid Computing IBM
Redbooks paper