Title: CSS434: Parallel
1CSS434 Grid Computing Textbook No Corresponding
Chapters
Professor Munehiro Fukuda A portion of these
slides were compiled from The Grid Blueprint for
a New Computer Infrastructure.
2Network Infrastructure
- Users login their organizational systems first
locally or remotely. - If they are affiliated with other organizations,
- They can login from the system of their main use
to some other systems. (They are given an
opportunity to use those resources in parallel). - Problems
- They must orchestrate job execution among the
resources they use. - Should those resources be limited to such a
handful number of researchers?
3Purposes of Computational Grid
- Use computing resource connected to high-speed
information highway as if we use electric power
grid - Only 30 utilization in academic/commercial
environments. - Many applications have only episodic
requirements. So, why dont we share computation
resource? - Computational results and data should be also
made available to all users. - Users
- Computational scientists and engineers
- Experimental scientists
- Association and corporations
- Training and education
- Consumers (E-commerce)
4Grid Applications
Category Examples Characteristics
Distributed supercomputing DIS and Stellar dynamics Very large problems needing lots of computing resource at a time
High throughput Chip design and parameter studies Harnessing many idle resources to increase aggregate throughput
On demand Medical instrumentation Allocating special resource dynamically
Data intensive Sky survey Using distributed data and needing high-volume data flows
Collaborative Collaborative design Education Support communication or collaborative work
5Grid Services Architecturefrom www.globus.org
slide
High-energy physics data analysis
Collaborative engineering
On-line instrumentation
Applications
Regional climate studies
Parameter studies
Distributed computing
Collab. design
Remote control
Application Toolkit Layer
Data- intensive
Remote viz
Information
Resource mgmt
. . .
Grid Services Layer
Security
Data access
Fault detection
Transport
Multicast
. . .
Grid Fabric Layer
Instrumentation
Control interfaces
QoS mechanisms
6Programming ModelUniform Access
- Paradigm
- Bag of task or master workers (Condor-MW)
- Client server (NetSolve)
- Object oriented (Legion)
- Synchronous applications (Not suited for
massively parallel computation.) - Language Support
- MPI-G message passing (Globus)
- Open MP shared memory
- Math Library remote procedure (NetSolve)
7Resource ManagementDiscovery, Allocation, and
Scheduling
- Centralized resource manager
- easy to manage
- a bottleneck
- Decentralized resource manager
- A collection of centralized manager (Condors
gate flocking) - A combination of meta and local schedulers.
Systems Resource descriptions Front-end process Resource manager Job launcher
Globus RSL resource spec. language Broker and MDS GRAM
Condor ClassAd and DAGMan Schedd Agent Matchmaker and startd Sandbox (Starter)
Legion IDL interface def. language Scheduler Collection Enactor
8Fault Tolerance
- Check-pointing
- At the master (Condor)
- At each node but collected at the master
(Catalina) - Use a whiteboard (Optimal Grid)
- Re-execution of fault worker jobs from the
beginning (Bayanihan, Optimal Grid) - Error code (NetSolve)
- User is responsible to handle errors.
9Security
- Resources covered with security layers
- Legion (Message/MayI layers)
- Entropia (Intercepting all system calls)
- A use of commodity tools
- SSL
- Public key
- Security Certificate
- Java sandbox
- Kerberos
10NetSolvehttp//icl.cs.utk.edu/netsolve/
Network of servers
Client
- RPC-based approach
- Clients
- Include a set of APIs called as (asynchronous)
RPCs - Agents
- Match clients requests for services with servers
- Servers
- Encapsulates remotely accessed numerical libraries
Agent
Agent
choice
Scalar server
Client
request
reply
MPP servers
11Legionhttp//legion.virginia.edu/
- Legion classes
- Act as managers and make policy
- Core objects
- Provide mechanisms that classes use to implement
policies hosts (processors), vaults(memory),
context, binding agents, etc. - Per-Program Scheduling
- Participating sites can assure their local
policies. - User can choose a scheduling policy.
Prog
request
Enactor
Scheduler
Converted Legion object ID By context objects
reserve
search
Converted Logion object address By binding agents
Resource database
Class
Host
collection
tty
Host
Host
tty
Resources
Class
tty
12Condorhttp//www.cs.wisc.edu/condor/
A Users local agent R Each computer
resource M Central manager
I/O forwarded to a users home
13AgentTeamwork at UWBArchitecture
14Paper Review by Students
- Globus
- Legion
- Condor
- Netsolve
- Discussions
- What programming or execution model is each
system based on? - What resource allocation and scheduling algorithm
does each system use? - Are they fault-tolerant?
- Did they any special security features for their
own?