Grid job management architecture - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Grid job management architecture

Description:

Long Running. jobs taking hours, days, weeks. High Job Volume. many users, many jobs per user ... Expect long execution delay (but must avoid starvation) ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 18

Provided by: Stoke9

Category:

more less

Transcript and Presenter's Notes

Title: Grid job management architecture

1
Grid job management architecture

Challenge
How to efficiently connect large volumes of
widely distributed data, computing resources, and
physics jobs.
Proposal
An LHC Computing Grid compatible job management
system which is
based on a services model
decoupled
decentralised
suitable for interactive, analysis, and
production jobs

2
Overview

Grid Characteristics and Components
Identify what we are trying to achieve
EDG Architecture
Understand what has been done to date and how
well it has worked
Research Plan
Proposal for a distributed job management system

3
Grid Components
4
Primary Grid Characteristics

Federated
limited or no central control
Distributed
data, users, and computing power
Heterogeneous
different base systems
Scalable
large clusters, and large grids

5
Secondary Grid Characteristics

What is a Grid?
coordinated resource sharing and problem solving
in dynamic, multi-institutional virtual
organizations
Foster, Kesselman, Tuecke
The Anatomy of the Grid, 2001

Long Running
jobs taking hours, days, weeks
High Job Volume
many users, many jobs per user
High Data Traffic
jobs, software, and data
Execution Delay
users may accept long delays before execution
commences

6
EDG Architecture

Submit job from UI to RB
Match job by RB checking RC and II
Schedule job on CE
Retrieve data from SE
Execute on WN
Update status in II
Register output with SE
Return results to RB
User fetches output from RB

UI User Interface RB Resource Broker RC
Replica Catalogue II Information Index CE
Computing Element WN Worker Node SE Storage
Element
7
EDG Architecture

Federated?
Limited capacity for site specific control
Security policies
Scheduling/Usage policies
Distributed?
Centralised scheduling, bookkeeping, and replica
management
Data storage and job execution are distributed

8
EDG Architecture

Heterogeneous?
Every site must have exactly the same software
installed with exactly the same operating system,
and near identical configuration
Strong coupling and dependency between EDG
components
Scalable?
Doesnt work well with large clusters, and
centralised scheduler and bookkeeping systems
quickly become overloaded
Fit for Purpose?
Doesnt support physics production style jobs
which are better suited to job pools and job pull
type mechanisms

9
Typical Failure Modes
central services do not scale well, even with
service replication
??
Failure or overloading of network connections to
the central services isolate users or resources
Failure of central services incapacitates entire
Grid
10
Distributed Job Management

Grid services augment decision making process
Resources and Jobs communicate directly
Accept possibility of sub optimal scheduling
because
Information Gathering, Scheduling, and Job
Management are not cost free operations

Job Submitter
Grid Resources
Grid Services
11
Publish/Subscribe Directory

Job directory jobs are published and resources
can subscribe to them to accept them
Resource directory available resources are
published and jobs can subscribe to them to
submit a job
Directory entries will have an expiry date
Directory is only a guide, not a guarantee
Ideally JS and resources will remember where they
have published and will retract publications if
they become invalid

Directory
2. Search/ Query
1. Publish
3. Subscribe
Service Seeker
Service Provider
12
Multiple Directories

Nothing to stop JS from publishing job in many
directories, or resource announcing availability
in many directories
Different directories may have different
characteristics or access permissions
Jobs may search preferred directories first for
available resources and similarly for resources
searching for jobs
Allows a variety of scheduling paradigms to
operate concurrently

13
Middleman Not Required

JS can contact resources directly, while services
can still be used to augment JS decision making
Analogy 1 I can buy my DVD drive from PC World
directly, or I can use Google or Price Guide UK
to find the cheapest drive
Analogy 2 I can contact the Oxford website
directly via its IP address, or use a DNS lookup
for www.ox.ac.uk using my local cache or network
DNS server

14
Support for Job Push and Pull

Job push based on job submitter using scheduler
to push job to available resource
Good for quick feedback, short execution delay
jobs
Typically the case where job submitter owns
resource which executes job

Job pull based on resource acquiring jobs from
job pool
Good for avoiding scheduler overload with
simulation type jobs (e.g. MC production)
Expect long execution delay (but must avoid
starvation)
Good for allowing resource owner to control which
jobs are executed

15
Further Details

Base job, resource, and directory entries on
Condor ClassAd system
Make use of Globus toolkit (foundation for
EDG/LCG software)
Evaluate possibility of Grid Services
implementation
Extend DIRAC LHCb MC production architecture
Prepare trial system for Data Challenge 04

16
Research Plans

Base requirements and design on existing use
cases, requirements, and available software
Description and analysis of process and protocol
interaction in CSP
Expand DistGrid JobsResourcesDirectoriesSe
rvicesData
Use SimGrid to evaluate different strategies
Benchmark against EDG and DIRAC during DC04

17
Overview