Title: Grid job management architecture
1Grid job management architecture
- Challenge
- How to efficiently connect large volumes of
widely distributed data, computing resources, and
physics jobs. - Proposal
- An LHC Computing Grid compatible job management
system which is - based on a services model
- decoupled
- decentralised
- suitable for interactive, analysis, and
production jobs
2Overview
- Grid Characteristics and Components
- Identify what we are trying to achieve
- EDG Architecture
- Understand what has been done to date and how
well it has worked - Research Plan
- Proposal for a distributed job management system
3Grid Components
4Primary Grid Characteristics
- Federated
- limited or no central control
- Distributed
- data, users, and computing power
- Heterogeneous
- different base systems
- Scalable
- large clusters, and large grids
5Secondary Grid Characteristics
- What is a Grid?
- coordinated resource sharing and problem solving
in dynamic, multi-institutional virtual
organizations - Foster, Kesselman, Tuecke
- The Anatomy of the Grid, 2001
- Long Running
- jobs taking hours, days, weeks
- High Job Volume
- many users, many jobs per user
- High Data Traffic
- jobs, software, and data
- Execution Delay
- users may accept long delays before execution
commences
6EDG Architecture
- Submit job from UI to RB
- Match job by RB checking RC and II
- Schedule job on CE
- Retrieve data from SE
- Execute on WN
- Update status in II
- Register output with SE
- Return results to RB
- User fetches output from RB
UI User Interface RB Resource Broker RC
Replica Catalogue II Information Index CE
Computing Element WN Worker Node SE Storage
Element
7EDG Architecture
- Federated?
- Limited capacity for site specific control
- Security policies
- Scheduling/Usage policies
- Distributed?
- Centralised scheduling, bookkeeping, and replica
management - Data storage and job execution are distributed
8EDG Architecture
- Heterogeneous?
- Every site must have exactly the same software
installed with exactly the same operating system,
and near identical configuration - Strong coupling and dependency between EDG
components - Scalable?
- Doesnt work well with large clusters, and
centralised scheduler and bookkeeping systems
quickly become overloaded - Fit for Purpose?
- Doesnt support physics production style jobs
which are better suited to job pools and job pull
type mechanisms
9Typical Failure Modes
central services do not scale well, even with
service replication
??
Failure or overloading of network connections to
the central services isolate users or resources
Failure of central services incapacitates entire
Grid
10Distributed Job Management
- Grid services augment decision making process
- Resources and Jobs communicate directly
- Accept possibility of sub optimal scheduling
because - Information Gathering, Scheduling, and Job
Management are not cost free operations
Job Submitter
Grid Resources
Grid Services
11Publish/Subscribe Directory
- Job directory jobs are published and resources
can subscribe to them to accept them - Resource directory available resources are
published and jobs can subscribe to them to
submit a job - Directory entries will have an expiry date
- Directory is only a guide, not a guarantee
- Ideally JS and resources will remember where they
have published and will retract publications if
they become invalid
Directory
2. Search/ Query
1. Publish
3. Subscribe
Service Seeker
Service Provider
12Multiple Directories
- Nothing to stop JS from publishing job in many
directories, or resource announcing availability
in many directories - Different directories may have different
characteristics or access permissions - Jobs may search preferred directories first for
available resources and similarly for resources
searching for jobs - Allows a variety of scheduling paradigms to
operate concurrently
13Middleman Not Required
- JS can contact resources directly, while services
can still be used to augment JS decision making - Analogy 1 I can buy my DVD drive from PC World
directly, or I can use Google or Price Guide UK
to find the cheapest drive - Analogy 2 I can contact the Oxford website
directly via its IP address, or use a DNS lookup
for www.ox.ac.uk using my local cache or network
DNS server
14Support for Job Push and Pull
- Job push based on job submitter using scheduler
to push job to available resource - Good for quick feedback, short execution delay
jobs - Typically the case where job submitter owns
resource which executes job
- Job pull based on resource acquiring jobs from
job pool - Good for avoiding scheduler overload with
simulation type jobs (e.g. MC production) - Expect long execution delay (but must avoid
starvation) - Good for allowing resource owner to control which
jobs are executed
15Further Details
- Base job, resource, and directory entries on
Condor ClassAd system - Make use of Globus toolkit (foundation for
EDG/LCG software) - Evaluate possibility of Grid Services
implementation - Extend DIRAC LHCb MC production architecture
- Prepare trial system for Data Challenge 04
16Research Plans
- Base requirements and design on existing use
cases, requirements, and available software - Description and analysis of process and protocol
interaction in CSP - Expand DistGrid JobsResourcesDirectoriesSe
rvicesData - Use SimGrid to evaluate different strategies
- Benchmark against EDG and DIRAC during DC04
17Overview
- European Data Grid software relies on
- centralised servers
- complete knowledge of system state
- homogeneous systems (same software, same
configuration) - This has resulted in problems with
- scalability
- robustness
- fitness for purpose
- Propose investigating alternative architecture
- which is suitable for physics analysis and
production jobs - can be implemented on top of current EDG/LCG
software - can be trailed during LHCb Data Challenge 2004
(March May)