Grid job management architecture - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Grid job management architecture

Description:

Long Running. jobs taking hours, days, weeks. High Job Volume. many users, many jobs per user ... Expect long execution delay (but must avoid starvation) ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 18
Provided by: Stoke9
Category:

less

Transcript and Presenter's Notes

Title: Grid job management architecture


1
Grid job management architecture
  • Challenge
  • How to efficiently connect large volumes of
    widely distributed data, computing resources, and
    physics jobs.
  • Proposal
  • An LHC Computing Grid compatible job management
    system which is
  • based on a services model
  • decoupled
  • decentralised
  • suitable for interactive, analysis, and
    production jobs

2
Overview
  • Grid Characteristics and Components
  • Identify what we are trying to achieve
  • EDG Architecture
  • Understand what has been done to date and how
    well it has worked
  • Research Plan
  • Proposal for a distributed job management system

3
Grid Components
4
Primary Grid Characteristics
  • Federated
  • limited or no central control
  • Distributed
  • data, users, and computing power
  • Heterogeneous
  • different base systems
  • Scalable
  • large clusters, and large grids

5
Secondary Grid Characteristics
  • What is a Grid?
  • coordinated resource sharing and problem solving
    in dynamic, multi-institutional virtual
    organizations
  • Foster, Kesselman, Tuecke
  • The Anatomy of the Grid, 2001
  • Long Running
  • jobs taking hours, days, weeks
  • High Job Volume
  • many users, many jobs per user
  • High Data Traffic
  • jobs, software, and data
  • Execution Delay
  • users may accept long delays before execution
    commences

6
EDG Architecture
  • Submit job from UI to RB
  • Match job by RB checking RC and II
  • Schedule job on CE
  • Retrieve data from SE
  • Execute on WN
  • Update status in II
  • Register output with SE
  • Return results to RB
  • User fetches output from RB

UI User Interface RB Resource Broker RC
Replica Catalogue II Information Index CE
Computing Element WN Worker Node SE Storage
Element
7
EDG Architecture
  • Federated?
  • Limited capacity for site specific control
  • Security policies
  • Scheduling/Usage policies
  • Distributed?
  • Centralised scheduling, bookkeeping, and replica
    management
  • Data storage and job execution are distributed

8
EDG Architecture
  • Heterogeneous?
  • Every site must have exactly the same software
    installed with exactly the same operating system,
    and near identical configuration
  • Strong coupling and dependency between EDG
    components
  • Scalable?
  • Doesnt work well with large clusters, and
    centralised scheduler and bookkeeping systems
    quickly become overloaded
  • Fit for Purpose?
  • Doesnt support physics production style jobs
    which are better suited to job pools and job pull
    type mechanisms

9
Typical Failure Modes
central services do not scale well, even with
service replication
??
Failure or overloading of network connections to
the central services isolate users or resources
Failure of central services incapacitates entire
Grid
10
Distributed Job Management
  • Grid services augment decision making process
  • Resources and Jobs communicate directly
  • Accept possibility of sub optimal scheduling
    because
  • Information Gathering, Scheduling, and Job
    Management are not cost free operations

Job Submitter
Grid Resources
Grid Services
11
Publish/Subscribe Directory
  • Job directory jobs are published and resources
    can subscribe to them to accept them
  • Resource directory available resources are
    published and jobs can subscribe to them to
    submit a job
  • Directory entries will have an expiry date
  • Directory is only a guide, not a guarantee
  • Ideally JS and resources will remember where they
    have published and will retract publications if
    they become invalid

Directory
2. Search/ Query
1. Publish
3. Subscribe
Service Seeker
Service Provider
12
Multiple Directories
  • Nothing to stop JS from publishing job in many
    directories, or resource announcing availability
    in many directories
  • Different directories may have different
    characteristics or access permissions
  • Jobs may search preferred directories first for
    available resources and similarly for resources
    searching for jobs
  • Allows a variety of scheduling paradigms to
    operate concurrently

13
Middleman Not Required
  • JS can contact resources directly, while services
    can still be used to augment JS decision making
  • Analogy 1 I can buy my DVD drive from PC World
    directly, or I can use Google or Price Guide UK
    to find the cheapest drive
  • Analogy 2 I can contact the Oxford website
    directly via its IP address, or use a DNS lookup
    for www.ox.ac.uk using my local cache or network
    DNS server

14
Support for Job Push and Pull
  • Job push based on job submitter using scheduler
    to push job to available resource
  • Good for quick feedback, short execution delay
    jobs
  • Typically the case where job submitter owns
    resource which executes job
  • Job pull based on resource acquiring jobs from
    job pool
  • Good for avoiding scheduler overload with
    simulation type jobs (e.g. MC production)
  • Expect long execution delay (but must avoid
    starvation)
  • Good for allowing resource owner to control which
    jobs are executed

15
Further Details
  • Base job, resource, and directory entries on
    Condor ClassAd system
  • Make use of Globus toolkit (foundation for
    EDG/LCG software)
  • Evaluate possibility of Grid Services
    implementation
  • Extend DIRAC LHCb MC production architecture
  • Prepare trial system for Data Challenge 04

16
Research Plans
  • Base requirements and design on existing use
    cases, requirements, and available software
  • Description and analysis of process and protocol
    interaction in CSP
  • Expand DistGrid JobsResourcesDirectoriesSe
    rvicesData
  • Use SimGrid to evaluate different strategies
  • Benchmark against EDG and DIRAC during DC04

17
Overview
  • European Data Grid software relies on
  • centralised servers
  • complete knowledge of system state
  • homogeneous systems (same software, same
    configuration)
  • This has resulted in problems with
  • scalability
  • robustness
  • fitness for purpose
  • Propose investigating alternative architecture
  • which is suitable for physics analysis and
    production jobs
  • can be implemented on top of current EDG/LCG
    software
  • can be trailed during LHCb Data Challenge 2004
    (March May)
Write a Comment
User Comments (0)
About PowerShow.com