Title: Parallel Processing and Interactivity over the (Int.Eu.) Grid
1Parallel Processing and Interactivity over the
(Int.Eu.) Grid
EGEE Int.EU.Grid Tutorial Lisbon, 14th November
2007
- Gonçalo Borges, Mário David, Jorge Gomes
- LIP
2MPI/Interactivity on the Grid
- It must be difficult to run parallel/interactive
applications over the Grid.
- Why do people think that?
- MPI/Interactivity has been neglected by larger
Grid projects - Aimed to sequential jobs
- Support on WMS and API is not flexible enough
- Matchmaking problems for MPI tasks on Grid
environments - Different models for normal MPI submissions,
sometimes incompatible with standard site
configurations - No support for configuring sites for MPI
- How to manage/control local cluster MPI support?
- How to set central MPI support?
- HEP sites without MPI backgroup cant/wont
invest time on it
3Batch execution on Grids
Internet
REMOTE SITE
REMOTE SITE
4Parallel Job Execution
- Use resources from different sites
- Resource-sets search
- Co-allocation synchronization
Internet
REMOTE SITE
REMOTE SITE
5Interactive Job Execution
- Fast start-up
- Execution in high-occupancy situations
Internet
REMOTE SITE
REMOTE SITE
6 7Parallel job support in other Grids
- The current EGEE Grid middleware only supports
- Normal JobType
- Execution of a sequential program.
- Max 1 process allocation.
- MPICH JobType
- Only MPICH is supported
- Hard coded into the EGEE WMS/RB
- For every new implementation that needs to be
supported the middleware needs to be modified - The Grid middleware (WMS/RB) produces a wrapper
that executes the binary with (mpich) mpirun
8CrossBroker
- CrossBroker Int.Eu.Grid meta-scheduler
- Offers the same functionalities as the EGEE
Resource Broker - Plus some new features
- Full support for Parallel Applications
- Intra Cluster (Pacx-MPI, mpich-g2) Intra Cluster
(Open MPI, mpich-p4) - Scheduling for intra-cluster and inter-cluster
jobs - Flexible MPI job startup based on MPI-START
- Support for Interactive Applications via
GVid/Glogin - Fast startup E/S forwarding from application to
the user
9CrossBroker
10Brokering Problems for MPI
- There is no guarantee that all parallel tasks
will start at the same time. - The CrossBroker can adopt two options
- 1st choice select only sites with free resources
- 2nd choice allocate a resource temporally and
wait until all other tasks show up - Timeshare the resource with a backfilling policy
to avoid resource iddleness
11Time Sharing
CrossBroker
Grid Resource
LRMS
MPI JOB
Scheduling Agent
Condor-G
- A parallel application arrives to the CrossBroker.
12Time Sharing
CrossBroker
Grid Resource
LRMS
MPI JOB
Scheduling Agent
Application Launcher
Condor-G
- The CrossBroker submits an agent
13Time Sharing
CrossBroker
Grid Resource
LRMS
MPI JOB
Scheduling Agent
Agent
Application Launcher
VM1
VM2
Condor-G
- which is created in a temporarily-acquired Grid
resource.
14Time Sharing
CrossBroker
Grid Resource
LRMS
MPI JOB
Scheduling Agent
Agent
Application Launcher
VM1
VM2
Condor-G
- The agent reports back to the CrossBroker
15Time Sharing
CrossBroker
Grid Resource
LRMS
Scheduling Agent
Waiting For rest of tasks
Agent
Application Launcher
VM1
VM2
Condor-G
MPI TASK
- and waits until other MPI agents are started.
16Time Sharing
CrossBroker
Grid Resource
JOB
LRMS
Scheduling Agent
Agent
Application Launcher
VM1
VM2
Condor-G
MPI TASK
17Time Sharing
CrossBroker
Grid Resource
LRMS
Scheduling Agent
BackFilling While the MPI waits
Agent
Application Launcher
VM1
VM2
Condor-G
JOB
MPI TASK
- The broker sends (directly) a backfilling job to
the agent
18Time Sharing
CrossBroker
Grid Resource
LRMS
Scheduling Agent
All tasks Ready!
Agent
Application Launcher
VM1
VM2
Condor-G
MPI TASK
JOB
- who runs it while it waits for other MPI agents
to start.
19More Grid specific problems
- The cluster where the MPI job is supposed to run
doesn't have a shared file system - How to distribute the binary and input files?
- How to gather the output?
- Different clusters over the Grid are managed by
different Local Resource Management Systems (PBS,
LSF, SGE,) - Whats the correct machine hostfile format?
- How to compile MPI program?
- How can a physicist working on Windows
workstation compile his code for/with an Itanium
MPI implementation?
20MPI-START
- MPI-START Abstraction layer for MPI jobs
- Portable must run under any supported OS
- Modular Plugin/Component architecture
- Relocatable Independent of absolute path, to
adapt to different site configurations - Unique interface to the upper layer
- Sits between CrossBroker, LRMS schedulers and MPI
implementations - Deployed at the WNs and invoked by the
CrossBroker wrapper
21MPI-Start Architecture
- Support of a new MPI implementation doesn't
require any change in the Grid middleware. - Provides a uniform method to start jobs
independently of - LRMS (PBS/Torque, SGE, )
- MPI implementation (OpenMPI, PACX-MPI, MPICH, )
- Hides MPI job start details local
infrastructure details - Shared/not shared home directories Location of
MPI libraries - Provides some support for the user to help manage
his data.
22MPI-START variables
- Interface Intra Cluster MPI
- I2G_MPI_APPLICATION
- The executable
- I2G_MPI_APPLICATION_ARGS
- The parameters to be passed to the executable
- I2G_MPI_TYPE
- The MPI implementation to use (e.g openmpi, ...)
- I2G_MPI_VERSION
- Specifies which version of the the MPI
implementation to use. If not defined the default
version will be used
23MPI-START variables
- Interface Intra Cluster MPI
- I2G_MPI_PRECOMMAND
- Specifies a command that is prepended to the
mpirun (e.g. time). - I2G_MPI_PRE_RUN_HOOK
- Points to a shell script that must contain a
pre_run_hook function. - This function will be called before the parallel
application is started (usage compilation of the
executable) - I2G_MPI_POST_RUN_HOOK
- Like I2G_MPI_PRE_RUN_HOOK, but the script must
define a post_run_hook that is called after the
parallel application finished (usage upload of
results).
24MPI-START invocation
- The submission (in SGE)
- The StdOut
- The StdErr
imain179_at_i2g-ce01 cat test2mpistart.sh !/bin
/sh This is a script to show how mpi-start is
called Set environment variables needed by
mpi-start export I2G_MPI_APPLICATION/bin/hostname
export I2G_MPI_APPLICATION_ARGS export
I2G_MPI_NP2 export I2G_MPI_TYPEopenmpi export
I2G_MPI_FLAVOURopenmpi export I2G_MPI_JOB_NUMBER
0 export I2G_MPI_STARTUP_INFO/home/imain179 expor
t I2G_MPI_PRECOMMANDtime export
I2G_MPI_RELAY export I2G_MPI_START/opt/i2g/bin/m
pi-start Execute mpi-start I2G_MPI_START
imain179_at_i2g-ce01 qsub -S /bin/bash -pe
openmpi 2 -l allow_slots_egee0
./test2mpistart.sh
imain179_at_i2g-ce01 cat test2mpistart.sh.o11448
6 Scientific Linux CERN SLC release 4.5
(Beryllium) Scientific Linux CERN SLC release 4.5
(Beryllium) lflip30.lip.pt lflip31.lip.pt
lflip31 /home/imain179 gt cat test2mpistart.sh.e1
14486 Scientific Linux CERN SLC release 4.5
(Beryllium) Scientific Linux CERN SLC release 4.5
(Beryllium) real 0m0.731s user 0m0.021s sys
0m0.013s
- MPI commands are transparent to the user
- No explicit mpiexec/mpirun instruction
- Start the script via normal LRMS submission
25Debug Support in MPI-START
- The debugging support is also controllable via
environment variables. The default is not to
produce any additional output. - I2G_MPI_START_VERBOSE
- If set to 1 only very basic information are
produced - I2G_MPI_START_DEBUG
- If set to 1 information about the internal flow
are outputed - I2G_MPI_START_TRACE
- If set to 1 that set -x is enabled at the
beginning.
26MPI Flow and Remote Debugging
START
NO
Is there a LRMS Plugin?
imain179_at_i2g-ce01 more test2mpistart.sh.o11470
7
UID imain179 HOST
lflip25.lip.pt DATE Tue Oct 30 164953 WET
2007 VERSION 0.0.34
mpi-start DEBUG
Configuration mpi-start DEBUG gt
I2G_MPI_APPLICATION/bin/hostname mpi-start
DEBUG gt I2G_MPI_APPLICATION_ARGS mpi-start
DEBUG gt I2G_MPI_TYPEopenmpi mpi-start
DEBUG gt I2G_MPI_VERSION mpi-start DEBUG
gt I2G_MPI_PRE_RUN_HOOK mpi-start DEBUG gt
I2G_MPI_POST_RUN_HOOK mpi-start DEBUG gt
I2G_MPI_PRECOMMANDtime mpi-start DEBUG gt
I2G_MPI_FLAVOURopenmpi mpi-start DEBUG gt
I2G_MPI_JOB_NUMBER0 mpi-start DEBUG gt
I2G_MPI_STARTUP_INFO/home/imain179 mpi-start
DEBUG gt I2G_MPI_RELAY (...)
Scheduler Plugin
Ask LRMS Plugin for machinefile
NO
Is there an MPI Plugin ?
Detect EGEE environment
Activate MPI Plugin
Prepare MPI run
Trigger pre-run hooks
MPI Plugin
Start MPI run
Trigger post-run hooks
EXIT
Dump Environment
27MPI Flow and Remote Debugging
START
NO
Is there a LRMS Plugin?
imain179_at_i2g-ce01 more test2mpistart.sh.o11470
7 mpi-start DEBUG enable debugging mpi-start
INFO search for scheduler mpi-start DEBUG
source /opt/i2g/bin/../etc/mpi-start/lsf.scheduler
mpi-start DEBUG checking for scheduler
support lsf mpi-start DEBUG checking for
LSF_HOSTS mpi-start DEBUG source
/opt/i2g/bin/../etc/mpi-start/pbs.scheduler mpi-st
art DEBUG checking for scheduler support
pbs mpi-start DEBUG checking for
PBS_NODEFILE mpi-start DEBUG source
/opt/i2g/bin/../etc/mpi-start/sge.scheduler mpi-st
art DEBUG checking for scheduler support
sge mpi-start DEBUG checking for
PE_HOSTFILE mpi-start INFO activate support
for sge mpi-start DEBUG convert PE_HOSTFILE
into standard format mpi-start DEBUG Dump
machinefile mpi-start DEBUG gt
lflip25.lip.pt mpi-start DEBUG gt
lflip25.lip.pt mpi-start DEBUG starting with 2
processes. (...)
Scheduler Plugin
Ask LRMS Plugin for machinefile
NO
Is there an MPI Plugin ?
Detect EGEE environment
Activate MPI Plugin
Prepare MPI run
Trigger pre-run hooks
MPI Plugin
Start MPI run
Trigger post-run hooks
EXIT
Dump Environment
28MPI Flow and Remote Debugging
START
NO
Is there a LRMS Plugin?
imain179_at_i2g-ce01 more test2mpistart.sh.o11470
7 mpi-start DEBUG check for EGEE
environment mpi-start DEBUG check for default
MPI mpi-start DEBUG found default MPI in
/opt/openmpi/1.1 mpi-start INFO activate
support for openmpi mpi-start DEBUG source
/opt/i2g/bin/../etc/mpi-start/openmpi.mpi
Scheduler Plugin
Ask LRMS Plugin for machinefile
NO
Is there an MPI Plugin ?
Detect EGEE environment
Activate MPI Plugin
Prepare MPI run
Trigger pre-run hooks
MPI Plugin
Start MPI run
Trigger post-run hooks
EXIT
Dump Environment
29MPI Flow and Remote Debugging
- MPI Plugin
- Prepare MPI run
START
NO
Is there a LRMS Plugin?
imain179_at_i2g-ce01 more test2mpistart.sh.o11470
7 mpi-start DEBUG check for EGEE
environment mpi-start DEBUG check for default
MPI mpi-start DEBUG found default MPI in
/opt/openmpi/1.1 mpi-start INFO activate
support for openmpi mpi-start DEBUG source
/opt/i2g/bin/../etc/mpi-start/openmpi.mpi (...)
Scheduler Plugin
Ask LRMS Plugin for machinefile
NO
Is there an MPI Plugin ?
Detect EGEE environment
Activate MPI Plugin
Prepare MPI run
Trigger pre-run hooks
imain179_at_i2g-ce01 more test2mpistart.sh.o11470
7 mpi-start DEBUG use user provided prefix
/opt/i2g/openmpi mpi-start DEBUG prepend Open
MPI to PATH and LD_LIBRARY_PATH mpi-start INFO
call backend MPI implementation mpi-start
INFO start program with mpirun (...)
MPI Plugin
Start MPI run
Trigger post-run hooks
EXIT
Dump Environment
30MPI Flow and Remote Debugging
START
NO
Is there a LRMS Plugin?
imain179_at_i2g-ce01 more test2mpistart.sh.o11470
7 mpi-start DEBUG mpi_start_pre_run_hook mpi-st
art DEBUG mpi_start_pre_run_hook_generic mpi-st
art DEBUG detect shared filesystem mpi-start
DEBUG found local fs ext2/ext3 mpi-start
DEBUG mpi_start_post_run_hook_copy_ssh mpi-star
t DEBUG fs not shared -gt distribute
binary mpi-start DEBUG distribute
"/bin/hostname" to remote node
lflip25.lip.pt mpi-start DEBUG skip local
machine (...)
Scheduler Plugin
Ask LRMS Plugin for machinefile
NO
Is there an MPI Plugin ?
Detect EGEE environment
Activate MPI Plugin
Prepare MPI run
Trigger pre-run hooks
MPI Plugin
Start MPI run
Trigger post-run hooks
EXIT
Dump Environment
31MPI Flow and Remote Debugging
- Pre-run Hook
- Start MPI Run
START
NO
Is there a LRMS Plugin?
imain179_at_i2g-ce01 more test2mpistart.sh.o11470
7 mpi-start DEBUG mpi_start_pre_run_hook mpi-st
art DEBUG mpi_start_pre_run_hook_generic mpi-st
art DEBUG detect shared filesystem mpi-start
DEBUG found local fs ext2/ext3 mpi-start
DEBUG mpi_start_post_run_hook_copy_ssh mpi-star
t DEBUG fs not shared -gt distribute
binary mpi-start DEBUG distribute
"/bin/hostname" to remote node
lflip25.lip.pt mpi-start DEBUG skip local
machine (...)
Scheduler Plugin
Ask LRMS Plugin for machinefile
NO
Is there an MPI Plugin ?
Detect EGEE environment
Activate MPI Plugin
Prepare MPI run
imain179_at_i2g-ce01 more test2mpistart.sh.o11470
7 START
mpi-start DEBUG time
/opt/i2g/openmpi/bin/mpiexec -x X509_USER_PROXY
--prefix /opt/i2g/openmpi -machinef ile
/tmp/114707.1.imaingridsdj/tmp.XrItm23029 -np 2
/bin/hostname lflip25.lip.pt lflip25.lip.pt FINI
SHED
(...)
Trigger pre-run hooks
MPI Plugin
Start MPI run
Trigger post-run hooks
EXIT
Dump Environment
32MPI Flow and Remote Debugging
START
NO
Is there a LRMS Plugin?
imain179_at_i2g-ce01 more test2mpistart.sh.o11470
7 mpi-start DEBUG mpi_start_post_run_hook mpi-s
tart DEBUG mpi_start_post_run_hook_generic mpi-
start DEBUG mpi_start_post_run_hook_generic mpi
-start DEBUG fs not shared -gt cleanup
binary mpi-start DEBUG cleanup "/bin/hostname"
to remote node lflip25.lip.pt mpi-start
DEBUG skip local machine (...)
Scheduler Plugin
Ask LRMS Plugin for machinefile
NO
Is there an MPI Plugin ?
Detect EGEE environment
Activate MPI Plugin
Prepare MPI run
Trigger pre-run hooks
MPI Plugin
Start MPI run
Trigger post-run hooks
EXIT
Dump Environment
33JDL Parallel jobs
- JOBTYPE
- Parallel more than one CPU
- SUBJOBTYPE
- openmpi
- pacx-mpi
- mpich
- mpich-g2
- JOBSTARTER
- If not defined, mpi-start
- JOBSTARTERARGUMENTS
34JDL Normal vs Parallel (OpenMpi) Job
goncalo_at_i2g-ui02 cat simple.jdl Executable
"simple.sh" StdOutput
"simple.out" StdError
"simple.err" InputSandbox
"simple.sh" OutputSandbox
"simple.out","simple.err" goncalo_at_i2g-ui02
i2g-job-submit simple.jdl
goncalo_at_i2g-ui02 cat cpi_openmpi.jdl Type
"job" JobType
"Parallel" SubJobType
"openmpi" NodeNumber 4 Executable
"cpi-openmpi" StdOutput
"cpi-openmpi.out" StdError
"cpi.opnempi.err" InputSandbox
"cpi.openmpi" OutputSandbox
"cpi-openmpi.out","cpi-openmpi.err" Environment
"OMPI_MCA_mpi_yield_when_idle1" go
ncalo_at_i2g-ui02 i2g-job-submit cpi_openmpi.jdl
35MPI-START over the Grid
Cross
UI
Broker
Replica
Info
Index
Manager
Internet
SERVICES
CE
WN
WN
36Open MPI
- HLRS is simultaneously an Open MPI project member
and an Int.Eu.Grid project member - Grid parallel support from basis
- State of the art MPI implementation
- Full support of the MPI-2 standard
- Full thread support
- Avoidance of old legacy code
- Profit from long experience in MPI
implementations - Avoiding the forking problem
- Production-quality research platform
- Rapid deployment for new platforms
37Pacx-MPI
- Middleware to run MPI applications on a network
of parallel computers - Starts MPI jobs in each cluster
- Emulates a bigger/unique MPI job to the
application - Optimized conforming MPI implementation
- Applications just need to be recompiled
38Pacx MPI communication (1)
- Pacx-MPI maps the MPI ranks of the big job to the
MPI processes running on each cluster - Pacx-MPI maps 2 additional hidden processes on
the local MPI jobs for external communication - Rank 0 of the local MPI jobs is always the out
daemon - Rank 1 of the local MPI jobs is always the in
daemon
39Pacx MPI communication (2)
- Internal Communication
- Communication between processes running inside
the same local cluster is performed via the
local, optimized MPI implementation - External Communication
- Send message to out daemon using local MPI
- out daemon send message to destination host
over the network using a protocol (TCP) - The in daemon send message to destination using
local MPI
40Pacx-MPI over the Grid
- To run a multi-cluster Pacx-MPI job over the Grid
- The Job has to go through a matchmaking progress
where the appropriate CEs are selected - The different CEs have to know about each others
- StartupServer Pacx-MPI job to establish the
initial communication - Once the initial communication is establish, the
different CEs communicate directly
41MPI Across Sites
42Final Pacx-MPI framework
Cross
UI
Broker
Replica
Info
Index
Manager
Internet
SERVICES
CE
WN
WN
43JDL Open Mpi vs PACX-MPI Job
goncalo_at_i2g-ui02 cat cpi_openmpi.jdl Type
"job" JobType
"Parallel" SubJobType
"openmpi" NodeNumber 4 Executable
"cpi-openmpi" StdOutput
"cpi-opempi.out" StdError
"cpi-openmpi.err" InputSandbox
"cpi-openmpi" OutputSandbox
"cpi-openmpi.out","cpi-openmpi.err" Environment
"OMPI_MCA_mpi_yield_when_idle1" go
ncalo_at_i2g-ui02 i2g-job-submit cpi_openmpi.jdl
goncalo_at_i2g-ui02 cat cpi_pacxmpi.jdl Type
"job" JobType
"Parallel" SubJobType
pacxmpi" NodeNumber 4 Executable
cpi-pacxmpi" StdOutput
cpi-pacxmpi.out" StdError
"cpi-pacxmpi.err" InputSandbox
"cpi-pacxnmpi" OutputSandbox
"cpi-pacxmpi.out","cpi-pacxmpi.err" Environment
"OMPI_MCA_mpi_yield_when_idle1" go
ncalo_at_i2g-ui02 i2g-job-submit cpi_pacxmpi.jdl
44 45Interactivity Requirements
- Fast startup
- Possibility of starting the application
immediately - even in scenarios in which all computing
resources might be running batch jobs - Online Input-Output streaming
- Ability to have application input and output
online - Scheduling Policies
- Interactive jobs are sent to sites with available
machines - If there are not available machines, use time
sharing
46Interactive Jobs Time Sharing
CrossBroker
Grid Resource
BATCH JOB
LRMS
Scheduling Agent
Condor-G
- A normal batch job is submitted and arrives to
the CrossBroker
47Interactive Jobs Time Sharing
CrossBroker
Grid Resource
BATCH JOB
LRMS
Scheduling Agent
Application Launcher
Condor-G
- The CrossBroker submits an agent
48Interactive Jobs Time Sharing
CrossBroker
Grid Resource
BATCH JOB
LRMS
Scheduling Agent
Agent
Application Launcher
VM1
VM2
Condor-G
- creating two Virtual Machines.
49Interactive Jobs Time Sharing
CrossBroker
Grid Resource
BATCH JOB
LRMS
Scheduling Agent
Agent
Application Launcher
VM1
VM2
Condor-G
- The agent reports back to the CrossBroker
50Interactive Jobs Time Sharing
CrossBroker
Grid Resource
LRMS
Scheduling Agent
Batch Jobrunning
Agent
Application Launcher
VM1
VM2
Condor-G
BATCH
- and the batch job is submitted directly.
51Interactive Time Sharing
CrossBroker
Grid Resource
INT. JOB
LRMS
Scheduling Agent
Agent
Application Launcher
VM1
VM2
Condor-G
BATCH
- An interactive job arrives to the CrossBroker
52Interactive Jobs Time Sharing
Grid Resource
CrossBroker
Startup-time reduction Only one layer involved
LRMS
Scheduling Agent
Agent
Application Launcher
VM1
VM2
Condor-G
INT. JOB
BATCH
- and is directly submitted using the running
agent.
53Response Time
CrossBroker
CE WN
Mechanism Resource Searching Resoruce Selection Submission Submission
Mechanism Resource Searching Resoruce Selection Campus Grid Remote Site
Free machine submission 3s 0.5s 17.2s 22.3s
Glidein submission to free machine 3s 0.5s 29.3s 33.25s
Virtual Machine submission 0.5s 0.5s 6.79s 8.12s
54JDL Interactive jobs
- INTERACTIVE true/false
- Indicates that the job is interactive and the
broker should treat it with higher proirity - INTERACTIVEAGENT i2glogin
- INTERACTIVEAGENTARGUMENTS
- These attributes specify the command (and its
arguments) used to communicate with the user - -r Remote connection
- p 195.168.105.6523433 Destination port and
IP where to fwd the output - -t For handling new lines
- -c For specifying the real command
55A JDL for Interactive jobs
- i2g-ui01_at_goncalo cat interactive.jdl
- Type "Job"
- VirtualOrganisation "imain"
- JobType "Parallel"
- SubJobType openmpi"
- NodeNumber 11
- Interactive TRUE
- InteractiveAgent glogin
- InteractiveAgentArguments -r p
195.168.105.6523433 - Executable "test-app"
- InputSandbox "test-app", "inputfile"
- OutputSanbox "std.out", "std.err"
- StdErr "std.err
- StdOutput "std.out"
- Rank other.GlueHostBenchmarkSI00
- Requirements other.GlueCEStateStatus
"Production"
56I/O streaming
i2g-job-submit interactive.jdl
VirtualOrganisation "imain" JobType
Normal" Interactive TRUE InteractiveAgent
i2glogin InteractiveAgentArguments -r
p 24599158.109.65.149 -c Executable
/bin/sh" InputSandbox i2glogin"
i2glogin -p 24599158.109.65.149
57I/O streaming
CrossBroker
Job
i2glogin -p 24599158.109.65.149
WN
User application
i2glogin
58I/O streaming
CrossBroker
i2glogin -p 24599158.109.65.149
WN
User application
i2glogin
59I/O streaming
CrossBroker
i2glogin -p 24599158.109.65.149 sh-2.05b
hostname aow5grid.uab.es sh-2.05b
exit exit Connection closed by foreign host
WN
User application
i2glogin