Title: Integrating Gaudi with the Grid
1 K.Harrison CERN, 23rd October 2002
HOW TO COMMISSION A NEW CENTRE FOR LHCb
PRODUCTION - Overview of LHCb distributed
production system - Configuration of access
machine - Job handling - Setting up Cambridge as
a (small-scale) production centre ?
Configuration for summer 2002 ? Problems
encountered ? Future plans
2LHCb distributed production system
- Production manager stores details of
participating sites in two places ? in
a Java servlet that produces job scripts ?
in the PVSS system used for job management - Each
production site must define and configure an
access machine ? Access machine deals with
requests from PVSS, and distributes jobs
between all machines available at a site ? In
EDG terms, the access machine acts as a Computing
Element, and the machines where jobs are
run act as Worker Nodes - When producing job
scripts, use Servlet Runner that must have
write access to the area where a sites job
scripts are created ? May be able to use CERN
Servlet Runner (afs access), or may need
Servlet Runner installed at remote site
3Configuration of access machine
- Main steps for configuring the access machine
are as follows ? Install PVSS tools
? Define environment variable LHCBPRODROOT to
point to root directory of production
area ? Download and run mcsetup
installation script ? Customise
site-specific scripts ? Customisation
basically defines site identity, command
for job submission, and what to do with
output ? Set up Servlet Runner if not using
CERN Servlet Runner ? More details available at
http//lhcb-wdqa.web.cern.ch/lhcb-wdqa/distrib
ution http//lhcb-comp.web.cern.ch/lhcb-comp/
ComputingModel
/datachallenges/slice.doc
4Job handling
- Basic job handling is as follows (using CERN
Servlet Runner) ? Specify job request by
filling in web form at http//lhcb-comp.web
.cern.ch/lhcb-comp/SICB/pcsf/html
/mcbrunel.htm ? Parameters passed to
Servlet Runner, which produces job
scripts ? Submit jobs either through PVSS or
locally using script submit-all-scripts
installed by mcsetup ? When jobs are
completed, update central database and
transfer data to CASTOR using script transfer-all
installed by mcsetup
5Cambridge Summer 2002 (1)
- Jobs for summer production were run on 10
desktop machines with Redhat Linux 7.1
installed ? 5 x P3 (0.9-1.0 GHz, 256-512
Mb) ? 5 x P4 (1.8-2.0 GHz, 256-512 Mb) -
Desktop machines are used by people who work
interactively, and may submit other jobs
production jobs were run on low-priority batch
queues ? Made use of otherwise-idle CPU
cycles - Each machine used has 10-20 Gb local
scratch space additionally had 20 Gb for LHCb
production on central file server - LHCb
production tools and software were installed only
on the access machine - Access machine
submitted jobs to an NQS pipe queue, for
distribution among all production nodes
6Cambridge Summer 2002 (2)
- A script executed at job startup determined
where to run the applications ? If
the local scratch area had at least 5 Gb free,
the LHCb software was copied to a new
directory in this area, and run there
? If there was insufficient free space
locally, the LHCb software was copied
to a new directory in the LHCb area of
the central file server, and run there - When a
job completed, its output was stored on the file
server, then the directory where the job was
run was deleted - Log files and DSTs were copied
to CERN, using bbftp and locally written tools
7Cambridge Problems encountered (1)
- Configuration process was very drawn out, as
all changes had to be made centrally ? With
new installation tools, site configuration is
simpler and almost everything is done
locally - Information concerning production not
always communicated quickly to sites outside
CERN ? Situtation improved now that
lhcb-production mailing list has been set
up
8Cambridge Problems encountered (2)
- Had problems during production when afs was
unavailable, with sequence as follows
? Job fails to retrieve parameter files needed by
SICBMC ? SICBMC complains, but runs
anyway ? Job fails to retrieve options
files needed by Brunel ? Brunel core
dumps ? Large amounts of CPU time wasted
(SICBMC producing unusable events) human
intervention needed after job crash ? Problem
solved with new system, where reliance on afs is
removed - Brunel v13r1 used a lot of memory
(around 200 Mb) ? Some jobs had to be killed
as they prevented other users from
working ? Improvements with newer versions of
Brunel?
9Cambridge Future plans
- Participation in summer 2002 production has
been a positive experience ? Gained
experience with production tools, and with
running simulation and reconstruction jobs
using the latest versions of the software
? Produced 37k events that have been copied to
CASTOR, and are being used locally in
physics studies - Aim to maintain participation
in data challenges at least at current (low)
level - Additional 20 x P3 (1.1 GHz, 256 Mb) are
available in Cambridge HEP Group if we are
able to use Grid tools (Globus or EDG) ? Will
be exploring possibilities in the coming months