Title: First Use of the UK eScience Grid
1First Use of the UK e-Science Grid
- Matthew Palmer, HEP Group, Cambridge University
CLRC e-Science Centre
ATLAS
2Overview
- The Grid and Globus an overview
- The Physics
- Using the Grid from certificates to vast
quantities of data - Results
- Other Grid usage in Particle Physics
- Wish list
- Conclusion
3Overview of the Grid
- Foster, Kesselman, Tuecke The anatomy of the
Grid - The problem
- coordinated resource sharing and problem solving
in dynamic multi-institutional virtual
organisations - The solution requires
- Resource management
- Single sign-on
- Delegation
- Integration with local security solutions
- User-based trust relationships
- The solution should be protocol based.
4Globus
- The lowest level of software in a Grid
- Services provided
- Security Globus Security Infrastructure
- Authentication based on certificates
- Authorization based on local files (e.g. gridmap
files) - Job submission and interface to local batch
system - Information services
- Provided through Meta-Directory Services (MDS)
- File transfer
- GridFTP
- Global Access to Secondary Storage (GASS)
- Resource Management
- Globus Resource Allocation Manager (GRAM)
5The PhysicsThe ATLAS experiment
6The PhysicsThe ATLAS experiment
- Part of the Large Hadron Collider (LHC) at CERN.
- Will produce 1 Petabyte of data per year 1
billion events. - Need to be able to discover processes that result
in a handful (100) events per year. - Therefore must simulate all relevant processes
completely otherwise we could get caught out by
a few unusual events from well-know/boring
processes. - Simulation is trivially parallelizable each
simulated event is entirely independent from all
others.
7The PhysicsGraviton Resonances
- My work last year involved looking at decays of
massive graviton resonances predicted by some
higher dimensional theories. - But, background is very large and therefore needs
to be simulated accurately. - With a locally generated background sample
- 2 million events
- error in simulation is typically larger than the
signal! - So, use the Grid to generate the events we need.
- 150 million of them!
8The PhysicsGraviton Resonances
9Using the GridGetting access
- Acquire a certificate.
- Get user accounts on all the machines you want to
use. - This involves sending pieces of paper to people.
- Administrative effort is O(n) .
- Result is a series of accounts across the UK
e-Science network that you arent supposed to use
directly. - and some unwanted email addresses that get
spammed! - Speak to the sysadmin and add your certificate to
the local gridmap file. - Now you can use the Grid yes its that simple!
10Using the GridOr maybe not
- Firewall issues
- Client machines have to be added into firewall
tables. - Specific port numbers must be used.
- Configuration problems/requests.
- Fortunately system administrators are really
helpful! - Now that the systems are beginning to be used,
most issues should be resolved. - Not too many users though sysadmins respond
quickly! - Continuous updates of software mean that theyll
be issues for some time.
11Using the GridPracticalities 1
- Start a Grid session (obtain a proxy certificate)
- grid-proxy-init
- Run a job
- E.g. globus-job-run herschel.amtp.cam.ac.uk
/bin/ls - Lists the contents of your home directory on
herschel - Submit a job
- E.g. globus-job-submit farm003.hep.phy.cam.ac.uk/
jobmanager-pbs myjob - Runs myjob on the HEP group farm using the PBS
job manager - Returns a contact URL e.g. https//farm003.hep.ph
y.cam.ac.uk2045/53546/21305646/ - Job status
- globus-job-status https//farm003.hep.phy.cam.ac.u
k2045/53546/21305646/ - Returns PENDING, RUNNING, DONE
12Using the GridPracticalities 2
- Retrieve stdout and stderr
- globus-job-get-output out https//farm003.hep.phy
.cam.ac.uk2045/53546/21305646/ - But, a bug means that stdout and stderr get
deleted after about 20 minutes, so dont rely on
them!! - Redirect stdout and stderr to files instead
- Find out information
- You can find out static information about what
machines are available and their capabilities and
configuration using - grid-info-search x h farm003.hep.phy.cam.ac.uk
- But, list of machines is fixed (you need
accounts), so why bother? - What about dynamic information, e.g. number of
jobs in queue, expected time before running and
so on? This would be really useful.
13Using the GridPracticalities 3
- File transfer
- Use the GSI enabled ncFTP client
- Has many nice features (later)
- Dont use globus-url-copy its very primitive
- Binary issues
- May have to compile locally
- May have library problems
- Simple solution restrict use to ix86 Linux
machines and compile statically!
14Using the GridSite Homogeneity
- Not all sites are the same!
- Disk quotas large files are expected to be
placed in certain locations. - Firewall constraints certain port numbers have
to be used. - Different job managers for each system.
- This has major implications for scripts and
programs - Ideally want to just copy the same set of
scripts, programs and data to each site. - But it doesnt work because of these issues.
- Can use MDS to find out most things and hide
this, but it can still cause problems and adds a
lot of extra development time. Plus, setup can
need debugging on each system.
15Using the GridGSIncFTP
- Currently, using the Grid can feel very remote.
- Time to list files in a directory can be many
seconds as authorization is done for each
command. - This is a problem as things have a tendency to go
wrong! - Greatly reduces productivity in two ways
- Latency lots of time spent waiting for a
directory listing. - Psychological Grid becomes hard to use you
end up fighting it. - GSIncFTP is a GSI-enabled version of the popular
ncFTP client - Has many nice and usable features e.g. tab
completion, listing of file contents, transfer of
directories, wild-carding etc. - Operations are fast as authentication/authorizatio
n are only done once at logon. - Look forward to using GSI-SSH
- will give fast, normal access to Grid machines.
16Using the GridInformation issues
- MDS gives a wealth of information about processor
speeds, RAM, and so on. - But, dynamic information is lacking. In
particular, the status of currently running jobs
is inadequate. - Callback features would be useful
- E.g. e-mail when a job is finished or if it
crashes. - Usage statistics would be useful but are missing.
- These can be implemented manually using scripts.
17Using the GridSummary
- Using the Grid via Globus is not too dissimilar
to using a batch system. - There are just more hoops to jump through and
different tools to use. - Potentially it gives you access to a much more
computing power than through a batch system. - But, with the current system of obtaining
accounts, the UK e-Science Grid does not scale
10 systems is probably the most it would scale
to. - Middle-ware is being developed to introduce
single sign-on across the e-Science Grid.
18Sites used
- RAL e-Science Centre
- Hrothgar16 x dual 1.2 GHz AthlonMP
- London e-Science Centre
- Pioneer 20 x 1 GHz Athlon
- Southampton e-Science Centre
- Metropolis dual 450 MHz Pentium III
- Cambridge e-Science Centre
- Herschel 16 x dual 450 MHz Pentium II
- Tempo 1 GHz Pentium III
- Cambridge HEP Group
- Farm 16 x 1.2 GHz Pentium III
- Also has EDG software installed and no direct
access (my Globus certificate maps onto a pooled
account) - Total 104 CPUs of varying speeds
19The Results
- Generated 150 million events (1200 CPU hours) in
less than 24 real hours. - This corresponds to all of the relevant events in
1 year of operation. - Then had to analyse them took many hours of
interactive analysis. - The results of this have just been submitted to
JHEP. - I have instructed a colleague on how to use the
Grid he is now using it for his own intensive
simulations after relatively little effort.
20The Results
21Comparison with EDG
- The European Data-Grid (EDG) is middle-ware that
runs on top of Globus. - Users join virtual organisations (VOs). The
system gridmap file is automatically updated from
VO member lists. Users are mapped to pooled
accounts. - Pro Single sign-on for all EDG sites.
- Con Pooled account may change security and
persistency issues. - Jobs are not submitted directly to a machine.
They are submitted to a resource broker (RB)
which queries currently available machines and
submits the job to the most appropriate.
22Comparison with EDG 2
- EDG has support for Grid enabled storage
including replica catalogues. - Cons
- Machine administration is handed over to a
program security issues and incompatible with
existing systems. - Requires a very particular system version only
available for commodity Linux clusters. - Other middle-ware is available e.g. NorduGrid
(designed to be run on existing systems). UK
e-Science is also developing some of its own
middle-ware. - Middle-ware is essential for the e-Science Grid
to reach its potential.
23Current and Future Particle Physics Grid Usage
- Large scale data challenges a portion of which
is being generated on various Grids. Typical
size - 3000 CPUs
- 100 Terabytes of data
- 107 events
- Full simulation
- 40 minutes per event
- 150 million events gt 11,000 CPU years!
- High multiplicity events
- Large numbers of sub-processes split into
separate jobs - 1000 jobs submitted in a short time
- User analysis of large data sets
- Distributed, interactive analysis of Tb of data
- Many challenges in the current Grid environment
24Wish list
- Single sign on the highest priority
- Not really a Grid until this is achieved
- Site homogeneity
- Or robust abstractions of all variations
- GSI-ssh
- Debugging will be much easier
- More friendly
- Available now
- Book-keeping tools
- To keep track of currently running jobs, status
information, etc.
25Conclusion
- The UK e-Science Grid can be used now to do
physics - To the point where a colleague was able to use
the Grid relatively painlessly with education - But
- It cannot be used to its full potential
- It does not scale above about 10 sites
- Need for middle-ware
- There are many middle-ware packages available now
- e-Science are developing more for their own needs
- Soon the UK e-Science Grid will reach its
potential