Title: ATLAS and the Grid
1ATLAS and the Grid ACAT02 Moscow June
2002 RWL Jones Lancaster University
2The ATLAS Computing Challenge
- Running conditions at startup
- 0.8x109 event sample ? 1.3 PB/year, before data
processing - Reconstructed events, Monte Carlo data ? 10
PB/year (3 PB on disk) - CPU 1.6M SpecInt95 including analysis
- CERN alone can handle only a fraction of these
resources
3The Solution The Grid
Note Truly HPC, but requires more
Not designed for tight-coupled problems, but
spin-offs many
4ATLAS Needs Grid Applications
- The ATLAS OO software framework is Athena, which
co-evolves with the LHCb Gaudi framework - ATLAS is truly intercontinental
- In particular, it is present on both sides of the
Atlantic - Opportunity the practical convergence between US
and European Grid projects will come through the
transatlantic applications - Threat There is an inevitable tendency towards
fragmentation/divergence of effort to be resisted
- Other relevant talks
- Nick Brook co-development with LHCb, especially
through UK GridPP collaboration (or rather, Ill
present this later) - Alexandre Vaniachine, describing work for the
ATLAS Data Challenges
5Data Challenges
Test Bench Data Challenges
- Prototype I May 2002
- Performance and scalability testing of components
of the computing fabric (clusters, disk storage,
mass storage system, system installation, system
monitoring) using straightforward physics
applications. Test job scheduling and data
replication software (DataGrid release 1.2) - Prototype II Mar 2003
- Prototyping of the integrated local computing
fabric, with emphasis on scaling, reliability and
resilience to errors. Performance testing of LHC
applications. Distributed application models
(DataGrid release 2). - Prototype III Mar 2004
- Full scale testing of the LHC computing model
with fabric management and Grid management
software for Tier-0 and Tier-1 centres, with some
Tier-2 components (DataGrid release 3).
6The Hierarchical View
1 TIPS 25,000 SpecInt95 PC (1999) 15
SpecInt95
PBytes/sec
Online System
100 MBytes/sec
Offline Farm20 TIPS
- One bunch crossing per 25 ns
- 100 triggers per second
- Each event is 1 Mbyte
100 MBytes/sec
Tier 0
CERN Computer Centre gt20 TIPS
Gbits/sec
or Air Freight
Tier 1
UK Regional Centre (RAL)
US Regional Centre
French Regional Centre
Italian Regional Centre
Tier 2
Tier2 Centre 1 TIPS
Tier2 Centre 1 TIPS
Tier2 Centre 1 TIPS
Gbits/sec
Tier 3
Physicists work on analysis channels Each
institute has 10 physicists working on one or
more channels Data for these channels should be
cached by the institute server
Lancaster 0.25TIPS
Sheffield
Manchester
Liverpool
Physics data cache
100 - 1000 Mbits/sec
Tier 4
Workstations
7A More Grid-like Model
The LHC Computing Facility
8Features of the Cloud Model
- All regional facilities have 1/3 of the full
reconstructed data - Allows more on disk/fast access space, saves tape
- Multiple copies mean no need for tape backup
- All regional facilities have all of the analysis
data (AOD) - Resource broker can still keep jobs fairly local
- Centres are Regional and NOT National
- Physicists from other Regions should have also
Access to the Computing Resources - Cost sharing is an issue
- Implications for the Grid middleware on
accounting - Between experiments
- Between regions
- Between analysis groups
- Also, different activities will require different
priorities
9Resource Estimates
10Resource Estimates
- Analysis resources?
- 20 analysis groups
- 20 jobs/group/day 400 jobs/day
- sample size 108 events
- 2.5 SI95s/ev gt 1011 SI95 (s/day) 1.2106 SI95
- Additional 20 for activities on smaller samples
11Rough Architecture
Installation of Software and Env
Middleware RB, GIS
Data Catalogue
Job Configuration/VDC /metadata
User Interface to Grid experiment framework
User
12Test Beds
- EDG Test Bed 1
- Common to all LHC experiments
- Using/testing EDG test bed 1 release code
- Already running boxed fast simulation and
installed full simulation - US ATLAS Test Bed
- Demonstrate success of grid computing model for
HEP - in data production
- in data access
- in data analysis
- Develop deploy grid middleware and applications
- wrap layers around apps
- simplify deployment
- Evolve into fully functioning scalable
distributed tiered grid - NorduGrid
- Developing a regional test bed
- Light-weight Grid user interface, working
prototypes etc - see talk by Aleksandr Konstantinov
13EDG Release 1.2
- EDG has strong emphasis on middleware
development applications come second - ATLAS has been testing the stable releases of
the EDG software as they become available as part
of WP8 (ATLAS key contact Silvia Resconi) - EDG Release (1.2) is under test by Integration
Team people plus Loose Cannons (experiment
independent people) on the development testbed at
CERN. - Standard requirements must be met before the
ATLAS Applications people test a release - The development testbed must consist of at
least 3 sites in 3 different countries ( e.g.
CERN, CNAF, RAL ) - There must be a long ( gt 24 hours) unattended
period with a low error rate ( lt 1 of jobs
failed )
http//pcatl0a.mi.infn.it/resconi/validation/vali
d.html
14EDG TestBed 1 Status28 May 2002 1703
- Web interface showing status of (400) servers
at testbed 1 sites - 5 Main Production Centres
15GridPP Sites in Testbed(s)
16NorduGrid Overview
- Launched in spring 2001, with the aim of creating
a Grid infrastructure in the Nordic countries - Partners from Denmark, Norway, Sweden, and
Finland - Initially the Nordic branch of the EU DataGrid
(EDG) project testbed - Independent developments
- Relies on funding from NorduNet2
http//www.nordugrid.org
17US Grid Test Bed Sites
U Michigan
Lawrence Berkeley National Laboratory
Boston University
Argonne National Laboratory
Brookhaven National Laboratory
Indiana University
Oklahoma University
University of Texas at Arlington
US -ATLAS testbed launched February 2001
18US Hardware and Deployment
- 8 gatekeepers - ANL, BNL, LBNL, BU, IU, UM, OU,
UTA - Farms - BNL, LBNL, IU, UTA Multiple RD
gatekeepers - Uniform OS through kickstart
- Running RH 7.2 ?
- First stage deployment
- Pacman, Globus 2.0b, cernlib (installations)
- Simple application package
- Second stage deployment
- Magda, Chimera, GDMP (Grid data management)
- Third stage
- MC production software VDC
- Many US names mentioned later, thanks also to
Craig Tull, Dan Engh, Mark Sosebee
19Important Components
- GridView - simple script tool to monitor status
of test bed (Java version being developed) - Gripe - unified user accounts
- Magda - MAnager for Grid Data
- Pacman - package management and distribution tool
- Grappa - web portal based on active notebook
technology
20Grid User Interface
- Several prototype interfaces
- GRAPPA
- EDG
- Nordugrid
- Lightweight
- Nothing experiment specific
- GRAT
- Line mode (and we will always need to retain line
mode!) - Now defining an ATLAS/LHCb joint user interface,
GANGA - Co-evolution with Grappa
- Knowledge of experiment OO architecture needed
(Athena/Gaudi)
21Interfacing Athena/Gaudi to the GRID
GANGA/Grappa
GUI
GRID Services
?
Histograms Monitoring Results
jobOptions/ Virtual Data Algorithms
Athena/GAUDI Application
?
22EDG GUI for Job Submission
23GRAPPA
- Based on XCAT Science Portal, framework for
building personal science portals - A science portal is an application-specific Grid
portal - Active notebook
- HTML pages to describe the features of the
notebook and how to use it - HTML forms which can be used to launch
parameterizable scripts (transformation) - Parameters stored in a sub-notebook (derivation)
- Very flexible
- Jython - access to Java classes
- Globus Java CoG kit
- XCAT
- XMESSAGES
- Not every user has to write scripts
- Notebooks can be shared among users
- Import/export capability
- Shava Smallen, Rob Gardner
24GRAPPA/XCAT Science Portal Architecture
The prototype can
- Submit Athena jobs to Grid computing elements
- Manage JobOptions, record sessions
- Staging and output collection supported
- Tested on US ATLAS Grid Testbed
25GANGA/Grappa Development Strategy
- Completed existing technology requirement
survey - Must be Grid aware but not Grid-dependent
- Still want to be able to pack and go to a
standalone laptop - Must be component-based
- Interface Technologies (Standards needed ? GGF)
- Programmatic API (eg. C, C, etc)
- Scripting as Glue ala Stallman (eg. Python)
- Others eg. SOAP, CORBA, RMI, DCOM, .NET, etc.
-
- Defining the experiment software services to
capture and present the functionality of the Grid
service
26Possible Designs
- Two ways of implementation
- Based on one of the general-purpose grid portals
(not tied to a single application/framework) - Alice Environment (AliEn)
- Grid Enabled Web eNvironment for Site-Independent
User Job Submission (GENIUS) - Grid access portal for physics applications
(Grappa) - Based on the concept of Python bus (P. Mato)
- use different modules whichever are required to
provide full functionality of the interface - use Python to glue this modules, i.e., allow
interaction and communication between them
27Python Bus
GRID
Internet
28Modules description
- GUI module
- Provides basic functionality
- Can be implemented using
- wxPython extension module
- Qt/Desktop C toolkit, etc.
29Installation Tools
- To use the Grid, deployable software must be
deployed on the Grid fabrics, and the deployable
run-time environment established (Unix and
Windows) - Installable code and run-time environment/configur
ation - Both ATLAS and LHCb use CMT for the software
management and environment configuration - CMT knows the package interdependencies and
external dependencies ? this is the obvious tool
to prepare the deployable code and to expose
the dependencies to the deployment tool
(Christian Arnault, Chas Loomis) - Grid aware tool to deploy the above
- PACMAN (Saul Youssef) is a candidate which seems
fairly easy to interface with CMT
30Installation Issues
- Most Grid projects seem to assume either code is
pre-installed or else can be dumped each time
into the input sandbox - The only route for installation of software
through the Grid seems to be as data in Storage
Elements - In general these are non-local
- Hard to introduce directory trees etc this way
(file based) - How do we advertise installed code?
- Check it is installed by a preparation task sent
to the remote fabric before/with the job - Advertise the software is installed in your
information service for use by the resource
broker - Probably need both!
- The local environment and external packages will
always be a problem - Points to a virtual machine idea eventually
Java? - Options?
- DAR mixed reports, but CMS are interested
- PACKMAN from AliEn
- LGCG, OSCAR not really suitable, more for site
management?
31CMT and deployable code
- Christian Arnault and Charles Loomis have a
beta-release of CMT that will produce package
rpms, which is a large step along the way - Still need to have minimal dependencies/clean
code! - Need to make the package dependencies explicit
- Rpm requires root to install in the system
database (but not for a private installation) - Developer and binary installations being
produced, probably needs further refinement - Work to expose dependencies as PACMAN cache files
ongoing - Note much work elsewhere in producing rpms of
ATLAS code, notably in Copenhagen this effort
has the advantage of the full dependency
knowledge in CMT being exposable
32pacman
- Package manager for the grid in development by
Saul Youssef (Boston U, GriPhyN/iVDGL) - Single tool to easily manage installation and
environment setup for the long list of ATLAS,
grid and other software components needed to
Grid-enable a site - fetch, install, configure, add to login
environment, update - Sits over top of (and is compatible with) the
many software packaging approaches (rpm, tar.gz,
etc.) - Uses dependency hierarchy, so one command can
drive the installation of a complete environment
of many packages - Packages organized into caches hosted at various
sites - How to fetch can be cached rather than the
desired object - Includes a web interface (for each cache) as well
as command line tools
33- An encryption package needed by Globus
-
- name SSLeay
- description Encryption
- url http//www.psy.uq.oz.au/ftp/Crypto
/ssleay - source http//www.psy.uq.oz.au/ftp/Crypto
/ssleay - systems linux-i386 SSLeay-0.9.0b.tar.
gz,SSLeay-0.9.0b,\ - linux2 SSLeay-0.9.0b.tar.
gz,SSLeay-0.9.0b,\ - sunos5 SSLeay-0.9.0b.tar.
gz,SSLeay-0.9.0b - depends
- exists /usr/local/bin/perl
- inpath gcc
- bins
- paths
- enviros
- localdoc README
- daemons
- install \
- root ./Configure linux-elf,make clean,
\
34Grid Applications Toolkit
- Horst Severini, Kaushik De, Ed May, Wensheng
Deng, Jerry Gieraltowski. (US Test Bed) - Repackaged Athena-Atlfast (OO fast detector
simulation) for grid testbed (building on Julian
Phillips and UK effort) - Script 1 can run on any globus enabled node
(requires transfer of 17MB source) - Script 2 runs on machine with packaged software
preinstalled on grid site - Script 3 runs on afs enabled sites (latest
version of software is used) - Other user toolkit contents
- to check status of grid nodes
- submit jobs (without worrying about underlying
middleware or ATLAS software) - uses only basic RSL globus-url-copy
35Monitoring Tool
- GridView - a simple visualization tool using
Globus Toolkit - First native Globus application for ATLAS grid
(March 2001) - Collects information using Globus tools.
Archival information is stored in MySQL server on
a different machine. Data published through web
server on a third machine. - Plans
- Java version
- Better visualization
- Historical plots
- Hierarchical MDS information
- Graphical view of system health
- New MDS schemas
- Optimize archived variables
- Publishing historical information through GIIS
servers?? - Explore discovery tools
- Explore scalability to large systems
- Patrick McGuigan
36(No Transcript)
37MDS Information
Listing of available object classes
38More Details
39MDS Team
- Dantong Yu, Patrick McGuigan, Craig Tull, KD, Dan
Engh - Site monitoring
- publish BNL acas information ?
- Glue schema testbed ?
- Software installation
- pacman information provider
- Application monitoring
- Grid monitoring
- GridView, Ganglia
- hierarchical GIIS server ?
- GriPhyN-PPDG GIIS server ?
40Data Management Architecture
AMI ATLAS Metatdata Interface Query
LFN Associated attributes and values
MAGDA MAnager for Grid-based Data Manage
replication, physical location
VDC Virtual Data Catalog Derive and transform
LFNs
41Managing Data -Magda
- MAnager for Grid-based Data (essentially the
replica catalogue tool) - Designed for managed production and chaotic
end-user usage - Designed for rapid development of components to
support users quickly, with components later
replaced by Grid Toolkit elements - Deploy as an evolving production tool and as a
testing ground for Grid Toolkit components - GDMP will be incorporated
- Application in DCs
- Logical files can optionally be organized into
collections - File management in production replication to
BNL CERN, BNL data access - GDMP integration, replication and end-user data
access in DC1 - Developments
- Interface with AMI (ATLAS Metadata Interface,
allows queries on Logical File Name collections
by users, Grenoble project) - Interfaces to Virtual Data Catalogue, see AVs
talk) - Interfacing with hybrid ROOT/RDBMS event store
- Athena (ATLAS offline framework) integration
further grid integration
Info http//www.usatlas.bnl.gov/magda/info Engin
e http//www.usatlas.bnl.gov/magda/dyShowMain.pl
T Wenaus, W Deng
42Magda Architecture Schema
- MySQL database at the core of the system
- DB access via perl, C, java, cgi (perl)
scripts C and Java APIs auto-generated off the
MySQL DB schema - User interaction via web interface and command
line - Principal components
- File catalog covering any file types
- Data repositories organized into sites, each with
its locations - Computers with repository access a host can
access a set of sites - Replication operations organized into tasks
- Interfacing to Data Catalogue efforts at Grenoble
- Will replace with standard components e.g. GDMP
as they become available for production
43Magda Architecture
- DB access via perl, C, java, cgi (perl)
scripts - C and Java APIs auto-generated off the MySQL DB
schema - User interaction via web interface and command
line
44Magda Sites
45Conclusion
- The Grid is the only viable solution to the ATLAS
Computing problem - The problems of coherence across the Atlantic are
large - ATLAS (and CMS etc) are at the sharp end, so we
will force the divide to be bridged - Many applications have been developed, but need
to be refined/merged - These revise our requirements we must use
LCG/GGF and any other forum to ensure the
middleware projects satisfy the real needs this
is not a test bed! - The progress so far is impressive and encouraging
- Good collaborations (especially ATLAS/LHCb)
- The real worry is scaling up to the full system
- Money!
- Manpower!
- Diplomacy?!