Title: Status and evolution of the EGEE Project and its Grid Middleware
1Status and evolution of the EGEE Project and its
Grid Middleware
- By Frédéric Hemmer
- Middleware Manager
- CERN
- Geneva, Switzerland
International Conference on Next Generation
Networks Brussels, Belgium June 2, 2005
2Contents
- The EGEE Project
- Overview and Structure
- Grid Operations
- Middleware
- Networking Activities
- Applications
- High Energy Physics
- Biomedical
- Summary and Conclusions
3EGEE goals
- Goal of EGEE develop a service grid
infrastructure which is available to scientists
24 hours-a-day - The project concentrates on
- building a consistent, robust and secure Grid
network that will attract additional computing
resources - continuously improve and maintain the middleware
in order to deliver a reliable service to users - attracting new users from industry as well as
science and ensure they receive the high standard
of training and support they need
4EGEE
- EGEE is the largest Grid
- infrastructure project in Europe
- 70 leading institutions in 27 countries,
federated in regional Grids - Leveraging national and regional grid activities
- 32 M Euros EU funding for initially 2 years
starting 1st April 2004 - EU review, February 2005 successful
- Preparing 2nd phase of the project proposal to
EU Grid call September 2005 - Promoting scientific partnership outside EU
5EGEE Geographical Extensions
- EGEE is a truly international under-taking
- Collaborations with other existing European
projects, in particular - GÉANT, DEISA, SEE-GRID
- Relations to other projects/proposals
- OSG OpenScienceGrid (USA)
- Asia Korea, Taiwan, EU-ChinaGrid
- BalticGrid Lithuania, Latvia, Estonia
- EELA Latin America
- EUMedGrid Mediterranean Area
-
- Expansion of EGEE infrastructure in these regions
is a key element for the future of the project
and international science
6EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
7EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
8Computing Resources April 2005
9Infrastructure metrics
Countries, sites, and CPU available in EGEE
production service
Region coun-tries sites cpu M6 (TA) cpu M15 (TA) cpu actual
CERN 0 1 900 1800 1841
UK/Ireland 2 19 100 2200 2398
France 1 8 400 895 1172
Italy 1 21 553 679 2164
South East 5 16 146 322 159
South West 2 13 250 250 498
Central Europe 5 10 385 730 629
Northern Europe 2 4 200 2000 427
Germany/Switzerland 2 10 100 400 1733
Russia 1 9 50 152 276
EGEE-total 21 111 3084 9428 11297
USA 1 3 - - 555
Canada 1 6 - - 316
Asia-Pacific 6 8 - - 394
Hewlett-Packard 1 3 - - 172
Total other 9 20 - - 1437
Grand Total 30 131 - - 12734
EGEE partner regions
Other collaborating sites
10Service Usage
- VOs and users on the production service
- Active HEP experiments
- 4 LHC, D0, CDF, Zeus, Babar
- Active other VO
- Biomed, ESR (Earth Sciences), Compchem, Magic
(Astronomy), EGEODE (Geo-Physics) - 6 disciplines
- Registered users in these VO 600
- In addition to these there are many VO that are
local to a region, supported by their ROCs, but
not yet visible across EGEE - Scale of work performed
- LHC Data challenges 2004
- gt1 M SI2K years of cpu time (1000 cpu years)
- 400 TB of data generated, moved and stored
- 1 VO achieved 4000 simultaneous jobs (4 times
CERN grid capacity)
Number of jobs processed/month
11Grid Operations
- The grid is flat, but
- Hierarchy of responsibility
- Essential to scale the operation
- CICs act as a single Operations Centre
- Operational oversight (grid operator)
responsibility - rotates weekly between CICs
- Report problems to ROC/RC
- ROC is responsible for ensuring problem is
resolved - ROC oversees regional RCs
- ROCs responsible for organising the operations in
a region - Coordinate deployment of middleware, etc
- CERN coordinates sites not associated with a ROC
RC - Resource Centre ROC - Regional Operations
Centre CIC Core Infrastructure Centre
12Grid monitoring
- Operation of Production Service real-time
display of grid operations - Accounting information
- Selection of Monitoring tools
- GIIS Monitor Monitor Graphs
- Sites Functional Tests
- GOC Data Base
- Scheduled Downtimes
- Live Job Monitor
- GridIce VO fabric view
- Certificate Lifetime Monitor
13LCG Deployment Schedule
14LCG Service Challenges
- Service Challenge 2
- Throughput test from LCG Tier-0 to LCG Tier-1
sites - Started 14th March
- Set up Infrastructure to 7 Sites
- NL, IN2P3, FNAL, BNL, FZK, INFN, RAL
- 100MB/s to each site
- 500MB/s combined to all sites at same time
- 500MB/s to a few sites individually
- Goal by end March05, sustained 500 MB/s at CERN
15SC2 met its throughput targets
- gt600MB/s daily average for 10 days was achieved -
Midday 23rd March to Midday 2nd April - Not without outages, but system showed it could
recover rate again from outages - Load reasonable evenly divided over sites (give
network bandwidth constraints of Tier-1 sites)
16Service Challenge 3
- Throughput phase
- 2 weeks sustained in July 2005
- Primary goals
- 150MB/s disk disk to Tier1s
- 60MB/s disk (T0) tape (T1s)
- Secondary goals
- Include a few named T2 sites (T2 -gt T1 transfers)
- Encourage remaining T1s to start disk disk
transfers - Service phase
- September end 2005
- Start with ALICE CMS, add ATLAS and LHCb
October/November - All offline use cases except for analysis
- More components WMS, VOMS, catalogs,
experiment-specific solutions - Implies production setup (CE, SE, )
17EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
18Future EGEE Middleware - gLite
- Intended to replace present middleware (LCG-2)
- Developed mainly from existing components
- Aims to address present shortcomings and advanced
needs from applications - Regular, iterative updates for fast user feedback
- Makes use of web-services where currently feasible
gLite-2
gLite-1
LCG-2
LCG-1
Globus 2 based
Web services based
Application requirements http//egee-na4.ct.infn.i
t/requirements/
19Architecture Design
- Design team including representatives from
Middleware providers (AliEn, Condor, EDG,
Globus,) and Operations, including US partners
produced middleware architecture and design. - Takes into account input and experiences from
applications, operations, and related projects - Focus on medium term (few months) and
commonalities with other projects (e.g. OSG) - Effective exchange of ideas, requirements,
solutions and technologies - Coordinated development of new capabilities
- Open communication channels
- Joint deployment and testing of middleware
- Early detection of differences and disagreements
- The 2nd release of gLite (v1.1) made in May05
- http//cern.ch/glite/packages/R1.1/R20050430/defau
lt.asp - http//cern.ch/glite/documentation
gLite is not just a software stack, it is a
new framework for international collaborative
middleware development. Much has been
accomplished in the first year. However, this is
just the first step.
20gLite Services in Release 1Software stack and
origin (simplified)
- Computing Element
- Gatekeeper (Globus)
- Condor-C (Condor)
- CE Monitor (EGEE)
- Local batch system (PBS, LSF, Condor)
- Workload Management
- WMS (EDG)
- Logging and bookkeeping (EDG)
- Condor-C (Condor)
- Information and Monitoring
- R-GMA (EDG)
- Storage Element
- glite-I/O (AliEn)
- Reliable File Transfer (EGEE)
- GridFTP (Globus)
- SRM Castor (CERN), dCache (FNAL, DESY), other
SRMs - Catalog
- File/Replica Metadata Catalogs (EGEE)
- Security
- GSI (Globus)
- VOMS (DataTAG/EDG)
- Authentication for C and Java based (web)
services (EDG)
Now doing rigorous scalability and performance
tests on pre-production service
21Software Process
- JRA1 Software Process is based on an iterative
method - It comprises two main 12-month development cycles
divided in shorter development-integration-test-re
lease cycles lasting 1 to 4 weeks - The two main cycles start with full Architecture
and Design phases, but the architecture and
design are periodically reviewed and verified. - The process is documented in a number of standard
documents - Software Configuration Management (SCM) Plan
- Test Plan
- Quality Assurance Plan
- Developers Guide
22Release Process
Development
Integration
Testing
Deployment Packages
Software Code
Fail
Pass
Testbed Deployment
Integration Tests
Fix
Fail
Pass
Installation Guide, Release Notes, etc
23Bug Counts and Trends
May 18, 2005
Defects/KLOC 2.01
24gLite Whats next?
- Focus and Priority is
- Bug Fixing
- Support to Service Challenge 3
- File Transfer Service (FTS)
- New planned features in 1.2
- VOMS 1.5 (Oracle support)
- CE Condor support (in addition to PBS/LSF)
- WMproxy (Web Services Interface including bulk
job submission) - Service discovery interface to BDII
- File Transfer Service improvements
- R-GMA aligned with the LCG-2 version
- Beyond 1.2
- DGAS accounting system
- Job Provenance
- Globus Workspace Services
- Harmonization of Security Models
- Integration with Service Discovery
25EGEE Activities
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
26Outreach Training
- Public and technical websites constantly evolving
to expand information available and keep it up to
date - 3 conferences organised
- 300 _at_ Cork, 400 _at_ Den Haag, 500 _at_ Athens
- Pisa 4th project conference 24-28 October 05
- More than 70 training events (including the GGF
grid school) across many countries - 1000 people trained
- induction application developer advanced
retreats - Material archive with more than 100 presentations
- Strong links with GILDA testbed and GENIUS portal
developed in EU DataGrid
27Deployment of applications
- Pilot applications
- High Energy Physics
- Biomed applications
- http//egee-na4.ct.infn.it/biomed/applications.htm
l - Generic applications Deployment under way
- Computational Chemistry
- Earth science research
- EGEODE first industrial application
- Astrophysics
- With interest from
- Hydrology
- Seismology
- Grid search engines
- Stock market simulators
- Digital video etc.
- Industry (provider, user, supplier)
28HEP (ATLAS) Utilisation
12000
- 660K jobs total in (LCG, Nordugrid, US Grid3)
- 400 kSI2k years of CPU
- In latest period average 7K jobs/day with 5K in
LCG
6000
29Bioinformatics
- GPS_at_ Grid Protein Sequence Analysis
- NPSA is a web portal offering proteins databases
and sequence analysis algorithms to the
bioinformaticians (3000 hits per day) - GPS_at_ is a gridified version with increased
computing power - Need for large databases and big number of short
jobs - xmipp_MLrefine
- 3D structure analysis of macromolecules from
(very noisy) electron microscopy images - Maximum likelihood approach for finding the
optimal model - Very compute intensive
- Drug discovery
- Health related area with high performance
computation need - An application currently being ported in Germany
(Fraunhofer institute)
30Medical imaging
- GATE
- Radiotherapy planning
- Improvement of precision by Monte Carlo
simulation - Processing of DICOM medical images
- Objective very short computation time compatible
with clinical practice - Status development and performance testing
- CDSS
- Clinical Decision Support System
- knowledge databases assembling
- image classification engines widespreading
- Objective access to knowledge databases from
hospitals - Status from development to deployment, some
medical end users
31Medical imaging
- SiMRI3D
- 3D Magnetic Resonance Image Simulator
- MRI physics simulation, parallel implementation
- Very compute intensive
- Objective offering an image simulator service to
the research community - Satus parallelized and now running on EGEE
resources - gPTM3D
- Interactive tool for medical images segmentation
and analysis - A non gridified version is distributed in several
hospitals - Need for very fast scheduling of interactive
tasks - Objectives shorten computation time using the
grid - Status development of the gridified version
being finalized
32Status of Biomedical VO
PADOVA
BARI
33Grid conclusions
- e-Infrastructures deployment creates a powerful
new tool for science as well as applications
from other fields - Investments in grid projects and e-Infrastructure
are growing world-wide - Applications are already benefiting from Grid
technologies - Open Source is the right approach for publicly
funded projects and necessary for fast and wide
adoption - Europe is strong in the development of
e-Infrastructure also thanks to the initial
success of EGEE - Collaboration across national and international
programmes is very important
34Summary
- EGEE is the first attempt to build a worldwide
Grid infrastructure for data intensive
applications from many scientific domains - A large-scale production grid service is already
deployed and being used for HEP and BioMed
applications with new applications being ported - Resources user groups are expanding
- A process is in place for migrating new
applications to the EGEE infrastructure - A training programme has started with many events
already held - next generation middleware is being tested
(gLite) - First project review by the EU successfully
passed in Feb05 - Plans for a follow-on project are being prepared
35Contacts
- EGEE Web Site
- http//www.eu-egee.org
- How to join
- http//public.eu-egee.org/join/
- gLite Web Site
- http//www.glite.org
- EGEE Project Office
- project-eu-egee-po_at_cern.ch