Title: LCG Middleware and Operations Status
1LCG Middleware and Operations Status
- Simon C. Lin
- Academia Sinica Grid Computing Center
- Taipei, Taiwan 11529
2Management and Direction
3Evolution to LCG phase 2
- TDR published (6/05) and reviewed (10/05)
- MOU approved by C-RRB (10/05)
- Annexes 1 2 Regional Centres CERN, Tier-1s,
Tier-2 Centres/Federations - Annex 3 Service level definitions
- CERN, Tier-1, Tier-2 services
- Grid Operations Services
- Annex 4 List of Funding Agencies
- Annex 5 WLCG Collaboration organisational
structure - Annex 6 Resource commitments and plans
- Annex 9 Resources Scrutiny Group (RSG)
- Funding agencies will sign in the next few months
- Metric for service definition still needs to be
fully implemented (SFT framework) - More experience will be gained in SC3 4
4Longer term support of middleware
- Started to address the problem
5Organization
6Organizational changes
- PEB ? Management Board
- Composition
- Experiment Computing Coordinators
- Tier-1 management (new entry)
- Area Managers
- Meetings Weekly phone, Monthly face to face
- New - Collaboration Board
- Representatives of all centres signing the MoU
- Tier-0, Tier-1s, Tier-2 centres / federations
- Experiment spokespersons
- First meeting early next year
- Task Forces organised by each experiment to
- focus on the usage and experience of the LCG
services and implications for the experiment - provide a communication point between the
experiment and LCG deployment/service management - coordinate evaluation and testing of new
middleware and services relevant for LCG - medium term planning of the evolution of usage of
the LCG services - single point for reporting progress with service
usage to the Management Board
7Mode of operation
- Collaboration Board (CB)
- Meets annually or when need arises
- Overview Board (POB)
- Four times per year
- Management Board (MB)
- Weekly phone meetings (for 1 hour) on Tuesdays at
1600 hrs. - Monthly (for 2 hours) face-to-face meeting on
Tuesday before the GDB - At the location where the GDB takes place
- Grid Deployment Board (GDB)
- Meets monthly for a full day
- Optionally precede by a technical pre-GDB meeting
when need arises - Architects Forum (AF)
- Meets every 2 weeks
- Experiment Related Task Forces
- Defined by the task forces
8VO-BOX
- This is a general framework for long running
processes - w/o special needs like root access
- tries to prevent the need for WN to require
outbound access - Technical Details
- gsissh server
- proxy renewal feature
- CLI API for job submission
- require SGM account w/ access available to
Software directory - has outbound connection
- and specific inbound port access
- Concerns
- Sys admin loading at small site
- Clash with security polices
- Prefer generic services that to meet Experiment
requirements - Data Transfer
- Secure message passing
- GDB
- Meeting will be held in January to better define
use case for VO-BOX
9New VO Geant 4
- What it is
- Geant4 is a Monte Carlo simulation toolkit
- Used by HEP, Astrophysics and Biomed
- Production
- Runs 2 large productions a year requiring 3 CPU
years each - Also planning small scale mini-production
throughout the year - Small output (15-20GB) all stored at CERN
- Good efficiencies achieved with LCG 85
- VO Request
- All centralized services provided at CERN
- Request CE access to computing resources
- 2GB Software directory for software installation
10Distributed Databases Deployment
- Distributed databases are needed for
- Condition, calibration and alignment data
- Many databases in this category
- Event TAG information
- General architecture
- Online/T0 autonomous, reliable service
- T1s full data replication, reliable service
- T2s partial replication, local caching
- Focus on large scale deployment
- October workshop with Tier sites to work out
details
Not a software development project Not running
the DB service
11Distributed Databases Technologies
- Oracle Real Application Clusters (RAC)
- Only viable solution, allowing scalability
- Oracle streams
- Proprietary replication mechanism
- FroNtier
- DB caching based on squid servers
- No vendor-neutral replication so far
123D Plans
- CERN
- DB service up and running
- Ramp up capacity in 2006 (from 2 nodes to 4
nodes/experiment) - Tier1s
- November h/w setup defined and proposed plan to
GDB - January h/w acceptance tests, RAC setup
- Begin February Tier 1 DB readiness workshop
- February Apps and streams setup at Tier 0
- March Tier 1 service starts
- End May Service review ? h/w defined for full
production - September Full LCG database service in place
13Issues and concerns
- Experiments do not have clear access patterns to
distributed DBs - Hard to design a system
- DBs are notoriously hard to get to work right
- ? Clarify access patterns and quantity of data
- Database service deployment is coming very late
in the game - Preprod March 06, Production Sept 06.
- It is an essential service that has not been
extensively tested - Experiments have gone on different paths (not too
much, luckily). - ? Test extensively and measure performance
- Tiers are just beginning to get involved
- It should be clear what the committment is
- MOU xii. administration of databases required
by Experiments at Tier1 Centres. - What if each experiment requires a different
Database service ? - ? Agree on service requirements for Tier1s and
Tier2s
14Milestones
15Proposed High Level Milestones SC3
- Sept 1 - Include 9 T1 and 10 T2
- Dec 31 - For success
- 5 T1 and 5 T2
- Appropriate baseline services operational
- Service availability gt 80
- Success rate of test jobs gt 80
- Maintain 1 week of 1GB/s from CERN to T1
16Proposed High Level Milestones SC4
- Feb 28 All baseline services deployed and
operational - All T1 and 20 T2
- April 30 setup complete and services
demonstrated - Run experiment test jobs and data distribution
tested - May 31 start of stable service phase
- Support full computing model of each experiment
including simulation and analysis - All T1 and 40 T2
- Sept 30 success completion of service phase
- Service availability gt 90
- Success rate of test jobs gt 90
- Maintain 1 week of 1.6 GB/s from CERN to T1
- T1 must reach nominal rate for LHC operations
17Service Challenge Schedule
SC1 - Nov04-Jan05 - data transfer between CERN
and three Tier-1s (FNAL, NIKHEF, FZK)
SC2 Apr05 - data distribution from CERN to 7
Tier-1s 600 MB/sec sustained for 10 days (one
third of final nominal rate)
SC Service Challenge
18CERN DataRecording Challenge Schedule
SC1
DRC1 Mar05 - data recording at CERN sustained
at 450 MB/sec for one week
SC2
DRC2 Dec05 - data recording sustained at 750
MB/sec for one week
DRC3 Apr06 - data recording sustained at 1
GB/sec for one week
SC4servicephase
DRC4 Sep06 - data recording sustained at 1.6
GB/sec
SC Service ChallengeDRC Data Recording
Challenge
19High-Perf T0/T1 Network Milestones
SC1
DRC1 450 MB/s
OPN1 T0/1 high performance network operational
at 3 T1 sites
SC2
SC3 service
DRC2 750 MB/s
OPN2 T0/1 high performance network operational
at 6 T1 sites at least half via GEANT
DRC3 1 GB/s
DRC4 1.6 GB/s
SC Service ChallengeDRC Data Recording
ChallengeOPN Optical Private Network
20Plans under development
SC1
DAQ T0 architectures andtest schedule
DRC1 450 MB/s
SC2
DAQ T0 T1 architecture andtest schedule
OPN13 T1 sites
SC3 service
Distributed Database Service Deployment schedule
DRC2 750 MB/s
DRC3 1 GB/s
OPN26 T1 sitesinc.GEANT
DRC4 1.6 GB/s
SC Service ChallengeDRC Data Recording
ChallengeOPN Optical Private Network
21Middleware Status
22Guiding Principles
Service Oriented Architecture
Interoperability
Portability
Building on existingcomponents in alightweight
manner
Web Services
Modularity
AliEn
LCG
Condor
Scalability
Globus
SRM
...
23gLite Architecture Overview
Resource Broker Node (Workload Manager, WM)
Job status
Storage Element
24gLite Middleware Services
Access
AvailablegLite Implementation
API
CLI
Information Monitoring
Services
Security Services
Authorization
Information Monitoring
Application Monitoring
Auditing
Authentication
ServiceDiscovery
Data Management
Workload Mgmt Services
JobProvenance
PackageManager
MetadataCatalog
File ReplicaCatalog
Accounting
ComputingElement
WorkloadManagement
StorageElement
DataMovement
Connectivity
25gLite Processes
- Architecture Definition
- Based on Design Team work
- Associated implementation work plan
- Design description of Service defined in the
Architecture document - Really is a definition of interfaces
- Yearly cycle
- Testing Team
- Test Release candidates on a distributed testbed
(CERN, Hannover, Imperial College) - Raise Critical bugs as needed
- Iterate with Integrators Developers
- Once Release Candidate passed functional tests
- Integration Team produces documentation, release
notes and final packaging - Announce the release on the glite Web site and
the glite-discuss mailing list.
- Implementation Work plan
- Prototype testbed deployment for early feedback
- Progress tracked monthly at the EMT
- EMT defines release contents
- Based on work plan progress
- Based on essential items needed
- So far mainly for HEP experiments and BioMed
- Decide on target dates for tags
- Taking into account enough time for integration
testing
- Deployment on Pre-production Service and/or
Service Challenges - Feedback from larger number of sites and
different level of competence - Raise Critical bugs as needed
- Critical bugs fixed with Quick Fixes when possible
- Integration Team produces Release Candidates
based on received tags - Build, Smoke Test, Deployment Modules,
configuration - Iterate with developers
- Deployment on Production of selected set of
Services - Based on the needs (deployment, applications)
- Today FTS clients, R-GMA, VOMS
26gLite Releases and Planning
gLite 1.1.2 Special Release for SC File Transfer
Service
gLite 1.4.1 Service Release
gLite 1.1.1 Special Release for SC File Transfer
Service
gLite 1.3 File Placement Service FTS
multi-VO Refactored RGMA CE
gLite 1.4 VOMS for Oracle SRMcp for
FTS WMproxy LBproxy DGAS
gLite 1.0 Condor-C CE gLite I/O R-GMA WMS LB VOMS
Single Catalog
gLite 1.2 File Transfer Agents Secure Condor-C
gLite 1.1 File Transfer Service Metadata catalog
Functionality
gLite 1.5 Release Date
QF1.3.0_22_2005
QF1.3.0_20_2005
QF1.3.0_21_2005
QF1.3.0_19_2005
QF1.1.2_11_2005
QF1.3.0_18_2005
QF1.1.0_09_2005
gLite 1.5 Functionality Freeze
QF1.1.0_10_2005
QF1.0.12_04_2005
QF1.3.0_17_2005
QF1.1.0_07_2005
QF1.1.0_08_2005
QF1.0.12_02_2005
QF1.1.2_16_2005
QF1.0.12_03_2005
QF1.1.0_05_2005
QF1.1.2_13_2005
QF1.1.0_06_2005
QF1.2.0_14_2005
QF1.2.0_15_2005
QF1.0.12_01_2005
QF1.3.0_24_2005
QF1.1.2_12_2005
QF1.3.0_23_2005
April 2005
May 2005
July 2005
Aug 2005
Sep 2005
Oct 2005
Nov 2005
Dec 2005
Jan 2006
Feb 2006
June 2005
Today
27gLite Documentationand Information sources
- Installation Guide
- Release Notes
- General
- Individual Components
- User Manuals
- With Quick Guide sections
- CLI Man pages
- APIs and WSDL
- Beginners Guide and Sample Code
- Bug Tracking System
- Mailing Lists
- gLite-discuss
- Pre-Production Service
- Other
- Data Management (FTS) Wiki
- Pre-Production Services Wiki
- Public and Private
- Presentations
28gLite Testing Status
29Pre-production Service Status 3
FTS
Work Flow Management
VO Management
Information System
WMS LB
VOMS
BDII
IO(DPM)
WMS LB
Catalogues
VOMS
Authentication
Data Management
MyProxy
IO(DPM)
R-GMA
IO(DPM)
WMS LB
Fireman(My)
IO(castor)
WMS LB
IO(DPM)
FTS
ASGC, Taipei
WMS LB
30Pre-production sites
31Summary
- gLite releases have been produced
- Tested, Documented, with Installation and Release
notes - Subsystems used on
- Service Challenges
- Pre-Production Services
- Production Service
- And by other communities (e.g. DILIGENT)
- gLite processes are in place
- Closely monitored by various bodies
- Hiding many technical problems to the end user
- gLite is more than just software, it also about
- Processes, Tools and Documentation
- International Collaboration
32NA4 Application Pilot Biomed
- Production Biomed Data Challenge on molecular
docking - gt80 CPU years in 6 weeks
- gt40 Kjobs, 60 Kfiles produced (1TB)
- 46M docked ligands
- High cost in human resources
In progress Medical files management
(DICOM) Metadata management (AMGA) Security (file
encryption) Grid workflows processing
Next priorities Efficient processing of short
jobs Fine grain access control (data
and metadata)
33What happened since Athens ?
- The number of users in VOs related to NA4
activity kept growing regularly - from 500 at PM9 to 1000 at PM18
- More than 20 applications are deployed on the
production infrastructure - The usage of the grid by pilot applications has
significantly evolved during the summer - From data challenge to service challenge (HEP)
- First biomedical data challenge (WISDOM)
- Several existing applications have been migrated
to the new middleware by the HEP, biomedical and
generic teams - Support of NA4 test team
34Production
LHCb
- Fundamental activity in preparation of LHC start
up - Physics
- Computing systems
- Examples
- LHCb 700 CPU/years in 2005 on the EGEE
infrastructure - ATLAS over 10,000 jobs per day
- Comprehensive analysis see S.Campana et al.,
Analysis of the ATLAS Rome Production experience
on the EGEE Computing Grid, e-Science 2005,
Melbourne, Australia - A lot of activity in all involved applications
(including as usual a lot of activity within
non-LHC experiments like BaBar, CDF and D0) - A lot more details in DNA4.3.2 (internal review)
ATLAS
35Production Middleware
36LCG Releases
- LCG-2_6_0
- Prerelease sent to IT, UKI and SE for testing
- New Glue schema glue-1.2
- New BDII require to handle new and old Glue
- RGMA from gLite-1.2
- VO-BOX
- Upgrade deadline of 3 weeks removed
- LCG-2_6_0 client libs available using TAR-ball
- Shared file system
- Able to install on 2_4_0 WNs
- LCG-2_7_0
- Expected before Christmas
- Info System
- Make sure Glue 1.2 is fully utilized
- Publish service version
- LFC with VOMS support and performance
enchancements - DPM with SRM-copy support
- New R-GMA version
- Update to new VDT and Globus
37New Release Plans
- Upgrade and introduce services as soon as ready
- Version tracking via information system
- When sufficient changes made
- Create a new release
- Starting point for new sites
- Preproduction service will change
- From pure gLite to lcg new gLite components
38LCG Porting
- Ports needed when SL cannot be installed
- Existing batch system OS
- Hardware support
- Architecture support
- WN ports are required
- Service nodes are few so use SL
- Ports available
- CentOS 4.1, Suse 9.3, RedHat 7.3/9
- MacOS X, Debian
- In Progress
- Solaris, EMT64, FC4, AIX, IRIX
- Consist of contributions from various sites
- For more details http//www.grid.ie/autobuild
39Operations
40EGEE/LCG-2 Grid Sites September 2005
EGEE/LCG-2 grid 160 sites, 36 countries
gt15,000 processors, 5 PB storage Other
national regional grids 60 sites, 6,000
processors
41Close to 10,000 jobs /day
? Over 2.2 million jobs in 2005 so far Daily
averages, sustained over a month, Jan Oct
2005 2183, 2940, 8052, 9432, 10112, 8327, 8352,
9492, 6554, 6103 ? 6 M kSI2K.cpu.hours ? 700
cpu years
42Working and usable grid gt 10K jobs per day 15
active VOs
43EGEE Operations
Steady EGEE operations across 6 sides (weekly
shifts started end 2004) IN2P3, RAL, INFN,
Russia, Taipei CERN
Total grid sides
Functional Test (SFT) framework in place since
June to monitor in detail the level of services
from a given side vs MoU. Will be extended to
more sides services One can observe an
increased side availability with time
number of sites passing the SFT tests
44CIC Operations
- 1 Year Ago
- 4 federations involved CERN, Italy, UK, France
- 40 sites pass functional tests
- Disparate tools
- Now
- 6 federations - Russia joined March 25th and
Taiwan joined October 17th - CIC portal integrated with monitoring and
ticketing system - Stream lined processes
- Average of 100 ticket operations each week
- gt 80 sites pass functional test
45Operations coordination
- Weekly operations meetings
- Regular ROC, CIC managers meetings
- Series of EGEE Operations Workshops
- Nov 04, May 05, Sep 05
- Last one was a joint workshop with Open Science
Grid - These have been extremely useful
- Will continue in Phase II
- Bring in related infrastructure projects
coordination point - Continue to arrange joint workshops with OSG (and
others?)
46A Selection of Monitoring tools
1. GIIS Monitor
2. GIIS Monitor graphs
3. Sites Functional Tests
4. GOC Data Base
5. Scheduled Downtimes
6. Live Job Monitor
7. GridIce VO view
8. GridIce fabric view
9. Certificate Lifetime Monitor
Note Those thumbnails are links and are
clickable.
47Integration of monitoring information
- Monitoring information from various tools is
collected in R-GMA archiver - Summary generator calculates overall status of
each monitored object (site, CE, ...) - update
1h - Metric generator calculates numerical value for
each monitored object aggregation (CE ? site ?
region ? grid) - update 1 day
48Summary
- The infrastructure continues to grow moderately
- The operation is becoming more stable with time
- Problem of bad sites is under control
- Operational oversight quite strict
- Through SFT, VO tools
- Affect of bad sites is much reduced
- Significant work loads are being managed and
significant resources being delivered to
applications - User support is also becoming stable and usable
- Successful interoperability with OSG
- With strong interest in building inter-operation
- EGEE-II must consolidate and build on this work
49Conclusion
- Unprecedented way to collaborate on day-to-day
basis will change the sociology of academia life,
eco-system of business world and eventually every
one in the society - Collaboration is the Keyword, how the AP
communities could collaborate and coordinate are
our challenges together - Examples may include new AP-X VO, ROCs, MW
enhancement, etc