LCG Middleware and Operations Status

About This Presentation

Title:

LCG Middleware and Operations Status

Description:

Annexes 1 & 2 : Regional Centres ? CERN, Tier-1s, Tier-2 Centres/Federations ... Match Maker. Replica. Catalog. Inform. System. Storage. Element. Resource Broker Node ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 49

Provided by: mt769

Category:

more less

Transcript and Presenter's Notes

Title: LCG Middleware and Operations Status

1
LCG Middleware and Operations Status

Simon C. Lin
Academia Sinica Grid Computing Center
Taipei, Taiwan 11529

2
Management and Direction
3
Evolution to LCG phase 2

TDR published (6/05) and reviewed (10/05)
MOU approved by C-RRB (10/05)
Annexes 1 2 Regional Centres CERN, Tier-1s,
Tier-2 Centres/Federations
Annex 3 Service level definitions
CERN, Tier-1, Tier-2 services
Grid Operations Services
Annex 4 List of Funding Agencies
Annex 5 WLCG Collaboration organisational
structure
Annex 6 Resource commitments and plans
Annex 9 Resources Scrutiny Group (RSG)
Funding agencies will sign in the next few months
Metric for service definition still needs to be
fully implemented (SFT framework)
More experience will be gained in SC3 4

4
Longer term support of middleware

Started to address the problem

5
Organization
6
Organizational changes

PEB ? Management Board
Composition
Experiment Computing Coordinators
Tier-1 management (new entry)
Area Managers
Meetings Weekly phone, Monthly face to face
New - Collaboration Board
Representatives of all centres signing the MoU
Tier-0, Tier-1s, Tier-2 centres / federations
Experiment spokespersons
First meeting early next year
Task Forces organised by each experiment to
focus on the usage and experience of the LCG
services and implications for the experiment
provide a communication point between the
experiment and LCG deployment/service management
coordinate evaluation and testing of new
middleware and services relevant for LCG
medium term planning of the evolution of usage of
the LCG services
single point for reporting progress with service
usage to the Management Board

7
Mode of operation

Collaboration Board (CB)
Meets annually or when need arises
Overview Board (POB)
Four times per year
Management Board (MB)
Weekly phone meetings (for 1 hour) on Tuesdays at
1600 hrs.
Monthly (for 2 hours) face-to-face meeting on
Tuesday before the GDB
At the location where the GDB takes place
Grid Deployment Board (GDB)
Meets monthly for a full day
Optionally precede by a technical pre-GDB meeting
when need arises
Architects Forum (AF)
Meets every 2 weeks
Experiment Related Task Forces
Defined by the task forces

8
VO-BOX

This is a general framework for long running
processes
w/o special needs like root access
tries to prevent the need for WN to require
outbound access
Technical Details
gsissh server
proxy renewal feature
CLI API for job submission
require SGM account w/ access available to
Software directory
has outbound connection
and specific inbound port access
Concerns
Sys admin loading at small site
Clash with security polices
Prefer generic services that to meet Experiment
requirements
Data Transfer
Secure message passing
GDB
Meeting will be held in January to better define
use case for VO-BOX

9
New VO Geant 4

What it is
Geant4 is a Monte Carlo simulation toolkit
Used by HEP, Astrophysics and Biomed
Production
Runs 2 large productions a year requiring 3 CPU
years each
Also planning small scale mini-production
throughout the year
Small output (15-20GB) all stored at CERN
Good efficiencies achieved with LCG 85
VO Request
All centralized services provided at CERN
Request CE access to computing resources
2GB Software directory for software installation

10
Distributed Databases Deployment

Distributed databases are needed for
Condition, calibration and alignment data
Many databases in this category
Event TAG information
General architecture
Online/T0 autonomous, reliable service
T1s full data replication, reliable service
T2s partial replication, local caching
Focus on large scale deployment
October workshop with Tier sites to work out
details

Not a software development project Not running
the DB service
11
Distributed Databases Technologies

Oracle Real Application Clusters (RAC)
Only viable solution, allowing scalability
Oracle streams
Proprietary replication mechanism
FroNtier
DB caching based on squid servers
No vendor-neutral replication so far

12
3D Plans

CERN
DB service up and running
Ramp up capacity in 2006 (from 2 nodes to 4
nodes/experiment)
Tier1s
November h/w setup defined and proposed plan to
GDB
January h/w acceptance tests, RAC setup
Begin February Tier 1 DB readiness workshop
February Apps and streams setup at Tier 0
March Tier 1 service starts
End May Service review ? h/w defined for full
production
September Full LCG database service in place

13
Issues and concerns

Experiments do not have clear access patterns to
distributed DBs
Hard to design a system
DBs are notoriously hard to get to work right
? Clarify access patterns and quantity of data
Database service deployment is coming very late
in the game
Preprod March 06, Production Sept 06.
It is an essential service that has not been
extensively tested
Experiments have gone on different paths (not too
much, luckily).
? Test extensively and measure performance
Tiers are just beginning to get involved
It should be clear what the committment is
MOU xii. administration of databases required
by Experiments at Tier1 Centres.
What if each experiment requires a different
Database service ?
? Agree on service requirements for Tier1s and
Tier2s

14
Milestones
15
Proposed High Level Milestones SC3

Sept 1 - Include 9 T1 and 10 T2
Dec 31 - For success
5 T1 and 5 T2
Appropriate baseline services operational
Service availability gt 80
Success rate of test jobs gt 80
Maintain 1 week of 1GB/s from CERN to T1

16
Proposed High Level Milestones SC4

Feb 28 All baseline services deployed and
operational
All T1 and 20 T2
April 30 setup complete and services
demonstrated
Run experiment test jobs and data distribution
tested
May 31 start of stable service phase
Support full computing model of each experiment
including simulation and analysis
All T1 and 40 T2
Sept 30 success completion of service phase
Service availability gt 90
Success rate of test jobs gt 90
Maintain 1 week of 1.6 GB/s from CERN to T1
T1 must reach nominal rate for LHC operations

17
Service Challenge Schedule
SC1 - Nov04-Jan05 - data transfer between CERN
and three Tier-1s (FNAL, NIKHEF, FZK)
SC2 Apr05 - data distribution from CERN to 7
Tier-1s 600 MB/sec sustained for 10 days (one
third of final nominal rate)
SC Service Challenge
18
CERN DataRecording Challenge Schedule
SC1
DRC1 Mar05 - data recording at CERN sustained
at 450 MB/sec for one week
SC2
DRC2 Dec05 - data recording sustained at 750
MB/sec for one week
DRC3 Apr06 - data recording sustained at 1
GB/sec for one week
SC4servicephase
DRC4 Sep06 - data recording sustained at 1.6
GB/sec
SC Service ChallengeDRC Data Recording
Challenge
19
High-Perf T0/T1 Network Milestones
SC1
DRC1 450 MB/s
OPN1 T0/1 high performance network operational
at 3 T1 sites
SC2
SC3 service
DRC2 750 MB/s
OPN2 T0/1 high performance network operational
at 6 T1 sites at least half via GEANT
DRC3 1 GB/s
DRC4 1.6 GB/s
SC Service ChallengeDRC Data Recording
ChallengeOPN Optical Private Network
20
Plans under development
SC1
DAQ T0 architectures andtest schedule
DRC1 450 MB/s
SC2
DAQ T0 T1 architecture andtest schedule
OPN13 T1 sites
SC3 service
Distributed Database Service Deployment schedule
DRC2 750 MB/s
DRC3 1 GB/s
OPN26 T1 sitesinc.GEANT
DRC4 1.6 GB/s
SC Service ChallengeDRC Data Recording
ChallengeOPN Optical Private Network
21
Middleware Status
22
Guiding Principles
Service Oriented Architecture
Interoperability
Portability
Building on existingcomponents in alightweight
manner
Web Services
Modularity
AliEn
LCG
Condor
Scalability
Globus
SRM
...
23
gLite Architecture Overview
Resource Broker Node (Workload Manager, WM)
Job status
Storage Element
24
gLite Middleware Services
Access
AvailablegLite Implementation
API
CLI
Information Monitoring
Services
Security Services
Authorization
Information Monitoring
Application Monitoring
Auditing
Authentication
ServiceDiscovery
Data Management
Workload Mgmt Services
JobProvenance
PackageManager
MetadataCatalog
File ReplicaCatalog
Accounting
ComputingElement
WorkloadManagement
StorageElement
DataMovement
Connectivity
25
gLite Processes

Architecture Definition
Based on Design Team work
Associated implementation work plan
Design description of Service defined in the
Architecture document
Really is a definition of interfaces
Yearly cycle

Testing Team
Test Release candidates on a distributed testbed
(CERN, Hannover, Imperial College)
Raise Critical bugs as needed
Iterate with Integrators Developers

Once Release Candidate passed functional tests
Integration Team produces documentation, release
notes and final packaging
Announce the release on the glite Web site and
the glite-discuss mailing list.

Implementation Work plan
Prototype testbed deployment for early feedback
Progress tracked monthly at the EMT

EMT defines release contents
Based on work plan progress
Based on essential items needed
So far mainly for HEP experiments and BioMed
Decide on target dates for tags
Taking into account enough time for integration
testing

Deployment on Pre-production Service and/or
Service Challenges
Feedback from larger number of sites and
different level of competence
Raise Critical bugs as needed
Critical bugs fixed with Quick Fixes when possible

Integration Team produces Release Candidates
based on received tags
Build, Smoke Test, Deployment Modules,
configuration
Iterate with developers

Deployment on Production of selected set of
Services
Based on the needs (deployment, applications)
Today FTS clients, R-GMA, VOMS

26
gLite Releases and Planning
gLite 1.1.2 Special Release for SC File Transfer
Service
gLite 1.4.1 Service Release
gLite 1.1.1 Special Release for SC File Transfer
Service
gLite 1.3 File Placement Service FTS
multi-VO Refactored RGMA CE
gLite 1.4 VOMS for Oracle SRMcp for
FTS WMproxy LBproxy DGAS
gLite 1.0 Condor-C CE gLite I/O R-GMA WMS LB VOMS
Single Catalog
gLite 1.2 File Transfer Agents Secure Condor-C
gLite 1.1 File Transfer Service Metadata catalog
Functionality
gLite 1.5 Release Date
QF1.3.0_22_2005
QF1.3.0_20_2005
QF1.3.0_21_2005
QF1.3.0_19_2005
QF1.1.2_11_2005
QF1.3.0_18_2005
QF1.1.0_09_2005
gLite 1.5 Functionality Freeze
QF1.1.0_10_2005
QF1.0.12_04_2005
QF1.3.0_17_2005
QF1.1.0_07_2005
QF1.1.0_08_2005
QF1.0.12_02_2005
QF1.1.2_16_2005
QF1.0.12_03_2005
QF1.1.0_05_2005
QF1.1.2_13_2005
QF1.1.0_06_2005
QF1.2.0_14_2005
QF1.2.0_15_2005
QF1.0.12_01_2005
QF1.3.0_24_2005
QF1.1.2_12_2005
QF1.3.0_23_2005
April 2005
May 2005
July 2005
Aug 2005
Sep 2005
Oct 2005
Nov 2005
Dec 2005
Jan 2006
Feb 2006
June 2005
Today
27
gLite Documentationand Information sources

Installation Guide
Release Notes
General
Individual Components
User Manuals
With Quick Guide sections
CLI Man pages
APIs and WSDL
Beginners Guide and Sample Code
Bug Tracking System
Mailing Lists
gLite-discuss
Pre-Production Service
Other
Data Management (FTS) Wiki
Pre-Production Services Wiki
Public and Private
Presentations

28
gLite Testing Status
29
Pre-production Service Status 3

Core Services

FTS
Work Flow Management
VO Management
Information System
WMS LB
VOMS
BDII
IO(DPM)
WMS LB
Catalogues
VOMS
Authentication
Data Management
MyProxy
IO(DPM)
R-GMA
IO(DPM)
WMS LB
Fireman(My)
IO(castor)
WMS LB
IO(DPM)
FTS
ASGC, Taipei
WMS LB
30
Pre-production sites
31
Summary

gLite releases have been produced
Tested, Documented, with Installation and Release
notes
Subsystems used on
Service Challenges
Pre-Production Services
Production Service
And by other communities (e.g. DILIGENT)
gLite processes are in place
Closely monitored by various bodies
Hiding many technical problems to the end user
gLite is more than just software, it also about
Processes, Tools and Documentation
International Collaboration

32
NA4 Application Pilot Biomed

Production Biomed Data Challenge on molecular
docking
gt80 CPU years in 6 weeks
gt40 Kjobs, 60 Kfiles produced (1TB)
46M docked ligands
High cost in human resources

In progress Medical files management
(DICOM) Metadata management (AMGA) Security (file
encryption) Grid workflows processing
Next priorities Efficient processing of short
jobs Fine grain access control (data
and metadata)
33
What happened since Athens ?

The number of users in VOs related to NA4
activity kept growing regularly
from 500 at PM9 to 1000 at PM18
More than 20 applications are deployed on the
production infrastructure
The usage of the grid by pilot applications has
significantly evolved during the summer
From data challenge to service challenge (HEP)
First biomedical data challenge (WISDOM)
Several existing applications have been migrated
to the new middleware by the HEP, biomedical and
generic teams
Support of NA4 test team

34
Production
LHCb

Fundamental activity in preparation of LHC start
up
Physics
Computing systems
Examples
LHCb 700 CPU/years in 2005 on the EGEE
infrastructure
ATLAS over 10,000 jobs per day
Comprehensive analysis see S.Campana et al.,
Analysis of the ATLAS Rome Production experience
on the EGEE Computing Grid, e-Science 2005,
Melbourne, Australia
A lot of activity in all involved applications
(including as usual a lot of activity within
non-LHC experiments like BaBar, CDF and D0)
A lot more details in DNA4.3.2 (internal review)

ATLAS
35
Production Middleware
36
LCG Releases

LCG-2_6_0
Prerelease sent to IT, UKI and SE for testing
New Glue schema glue-1.2
New BDII require to handle new and old Glue
RGMA from gLite-1.2
VO-BOX
Upgrade deadline of 3 weeks removed
LCG-2_6_0 client libs available using TAR-ball
Shared file system
Able to install on 2_4_0 WNs
LCG-2_7_0
Expected before Christmas
Info System
Make sure Glue 1.2 is fully utilized
Publish service version
LFC with VOMS support and performance
enchancements
DPM with SRM-copy support
New R-GMA version
Update to new VDT and Globus

37
New Release Plans

Upgrade and introduce services as soon as ready
Version tracking via information system
When sufficient changes made
Create a new release
Starting point for new sites
Preproduction service will change
From pure gLite to lcg new gLite components

38
LCG Porting

Ports needed when SL cannot be installed
Existing batch system OS
Hardware support
Architecture support
WN ports are required
Service nodes are few so use SL
Ports available
CentOS 4.1, Suse 9.3, RedHat 7.3/9
MacOS X, Debian
In Progress
Solaris, EMT64, FC4, AIX, IRIX
Consist of contributions from various sites
For more details http//www.grid.ie/autobuild

39
Operations
40
EGEE/LCG-2 Grid Sites September 2005
EGEE/LCG-2 grid 160 sites, 36 countries
gt15,000 processors, 5 PB storage Other
national regional grids 60 sites, 6,000
processors
41
Close to 10,000 jobs /day
? Over 2.2 million jobs in 2005 so far Daily
averages, sustained over a month, Jan Oct
2005 2183, 2940, 8052, 9432, 10112, 8327, 8352,
9492, 6554, 6103 ? 6 M kSI2K.cpu.hours ? 700
cpu years
42
Working and usable grid gt 10K jobs per day 15
active VOs
43
EGEE Operations
Steady EGEE operations across 6 sides (weekly
shifts started end 2004) IN2P3, RAL, INFN,
Russia, Taipei CERN
Total grid sides
Functional Test (SFT) framework in place since
June to monitor in detail the level of services
from a given side vs MoU. Will be extended to
more sides services One can observe an
increased side availability with time
number of sites passing the SFT tests
44
CIC Operations

1 Year Ago
4 federations involved CERN, Italy, UK, France
40 sites pass functional tests
Disparate tools
Now
6 federations - Russia joined March 25th and
Taiwan joined October 17th
CIC portal integrated with monitoring and
ticketing system
Stream lined processes
Average of 100 ticket operations each week
gt 80 sites pass functional test

45
Operations coordination

Weekly operations meetings
Regular ROC, CIC managers meetings
Series of EGEE Operations Workshops
Nov 04, May 05, Sep 05
Last one was a joint workshop with Open Science
Grid
These have been extremely useful
Will continue in Phase II
Bring in related infrastructure projects
coordination point
Continue to arrange joint workshops with OSG (and
others?)

46
A Selection of Monitoring tools
1. GIIS Monitor
2. GIIS Monitor graphs
3. Sites Functional Tests
4. GOC Data Base
5. Scheduled Downtimes
6. Live Job Monitor
7. GridIce VO view
8. GridIce fabric view
9. Certificate Lifetime Monitor
Note Those thumbnails are links and are
clickable.
47
Integration of monitoring information

Monitoring information from various tools is
collected in R-GMA archiver
Summary generator calculates overall status of
each monitored object (site, CE, ...) - update
1h
Metric generator calculates numerical value for
each monitored object aggregation (CE ? site ?
region ? grid) - update 1 day

48
Summary

The infrastructure continues to grow moderately
The operation is becoming more stable with time
Problem of bad sites is under control
Operational oversight quite strict
Through SFT, VO tools
Affect of bad sites is much reduced
Significant work loads are being managed and
significant resources being delivered to
applications
User support is also becoming stable and usable
Successful interoperability with OSG
With strong interest in building inter-operation
EGEE-II must consolidate and build on this work

49
Conclusion

Unprecedented way to collaborate on day-to-day
basis will change the sociology of academia life,
eco-system of business world and eventually every
one in the society
Collaboration is the Keyword, how the AP
communities could collaborate and coordinate are
our challenges together
Examples may include new AP-X VO, ROCs, MW
enhancement, etc

Write a Comment

User Comments (0)