LCG Middleware and Operations Status - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

LCG Middleware and Operations Status

Description:

Annexes 1 & 2 : Regional Centres ? CERN, Tier-1s, Tier-2 Centres/Federations ... Match Maker. Replica. Catalog. Inform. System. Storage. Element. Resource Broker Node ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 49
Provided by: mt769
Category:

less

Transcript and Presenter's Notes

Title: LCG Middleware and Operations Status


1
LCG Middleware and Operations Status
  • Simon C. Lin
  • Academia Sinica Grid Computing Center
  • Taipei, Taiwan 11529

2
Management and Direction
3
Evolution to LCG phase 2
  • TDR published (6/05) and reviewed (10/05)
  • MOU approved by C-RRB (10/05)
  • Annexes 1 2 Regional Centres CERN, Tier-1s,
    Tier-2 Centres/Federations
  • Annex 3 Service level definitions
  • CERN, Tier-1, Tier-2 services
  • Grid Operations Services
  • Annex 4 List of Funding Agencies
  • Annex 5 WLCG Collaboration organisational
    structure
  • Annex 6 Resource commitments and plans
  • Annex 9 Resources Scrutiny Group (RSG)
  • Funding agencies will sign in the next few months
  • Metric for service definition still needs to be
    fully implemented (SFT framework)
  • More experience will be gained in SC3 4

4
Longer term support of middleware
  • Started to address the problem

5
Organization
6
Organizational changes
  • PEB ? Management Board
  • Composition
  • Experiment Computing Coordinators
  • Tier-1 management (new entry)
  • Area Managers
  • Meetings Weekly phone, Monthly face to face
  • New - Collaboration Board
  • Representatives of all centres signing the MoU
  • Tier-0, Tier-1s, Tier-2 centres / federations
  • Experiment spokespersons
  • First meeting early next year
  • Task Forces organised by each experiment to
  • focus on the usage and experience of the LCG
    services and implications for the experiment
  • provide a communication point between the
    experiment and LCG deployment/service management
  • coordinate evaluation and testing of new
    middleware and services  relevant for LCG
  • medium term planning of the evolution of usage of
    the LCG services
  • single point for reporting progress with service
    usage to the Management Board

7
Mode of operation
  • Collaboration Board (CB)
  • Meets annually or when need arises
  • Overview Board (POB)
  • Four times per year
  • Management Board (MB)
  • Weekly phone meetings (for 1 hour) on Tuesdays at
    1600 hrs.
  • Monthly (for 2 hours) face-to-face meeting on
    Tuesday before the GDB
  • At the location where the GDB takes place
  • Grid Deployment Board (GDB)
  • Meets monthly for a full day
  • Optionally precede by a technical pre-GDB meeting
    when need arises
  • Architects Forum (AF)
  • Meets every 2 weeks
  • Experiment Related Task Forces
  • Defined by the task forces

8
VO-BOX
  • This is a general framework for long running
    processes
  • w/o special needs like root access
  • tries to prevent the need for WN to require
    outbound access
  • Technical Details
  • gsissh server
  • proxy renewal feature
  • CLI API for job submission
  • require SGM account w/ access available to
    Software directory
  • has outbound connection
  • and specific inbound port access
  • Concerns
  • Sys admin loading at small site
  • Clash with security polices
  • Prefer generic services that to meet Experiment
    requirements
  • Data Transfer
  • Secure message passing
  • GDB
  • Meeting will be held in January to better define
    use case for VO-BOX

9
New VO Geant 4
  • What it is
  • Geant4 is a Monte Carlo simulation toolkit
  • Used by HEP, Astrophysics and Biomed
  • Production
  • Runs 2 large productions a year requiring 3 CPU
    years each
  • Also planning small scale mini-production
    throughout the year
  • Small output (15-20GB) all stored at CERN
  • Good efficiencies achieved with LCG 85
  • VO Request
  • All centralized services provided at CERN
  • Request CE access to computing resources
  • 2GB Software directory for software installation

10
Distributed Databases Deployment
  • Distributed databases are needed for
  • Condition, calibration and alignment data
  • Many databases in this category
  • Event TAG information
  • General architecture
  • Online/T0 autonomous, reliable service
  • T1s full data replication, reliable service
  • T2s partial replication, local caching
  • Focus on large scale deployment
  • October workshop with Tier sites to work out
    details

Not a software development project Not running
the DB service
11
Distributed Databases Technologies
  • Oracle Real Application Clusters (RAC)
  • Only viable solution, allowing scalability
  • Oracle streams
  • Proprietary replication mechanism
  • FroNtier
  • DB caching based on squid servers
  • No vendor-neutral replication so far

12
3D Plans
  • CERN
  • DB service up and running
  • Ramp up capacity in 2006 (from 2 nodes to 4
    nodes/experiment)
  • Tier1s
  • November h/w setup defined and proposed plan to
    GDB
  • January h/w acceptance tests, RAC setup
  • Begin February Tier 1 DB readiness workshop
  • February Apps and streams setup at Tier 0
  • March Tier 1 service starts
  • End May Service review ? h/w defined for full
    production
  • September Full LCG database service in place

13
Issues and concerns
  • Experiments do not have clear access patterns to
    distributed DBs
  • Hard to design a system
  • DBs are notoriously hard to get to work right
  • ? Clarify access patterns and quantity of data
  • Database service deployment is coming very late
    in the game
  • Preprod March 06, Production Sept 06.
  • It is an essential service that has not been
    extensively tested
  • Experiments have gone on different paths (not too
    much, luckily).
  • ? Test extensively and measure performance
  • Tiers are just beginning to get involved
  • It should be clear what the committment is
  • MOU xii. administration of databases required
    by Experiments at Tier1 Centres.
  • What if each experiment requires a different
    Database service ?
  • ? Agree on service requirements for Tier1s and
    Tier2s

14
Milestones
15
Proposed High Level Milestones SC3
  • Sept 1 - Include 9 T1 and 10 T2
  • Dec 31 - For success
  • 5 T1 and 5 T2
  • Appropriate baseline services operational
  • Service availability gt 80
  • Success rate of test jobs gt 80
  • Maintain 1 week of 1GB/s from CERN to T1

16
Proposed High Level Milestones SC4
  • Feb 28 All baseline services deployed and
    operational
  • All T1 and 20 T2
  • April 30 setup complete and services
    demonstrated
  • Run experiment test jobs and data distribution
    tested
  • May 31 start of stable service phase
  • Support full computing model of each experiment
    including simulation and analysis
  • All T1 and 40 T2
  • Sept 30 success completion of service phase
  • Service availability gt 90
  • Success rate of test jobs gt 90
  • Maintain 1 week of 1.6 GB/s from CERN to T1
  • T1 must reach nominal rate for LHC operations

17
Service Challenge Schedule
SC1 - Nov04-Jan05 - data transfer between CERN
and three Tier-1s (FNAL, NIKHEF, FZK)
SC2 Apr05 - data distribution from CERN to 7
Tier-1s 600 MB/sec sustained for 10 days (one
third of final nominal rate)
SC Service Challenge
18
CERN DataRecording Challenge Schedule
SC1
DRC1 Mar05 - data recording at CERN sustained
at 450 MB/sec for one week
SC2
DRC2 Dec05 - data recording sustained at 750
MB/sec for one week
DRC3 Apr06 - data recording sustained at 1
GB/sec for one week
SC4servicephase
DRC4 Sep06 - data recording sustained at 1.6
GB/sec
SC Service ChallengeDRC Data Recording
Challenge
19
High-Perf T0/T1 Network Milestones
SC1
DRC1 450 MB/s
OPN1 T0/1 high performance network operational
at 3 T1 sites
SC2
SC3 service
DRC2 750 MB/s
OPN2 T0/1 high performance network operational
at 6 T1 sites at least half via GEANT
DRC3 1 GB/s
DRC4 1.6 GB/s
SC Service ChallengeDRC Data Recording
ChallengeOPN Optical Private Network
20
Plans under development
SC1
DAQ T0 architectures andtest schedule
DRC1 450 MB/s
SC2
DAQ T0 T1 architecture andtest schedule
OPN13 T1 sites
SC3 service
Distributed Database Service Deployment schedule
DRC2 750 MB/s
DRC3 1 GB/s
OPN26 T1 sitesinc.GEANT
DRC4 1.6 GB/s
SC Service ChallengeDRC Data Recording
ChallengeOPN Optical Private Network
21
Middleware Status
22
Guiding Principles
Service Oriented Architecture
Interoperability
Portability
Building on existingcomponents in alightweight
manner
Web Services
Modularity
AliEn
LCG
Condor
Scalability
Globus
SRM
...
23
gLite Architecture Overview
Resource Broker Node (Workload Manager, WM)
Job status
Storage Element
24
gLite Middleware Services
Access
AvailablegLite Implementation
API
CLI
Information Monitoring
Services
Security Services
Authorization
Information Monitoring
Application Monitoring
Auditing
Authentication
ServiceDiscovery
Data Management
Workload Mgmt Services
JobProvenance
PackageManager
MetadataCatalog
File ReplicaCatalog
Accounting
ComputingElement
WorkloadManagement
StorageElement
DataMovement
Connectivity
25
gLite Processes
  • Architecture Definition
  • Based on Design Team work
  • Associated implementation work plan
  • Design description of Service defined in the
    Architecture document
  • Really is a definition of interfaces
  • Yearly cycle
  • Testing Team
  • Test Release candidates on a distributed testbed
    (CERN, Hannover, Imperial College)
  • Raise Critical bugs as needed
  • Iterate with Integrators Developers
  • Once Release Candidate passed functional tests
  • Integration Team produces documentation, release
    notes and final packaging
  • Announce the release on the glite Web site and
    the glite-discuss mailing list.
  • Implementation Work plan
  • Prototype testbed deployment for early feedback
  • Progress tracked monthly at the EMT
  • EMT defines release contents
  • Based on work plan progress
  • Based on essential items needed
  • So far mainly for HEP experiments and BioMed
  • Decide on target dates for tags
  • Taking into account enough time for integration
    testing
  • Deployment on Pre-production Service and/or
    Service Challenges
  • Feedback from larger number of sites and
    different level of competence
  • Raise Critical bugs as needed
  • Critical bugs fixed with Quick Fixes when possible
  • Integration Team produces Release Candidates
    based on received tags
  • Build, Smoke Test, Deployment Modules,
    configuration
  • Iterate with developers
  • Deployment on Production of selected set of
    Services
  • Based on the needs (deployment, applications)
  • Today FTS clients, R-GMA, VOMS

26
gLite Releases and Planning
gLite 1.1.2 Special Release for SC File Transfer
Service
gLite 1.4.1 Service Release
gLite 1.1.1 Special Release for SC File Transfer
Service
gLite 1.3 File Placement Service FTS
multi-VO Refactored RGMA CE
gLite 1.4 VOMS for Oracle SRMcp for
FTS WMproxy LBproxy DGAS
gLite 1.0 Condor-C CE gLite I/O R-GMA WMS LB VOMS
Single Catalog
gLite 1.2 File Transfer Agents Secure Condor-C
gLite 1.1 File Transfer Service Metadata catalog
Functionality
gLite 1.5 Release Date
QF1.3.0_22_2005
QF1.3.0_20_2005
QF1.3.0_21_2005
QF1.3.0_19_2005
QF1.1.2_11_2005
QF1.3.0_18_2005
QF1.1.0_09_2005
gLite 1.5 Functionality Freeze
QF1.1.0_10_2005
QF1.0.12_04_2005
QF1.3.0_17_2005
QF1.1.0_07_2005
QF1.1.0_08_2005
QF1.0.12_02_2005
QF1.1.2_16_2005
QF1.0.12_03_2005
QF1.1.0_05_2005
QF1.1.2_13_2005
QF1.1.0_06_2005
QF1.2.0_14_2005
QF1.2.0_15_2005
QF1.0.12_01_2005
QF1.3.0_24_2005
QF1.1.2_12_2005
QF1.3.0_23_2005
April 2005
May 2005
July 2005
Aug 2005
Sep 2005
Oct 2005
Nov 2005
Dec 2005
Jan 2006
Feb 2006
June 2005
Today
27
gLite Documentationand Information sources
  • Installation Guide
  • Release Notes
  • General
  • Individual Components
  • User Manuals
  • With Quick Guide sections
  • CLI Man pages
  • APIs and WSDL
  • Beginners Guide and Sample Code
  • Bug Tracking System
  • Mailing Lists
  • gLite-discuss
  • Pre-Production Service
  • Other
  • Data Management (FTS) Wiki
  • Pre-Production Services Wiki
  • Public and Private
  • Presentations

28
gLite Testing Status
29
Pre-production Service Status 3
  • Core Services

FTS
Work Flow Management
VO Management
Information System
WMS LB
VOMS
BDII
IO(DPM)
WMS LB
Catalogues
VOMS
Authentication
Data Management
MyProxy
IO(DPM)
R-GMA
IO(DPM)
WMS LB
Fireman(My)
IO(castor)
WMS LB
IO(DPM)
FTS
ASGC, Taipei
WMS LB
30
Pre-production sites
31
Summary
  • gLite releases have been produced
  • Tested, Documented, with Installation and Release
    notes
  • Subsystems used on
  • Service Challenges
  • Pre-Production Services
  • Production Service
  • And by other communities (e.g. DILIGENT)
  • gLite processes are in place
  • Closely monitored by various bodies
  • Hiding many technical problems to the end user
  • gLite is more than just software, it also about
  • Processes, Tools and Documentation
  • International Collaboration

32
NA4 Application Pilot Biomed
  • Production Biomed Data Challenge on molecular
    docking
  • gt80 CPU years in 6 weeks
  • gt40 Kjobs, 60 Kfiles produced (1TB)
  • 46M docked ligands
  • High cost in human resources

In progress Medical files management
(DICOM) Metadata management (AMGA) Security (file
encryption) Grid workflows processing
Next priorities Efficient processing of short
jobs Fine grain access control (data
and metadata)
33
What happened since Athens ?
  • The number of users in VOs related to NA4
    activity kept growing regularly
  • from 500 at PM9 to 1000 at PM18
  • More than 20 applications are deployed on the
    production infrastructure
  • The usage of the grid by pilot applications has
    significantly evolved during the summer
  • From data challenge to service challenge (HEP)
  • First biomedical data challenge (WISDOM)
  • Several existing applications have been migrated
    to the new middleware by the HEP, biomedical and
    generic teams
  • Support of NA4 test team

34
Production
LHCb
  • Fundamental activity in preparation of LHC start
    up
  • Physics
  • Computing systems
  • Examples
  • LHCb 700 CPU/years in 2005 on the EGEE
    infrastructure
  • ATLAS over 10,000 jobs per day
  • Comprehensive analysis see S.Campana et al.,
    Analysis of the ATLAS Rome Production experience
    on the EGEE Computing Grid, e-Science 2005,
    Melbourne, Australia
  • A lot of activity in all involved applications
    (including as usual a lot of activity within
    non-LHC experiments like BaBar, CDF and D0)
  • A lot more details in DNA4.3.2 (internal review)

ATLAS
35
Production Middleware
36
LCG Releases
  • LCG-2_6_0
  • Prerelease sent to IT, UKI and SE for testing
  • New Glue schema glue-1.2
  • New BDII require to handle new and old Glue
  • RGMA from gLite-1.2
  • VO-BOX
  • Upgrade deadline of 3 weeks removed
  • LCG-2_6_0 client libs available using TAR-ball
  • Shared file system
  • Able to install on 2_4_0 WNs
  • LCG-2_7_0
  • Expected before Christmas
  • Info System
  • Make sure Glue 1.2 is fully utilized
  • Publish service version
  • LFC with VOMS support and performance
    enchancements
  • DPM with SRM-copy support
  • New R-GMA version
  • Update to new VDT and Globus

37
New Release Plans
  • Upgrade and introduce services as soon as ready
  • Version tracking via information system
  • When sufficient changes made
  • Create a new release
  • Starting point for new sites
  • Preproduction service will change
  • From pure gLite to lcg new gLite components

38
LCG Porting
  • Ports needed when SL cannot be installed
  • Existing batch system OS
  • Hardware support
  • Architecture support
  • WN ports are required
  • Service nodes are few so use SL
  • Ports available
  • CentOS 4.1, Suse 9.3, RedHat 7.3/9
  • MacOS X, Debian
  • In Progress
  • Solaris, EMT64, FC4, AIX, IRIX
  • Consist of contributions from various sites
  • For more details http//www.grid.ie/autobuild

39
Operations
40
EGEE/LCG-2 Grid Sites September 2005
EGEE/LCG-2 grid 160 sites, 36 countries
gt15,000 processors, 5 PB storage Other
national regional grids 60 sites, 6,000
processors
41
Close to 10,000 jobs /day
? Over 2.2 million jobs in 2005 so far Daily
averages, sustained over a month, Jan Oct
2005 2183, 2940, 8052, 9432, 10112, 8327, 8352,
9492, 6554, 6103 ? 6 M kSI2K.cpu.hours ? 700
cpu years
42
Working and usable grid gt 10K jobs per day 15
active VOs
43
EGEE Operations
Steady EGEE operations across 6 sides (weekly
shifts started end 2004) IN2P3, RAL, INFN,
Russia, Taipei CERN
Total grid sides
Functional Test (SFT) framework in place since
June to monitor in detail the level of services
from a given side vs MoU. Will be extended to
more sides services One can observe an
increased side availability with time
number of sites passing the SFT tests
44
CIC Operations
  • 1 Year Ago
  • 4 federations involved CERN, Italy, UK, France
  • 40 sites pass functional tests
  • Disparate tools
  • Now
  • 6 federations - Russia joined March 25th and
    Taiwan joined October 17th
  • CIC portal integrated with monitoring and
    ticketing system
  • Stream lined processes
  • Average of 100 ticket operations each week
  • gt 80 sites pass functional test

45
Operations coordination
  • Weekly operations meetings
  • Regular ROC, CIC managers meetings
  • Series of EGEE Operations Workshops
  • Nov 04, May 05, Sep 05
  • Last one was a joint workshop with Open Science
    Grid
  • These have been extremely useful
  • Will continue in Phase II
  • Bring in related infrastructure projects
    coordination point
  • Continue to arrange joint workshops with OSG (and
    others?)

46
A Selection of Monitoring tools
1. GIIS Monitor
2. GIIS Monitor graphs
3. Sites Functional Tests
4. GOC Data Base
5. Scheduled Downtimes
6. Live Job Monitor
7. GridIce VO view
8. GridIce fabric view
9. Certificate Lifetime Monitor
Note Those thumbnails are links and are
clickable.
47
Integration of monitoring information
  • Monitoring information from various tools is
    collected in R-GMA archiver
  • Summary generator calculates overall status of
    each monitored object (site, CE, ...) - update
    1h
  • Metric generator calculates numerical value for
    each monitored object aggregation (CE ? site ?
    region ? grid) - update 1 day

48
Summary
  • The infrastructure continues to grow moderately
  • The operation is becoming more stable with time
  • Problem of bad sites is under control
  • Operational oversight quite strict
  • Through SFT, VO tools
  • Affect of bad sites is much reduced
  • Significant work loads are being managed and
    significant resources being delivered to
    applications
  • User support is also becoming stable and usable
  • Successful interoperability with OSG
  • With strong interest in building inter-operation
  • EGEE-II must consolidate and build on this work

49
Conclusion
  • Unprecedented way to collaborate on day-to-day
    basis will change the sociology of academia life,
    eco-system of business world and eventually every
    one in the society
  • Collaboration is the Keyword, how the AP
    communities could collaborate and coordinate are
    our challenges together
  • Examples may include new AP-X VO, ROCs, MW
    enhancement, etc
Write a Comment
User Comments (0)
About PowerShow.com