Facilities and Fabrics Research and Development - PowerPoint PPT Presentation

About This Presentation

Title:

Facilities and Fabrics Research and Development

Description:

GYOZA. Production Cluster. US-CMS Grid. TESTBED. USER ANALYSIS. ENSTORE (15 Drives, shared) ... GYOZA. PRODUCTION. CLUSTER ( 80 Dual Nodes) IGT. ENSTORE (17 ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 28

Provided by: cms5

Learn more at: https://uscms.org

Category:

more less

Transcript and Presenter's Notes

Title: Facilities and Fabrics Research and Development

1
Facilities and Fabrics Research and Development

Michael ErnstFermilabDOE/NSF ReviewJanuary
16, 2003

2
UF US CMS Prototypes and Testbeds

Tier-1 and Tier-2 Prototypes and Testbeds
operational

Facilities for event simulationincluding
reconstruction
Sophisticated processing for pile-up simulation
User cluster and hosting of data samples for
physics studies
Facilities and Grid RD

Storage Servers about 10 TeraByte
Wide Area Network 622 Mbps (shared)
Production Farm 80 Dual CPU Nodes
Tape Library 60 TeraByte
User Cluster 16 Nodes
RD Cluster
3
CMS Milestones

DC04

4
CMS Data Challenge DC04
5
US CMS UF Resources

Roughly 50 (Disk) to 75 (CPU) of the
(estimated) resources
planned for end of 2003

6
Major Tier 1 Activities

Data Serving and dCache project rather
successful standardized data server, scalability
Now focusing on standardization of Storage
Management Protocols (with PPDG, EDG)
Resource Management (dynamic partitioning of farm
resources
3 different Grid Systems, Analysis Cluster
User Analysis Cluster RD facility deployed,
work on load balancing
Monitoring systems deployed, from fabric
monitoring to Grid monitoring
User Management (Virtual Organization tools)
Throughput studies and development domestic and
transatlantic
Software distribution and deployment system
eval, Grid tools considered
Documentation, collaboratory, web servers, etc.
There are many things we do not do due to lack of
manpower
Down-scoped for 2002 due to lack of funding
(should have 13, have 6 FTE)
RD slowed down and in danger of being ad-hoc,
missing pieces
Global Data Catalogs (essential pre-requisite for
DC04)
Physics Analysis Cluster
User support has somewhat suffered
With the planned ramp-up of the UF (and a good
plan!) these issues will be resolved

7
Adapted Scope

We have adapted the UF scope
In our embracing of the Grid Technology and how
it comes into the project in order to deal with
the LCG
We also changed the scope by stretching RD,
leaving less functionality
Things we should have developed and working by
now, but havent had the resources
E.g. Storage Resource Manager, File Catalogs,
Global File System
We bought less equipment than planned and
therefore production requires more manpower
The Project WBS has been re-worked
Adapt the project plan to the changed scope
(below the original plans for FY03/04) leading to
a strong US CMS participation in the
CMS 5 data challenge DC04 validating core
software, developing computing model
US CMS will take part in the LCG Production
Grid milestone in 2003
WBS reflects reality, has a structure that
should work with the new Fermilab project
accounting scheme, and should allow tracking of
effort and progress
A level of 13 FTE is required to provide the
effort for research, development, deployment,
integration, management, operation, support
Modest hardware procurements for Tier-1 center
Typically 500k/year in FY2003, FY2004
Need more (700k) in 2005 for participation in
10 challenge DC05, Regional Center TDR

8
Set of High Level Milestones

Integration Grid Testbed deployed, running PRS
production, demonstrations
October, November, December 2002
Review of IGT, promotion and termination
December 2002, January 2003
Farm Configuration Definition and Deployment
February 2003.
Fully Functional Production Grid on a National
Scale
February 2003
Migration of Testbed functionality to Production
Facilities
March 2003
Start of LCG 24x7 Production Grid, with US CMS
participating
June 2003 -- this needs definition from the
LCG/GDB as to what it actually means
Start of CMS DC04 production preparation, PCP04
July 2003 -- enormous resources with sustained
running over 6 months
Running of DC04
Feb 2004 -- again the computing model etc being
developed inside CMS

9
General Approach to UF WBS

Goals Aggressive prototyping, early roll out,
strong QCDocu, track external practices
Approach Rolling Prototypes evolution of the
facility and data systems
Test stands for various hardware components and
(fabric related software components) -- this
allows to sample emerging technologies with small
risks (WBS 1.1.1)
Setup of a test(bed) system out of
next-generation components -- always keeping a
well-understood and functional production system
intact (WBS 1.1.2)
Deployment of a production-quality facility ---
comprised of well-defined components with
well-defined interfaces that can be upgraded
component-wise with a well-defined mechanism for
changing the components to minimize risks (WBS
1.1.3)
This matches to general strategy of rolling
replacements and thereby upgrading facility
capacity making use of Moores law
Correspondingly our approach to developing the
software systems for the distributed data
processing environment adopts rolling
prototyping
Analyze current practices in distributed systems
processing and of external software, like Grid
middleware (WBS 1.3.1, 1.3.2)
Prototyping of the distributed processing
environment (WBS 1.3.3)
Software Support and Transitioning, including use
of testbeds (WBS 1.3.4)
Servicing external milestones like data
challenges to exercise the new functionality and
get feedback (WBS 1.3.5)
Next prototype system to be delivered is the US
CMS contribution to the LCG Production Grid (June
2003)
CMS will run a large Data Challenge on that
system to prove and further develop the computing
model

10
UF WBS Level 3
Full WBS at http//heppc16.ucsd.edu/Planning_new/

Note the large effort captured from Grids and/or
at Universities

11
UF WBS to Level 4
Full WBS at http//heppc16.ucsd.edu/Planning_new/
12
US CMS WBS to Level 4
Full WBS at http//heppc16.ucsd.edu/Planning_new/
1.2
Physics Analysis Facility Support
1.0
0.0
2.0
0.0
1.2.1
Facility Planning and Procurement
0.5
0.5
1.2.2
Desktop Support
0.3
1.0
1.2.3
Documentation
0.3
0.5
1.2.4
Collaborative Tools
1.2.5
Development Environment
1.2.6
Administration, Monitoring Problem Resolution
Software
1.2.7
Administrative Support
1.2.8
User Helpdesk
1.2.9
Training Office
1.2.10
Virtual Control Room
13
US CMS WBS to Level 4
Full WBS at http//heppc16.ucsd.edu/Planning_new/
14
US CMS WBS to Level 4
Full WBS at http//heppc16.ucsd.edu/Planning_new/
15
Estimated Resource Needs at T1

16
Projects

Compute Storage Elements RD on Components and
Systems
Cluster Management Generic Farms, Dynamic
Partitioning
Storage Management and Access
Interface Standardization and their
implementation
Data set catalogs, metadata, replication, robust
file transfers
Networking Terabyte throughput to T2, to CERN
Need High-Performance Network Infrastructure
RD on Network Protocol Stack
Physics Analysis Center
Analysis cluster
Desktop support
Software Distribution, Software support, User
Support Helpdesk
Collaborative tools
VO management and security
Worked out a plan for RD and Deployment
Need to develop operations scenario
In addition
Operation and Facilities related RD

17
Current System Architecture at Tier-1
Production Cluster
250GB
MREN (OC3) (shared)
ESNET (OC12) (shared)
POPCRN
GALLO
VELVEETA
250GB
CISCO 6509
RD
US-CMS Grid TESTBED
GYOZA
FRY
BIGMAC
250GB
1TB
RAMEN
BURRITO
CHALUPA
1TB
BATCH
250GB
WONDER
CHOCOLAT
WHOPPER
SNICKERS
CMSUN1
ENSTORE (15 Drives, shared)
1TB

USER ANALYSIS
18
Dynamic Partitioning and Farm Configuration
Front-End Node
Front-End Node
Node
Node
Node
Node
Node
Analysis
Production
Node
Node
Node
Node
Node
Network Attached Storage
Large Scale Caching System (dCache)
Tertiary Storage System (ENSTORE)
19
Anticipated System Architecture
PRODUCTION CLUSTER (80 Dual Nodes)
MREN (OC3) (shared)
ESNET (OC 12) (shared)
POPCRN
CISCO 6509
IGT
GYOZA
RD
FRY

ENSTORE (17 DRIVES, shared)

NAS
DCACHE ( 6 TB)
USER ANALYSIS
20
User access to Tier-1 (JetMet, Muons)

ROOT/POOL interface
(TDCacheFile)

AMS server
AMS/Enstore interface
AMS/dCache interface

dCache
Objects
Network

Users in
Wisconsin
CERN
FNAL
Texas

Enstore
NAS/RAID
21
VO AA Support

Very complex issue (and large possible impacts)
US CMS is working on developing a more complete
view on the subject assess current registration
schemes, databases etc, develop an operational
model
This plan was developed in consultation with the
Fermilab security teamidentifies the specific
US-CMS pieces of work
There are many stakeholders in this project,
coming from rather different sides-- needs some
more discussion on project structure and scope
Person in CD/CCF has been identified
There are also many existing and emerging
components developed by different projects
There is a strong operational component, and
existing regulations, practices, constraints
Technical issues we can start working on
(WBS/schedule exists)
Registration Schema definition, VO Membership
Service, Registration Mechanisms
Authentication replacing group certificates,
deployment of KCA at Tier-1 with some
documentation on the experience, then deployment
in the DGT, performance studies
Authorization deploy EDG gatekeeper with LCAS,
ensure documentation of VOMS client, mechanisms
to update/access VOMS information,distributed
VOMS, re-auth of long jobs, Documentation and
Helpdesk
Eventually also Accounting issues

22
Storage and Data Access

Viewpoints of CMS Event Data in the CMS Data Grid
System
High-level data views in the minds of physicists
High-level data views in physics analysis tools
Virtual data product collections
(highest-level common view across CMS)
Materialized data product collections
File sets
(set of log. files with hi.-lev. significance)
Logical files
Physical files on sites
(device location independent view)
Physical files on storage devices
(lowest-level generic view of files)
Device-specific files

RD focusing on common Interfaces
23
Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
Today Responsibility of Application (invoking
some higher-level Middleware components (e.g.
Condor))
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
24
Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
25
RD on Components for Data Storage and Data
Access

Approach develop a Storage Architecture, define
Components and Interfaces
This will include StorageData Management, Data
Access, Catalogs, Robust Data Movement etc)
Storage System related RD issues
Detailed Analysis of the SRM and GridFTP
specifications including identification of
initial version of protocols be used, discussion
of any connective middleware w.r.t
interoperability. Coordination with Tier0/1/2 and
LCG. Goal is to effect transfer and support
replica managers.
Protocol elements include features from GridFTP,
SRM
At Tier2 centers selection of Temporary Store
implementation, supporting SRM and GridFTP
(incl. evaluation on interop issues with Tier1
center)
dCache, dFarm, DRM, NeST, DAP
At Tier1 center provide SRM/dCache interface
for FNAL/dCache implementation compatible with
criteria above
Track compatibility with LCG (incl. Tier0 center
at CERN) as their plan evolves
Further planning required to incorporate Replica
Managers / Replica Location Service
Also Fabric Level Storage Services Cluster
File Systems, Object Storage Devices
Dedicated presentations in the break-out
session

26
Networking

Immediate needs for RD in three topic areas
End-to-End Performance / Network Performance and
Prediction
Closely related to work on Storage Grid (SRM
etc)
Alternative implementations of TCP/IP Stack
QoS and Differentiated Services, Bandwidth
Brokering
Evaluate and eventually utilize differentiated
service framework as being implemented in Abilene
and ESnet
Evaluate bandwidth brokers (e.g. GARA)
Virtual Private Networks (VPN)
Evaluate and eventually implement VPN technology
over public network infrastructure for the CMS
Production Grid
Other parties involved are CERN, Caltech,
DataTAG, Internet2, ESnet,
Dedicated presentations
in the break-out session

27
UF Summary

We have successfully deployed a system of a
Tier-1 center prototype at Fermilab, and Tier-2
prototype facilities at Caltech, U.Florida and
UCSD, with high throughput data transfers
This Tier-1/Tier-2 distributed User Facility was
very successfully used for a large-scale,
world-wide production challenge
By aggressively utilizing Grid Technology for
Production at the USCMS facilities we are
stimulating the development process at a
worldwide scale
Successfully running an RD program on Data
Storage and Data Access fully implemented dCache
for Production and Analysis and will focus on
Implementation of standardized Storage Management
Have set up an RD program on high-performance
networking in the production environment and
successfully optimized throughput during
Production for DAQ TDR
Large data samples (Objectivity and nTuples)
have been made available to the physics
community - DAQ TDR
Major upcoming milestones
be operational as part of LCG Production Grid
in June 2003
participation in the CMS 5 data challenge DC04