Facilities and Fabrics Research and Development - PowerPoint PPT Presentation

About This Presentation
Title:

Facilities and Fabrics Research and Development

Description:

GYOZA. Production Cluster. US-CMS Grid. TESTBED. USER ANALYSIS. ENSTORE (15 Drives, shared) ... GYOZA. PRODUCTION. CLUSTER ( 80 Dual Nodes) IGT. ENSTORE (17 ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 28
Provided by: cms5
Learn more at: https://uscms.org
Category:

less

Transcript and Presenter's Notes

Title: Facilities and Fabrics Research and Development


1
Facilities and Fabrics Research and Development
  • Michael ErnstFermilabDOE/NSF ReviewJanuary
    16, 2003

2
UF US CMS Prototypes and Testbeds
  • Tier-1 and Tier-2 Prototypes and Testbeds
    operational
  • Facilities for event simulationincluding
    reconstruction
  • Sophisticated processing for pile-up simulation
  • User cluster and hosting of data samples for
    physics studies
  • Facilities and Grid RD

Storage Servers about 10 TeraByte
Wide Area Network 622 Mbps (shared)
Production Farm 80 Dual CPU Nodes
Tape Library 60 TeraByte
User Cluster 16 Nodes
RD Cluster
3
CMS Milestones
  • DC04

4
CMS Data Challenge DC04
5
US CMS UF Resources
  • Roughly 50 (Disk) to 75 (CPU) of the
    (estimated) resources
  • planned for end of 2003

6
Major Tier 1 Activities
  • Data Serving and dCache project rather
    successful standardized data server, scalability
  • Now focusing on standardization of Storage
    Management Protocols (with PPDG, EDG)
  • Resource Management (dynamic partitioning of farm
    resources
  • 3 different Grid Systems, Analysis Cluster
  • User Analysis Cluster RD facility deployed,
    work on load balancing
  • Monitoring systems deployed, from fabric
    monitoring to Grid monitoring
  • User Management (Virtual Organization tools)
  • Throughput studies and development domestic and
    transatlantic
  • Software distribution and deployment system
    eval, Grid tools considered
  • Documentation, collaboratory, web servers, etc.
  • There are many things we do not do due to lack of
    manpower
  • Down-scoped for 2002 due to lack of funding
    (should have 13, have 6 FTE)
  • RD slowed down and in danger of being ad-hoc,
    missing pieces
  • Global Data Catalogs (essential pre-requisite for
    DC04)
  • Physics Analysis Cluster
  • User support has somewhat suffered
  • With the planned ramp-up of the UF (and a good
    plan!) these issues will be resolved

7
Adapted Scope
  • We have adapted the UF scope
  • In our embracing of the Grid Technology and how
    it comes into the project in order to deal with
    the LCG
  • We also changed the scope by stretching RD,
    leaving less functionality
  • Things we should have developed and working by
    now, but havent had the resources
  • E.g. Storage Resource Manager, File Catalogs,
    Global File System
  • We bought less equipment than planned and
    therefore production requires more manpower
  • The Project WBS has been re-worked
  • Adapt the project plan to the changed scope
    (below the original plans for FY03/04) leading to
    a strong US CMS participation in the
  • CMS 5 data challenge DC04 validating core
    software, developing computing model
  • US CMS will take part in the LCG Production
    Grid milestone in 2003
  • WBS reflects reality, has a structure that
    should work with the new Fermilab project
    accounting scheme, and should allow tracking of
    effort and progress
  • A level of 13 FTE is required to provide the
    effort for research, development, deployment,
    integration, management, operation, support
  • Modest hardware procurements for Tier-1 center
  • Typically 500k/year in FY2003, FY2004
  • Need more (700k) in 2005 for participation in
    10 challenge DC05, Regional Center TDR

8
Set of High Level Milestones
  • Integration Grid Testbed deployed, running PRS
    production, demonstrations
  • October, November, December 2002
  • Review of IGT, promotion and termination
  • December 2002, January 2003
  • Farm Configuration Definition and Deployment
  • February 2003.
  • Fully Functional Production Grid on a National
    Scale
  • February 2003
  • Migration of Testbed functionality to Production
    Facilities
  • March 2003
  • Start of LCG 24x7 Production Grid, with US CMS
    participating
  • June 2003 -- this needs definition from the
    LCG/GDB as to what it actually means
  • Start of CMS DC04 production preparation, PCP04
  • July 2003 -- enormous resources with sustained
    running over 6 months
  • Running of DC04
  • Feb 2004 -- again the computing model etc being
    developed inside CMS

9
General Approach to UF WBS
  • Goals Aggressive prototyping, early roll out,
    strong QCDocu, track external practices
  • Approach Rolling Prototypes evolution of the
    facility and data systems
  • Test stands for various hardware components and
    (fabric related software components) -- this
    allows to sample emerging technologies with small
    risks (WBS 1.1.1)
  • Setup of a test(bed) system out of
    next-generation components -- always keeping a
    well-understood and functional production system
    intact (WBS 1.1.2)
  • Deployment of a production-quality facility ---
    comprised of well-defined components with
    well-defined interfaces that can be upgraded
    component-wise with a well-defined mechanism for
    changing the components to minimize risks (WBS
    1.1.3)
  • This matches to general strategy of rolling
    replacements and thereby upgrading facility
    capacity making use of Moores law
  • Correspondingly our approach to developing the
    software systems for the distributed data
    processing environment adopts rolling
    prototyping
  • Analyze current practices in distributed systems
    processing and of external software, like Grid
    middleware (WBS 1.3.1, 1.3.2)
  • Prototyping of the distributed processing
    environment (WBS 1.3.3)
  • Software Support and Transitioning, including use
    of testbeds (WBS 1.3.4)
  • Servicing external milestones like data
    challenges to exercise the new functionality and
    get feedback (WBS 1.3.5)
  • Next prototype system to be delivered is the US
    CMS contribution to the LCG Production Grid (June
    2003)
  • CMS will run a large Data Challenge on that
    system to prove and further develop the computing
    model

10
UF WBS Level 3
Full WBS at http//heppc16.ucsd.edu/Planning_new/
  • Note the large effort captured from Grids and/or
    at Universities

11
UF WBS to Level 4
Full WBS at http//heppc16.ucsd.edu/Planning_new/
12
US CMS WBS to Level 4
Full WBS at http//heppc16.ucsd.edu/Planning_new/
1.2
Physics Analysis Facility Support
1.0
0.0
2.0
0.0
1.2.1
Facility Planning and Procurement
0.5
0.5
1.2.2
Desktop Support
0.3
1.0
1.2.3
Documentation
0.3
0.5
1.2.4
Collaborative Tools
1.2.5
Development Environment
1.2.6
Administration, Monitoring Problem Resolution
Software
1.2.7
Administrative Support
1.2.8
User Helpdesk
1.2.9
Training Office
1.2.10
Virtual Control Room
13
US CMS WBS to Level 4
Full WBS at http//heppc16.ucsd.edu/Planning_new/
14
US CMS WBS to Level 4
Full WBS at http//heppc16.ucsd.edu/Planning_new/
15
Estimated Resource Needs at T1

16
Projects
  • Compute Storage Elements RD on Components and
    Systems
  • Cluster Management Generic Farms, Dynamic
    Partitioning
  • Storage Management and Access
  • Interface Standardization and their
    implementation
  • Data set catalogs, metadata, replication, robust
    file transfers
  • Networking Terabyte throughput to T2, to CERN
  • Need High-Performance Network Infrastructure
  • RD on Network Protocol Stack
  • Physics Analysis Center
  • Analysis cluster
  • Desktop support
  • Software Distribution, Software support, User
    Support Helpdesk
  • Collaborative tools
  • VO management and security
  • Worked out a plan for RD and Deployment
  • Need to develop operations scenario
  • In addition
  • Operation and Facilities related RD

17
Current System Architecture at Tier-1
Production Cluster
250GB
MREN (OC3) (shared)
ESNET (OC12) (shared)
POPCRN
GALLO
VELVEETA
250GB
CISCO 6509
RD
US-CMS Grid TESTBED
GYOZA
FRY
BIGMAC
250GB
1TB
RAMEN
BURRITO
CHALUPA
1TB
BATCH
250GB
WONDER
CHOCOLAT
WHOPPER
SNICKERS
CMSUN1
ENSTORE (15 Drives, shared)
1TB

USER ANALYSIS
18
Dynamic Partitioning and Farm Configuration
Front-End Node
Front-End Node
Node
Node
Node
Node
Node
Analysis
Production
Node
Node
Node
Node
Node
Network Attached Storage
Large Scale Caching System (dCache)
Tertiary Storage System (ENSTORE)
19
Anticipated System Architecture
PRODUCTION CLUSTER (80 Dual Nodes)
MREN (OC3) (shared)
ESNET (OC 12) (shared)
POPCRN
CISCO 6509
IGT
GYOZA
RD
FRY

ENSTORE (17 DRIVES, shared)

NAS
DCACHE ( 6 TB)
USER ANALYSIS
20
User access to Tier-1 (JetMet, Muons)
  • ROOT/POOL interface
  • (TDCacheFile)
  • AMS server
  • AMS/Enstore interface
  • AMS/dCache interface

dCache
Objects
Network
  • Users in
  • Wisconsin
  • CERN
  • FNAL
  • Texas

Enstore
NAS/RAID
21
VO AA Support
  • Very complex issue (and large possible impacts)
  • US CMS is working on developing a more complete
    view on the subject assess current registration
    schemes, databases etc, develop an operational
    model
  • This plan was developed in consultation with the
    Fermilab security teamidentifies the specific
    US-CMS pieces of work
  • There are many stakeholders in this project,
    coming from rather different sides-- needs some
    more discussion on project structure and scope
  • Person in CD/CCF has been identified
  • There are also many existing and emerging
    components developed by different projects
  • There is a strong operational component, and
    existing regulations, practices, constraints
  • Technical issues we can start working on
    (WBS/schedule exists)
  • Registration Schema definition, VO Membership
    Service, Registration Mechanisms
  • Authentication replacing group certificates,
    deployment of KCA at Tier-1 with some
    documentation on the experience, then deployment
    in the DGT, performance studies
  • Authorization deploy EDG gatekeeper with LCAS,
    ensure documentation of VOMS client, mechanisms
    to update/access VOMS information,distributed
    VOMS, re-auth of long jobs, Documentation and
    Helpdesk
  • Eventually also Accounting issues

22
Storage and Data Access
  • Viewpoints of CMS Event Data in the CMS Data Grid
    System
  • High-level data views in the minds of physicists
  • High-level data views in physics analysis tools
  • Virtual data product collections
    (highest-level common view across CMS)
  • Materialized data product collections
  • File sets
    (set of log. files with hi.-lev. significance)

  • Logical files
  • Physical files on sites
    (device location independent view)
  • Physical files on storage devices
    (lowest-level generic view of files)
  • Device-specific files

RD focusing on common Interfaces
23
Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
Today Responsibility of Application (invoking
some higher-level Middleware components (e.g.
Condor))
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
24
Storage Resource Management
Site hosting the Application
Client
Client
Logical Query
property index
Logical Files
Request Interpreter
site-specific file requests
site-specific files
Replica Catalog
request planning
Request Executer
DRM
MDS
pinning file transfer requests
Network
Tier1
Tier2
HRM
DRM
dCache
? (dCache, dFarm, DRM, NeST)
Enstore
25
RD on Components for Data Storage and Data
Access
  • Approach develop a Storage Architecture, define
    Components and Interfaces
  • This will include StorageData Management, Data
    Access, Catalogs, Robust Data Movement etc)
  • Storage System related RD issues
  • Detailed Analysis of the SRM and GridFTP
    specifications including identification of
    initial version of protocols be used, discussion
    of any connective middleware w.r.t
    interoperability. Coordination with Tier0/1/2 and
    LCG. Goal is to effect transfer and support
    replica managers.
  • Protocol elements include features from GridFTP,
    SRM
  • At Tier2 centers selection of Temporary Store
    implementation, supporting SRM and GridFTP
    (incl. evaluation on interop issues with Tier1
    center)
  • dCache, dFarm, DRM, NeST, DAP
  • At Tier1 center provide SRM/dCache interface
    for FNAL/dCache implementation compatible with
    criteria above
  • Track compatibility with LCG (incl. Tier0 center
    at CERN) as their plan evolves
  • Further planning required to incorporate Replica
    Managers / Replica Location Service
  • Also Fabric Level Storage Services Cluster
    File Systems, Object Storage Devices
  • Dedicated presentations in the break-out
    session

26
Networking
  • Immediate needs for RD in three topic areas
  • End-to-End Performance / Network Performance and
    Prediction
  • Closely related to work on Storage Grid (SRM
    etc)
  • Alternative implementations of TCP/IP Stack
  • QoS and Differentiated Services, Bandwidth
    Brokering
  • Evaluate and eventually utilize differentiated
    service framework as being implemented in Abilene
    and ESnet
  • Evaluate bandwidth brokers (e.g. GARA)
  • Virtual Private Networks (VPN)
  • Evaluate and eventually implement VPN technology
    over public network infrastructure for the CMS
    Production Grid
  • Other parties involved are CERN, Caltech,
    DataTAG, Internet2, ESnet,
    Dedicated presentations
    in the break-out session

27
UF Summary
  • We have successfully deployed a system of a
    Tier-1 center prototype at Fermilab, and Tier-2
    prototype facilities at Caltech, U.Florida and
    UCSD, with high throughput data transfers
  • This Tier-1/Tier-2 distributed User Facility was
    very successfully used for a large-scale,
    world-wide production challenge
  • By aggressively utilizing Grid Technology for
    Production at the USCMS facilities we are
    stimulating the development process at a
    worldwide scale
  • Successfully running an RD program on Data
    Storage and Data Access fully implemented dCache
    for Production and Analysis and will focus on
    Implementation of standardized Storage Management
  • Have set up an RD program on high-performance
    networking in the production environment and
    successfully optimized throughput during
    Production for DAQ TDR
  • Large data samples (Objectivity and nTuples)
    have been made available to the physics
    community - DAQ TDR
  • Major upcoming milestones
  • be operational as part of LCG Production Grid
    in June 2003
  • participation in the CMS 5 data challenge DC04
Write a Comment
User Comments (0)
About PowerShow.com