Title: Towards a US and LHC Grid
1- Towards a US (and LHC) Grid
- Environment for HENP Experiments
- CHEP 2000 Grid WorkshopHarvey B. Newman, Caltech
- Padova
- February 12, 2000
2Data Grid Hierarchy Integration, Collaboration,
Marshal resources
1 TIPS 25,000 SpecInt95 PC (today) 10-15
SpecInt95
PBytes/sec
Online System
100 MBytes/sec
Offline Farm20 TIPS
Bunch crossing per 25 nsecs.100 triggers per
secondEvent is 1 MByte in size
100 MBytes/sec
Tier 0
CERN Computer Center
622 Mbits/sec
or Air Freight
Tier 1
Fermilab4 TIPS
France Regional Center
Italy Regional Center
Germany Regional Center
2.4 Gbits/sec
Tier 2
622 Mbits/sec
Tier 3
Physicists work on analysis channels. Each
institute has 10 physicists working on one or
more channels Data for these channels should be
cached by the institute server
Institute 0.25TIPS
Institute
Institute
Institute
Physics data cache
100 - 1000 Mbits/sec
Tier 4
Workstations
3To Solve the LHC Data Problem
- The proposed LHC computing and data handling will
not support FREE access, transport or processing
for more than a small part of the data - Balance between proximity to large computational
and data handling facilities, and proximity
to end users and more local resources for
frequently-accessed datasets - Strategies must be studied and prototyped, to
ensure both acceptable turnaround times, and
efficient resource utilisation - Problems to be Explored
- How to meet demands of hundreds of users who
need transparent access to local and remote
data, in disk caches and tape stores - Prioritise hundreds of requests of local and
remote communities, consistent with local and
regional policies - Ensure that the system is dimensioned/used/manag
ed optimally, for the mixed workload
4 Regional Center Architecture Example by I.
Gaines (MONARC)
Tape Mass Storage Disk Servers Database Servers
Tier 2
Local institutes
Data Import
Data Export
Production Reconstruction Raw/Sim ?
ESD Scheduled, predictable experiment/ physics
groups
Production Analysis ESD ? AOD AOD ?
DPD Scheduled Physics groups
Individual Analysis AOD ? DPD and
plots Chaotic Physicists
CERN
Tapes
Desktops
Physics Software Development
RD Systems and Testbeds
Info servers Code servers
Web Servers Telepresence Servers
Training Consulting Help Desk
5Grid Services Architecture
Applns
HEP Data-Analysis Related Applications
Appln Toolkits
Remote viz toolkit
Remote comp. toolkit
Remote data toolkit
Remote sensors toolkit
Remote collab. toolkit
...
Grid Services
Protocols, authentication, policy, resource
management, instrumentation, data discovery, etc.
Grid Fabric
Networks, data stores, computers, display
devices, etc. associated local services (local
implementations)
Adapted from Ian Foster
6Grid Hierarchy Goals Better Resource Use and
Faster Turnaround
- Grid integration and (de facto standard) common
services to ease development, operation,
management and security - Efficient resource use and improved
responsiveness through - Treatment of the ensemble of site and network
resourcesas an integrated (loosely coupled)
system - Resource discovery, query estimation
(redirection),
co-scheduling, prioritization, local and global
allocations - Network and site instrumentation performance
tracking, monitoring, forward-prediction,
problem trapping and handling
7GriPhyN First Production Scale Grid Physics
Network
- Develop a New Integrated Distributed System,
while Meeting Primary Goals of the US LIGO, SDSS
and LHC Programs -
- Unified GRID System Concept Hierarchical
Structure - Twenty Centers with Three Sub-Implementations
- 5-6 Each in US for LIGO, CMS, ATLAS 2-3 for
SDSS - Emphasis on Training, Mentoring and Remote
Collaboration - Focus on LIGO, SDSS (BaBar and Run2) handling
of real data, and LHC Mock Data Challenges with
simulated data - Making the Process of Discovery Accessible to
Students Worldwide - GriPhyN Web Site http//www.phys.ufl.edu/avery/m
re/ - White Paper http//www.phys.ufl.edu/avery/mre/wh
ite_paper.html
8Grid Development Issues
- Integration of applications with Grid Middleware
- Performance-oriented user application software
architectureis required, to deal with the
realities of data access and delivery - Application frameworks must work with system
state and policy information (instructions)
from the Grid - O(R)DBMSs must be extended to work across
networks - E.g. Invisible (to the DBMS) data transport,
and catalog update - Interfacility cooperation at a new level, across
world regions - Agreement on choice and implementation of
standard Grid components, services, security and
authentication - Interface the common services locally to match
with heterogeneous resources, performance levels,
and local operational requirements - Accounting and exchange of value software to
enable cooperation
9Roles of Projectsfor HENP Distributed Analysis
-
- RD45, GIOD Networked Object Databases
- Clipper/GC High speed access to Objects
or File data FNAL/SAM for
processing and analysis - SLAC/OOFS Distributed File System
Objectivity Interface - NILE, Condor Fault Tolerant Distributed
Computing with Heterogeneous CPU Resources - MONARC LHC Computing Models Architecture,
Simulation, Strategy, Politics - PPDG First Distributed Data Services and
Data Grid System Prototype - ALDAP OO Database Structures and
Access Methods for Astrophysics and HENP
Data - GriPhyN Production-Scale Data Grid
- Simulation/Modeling, Application Network
Instrumentation, System Optimization/Evaluation - APOGEE
10Other ODBMS tests
Tests with Versant(fallback ODBMS)
DRO WAN Tests with CERN
Production on CERNs PCSF and file movement to
Caltech
Objectivity/DB Creation of 32000 database
federation
11The China Clipper ProjectA Data Intensive Grid
ANL-SLAC-Berkeley
- China Clipper Goal
- Develop and demonstrate middleware allowing
applications transparent, high-speed access to
large data sets distributed over wide-area
networks. - ? Builds on expertise and assets at ANL, LBNL
SLAC - ? NERSC, ESnet
- ? Builds on Globus Middleware and
high-performance distributed storage
system (DPSS from LBNL) - ? Initial focus on large DOE HENP applications
- ? RHIC/STAR, BaBar
- ? Demonstrated data rates to 57 Mbytes/sec.
12Grand Challenge Architecture
- An order-optimized prefetch architecture for data
retrieval from multilevel storage in a multiuser
environment - Queries select events and specific event
components based upon tag attribute ranges - Query estimates are provided prior to execution
- Queries are monitored for progress, multi-use
- Because event components are distributed over
several files, processing an event requires
delivery of a bundle of files - Events are delivered in an order that takes
advantage of what is already on disk, and
multiuser policy-based prefetching of further
data from tertiary storage - GCA intercomponent communication is CORBA-based,
but physicists are shielded from this layer
13GCA System Overview
GCA STACS
File Catalog
Index
Staged event files
(Other) disk-resident event data
Event Tags
pftp
HPSS
14STorage Access Coordination System (STACS)
Query Estimator
Query
Bit-Sliced Index
Estimate
List of file bundles and events
Policy Module
Query Monitor
File Bundles, Event lists
Query Status, Cache Map
Requests for file caching and purging
Pftp and file purge commands
Cache Manager
File Catalog
15The Particle Physics Data Grid (PPDG)
ANL, BNL, Caltech, FNAL, JLAB, LBNL, SDSC, SLAC,
U.Wisc/CS
Site to Site Data Replication Service 100
Mbytes/sec
PRIMARY SITE Data Acquisition, CPU, Disk, Tape
Robot
SECONDARY SITE CPU, Disk, Tape Robot
- First Year Goal Optimized cached read access to
1-10 Gbytes, drawn from a total data set of
order One Petabyte
Multi-Site Cached File Access Service
16The Particle Physics Data Grid (PPDG)
-
- The ability to query and partially retrieve
hundreds of terabytes across Wide Area Networks
within seconds, - PPDG uses advanced services in three areas
- Distributed caching to allow for rapid data
delivery in response to multiple requests - Matchmaking and Request/Resource co-scheduling
to manage workflow and use computing and net
resources efficiently to achieve high throughput - Differentiated Services to allow
particle-physics bulk data transport to coexist
with interactive and real-time remote
collaboration sessions, and other network
traffic. -
17PPDG Architecture for Reliable High Speed Data
Delivery
Resource Management
Object-based and File-based Application Services
File Replication Index
Matchmaking Service
File Access Service
Cost Estimation
Cache Manager
File Fetching Service
Mass Storage Manager
File Mover
File Mover
FutureFile and Object Export Cache State
Tracking Forward Prediction
End-to-End Network Services
Site Boundary
Security Domain
18First Year PPDG System Components
- Middleware Components (Initial Choice) See PPDG
Proposal - Object and File-Based Objectivity/DB (SLAC
enhanced) Application Services GC Query Object,
Event Iterator, Query Monitor - FNAL SAM System
- Resource Management Start with Human
Intervention(but begin to deploy resource
discovery mgmnt tools Condor, SRB) - File Access Service Components of OOFS
(SLAC) - Cache Manager GC Cache Manager (LBNL)
- Mass Storage Manager HPSS, Enstore, OSM
(Site-dependent) - Matchmaking Service Condor (U.
Wisconsin) - File Replication Index MCAT
(SDSC) - Transfer Cost Estimation Service Globus (ANL)
- File Fetching Service Components of OOFS
- File Movers(s)
SRB (SDSC) Site specific - End-to-end Network Services Globus tools for
QoS reservation - Security and authentication Globus (ANL)
19CONDOR Matchmaking A Resource Allocation Paradigm
- Parties use ClassAds to advertise properties,
requirements and ranking to a matchmaker - ClassAds are Self-describing (no separate schema)
- ClassAds combine query and data
High Throughput Computing
http//www.cs.wisc.edu/condor
20Remote Execution in Condor
Agents for Remote Execution in CONDOR
Execution
Submission
Request Queue
Owner Agent
Customer Agent
Object Files
Object Files
Execution Agent
Application Agent
Data Object Files
Ckpt Files
Application Process
Application Process
Remote I/O Ckpt
21Beyond Traditional ArchitecturesMobile Agents
(Java Aglets)
Agents are objects with rules and legs -- D.
Taylor
Agent
Service
Agent
Application
- Mobile Agents
- Execute Asynchronously
- Reduce Network Load Local Conversations
- Overcome Network Latency Some Outages
- Adaptive ? Robust, Fault Tolerant
- Naturally Heterogeneous
- Extensible Concept Agent Hierarchies
22Using the Globus Tools
- Tests with gsiftp, a modified ftp
server/client that allows control of the TCP
buffer size - Transfers of Objy database files from the
Exemplar to - Itself
- An O2K at Argonne (via CalREN2 and Abilene)
- A Linux machine at INFN (via US-CERN
Transatlantic link) - Target /dev/null in multiple streams (1 to 16
parallel gsiftp sessions). - Aggregate throughput as a function of number
of streams and send/receive buffer sizes
25 MB/sec on HiPPI loop-back
4MB/sec to Argonne by tuning TCP window size
Saturating available B/W to Argonne
23Distributed Data Delivery and LHC Software
Architecture
-
- Software Architectural Choices
- Traditional, single-threaded applications
- Wait for data location, arrival and reassembly
OR - Performance-Oriented (Complex)
- I/O requests up-front multi-threaded data
driven respond to ensemble of (changing) cost
estimates - Possible code movement as well as data movement
- Loosely coupled, dynamic
24GriPhyN Foundation
- Build on the Distributed System Results of the
GIOD, MONARC, NILE, Clipper/GC and PPDG Projects - Long Term Vision in Three Phases
- 1. Read/write access to high volume data and
processing power - Condor/Globus/SRB NetLogger components to
manage jobs and resources - 2. WAN-distributed data-intensive Grid computing
system - Tasks move automatically to the most effective
Node in the Grid - Scalable implementation using mobile agent
technology - 3. Virtual Data concept for multi-PB
distributed data management, with
large-scale Agent Hierarchies - Transparently match data to sites, manage data
replication or transport, co-schedule data
compute resources - Build on VRVS Developments for Remote
Collaboration
25GriPhyN/APOGEE Production-Design of a Data
Analysis Grid
- INSTRUMENTATION, SIMULATION, OPTIMIZATION,
COORDINATION - SIMULATION of a Production-Scale Grid Hierarchy
- Provide a Toolset for HENP experiments to test
and optimize their data analysis and resource
usage strategies - INSTRUMENTATION of Grid Prototypes
- Characterize the Grid components performance
under load - Validate the Simulation
- Monitor, Track and Report system state, trends
and Events - OPTIMIZATION of the Data Grid
- Genetic algorithms, or other evolutionary methods
- Deliver optimization package for HENP distributed
systems - Applications to other experiments accelerator
and other control systems other fields - COORDINATE with Experiment-Specific Projects
CMS, ATLAS, BaBar, Run2
26Grid (IT) Issues to be Addressed
- Dataset compaction data caching and mirroring
strategies - Using large time-quanta or very high bandwidth
bursts, for large data transactions - Query estimators, Query Monitors (cf. GCA work)
- Enable flexible, resilient prioritisation
schemes (marginal utility) - Query redirection, fragmentation, priority
alteration, etc. - Pre-Emptive and realtime data/resource
matchmaking - Resource discovery
- Data and CPU Location Brokers
- Co-scheduling and queueing processes
- State, workflow, performance-monitoring
instrumentation tracking and forward
prediction - Security Authentication (for resource
allocation/usage and priority) running a
certificate authority
27CMS Example Data Grid Program of Work (I)
- FY 2000
- Build basic services 1 Million event samples
on proto-Tier2s - For HLT milestones and detector/physics studies
with ORCA - MONARC Phase 3 simulations for
study/optimization - FY 2001
- Set up initial Grid system based on PPDG
deliverables at the first Tier2 centers and
Tier1-prototype centers - High speed site-to-site file replication service
- Multi-site cached file access
- CMS Data Challenges in support of DAQ TDR
- Shakedown of preliminary PPDG ( MONARC and
GIOD) system strategies and tools - FY 2002
- Deploy Grid system at the second set of Tier2
centers - CMS Data Challenges for Software and Computing
TDR and Physics TDR
28Data Analysis Grid Program of Work (II)
- FY 2003
- Deploy Tier2 centers at last set of sites
- 5-Scale Data Challenge in Support of Physics
TDR - Production-prototype test of Grid Hierarchy
System, with first elements of the production
Tier1 Center - FY 2004
- 20 Production (Online and Offline) CMS Mock
Data Challenge, with all Tier2 Centers, and
partly completed Tier1 Center - Build Production-quality Grid System
- FY 2005 (Q1 - Q2)
- Final Production CMS (Online and Offline)
Shakedown - Full distributed system software and
instrumentation - Using full capabilities of the Tier2 and Tier1
Centers
29Summary
- The HENP/LHC data handling problem
- Multi-Petabyte scale, binary pre-filtered data,
resources distributed worldwide - Has no analog now, but will be increasingly
prevalent in research, and industry by 2005. - Development of a robust PB-scale networked data
access and analysis system is mission-critical
- An effective partnership exists, HENP-wide,
through many RD projects - - RD45, GIOD, MONARC, Clipper, GLOBUS, CONDOR,
ALDAP, PPDG, ... - An aggressive RD program is required to develop
- Resilient self-aware systems, for data access,
processing and analysis across a hierarchy of
networks - Solutions that could be widely applicable to
data problems in other scientific fields and
industry, by LHC startup - Focus on Data Grids for Next Generation Physics
30LHC Data Models 1994-2000
- HEP data models are complex!
- Rich hierarchy of hundreds of complex data
types (classes) - Many relations between them
- Different access patterns (Multiple Viewpoints)
- OO technology
- OO applications deal with networks of objects
(and containers) - Pointers (or references) are used to describe
relations - Existing solutions do not scale
- Solution suggested by RD45 ODBMS coupled to a
Mass Storage System - Construction of Compact Datasets for
AnalysisRapid Access/Navigation/Transport
31Content Delivery Networks (CDN)
- Web-Based Server-Farm Networks Circa 2000Dynamic
(Grid-Like) Content Delivery Engines - Akamai, Adero, Sandpiper
- 1200 ? Thousands of Network-Resident Servers
- 25 ? 60 ISP Networks
- 25 ? 30 Countries
- 40 Corporate Customers
- 25 B Capitalization
- Resource Discovery
- Build Weathermap of Server Network (State
Tracking) - Query Estimation Matchmaking/Optimization
Request rerouting - Virtual IP Addressing One address per
server-farm - Mirroring, Caching
- (1200) Autonomous-Agent Implementation
32Strawman Tier 2 Evolution
- 2000 2005
- Linux Farm 1,200 SI95 20,000 SI95
- Disks on CPUs 4 TB 50 TB
- RAID Array 1 TB 30 TB
- Tape Library 1-2 TB 50-100 TB
- LAN Speed 0.1 - 1 Gbps 10-100 Gbps
- WAN Speed 155 - 622 Mbps 2.5 - 10 Gbps
- Collaborative MPEG2 VGA Realtime
HDTVInfrastructure (1.5 - 3 Mbps) (10 -
20 Mbps) - Reflects lower Tier 2 component costs due to
less demanding usage. Some of the CPU will be
used for simulation.
33USCMS SC Spending profile
2006 is a model year for the operations phase of
CMS
34GriPhyN Cost
- System support 8.0 M
- RD 15.0 M
- Software 2.0 M
- Tier 2 networking 10.0 M
- Tier 2 hardware 50.0 M
- Total 85.0 M
35Grid Hierarchy ConceptBroader Advantages
- Partitioning of users into proximate
communitiesinto for support, troubleshooting,
mentoring - Partitioning of facility tasks, to manage and
focus resources - Greater flexibility to pursue different physics
interests, priorities, and resource allocation
strategies by region - Lower tiers of the hierarchy ? More local control
36Storage Request Brokers (SRB)
- Name Transparency Access to data by attributes
stored in an RDBMS (MCAT). - Location Transparency Logical collections (by
attributes) spanning multiple physical resources. - Combined Location and Name Transparency
meansthat datasets can be replicated across
multiple caches and data archives (PPDG). - Data Management Protocol Transparency SRB with
custom-built drivers in front of each storage
system - User does not need to know how the data is
accessedSRB deals with local file system
managers - SRBs (agents) authenticate themselves and users,
using Grid Security Infrastructure (GSI)
37Role of Simulationfor Distributed Systems
- Simulations are widely recognized and used as
essential tools for the design, performance
evaluation and optimisation of complex
distributed systems - From battlefields to agriculture from the
factory floor to telecommunications systems - Discrete event simulations with an appropriate
and high level of abstraction - Just beginning to be part of the HEP culture
- Some experience in trigger, DAQ and tightly
coupledcomputing systems CERN CS2 models
(Event-oriented) - MONARC (Process-Oriented Java 2 Threads Class
Lib) - These simulations are very different from HEP
Monte Carlos - Time intervals and interrupts are the
essentials - Simulation is a vital part of the study of site
architectures, network behavior, data
access/processing/delivery strategies,
for HENP Grid Design and Optimization
38Monitoring ArchitectureUse of NetLogger in
CLIPPER
- End-to-end monitoring of grid assets is necessary
to - Resolve network throughput problems
- Dynamically schedule resources
- Add precision-timed event monitor agents to
- ATM switches
- Storage servers
- Testbed computational resources
- Produce trend analysis modules for monitor
agents - Make results available to applications