Title: U'S' ATLAS Computing Facilities
1U.S. ATLAS Computing Facilities
- U.S. ATLAS Physics Computing Review
- Bruce G. Gibbard, BNL
- 10-11 January 2000
2US ATLAS Computing Facilities
- Facilities procured, installed and operated
- to meet US MOU Obligations
- Direct IT responsibility (Monte Carlo, for
example) - Support for detector construction, testing,
calib. - Support for software development and testing
- to enable effective participation by US
physicists in ATLAS physics program! - Direct access to and analysis of physics data
sets - Support simulation, re-reconstruction, and
reorganization of data associated with that
analysis
3Setting the Scale
- Uncertainties in Defining Requirements
- Five years of detector, algorithm software
development - Five years of computer technology evolution
- Start from ATLAS estimate rules of thumb
- Adjust for US ATLAS perspective (experience and
priorities) - Adjust for details of architectural model of US
ATLAS facilities
4Atlas Estimate Rules of Thumb
- Tier 1 Center in 05 should include ...
- 30,000 SPECint95 for Analysis
- 10-20,000 SPECint95 for Simulation
- 50-100 TBytes/year of On-line (Disk) Storage
- 200 TBytes/year of Near-line (Robotic Tape)
Storage - 100 Mbit/sec connectivity to CERN
- Assume no major raw data processing or handling
outside of CERN
5US ATLAS Perspective
- US ATLAS facilities must be adequate to meet any
reasonable U.S. ATLAS computing needs (U.S. role
in ATLAS should not be constrained by a computing
shortfall, rather the U.S. role should be
enhanced by computing strength) - Store re-reconstruct 10-30 of events
- Take high end of simulation capacity range
- Take high end of disk capacity range
- Augment analysis capacity
- Augment CERN link bandwidth
6Adjusted For US ATLAS Perspective
- US ATLAS Tier 1 Center in 05 should include ...
- 10,000 SPECint95 for Re-reconstruction
- 50,000 SPECint95 for Analysis
- 20,000 SPECint95 for Simulation
- 100 TBytes/year of On-line (Disk) Storage
- 300 TBytes/year of Near-line (Robotic Tape)
Storage - Dedicate OC12, 622 Mbit/sec to CERN
7Architectural Model
- Consists of Transparent Hierarchically
Distributed Grid Connected Computing Resources - Primary ATLAS Computing Centre at CERN
- US ATLAS Tier 1 Computing Center at BNL
- National in scope at 20 of CERN
- US ATLAS Tier 2 Computing Centers
- Six, each regional in scope at 20 of Tier 1
- Likely one of them at CERN
- US ATLAS Institutional Computing Facilities
- Local LAN in scope, not project supported
- US ATLAS Individual Desk Top Systems
8Schematic of Model
9Distributed Model
- Rationale (benefits)
- Improved user access to computing resources
- Local geographic travel
- Higher performance regional networks
- Enable local autonomy
- Less widely shared
- More locally managed resources
- Increased capacities
- Encourage integration of other equipment
expertise - Institutional, base program
- Additional funding options
- Com Sci, NSF
10Distributed Model
- But increase vulnerability (Risk)
- Increased dependence on network
- Increased dependence on GRID infrastructure RD
- Increased dependence on facility modeling tools
- More complex management
- Risk / benefit analysis must yield positive result
11Adjusted For Architectural Model
- US ATLAS facilities in 05 should include ...
- 10,000 SPECint95 for Re-reconstruction
- 85,000 SPECint95 for Analysis
- 35,000 SPECint95 for Simulation
- 190 TBytes/year of On-line (Disk) Storage
- 300 TBytes/year of Near-line (Robotic Tape)
Storage - Dedicated OC12 622 Mbit/sec Tier 1 connectivity
to each Tier 2 - Dedicate OC12 622 Mbit/sec to CERN
? ?
12GRID Infrastructure
- GRID infrastructure software must supply
- Efficiency (optimizing hardware use)
- Transparency (optimizing user effectiveness)
- Projects
- PPDG Distributed data services - Later talk by
D. Malon - APOGEE Complete GRID infrastructure including
distributed resources management, modeling,
instrumentation, etc. - GriPhyN Staged development toward delivery of a
production system - Alternative to success with these projects is a
difficult to use and/or inefficient overall
system - U.S. ATLAS involvement includes - ANL, LBNL, LBNL
13Facility Modeling
- Performance of Complex Distribute System is
Difficult but Necessary to Predict - MONARC - LHC centered project
- Provide toolset for modeling such systems
- Develop guidelines for designing such systems
- Currently capable of relevant analyses
- U.S. ATLAS Involvement
- Later talk by K. Sliwa
14Components of Model Tier 1
- Full Function Facility
- Dedicated Connectivity to CERN
- Primary Site for Storage/Serving
- Cache/Replicate CERN data needed by US ATLAS
- Archive and Serve WAN all data of interest to US
ATLAS - Computation
- Primary Site for Re-reconstruction (perhaps only
site) - Major Site for Simulation Analysis (2 x Tier
2) - Repository of Technical Expertise and Support
- Hardware, OSs, utilities, and other standard
elements of U.S. ATLAS - Network, AFS, GRID, other infrastructure
elements of WAN model
15Components of Model Tier 2
- Limit personnel and maintenance support costs
- Focused Function Facility
- Excellent connectivity to Tier 1 (Network GRID)
- Tertiary storage via Network at Tier 1 (none
local) - Primary Analysis site for its region
- Major Simulation capabilities
- Major online storage cache for its region
- Leverage local expertise and other resources
- Part of site selection criteria, 1 FTE
contributed, for example
16Technology Trends Choices
- CPU
- Range Commodity processors -gt SMP servers
- Factor 2 decrease in price/performance in 1.5
years - Disk
- Range Commodity disk -gt RAID disk
- Factor 2 decrease in price/performance in 1.5
years - Tape Storage
- Range Desktop storage -gt High-end storage
- Factor 2 decrease in price/performance in 1.5 - 2
years
17Price/Performance Evolution
As of Dec 1996
From Harvey Newman presentation, Third LCB
Workshop, Marseilles, Sept. 1999
18Technology Trends Choices
- For Costing Purpose
- Start with familiar established technologies
- Project by observed exponential slopes
- This is a Conservative Approach
- There are no known near term show stoppers to
these established technologies - A new technology would have to be more cost
effective to supplant projection of an
established technology
19Technology Trends Choices
- CPU Intensive processing
- Farms of commodity processors - Intel/Linux
- I/O Intensive Processing and Serving
- Mid-scale SMPs (SUN, IBM, etc.)
- Online Storage (Disk)
- Fibre Channel Connected RAID
- Nearline Storage (Robotic Tape System)
- STK / 9840 / HPSS
- LAN
- Gigabit Ethernet
20Composition of Tier 1
- Commodity processor farms (Intel/Linux)
- Mid-scale SMP servers (SUN)
- Fibre Channel connected RAID disk
- Robotic tape / HSM system (STK / HPSS)
21Current Tier 1 Status
- U.S. ATLAS Tier 1 facility is currently operating
as a small, 5 , adjunct to the RHIC Computing
Facility (RCF) - Deployment includes
- Intel/Linux farms (28 CPUs)
- Sun E450 server (2 CPUs)
- 200 Mbytes of Fibre Channel RAID Disk
- Intel/Linux web server
- Archiving via low priority HPSS Class of Service
- Shared use of an AFS server (10 GBytes)
22Current Tier 1 Status
- These RCF chosen platforms/technologies are
common to ATLAS - Allows wide range of services with only 1 FTE of
sys admin contributed (plus US ATLAS librarian) - Significant divergence of direction between US
ATLAS and RHIC has been allowed for - Complete divergence, extremely unlikely, would
exceed current staffing estimates
23(No Transcript)
24RAID Disk Subsystem
25Intel/Linux Processor Farm
26Intel/Linux Nodes
27Composition of Tier 2 (Initial One)
- Commodity processor farms (Intel/Linux)
- Mid-scale SMP servers
- Fibre Channel connected RAID disk
28Staff Estimate(In Pseudo Detail)
29Time Evolution of Facilities
- Tier 1 functioning as early prototype
- Ramp up to meet needs and validate design
- Assume 2 years for Tier 2 to fully establish
- Initiate first Tier 2 in 2001
- True Tier 2 prototype
- Demonstrate Tier 1 - Tier 2 interaction
- Second Tier 2 initiated in 2002 (CERN?)
- Four remaining initiated in 2003
- Fully operational by 2005
- Six are to be identical (CERN exception?)
30Staff Evolution
31Network
- Tier 1 connectivity to CERN and to Tier 2s is
critical - Must be guaranteed and allocable (dedicated and
differentiate) - Must be adequate (Triage of functions is
disruptive) - Should grow with need OC12 should be practical
by 2005 when serious data will flow
32WAN Configurations and Cost(FY 2000 k)
33Annual Equipment Costs for Tier 1 Center (FY
2000 k)
34Annual Equipment Costs Tier 2 Center(FY 2000
k)
35Integrated Facility Capacities by Year
36US ATLAS Facilities Annual Costs (FY2000 k)
37Major Milestones