Title: TRIUMF SITE REPORT Corrie Kost
1TRIUMF SITE REPORTCorrie Kost
Update since Hepix Fall 2005
2Devolving Server Functions
Windows print server cluster - 2 Dell
PowerEdge SC1425 machines sharing external SCSI
disk holding printers data
NEW
NEW
OLD
- 2 Dell PowerEdge SC1425 -
primary secondary Windows domain controllers
- Windows Print Server
- Windows Domain controller
3Update since Hepix Fall 2005
4Update since Hepix Fall 2005
Waiting for 10Gb/sec DWDM XFP
5Update since Hepix Fall 2005
40 km DWDM 64 wavelengths/fibre CH34193.4THz
1550.116nm
10kUS each
6Update since Hepix Fall 2005
7Update since Hepix Fall 2005
8Servers / Data Centre
GPS TIME
TRPRINT
CMMS
TGATE
WINPRINT1/2
DOCUMENTS
CONDORG
TNT2K3
TRWEB
TRWINDATA
TRSHARE
TRSWINAPPS
TRMAIL
PROMISE STORAGE
TRSERV
LCG STORAGE
IBM CLUSTER
LCG WORKER NODES
IBM 2TB STORAGE
RH-FC-SL MIRROR
KOPIODOC
1GB 1GB TEMP TRSHARE
9Update since Hepix Fall 2005
GPS TIME
TRPRINT
CMMS
TGATE
WINPRINT1
DOCUMENTS
WINPRINT2
TNT2K3
CONDORG
TRWEB
TRWINDATA
TRSHARE
TRSWINAPPS
TRMAIL
PROMISE STORAGE
TRSERV
10TRIUMF-CERN ATLAS
ATLAS
TIER1 prototype (Service Challenge)
Lightpath - InternationalGrid Testbed (CANet
IGT) Equipment Amanda Backup
Worker nodes (evaluation units) Blades,
Dual/Dual 64-bit 3GHz Xeons 4 GB RAM, 80GB Sata
VOBOX 2GB, 3GHz 64-bit Xeon, 2160GB SATA LFC
2GB, 3GHz 64-bit Xeon, 2160GB SATA FTS
2GB, 3GHz 64-bit Xeon, 373GB SCSI
SRM Head node 2GB, 64-bit Opteron
2232GB RAID1 sc1-sc3 dCache Storage Elements
2GB, 3GHz 64-bit Xeon, 8232GB RAID5
2 SDLT 160GB drives / 26 Cart 2 SDLT 300GB
drives / 26 Cart
11ATLAS/CERN ? TRIUMF
12Tier0?Tier1 Tests Apr 3?30
- Any MB/sec rates below 90 nominal needs
explanation and compensation in days following. - Maintain rates unattended over Easter weekend
(April 14-16) - Tape tests April 18-24
- Experiment-driven transfers April 25-30
- The nominal rate for PIC is 100MB/s, but will
be limited by the WAN until November 2006.
https//twiki.cern.ch/twiki/bin/view/LCG/LCGServic
eChallenges
13Update from Hepix Fall 2005
- ATLAS SC4 Plans Extracted from Mumbai Workshop
17 Feb/2006(1) - March-April (pre-SC4)
- 3-4 weeks in for internal Tier-0 tests (Phase
0) - April-May (pre-SC4)
- Tests of distributed operations on a small
testbed (the pre-production system) - Last 3 weeks of June
- Tier-0 test (Phase 1) with data distribution to
Tier-1s (720MB/s full ESD to BNL) - 3 weeks in July
- Distributed processing tests (Part 1)
- 2 weeks in July-August
- Distributed analysis tests (Part 1)
- 3-4 weeks in September-October
- Tier-0 test (Phase 2) with data to Tier-2s
- 3 weeks in October
- Distributed processing tests (Part 2)
- 3-4 weeks in November
- Distributed analysis tests (Part 2)
14Repeated reads on same set of (typically 16)
files (at 600MB/sec) during 150 days 7
PB (total since started 13PB to March 30 no
reboot for 134 days)
15Repeated reads on same set of (typically 16)
files (at 600MB/sec) during 150 days 7
PB (total since started 13PB to March 30 no
reboot for 134 days)
16Keeping it Cool
- Central Computing Room isolation fixed
- Combined two 11-Ton air-conditioners to even out
load - Adding heating coil to improve stability
- Blades for Atlas! 30 less heat, 20 less TCO
- 100 W/sq-ft ? 200 W/sq-ft ? 400 W/sq-ft means
cooling costs are a significant cost factor - Note Electrical/Cooling costs estimated at
Can150k/yr - Water cooled systems for (multicore/multicpu)
blade systems?
17Keeping it Cool2
- HP offers Modular Cooling System (MCS)
- Used when rack gt 10-15Kw
- US30K
- Chilled (5-10C) water
- Max load 30Kw/rack (17GPM / 65LPM _at_ 5C water _at_
20C air) - Water cannot reach servers
- Door open? - Cold air out front, hot out back
- Significantly less noise with doors closed
- HWD 1999x909x1295mm (79x36x51) 513Kg/1130lbs
(empty) - Not certified for Seismic or Zone 4
- http//h20000.www2.hp.com/bc/docs/support/SupportM
anual/c00613691/c00613691.pdf
18Amanda Backup at TRIUMF
Details by Steve McDonald Thursday 430pm
19End of Presentation
- Extra Slides on SC4 plans for reference
20Service Challenge Servers Details
- fts FTS Server FTS File Transfer
Service homepage http//egee-jra1-dm.web.
cern.ch/egee2Djra12Ddm/FTS/ Oracle
database used 64-bit Intel Xeon 3 GHz
73 GB SCSI disks (3) 2 GB RAM
IBM 4560-SLX Tape Library attached (will
have 2 SDLT-II drives attached when they
arrive, probably next week) SDLT-II does
300 GB native, 600 compressed. Running SL
305 64 bit lfc LFC Server LFC LCG
File Catalog info page https//uimon.cern.
ch/twiki/bin/view/LCG/LfcAdminGuide MySQL
database used 64-bit Intel Xeon 3 GHz
160 GB SATA disks (2), software raid-1
2 GB RAM Running SL 305 64 bit
vobox VO Box Virtual Organization Box
info page http//agenda.nikhef.nl/askArchive.php?
baseagendacatega0613ida0613s3t1/transparencie
s 64-bit Intel Xeon 3 GHz 160 GB
SATA disks (2), software raid-1 2 GB RAM
Running SL 305 64 bit sc1-sc3 dCache
Storage Elements 64-bit Intel Xeons 3 GHz
3ware Raid Controller 8x 232 GB disks in
H/W Raid-5 giving 1.8 TB storage 2 GB RAM
Running SL 41 64 bit sc4 SRM endpoint,
dCache Admin node and Storage Element
64-bit Opteron 246 3ware Raid Controller
2x 232 GB disks in H/W Raid-1 giving 250 GB
storage 2 GB RAM Running SL 41 64
bit IBM 4560-SLX Tape Library attached.
We are moving both SDLT-I drives to this
unit. SDLT-I does 160 GB native, 300 compressed.
21ATLAS SC4 Tests
- Complete Tier-0 test
- Internal data transfer from Event Filter farm
to Castor disk pool, Castor tape, CPU farm - Calibration loop and handling of conditions data
- Including distribution of conditions data to
Tier-1s (and Tier-2s) - Transfer of RAW, ESD, AOD and TAG data to Tier-1s
- Transfer of AOD and TAG data to Tier-2s
- Data and dataset registration in DB (add
meta-data information to meta-data DB) - Distributed production
- Full simulation chain run at Tier-2s (and
Tier-1s) - Data distribution to Tier-1s, other Tier-2s and
CAF - Reprocessing raw data at Tier-1s
- Data distribution to other Tier-1s, Tier-2s and
CAF - Distributed analysis
- Random job submission accessing data at Tier-1s
(some) and Tier-2s (mostly) - Tests of performance of job submission,
distribution and output retrieval
22ATLAS SC4 Plans (1)
- Tier-0 data flow tests
- Phase 0 3-4 weeks in March-April for internal
Tier-0 tests - Explore limitations of current setup
- Run real algorithmic code
- Establish infrastructure for calib/align loop and
conditions DB access - Study models for event streaming and file merging
- Get input from SFO simulator placed at Point 1
(ATLAS pit) - Implement system monitoring infrastructure
- Phase 1 last 3 weeks of June with data
distribution to Tier-1s - Run integrated data flow tests using the SC4
infrastructure for data distribution - Send AODs to (at least) a few Tier-2s
- Automatic operation for O(1 week)
- First version of shifters interface tools
- Treatment of error conditions
- Phase 2 3-4 weeks in September-October
- Extend data distribution to all (most) Tier-2s
- Use 3D tools to distribute calibration data
- The ATLAS TDAQ Large Scale Test in
October-November prevents further Tier-0 tests in
2006 - but is not incompatible with other distributed
operations
No external data transfer during this phase(?)
23ATLAS SC4 Plans (2)
- ATLAS CSC includes continuous distributed
simulation productions - We will continue running distributed simulation
productions all the time - Using all Grid computing resources we have
available for ATLAS - The aim is to produce 2M fully simulated (and
reconstructed) events/week from April onwards,
both for physics users and to build the datasets
for later tests - We can currently manage 1M events/week ramping
up gradually - SC4 distributed reprocessing tests
- Test of the computing model using the SC4 data
management infrastructure - Needs file transfer capabilities between Tier-1s
and back to CERN CAF - Also distribution of conditions data to Tier-1s
(3D) - Storage management is also an issue
- Could use 3 weeks in July and 3 weeks in October
- SC4 distributed simulation intensive tests
- Once reprocessing tests are OK, we can use the
same infrastructure to implement our computing
model for simulation productions - As they would use the same setup both from our
ProdSys and the SC4 side - First separately, then concurrently
24ATLAS SC4 Plans (3)
- Distributed analysis tests
- Random job submission accessing data at Tier-1s
(some) and Tier-2s (mostly) - Generate groups of jobs and simulate analysis job
submission by users at home sites - Direct jobs needing only AODs as input to Tier-2s
- Direct jobs needing ESDs or RAW as input to
Tier-1s - Make preferential use of ESD and RAW samples
available on disk at Tier-2s - Tests of performance of job submission,
distribution and output retrieval - Test job priority and site policy schemes for
many user groups and roles - Distributed data and dataset discovery and access
through metadata, tags, data catalogues. - Need same SC4 infrastructure as needed by
distributed productions - Storage of job outputs for private or group-level
analysis may be an issue - Tests can be run during Q3-4 2006
- First a couple of weeks in July-August (after
distributed production tests) - Then another longer period of 3-4 weeks in
November