TRIUMF SITE REPORT Corrie Kost - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

TRIUMF SITE REPORT Corrie Kost

Description:

... below 90% nominal needs explanation and compensation in days following. Maintain rates unattended over Easter weekend (April 14-16) Tape tests April 18-24 ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 25

Provided by: corr159

Category:

more less

Transcript and Presenter's Notes

Title: TRIUMF SITE REPORT Corrie Kost

1
TRIUMF SITE REPORTCorrie Kost
Update since Hepix Fall 2005
2
Devolving Server Functions
Windows print server cluster - 2 Dell
PowerEdge SC1425 machines sharing external SCSI
disk holding printers data
NEW
NEW
OLD
- 2 Dell PowerEdge SC1425 -
primary secondary Windows domain controllers

Windows Print Server
Windows Domain controller

3
Update since Hepix Fall 2005
4
Update since Hepix Fall 2005
Waiting for 10Gb/sec DWDM XFP
5
Update since Hepix Fall 2005
40 km DWDM 64 wavelengths/fibre CH34193.4THz
1550.116nm
10kUS each
6
Update since Hepix Fall 2005
7
Update since Hepix Fall 2005
8
Servers / Data Centre
GPS TIME
TRPRINT
CMMS
TGATE
WINPRINT1/2
DOCUMENTS
CONDORG
TNT2K3
TRWEB
TRWINDATA
TRSHARE
TRSWINAPPS
TRMAIL
PROMISE STORAGE
TRSERV
LCG STORAGE
IBM CLUSTER
LCG WORKER NODES
IBM 2TB STORAGE
RH-FC-SL MIRROR
KOPIODOC
1GB 1GB TEMP TRSHARE
9
Update since Hepix Fall 2005
GPS TIME
TRPRINT
CMMS
TGATE
WINPRINT1
DOCUMENTS
WINPRINT2
TNT2K3
CONDORG
TRWEB
TRWINDATA
TRSHARE
TRSWINAPPS
TRMAIL
PROMISE STORAGE
TRSERV
10
TRIUMF-CERN ATLAS
ATLAS
TIER1 prototype (Service Challenge)
Lightpath - InternationalGrid Testbed (CANet
IGT) Equipment Amanda Backup
Worker nodes (evaluation units) Blades,
Dual/Dual 64-bit 3GHz Xeons 4 GB RAM, 80GB Sata
VOBOX 2GB, 3GHz 64-bit Xeon, 2160GB SATA LFC
2GB, 3GHz 64-bit Xeon, 2160GB SATA FTS
2GB, 3GHz 64-bit Xeon, 373GB SCSI
SRM Head node 2GB, 64-bit Opteron
2232GB RAID1 sc1-sc3 dCache Storage Elements
2GB, 3GHz 64-bit Xeon, 8232GB RAID5
2 SDLT 160GB drives / 26 Cart 2 SDLT 300GB
drives / 26 Cart
11
ATLAS/CERN ? TRIUMF
12
Tier0?Tier1 Tests Apr 3?30

Any MB/sec rates below 90 nominal needs
explanation and compensation in days following.
Maintain rates unattended over Easter weekend
(April 14-16)
Tape tests April 18-24
Experiment-driven transfers April 25-30
The nominal rate for PIC is 100MB/s, but will
be limited by the WAN until November 2006.

https//twiki.cern.ch/twiki/bin/view/LCG/LCGServic
eChallenges
13
Update from Hepix Fall 2005

ATLAS SC4 Plans Extracted from Mumbai Workshop
17 Feb/2006(1)
March-April (pre-SC4)
3-4 weeks in for internal Tier-0 tests (Phase
0)
April-May (pre-SC4)
Tests of distributed operations on a small
testbed (the pre-production system)
Last 3 weeks of June
Tier-0 test (Phase 1) with data distribution to
Tier-1s (720MB/s full ESD to BNL)
3 weeks in July
Distributed processing tests (Part 1)
2 weeks in July-August
Distributed analysis tests (Part 1)
3-4 weeks in September-October
Tier-0 test (Phase 2) with data to Tier-2s
3 weeks in October
Distributed processing tests (Part 2)
3-4 weeks in November
Distributed analysis tests (Part 2)

14
Repeated reads on same set of (typically 16)
files (at 600MB/sec) during 150 days 7
PB (total since started 13PB to March 30 no
reboot for 134 days)
15
Repeated reads on same set of (typically 16)
files (at 600MB/sec) during 150 days 7
PB (total since started 13PB to March 30 no
reboot for 134 days)
16
Keeping it Cool

Central Computing Room isolation fixed
Combined two 11-Ton air-conditioners to even out
load
Adding heating coil to improve stability
Blades for Atlas! 30 less heat, 20 less TCO
100 W/sq-ft ? 200 W/sq-ft ? 400 W/sq-ft means
cooling costs are a significant cost factor
Note Electrical/Cooling costs estimated at
Can150k/yr
Water cooled systems for (multicore/multicpu)
blade systems?

17
Keeping it Cool2

HP offers Modular Cooling System (MCS)
Used when rack gt 10-15Kw
US30K
Chilled (5-10C) water
Max load 30Kw/rack (17GPM / 65LPM _at_ 5C water _at_
20C air)
Water cannot reach servers
Door open? - Cold air out front, hot out back
Significantly less noise with doors closed
HWD 1999x909x1295mm (79x36x51) 513Kg/1130lbs
(empty)
Not certified for Seismic or Zone 4
http//h20000.www2.hp.com/bc/docs/support/SupportM
anual/c00613691/c00613691.pdf

18
Amanda Backup at TRIUMF
Details by Steve McDonald Thursday 430pm
19
End of Presentation

Extra Slides on SC4 plans for reference

20
Service Challenge Servers Details

fts FTS Server       FTS File Transfer
Service       homepage http//egee-jra1-dm.web.
cern.ch/egee2Djra12Ddm/FTS/      Oracle
database used       64-bit Intel Xeon 3 GHz
      73 GB SCSI disks (3)       2 GB RAM
      IBM 4560-SLX Tape Library attached (will
have 2 SDLT-II drives       attached when they
arrive, probably next week) SDLT-II does
300 GB native, 600 compressed.       Running SL
305 64 bit lfc LFC Server       LFC LCG
File Catalog       info page https//uimon.cern.
ch/twiki/bin/view/LCG/LfcAdminGuide       MySQL
database used       64-bit Intel Xeon 3 GHz
      160 GB SATA disks (2), software raid-1
      2 GB RAM       Running SL 305 64 bit
vobox VO Box Virtual Organization Box
info page http//agenda.nikhef.nl/askArchive.php?
baseagendacatega0613ida0613s3t1/transparencie
s       64-bit Intel Xeon 3 GHz       160 GB
SATA disks (2), software raid-1       2 GB RAM
      Running SL 305 64 bit sc1-sc3 dCache
Storage Elements       64-bit Intel Xeons 3 GHz
      3ware Raid Controller 8x 232 GB disks in
H/W Raid-5 giving 1.8 TB storage       2 GB RAM
      Running SL 41 64 bit sc4 SRM endpoint,
dCache Admin node and Storage Element
64-bit Opteron 246       3ware Raid Controller
2x 232 GB disks in H/W Raid-1 giving 250 GB
storage       2 GB RAM       Running SL 41 64
bit       IBM 4560-SLX Tape Library attached.
We are moving both SDLT-I       drives to this
unit. SDLT-I does 160 GB native, 300 compressed.

21
ATLAS SC4 Tests

Complete Tier-0 test
Internal data transfer from Event Filter farm
to Castor disk pool, Castor tape, CPU farm
Calibration loop and handling of conditions data
Including distribution of conditions data to
Tier-1s (and Tier-2s)
Transfer of RAW, ESD, AOD and TAG data to Tier-1s
Transfer of AOD and TAG data to Tier-2s
Data and dataset registration in DB (add
meta-data information to meta-data DB)
Distributed production
Full simulation chain run at Tier-2s (and
Tier-1s)
Data distribution to Tier-1s, other Tier-2s and
CAF
Reprocessing raw data at Tier-1s
Data distribution to other Tier-1s, Tier-2s and
CAF
Distributed analysis
Random job submission accessing data at Tier-1s
(some) and Tier-2s (mostly)
Tests of performance of job submission,
distribution and output retrieval

22
ATLAS SC4 Plans (1)

Tier-0 data flow tests
Phase 0 3-4 weeks in March-April for internal
Tier-0 tests
Explore limitations of current setup
Run real algorithmic code
Establish infrastructure for calib/align loop and
conditions DB access
Study models for event streaming and file merging
Get input from SFO simulator placed at Point 1
(ATLAS pit)
Implement system monitoring infrastructure
Phase 1 last 3 weeks of June with data
distribution to Tier-1s
Run integrated data flow tests using the SC4
infrastructure for data distribution
Send AODs to (at least) a few Tier-2s
Automatic operation for O(1 week)
First version of shifters interface tools
Treatment of error conditions
Phase 2 3-4 weeks in September-October
Extend data distribution to all (most) Tier-2s
Use 3D tools to distribute calibration data
The ATLAS TDAQ Large Scale Test in
October-November prevents further Tier-0 tests in
2006
but is not incompatible with other distributed
operations

No external data transfer during this phase(?)
23
ATLAS SC4 Plans (2)

ATLAS CSC includes continuous distributed
simulation productions
We will continue running distributed simulation
productions all the time
Using all Grid computing resources we have
available for ATLAS
The aim is to produce 2M fully simulated (and
reconstructed) events/week from April onwards,
both for physics users and to build the datasets
for later tests
We can currently manage 1M events/week ramping
up gradually
SC4 distributed reprocessing tests
Test of the computing model using the SC4 data
management infrastructure
Needs file transfer capabilities between Tier-1s
and back to CERN CAF
Also distribution of conditions data to Tier-1s
(3D)
Storage management is also an issue
Could use 3 weeks in July and 3 weeks in October
SC4 distributed simulation intensive tests
Once reprocessing tests are OK, we can use the
same infrastructure to implement our computing
model for simulation productions
As they would use the same setup both from our
ProdSys and the SC4 side
First separately, then concurrently

24
ATLAS SC4 Plans (3)

Distributed analysis tests
Random job submission accessing data at Tier-1s
(some) and Tier-2s (mostly)
Generate groups of jobs and simulate analysis job
submission by users at home sites
Direct jobs needing only AODs as input to Tier-2s
Direct jobs needing ESDs or RAW as input to
Tier-1s
Make preferential use of ESD and RAW samples
available on disk at Tier-2s
Tests of performance of job submission,
distribution and output retrieval
Test job priority and site policy schemes for
many user groups and roles
Distributed data and dataset discovery and access
through metadata, tags, data catalogues.
Need same SC4 infrastructure as needed by
distributed productions
Storage of job outputs for private or group-level
analysis may be an issue
Tests can be run during Q3-4 2006
First a couple of weeks in July-August (after
distributed production tests)
Then another longer period of 3-4 weeks in
November