TRIUMF SITE REPORT Corrie Kost - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

TRIUMF SITE REPORT Corrie Kost

Description:

... below 90% nominal needs explanation and compensation in days following. Maintain rates unattended over Easter weekend (April 14-16) Tape tests April 18-24 ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 25
Provided by: corr159
Category:
Tags: report | site | triumf | corrie | kost

less

Transcript and Presenter's Notes

Title: TRIUMF SITE REPORT Corrie Kost


1
TRIUMF SITE REPORTCorrie Kost
Update since Hepix Fall 2005
2
Devolving Server Functions
Windows print server cluster - 2 Dell
PowerEdge SC1425 machines sharing external SCSI
disk holding printers data
NEW
NEW
OLD
- 2 Dell PowerEdge SC1425 -
primary secondary Windows domain controllers
  • Windows Print Server
  • Windows Domain controller

3
Update since Hepix Fall 2005
4
Update since Hepix Fall 2005
Waiting for 10Gb/sec DWDM XFP
5
Update since Hepix Fall 2005
40 km DWDM 64 wavelengths/fibre CH34193.4THz
1550.116nm
10kUS each
6
Update since Hepix Fall 2005
7
Update since Hepix Fall 2005
8
Servers / Data Centre
GPS TIME
TRPRINT
CMMS
TGATE
WINPRINT1/2
DOCUMENTS
CONDORG
TNT2K3
TRWEB
TRWINDATA
TRSHARE
TRSWINAPPS
TRMAIL
PROMISE STORAGE
TRSERV
LCG STORAGE
IBM CLUSTER
LCG WORKER NODES
IBM 2TB STORAGE
RH-FC-SL MIRROR
KOPIODOC
1GB 1GB TEMP TRSHARE
9
Update since Hepix Fall 2005
GPS TIME
TRPRINT
CMMS
TGATE
WINPRINT1
DOCUMENTS
WINPRINT2
TNT2K3
CONDORG
TRWEB
TRWINDATA
TRSHARE
TRSWINAPPS
TRMAIL
PROMISE STORAGE
TRSERV
10
TRIUMF-CERN ATLAS
ATLAS
TIER1 prototype (Service Challenge)
Lightpath - InternationalGrid Testbed (CANet
IGT) Equipment Amanda Backup
Worker nodes (evaluation units) Blades,
Dual/Dual 64-bit 3GHz Xeons 4 GB RAM, 80GB Sata
VOBOX 2GB, 3GHz 64-bit Xeon, 2160GB SATA LFC
2GB, 3GHz 64-bit Xeon, 2160GB SATA FTS
2GB, 3GHz 64-bit Xeon, 373GB SCSI
SRM Head node 2GB, 64-bit Opteron
2232GB RAID1 sc1-sc3 dCache Storage Elements
2GB, 3GHz 64-bit Xeon, 8232GB RAID5
2 SDLT 160GB drives / 26 Cart 2 SDLT 300GB
drives / 26 Cart
11
ATLAS/CERN ? TRIUMF
12
Tier0?Tier1 Tests Apr 3?30
  • Any MB/sec rates below 90 nominal needs
    explanation and compensation in days following.
  • Maintain rates unattended over Easter weekend
    (April 14-16)
  • Tape tests April 18-24
  • Experiment-driven transfers April 25-30
  • The nominal rate for PIC is 100MB/s, but will
    be limited by the WAN until November 2006.

https//twiki.cern.ch/twiki/bin/view/LCG/LCGServic
eChallenges
13
Update from Hepix Fall 2005
  • ATLAS SC4 Plans Extracted from Mumbai Workshop
    17 Feb/2006(1)
  • March-April (pre-SC4)
  • 3-4 weeks in for internal Tier-0 tests (Phase
    0)
  • April-May (pre-SC4)
  • Tests of distributed operations on a small
    testbed (the pre-production system)
  • Last 3 weeks of June
  • Tier-0 test (Phase 1) with data distribution to
    Tier-1s (720MB/s full ESD to BNL)
  • 3 weeks in July
  • Distributed processing tests (Part 1)
  • 2 weeks in July-August
  • Distributed analysis tests (Part 1)
  • 3-4 weeks in September-October
  • Tier-0 test (Phase 2) with data to Tier-2s
  • 3 weeks in October
  • Distributed processing tests (Part 2)
  • 3-4 weeks in November
  • Distributed analysis tests (Part 2)

14
Repeated reads on same set of (typically 16)
files (at 600MB/sec) during 150 days 7
PB (total since started 13PB to March 30 no
reboot for 134 days)
15
Repeated reads on same set of (typically 16)
files (at 600MB/sec) during 150 days 7
PB (total since started 13PB to March 30 no
reboot for 134 days)
16
Keeping it Cool
  • Central Computing Room isolation fixed
  • Combined two 11-Ton air-conditioners to even out
    load
  • Adding heating coil to improve stability
  • Blades for Atlas! 30 less heat, 20 less TCO
  • 100 W/sq-ft ? 200 W/sq-ft ? 400 W/sq-ft means
    cooling costs are a significant cost factor
  • Note Electrical/Cooling costs estimated at
    Can150k/yr
  • Water cooled systems for (multicore/multicpu)
    blade systems?

17
Keeping it Cool2
  • HP offers Modular Cooling System (MCS)
  • Used when rack gt 10-15Kw
  • US30K
  • Chilled (5-10C) water
  • Max load 30Kw/rack (17GPM / 65LPM _at_ 5C water _at_
    20C air)
  • Water cannot reach servers
  • Door open? - Cold air out front, hot out back
  • Significantly less noise with doors closed
  • HWD 1999x909x1295mm (79x36x51) 513Kg/1130lbs
    (empty)
  • Not certified for Seismic or Zone 4
  • http//h20000.www2.hp.com/bc/docs/support/SupportM
    anual/c00613691/c00613691.pdf

18
Amanda Backup at TRIUMF
Details by Steve McDonald Thursday 430pm
19
End of Presentation
  • Extra Slides on SC4 plans for reference

20
Service Challenge Servers Details
  • fts  FTS Server       FTS File Transfer
    Service       homepage  http//egee-jra1-dm.web.
    cern.ch/egee2Djra12Ddm/FTS/      Oracle
    database used       64-bit Intel Xeon 3 GHz
          73 GB SCSI disks (3)       2 GB RAM
          IBM 4560-SLX Tape Library attached (will
    have 2 SDLT-II drives       attached when they
    arrive, probably next week) SDLT-II does      
    300 GB native, 600 compressed.       Running SL
    305 64 bit lfc  LFC Server       LFC LCG
    File Catalog       info page https//uimon.cern.
    ch/twiki/bin/view/LCG/LfcAdminGuide       MySQL
    database used       64-bit Intel Xeon 3 GHz
          160 GB SATA disks (2), software raid-1
          2 GB RAM       Running SL 305 64 bit
    vobox VO Box Virtual Organization Box       
    info page http//agenda.nikhef.nl/askArchive.php?
    baseagendacatega0613ida0613s3t1/transparencie
    s       64-bit Intel Xeon 3 GHz       160 GB
    SATA disks (2), software raid-1       2 GB RAM
          Running SL 305 64 bit sc1-sc3 dCache
    Storage Elements       64-bit Intel Xeons 3 GHz
          3ware Raid Controller 8x 232 GB disks in
    H/W Raid-5 giving 1.8 TB storage       2 GB RAM
          Running SL 41 64 bit sc4 SRM endpoint,
    dCache Admin node and Storage Element      
    64-bit Opteron 246       3ware Raid Controller
    2x 232 GB disks in H/W Raid-1 giving 250 GB
    storage       2 GB RAM       Running SL 41 64
    bit       IBM 4560-SLX Tape Library attached. 
    We are moving both SDLT-I       drives to this
    unit.  SDLT-I does 160 GB native, 300 compressed.

21
ATLAS SC4 Tests
  • Complete Tier-0 test
  • Internal data transfer from Event Filter farm
    to Castor disk pool, Castor tape, CPU farm
  • Calibration loop and handling of conditions data
  • Including distribution of conditions data to
    Tier-1s (and Tier-2s)
  • Transfer of RAW, ESD, AOD and TAG data to Tier-1s
  • Transfer of AOD and TAG data to Tier-2s
  • Data and dataset registration in DB (add
    meta-data information to meta-data DB)
  • Distributed production
  • Full simulation chain run at Tier-2s (and
    Tier-1s)
  • Data distribution to Tier-1s, other Tier-2s and
    CAF
  • Reprocessing raw data at Tier-1s
  • Data distribution to other Tier-1s, Tier-2s and
    CAF
  • Distributed analysis
  • Random job submission accessing data at Tier-1s
    (some) and Tier-2s (mostly)
  • Tests of performance of job submission,
    distribution and output retrieval

22
ATLAS SC4 Plans (1)
  • Tier-0 data flow tests
  • Phase 0 3-4 weeks in March-April for internal
    Tier-0 tests
  • Explore limitations of current setup
  • Run real algorithmic code
  • Establish infrastructure for calib/align loop and
    conditions DB access
  • Study models for event streaming and file merging
  • Get input from SFO simulator placed at Point 1
    (ATLAS pit)
  • Implement system monitoring infrastructure
  • Phase 1 last 3 weeks of June with data
    distribution to Tier-1s
  • Run integrated data flow tests using the SC4
    infrastructure for data distribution
  • Send AODs to (at least) a few Tier-2s
  • Automatic operation for O(1 week)
  • First version of shifters interface tools
  • Treatment of error conditions
  • Phase 2 3-4 weeks in September-October
  • Extend data distribution to all (most) Tier-2s
  • Use 3D tools to distribute calibration data
  • The ATLAS TDAQ Large Scale Test in
    October-November prevents further Tier-0 tests in
    2006
  • but is not incompatible with other distributed
    operations

No external data transfer during this phase(?)
23
ATLAS SC4 Plans (2)
  • ATLAS CSC includes continuous distributed
    simulation productions
  • We will continue running distributed simulation
    productions all the time
  • Using all Grid computing resources we have
    available for ATLAS
  • The aim is to produce 2M fully simulated (and
    reconstructed) events/week from April onwards,
    both for physics users and to build the datasets
    for later tests
  • We can currently manage 1M events/week ramping
    up gradually
  • SC4 distributed reprocessing tests
  • Test of the computing model using the SC4 data
    management infrastructure
  • Needs file transfer capabilities between Tier-1s
    and back to CERN CAF
  • Also distribution of conditions data to Tier-1s
    (3D)
  • Storage management is also an issue
  • Could use 3 weeks in July and 3 weeks in October
  • SC4 distributed simulation intensive tests
  • Once reprocessing tests are OK, we can use the
    same infrastructure to implement our computing
    model for simulation productions
  • As they would use the same setup both from our
    ProdSys and the SC4 side
  • First separately, then concurrently

24
ATLAS SC4 Plans (3)
  • Distributed analysis tests
  • Random job submission accessing data at Tier-1s
    (some) and Tier-2s (mostly)
  • Generate groups of jobs and simulate analysis job
    submission by users at home sites
  • Direct jobs needing only AODs as input to Tier-2s
  • Direct jobs needing ESDs or RAW as input to
    Tier-1s
  • Make preferential use of ESD and RAW samples
    available on disk at Tier-2s
  • Tests of performance of job submission,
    distribution and output retrieval
  • Test job priority and site policy schemes for
    many user groups and roles
  • Distributed data and dataset discovery and access
    through metadata, tags, data catalogues.
  • Need same SC4 infrastructure as needed by
    distributed productions
  • Storage of job outputs for private or group-level
    analysis may be an issue
  • Tests can be run during Q3-4 2006
  • First a couple of weeks in July-August (after
    distributed production tests)
  • Then another longer period of 3-4 weeks in
    November
Write a Comment
User Comments (0)
About PowerShow.com