Title: NIKHEF Data Processing Fclty
1NIKHEF Data Processing Fclty
- Status Overview per 2004.10.27
- David Groep, NIKHEF
2A historical view
- Started in 2000 with a dedicated farm for DØ
- 50 Dual P3-800 MHz
- tower model Dell Precision 220
- 800 GByte 3ware disk array
jobs
3Many different farms
- 2001 EU DataGrid WP6 Application test bed
- 2002 addition of the development test bed
- 2003 LCG-1 production facility
- April 2004 amalgamation of all nodes into LCG-2
- September 2004 addition of
- EGEE PPS
- VL-E P4 CTB
- EGEE JRA1 LTB
4Growth of resources
- Intel Pentium III 800 MHz 100 CPUs 2000
- Intel Pentium III 933 MHz 40 CPUs 2001
- AMD Athlon MP2000 2 GHz 132 CPUs 2002
- Intel XEON 2.8 GHz 54 CPUs 2003
- Intel XEON 2.8 GHz 20 CPUs 2003
- Total WN resources (raw) 353 THz
hr/mo 200 kSI2k - Total on-line disk cache 7 TByte
5Node types
2U pizza boxesPIII 933 MHz, 1GByte RAM, 43
Gbyte disk
1U GFRC (NCF)AMD MP2000, 1GByte RAM, 60 Gbyte
diskthermodynamic challenges
1U HalloweenXEON 2.8 GHz2GByte RAM, 80 Gbyte
diskfirst GigE nodes
6Connecting things together
- Collapsed backbone strategy
- Foundry Networks BigIron 15000
- 14 GigE SX, 2x GigE LX
- 16 1000BaseTX
- 48 100BaseTX
- Service nodes directly GigE connected
- Farms connected via local switches
- WN oversubscription typical 15 17
- Dynamic re-assignment of nodes to facilities
- DHCP Relay
- built-in NAT support (for worker nodes)
7NIKHEF Farm Network
8Network Uplinks
- NIKHEF links
- 1 Gb/s IPv4 1 Gb/s IPv6 SURFnet
- 2 Gb/s WTCW (to SARA)
- SURFnet links
9NDPF Usage
- Analyzed production batch logs since May 2002
- total of 1.94 PHzHours provided in 306 000 jobs
Added Halloween
LHC Data Challenges
Added NCF GFRC
experimental use and tests not shown
10Usage per Virtual Organisation
Real-time web info www.nikhef.nl/grid/ www.dutchg
rid.nl/Org/Nikhef/farmstats.html
- Dzero acts as background fill
- Usage doesnt (yet) reflect shares
11Usage monitoring
- Live viewgraphs
- farm occupancy
- per-VO distribution
- network loads
- Tools
- Cricket (network)
- home-grown scripts rrdtool
12Central services
- VO-LDAP services LHC VOs
- DutchGrid CA
- edg-testbed-stuff
- Torque Maui distribution
- installation support components
13Some of the issues
- Data access patterns in Grids
- jobs tend to clutter CWD
- high load when shared over NFS
- shared homes required for traditional batch MPI
- Garbage collection for foreign jobs
- OpenPBS Torque transient TMPDIR patch
- Policy management
- maui fair-share policies
- CPU capping
- max-queued-jobs capping
14Developments work in progress
- Parallel Virtual File Systems
- From LCFGng to Quattor (Jeff)
- Monitoring and disaster recovery (Davide)
15Team