Title: ScotGRID- Glasgow
1ScotGRID- Glasgow
2ScotGRID-Glasgow Timeline
- Dec 2001 - delivery of kit
- Feb 2002 - xCAT tutorial Chris Turcksin David
McLauchlin - Mar 2002 - Attempt trial masternode on Netvista
workstation - Apr 2002 - ScotGRID room handed over to builders
- May 2002 - prepare initial production xCAT
configuration files offline - Jun 2002 - Building work complete - xCAT
reinstallation - Jul 2002 - user registration
- ? - further development and trial production
- Dec 2002 - group disk (re)organisation to match
project aims
3ScotGRID-Glasgow - Schematic
Masternode
Storage Nodes
Head Nodes
Campus Backbone
Internet VLAN
10.0.0.0 VLAN
100 Mbps
1000 Mbps
Compute Nodes
4ScotGRID-Glasgow - Front View
5ScotGRID-Glasgow Facts/Figures
- RedHat 7.2
- xCAT-dist-1.1.RC8.1
- OpenPBS_2_3_16
- Maui-3.0.7
- OpenAFS-1.2.2 on masternode
- RAL virtual tape access
- IP Masquerading on masternode for Internet
access from compute nodes - Intel Fortran Compiler 7.0 for Linux
- HEPiX login scripts
- gcc-2.95.2
- j2sdk-1_4_1
- 59 x330 dual PIII 1GHz/2 Gbyte compute nodes
- 2 x340 dual PIII/1 GHz /2 Gbyte head nodes
- 3 x340 dual PIII/1 GHz/2 Gbyte storage nodes,
each with 11 by 34 Gbytes in Raid 5 - 1 x340 dual PIII/1 GHz/0.5 Gbyte masternode
- 3 48 port Cisco 3500 series 100 bit/sec Ethernet
Switch1 8 port Cisco 3500 series 1000
bits/sec Ethernet Switch - 4 16 port Equinox ELS Terminal Servers
- 150,000 dedicated maui processor hours
- 38 names in NIS passwd map
6ScotGRID-Glasgow - Ethernetry
CDFA 194.36.1.91
STORAGE NODES 10.0.1.1,2,3
INTERNET (194.36.1.0)
8 ports
48/2 ports
48/2 ports
48/2 ports
Headnode2 194.36.1.63
Masternode 194.36.1.61Storage 194.36.1.64,5,6
Headnode1 194.36.1.62
10.x.y.zhead,compute
10.x.y.zhead,compute
10.x.y.zmaster
GBIC
7ScotGRID-Glasgow - Wiring View
8ScotGRID-Glasgow - Experience
- xCAT installed and manages the cluster just fine
once it is understood - the style is linux
- documentation is nearly correct
- RedHat Linux updates cause surprises wrt things
not working as described - it is a toolkit with HOWTOs
- it is mostly text files that one should feel free
to modify - The xCAT tutorial was most valuable
- at the level of both detail and general outlook
- input from Chris and David during procurement
would have helped - The hardware has produced few surprises
- a number of disks in the exp300s stopped
spinning early on - a number of x330 power supplies have given
intermittent fan faults - IBM Hardware Software maintenance via Call
Centre has worked fine but the terms and
conditions are not known - especially wrt non IBM
kit
9ScotGRID-Glasgow - Work in progress
- Amanda Backup
- EDG LCFG/CE/SE/UI installation
- xCAT front-end
- define scheme to manually add node to xCAT
cluster - added node has access to accounts, files and
batch queues - added node is an EDG node or perhaps a CDF SAM
station - front-ending minimises interference between
purview of xCAT and other grid or protogrid
systems like EDG or SAM
10ScotGRID-Glasgow - issues
- only 7 Gbytes of shared disk per processor
- some 800 Gbytes available to users
- no 1000 Gbit/sec path between storage and
internet - Ext 3 file system corruption on ServeRaid
- kernel crashes when RaidMan running
- relation to ScotGRID-Edinburgh unclear
- file sharing over WAN hard at Gbit/sec speeds but
would be good - too little disk at Glasgow and too little cpu at
Edinburgh not good - Security
- too many roots
- need implemented security policy wrt root
access, firewalls, exports,
- continuing air conditioning problems
- ¼ cooling out of use for gt 2 months - now
apparently fixed - humidity still out of spec
- N1 rule broken - more load just arrived and more
expected - Other minor accommodation problems
- no proper earthing of cooling grilles in floor
- unprotected emergency off buttons
- over temp trip prevents auto restart after power
failure - no UPS
- no datasafe
- Both ServeRaid channels already used on storage
nodes (?)
11ScotGRID-Glasgow - upgrade
- Disks
- 33 of 42 slots in exp300 used
- 34 GB disks
- replace with 150 GB drives
- avoids associated infrastructure costs
- wastes existing drives
- transition via tape backup
- 3 modest server class systems for LCFG, EDG style
CE and SE - SE needs as good Gigabit Ethernet as Storage
nodes - Use to learn about e1000 driver trunking of Vlans
as possible storage node solution - only 1 Gigabit Card needed if duplexing effective
and e1000 driver can do trunking otherwise 2
cards needed - Extra Gigabit Ethernetry if required - hubs, nics
for storage nodes
12ScotGRID-Glasgow SysAdmin personal comments
- More effort should have gone into matching the
solution to the problem - In the early days one went to IBM, CDC,DEC ...
and asked what you could get for xxxK. Now the
number of permutations is enormous and more
effort is needed to match the list of ordered
items to the requirements. Understanding the
nature of ones problem and the capabilities of
the technologies is no easy task. - It would be nice if reliability and manageability
were extra cost options - High Energy Physics only really cares about high
average throughput of good data - 99 availability is quite adequate
- HEP is very interested in storage capacity but is
not so concerned about reliability provided no
bad data creeps in undetected. Most data is
statistical and minor loss would only be a
bookkeeping issue. - Some of the Glasgow ScotGRID cluster is overkill
for our needs - Getting our act together is trivially harder
than we think ( in the sense that mathematicians
use the word trivial). We are skipping the
boring/tedious bits - analysing requirements,
developing policies/procedures ...