ScotGRID- Glasgow - PowerPoint PPT Presentation

About This Presentation
Title:

ScotGRID- Glasgow

Description:

Feb 2002 - xCAT tutorial Chris Turcksin & David McLauchlin ... 4 16 port Equinox ELS Terminal Servers. RedHat 7.2. xCAT-dist-1.1.RC8.1. OpenPBS_2_3_16 ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 13
Provided by: ppephysi
Category:

less

Transcript and Presenter's Notes

Title: ScotGRID- Glasgow


1
ScotGRID- Glasgow
  • SysAdmin perspective

2
ScotGRID-Glasgow Timeline
  • Dec 2001 - delivery of kit
  • Feb 2002 - xCAT tutorial Chris Turcksin David
    McLauchlin
  • Mar 2002 - Attempt trial masternode on Netvista
    workstation
  • Apr 2002 - ScotGRID room handed over to builders
  • May 2002 - prepare initial production xCAT
    configuration files offline
  • Jun 2002 - Building work complete - xCAT
    reinstallation
  • Jul 2002 - user registration
  • ? - further development and trial production
  • Dec 2002 - group disk (re)organisation to match
    project aims

3
ScotGRID-Glasgow - Schematic
Masternode
Storage Nodes
Head Nodes
Campus Backbone
Internet VLAN
10.0.0.0 VLAN
100 Mbps
1000 Mbps
Compute Nodes
4
ScotGRID-Glasgow - Front View
5
ScotGRID-Glasgow Facts/Figures
  • RedHat 7.2
  • xCAT-dist-1.1.RC8.1
  • OpenPBS_2_3_16
  • Maui-3.0.7
  • OpenAFS-1.2.2 on masternode
  • RAL virtual tape access
  • IP Masquerading on masternode for Internet
    access from compute nodes
  • Intel Fortran Compiler 7.0 for Linux
  • HEPiX login scripts
  • gcc-2.95.2
  • j2sdk-1_4_1
  • 59 x330 dual PIII 1GHz/2 Gbyte compute nodes
  • 2 x340 dual PIII/1 GHz /2 Gbyte head nodes
  • 3 x340 dual PIII/1 GHz/2 Gbyte storage nodes,
    each with 11 by 34 Gbytes in Raid 5
  • 1 x340 dual PIII/1 GHz/0.5 Gbyte masternode
  • 3 48 port Cisco 3500 series 100 bit/sec Ethernet
    Switch1 8 port Cisco 3500 series 1000
    bits/sec Ethernet Switch
  • 4 16 port Equinox ELS Terminal Servers
  • 150,000 dedicated maui processor hours
  • 38 names in NIS passwd map

6
ScotGRID-Glasgow - Ethernetry
CDFA 194.36.1.91
STORAGE NODES 10.0.1.1,2,3
INTERNET (194.36.1.0)
8 ports
48/2 ports
48/2 ports
48/2 ports
Headnode2 194.36.1.63
Masternode 194.36.1.61Storage 194.36.1.64,5,6
Headnode1 194.36.1.62
10.x.y.zhead,compute
10.x.y.zhead,compute
10.x.y.zmaster
GBIC
7
ScotGRID-Glasgow - Wiring View
8
ScotGRID-Glasgow - Experience
  • xCAT installed and manages the cluster just fine
    once it is understood
  • the style is linux
  • documentation is nearly correct
  • RedHat Linux updates cause surprises wrt things
    not working as described
  • it is a toolkit with HOWTOs
  • it is mostly text files that one should feel free
    to modify
  • The xCAT tutorial was most valuable
  • at the level of both detail and general outlook
  • input from Chris and David during procurement
    would have helped
  • The hardware has produced few surprises
  • a number of disks in the exp300s stopped
    spinning early on
  • a number of x330 power supplies have given
    intermittent fan faults
  • IBM Hardware Software maintenance via Call
    Centre has worked fine but the terms and
    conditions are not known - especially wrt non IBM
    kit

9
ScotGRID-Glasgow - Work in progress
  • Amanda Backup
  • EDG LCFG/CE/SE/UI installation
  • xCAT front-end
  • define scheme to manually add node to xCAT
    cluster
  • added node has access to accounts, files and
    batch queues
  • added node is an EDG node or perhaps a CDF SAM
    station
  • front-ending minimises interference between
    purview of xCAT and other grid or protogrid
    systems like EDG or SAM

10
ScotGRID-Glasgow - issues
  • only 7 Gbytes of shared disk per processor
  • some 800 Gbytes available to users
  • no 1000 Gbit/sec path between storage and
    internet
  • Ext 3 file system corruption on ServeRaid
  • kernel crashes when RaidMan running
  • relation to ScotGRID-Edinburgh unclear
  • file sharing over WAN hard at Gbit/sec speeds but
    would be good
  • too little disk at Glasgow and too little cpu at
    Edinburgh not good
  • Security
  • too many roots
  • need implemented security policy wrt root
    access, firewalls, exports,
  • continuing air conditioning problems
  • ¼ cooling out of use for gt 2 months - now
    apparently fixed
  • humidity still out of spec
  • N1 rule broken - more load just arrived and more
    expected
  • Other minor accommodation problems
  • no proper earthing of cooling grilles in floor
  • unprotected emergency off buttons
  • over temp trip prevents auto restart after power
    failure
  • no UPS
  • no datasafe
  • Both ServeRaid channels already used on storage
    nodes (?)

11
ScotGRID-Glasgow - upgrade
  • Disks
  • 33 of 42 slots in exp300 used
  • 34 GB disks
  • replace with 150 GB drives
  • avoids associated infrastructure costs
  • wastes existing drives
  • transition via tape backup
  • 3 modest server class systems for LCFG, EDG style
    CE and SE
  • SE needs as good Gigabit Ethernet as Storage
    nodes
  • Use to learn about e1000 driver trunking of Vlans
    as possible storage node solution
  • only 1 Gigabit Card needed if duplexing effective
    and e1000 driver can do trunking otherwise 2
    cards needed
  • Extra Gigabit Ethernetry if required - hubs, nics
    for storage nodes

12
ScotGRID-Glasgow SysAdmin personal comments
  • More effort should have gone into matching the
    solution to the problem
  • In the early days one went to IBM, CDC,DEC ...
    and asked what you could get for xxxK. Now the
    number of permutations is enormous and more
    effort is needed to match the list of ordered
    items to the requirements. Understanding the
    nature of ones problem and the capabilities of
    the technologies is no easy task.
  • It would be nice if reliability and manageability
    were extra cost options
  • High Energy Physics only really cares about high
    average throughput of good data
  • 99 availability is quite adequate
  • HEP is very interested in storage capacity but is
    not so concerned about reliability provided no
    bad data creeps in undetected. Most data is
    statistical and minor loss would only be a
    bookkeeping issue.
  • Some of the Glasgow ScotGRID cluster is overkill
    for our needs
  • Getting our act together is trivially harder
    than we think ( in the sense that mathematicians
    use the word trivial). We are skipping the
    boring/tedious bits - analysing requirements,
    developing policies/procedures ...
Write a Comment
User Comments (0)
About PowerShow.com