Title: Technologies for the Future: CLUSTERS
1Technologies for the Future CLUSTERS
- Anne C. Elster
- Dept. of Computer Information Science (IDI)
- Norwegian Univ. of Science Tech. (NTNU)
- Trondheim, Norway
NOTUR 2003
2Clusters (Networks of PCs/Workstation)
- Are they suitable for HPC?
- Advantage
- Cost-effective hardware since uses COTS
(Commercial Of-The-Shelf) parts - BUT
- Typically much slower processor interconnectes
than traditional HPC systems - What about usability?
NTNU IDIs 40-node AMD 1.46GHz cluster 2GB RAM,
40GB disk, Fast Ethernet
3Cluster TechnologiesNOTUR Emerging Technology
projectCollaboration between NTNU Univ. of
- Goal
- Analyze Cluster technologies suitability for HPC
by looking at some of the most interesting NOTUR
applications - The results will provide a foundation for
decisions regarding future HPC programs
4Main Collaborators include
- Anne C. Elster (IDI, NTNU) Project leader
- Otto Anshus Tore Larsen (CS, U of Tromsø)
- Tor Johansen staff (CC, U of Tromsø)
- Torbjørn Hallgren (IDI, NTNU)
- Einar Rønquist (IMF, NTNU)
- Master Ph.D. Students and Post Docs at NTNU and
Univ. of Tromsø
5General Issues to Consider
- Why cluster vs. Powerful desktop vs. Large SMPs?
- What are the total costs associated with clusters
(hardware, software, support, usability) - 32-bit vs. 64-bit architectures
6Cluster Project ACTIVITIES
- A.1 Profiling Tuning Selected Applications
- A.1.a/b Physics and Chemistry Codes
- (Elster students, Dept. of Computer Science
Dept., NTNU) - A.1.2a Profiling User-Analysis of Amber, Dalton
Gaussian - (Tor Johansen staff, Comp. Center, U of
Tromsø) - A.1.2b Optimization tool analysis of Dalton
- (Anshus PostDoc/student, Dept. of Comp. Sci.,
U of Tromsø)
7Cluster Project ACTIVITIES continuted
- A.2 Execution Monitoring
- (Anshus, Tore Larsen students, CS, U of T)
- A.3 Visualization servers, etc.
- (Hallgren, Elster students, CS, NTNU)
- A.4 Impact of future numerical algorithms
- (Rønquist student, Dept. of Mathematics, NTNU
- A.5 Interface with NOTUR ET Grid Project
- (Elster, Harald Simonsen and colleagues, staff
students associated with the NOTUR ET Cluster
Grid projects)
8A.1.a/b Physics Chemistry Codes (Elster
students, Dept. of CS Dept., NTNU)
Lessons Learned so far -- Paul Sacks work on a
Physics application (report available on the
- FORTRAN problems
- Different FORTRAN implementations have
non-stardard add-ons (e.g. FORTRAN 90) - Leads to great difficulty in porting code to a
different platform with a different Fortran
compiler (e.g. by a different vendor)
9A.1.a/b Physics Chemistry Codes contin.
- Performance of programs can individually vary on
different machines - Åsmund Østvold wrote a proj. report on
- porting PROTOMOL from an SMP w/ MPI one-siden
communication primitives (MPI put/get) to a
cluster. (available on WWW) - He also did a MS study with SCALI on various
- MPI broadcast algorithms and bechmarking
10A.1.a/b Physics Chemistry Codes contin.2
- Ongoing work with Snorre Boasson Jan Christian
Meyer on porting of PIC code using Pthread (SMP
primitives) to MPI . - Preliminary report will be available later this
week. - Recent Trends in Cluster Computing presented at
ParCo 2003 by Elster et. al. includes harware
trends and survey of libraries and performance
11A.1.2a Profiling User-Analysis of Amber, Dalton
Gaussian (Tor Johansen staff, Comp.
Center, U of Tromsø)
- Koordineringsarbeide
- Reise NOTUR 2003
- Porting og testing av Amber og Scali SW
12A.1.2b Optimization tool analysis of
Dalton(Anshus PostDoc/students, CS, U of
- Ytelsesmålinger gjort på DALTON
- A Report for the NOTUR Project Emerging
Technologies Cluster - Daniel Stødle, Otto J. Anshus, John Markus
Bjørndalen - Survey of optimizing techniques for parallel
programs running on computer clusters - Espen S. Johnsen, Otto J. Anshus, John Markus
Bjørndalen, Lars Ailo Bongo (September 29, 2003)
13A.1.2b Optimization tool analysis of Dalton
(Anshus PostDoc/student, IFI, U i Tromsø)
- Dalton scales pretty well 25x speedup on 32
nodes - NOTE Only with-out caching temp. If use cache
only 3-5x speedup on 32! - Even through the 8-way cluster had no local disk
(only a netork file system), the sequential
Dalton code was significantly faster. - This indicates that network bandwith may not
be a problem if caching is used in the parallel - Communication pattern master-slave
"bag-of-tasks" oriented programs with little
communicaiton sychronization and generally good
utilization of the slave nodes. - Master does relatively little work and is blocked
most of the time - Finally checked if the master node could be a
bottle neck, but could not detect differences in
execution time when Master put on a slow node vs.
a fast node.. NOTE Only tested up to 32 nodes
using larger no. of nodes may limit performance
by overloading the master node.
14A.1.2b Optimization tool analysis of Dalton
(Anshus PostDoc/student, IFI, U i Tromsø)
- Thanks to
- Kenneth Ruud, Chemistry, UiT
- Roy Dragseth, CC UiT for support on the Itanium
at U og Tromsø.
15A.2 Execution Monitoring (Anshus, Tore Larsen
students, CS, U of T)
- Survey of execution monitoring tools for
computer clusters - Espen S. Johnsen, Otto J. Anshus, John Markus
Bjørndalen, Lars Ailo Bongo, Sept 03 - Performance Monitoring
- Lars Ailo Bongo, Otto J. Anshus, John Markus
16A.3 Visualization servers, etc. (Hallgren,
Elster students, CS, NTNU)
- On going work with Torbjørn Vik
- Preliminary report on survey of how clusters are
currently used in visualization - To types of Cluster usages
- off-line (non-real-time rendering). Often called
"renderingfarms" with lots of nodes which all
work on a frame each of a larger animation. - Typically used in the film industry and other
areas where interactivity and/or real-time
rendering not needed. - All larger 3D modelling programs such as
Lightwave, 3DStudio, Maya has functionality for
this. - on-line ( realtime). Most interesting from a
technical viewpoint...
17A.3 Visualization servers, etc. - Contin.
- Cluster brukes innenfor interaktiv
visualiseringsprogramvare for å - øke ytelsen,
- muliggjøre større datasett,
- unngå begrensninger i lokal hardware.
- De fleste visualiseringscluster fungerer
prinsipielt ved at en bruker sitter på en
klientmaskin som i seg selv ikke har noe særlig
kapasitet. Clusteret tar seg av all beregning og
sender bare de ferdige bildene til klienten.
Klientmaskinen sørger også for å ta imot input
fra bruker og sende disse til cluster. Datasett
for slik visualisering er ofte svært store, og,
avhengig av situasjonen, brukes både
polygonbasert og voxelbasert rendering. - Hovedproblemet med å få clusters brukbare
innenfor interaktive visualiseringsprogram er
forsinkelser pga nettverk. Dette løses ved å
redusere tiden som brukes for å overføre bilder
mellom cluster og klient. Det kan enten løses ved
å - redusere datamengden (komprimeringsmetoder) eller
- øke nettverksytelsen. Eller begge.
- Parallelitet i selve clusteret baseres på
uavhengighetsforhold mellom forskjellige data.
Det kan være uavhengigheter mellom forskjellige
deler i samme datasett, eller det kan være
uavhengigheter mellom forskjellige frames i et 4D
datasett. - Load-balancing blir ofte et problem i slike
sammenhenger og er et viktig forskningsområde. - Hvilken metode som brukes for load-balancing er
som oftest svært kontekstavhengig. - Clusterprogramvare for visualisering fremdeles
manglende ??
18A.4 Impact of future numerical algorithms (Rønqui
st student, Dept. of Mathematics, NTNU
- Rønquist student Staff (now at Simulasenteret)
wrote a report based on his summer jobb - May add in experiences from Elsters group fall
19A.5 Interface with NOTUR ET Grid
Project (Elster, Harald Simonsen and colleagues,
staff students associated with the NOTUR ET
Cluster Grid projects)
- Test node established at NTNU
- Andreas Botnen(USIT) and
- Robin Holtet (IDI, now ITEA)
- May use IDIs 30-40-node cluster in testgrid
- Meetings
- Between Elster and Simonsens groups
- Robin Holtet and Elsters student Thorvald Natvig
to Linköping meeting this month. - Collaborations re. National GRID and EEGE
- Student from NTNU and UiO at CERN
20Main cluster issues
- Global operations have more severe impact on
cluster performance than traditional
supercomputers since communication between
processors take relatively more of the total
execution time - SCALABILITY!!
21Lessons leared
- Clusters generally have cheap hardware, but may
cause increased hidden costs regarding - More incompatible compilers, especially Fortran
90 (also C) - Some applications are non-trivial to port from a
share-memory paradigm to a distributed memory
paradigms - Some applications require high-bandwidth
interconnects which drive up costs (e.g. SGI
Altix) - Power and cooling costs (ref. Brian Vinter)
- Stability, recovery
- Overall costs and scalability should be further
22The Ideal Cluster -- Hardware
- High-bandwidth network
- Low-latency network
- Low Operating System overhead (tcp causes slow
start) - Great floating-point performance
- (64-bit processors or more?)
23The Ideal Cluster -- Software
- Compiler that is
- Portable
- Optimizing
- Do extra work to save communication
- Self-tuning /Load -balanced
- Automatic selection of best algorithm
- One-sided communication support?
- Optimized middleware
24For more information
- A dozen or more reports associated with this
project will be made available on the web at - http//www.idi.ntnu.no/elster
- Email elster_at_idi.ntnu.no