Title: Microsoft HPC Institute BiAnnual Meeting
1Microsoft HPC Institute Bi-Annual
Meeting
Jack Dongarra Innovative Computing
LaboratoryUniversity of Tennessee and Oak Ridge
National Laboratory
2 Our Microsoft Cluster
- 24 Custom Built Nodes from TeamHPC
- Dual socket AMD Opteron 265 (Dual Core) 1.8GHz
Processors (total of 96 processors) - 4GB Ram / node
- 80GB SATA Hard Drive / node
- Windows Server 2003 R2 x64 Edition
- Microsoft Compute Cluster Edition 2003
- Nforce Gigabit NIC
- Silverstorm 10Gb/s Infiniband NICs
- Coming soon
- Mellanox 20Gb/s DDR Infiniband NICs
- Drivers dont support dual cards today
- Myricom 10G 16 port switch NICs
3Three Thrust Research Areas
- Numerical Linear Algebra Algorithms and Software
- BLAS, LAPACK, ScaLAPACK, PBLAS, ATLAS
- Numerical Libraries for Multicore
- Self Adapting Numerical Algorithms (SANS) Effort
- Generic Code Optimization, ATLAS
- LAPACK For Clusters easy access to clusters
- Access to clusters for linear algebra sw via
Matlab, Mathematica, Python, etc on the front end - Heterogeneous Distributed Computing
- PVM, MPI
- GridSolve, FT-MPI, Open-MPI
- Performance Evaluation
- PAPI, HPC Challenge
4GridSolve
- Grid based hardware/software/data server
- RPC-style (GridRPC) clients
- Do not need to know about services in advance
- Agent, servers, proxy
- Service discovery, dynamic problem solving, load
balancing, fault tolerance, asynchronicity,
disconnected operation, NAT-tolerance - Easy, transparent access to resources
- Clients Matlab, C, Fortran NetSolve
Mathematica, Octave, Java - Ease of Use Paramount Goal
5GridSolve Architecture
Resource discovery Scheduling Load
balancing Fault tolerance
Agent
request
server list
cluster
data
cluster
result
cluster
Client
cluster
6GridSolve on Windows Cluster
- Several efforts to getting GridSolve to work with
Windows - Native Windows client (Note client only)
- Options
- Using Cygwin (Agent and Server )
- Using SUA (Subsystem for Unix-based Apps
Interix, SFU ) - Under development Native Windows agent and server
7Performance Evaluation Toolshttp//icl.cs.utk.edu
/papi
- Performance Application Programming
Interface (PAPI) - A portable library to access hardware counters
found on processors - Provides a standardized list of performance
metrics - KOJAK (Joint with Felix Wolf)
- Software package for the automatic performance
analysis of parallel apps - Message passing and multi-threading
(MPI and/or OpenMP) - Parallel performance
- CPU and memory performance
8Wheres PAPI
- PAPI runs on most modern processors and Operating
Systems of interest to HPC - IBM POWER3,4,5 / AIX
- POWER4,5 / Linux
- PowerPC-32 and -64 / Linux
- Blue Gene / CNK
- Intel Pentium II, III, 4, M, D, EM64T, etc. /
Linux - Intel Itanium
- AMD Athlon, Opteron / Linux
- Cray T3E, X1, XD3, XT3 / Catamount
- Altix, Sparc,
9Perfometer
Call Perfometer(red)
Call Perfometer(blue)
10Tools Using PAPI, eg Perfometer
11 12PAPI Design
13PAPI / Windows Limitations
- Counter State isnt saved on context switch
- Can only count cpu-wide
- Cant migrate tasks (using processor affinity)
- Cant share counters among users
- Need kernel modifications
- To preserve counter state on context switch
14Linear Algebra Software Packageshttp//icl.cs.utk
.edu/lapack/
- LAPACK
- Used by Matlab, Mathematica, Numeric Python,
- Tuned version provided by vendors AMD, Apple,
Compaq, Cray, Fujitsu, Hewlett-Packard, Hitachi,
IBM, Intel, MathWorks, NAG, NEC, PGI, SUN,
Visual Numerics, by most of Linux distribution
(Fedora, Debian, Cygwin,...). - On going work Multi-core, performance, accuracy,
extended precision, ease of use - ScaLAPACK
- Parallel implementation of LAPACK scaling on
parallel hardware from 10s to 100s to 1000s of
processors - On going work Target new architectures, new
parallel environment. For example port to
Microsoft HPC cluster solution - LAPACK for Clusters (LFC)
- Most of ScaLAPACK functionality from serial
clients (Matlab, Python, Mathematica) - On going work Looking at sparse data and I/O
scenarios, web services
15Parallelism in LAPACK / ScaLAPACK
Distributed Memory
Shared Memory
ScaLAPACK
LAPACK
PBLAS
ATLAS
Specialized BLAS
Parallel
BLACS
threads
MPI
16Installation and testing of LAPACK, BLACS, and
ScaLAPACK
- Uses Intel ifort and icl , BLAS Intel MKL
- No problem at installation, LAPACK, BLACS and
ScaLAPACK uses Makefile with make.inc files to
set the environment variables - Tests are fine
- ( Note that the BLACS testing is a fairly hard
test for MPI, and LAPACK a fairly hard test for
IEEE and compiler semantic.) - Can be used from Microsoft Visual Studio
- DLL coming soon
17LFC Overview
- Client/Server
- Separation via sockets
- Server objects are large
- Server runs in parallel
- Client objects hold references to server objects
- Process spawning
- Separation via system calls
- mpirun/mpiexec start parallel job
. . .
. . .
Tunnel (IP, TCP, ....)
Firewall (perhaps)
18FT-MPI http//icl.cs.utk.edu/ft-mpi/
- Define the behavior of MPI in event a failure
occurs at the process level. - FT-MPI based on MPI 1.3 (plus some MPI 2
features) with a fault tolerant model similar to
what was done in PVM. - Complete reimplementation, not based on other
implementations. - Gives the application the possibility to recover
from a process-failure. - A regular, non fault-tolerant MPI program will
run using FT-MPI. - What FT-MPI does not do
- Recover user data (e.g. automatic check-pointing)
- Provide transparent fault-tolerance
- Open-MPI for MS
19Open-MPI Collaborators
- Los Alamos National Lab
- (LA-MPI)
- Sandia National Lab
- Indiana U
- (LAM/MPI)
- U of Tennessee
- (FT-MPI)
- HLRS - U of Stuttgart
- (PACX-MPI)
- U of Houston
- Cisco Systems
- Mellanox
- Voltaire
- Sun
- IBM
- Myricom
- URL www.open-mpi.org
20Open-MPI - A Convergence of Ideas
FT-MPI (U of TN)
Open MPI
LA-MPI (LANL)
LAM/MPI (IU)
PACX-MPI (HLRS)
OpenRTE
Fault Detection (LANL, Industry)
FDDP (Semi. Mfg. Industry)
Resilient Computing Systems
Robustness (CSU)
Grid (many)
Autonomous Computing (many)
21HPC Challenge Goals
- To examine the performance of HPC architectures
using kernels with more challenging memory access
patterns than HPL - HPL works well on all architectures ? even
cache-based, distributed memory multiprocessors
due to - Extensive memory reuse
- Scalable with respect to the amount of
computation - Scalable with respect to the communication volume
- Extensive optimization of the software
- To complement the Top500 list
- To provide benchmarks that bound the performance
of many real applications as a function of memory
access characteristics ? e.g., spatial and
temporal locality
22HPCS/HPCC Performance Targets
- HPL linear system solve Ax b
- STREAM vector operations A B s C
- FFT 1D Fast Fourier Transform Z fft(X)
- RandomAccess integer update Ti XOR( Ti,
rand)
Memory Hierarchy
Registers
Instructions
Operands
Cache(s)
Lines
Blocks
Local Memory
Performance Targets
Messages
Remote Memory
HPC Challenge
Pages
Disk
Tape
- HPCC was developed by HPCS to assist in testing
new HEC systems - Each benchmark focuses on a different part of
the memory hierarchy - HPCS performance targets attempt to
- Flatten the memory hierarchy
- Improve real application performance
- Make programming easier
23(No Transcript)
24Testbed for Benchmarking
- Would like to set up a cluster with different
interconnects - GigE
- Various Infiniband
- Myrinet
- Etc
- Differ OSs
- Linux
- Windows
- Make available to community for testing