Microsoft HPC Institute BiAnnual Meeting

About This Presentation

Title:

Microsoft HPC Institute BiAnnual Meeting

Description:

24 Custom Built Nodes from TeamHPC. Dual socket AMD Opteron 265 (Dual Core) 1. ... MathWorks, NAG, NEC, PGI, SUN, Visual Numerics, by most of Linux distribution ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 25

Provided by: jack237

Category:

more less

Transcript and Presenter's Notes

Title: Microsoft HPC Institute BiAnnual Meeting

1
Microsoft HPC Institute Bi-Annual
Meeting
Jack Dongarra Innovative Computing
LaboratoryUniversity of Tennessee and Oak Ridge
National Laboratory
2
Our Microsoft Cluster

24 Custom Built Nodes from TeamHPC
Dual socket AMD Opteron 265 (Dual Core) 1.8GHz
Processors (total of 96 processors)
4GB Ram / node
80GB SATA Hard Drive / node
Windows Server 2003 R2 x64 Edition
Microsoft Compute Cluster Edition 2003
Nforce Gigabit NIC
Silverstorm 10Gb/s Infiniband NICs
Coming soon
Mellanox 20Gb/s DDR Infiniband NICs
Drivers dont support dual cards today
Myricom 10G 16 port switch NICs

3
Three Thrust Research Areas

Numerical Linear Algebra Algorithms and Software
BLAS, LAPACK, ScaLAPACK, PBLAS, ATLAS
Numerical Libraries for Multicore
Self Adapting Numerical Algorithms (SANS) Effort
Generic Code Optimization, ATLAS
LAPACK For Clusters easy access to clusters
Access to clusters for linear algebra sw via
Matlab, Mathematica, Python, etc on the front end
Heterogeneous Distributed Computing
PVM, MPI
GridSolve, FT-MPI, Open-MPI
Performance Evaluation
PAPI, HPC Challenge

4
GridSolve

Grid based hardware/software/data server
RPC-style (GridRPC) clients
Do not need to know about services in advance
Agent, servers, proxy
Service discovery, dynamic problem solving, load
balancing, fault tolerance, asynchronicity,
disconnected operation, NAT-tolerance
Easy, transparent access to resources
Clients Matlab, C, Fortran NetSolve
Mathematica, Octave, Java
Ease of Use Paramount Goal

5
GridSolve Architecture
Resource discovery Scheduling Load
balancing Fault tolerance
Agent
request
server list
cluster
data
cluster
result
cluster
Client
cluster
6
GridSolve on Windows Cluster

Several efforts to getting GridSolve to work with
Windows
Native Windows client (Note client only)
Options
Using Cygwin (Agent and Server )
Using SUA (Subsystem for Unix-based Apps
Interix, SFU )
Under development Native Windows agent and server

7
Performance Evaluation Toolshttp//icl.cs.utk.edu
/papi

Performance Application Programming
Interface (PAPI)
A portable library to access hardware counters
found on processors
Provides a standardized list of performance
metrics
KOJAK (Joint with Felix Wolf)
Software package for the automatic performance
analysis of parallel apps
Message passing and multi-threading
(MPI and/or OpenMP)
Parallel performance
CPU and memory performance

8
Wheres PAPI

PAPI runs on most modern processors and Operating
Systems of interest to HPC
IBM POWER3,4,5 / AIX
POWER4,5 / Linux
PowerPC-32 and -64 / Linux
Blue Gene / CNK
Intel Pentium II, III, 4, M, D, EM64T, etc. /
Linux
Intel Itanium
AMD Athlon, Opteron / Linux
Cray T3E, X1, XD3, XT3 / Catamount
Altix, Sparc,

9
Perfometer
Call Perfometer(red)
Call Perfometer(blue)
10
Tools Using PAPI, eg Perfometer
11

12
PAPI Design
13
PAPI / Windows Limitations

Counter State isnt saved on context switch
Can only count cpu-wide
Cant migrate tasks (using processor affinity)
Cant share counters among users
Need kernel modifications
To preserve counter state on context switch

14
Linear Algebra Software Packageshttp//icl.cs.utk
.edu/lapack/

LAPACK
Used by Matlab, Mathematica, Numeric Python,
Tuned version provided by vendors AMD, Apple,
Compaq, Cray, Fujitsu, Hewlett-Packard, Hitachi,
IBM, Intel, MathWorks, NAG, NEC, PGI, SUN,
Visual Numerics, by most of Linux distribution
(Fedora, Debian, Cygwin,...).
On going work Multi-core, performance, accuracy,
extended precision, ease of use
ScaLAPACK
Parallel implementation of LAPACK scaling on
parallel hardware from 10s to 100s to 1000s of
processors
On going work Target new architectures, new
parallel environment. For example port to
Microsoft HPC cluster solution
LAPACK for Clusters (LFC)
Most of ScaLAPACK functionality from serial
clients (Matlab, Python, Mathematica)
On going work Looking at sparse data and I/O
scenarios, web services

15
Parallelism in LAPACK / ScaLAPACK
Distributed Memory
Shared Memory
ScaLAPACK
LAPACK
PBLAS
ATLAS
Specialized BLAS
Parallel
BLACS
threads
MPI
16
Installation and testing of LAPACK, BLACS, and
ScaLAPACK

Uses Intel ifort and icl , BLAS Intel MKL
No problem at installation, LAPACK, BLACS and
ScaLAPACK uses Makefile with make.inc files to
set the environment variables
Tests are fine
( Note that the BLACS testing is a fairly hard
test for MPI, and LAPACK a fairly hard test for
IEEE and compiler semantic.)
Can be used from Microsoft Visual Studio
DLL coming soon

17
LFC Overview

Client/Server
Separation via sockets
Server objects are large
Server runs in parallel
Client objects hold references to server objects
Process spawning
Separation via system calls
mpirun/mpiexec start parallel job

. . .
. . .
Tunnel (IP, TCP, ....)
Firewall (perhaps)
18
FT-MPI http//icl.cs.utk.edu/ft-mpi/

Define the behavior of MPI in event a failure
occurs at the process level.
FT-MPI based on MPI 1.3 (plus some MPI 2
features) with a fault tolerant model similar to
what was done in PVM.
Complete reimplementation, not based on other
implementations.
Gives the application the possibility to recover
from a process-failure.
A regular, non fault-tolerant MPI program will
run using FT-MPI.
What FT-MPI does not do
Recover user data (e.g. automatic check-pointing)
Provide transparent fault-tolerance
Open-MPI for MS

19
Open-MPI Collaborators

Los Alamos National Lab
(LA-MPI)
Sandia National Lab
Indiana U
(LAM/MPI)
U of Tennessee
(FT-MPI)
HLRS - U of Stuttgart
(PACX-MPI)
U of Houston

Cisco Systems
Mellanox
Voltaire
Sun
IBM
Myricom
URL www.open-mpi.org

20
Open-MPI - A Convergence of Ideas
FT-MPI (U of TN)
Open MPI
LA-MPI (LANL)
LAM/MPI (IU)
PACX-MPI (HLRS)
OpenRTE
Fault Detection (LANL, Industry)
FDDP (Semi. Mfg. Industry)
Resilient Computing Systems
Robustness (CSU)
Grid (many)
Autonomous Computing (many)
21
HPC Challenge Goals

To examine the performance of HPC architectures
using kernels with more challenging memory access
patterns than HPL
HPL works well on all architectures ? even
cache-based, distributed memory multiprocessors
due to
Extensive memory reuse
Scalable with respect to the amount of
computation
Scalable with respect to the communication volume
Extensive optimization of the software
To complement the Top500 list
To provide benchmarks that bound the performance
of many real applications as a function of memory
access characteristics ? e.g., spatial and
temporal locality

22
HPCS/HPCC Performance Targets

HPL linear system solve Ax b
STREAM vector operations A B s C
FFT 1D Fast Fourier Transform Z fft(X)
RandomAccess integer update Ti XOR( Ti,
rand)

Memory Hierarchy
Registers
Instructions
Operands
Cache(s)
Lines
Blocks
Local Memory
Performance Targets
Messages
Remote Memory
HPC Challenge
Pages
Disk
Tape