High Performance Computing

About This Presentation

Title:

High Performance Computing

Description:

High Performance Computing – PowerPoint PPT presentation

Number of Views:405

Avg rating:3.0/5.0

Slides: 115

Provided by: johnr234

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Computing

1
High Performance Computing

RD
rnd_at_ciit.net.pk
http//rnd.ciit.net.pk

2
Objectives

At the end of the address u will
Understand HPC concepts
Tell about various HPC paradigms
Distinguish between HPC programming technologies
Understand the physical cluster architecture
Understand about the application building
Informed about the CIIT Cluster resources.

3
Outline

Introduction
Why we need powerful computers
Why powerful computers are parallel
Parallel computers, yesterday and today
Issues in parallel performance
What are Clusters
Parallel applications programming APIs
CIIT Computation resources
Aplications Areas
QA

4
Why do we need powerful computers?
5
Simulation The Third Pillar of Science

Traditional scientific and engineering paradigm
Do theory or paper design.
Perform experiments or build system.
Limitations
Too difficult -- build large wind tunnels.
Too expensive -- build a throw-away passenger
jet.
Too slow -- wait for climate or galactic
evolution.
Too dangerous -- weapons, drug design, climate
experiments.
Computational science paradigm
Use high performance computer systems to simulate
the phenomenon.
Base on known physical laws and efficient
numerical methods.

6
Some Challenging Computations

Science
Global climate modeling
Astrophysical modeling
Biology genomics protein folding drug design
Computational Chemistry
Computational Material Sciences and Nanosciences
Engineering
Crash simulation
Semiconductor design
Earthquake and structural modeling
Computation fluid dynamics (airplane design)
Combustion (engine design)
Business
Financial and economic modeling
Transaction processing, web services and search
engines
Defense
Nuclear weapons -- test by simulation
Cryptography

7
Units of Measure in HPC

High Performance Computing (HPC) units are
Flop/s floating point operations per second
Typical sizes are millions, billions, trillions
Mega Mflop/s 106 flop/sec Mbyte 106 byte
Giga Gflop/s 109 flop/sec Gbyte 109 byte
Tera Tflop/s 1012 flop/sec Tbyte 1012 byte
Peta Pflop/s 1015 flop/sec Pbyte 1015 byte
Exa Eflop/s 1018 flop/sec Ebyte 1018 byte

8
Global Climate Modeling Problem

Problem is to compute
f(latitude, longitude, elevation, time) ?
temperature, pressure,
humidity, wind velocity
Approach
Discretize the domain, e.g., a measurement point
every 10 km
Devise an algorithm to predict weather at time
t1 given t

Uses
Predict major events, e.g., El Nino
Use in setting air emissions standards

Source http//www.epm.ornl.gov/chammp/chammp.html
9
Global Climate Modeling Computation

One piece is modeling the fluid flow in the
atmosphere
Solve Navier-Stokes problem
Roughly 100 Flops per grid point with 1 minute
timestep
Computational requirements
To match real-time, need 5x 1011 flops in 60
seconds 8 Gflop/s
Weather prediction (7 days in 24 hours) ? 56
Gflop/s
Climate prediction (50 years in 30 days) ? 4.8
Tflop/s
To use in policy negotiations (50 years in 12
hours) ? 288 Tflop/s
To double the grid resolution, computation is at
least 8x
State of the art models require integration of
atmosphere, ocean, sea-ice, land models, plus
possibly carbon cycle, geochemistry and more
Current models are coarser than this
http//www.nersc.gov/aboutnersc/pubs/bigsplash.pdf

10
Why are powerful computers parallel?
11
Tunnel Vision by Experts

I think there is a world market for maybe five
computers.
Thomas Watson, chairman of IBM, 1943.
There is no reason for any individual to have a
computer in their home
Ken Olson, president and founder of Digital
Equipment Corporation, 1977.
640K of memory ought to be enough for
anybody.
Bill Gates, chairman of Microsoft,1981.

Slide source Warfield et al.
12
Technology Trends Microprocessor Capacity
Moores Law
Moores Law transistors/chip doubles every
1.5 years
Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
Microprocessors have become smaller, denser, and
more powerful.
Slide source Jack Dongarra
13
How fast can a serial computer be?
1 Tflop 1 TB sequential machine
r .3 mm

Consider the 1 Tflop sequential machine
data must travel some distance, r, to get from
memory to CPU
to get 1 data element per cycle, this means 1012
times per second at the speed of light, c 3e8
m/s
so r lt c/1012 .3 mm
Now put 1 TB of storage in a .3 mm2 area
each word occupies 3 Angstroms2, the size of a
small atom
(1 Angstrom 0.000,000,1 mm)

14
Automatic Parallelism in Modern Machines

Bit level parallelism
within floating point operations, etc.
Instruction level parallelism
multiple instructions execute per clock cycle
Memory system parallelism
overlap of memory operations with computation
OS parallelism
multiple jobs run in parallel on commodity SMPs

There are limits to all of these -- for very high
performance, user must identify, schedule and
coordinate parallel tasks
15
Number of transistors per processor chip
16
Number of transistors per processor chip
Instruction-Level Parallelism
Thread-Level Parallelism?
Bit-Level Parallelism
17
Parallel computers, yesterday and today
18
Various Competing Computer Architectures

Vector Computers (VC) ---proprietary system
provided the breakthrough needed for the
emergence of computational science, buy they were
only a partial answer.
Massively Parallel Processors (MPP)-proprietary
system
high cost and a low performance/price ratio.
Symmetric Multiprocessors (SMP)
suffers from scalability
Clusters -- gaining popularity
High Performance Computing---Commodity
Supercomputing
High Availability Computing ---Mission Critical
Applications

19
High Performance Computing

Models
Shared Memory
Distributed Memory

20
Machine ArchitecturesShared Memory
CPU1
CPU2
CPUN
- - - - - - - - - -
NETWORK
MEMORY
FEATURES 1) All CPUs share memory 2) CPUs access
memory using the interconnection network
21
Machine ArchitecturesDistributed Memory
Network
FEATURES 1) Each node has its own local
memory 2) Nodes share data by passing data over
the network
22
Issues in parallel performance
23
Locality and Parallelism
Conventional Storage Hierarchy
Proc
Proc
Proc
Cache
Cache
Cache
L2 Cache
L2 Cache
L2 Cache
L3 Cache
L3 Cache
L3 Cache
potential interconnects
Memory
Memory
Memory

Large memories are slow, fast memories are small
Storage hierarchies are large and fast on average
Parallel processors, collectively, have large,
fast cache
the slow accesses to remote data we call
communication
Algorithm should do most work on local data

24
Finding Enough Parallelism Amdahls Law

Amdahls Law

How many processors can we really use? Lets say
we have a legacy code such that is it only
feasible to convert half of the heavily used
routines to parallel
25
Finding Enough Parallelism Amdahls Law

Amdahls Law

If we run this on a parallel machine with five
processors
Our code now takes about 60s. We have sped it up
by about 40. Lets say we use a thousand
processors
We have now sped our code by about a factor of
two.

26
Finding Enough Parallelism Amdahls Law

Suppose only part of an application seems
parallel
Amdahls law
Let s be the fraction of work done sequentially,
so (1-s) is the fraction parallelizable
Let P number of processors

Speedup(P) Time(1)/Time(P)
lt 1/(s (1-s)/P) lt 1/s

Even if the parallel part speeds up perfectly,
the sequential part limits overall performance.

27
Finding Enough Parallelism Amdahls Law

Amdahls Law
speedup and efficiency
Speedup SN Ts / Tp
Efficiency EN SN / N
If the best known serial algorithm takes 8
seconds i.e. Ts 8, while a parallel algorithm
takes 2 seconds using 5 processors, then
SN Ts / Tp 8 / 2 4 and
EN SN / N 4 / 5 0.8 80
i.e. the parallel algorithm exhibits a speedup of
4 with 5 processors giving an 80 efficiency.

28
Load Imbalance

Load imbalance is the time that some processors
in the system are idle due to
insufficient parallelism (during that phase)
unequal size tasks
Examples of the latter
adapting to interesting parts of a domain
tree-structured computations
fundamentally unstructured problems
Algorithm needs to balance the load

29
Some of the Fastest Super Computers
30
Parallel Computing Today
IBM BlueGene _at_ 280 TFlops
31
Parallel Computing Today
Mini BlueGene _at_ 91 TFlops
32
Parallel Computing Today
ASC Purple _at_ 63 TFlops
33
Parallel Computing Today
Columbia SGI Altix _at_ 51 TFlops
34
Parallel Computing Today
Earth Simulator _at_ 35 TFlops
35

Parallel Computing Today

36
Parallel Computing _at_ home
Small class Beowulf cluster
37
Clusters
The Modern Choice
38
What is a cluster?

A cluster is a type of parallel or distributed
processing system, which consists of a collection
of interconnected stand-alone computers
cooperatively working together as a single,
integrated computing resource.
A typical cluster
Network Faster, closer connection than a typical
network (LAN)
Low latency communication protocols
Looser connection than SMP

39
Cluster Architecture
40
Backbone/Communication Topology
41
Token-Ring/Ethernet with Workstations
42
Complete Connectivity
43
Star Topology
44
Binary Tree
45
INTEL Paragon (2-D Mesh)
46
The Need for Alternative Supercomputing Resources

Cannot afford to buy Big Iron machines
due to their high cost and short life span.
cut-down of funding
dont fit better into today's funding model.
.
Paradox time required to develop a parallel
application for solving GCA is equal to
half Life of Parallel Supercomputers.

47
Clusters are best-alternative!

Supercomputing-class commodity components are
available
They fit very well with todays/future funding
model.
Can leverage upon future technological advances
VLSI, CPUs, Networks, Disk, Memory, Cache, OS,
programming tools, applications,...

48
Best of both Worlds!

High Performance Computing
parallel computers/supercomputer-class
workstation cluster
dependable parallel computers
High Availability Computing
mission-critical systems
fault-tolerant computing

49
So Whats So Different about Clusters?

Commodity Parts?
Communications Packaging?
Incremental Scalability?
Independent Failure?
Intelligent Network Interfaces?
Complete System on every node
virtual memory
scheduler
files
Nodes can be used individually or combined...

50
1984 Computer Food Chain
Mainframe
PC
Workstation
Mini Computer
Vector Supercomputer
51
Original Food Chain
Mainframe
Vector Supercomputer
Mini Computer
Workstation
PC
Before
52
Computer Food Chain (Now and Future)
53
Why Clusters now?(Beyond Technology and Cost)

Building block is big enough
complete computers (HW SW) shipped in millions
killer micro, killer RAM, killer disks,killer
OS, killer networks, killer apps.
Workstations performance is doubling every 18
months.
Networks are faster
Higher link bandwidth (v 10Mbit Ethernet)
Switch based networks coming (ATM)
Interfaces simple fast (Active Msgs)
Demise of Mainframes, Supercomputers, MPPs

54
Architectural Drivers

Node architecture dominates performance
processor, cache, bus, and memory
design and engineering gt performance
Greatest demand for performance is on large
systems
must track the leading edge of technology without
lag
MPP network technology gt mainstream
system area networks
System on every node is a powerful enabler
very high speed I/O, virtual memory, scheduling,

55
...Architectural Drivers

Clusters can be grown Incremental scalability
(up, down, and across)
Individual nodes performance can be improved by
adding additional resource (new memory
blocks/disks)
New nodes can be added or nodes can be removed
Clusters of Clusters and Metacomputing
Complete software tools
Threads, PVM, MPI, DSM, C, C, Java, Parallel
C, Compilers, Debuggers, OS, etc.
Wide class of applications
Sequential and grand challenging parallel
applications

56
Top500 SuperComputers Statistics
57
Top500 SuperComputers ListManufacturers
58
Top500 SuperComputers ListContinents
59
Top500 SuperComputers ListCountries/Performance
60
Top500 SuperComputers ListAsian Countries/Systems
61
Top500 SuperComputers ListCustomer
Segments/Performance
62
Top500 SuperComputers ListArchitecture/Performanc
e
63
Top500 SuperComputers ListInterConnet/Performance
64
Top500 SuperComputers ListOperatingSystems/System
s
65
How do I write parallel apps
66
Available APIs
67
What is OpenMP ?

A standard developed under the review of many
major software and hardware developers,
government, and academia
Facilitates simple development of programs to
take advantage of SMP architectures
SMP Symmetric multi-processing, access time to
memory is approx. equal for all processors
(usually 2- 16 processors)
Shared Memory memory local to all processors in
an SMP domain
Distributed Memory remote memory access
(non-local memory) NUMA (clusters, grids)

68
What is OpenMP ?

OpenMP API is comprised of
Compiler directives
Library routines
Environment variables
OpenMP language support
Fortran, C, C
Compilers supporting OpenMP
Intel Compilers, Portland Group (PGI), IBM,
Compaq
Omni, OdinMP can be used with gcc

69
OpenMP (behind the scenes)

Thread communication through shared variables
(shared memory)
Threads can be carried through from one
parallel region to the next
Imp Need to amortize thread fork cost and
minimize thread joins
Number of threads can be dynamically altered
during runtime
Support for nested parallelism exists in some
compilers

70
What is OpenMosix ?

An OpenSource enhancement to the Linux kernel
Provides adaptive (on-line) load-balancing
between the machines.
Uses pre-emptive process migration to assign and
reassign the processes among the nodes to take
the best advantage of the available resources

71
OpenMosix architecture (1/5)

Network transparency
The interactive user and the application level
programs are provided by a virtual machine that
looks like a single MP machine
Preemptive process migration
Any users process, trasparently and at any
time, can migrate to any available node.
The migrating process is divided into two
contexts
system context (deputy) that may not be migrated
from home workstation
user context (remote) that can be migrated on a
diskless node

72
OpenMosix architecture (2/5)

Preemptive process migration

master node
diskless node
73
OpenMosix architecture (3/5)

Dynamic load balancing
Initiates process migrations in order to balance
the load of farm
Responds to variations in the load of the nodes,
runtime characteristics of the processes, number
of nodes and their speeds
Makes continuous attempts to reduce the load
differences between pairs of nodes and
dynamically migrating processes from nodes with
higher load to nodes with a lower load

74
OpenMosix architecture (4/5)

Memory sharing
Places the maximal number of processes in the
farm main memory, even if it implies an uneven
load distribution among the nodes
Delays as much as possible swapping out of pages
Makes the decision of which process to migrate
and where to migrate it is based on the knoweldge
of the amount of free memory in other nodes
Efficient kernel communication
is specifically developed to reduce the overhead
of the internal kernel communications (e.g.
between the process and its home site, when it is
executing in a remote site)
fast and reliable protocol with low startup
latency and high throughput

75
OpenMosix architecture (5/5)

Probabilistic information dissemination
algorithms
provide each node with sufficient knowledge about
available resources in other nodes, without
polling
measure the amount of the available resources on
each node
receive the resources indices that each node send
at regular intervals to a randomly chosen subset
of nodes
the use of randomly chosen subset of nodes is due
for support of dynamic configuration and to
overcome partial nodes failures
Decentralized control and autonomy
each node makes its own control decisions
independently and there is no master-slave
relationship between nodes
each node is capable of operating as an
independent system this property allows a
dynamic configuration, where nodes may join or
leave the farm with minimal disruption

76
Openmosix Conclusions (1/2)

Noticeable features of OpenMOSIX are
load-balancing
process migration algorithms
This is most useful in time-sharing, multi-user
environments, where users do not have means (and
usually are not interested) in the status (e.g.
load of the nodes)
Parallel application can be executed by forking
many processes, just like in an SMP, where
OpenMOSIX continuously attempts to optimize the
resource allocation

77
Openmosix Conclusions (2/2)

Building up farms with the OpenMosixClusterNFS
approach requires no more than 2 hours
With this approach management of a farm
management of a single server

78
Message Passing Interface (MPI)

Available for Numerous h/w and OS platforms
Most Popular
Well Supported
Supports C/C, Fortran 77/90/95
Processes coordinate their activities by passing
and receiving messages
Extensive API wrapper functions available
Give substantial control over the applications
design/ architecture

79
MPI continued

Basic Calls
include "mpi.h" provides basic MPI definitions
and types
MPI_Init starts MPI
MPI_Finalize exits MPI
MPI_Comm_rank( MPI_COMM_WORLD, rank )
MPI_Comm_size( MPI_COMM_WORLD, size )
MPI_Send()
MPI_Recv()

80
MPI continued

Basic Program Structure
include "mpi.h"
main(int argc ,char argv)
/No MPI fucntions before this/
MPI_Init(argc,argv)//allows systems to do
special //setup
. .
.
MPI_Finalise()//frees memory used by MPI
/No MPI function called after this/
/main/

81
MPI Continued

Sample Hello World in MPI
include "mpi.h"
include ltstdio.hgt
int main( argc, argv )
int argc char argv
MPI_Init( argc, argv )
printf( "Hello world\n" )
MPI_Finalize()
return 0

82
Parallel Virtual Machine (PVM)

Runs on every UNIX and WinNT/Win95
Runs over most physical networks (ethernet, FDDI,
Myrinet, ATM, Shared-Memory)
A heterogeneous collection of machines can be
assembled and used as a Super Computer
Programming is completely portable
Supports C/C, Fortran 77/90/95
The underlying machine and network is transparent
to the programmer/user
Each user has his/hers own private VM

83
PVM Continued

Basic Calls
Pvm_Spawn
num pvm_spawn(child, arguments, flag, where,
howmany, tids)
Send (one receiver)
info pvm_send(tid, tag)
Receiving
bufid pvm_recv(tid, tag)
broadcast (multiple receivers)
info pvm_mcast(tids, n, tag),
info pvm_bcast(group_name, tag)

84
PVM Continued

Sample Hello world
include ltstdio.hgt
include ltpvm3.hgt
main()
int cc, tid
char buf100
printf("i'm tx\n", pvm_mytid())
/ spawn 1 copy of hello_other on any machine /
cc pvm_spawn("hello_other", (char)0,PvmTaskDe
fault, "", 1, tid)
if (cc 1)
cc pvm_recv(-1, -1) / receive a message
from any source /
/ get info about the sender /
pvm_bufinfo(cc, (int)0, (int)0, tid)
pvm_upkstr(buf)
printf("from tx s\n", tid, buf)
else
printf("can't start hello_other\n")

85
CIIT Computational Cluster
86
CIIT Computational Cluster
87
..CIIT Computational Cluster

Cluster Specs
Master Node
Dual P-II 500 Mhz, 512 Mb 20Gig Ultra Wide SCSI
Compute Node x 32
P-II 333 Mhz, 96 Mb 4Gig Ultra Wide SCSI
Ethernet 10/100 Interconnect
9-12 GFlops

88
..CIIT Computational Cluster

Available Libraries/APIs
LAM MPI/MPICH
PVM
BLAS (Basic Linear Algebra software)
LAPack (Linear Algebra package)
ScalaPack (Scalable Linear Algebra Package)
BLACS (Basic Linear Algebra Communication sub
programs)
Intel MKL (Intels Math Kernel Libraries)

89
..CIIT Computational Cluster

Available Programming languages/Compilers
C/C
FORTRAN77
GNU, Intel, Lahey, Fujitsu
Misc
OpenPBS/Torque

90
..CIIT Computational Cluster

Topology

91
..CIIT Computational Cluster

How do I access it ?
Use any ssh client
Openssh or Putty (for windows)
Connect from
Faculty IP 172.16.4.19
Students labs 172.16.0.45

92
..CIIT Computational Cluster
93
..CIIT Computational Cluster
94
..CIIT Computational Cluster

Code compilation
Mpicc , mpif77, gcc, g77
Eg to compile hello.c
mpicc o ltbinary outputgt ltsource file.cgt
Or
mpicc o hello hello.c
mpif77 o hello hello.f
gcc -I /opt/pvm3/include myprogram.c -L
/opt/pvm3/lib/LINUX/ -lpvm3 -o myprogramexe

95
..CIIT Computational Cluster

Resource Availability

96
..CIIT Computational Cluster
97
..CIIT Computational Cluster

Submitting Your jobs
start of script
PBS -N HelloJob
PBS -q workq
PBS -l nodes32
echo "start it"
echo "HOMEHOME"
lamboot PBS_NODEFILE
echo "- LAM is ready"
cd PBS_O_WORKDIR
mpirun C hello
lamhalt PBS_NODEFILE
echo "done
End of script
user/ qsub myjobscript

98
..CIIT Computational Cluster
99
..CIIT Computational Cluster

Jobs Status

100
..CIIT Computational Cluster
101
..CIIT Computational Cluster
102
Applications of Parallel Computing revisited

Science
Global climate modeling
Astrophysical modeling
Biology genomics protein folding drug design
Computational Chemistry
Computational Material Sciences and Nano Sciences
Engineering
Crash simulation
Semiconductor design
Earthquake and structural modeling
Computation fluid dynamics (airplane design)
Combustion (engine design)

103
Applications of Parallel Computing revisited

Business
Financial and economic modeling
Transaction processing, web services and search
engines
Defense
Nuclear weapons
Cryptography

104
PDC hot topics for E-commerce
Applications of Parallel Computing revisited

Cluster based web-servers, search engineers,
portals
Scheduling and Single System Image.
Heterogeneous Computing
Reliability and High Availability and Data
Recovery
Parallel Databases and high performance-reliable-m
ass storage systems.
CyberGuard! Data mining for detection of cyber
attacks, frauds, etc. detection and online
control.
Data Mining for identifying sales pattern and
automatically tuning portal to special
sessions/festival sales
eCash, eCheque, eBank, eSociety, eGovernment,
eEntertainment, eTravel, eGoods, and so on.
Data/Site Replications and Caching techniques
Compute Power Market
Infowares (yahoo.com, AOL.com)
ASPs (application service providers)
. . .

105
Q/A

References
www.tldp.com
www-unix.mcs.anl.gov/mpi/
www.netlib.org/pvm3/
www.beowulf.org
www.putty.nl/
www.top500.org
Openmp www.openmp.org
Introduction to Openmp http//www.llnl.gov/computi
ng/tutorials/workshops/workshop/openMP/MAIN.html
http//www.openmosix.org

End of Day 6
106
BkupSlides
107
..Introduction of PDC

Let s be the fraction of work done sequentially,
so (1-s) is the fraction parallelizable
Let P number of processors

Amdahls Law

Speedup(P) Time(1)/Time(P)
lt 1/(s (1-s)/P) lt 1/s

Even if the parallel part speeds up perfectly,
the sequential part limits overall performance.

108
BkupSlides
109
Introduction of PDC

Sequential Hardware
Turing Machine View
Tape TM State
Sequential Change of State and Tape Position
Von Neumann View
Program Counter Registers Thread/process
Sequential change of Machine state
Sequence /U essence of computation

110
High Resolution Climate Modeling on NERSC-3 P.
Duffy, et al., LLNL
111
A 1000 Year Climate Simulation

Demonstration of the Community Climate Model
(CCSM2)
A 1000-year simulation shows long-term, stable
representation of the earths climate.
760,000 processor hours used
Temperature change shown

Warren Washington and Jerry Meehl, National
Center for Atmospheric Research Bert Semtner,
Naval Postgraduate School John Weatherly, U.S.
Army Cold Regions Research and Engineering Lab
Laboratory et al.
http//www.nersc.gov/aboutnersc/pubs/bigsplash.pdf

112
Climate Modeling on the Earth Simulator System