Title: High Performance Computing
1High Performance Computing
- RD
- rnd_at_ciit.net.pk
- http//rnd.ciit.net.pk
2Objectives
- At the end of the address u will
- Understand HPC concepts
- Tell about various HPC paradigms
- Distinguish between HPC programming technologies
- Understand the physical cluster architecture
- Understand about the application building
- Informed about the CIIT Cluster resources.
3Outline
- Introduction
- Why we need powerful computers
- Why powerful computers are parallel
- Parallel computers, yesterday and today
- Issues in parallel performance
- What are Clusters
- Parallel applications programming APIs
- CIIT Computation resources
- Aplications Areas
- QA
4Why do we need powerful computers?
5 Simulation The Third Pillar of Science
- Traditional scientific and engineering paradigm
- Do theory or paper design.
- Perform experiments or build system.
- Limitations
- Too difficult -- build large wind tunnels.
- Too expensive -- build a throw-away passenger
jet. - Too slow -- wait for climate or galactic
evolution. - Too dangerous -- weapons, drug design, climate
experiments. - Computational science paradigm
- Use high performance computer systems to simulate
the phenomenon. - Base on known physical laws and efficient
numerical methods.
6Some Challenging Computations
- Science
- Global climate modeling
- Astrophysical modeling
- Biology genomics protein folding drug design
- Computational Chemistry
- Computational Material Sciences and Nanosciences
- Engineering
- Crash simulation
- Semiconductor design
- Earthquake and structural modeling
- Computation fluid dynamics (airplane design)
- Combustion (engine design)
- Business
- Financial and economic modeling
- Transaction processing, web services and search
engines - Defense
- Nuclear weapons -- test by simulation
- Cryptography
7Units of Measure in HPC
- High Performance Computing (HPC) units are
- Flop/s floating point operations per second
- Typical sizes are millions, billions, trillions
- Mega Mflop/s 106 flop/sec Mbyte 106 byte
- Giga Gflop/s 109 flop/sec Gbyte 109 byte
- Tera Tflop/s 1012 flop/sec Tbyte 1012 byte
- Peta Pflop/s 1015 flop/sec Pbyte 1015 byte
- Exa Eflop/s 1018 flop/sec Ebyte 1018 byte
-
8Global Climate Modeling Problem
- Problem is to compute
- f(latitude, longitude, elevation, time) ?
- temperature, pressure,
humidity, wind velocity - Approach
- Discretize the domain, e.g., a measurement point
every 10 km - Devise an algorithm to predict weather at time
t1 given t
- Uses
- Predict major events, e.g., El Nino
- Use in setting air emissions standards
Source http//www.epm.ornl.gov/chammp/chammp.html
9Global Climate Modeling Computation
- One piece is modeling the fluid flow in the
atmosphere - Solve Navier-Stokes problem
- Roughly 100 Flops per grid point with 1 minute
timestep - Computational requirements
- To match real-time, need 5x 1011 flops in 60
seconds 8 Gflop/s - Weather prediction (7 days in 24 hours) ? 56
Gflop/s - Climate prediction (50 years in 30 days) ? 4.8
Tflop/s - To use in policy negotiations (50 years in 12
hours) ? 288 Tflop/s - To double the grid resolution, computation is at
least 8x - State of the art models require integration of
atmosphere, ocean, sea-ice, land models, plus
possibly carbon cycle, geochemistry and more - Current models are coarser than this
- http//www.nersc.gov/aboutnersc/pubs/bigsplash.pdf
10Why are powerful computers parallel?
11Tunnel Vision by Experts
- I think there is a world market for maybe five
computers. - Thomas Watson, chairman of IBM, 1943.
- There is no reason for any individual to have a
computer in their home - Ken Olson, president and founder of Digital
Equipment Corporation, 1977. - 640K of memory ought to be enough for
anybody. - Bill Gates, chairman of Microsoft,1981.
Slide source Warfield et al.
12Technology Trends Microprocessor Capacity
Moores Law
Moores Law transistors/chip doubles every
1.5 years
Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
Microprocessors have become smaller, denser, and
more powerful.
Slide source Jack Dongarra
13How fast can a serial computer be?
1 Tflop 1 TB sequential machine
r .3 mm
- Consider the 1 Tflop sequential machine
- data must travel some distance, r, to get from
memory to CPU - to get 1 data element per cycle, this means 1012
times per second at the speed of light, c 3e8
m/s - so r lt c/1012 .3 mm
- Now put 1 TB of storage in a .3 mm2 area
- each word occupies 3 Angstroms2, the size of a
small atom - (1 Angstrom 0.000,000,1 mm)
14Automatic Parallelism in Modern Machines
- Bit level parallelism
- within floating point operations, etc.
- Instruction level parallelism
- multiple instructions execute per clock cycle
- Memory system parallelism
- overlap of memory operations with computation
- OS parallelism
- multiple jobs run in parallel on commodity SMPs
There are limits to all of these -- for very high
performance, user must identify, schedule and
coordinate parallel tasks
15Number of transistors per processor chip
16Number of transistors per processor chip
Instruction-Level Parallelism
Thread-Level Parallelism?
Bit-Level Parallelism
17Parallel computers, yesterday and today
18Various Competing Computer Architectures
- Vector Computers (VC) ---proprietary system
- provided the breakthrough needed for the
emergence of computational science, buy they were
only a partial answer. - Massively Parallel Processors (MPP)-proprietary
system - high cost and a low performance/price ratio.
- Symmetric Multiprocessors (SMP)
- suffers from scalability
- Clusters -- gaining popularity
- High Performance Computing---Commodity
Supercomputing - High Availability Computing ---Mission Critical
Applications
19High Performance Computing
- Models
- Shared Memory
- Distributed Memory
20Machine ArchitecturesShared Memory
CPU1
CPU2
CPUN
- - - - - - - - - -
NETWORK
MEMORY
FEATURES 1) All CPUs share memory 2) CPUs access
memory using the interconnection network
21Machine ArchitecturesDistributed Memory
Network
FEATURES 1) Each node has its own local
memory 2) Nodes share data by passing data over
the network
22Issues in parallel performance
23Locality and Parallelism
Conventional Storage Hierarchy
Proc
Proc
Proc
Cache
Cache
Cache
L2 Cache
L2 Cache
L2 Cache
L3 Cache
L3 Cache
L3 Cache
potential interconnects
Memory
Memory
Memory
- Large memories are slow, fast memories are small
- Storage hierarchies are large and fast on average
- Parallel processors, collectively, have large,
fast cache - the slow accesses to remote data we call
communication - Algorithm should do most work on local data
24Finding Enough Parallelism Amdahls Law
How many processors can we really use? Lets say
we have a legacy code such that is it only
feasible to convert half of the heavily used
routines to parallel
25Finding Enough Parallelism Amdahls Law
- If we run this on a parallel machine with five
processors - Our code now takes about 60s. We have sped it up
by about 40. Lets say we use a thousand
processors - We have now sped our code by about a factor of
two.
26Finding Enough Parallelism Amdahls Law
- Suppose only part of an application seems
parallel - Amdahls law
- Let s be the fraction of work done sequentially,
so (1-s) is the fraction parallelizable - Let P number of processors
Speedup(P) Time(1)/Time(P)
lt 1/(s (1-s)/P) lt 1/s
- Even if the parallel part speeds up perfectly,
the sequential part limits overall performance.
27Finding Enough Parallelism Amdahls Law
- Amdahls Law
- speedup and efficiency
- Speedup SN Ts / Tp
- Efficiency EN SN / N
- If the best known serial algorithm takes 8
seconds i.e. Ts 8, while a parallel algorithm
takes 2 seconds using 5 processors, then - SN Ts / Tp 8 / 2 4 and
- EN SN / N 4 / 5 0.8 80
- i.e. the parallel algorithm exhibits a speedup of
4 with 5 processors giving an 80 efficiency.
28Load Imbalance
- Load imbalance is the time that some processors
in the system are idle due to - insufficient parallelism (during that phase)
- unequal size tasks
- Examples of the latter
- adapting to interesting parts of a domain
- tree-structured computations
- fundamentally unstructured problems
- Algorithm needs to balance the load
29Some of the Fastest Super Computers
30Parallel Computing Today
IBM BlueGene _at_ 280 TFlops
31Parallel Computing Today
Mini BlueGene _at_ 91 TFlops
32Parallel Computing Today
ASC Purple _at_ 63 TFlops
33Parallel Computing Today
Columbia SGI Altix _at_ 51 TFlops
34Parallel Computing Today
Earth Simulator _at_ 35 TFlops
35 36Parallel Computing _at_ home
Small class Beowulf cluster
37Clusters
The Modern Choice
38What is a cluster?
- A cluster is a type of parallel or distributed
processing system, which consists of a collection
of interconnected stand-alone computers
cooperatively working together as a single,
integrated computing resource. - A typical cluster
- Network Faster, closer connection than a typical
network (LAN) - Low latency communication protocols
- Looser connection than SMP
39Cluster Architecture
40Backbone/Communication Topology
41Token-Ring/Ethernet with Workstations
42Complete Connectivity
43Star Topology
44Binary Tree
45INTEL Paragon (2-D Mesh)
46The Need for Alternative Supercomputing Resources
- Cannot afford to buy Big Iron machines
- due to their high cost and short life span.
- cut-down of funding
- dont fit better into today's funding model.
- .
- Paradox time required to develop a parallel
application for solving GCA is equal to - half Life of Parallel Supercomputers.
47Clusters are best-alternative!
- Supercomputing-class commodity components are
available - They fit very well with todays/future funding
model. - Can leverage upon future technological advances
- VLSI, CPUs, Networks, Disk, Memory, Cache, OS,
programming tools, applications,...
48Best of both Worlds!
- High Performance Computing
- parallel computers/supercomputer-class
workstation cluster - dependable parallel computers
- High Availability Computing
- mission-critical systems
- fault-tolerant computing
49So Whats So Different about Clusters?
- Commodity Parts?
- Communications Packaging?
- Incremental Scalability?
- Independent Failure?
- Intelligent Network Interfaces?
- Complete System on every node
- virtual memory
- scheduler
- files
-
- Nodes can be used individually or combined...
501984 Computer Food Chain
Mainframe
PC
Workstation
Mini Computer
Vector Supercomputer
51Original Food Chain
Mainframe
Vector Supercomputer
Mini Computer
Workstation
PC
Before
52Computer Food Chain (Now and Future)
53Why Clusters now?(Beyond Technology and Cost)
- Building block is big enough
- complete computers (HW SW) shipped in millions
killer micro, killer RAM, killer disks,killer
OS, killer networks, killer apps. - Workstations performance is doubling every 18
months. - Networks are faster
- Higher link bandwidth (v 10Mbit Ethernet)
- Switch based networks coming (ATM)
- Interfaces simple fast (Active Msgs)
- Demise of Mainframes, Supercomputers, MPPs
54Architectural Drivers
- Node architecture dominates performance
- processor, cache, bus, and memory
- design and engineering gt performance
- Greatest demand for performance is on large
systems - must track the leading edge of technology without
lag - MPP network technology gt mainstream
- system area networks
- System on every node is a powerful enabler
- very high speed I/O, virtual memory, scheduling,
55...Architectural Drivers
- Clusters can be grown Incremental scalability
(up, down, and across) - Individual nodes performance can be improved by
adding additional resource (new memory
blocks/disks) - New nodes can be added or nodes can be removed
- Clusters of Clusters and Metacomputing
- Complete software tools
- Threads, PVM, MPI, DSM, C, C, Java, Parallel
C, Compilers, Debuggers, OS, etc. - Wide class of applications
- Sequential and grand challenging parallel
applications
56Top500 SuperComputers Statistics
57Top500 SuperComputers ListManufacturers
58Top500 SuperComputers ListContinents
59Top500 SuperComputers ListCountries/Performance
60Top500 SuperComputers ListAsian Countries/Systems
61Top500 SuperComputers ListCustomer
Segments/Performance
62Top500 SuperComputers ListArchitecture/Performanc
e
63Top500 SuperComputers ListInterConnet/Performance
64Top500 SuperComputers ListOperatingSystems/System
s
65How do I write parallel apps
66Available APIs
67What is OpenMP ?
- A standard developed under the review of many
major software and hardware developers,
government, and academia - Facilitates simple development of programs to
take advantage of SMP architectures - SMP Symmetric multi-processing, access time to
memory is approx. equal for all processors
(usually 2- 16 processors) - Shared Memory memory local to all processors in
an SMP domain - Distributed Memory remote memory access
(non-local memory) NUMA (clusters, grids)
68What is OpenMP ?
- OpenMP API is comprised of
- Compiler directives
- Library routines
- Environment variables
- OpenMP language support
- Fortran, C, C
- Compilers supporting OpenMP
- Intel Compilers, Portland Group (PGI), IBM,
Compaq - Omni, OdinMP can be used with gcc
69OpenMP (behind the scenes)
- Thread communication through shared variables
(shared memory) - Threads can be carried through from one
parallel region to the next - Imp Need to amortize thread fork cost and
minimize thread joins - Number of threads can be dynamically altered
during runtime - Support for nested parallelism exists in some
compilers
70What is OpenMosix ?
- An OpenSource enhancement to the Linux kernel
- Provides adaptive (on-line) load-balancing
between the machines. - Uses pre-emptive process migration to assign and
reassign the processes among the nodes to take
the best advantage of the available resources
71OpenMosix architecture (1/5)
- Network transparency
- The interactive user and the application level
programs are provided by a virtual machine that
looks like a single MP machine - Preemptive process migration
- Any users process, trasparently and at any
time, can migrate to any available node. - The migrating process is divided into two
contexts - system context (deputy) that may not be migrated
from home workstation - user context (remote) that can be migrated on a
diskless node
72OpenMosix architecture (2/5)
- Preemptive process migration
master node
diskless node
73OpenMosix architecture (3/5)
- Dynamic load balancing
- Initiates process migrations in order to balance
the load of farm - Responds to variations in the load of the nodes,
runtime characteristics of the processes, number
of nodes and their speeds - Makes continuous attempts to reduce the load
differences between pairs of nodes and
dynamically migrating processes from nodes with
higher load to nodes with a lower load
74OpenMosix architecture (4/5)
- Memory sharing
- Places the maximal number of processes in the
farm main memory, even if it implies an uneven
load distribution among the nodes - Delays as much as possible swapping out of pages
- Makes the decision of which process to migrate
and where to migrate it is based on the knoweldge
of the amount of free memory in other nodes - Efficient kernel communication
- is specifically developed to reduce the overhead
of the internal kernel communications (e.g.
between the process and its home site, when it is
executing in a remote site) - fast and reliable protocol with low startup
latency and high throughput
75OpenMosix architecture (5/5)
- Probabilistic information dissemination
algorithms - provide each node with sufficient knowledge about
available resources in other nodes, without
polling - measure the amount of the available resources on
each node - receive the resources indices that each node send
at regular intervals to a randomly chosen subset
of nodes - the use of randomly chosen subset of nodes is due
for support of dynamic configuration and to
overcome partial nodes failures - Decentralized control and autonomy
- each node makes its own control decisions
independently and there is no master-slave
relationship between nodes - each node is capable of operating as an
independent system this property allows a
dynamic configuration, where nodes may join or
leave the farm with minimal disruption
76Openmosix Conclusions (1/2)
- Noticeable features of OpenMOSIX are
- load-balancing
- process migration algorithms
- This is most useful in time-sharing, multi-user
environments, where users do not have means (and
usually are not interested) in the status (e.g.
load of the nodes) - Parallel application can be executed by forking
many processes, just like in an SMP, where
OpenMOSIX continuously attempts to optimize the
resource allocation
77Openmosix Conclusions (2/2)
- Building up farms with the OpenMosixClusterNFS
approach requires no more than 2 hours - With this approach management of a farm
management of a single server
78Message Passing Interface (MPI)
- Available for Numerous h/w and OS platforms
- Most Popular
- Well Supported
- Supports C/C, Fortran 77/90/95
- Processes coordinate their activities by passing
and receiving messages - Extensive API wrapper functions available
- Give substantial control over the applications
design/ architecture
79MPI continued
- Basic Calls
- include "mpi.h" provides basic MPI definitions
and types - MPI_Init starts MPI
- MPI_Finalize exits MPI
- MPI_Comm_rank( MPI_COMM_WORLD, rank )
MPI_Comm_size( MPI_COMM_WORLD, size ) - MPI_Send()
- MPI_Recv()
80MPI continued
- Basic Program Structure
- include "mpi.h"
- main(int argc ,char argv)
-
- /No MPI fucntions before this/
- MPI_Init(argc,argv)//allows systems to do
special //setup - . .
- .
- MPI_Finalise()//frees memory used by MPI
- /No MPI function called after this/
- /main/
81MPI Continued
- Sample Hello World in MPI
- include "mpi.h"
- include ltstdio.hgt
- int main( argc, argv )
- int argc char argv
-
- MPI_Init( argc, argv )
- printf( "Hello world\n" )
- MPI_Finalize()
- return 0
-
82Parallel Virtual Machine (PVM)
- Runs on every UNIX and WinNT/Win95
- Runs over most physical networks (ethernet, FDDI,
Myrinet, ATM, Shared-Memory) - A heterogeneous collection of machines can be
assembled and used as a Super Computer - Programming is completely portable
- Supports C/C, Fortran 77/90/95
- The underlying machine and network is transparent
to the programmer/user - Each user has his/hers own private VM
83PVM Continued
- Basic Calls
- Pvm_Spawn
- num pvm_spawn(child, arguments, flag, where,
howmany, tids) - Send (one receiver)
- info pvm_send(tid, tag)
- Receiving
- bufid pvm_recv(tid, tag)
- broadcast (multiple receivers)
- info pvm_mcast(tids, n, tag),
- info pvm_bcast(group_name, tag)
84PVM Continued
- Sample Hello world
- include ltstdio.hgt
- include ltpvm3.hgt
- main()
-
- int cc, tid
- char buf100
- printf("i'm tx\n", pvm_mytid())
- / spawn 1 copy of hello_other on any machine /
- cc pvm_spawn("hello_other", (char)0,PvmTaskDe
fault, "", 1, tid) - if (cc 1)
- cc pvm_recv(-1, -1) / receive a message
from any source / - / get info about the sender /
- pvm_bufinfo(cc, (int)0, (int)0, tid)
- pvm_upkstr(buf)
- printf("from tx s\n", tid, buf)
- else
- printf("can't start hello_other\n")
85CIIT Computational Cluster
86CIIT Computational Cluster
87..CIIT Computational Cluster
- Cluster Specs
- Master Node
- Dual P-II 500 Mhz, 512 Mb 20Gig Ultra Wide SCSI
- Compute Node x 32
- P-II 333 Mhz, 96 Mb 4Gig Ultra Wide SCSI
- Ethernet 10/100 Interconnect
- 9-12 GFlops
88..CIIT Computational Cluster
- Available Libraries/APIs
- LAM MPI/MPICH
- PVM
- BLAS (Basic Linear Algebra software)
- LAPack (Linear Algebra package)
- ScalaPack (Scalable Linear Algebra Package)
- BLACS (Basic Linear Algebra Communication sub
programs) - Intel MKL (Intels Math Kernel Libraries)
89..CIIT Computational Cluster
- Available Programming languages/Compilers
- C/C
- FORTRAN77
- GNU, Intel, Lahey, Fujitsu
- Misc
- OpenPBS/Torque
90..CIIT Computational Cluster
91..CIIT Computational Cluster
- How do I access it ?
- Use any ssh client
- Openssh or Putty (for windows)
- Connect from
- Faculty IP 172.16.4.19
- Students labs 172.16.0.45
92..CIIT Computational Cluster
93..CIIT Computational Cluster
94..CIIT Computational Cluster
- Code compilation
- Mpicc , mpif77, gcc, g77
- Eg to compile hello.c
- mpicc o ltbinary outputgt ltsource file.cgt
- Or
- mpicc o hello hello.c
- mpif77 o hello hello.f
- gcc -I /opt/pvm3/include myprogram.c -L
/opt/pvm3/lib/LINUX/ -lpvm3 -o myprogramexe
95..CIIT Computational Cluster
96..CIIT Computational Cluster
97..CIIT Computational Cluster
- Submitting Your jobs
- start of script
- PBS -N HelloJob
- PBS -q workq
- PBS -l nodes32
- echo "start it"
- echo "HOMEHOME"
- lamboot PBS_NODEFILE
- echo "- LAM is ready"
- cd PBS_O_WORKDIR
- mpirun C hello
- lamhalt PBS_NODEFILE
- echo "done
- End of script
- user/ qsub myjobscript
98..CIIT Computational Cluster
99..CIIT Computational Cluster
100..CIIT Computational Cluster
101..CIIT Computational Cluster
102Applications of Parallel Computing revisited
- Science
- Global climate modeling
- Astrophysical modeling
- Biology genomics protein folding drug design
- Computational Chemistry
- Computational Material Sciences and Nano Sciences
- Engineering
- Crash simulation
- Semiconductor design
- Earthquake and structural modeling
- Computation fluid dynamics (airplane design)
- Combustion (engine design)
103Applications of Parallel Computing revisited
- Business
- Financial and economic modeling
- Transaction processing, web services and search
engines - Defense
- Nuclear weapons
- Cryptography
104PDC hot topics for E-commerce
Applications of Parallel Computing revisited
- Cluster based web-servers, search engineers,
portals - Scheduling and Single System Image.
- Heterogeneous Computing
- Reliability and High Availability and Data
Recovery - Parallel Databases and high performance-reliable-m
ass storage systems. - CyberGuard! Data mining for detection of cyber
attacks, frauds, etc. detection and online
control. - Data Mining for identifying sales pattern and
automatically tuning portal to special
sessions/festival sales - eCash, eCheque, eBank, eSociety, eGovernment,
eEntertainment, eTravel, eGoods, and so on. - Data/Site Replications and Caching techniques
- Compute Power Market
- Infowares (yahoo.com, AOL.com)
- ASPs (application service providers)
- . . .
105Q/A
- References
- www.tldp.com
- www-unix.mcs.anl.gov/mpi/
- www.netlib.org/pvm3/
- www.beowulf.org
- www.putty.nl/
- www.top500.org
- Openmp www.openmp.org
- Introduction to Openmp http//www.llnl.gov/computi
ng/tutorials/workshops/workshop/openMP/MAIN.html - http//www.openmosix.org
End of Day 6
106BkupSlides
107..Introduction of PDC
- Let s be the fraction of work done sequentially,
so (1-s) is the fraction parallelizable - Let P number of processors
Speedup(P) Time(1)/Time(P)
lt 1/(s (1-s)/P) lt 1/s
- Even if the parallel part speeds up perfectly,
the sequential part limits overall performance.
108BkupSlides
109Introduction of PDC
- Sequential Hardware
- Turing Machine View
- Tape TM State
- Sequential Change of State and Tape Position
- Von Neumann View
- Program Counter Registers Thread/process
- Sequential change of Machine state
- Sequence /U essence of computation
110High Resolution Climate Modeling on NERSC-3 P.
Duffy, et al., LLNL
111A 1000 Year Climate Simulation
- Demonstration of the Community Climate Model
(CCSM2) - A 1000-year simulation shows long-term, stable
representation of the earths climate. - 760,000 processor hours used
- Temperature change shown
- Warren Washington and Jerry Meehl, National
Center for Atmospheric Research Bert Semtner,
Naval Postgraduate School John Weatherly, U.S.
Army Cold Regions Research and Engineering Lab
Laboratory et al. - http//www.nersc.gov/aboutnersc/pubs/bigsplash.pdf
112Climate Modeling on the Earth Simulator System
- Development of ES started in 1997 with the goal
of enabling a comprehensive understanding of
global environmental changes such as global
warming.
- Construction was completed February, 2002 and
practical operation started March 1, 2002
- 35.86 Tflops (87.5 of peak performance) on
Linpack benchmark.
- 26.58 Tflops on a global atmospheric circulation
code.
113Fully Connection CM-2
114Scaling microprocessors
- What happens when feature size shrinks by a
factor of x? - Clock rate goes up by x
- actually a little less
- Transistors per unit area goes up by x2
- Die size also tends to increase
- typically another factor of x
- Raw computing power of the chip goes up by x4 !
- of which x3 is devoted either to parallelism or
locality