Title: Lecture 1: Introduction to High Performance Computing
1Lecture 1Introduction to High Performance
Computing
2Grand challenge problem
- A grand challenge problem is one that cannot be
solved in a reasonable amount of time with
todays computers.
3Weather Forecasting
- Cells of size 1 mile x 1 mile x 1 mile
- gt Whole global atmosphere about 5 x 108 cells
- If each calculation requires 200 Flops
- gt 1011 Flops, in one time step
- To forecast the weather over 10 days using
10-minute intervals, with a computer operating at
100 Mflops (108 Flops/s) - gt would take 107 seconds or over 100 days.
- To perform the calculation in 10 minutes would
require a computer operating at 1.7 Tflops (1.7 x
1012 Flops/s).
4Some Grand Challenge Applications
- Science
- Global climate modeling
- Astrophysical modeling
- Biology genomics protein folding drug design
- Computational Chemistry
- Computational Material Sciences and
Nanosciences - Engineering
- Crash simulation
- Semiconductor design
- Earthquake and structural modeling
- Computation fluid dynamics (airplane design)
- Combustion (engine design)
- Business
- Financial and economic modeling
- Transaction processing, web services and search
engines - Defense
- Nuclear weapons -- test by simulations
- Cryptography
5Units of High Performance Computing
- Speed
- 1 Mflop/s 1 Megaflop/s 106 Flop/second
- 1 Gflop/s 1 Gigaflop/s 109 Flop/second
- 1 Tflop/s 1 Teraflop/s 1012 Flop/second
- 1 Pflop/s 1 Petaflop/s 1015 Flop/second
- Capacity
- 1 MB 1 Megabyte 106 Bytes
- 1 GB 1 Gigabyte 109 Bytes
- 1 TB 1 Terabyte 1012 Bytes
- 1 PB 1 Petabyte 1015 Bytes
6Moores Law
- Gordon Moore (co-founder of Intel) predicted in
1965 that the transistor density of semiconductor
chips would double roughly every 18 months.
7Moores Law holds also for performance and
capacity
1945 2002
Computer ENIAC Laptop
Number of vacuum tubes / transistors 18 000 6 000 000 000
Weight (kg) 27 200 0.9
Size (m3) 68 0.0028
Power (watts) 20 000 60
Cost () 4 630 000 1 000
Memory (bytes) 200 1 073 741 824
Performance (Flops/s) 800 5 000 000 000
8Peak Performance
- A contemporary RISC processor delivers 10 of its
peak performance - Two primary reasons behind this low efficiency
- IPC inefficiency
- Memory inefficiency
9Instructions per cycle (IPC) inefficiency
- Today the theoretical IPC is 4-6
- Detailed analysis for a spectrum of applications
indicates that the average IPC is 1.21.4 - 75 of the performance is not used
10Reasons for IPC inefficiency
- Latency
- Waiting for access to memory or other parts of
the system - Overhead
- Extra work that has to be done to manage program
concurrency and parallel resources the real work
you want to perform - Starvation
- Not enough work to do due to insufficient
parallelism or poor load balancing among
distributed resources - Contention
- Delays due to fighting over what task gets to use
a shared resource next. Network bandwidth is a
major constraint
11Memory Hierarchy
12Processor-Memory Problem
- Processors issue instructions roughly every
nanosecond - DRAM can be accessed roughly every 100
nanoseconds - The gap is growing
- processors getting faster by 60 per year
- DRAM getting faster by 7 per year
13Processor-Memory Problem
14How fast can a serial computer be?
- Consider the 1 Tflop sequential machine
- data must travel distance, r, to get from memory
to CPU - to get 1 data element per cycle, this means 1012
times per second at the speed of light, c 3x108
m/s - so r lt c / 1012 0.3 mm
- For 1 TB of storage in a 0.3 mm2 area
- each word occupies about 3 Angstroms2, the size
of a small atom
15- So, we need Parallel Computing!
16High Performance Computers
- In 1980s
- 1x106 Floating Point Ops/sec (Mflop/s)
- Scalar based
- In 1990s
- 1x109 Floating Point Ops/sec (Gflop/s)
- Vector Shared memory computing
- Today
- 1x1012 Floating Point Ops/sec (Tflop/s)
- Highly parallel, distributed processing, message
passing
17What is a Supercomputer?
- A supercomputer is a hardware and software
system that provides close to the maximum
performance that can currently be achieved
18Top500 Computers
- Over the last 10 years the range for the
Top500 has increased greater than Moores law - 1993
- 1 59.7 GFlop/s
- 500 422 MFlop/s
- 2004
- 1 70 TFlop/s
- 500 850 GFlop/s
19Top500 List at June 2005
Manuf. Computer Instal. Site Cntry Year Rmax (Tflop/s) proc
1 IBM BlueGene/L LLNL USA 2005 136.8 65536
2 IBM BlueGene/L IBM Watson Res. Center USA 2005 91.3 40960
3 SGI Altix NASA USA 2004 51.9 10160
4 NEC Vector Earth Simulator Center Japan 2002 35.9 5120
5 IBM Cluster Barcelona Supercomp. C. Spain 2005 27.9 4800
20Performance Development
21Increasing CPU Performance
- Manycore Chip
- Composed of hybrid cores
- Some general purpose
- Some graphics
- Some floating point
22What is Next?
- Board composed of multiple manycore chips sharing
memory
- Rack composed of multiple boards
- A room full of these racks
-
- ?Millions of cores
- ?Exascale systems (1018 Flop/s)
23Moores Law Reinterpreted
- Number of cores per chip doubles every 2 year,
while clock speed decreases (not increases). - Need to deal with systems with millions of
concurrent threads - Number of threads of execution doubles every 2
year
24Performance Projection
25Directions
- Move toward shared memory
- SMPs and Distributed Shared Memory
- Shared address space with deep memory hierarchy
- Clustering of shared memory machines for
scalability - Efficiency of message passing and data parallel
programming - MPI and HPF
26Future of HPC
- Yesterday's HPC is today's mainframe is
tomorrow's workstation