Title: APC523AST523 Scientific Computation in Astrophysics
1APC523/AST523Scientific Computation in
Astrophysics
- Lecture 2
- Computer Architecture
2Is it really necessary to study CA to be
proficient in scientific computation?
After all, dont need to know how an internal
combustion engine works to drive a car.
3If you plan to run only canned software packages,
then you probably do not need to know anything
about CA (and you probably shouldnt be taking
this course!) If you plan to write efficient
code on modern parallel processors,
you have to understand how those processors
work.
4Current trends in CA.
- Desktop systems.
- driven by price/performance.
- Servers.
- driven by reliability, scalability
- Embedded processors.
- drive by price/power consumption
We need focus on desktop systems only.
5Measuring Performance
Price/performance is the key design issue for
scientific computation. (caveat power
consumption is an important issue for large
clusters) Execution time time between the start
and end of an event Throughput total work done
in a given amount of time We are more interested
in execution time than throughput. CPU time
time CPU is computing Wall-clock time total
execution time, including CPU time, I/O, OS
overhead, everything Quanta of time on computer
is clock period.
6Measuring performance
benchmark model program with which to measure
performance Real applications best choice for
a given user, but how to weight importance of
different applications? --gt benchmark
suites Kernel small, compute intensive pieces
of real programs used as a benchmark, e.g.
Linpack
7Weve all been spoiled by Moores Law
The transistor density in integrated circuits
will be doubled every two years. Prediction
by Gordon Moore, 1965. Since performance scales
with transistor density, Moores Law has been
interpreted as a prediction about the former as
well. Actually, performance doubles about every
18 months. Shows no sign of ending. Imagine if
astronomical observatories doubled their
capabilities every 18 months, all with no
financing from the NSF! But we are forced to
use mass-produced commodity processors that were
not designed for scientific computation
8Amdahls Law
Defines the speedup that can be gained by a
particular performance enhancement Let a
fraction of program that can use enhancement
S speedup of entire code Sa speedup of
enhanced portion of code by itself e.g. Suppose
your program takes 20 secs to execute. You
reduce the execution time of some portion of it
from 10 secs to 5 secs. Then a0.5, Sa 2, S
4/3 Amdahls law expresses law of diminishing
returns. Overall program performance limited by
slowest step. Improving performance of one part
may not lead to much improvement overall.
9Top 500 listwww.top500.org
Since 1993, Linpack benchmark is used to measure
performance of machines worldwide. Every six
months, a list of the 500 best performing
machines is released. Of course, there is much
competition to be a the top of the list, but
utility of list is that it reveals important
trends in architecture of high-performance
computers.
10Performance increase since 1993 is mind-boggling
1993
2005
11(No Transcript)
12Scalar processors completely dominate the list
today
13Clusters beginning to dominate
14Growing dominance by machines with 256 or more
processors
15The story of mario a lesson in the rapid pace of
progress
- Cray C90, 16Gflops, 4Gb main memory, 130Gb disks
- Installed at Pittsburgh Supercomputer Center
(PSC) in 1993 - List price 35M
- By 1996, Cray T3E at PSC outperformed mario by
factor of 10 - Decommissioned in 1999.
- Sold on e-bay in 2000 for 50k as living room
furniture - Today, quad Opteron with more memory and disk
space is 5k
16Basics components of a computer
- Processor
- Memory
- Communication channels
- Input/output (I/O) channels
171. Processor
The brains of the computer performs operations
on data according to instructions controlled by
the programmer. Generally, programmer writes
algorithm in a high-level language (e.g., C, C,
F90). Language is translated into instructions
by a compiler (producing an executable). Interpre
ted languages (e.g. Java) are also popular.
These are translated into instructions at
runtime. More flexible, easier to code, but are
generally much slower and may have unreliable
floating-point arithmetic. Probably a bad idea
to use Java for large-scale scientific
computations.
18The instruction set
Each processor has a unique, specific instruction
set. Increasingly complex instruction sets were
developed up to 1980s. (e.g. VAX,
x86) mid-80s, reduced instruction set computers
(RISC) introduced (e.g. MIPS R4000). By focusing
on a smaller set of simple instructions, more
sophisticated hardware optimization strategies
could be implemented. Today, almost all
processors are RISC. x86 instruction set only
survives to retain binary compatibility.
19Basic architecture
Virtually all processors use register-to-register
architecture. All data processed by CPU enters
and leaves via registers (usually 64-bits in
size) C A B becomes Load R1,
A Load R2, B Add R3, R1, R2
Store R3, C Operands most 32-bit processors
support 8-, 16- and 32-bit integer operands, and
32- and 64-bit floating-point arithmetic. 64-bit
architectures supports 64-bit integers as
well. See HP Appendix G on web to see how
floating point arithmetic actually works.
20Pipelining most important RISC optimization
Often same sequence of instructions are repeated
many times (loops). Can optimize by designing
processor to operate overlap different steps in
sequence like a pipeline. Clock cycle
1 2 3 4 5 6 7
8 9 instruction I IF ID
EX ME WB instruction i1 IF ID
EX ME WB instruction i2
IF ID EX ME WB instruction i3
IF ID EX ME
WB instruction i4
IF ID EX ME WB IFinstruction
fetch IDinstruction decode EXexecute
MEmemory reference WBwrite back Pipeline in
example takes 9 clock cycles to complete 5
instructions Un-pipelined processor would take
56 30 cycles Pipelining is an example of
instruction level parallelism (ILP)
21Hazards to pipelining
- Data hazards data for instruction i1 depends
on data produced by instruction i - Control hazards pipeline contains conditional
branch - Most processors limit branching penalties by a
variety of techniques branch prediction,
predicted-not-taken, etc. - Lessons for programmer
- Isolate recursion formulae from other work,
since it will interrupt pipelining of other
instructions. - Avoid conditional branches in pipelined code
22The role of compilers in CA
- Today, most code is written in one of a small
number of languages. Hardware is now designed to
optimize instructions produced by compilers of
that language. - Compilers can
- Integrate procedures into calling code
- Eliminate common sub-expressions (do algebra!)
- Eliminate unnecessary temporary variables
(reduces loads/stores) - Change order of instructions (e.g. move code
outside loop) - Pipeline
- Optimize register allocation
232. Memory
Stores both data and instructions. Organized
into bits, bytes, words, and double words. Size
of words is now standardized (32-bits). Byte
order is still not standardized. Two
possibilities Little endian stores leading
byte last (little end) 76543210 (e.g.
Intel) Big endian stores leading byte first (big
end) 01234567 (Sparc, PowerPC) Data
transferred from one architecture to another
(e.g. for visualization) must be byte-swapped.
24Memory design
Dynamic Random Access Memory (DRAM) - bits stored
in 2D array, accessed by rows and columns
(reduces number of address pins). Typical access
time 100ns. DRAM must be refreshed, typically 5
of reads have to wait for a refresh to finish.
Reading destroys data in DRAM, so it must be
re-written after a read. Both introduce
latency DRAM comes on Dual Inline Memory Modules
(DIMMs). Since 1998, memory on DIMMs doubles
every 2 yrs slower than Moores Law -- is
leading to a memory/processor performance
mismatch. Synchronous DRAM (SDRAM) - contains
clock that synchronizes with CPU to increase
memory bandwidth. Memory bus operates at 100-150
MHz, usually 8-bit wide channels, means
800-1200Mb/s Double Data Rate (DDR) SDRAM now
available, 2-bits transferred each clock cycle
25Hierarchical Memory
- Ideally, entire memory system would be built
using fastest chips possible. Not practical. - Instead, exploit principle of locality. Most
programs access adjacent memory locations
sequentially. - location M at time t location M1 at time
t1. - Design solution hierarchical memory
- Memory closest to processor uses fastest chips
(cache) - Main memory built from DDR SDRAM
- Additional memory can be built from disks
(virtual memory) - Usually cache is subdivided into several levels
(L1, L2,L3). - Data is transferred between levels in blocks
- Between cache and main memory cache line
- Between main and virtual memory page
26How does hierarchical memory work?
If processor needs item at address A, but it is
not in cache, memory system moves a cache line
containing A (and A1, A2, etc) from main memory
into cache. Then, if processor needs A1 on next
cycle, it is already in cache. If memory
location needed by processor is in cache
cache hit If memory location needed by processor
is not in cache cache miss Fraction of
requests which are hits is hit rate. Goal of CA
design cache to maximize hit rate of typical
program by optimizing cache size, cache line
size, using prefetch and non-blocking
cache. Goal of Programmer write code to maximize
hit rate
27Effective access time for hierarchical memory is
teff effective access time tcache access
time of cache tmain access time of main
memory H hit rate Suppose tcache 10ns,
tmain100ns, H98 teff (0.98)(10)
(1-0.98)(100) 11.8ns almost as fast as
cache! In reality, tmain can vary greatly
depending on latency, OS, etc.
28Write code that maximizes cache hits.
For example, always access data contiguously.
Order loops so inner loop is over neighboring
data elements. Avoid stride not equal to one.
for (i0 ilt100 i) for (j0 jlt100
j) / BAD / aji
bjicji for (i0 ilt100 i)
for (j0 jlt100 j) / GOOD
/ aij bijcij
Note exactly the OPPOSITE ordering is necessary
in FORTRAN
29The role of compilers in memory organization
- Compilers organize memory into
- stack - local variables
- heap - dynamic variables addressed by pointers
- global data space - statically declared global
variables/constants - Register allocation by compiler impossible for
heap. - If there are multiple ways to reference address
of variable, they are said to be aliased, and
cannot be allocated to register. - p a / gets address of a /
- a 2 / assigns to a directly /
- p 1 / uses p to assign to a /
- c ab / accesses a /
- Cannot allocate a to register.
- Moral for programmer Use pointer references
sparingly!
30Interleaved memory
To increase bandwidth to memory, and reduce
effect of latency, can divide memory into N
banks. Distribute data across banks with
successive addresses in successive
banks. bank0 bank1 . bankN 1
2 N N2 N3 2N
Provided data is accessed
with stride not equal to N, memory references are
to different banks.