CS 612: Software Design for High-performance Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

CS 612: Software Design for High-performance Architectures

Description:

Result is dual-core running at n-2 in same thermal envelope as single-core running at ... AMD Multi-Core Processor. Dual-core AMD Opteron processor is 199mm2. in 90nm ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 35

Provided by: ping50

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 612: Software Design for High-performance Architectures

1
CS 612Software Design for
High-performance Architectures
2
Administration

Instructor Keshav Pingali
457 Rhodes Hall
pingali_at_cs.cornell.edu
TA Milind Kulkarni
490 Rhodes Hall
milind_at_cs.cornell.edu

3
Course content

Understand high-end programming paradigms,
compilers and runtime systems
Applications requirements
Shared-memory programming
Optimistic and pessimistic parallelization
Transactional memory
Memory hierarchy optimization
Self-optimizing systems
Focus on software problem for multicore
processors

4
Problem

Silicon designers can choose a variety of methods
to increase processor performance
Commercial end-customers are demanding
More capable systems with more capable processors
That new systems stay within their existing
power/thermal infrastructure
Processor frequency and power consumption seem to
be scaling in lockstep
How can the industry-standard PC and Server
industries stay on our historic performance curve
without burning a hole in our motherboards?

5
What is a processor?

A single chip package that fits in a socket
1 core (not much point in lt1 core)
Cores can have functional units, cache,
etc.associated with them, just as today
Cores can be fast or slow, just as today
Shared resources
More cache
Other integration memory controllers, high-speed
serial links, etc.
One system interface no matter how many cores
Number of signal pins doesnt scale with number
of cores

6
ILP Problem

Functional units
Superscalar is known territory
Diminishing returns for adding more functional
blocks
Alternatives like VLIW have been considered and
rejected by the market
Single-threaded architectural performance is
pegged
Data paths
Increasing bandwidth between functional units in
a core makes a difference
Such as comprehensive 64-bit design, but then
where to?

7
ILP Problem (contd.)

Pipeline
Deeper pipeline buys frequency at expense of
increased cache miss penalty and lower
instructions per clock
Shallow pipeline gives better instructions per
clock at the expense of frequency scaling
Max frequency per core requires deeper pipelines
Industry converging on middle ground9 to 11
stages
Successful RISC CPUs are in the same range
Cache
Cache size buys performance at expense of die
size
Deep pipeline cache miss penalties are reduced by
larger caches

8
Power problem

Moores Law isnt dead, more transistors for
everyone!
Butit doesnt really mention scaling transistor
power
Chemistry and physics at nano-scale
Stretching materials science
Transistor leakage current is increasing
As manufacturing economies and frequency
increase, power consumption is increasing
disproportionately
There are no process or architectural quick-fixes

9
Static Current vs. Frequency
Non-linear as processors approach max frequency
15
Static Current
Fast, High Power
Fast, Low Power
0
Frequency
1.0
1.5
10
Power vs. Frequency

In AMDs process, for 200MHz frequency steps, two
steps back on frequency cuts power consumption by
40 from maximum frequency
Substantially lower power with lower frequency
Result is dual-core running at n-2 in same
thermal envelope as single-core running at top
speed

11
AMD Multi-Core Processor

Dual-core AMD Opteron processor is 199mm2 in
90nm
Single-core AMD Opteron processor is 193mm2 in
130nm

12
Multi-Core Processor Architecture
13
Multi-Core Software

More aggregate performance for
Multi-threaded apps
Transactions many instances of same app
Multi-tasking
Problem
Most apps are not multithreaded
Writing multithreaded code increases software
costs dramatically (factor of 3 for some game
engines)

14
First problem Parallelization
We are the cusp of a transition to multicore,
multithreaded architectures, and we still have
not demonstrated the ease of programming the
move will require I have talked with a few
people at Microsoft Research who say this is also
at or near the top of their list of critical CS
research problems. Justin
Rattner, Senior Fellow, Intel
15
Second problem memory hierarchy

The CPU chip industry has now reached the
point that instructions can be executed more
quickly than the chips can be fed with code and
data. Future chip design is memory design. Future
software design is also memory design. .
Controlling memory access patterns will drive
hardware and software designs for the foreseeable
future.
Richard Sites, DEC

16
Memory Hierarchy of SGI Octane
Memory
128MB
size
L2 cache
1MB
L1 cache
32KB (I) 32KB (D)
Regs
64
access time (cycles)
2
10
70

R10 K processor
4-way superscalar, 2 fpo/cycle, 195MHz
Peak performance 390 Mflops
Experience sustained performance is less than
10 of peak
Processor often stalls waiting for memory system
to load data

17
Memory-wall solutions

Latency avoidance
multi-level memory hierarchies (caches)
Latency tolerance
Pre-fetching
multi-threading
Techniques are not mutually exclusive
Most microprocessors have caches and pre-fetching
Modest multi-threading is coming into vogue
Our focus memory hierarchies

18
Hiding latency in numerical codes

Most numerical kernels O(n3) work, O(n2) data
all factorization codes
Cholesky factorization A LLT (A is spd)
LU factorization A LU
LU factorization with pivoting A LU
QR factorization A QR (Q is orthogonal)
BLAS-3 matrix multiplication
use latency avoidance techniques
Matrix-vector product O(n2) work, O(n2) data
use latency tolerance techniques such as
pre-fetching
particularly important for iterative solution of
large sparse systems

19
Software problem

Caches are useful only if programs have
locality of reference
temporal locality program references to given
memory address are clustered together in time
spatial locality program references clustered in
address space are clustered in time
Problem
Programs obtained by expressing most algorithms
in the straight-forward way do not have much
locality of reference
Worrying about locality when coding algorithms
complicates the software process enormously.

20
Example matrix multiplication
DO I 1, N //assume arrays stored in
row-major order DO J 1, N DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

Great algorithmic data reuse each array element
is touched O(N) times!
All six loop permutations are computationally
equivalent (even modulo round-off error).
However, execution times of the six versions can
be very different if machine has a cache.

21
IJK version (large cache)
B
K

DO I 1, N
DO J 1, N
DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

A
C
K

Large cache scenario
Matrices are small enough to fit into cache
Only cold misses, no capacity misses
Miss ratio
Data size 3 N2
Each miss brings in b floating-point numbers
Miss ratio 3 N2 /b4N3 0.75/bN 0.019 (b
4,N10)

22
IJK version (small cache)
B
K

DO I 1, N
DO J 1, N
DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

A
C
K

Small cache scenario
Matrices are large compared to cache/row-major
storage
Cold and capacity misses
Miss ratio
C N2/b misses (good temporal locality)
A N3 /b misses (good spatial locality)
B N3 misses (poor temporal and spatial
locality)
Miss ratio ? 0.25 (b1)/b 0.3125 (for b 4)

23
MMM Experiments

Simulated L1 Cache Miss Ratio for Intel Pentium
III
MMM with N 11300
16KB 32B/Block 4-way 8-byte elements

24
Quantifying performance differences

DO I 1, N //assume arrays stored in
row-major order
DO J 1, N
DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)

Octane
L2 cache hit 10 cycles, cache miss 70 cycles
Time to execute IKJ version
2N3 700.134N3 100.874N3 73.2 N3
Time to execute JKI version
2N3 700.54N3 100.54N3 162 N3
Speed-up 2.2
Key transformation loop permutation

25
Even better..

Break MMM into a bunch of smaller MMMs so that
large cache model is true for each small MMM
? large cache model is valid for entire
computation
? miss ratio will be 0.75/bt for entire
computation where t is

26
Loop tiling
Jt
B
J
DO It 1,N, t DO Jt 1,N,t DO Kt 1,N,t
DO I It,Itt-1 DO J Jt,Jtt-1
DO K Kt,Ktt-1 C(I,J)
C(I,J)A(I,K)B(K,J)
A
It
t
t
I
t
t
K
C
Kt

Break big MMM into sequence of smaller MMMs where
each smaller MMM multiplies sub-matrices of size
txt.
Parameter t (tile size) must be chosen carefully
as large as possible
working set of small matrix multiplication must
fit in cache

27
Speed-up from tiling

Miss ratio for block computation
miss ratio for large cache model
0.75/bt
0.001 (b 4, t 200) for Octane
Time to execute tiled version
2N3 700.0014N3 100.9994N3 42.3N3
Speed-up over JKI version 4

28
Observations

Locality optimized code is more complex than
high-level algorithm.
Loop orders and tile size must be chosen
carefully
cache size is key parameter
associativity matters
Actual code is even more complex must optimize
for processor resources
registers register tiling
pipeline loop unrolling
Optimized MMM code can be 1000 lines of C code

29
One solution to both problems restructuring
compilers (1985-)

Programmer writes high-level architecture
independent code
Restructuring compiler optimizes program for
Number of cores
Number of register
Cache organization
Instruction set mul-add? vector extensions?

30
Two key issues
P1
1
P2
P
P3
2

Program restructuring given program P, determine
set of equivalent programs P1, P2, P3,
Program selection determine which program
performs best on target architecture

31
Automatic parallelization

Pessimistic parallelization
Compiler determines partial order on program
operations by determining dependences
At run-time, execute operations in parallel,
respecting dependences
Works reasonably well for array programs but not
for irregular data structures like trees and
graphs
Optimistic parallelization
Execute operations speculatively in parallel,
assuming that dependences do not exist
Check at runtime if dependences are violated
If so, roll-back execution to safe point and
re-execute sequentially
Works only if optimism is warranted
Lots of interest in transactional memory which
is one model of optimistic parallelization

32
Automatic locality enhancement

Some methodology exists for array programs but
little is known for irregular programs
Many compilers can perform tiling and permutation
automatically (gcc)
Choosing parameter values tile sizes etc.
Compiler can use architectural models
Self-optimizing systems system determines best
values using some kind of heuristic search
(ATLAS,FFTW)

33
Course outline