Title: CS 612: Software Design for High-performance Architectures
1CS 612Software Design for
High-performance Architectures
2Administration
- Instructor Keshav Pingali
- 457 Rhodes Hall
- pingali_at_cs.cornell.edu
- TA Milind Kulkarni
- 490 Rhodes Hall
- milind_at_cs.cornell.edu
3Course content
- Understand high-end programming paradigms,
compilers and runtime systems - Applications requirements
- Shared-memory programming
- Optimistic and pessimistic parallelization
- Transactional memory
- Memory hierarchy optimization
- Self-optimizing systems
- Focus on software problem for multicore
processors
4Problem
- Silicon designers can choose a variety of methods
to increase processor performance - Commercial end-customers are demanding
- More capable systems with more capable processors
- That new systems stay within their existing
power/thermal infrastructure - Processor frequency and power consumption seem to
be scaling in lockstep - How can the industry-standard PC and Server
industries stay on our historic performance curve
without burning a hole in our motherboards?
5What is a processor?
- A single chip package that fits in a socket
- 1 core (not much point in lt1 core)
- Cores can have functional units, cache,
etc.associated with them, just as today - Cores can be fast or slow, just as today
- Shared resources
- More cache
- Other integration memory controllers, high-speed
serial links, etc. - One system interface no matter how many cores
- Number of signal pins doesnt scale with number
of cores
6ILP Problem
- Functional units
- Superscalar is known territory
- Diminishing returns for adding more functional
blocks - Alternatives like VLIW have been considered and
rejected by the market - Single-threaded architectural performance is
pegged - Data paths
- Increasing bandwidth between functional units in
a core makes a difference - Such as comprehensive 64-bit design, but then
where to?
7ILP Problem (contd.)
- Pipeline
- Deeper pipeline buys frequency at expense of
increased cache miss penalty and lower
instructions per clock - Shallow pipeline gives better instructions per
clock at the expense of frequency scaling - Max frequency per core requires deeper pipelines
- Industry converging on middle ground9 to 11
stages - Successful RISC CPUs are in the same range
- Cache
- Cache size buys performance at expense of die
size - Deep pipeline cache miss penalties are reduced by
larger caches
8Power problem
- Moores Law isnt dead, more transistors for
everyone! - Butit doesnt really mention scaling transistor
power - Chemistry and physics at nano-scale
- Stretching materials science
- Transistor leakage current is increasing
- As manufacturing economies and frequency
increase, power consumption is increasing
disproportionately - There are no process or architectural quick-fixes
9Static Current vs. Frequency
Non-linear as processors approach max frequency
15
Static Current
Fast, High Power
Fast, Low Power
0
Frequency
1.0
1.5
10Power vs. Frequency
- In AMDs process, for 200MHz frequency steps, two
steps back on frequency cuts power consumption by
40 from maximum frequency - Substantially lower power with lower frequency
- Result is dual-core running at n-2 in same
thermal envelope as single-core running at top
speed
11AMD Multi-Core Processor
- Dual-core AMD Opteron processor is 199mm2 in
90nm - Single-core AMD Opteron processor is 193mm2 in
130nm
12Multi-Core Processor Architecture
13Multi-Core Software
- More aggregate performance for
- Multi-threaded apps
- Transactions many instances of same app
- Multi-tasking
- Problem
- Most apps are not multithreaded
- Writing multithreaded code increases software
costs dramatically (factor of 3 for some game
engines)
14First problem Parallelization
We are the cusp of a transition to multicore,
multithreaded architectures, and we still have
not demonstrated the ease of programming the
move will require I have talked with a few
people at Microsoft Research who say this is also
at or near the top of their list of critical CS
research problems. Justin
Rattner, Senior Fellow, Intel
15Second problem memory hierarchy
- The CPU chip industry has now reached the
point that instructions can be executed more
quickly than the chips can be fed with code and
data. Future chip design is memory design. Future
software design is also memory design. . - Controlling memory access patterns will drive
hardware and software designs for the foreseeable
future. -
- Richard Sites, DEC
16Memory Hierarchy of SGI Octane
Memory
128MB
size
L2 cache
1MB
L1 cache
32KB (I) 32KB (D)
Regs
64
access time (cycles)
2
10
70
- R10 K processor
- 4-way superscalar, 2 fpo/cycle, 195MHz
- Peak performance 390 Mflops
- Experience sustained performance is less than
10 of peak - Processor often stalls waiting for memory system
to load data
17Memory-wall solutions
- Latency avoidance
- multi-level memory hierarchies (caches)
- Latency tolerance
- Pre-fetching
- multi-threading
- Techniques are not mutually exclusive
- Most microprocessors have caches and pre-fetching
- Modest multi-threading is coming into vogue
- Our focus memory hierarchies
-
18Hiding latency in numerical codes
- Most numerical kernels O(n3) work, O(n2) data
- all factorization codes
- Cholesky factorization A LLT (A is spd)
- LU factorization A LU
- LU factorization with pivoting A LU
- QR factorization A QR (Q is orthogonal)
- BLAS-3 matrix multiplication
- use latency avoidance techniques
- Matrix-vector product O(n2) work, O(n2) data
- use latency tolerance techniques such as
pre-fetching - particularly important for iterative solution of
large sparse systems
19Software problem
- Caches are useful only if programs have
locality of reference - temporal locality program references to given
memory address are clustered together in time - spatial locality program references clustered in
address space are clustered in time - Problem
- Programs obtained by expressing most algorithms
in the straight-forward way do not have much
locality of reference - Worrying about locality when coding algorithms
complicates the software process enormously.
20Example matrix multiplication
DO I 1, N //assume arrays stored in
row-major order DO J 1, N DO K 1, N
C(I,J) C(I,J) A(I,K)B(K,J)
- Great algorithmic data reuse each array element
is touched O(N) times! - All six loop permutations are computationally
equivalent (even modulo round-off error). - However, execution times of the six versions can
be very different if machine has a cache.
21IJK version (large cache)
B
K
- DO I 1, N
- DO J 1, N
- DO K 1, N
- C(I,J) C(I,J) A(I,K)B(K,J)
A
C
K
- Large cache scenario
- Matrices are small enough to fit into cache
- Only cold misses, no capacity misses
- Miss ratio
- Data size 3 N2
- Each miss brings in b floating-point numbers
- Miss ratio 3 N2 /b4N3 0.75/bN 0.019 (b
4,N10)
22IJK version (small cache)
B
K
- DO I 1, N
- DO J 1, N
- DO K 1, N
- C(I,J) C(I,J) A(I,K)B(K,J)
A
C
K
- Small cache scenario
- Matrices are large compared to cache/row-major
storage - Cold and capacity misses
- Miss ratio
- C N2/b misses (good temporal locality)
- A N3 /b misses (good spatial locality)
- B N3 misses (poor temporal and spatial
locality) - Miss ratio ? 0.25 (b1)/b 0.3125 (for b 4)
23MMM Experiments
- Simulated L1 Cache Miss Ratio for Intel Pentium
III - MMM with N 11300
- 16KB 32B/Block 4-way 8-byte elements
24Quantifying performance differences
- DO I 1, N //assume arrays stored in
row-major order - DO J 1, N
- DO K 1, N
- C(I,J) C(I,J) A(I,K)B(K,J)
- Octane
- L2 cache hit 10 cycles, cache miss 70 cycles
- Time to execute IKJ version
- 2N3 700.134N3 100.874N3 73.2 N3
- Time to execute JKI version
- 2N3 700.54N3 100.54N3 162 N3
- Speed-up 2.2
- Key transformation loop permutation
25Even better..
- Break MMM into a bunch of smaller MMMs so that
large cache model is true for each small MMM - ? large cache model is valid for entire
computation - ? miss ratio will be 0.75/bt for entire
computation where t is
26Loop tiling
Jt
B
J
DO It 1,N, t DO Jt 1,N,t DO Kt 1,N,t
DO I It,Itt-1 DO J Jt,Jtt-1
DO K Kt,Ktt-1 C(I,J)
C(I,J)A(I,K)B(K,J)
A
It
t
t
I
t
t
K
C
Kt
- Break big MMM into sequence of smaller MMMs where
each smaller MMM multiplies sub-matrices of size
txt. - Parameter t (tile size) must be chosen carefully
- as large as possible
- working set of small matrix multiplication must
fit in cache
27Speed-up from tiling
- Miss ratio for block computation
- miss ratio for large cache model
- 0.75/bt
- 0.001 (b 4, t 200) for Octane
- Time to execute tiled version
- 2N3 700.0014N3 100.9994N3 42.3N3
- Speed-up over JKI version 4
28Observations
- Locality optimized code is more complex than
high-level algorithm. - Loop orders and tile size must be chosen
carefully - cache size is key parameter
- associativity matters
- Actual code is even more complex must optimize
for processor resources - registers register tiling
- pipeline loop unrolling
- Optimized MMM code can be 1000 lines of C code
29One solution to both problems restructuring
compilers (1985-)
- Programmer writes high-level architecture
independent code - Restructuring compiler optimizes program for
- Number of cores
- Number of register
- Cache organization
- Instruction set mul-add? vector extensions?
30Two key issues
P1
1
P2
P
P3
2
- Program restructuring given program P, determine
- set of equivalent programs P1, P2, P3,
- Program selection determine which program
- performs best on target architecture
-
31Automatic parallelization
- Pessimistic parallelization
- Compiler determines partial order on program
operations by determining dependences - At run-time, execute operations in parallel,
respecting dependences - Works reasonably well for array programs but not
for irregular data structures like trees and
graphs - Optimistic parallelization
- Execute operations speculatively in parallel,
assuming that dependences do not exist - Check at runtime if dependences are violated
- If so, roll-back execution to safe point and
re-execute sequentially - Works only if optimism is warranted
- Lots of interest in transactional memory which
is one model of optimistic parallelization
32Automatic locality enhancement
- Some methodology exists for array programs but
little is known for irregular programs - Many compilers can perform tiling and permutation
automatically (gcc) - Choosing parameter values tile sizes etc.
- Compiler can use architectural models
- Self-optimizing systems system determines best
values using some kind of heuristic search
(ATLAS,FFTW)
33Course outline
- Applications requirements
- Scientific and engineering applications
- Commercial work-loads
- Shared-memory programming
- Memory consistency models
- OpenMP
- Optimistic and pessimistic parallelization
- Dependence analysis techniques for array and
irregular programs - Transactional memory models and implementations
- Automatic locality enhancement
- Self-optimizing systems
34Course work
- Small number of programming assignments
- Paper presentations and class participation
- We will have papers online by next Monday
- Sign up for presentation by next Thursday
- Substantial course project
- independent reading
- implementation work
- presentation