Title: CSCI43206360: Parallel Programming
1CSCI-4320/6360 Parallel Programming
ComputingAE 215, Tues./Fri. 12-120
p.m.Introduction, Syllabus Prelims
- Prof. Chris Carothers
- Computer Science Department
- Lally 306
- Office Hrs Tuesdays, 130 330 p.m
- chrisc_at_cs.rpi.edu
- www.cs.rpi.edu/chrisc/COURSES/PARALLEL/SPRING-200
9
2Course Prereqs
- Some programming experience in Fortran, C, C
- Java is great but not for HPC
- Youll have a choice to do your assignment in C,
C or Fortransubject to the language support of
the programming paradigm.. - Assume youve never touched a parallel or
distributed computer.. - If you have MPI experience great..it will help
you, but it is not necessary - If you love to write software
- Both practice and theory are presented but there
is a strong focus on getting your programs to
work
3Course Optional Textbook
- Introduction to Parallel Computing, by Grama,
Gupta, Karypis and Kumar - Make sure you have the 2nd edition!
- Available online thru the Pearson/Addison Wesley
publisher or Amazon.com etc. - Written in 2003 so somewhat out of date
4Course Topics
- Prelims Motivation
- Memory Hierarchy
- CPU Organization
- Parallel Architectures
- Message Passing/SMP/Vector-SIMD
- Communications Networks
- Basic Communications Operations
- MPI Programming
- Principles of Parallel Algorithm Design
- Thread Programming
- Ptreads
- OpenMP
5Course Topics (cont.)?
- Parallel Debuggers
- Parallel Operating Systems
- Parallel Filesystems
- Parallel Algorithms
- Matrix Algorithms
- Search
- Graph
- Other Programming Paradigm
- MapReduce
- Transactional Memory
- Fault Tolerance
- Applications (Guest Lectures)?
- Computational Fluid Dynamics
- Mesh Adaptivity
- Parallel Discrete-Event Simulation
6Course Grading Criteria
- You must read lecture slides, and find papers on
that topic - FOR EACH CLASS
- Find a academic paper written in 2003 or later
that relates to the class topic. - You will write a 1 to 2 page paper that sumarizes
the class and the paper you states any questions
you might have. - If youre a grad student, you must find and
review 2 papers! - Whats it worth
- 1 grade point per class up to 25 points total
- There are 27/28 lectures, so you can pick 3 to
miss - 4 programming assignments worth 10 pts each
- MPI, Pthreads, OpenMP, Parallel I/O
- Parallel Computing Research Project worth 35 pts
- Yes, thats right no mid-term or final exam
- May sound good, but when its 4 a.m. and your
parallel program doesnt work and youve spent
the past 30 hours debugging it, an exam doesnt
sound so bad - For a course like this, youll need to manage you
time well..do a little each day and dont get
behind on the assignments or projects!
7Where to find papers
- Use Google Scholar or ACM and IEEE Digital
Libraries. - You can access ACM or IEEE publications online
from any on campus RPI computer system - Publications to look for are
- Any IEEE, ACM, SIAM parallel computing conf. or
journal - A few to consider are
- Super Computing (SC), IPDPS, ICP, IEEE TPDS,
IEEE TOC, ACM TOPLAS, ACM TOCS, JPDC, Currency
Practice and Experience, Cluster Computing, SPAA,
HPCA, HPDC, Parallel Computing, IBM Journal of R
D - Try to find the most recent paper relating to a
particular topic - Dont re-use a paper if it covers potentially
multiple topics.. - Summary /Reviews are due the next lecture w/o
exception. - Include a copy of your paper(s) with your
summary/review and a reference citation of the
paper and where you found it.
8To Make A Fast Parallel Computer You Need a
Faster Serial Computerwell sorta
- Review of
- Instructions
- Instruction processing..
- Put it togetherwhy the heck do we care about or
need a parallel computer? - i.e., they are really cool pieces of technology,
but can they really do anything useful beside
compute Pi to a few billion more digits
9Processor Instruction Sets
- In general, a computer needs a few different
kinds of instructions - mathematical and logical operations
- data movement (access memory)?
- jumping to new places in memory
- if the right conditions hold.
- I/O (sometimes treated as data movement)?
- All these instructions involve using registers to
store data as close as possible to the CPU - E.g. t0, s0 in MIPs on eax, ebx in x86
10a(bc)-(de)
s0
s1
s2
s3
s4
- add t0, s1, s2 t0 bc
- add t1, s3, s4 t1 de
- sub s0, t0, t1 a t0t1
11lw destreg, const(addrreg)?
Load Word
A number
Name of register to get base address from
Name of register to put value in
address (contents of addrreg) const
12Array Example abc8
s0
s2
s1
- lw t0,8(s2) t0 c8
- add s0, s1, t0 s0s1t0
- (yeah, this is not quite right ?)?
13lw destreg, const(addrreg)?
Load Word
A number
Name of register to get base address from
Name of register to put value in
address (contents of addrreg) const
14sw srcreg, const(addrreg)?
Store Word
A number
Name of register to get base address from
Name of register to get value from
address (contents of addrreg) const
15Example sw s0, 4(s3)?
- If s3 has the value 100, this will copy the word
in register s0 to memory location 104. - Memory104 lt- s0
16lw destreg, const(addrreg)?
Load Word
A number
Name of register to get base address from
Name of register to put value in
address (contents of addrreg) const
17sw srcreg, const(addrreg)?
Store Word
A number
Name of register to get base address from
Name of register to get value from
address (contents of addrreg) const
18Example sw s0, 4(s3)?
- If s3 has the value 100, this will copy the word
in register s0 to memory location 104. - Memory104 lt- s0
19Instruction formats
32 bits
This format is used for many MIPS instructions
that involve calculations on values already in
registers. E.g. add t0, s0, s1
20How are instructions processed?
- In the simple case
- Fetch instruction from memory
- Decode it (read op code, and use registers based
on what instruction the op code says - Execute the instruction
- Write back any results to register or memory
- Complex case
- Pipeline overlap instruction processing
- Superscalar multi-instruction issue per clock
cycle..
21Simple (relative term) CPU Multicyle Datapath
Control
22Simple (yeah right!) Instruction Processing FSM!
23Pipeline Processing w/ Laundry
- While the first load is drying, put the second
load in the washing machine. - When the first load is being folded and the
second load is in the dryer, put the third load
in the washing machine. - NOTE unrealistic scenario for CS students, as
most only own 1 load of clothes
24(No Transcript)
25Pipelined DP w/ signals
26Pipelined Instruction.. But wait, weve got
dependencies!
27Pipeline w/ Forwarding Values
28Where Forwarding Failsmust stall
29How Stalls Are Inserted
30What about those crazy branches?
Problem if the branch is taken, PC goes to addr
72, but dont know until after 3 other
instructions are processed
31Dynamic Branch Prediction
- From the phase There is no such thing as a
typical program, this implies that programs will
branch is different ways and so there is no one
size fits all branch algorithm. - Alt approach keep a history (1 bit) on each
branch instruction and see if it was last taken
or not. - Implementation branch prediction buffer or
branch history table. - Index based on lower part of branch address
- Single bit indicates if branch at address was
last taken or not. (1 or 0)? - But single bit predictors tends to lack
sufficient history
32Solution 2-bit Branch Predictor
Must be wrong twice before changing
predictionLearns if the branch is more biased
towards taken or not taken
33Even more performance
- Ultimately we want greater and greater
Instruction Level Parallelism (ILP)? - How?
- Multiple instruction issue.
- Results in CPIs less than one.
- Here, instructions are grouped into issue
slots. - So, we usually talk about IPC (instructions per
cycle)? - Static uses the compiler to assist with grouping
instructions and hazard resolution. Compiler MUST
remove ALL hazards. - Dynamic (i.e., superscalar) hardware creates the
instruction schedule based on dynamically
detected hazards
34Example Static 2-issue Datapath
- Additions
- 32 bits from intr. Mem
- Two read, 1 write ports on reg file
- 1 more ALU (top handles address calc)?
35Ex. 2-Issue Code Schedule
- Loop lw t0, 0(s1) t0array element
- addiu t0, t0, s2 add scalar in s2
- sw t0, 0(s1) store result
- addi s1, s1, -4 dec pointer
- bne s1, zero, Loop branch s1!0
-
It take 4 clock cycles for 5 instructions or IPC
of 1.25
36More Performance Loop Unrolling
- Technique where multiple copies of the loop body
are made. - Make more ILP available by removing dependencies.
- How? Complier introduces additional registers via
register renaming. - This removes name or anti dependence
- where an instruction order is purely a
consequence of the reuse of a register and not a
real data dependence. - No data values flow between one pair and the next
pair - Lets assume we unroll a block of 4 interations
of the loop..
37Loop Unrolling Schedule
Now, it takes 8 clock cycles for 14 instructions
or IPC of 1.75!! This is a 40 performance boost!
38Dynamic Scheduled Pipeline
39Intel P4 Dynamic Pipeline Looks like a cluster
.. Just much much smaller
40Summary of Pipeline Technology
Weve exhausted this!! IPC just wont go much
higher Why??
41More Speed til it Hertz!
- So, if not ILP is available, why not increase the
clock frequency - E.g. why dont we have 100 GHz processors today?
- ANSWER POWER HEAT!!
- With current CMOS technology power needs
polynominal increase with a linear increase in
clock speed. - Power leads to heat which will ultimately turn
your CPU to heap of melted silicon!
42(No Transcript)
43CPU Power Consumption
Typically, 100 watts is magic limit..
44Where do we go from here?(actually, weve
arrived _at_ here!)?
- Current Industry Trend Multi-core CPUs
- Typically lower clock rate (i.e., lt 3 Ghz)?
- 2, 4 and now 8 cores in single socket package
- Because of smaller VLSI design processes (e.g. lt
45 nm) can reduce power heat.. - Potential for large, lucrative contracts in
turning old dusty sequential codes to multi-core
capable - Salesman heres your new 200 CPU, oh, BTW,
youll need this million consulting contract to
port your code to take advantage of those extra
cores! - Best business model since the mainframe!
- More cores require greater and greater
exploitation of available parallelism in an
application which gets harder and harder as you
scale to more processors.. - Due to cost, well force in-house development of
talent pool.. - You could be that talent pool
45Examples Multicore CPUs
- Brief listing of the recently released new 45 nm
processors Based on Intel site (Processor Model
- Cache - Clock Speed - Front Side Bus)? - Desktop Dual Core
- E8500 - 6 MB L2 - 3.16 GHz - 1333 MHz
- E8400 - 6 MB L2 - 3.00 GHz - 1333 MHz
- E8300 - 6 MB L2 - 2.66 GHz - 1333 MHz
- Laptop Dual Core
- T9500 - 6 MB L2 - 2.60 GHz - 800 MHz
- T9300 - 6 MB L2 - 2.50 GHz - 800 MHz
- T8300 - 3 MB L2 - 2.40 GHz - 800 MHz
- T8100 - 3 MB L2 - 2.10 GHz - 800 MHz
- Desktop Quad Core
- Q9550 - 12MB L2 - 2.83 GHz - 1333 MHz
- Q9450 - 12MB L2 - 2.66 GHz - 1333 MHz
- Q9300 - 6MB L2 - 2.50 GHz - 1333 MHz
- Desktop Extreme Series
- QX9650 - 12 MB L2 - 3 GHz - 1333 MHz
- Note Intel's new 45nm Penryn-based Core 2 Duo
and Core 2 Extreme processors were released on
January 6, 2008. The new processors launch within
a 35W thermal envelope.
46Nov. 2008TOP 5 Supercomputers(www.top500.org)?
- DOE/LANL RoadRunner, IBM QS22 Opteron cluster
w/ PowerXCell8i 3.2 Ghz/. 129600 procs, 1105
Tflops - ORNL Jaguar Cray XT5 2.3 GHz Opterons, 150152
procs, 1059 TFlops - NASA Pleiades SGI Altix ICE Xeon 3.0/2.66
GHz, 51200 procs, 487 TFlops - DOE/LLNL IBM Blue Gene/L, 700 MHz PPC 440,
212992, 478 TFlops - ANL Intrepid IBM Blue Gene/P, 850 MHz PPC
450, 163840 procs, 450 TFlops
RPI/CCNI is current 34 with Fen IBM Blue
Gene/L, 32,768 procs, 73 TFlops.
47(No Transcript)
48(No Transcript)
49What are SCs used for??
- Can you say fever for the flavor..
- Yes, Pringles used an SC to model airflow of
chips as the entered The Can.. - Improved overall yield of good chips in The
Can and less chips on the floor
50Patient Specific Vascular Surgical Planning
- Virtual flow facility for patient specific
surgical planning - High quality patient specific flow simulations
needed quickly - Image patent, create model, adaptive flow
simulation - Simulation on massively parallel computers
- Cost only 600 on 32K Blue Gene/L vs. 50K for a
repeat open heart surgery
51Summary
- Current uni-core speed has peaked
- No more ILP to exploit
- Cant make CPU cores any faster w/ current CMOS
technology - Must go massively parallel in order to increase
IPC (instructions per clock cycle). - Only way for large application to go really fast
is to use lots and lots of processors.. - Todays systems have 10s of thousands of
processors - By 2012 systems will emerge w/ gt 200K to 1
million processors w/ 10 PFlop compute power!
(e.g. Blue Waters _at_ UIUC)?