CSCI43206360: Parallel Programming - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

CSCI43206360: Parallel Programming

Description:

Desktop Dual Core: E8500 - 6 MB L2 - 3.16 GHz - 1333 MHz. E8400 ... Laptop Dual Core: T9500 - 6 MB L2 - 2.60 GHz - 800 MHz. T9300 - 6 MB L2 - 2.50 GHz - 800 MHz ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 52

Provided by: DaveHol

Category:

more less

Transcript and Presenter's Notes

Title: CSCI43206360: Parallel Programming

1
CSCI-4320/6360 Parallel Programming
ComputingAE 215, Tues./Fri. 12-120
p.m.Introduction, Syllabus Prelims

Prof. Chris Carothers
Computer Science Department
Lally 306
Office Hrs Tuesdays, 130 330 p.m
chrisc_at_cs.rpi.edu
www.cs.rpi.edu/chrisc/COURSES/PARALLEL/SPRING-200
9

2
Course Prereqs

Some programming experience in Fortran, C, C
Java is great but not for HPC
Youll have a choice to do your assignment in C,
C or Fortransubject to the language support of
the programming paradigm..
Assume youve never touched a parallel or
distributed computer..
If you have MPI experience great..it will help
you, but it is not necessary
If you love to write software
Both practice and theory are presented but there
is a strong focus on getting your programs to
work

3
Course Optional Textbook

Introduction to Parallel Computing, by Grama,
Gupta, Karypis and Kumar
Make sure you have the 2nd edition!
Available online thru the Pearson/Addison Wesley
publisher or Amazon.com etc.
Written in 2003 so somewhat out of date

4
Course Topics

Prelims Motivation
Memory Hierarchy
CPU Organization
Parallel Architectures
Message Passing/SMP/Vector-SIMD
Communications Networks
Basic Communications Operations
MPI Programming
Principles of Parallel Algorithm Design
Thread Programming
Ptreads
OpenMP

5
Course Topics (cont.)?

Parallel Debuggers
Parallel Operating Systems
Parallel Filesystems
Parallel Algorithms
Matrix Algorithms
Search
Graph
Other Programming Paradigm
MapReduce
Transactional Memory
Fault Tolerance
Applications (Guest Lectures)?
Computational Fluid Dynamics
Mesh Adaptivity
Parallel Discrete-Event Simulation

6
Course Grading Criteria

You must read lecture slides, and find papers on
that topic
FOR EACH CLASS
Find a academic paper written in 2003 or later
that relates to the class topic.
You will write a 1 to 2 page paper that sumarizes
the class and the paper you states any questions
you might have.
If youre a grad student, you must find and
review 2 papers!
Whats it worth
1 grade point per class up to 25 points total
There are 27/28 lectures, so you can pick 3 to
miss
4 programming assignments worth 10 pts each
MPI, Pthreads, OpenMP, Parallel I/O
Parallel Computing Research Project worth 35 pts
Yes, thats right no mid-term or final exam
May sound good, but when its 4 a.m. and your
parallel program doesnt work and youve spent
the past 30 hours debugging it, an exam doesnt
sound so bad
For a course like this, youll need to manage you
time well..do a little each day and dont get
behind on the assignments or projects!

7
Where to find papers

Use Google Scholar or ACM and IEEE Digital
Libraries.
You can access ACM or IEEE publications online
from any on campus RPI computer system
Publications to look for are
Any IEEE, ACM, SIAM parallel computing conf. or
journal
A few to consider are
Super Computing (SC), IPDPS, ICP, IEEE TPDS,
IEEE TOC, ACM TOPLAS, ACM TOCS, JPDC, Currency
Practice and Experience, Cluster Computing, SPAA,
HPCA, HPDC, Parallel Computing, IBM Journal of R
D
Try to find the most recent paper relating to a
particular topic
Dont re-use a paper if it covers potentially
multiple topics..
Summary /Reviews are due the next lecture w/o
exception.
Include a copy of your paper(s) with your
summary/review and a reference citation of the
paper and where you found it.

8
To Make A Fast Parallel Computer You Need a
Faster Serial Computerwell sorta

Review of
Instructions
Instruction processing..
Put it togetherwhy the heck do we care about or
need a parallel computer?
i.e., they are really cool pieces of technology,
but can they really do anything useful beside
compute Pi to a few billion more digits

9
Processor Instruction Sets

In general, a computer needs a few different
kinds of instructions
mathematical and logical operations
data movement (access memory)?
jumping to new places in memory
if the right conditions hold.
I/O (sometimes treated as data movement)?
All these instructions involve using registers to
store data as close as possible to the CPU
E.g. t0, s0 in MIPs on eax, ebx in x86

10
a(bc)-(de)
s0
s1
s2
s3
s4

add t0, s1, s2 t0 bc
add t1, s3, s4 t1 de
sub s0, t0, t1 a t0t1

11
lw destreg, const(addrreg)?
Load Word
A number
Name of register to get base address from
Name of register to put value in
address (contents of addrreg) const
12
Array Example abc8
s0
s2
s1

lw t0,8(s2) t0 c8
add s0, s1, t0 s0s1t0
(yeah, this is not quite right ?)?

13
lw destreg, const(addrreg)?
Load Word
A number
Name of register to get base address from
Name of register to put value in
address (contents of addrreg) const
14
sw srcreg, const(addrreg)?
Store Word
A number
Name of register to get base address from
Name of register to get value from
address (contents of addrreg) const
15
Example sw s0, 4(s3)?

If s3 has the value 100, this will copy the word
in register s0 to memory location 104.
Memory104 lt- s0

16
lw destreg, const(addrreg)?
Load Word
A number
Name of register to get base address from
Name of register to put value in
address (contents of addrreg) const
17
sw srcreg, const(addrreg)?
Store Word
A number
Name of register to get base address from
Name of register to get value from
address (contents of addrreg) const
18
Example sw s0, 4(s3)?

If s3 has the value 100, this will copy the word
in register s0 to memory location 104.
Memory104 lt- s0

19
Instruction formats
32 bits
This format is used for many MIPS instructions
that involve calculations on values already in
registers. E.g. add t0, s0, s1
20
How are instructions processed?

In the simple case
Fetch instruction from memory
Decode it (read op code, and use registers based
on what instruction the op code says
Execute the instruction
Write back any results to register or memory
Complex case
Pipeline overlap instruction processing
Superscalar multi-instruction issue per clock
cycle..

21
Simple (relative term) CPU Multicyle Datapath
Control
22
Simple (yeah right!) Instruction Processing FSM!
23
Pipeline Processing w/ Laundry

While the first load is drying, put the second
load in the washing machine.
When the first load is being folded and the
second load is in the dryer, put the third load
in the washing machine.
NOTE unrealistic scenario for CS students, as
most only own 1 load of clothes

24
(No Transcript)
25
Pipelined DP w/ signals
26
Pipelined Instruction.. But wait, weve got
dependencies!
27
Pipeline w/ Forwarding Values
28
Where Forwarding Failsmust stall
29
How Stalls Are Inserted
30
What about those crazy branches?
Problem if the branch is taken, PC goes to addr
72, but dont know until after 3 other
instructions are processed
31
Dynamic Branch Prediction

From the phase There is no such thing as a
typical program, this implies that programs will
branch is different ways and so there is no one
size fits all branch algorithm.
Alt approach keep a history (1 bit) on each
branch instruction and see if it was last taken
or not.
Implementation branch prediction buffer or
branch history table.
Index based on lower part of branch address
Single bit indicates if branch at address was
last taken or not. (1 or 0)?
But single bit predictors tends to lack
sufficient history

32
Solution 2-bit Branch Predictor
Must be wrong twice before changing
predictionLearns if the branch is more biased
towards taken or not taken
33
Even more performance

Ultimately we want greater and greater
Instruction Level Parallelism (ILP)?
How?
Multiple instruction issue.
Results in CPIs less than one.
Here, instructions are grouped into issue
slots.
So, we usually talk about IPC (instructions per
cycle)?
Static uses the compiler to assist with grouping
instructions and hazard resolution. Compiler MUST
remove ALL hazards.
Dynamic (i.e., superscalar) hardware creates the
instruction schedule based on dynamically
detected hazards

34
Example Static 2-issue Datapath

Additions
32 bits from intr. Mem
Two read, 1 write ports on reg file
1 more ALU (top handles address calc)?

35
Ex. 2-Issue Code Schedule

Loop lw t0, 0(s1) t0array element
addiu t0, t0, s2 add scalar in s2
sw t0, 0(s1) store result
addi s1, s1, -4 dec pointer
bne s1, zero, Loop branch s1!0

It take 4 clock cycles for 5 instructions or IPC
of 1.25
36
More Performance Loop Unrolling

Technique where multiple copies of the loop body
are made.
Make more ILP available by removing dependencies.
How? Complier introduces additional registers via
register renaming.
This removes name or anti dependence
where an instruction order is purely a
consequence of the reuse of a register and not a
real data dependence.
No data values flow between one pair and the next
pair
Lets assume we unroll a block of 4 interations
of the loop..

37
Loop Unrolling Schedule
Now, it takes 8 clock cycles for 14 instructions
or IPC of 1.75!! This is a 40 performance boost!
38
Dynamic Scheduled Pipeline
39
Intel P4 Dynamic Pipeline Looks like a cluster
.. Just much much smaller
40
Summary of Pipeline Technology
Weve exhausted this!! IPC just wont go much
higher Why??
41
More Speed til it Hertz!

So, if not ILP is available, why not increase the
clock frequency
E.g. why dont we have 100 GHz processors today?
ANSWER POWER HEAT!!
With current CMOS technology power needs
polynominal increase with a linear increase in
clock speed.
Power leads to heat which will ultimately turn
your CPU to heap of melted silicon!

42
(No Transcript)
43
CPU Power Consumption
Typically, 100 watts is magic limit..
44
Where do we go from here?(actually, weve
arrived _at_ here!)?

Current Industry Trend Multi-core CPUs
Typically lower clock rate (i.e., lt 3 Ghz)?
2, 4 and now 8 cores in single socket package
Because of smaller VLSI design processes (e.g. lt
45 nm) can reduce power heat..
Potential for large, lucrative contracts in
turning old dusty sequential codes to multi-core
capable
Salesman heres your new 200 CPU, oh, BTW,
youll need this million consulting contract to
port your code to take advantage of those extra
cores!
Best business model since the mainframe!
More cores require greater and greater
exploitation of available parallelism in an
application which gets harder and harder as you
scale to more processors..
Due to cost, well force in-house development of
talent pool..
You could be that talent pool

45
Examples Multicore CPUs

Brief listing of the recently released new 45 nm
processors Based on Intel site (Processor Model
- Cache - Clock Speed - Front Side Bus)?
Desktop Dual Core
E8500 - 6 MB L2 - 3.16 GHz - 1333 MHz
E8400 - 6 MB L2 - 3.00 GHz - 1333 MHz
E8300 - 6 MB L2 - 2.66 GHz - 1333 MHz
Laptop Dual Core
T9500 - 6 MB L2 - 2.60 GHz - 800 MHz
T9300 - 6 MB L2 - 2.50 GHz - 800 MHz
T8300 - 3 MB L2 - 2.40 GHz - 800 MHz
T8100 - 3 MB L2 - 2.10 GHz - 800 MHz
Desktop Quad Core
Q9550 - 12MB L2 - 2.83 GHz - 1333 MHz
Q9450 - 12MB L2 - 2.66 GHz - 1333 MHz
Q9300 - 6MB L2 - 2.50 GHz - 1333 MHz
Desktop Extreme Series
QX9650 - 12 MB L2 - 3 GHz - 1333 MHz
Note Intel's new 45nm Penryn-based Core 2 Duo
and Core 2 Extreme processors were released on
January 6, 2008. The new processors launch within
a 35W thermal envelope.

46
Nov. 2008TOP 5 Supercomputers(www.top500.org)?

DOE/LANL RoadRunner, IBM QS22 Opteron cluster
w/ PowerXCell8i 3.2 Ghz/. 129600 procs, 1105
Tflops
ORNL Jaguar Cray XT5 2.3 GHz Opterons, 150152
procs, 1059 TFlops
NASA Pleiades SGI Altix ICE Xeon 3.0/2.66
GHz, 51200 procs, 487 TFlops
DOE/LLNL IBM Blue Gene/L, 700 MHz PPC 440,
212992, 478 TFlops
ANL Intrepid IBM Blue Gene/P, 850 MHz PPC
450, 163840 procs, 450 TFlops

RPI/CCNI is current 34 with Fen IBM Blue
Gene/L, 32,768 procs, 73 TFlops.
47
(No Transcript)
48
(No Transcript)
49
What are SCs used for??

Can you say fever for the flavor..
Yes, Pringles used an SC to model airflow of
chips as the entered The Can..
Improved overall yield of good chips in The
Can and less chips on the floor

50
Patient Specific Vascular Surgical Planning

Virtual flow facility for patient specific
surgical planning
High quality patient specific flow simulations
needed quickly
Image patent, create model, adaptive flow
simulation
Simulation on massively parallel computers
Cost only 600 on 32K Blue Gene/L vs. 50K for a
repeat open heart surgery

51
Summary

Current uni-core speed has peaked
No more ILP to exploit
Cant make CPU cores any faster w/ current CMOS
technology
Must go massively parallel in order to increase
IPC (instructions per clock cycle).
Only way for large application to go really fast
is to use lots and lots of processors..
Todays systems have 10s of thousands of
processors
By 2012 systems will emerge w/ gt 200K to 1
million processors w/ 10 PFlop compute power!
(e.g. Blue Waters _at_ UIUC)?