Title: Keynote: Parallel Programming for High Schools
1 Keynote Parallel Programming for High Schools
- Uzi Vishkin, University of Maryland
- Ron Tzur, Purdue University
- David Ellison, University of Maryland and
University of Indiana - George Caragea, University of Maryland
- CS4HS Workshop, Carnegie-Mellon University, July
26, 2009
2Why are we here?
- Its a time of emerging update to what literacy
in CS means - Parallel Algorithmic Thinking (PAT)
3Goals
- Nurture your sense of
- Sense of urgency of shift to parallel within
computational thinking - Get a sense of PAT and of potential student
understandings - Confidence, competence, and enthusiasm in ability
to take on the challenge of promoting PAT in your
students - At the end we hope youll say
- I understand, I want to do it, I can, and I know
it will not happen without me (irreplaceable
member of the Jury)
4Outline (RT)
- Intro Whats all the fuss about parallelism?
(UV) - Teaching-learning activities in XMT (DE)
- A teachers voice (1) Its the future
teachable (ST) - PAT Module goals, plan, hands-on pedagogy,
learning theory (RT) - A teachers voice (2) XMT Approach/content (ST)
- Hands-on The Merge-Sort Problem (UV)
- A teachers voice (3) To begin PAT, use XMTC
(ST) - How to (start)? (GC)
- Q A (All, 12 min)
5Intro Commodity Computer Systems (UV)
- Serial General-purpose computing
- 1946?2003, 5KHz?4GHz.
- 2004 ? Clock frequency growth turns flat
- 2004 Onward
- Parallelism only game in town
- If you want your program to run significantly
faster youre going to have to parallelize it - General-purpose computing goes parallel
- Transistors/chip 1980?2011 29K?30B!
- cores dy-2003
6Intro Commodity Computer Systems
- 40 Years of Parallel Computing
- Never a successful general-purpose parallel
computer (easy to program good speedups) - Grade from NSF Blue-Ribbon Panel on
Cyberinfrastructure F !!! - Programming existing parallel computers is as
intimidating and time consuming as programming
in assembly language
7Intro Second Paradigm Shift - Within Parallel
- Existing parallel paradigm Decomposition-First
- Too painful to program
- Needed Paradigm Express only what can be done
in parallel - Natural (parallel) algorithm Parallel
Random-Access Model (PRAM) - Build both machine (HW) and programming (SW)
around this model
8Middle School Summer Camp Class Picture, July09
(20 of 22 students)
9Demonstration Exchange Problem (DE)
10Lets Look at our First Step
I
Our first step XA
1
11Lets Look at our Second Step
I
1
Our second step BA
2
12Lets Look at our Third Step
I
1
2
Our third step BX
3
- Our first algorithm and pseudo programming code
- XA
- AB
- BX
- Serial exchange, 3 steps, 3 operations, 1
working memory space
How many steps?
3
How many operations?
3
Whats the connection between the number of steps
operations?
Equal
How much working memory space consumed?
1 Space
Hands-on Challenge Can we exchange the contents
of A (2) and B (5) in fewer steps?
13I
What is the hint in this figure?
14First Step in a parallel algorithm
I
1
XA and simultaneously YB
Can you anticipate the next step?
15Second step in a parallel algorithm
I
XA and YB
1
2
AY and simultaneously BX
2
How many steps?
How many operations?
4
How much working memory space consumed?
2
Can you make any generalizations with respect to
serial and parallel problem solving?
Parallel algorithms tend to involve fewer Steps,
but may cost more operations and may consume more
working memory.
XA and YB
AY and BX
16Array Exchange A and B as arrays with indices
0 9 and input state as shown. Using a single
working memory space X, devise an algorithm to
exchange the contents of cells with the same
index (e.g., replace A022 with B012) .
Consider number of steps, operations.
12
22
22
12
Step 1 XA0
12
Step 2 A0B0
13
Step 3 B0X
Step 4 XA1
For i0 to 9 Do XAi AiBi Bi
X end
How many steps needed to complete the exchange?
30
How many operations?
30
How much working memory space?
1
Your homework asks for the general case of arrays
A and B of length n
17Array Exchange Problem Can you parallelize it?
Step 1 X0-9A0-9
22
12
12
Step 2 A0-9B0-9
23
13
13
Step 3 B0-9X0-9
24
14
14
25
15
15
Parallel algorithm For i0 to n-1
pardo X(i)A(i) A(i)B(i) B(i)X(i)
end
XMTC Program spawn(0,n-1) var x xA( ) A(
)B( ) B( )x
26
16
16
27
17
17
28
18
?
18
29
19
19
30
20
20
31
21
21
How many steps? How many operations? How much
working memory space consumed? Answer the above
questions for the general case of arrays A and B
of length n?
3
30
10
3 steps, 3n operations, n spaces
18Array Exchange Algorithm A highly parallel
approach
Step 1 X0-9A0-9 And Y0-9B0-9
12
22
22
12
13
23
23
13
Step 2 A0-9Y0-9 And B0-9X0-9
14
24
24
14
15
25
25
15
16
26
26
16
For i1 to n pardo X(i)A(i) and
B(i)X(i) Y(i)B(i) and A(i)Y(i) end
17
27
27
17
18
28
28
18
19
29
29
19
20
30
30
20
21
31
31
21
How many steps? How many operations? How much
working memory space consumed? Answer the above
questions for the general case of arrays A and B
of length n?
2
40
20
2 steps , 4n operations, 2n spaces
19Intro Second Paradigm Shift (cont.)
- Late 1970s THEORY
- Figure out how to think algorithmically in
parallel - Huge success. But
- 1997 Onward PRAM-On-Chip _at_ UMD
- Derive specs for architecture design and build
- Above premises contrasted with
- Build-first, figure-out-how-to-program-later
approach - J. Hennessy Many of the early parallel ideas
were motivated by observations of what was easy
to implement in the hardware rather than what was
easy to use
20Pre Many-Core Parallelism Three Thrusts
- Improving single-task completion time for
general-purpose parallelism was not the main
target of parallel machines - Application-specific
- computer graphics
- Limiting origin
- GPUs great performance if you figure out how
- 2. Parallel machines for high-throughput (of
serial programs) - Only choice for HPC?Language standards, but
many issues (F!) - HW designers (that dominate vendors) YOU figure
out how to program (their machines) for locality.
21Pre Many-Core Parallelism 3 Thrusts (cont.)
- Currently, how future computer will look is
unknown - SW Vendor impasse What can a non-HW entity do
without betting on the wrong horse? - Needed - successor to Pentium for multi-core area
that - Is easy to program (hence, learning hence,
teaching) - Gives good performance with any amount of
parallelism - Supports application programming (VHDL/Verilog,
OpenGL, MATLAB) AND performance programming - Fits current chip technology and scales with it
(particularly strong speed-ups for single-task
completion time) - Hindsight is always 20/20
- Should have used the benchmark of Programmability
- ? TEACHABILITY !!!
22Pre Many-Core Parallelism 3 Thrusts (cont.)
- PRAM algorithmic theory
- Started with a clean slate target
- Programmability
- Single-task completion time for general-purpose
parallel computing - Currently the theory common to all parallel
approaches - necessary level of understanding parallelism
- As simple as it gets Ahead of its time
avant-garde - 1990s Common wisdom (LOGP) never implementable
- UMD Built eXplicit Multi-Threaded (XMT) parallel
computer - 100x speedups for 1000 processors on chip
- XMTC programming language
- Linux-based simulator download to any machine
- Most importantly TAUGHT IT
- Graduate ? seniors ? freshmen ? high school ?
middle school - Reality check The human factor ? YOU
- Teachers ? Students
23One Teachers Voice (RT)
- Mr. Shane Torbert (could not join us - sisters
getting married!) - Thomas Jefferson (TJ) High School
- Two years of trial
- Interview question Why you gave Vishkins XMT a
try? - Observe video segment 1
- http//www.umiacs.umd.edu/users/vishkin/TEACHING/S
HANE-TORBERT-INTERVIEW7-09/01 Shane Why
XMT.m4v(It requires either some iTune
installation or other m4v player)
24Summary of Shanes Thesis
- Its the Future and Teachable !!!
25Teaching PAT with XMT-C
- Overarching goal
- Nurture a (50-year) generation of CS enthusiasts
ready to think/work in parallel (programmers,
developers, engineers, theoreticians, etc.) - Module goals for student learning
- Understand what are parallel algorithms
- Understand the differences, and links, between
parallel and serial algorithms (serial as a
special case of parallel - single processor) - Understand and master how to
- Analyze a given problem into the shortest
sequence of steps within which all possible
concurrent operations are performed - Program (code, run, debug, improve, etc.)
parallel algorithms - Understand and use measures of algorithm
efficiency - Run-time
- Work distinguish number of operations vs.
number of steps - Complexity
26Teaching PAT with XMT-C (cont.)
- Objectives - students will be able to
- Program parallel algorithms (that run) in XMTC
- Solve general-purpose, genuine parallel problems
- Compare and choose best parallel (and serial)
algorithms - Explain why an algorithm is serial/parallel
- Propose and execute reasoned improvements to
their own and/or others parallel algorithms - Reason about correctness of algorithms Why an
algorithm provides a solution to a given problem?
27Hands-0n The Bill Gates Intro Problem(from
Baltimore Polytechnic Institute)
- Please form small groups (3-4)
- Consider Bill Gates, the richest person on earth
- Well, he can hire as many helpers for any task in
his life - Suggest an algorithm to accomplish the following
morning tasks in the least number of steps and go
out to work
2810-Year Old Solves Bill Gates
29A Solution for Bill Gates
Moral Parallelism introduces both constraints
and opportunities Constraints We cant
just assume we can accomplish everything at
once! Opportunities Can be much faster
than serial! 5
parallel steps versus 11 serial steps
30Pedagogical Considerations (1)
- In your small groups discuss
- How might solving the Bill Gates problem
help students in learning PAT? - Will you use it as an intro to a PAT module?
Why? - Be ready to share your ideas with the whole group
- Whole group discussion of Bill Gates problem to
initiate PAT
31A Brain-based Learning Theory
- Understanding anticipation and reasoning about
invariant relationship between activity and its
effects (AER) - Learning transformation in such anticipation,
commencing with available and proceeding to
intended - Mechanism Reflection (two types) on
activity-effect relationship (RefAER) - Type-I comparison between goal and actual effect
- Type-II comparison across records of
experiences/situations in which AER has been used
consistently - Stages
- Participatory (provisional, oops), Anticipatory
(transfer enabling, succeed) - For more, see www.edci.purdue.edu/faculty_profiles
/tzur/index.html
32A Teachers Voice XMT Approach/Content
- Pay attention to his emphasis on student
development of anticipation of run-time using
complexity analysis (deep level of
understanding even for serial thinking) - Play video segments 2 and 3 (530 min)
- http//www.umiacs.umd.edu/users/vishkin/TEACHING/S
HANE-TORBERT-INTERVIEW7-09/02 Shane Ease of
Use.m4v - http//www.umiacs.umd.edu/users/vishkin/TEACHING/S
HANE-TORBERT-INTERVIEW7-09/03 Shane Content
Focus.m4v - Shanes suggested first trial with teaching this
material - - Where your CS AP class (you most likely ask
when ) - - When Between the AP exam and the end of the
school year.
33PAT Module Plan
- Intro Tasks Create informal algorithmic
solutions for problems students can relate to
parallelize - Bill Gates Way out of a maze train a dog to
fetch a ball standing blindfolded in line, the
toddler problem, building a sand castle, etc. - Discussion
- What is Serial? Parallel? How do they differ?
Advantages and disadvantages of both (tradeoffs)?
Steps vs. operations? Breadth-first vs.
Depth-first searches? - Establish XMT environment
- Installation (Linux, Simulator)
- Programming syntax (Logo? C? XMT-C?) Hello
World and beyond - Algorithms for Meaningful Problems
- For each problem create parallel and serial
algorithms that solve it analyze and compare
them (individual, pairs, small groups, whole
class - Revisit discussion of how serial and parallel
differ
34Problem Sequence
- Exchange problems
- Ranking problems
- Summation and Prefix-Sums (application
compaction) - Matrix multiplication problems
- Sorting problems (including merge-sort ,
integer-sort and sample-sort) - Selection problems (finding the median)
- Minimum problems
- Nearest-one problems
- See also
- www.umiacs.umd.edu/users/vishkin/XMT/index.shtml
- www.umiacs.umd.edu/users/vishkin/XMT/index.shtmlt
utorial - www.umiacs.umd.edu/users/vishkin/XMT/sw-release.ht
ml - www.umiacs.umd.edu/users/vishkin/XMT/teaching-plat
form.html
35PRAM-On-Chip Silicon 64-processor, 75MHz
prototype
FPGA Prototype built n4, TCUs64, m8,
75MHz. The system consists of 3 FPGA chips 2
Virtex-4 LX200 1 Virtex-4 FX100(Thanks Xilinx!)
Block diagram of XMT
36Some experimental results (UV)
- AMD Opteron 2.6 GHz, RedHat Linux Enterprise 3,
64KB64KB L1 Cache, 1MB L2 Cache (none in XMT),
memory bandwidth 6.4 GB/s (X2.67 of XMT) - M_Mult was 2000X2000 QSort was 20M
- XMT enhancements Broadcast, prefetch buffer,
non-blocking store, non-blocking caches.
- XMT Wall clock time (in seconds)
- App. XMT Basic XMT Opteron
- M-Mult 179.14 63.7 113.83
- QSort 16.71 6.59 2.61
- Assume (arbitrary yet conservative)
- ASIC XMT 800MHz and 6.4GHz/s
- Reduced bandwidth to .6GB/s and projected back by
800X/75 - XMT Projected time (in seconds)
- App. XMT Basic XMT Opteron
- M-Mult 23.53 12.46 113.83
- QSort 1.97 1.42 2.61
- Simulation of 1024 processors 100X on standard
benchmark suite for VHDL gate-level simulation.
for 1024 processors Gu-V06 - Silicon area of 64-processor XMT, same as 1
commodity processor (core)
37Hands-On Example Merging
- Input
- Two arrays A1. . n, B1. . N
- Elements from a totally ordered domain S
- Each array is monotonically non-decreasing
- Merging task (Output)
- Map each of these elements into a monotonically
non-decreasing array C1..2n - Serial Merging algorithm
- SERIAL - RANK(A1 . . B1. .)
- Starting from A(1) and B(1), in each round
- Compare an element from A with an element of B
- Determine the rank of the smaller among them
- Complexity O(n) time (hence, also O(n) work...)
- Hands-on How will you parallelize this
algorithm?
38Partitioning Approach
- Input size for a problemn Design a 2-stage
parallel algorithm - Partition the input in each array into a large
number, say p, of independent small jobs - Size of the largest small job is roughly n/p
- Actual work - do the small jobs concurrently,
using a separate (possibly serial) algorithm for
each - Surplus-log parallel algorithm for
Merging/Ranking - for 1 i n pardo
- Compute RANK(i,B) using standard binary search
- Compute RANK(i,A) using binary search
- Complexity WO(n log n), TO(log n)
39Middle School Students Experiment with Merge/Rank
40Linear work parallel merging using a single spawn
- Stage 1 of algorithm Partitioning for 1 i
n/p pardo p lt n/log and p n - b(i)RANK(p(i-1) 1),B) using binary search
- a(i)RANK(p(i-1) 1),A) using binary search
- Stage 2 of algorithm Actual work
- Observe Overall ranking task broken into 2p
independent slices. - Example of a slice
- Start at A(p(i-1) 1) and B(b(i)).
- Using serial ranking advance till
- Termination condition
- Either some A(pi1) or some B(jp1) loses
- Parallel program 2p concurrent threads
- using a single spawn-join for the whole
- algorithm
- Example Thread of 20 Binary search B.
- Rank as 11 (index of 15 in B) 9 (index of
- 20 in A). Then compare 21 to 22 and rank
- 21 compare 23 to 22 to rank 22 compare 23
- to 24 to rank 23 compare 24 to 25, but terminate
- since the Thread of 24 will rank 24.
41Linear work parallel merging (contd)
- Observation 2p slices. None has more than 2n/p
elements - (not too bad since average is 2n/2pn/p elements)
- Complexity Partitioning takes WO(p log n), and
TO(log n) time, or O(n) work and O(log n) time,
for p lt n/log n - Actual work employs 2p serial algorithms, each
takes O(n/p) time - Total WO(n), and TO(n/p), for p lt n/log n
- IMPORTANT Correctness complexity of parallel
programs - Same as for algorithm
- This is a big deal. Other parallel programming
approaches do not have a simple concurrency
model, and need to reason w.r.t. the program
42A Teachers Voice Start PAT with XMT
- Observe Shanes video segment 4
- http//www.umiacs.umd.edu/users/vishkin/TEACHING/S
HANE-TORBERT-INTERVIEW7-09/04 Shane Word to
Teachers.m4v
43How to (start)? (GC)
- Contact us! ! !
- Observe online teaching sessions (more to be
added soon) - Contact us
- Download and install simulator
- Read manual
- Google XMT or www.umiacs.umd.edu/users/vishkin/XMT
/index.shtml - Solve a few problems on your own
- Try programming a parallel algorithm in XMTC for
prefix-sums - Contact us
- Follow teaching plan (slides 29-30)
- Did we already say CONTACT US ?!?! (entire team
waiting for your call )
44???
45Additional Intro Problems
46(No Transcript)
47How can we direct the computer to search this
maze and help the cat get to the milk a parallel
algorithm (depth first search?)
0
Back to A
We might imagine locations in the maze that force
a decision and call these junctions.
Over to E
Back to A
A
Over to F
H
B
Back to A
We might say the computer is forced to make a
decision at junction A
Over to G
C
D
Back to A
E
Over to H
And progresses in a left handed fashion to B
Back to A
Until it reaches a blockage C
And must return to B
And proceed to the next junction D but that
returns to itself
G
F
Back to B
48Back-up slide FPGA 64-processor, 75MHz prototype
Specs and aspirations
- Multi GHz clock rate
- FPGA Prototype built n4, TCUs64, m8, 75MHz.
- The system consists of 3 FPGA chips
- 2 Virtex-4 LX200 1 Virtex-4 FX100(Thanks
Xilinx!)
Block diagram of XMT
- - Cache coherence defined away Local cache only
at master thread control unit (MTCU) - Prefix-sum functional unit (FA like) with
global register file (GRF) - Reduced global synchrony
- Overall design idea no-busy-wait FSMs