Title: Teaching Parallelism Panel, SPAA11
1Teaching Parallelism Panel, SPAA11
- Uzi Vishkin, University of Maryland
-
2Dream opportunity
- Limited interest in parallel computing evolved
into quest for general-purpose parallel computing
in mainstream computers. Alas - Only heroic programmers can exploit the vast
parallelism in todays mainstream computers. - Rejection of their par prog by most programmers
all but certain. - Widespread working assumption Programming models
for larger-scale mainstream systems - similar.
Not so in serial days. - Parallel computing plagued with prog
difficulties. build-first figure-out-how-to-progr
am-later ? fitting parallel languages to these
arbitrary arch ? standardization of language fits
? doomed later parallel arch - Working assumption ? import parallel computing
ills to mainstream - Shock and awe example 1st par prog trauma ASAP?
Start par prog course with tile-based parallel
algorithm for matrix mult. How many tiles to fit
1000X1000 matrices in cache of modern PC? Teach
later OK - Missing Many-Core Understanding Comparison of
many-core platforms for Ease-of-programming
achieving hard speedups. The Economist I (F)
3Summary of my thoughts
- 1. In class Parallel PRAM algorithmic theory
- -2nd in magnitude only to serial algorithmic
theory - -Won the battle of ideas in the 1980s.
Repeatedly - -Challenged without success ? no real
alternative! - ?Is this another the older we get the better we
were? ? - 2. Parallel programming experience for
concreteness - In Homework Extensive programming assignments
- The XMT HW/SW/Alg solution Programming for
locality 2nd order consideration - Must be Trauma-free providing Hard speedups/Best
serial - 3. Tread carefully Consider non-parallel
computing colleague instructors. Limited line of
credit. Future change is certain. Pushing may
backfire when need cooperation in future
4Parallel Random-Access Machine/Model
n synchronous processors all having unit time
access to a shared memory.
- Reactions
- Important to convey plurality, plus coherent
approach(es) - You got to be kidding, this is way
- - Too easy
- Too difficult
- Why even mention processors? What to do with
n processors? How to allocate processors to
instructions?
5Immediate Concurrent Execution
- Work-Depth framework SV82, Adopted in Par Alg
texts J92,KKT01. - ICE basis for architecture specs
- V, Using simple abstraction to reinvent computing
for parallelism, CACM 1/2011 - Similar to role of stored-program
program-counter in arch specs for serial comp
6Algorithms-aware many-core is feasible
- PRAM-On-Chip HW Prototypes
- 64-core, 75MHz FPGA of XMT SPAA98..CF08
- Toolchain Compiler
- simulator HIPS11
- 128-core interconnection network
IBM 90nm 9mmX5mm,
-
400 MHz HotI07 -
- FPGA design?ASIC
- IBM 90nm 10mmX10mm
- 150 MHz
-
- Programming
- Programmers workflow
- Rudimentary yet stable
- compiler
Architecture scales to 1000
cores on-chip
XMT homepage www.umiacs.umd.edu/users/vishkin/XMT
/index.shtml or search XMT. considerable
material/suggestions for teaching class notes,
tool chain, lecture videos, programming
assignment.
7Elements in My education platform
- Identify thinking in parallel with the basic
abstraction behind the SV82b work-depth
framework. Note presentation framework in PRAM
texts J92, KKT01. - Teach as much PRAM algorithms as timing and
developmental stage of the students permit
extensive dry theory homework required from
graduate students. Little from high-school
students. - Students self-study programming in XMTC (standard
C plus 2 commands, spawn and prefix-sum) and do
demanding programming assignments - Provide a programmers workflow links the simple
PRAM abstraction with XMTC (even tuned)
programming. The synchronous PRAM provides ease
of algorithm design and reasoning about
correctness and complexity. Multi-threaded
programming relaxes this synchrony for
implementation. Since reasoning directly about
soundness and performance of multi-threaded code
is known to be error prone, the workflow only
tasks the programmer with establish that the
code behavior matches the PRAM-like algorithm - Unlike PRAM, XMTC incorporates locality as 2nd
order. Unlike many approaches, XMTC preempts harm
of locality on programmers productivity. - If XMT architecture is presented only at the end
of the course parallel programming more
difficult than serial, which does not require
architecture.
8Anecdotal Validation (?)
- Breadth-first-search (BFS) example 42 students
joint UIUC/UMD course - lt1X speedups using OpenMP on 8-processor SMP
- 7x-25x speedups on 64-processor XMT FPGA
prototype Built at UMD - Whats the big deal of 64 processors beating
8? - Silicon area of 64 XMT processors 1-2 SMP
processors - Questionnaire Rank approaches for achieving
(hard) speedups All students, but one XMTC
ahead of OpenMP - Order-of-magnitude teachability/learnabilty (MS,
HS up, SIGCSE10) - SPAA11 gt100X speedup on max-flow relative to
2.5X on GPU (IPDPS10) - Fleck/Kuhn research too esoteric to be reliable
? exoteric validation! - Reward alert Try to publish a paper boasting
easy to obtain results - EoP 1. Badly needed. Yet, 2. A lose-lose
proposition.
9Where to find a machine that supports effectively
such parallel algorithms?
- Parallel algorithms researchers realized decades
ago that the main reason that parallel machines
are difficult to program has been that bandwidth
between processors/memories is limited. Lower
bounds VW85,MNV94. - BMM94 1. HW vendors see the cost benefit of
lowering performance of interconnects, but
grossly underestimate the programming
difficulties and the high software development
costs implied. 2. Their exclusive focus on
runtime benchmarks misses critical costs,
including (i) the time to write the code, and
(ii) the time to port the code to different
distribution of data or to different machines
that require different distribution of data. - G. Blelloch, B. Maggs G. Miller. The hidden
cost of low bandwidth communication. In
Developing a CS Agenda for HPC (Ed. U. Vishkin).
ACM Press, 1994 - Patterson, CACM04 Latency Lags Bandwidth. HP12
as latency improved by 30-80X, bandwidth improved
by 10-25KX - ? Isnt this great news cost benefit of low
bandwidth drastically decreasing - Not so fast. X86Gen Senior Eng, 1/2011 Okay, you
do have a convenient way to do parallel
programming so whats the big deal?! - Commodity HW ? Decomposition-first programming
doctrine ? heroic programmers ? sigh Has the
bw ? ease-of-programming opportunity got lost?
10Sociologists of science
- Debates between adherents of different thought
styles consist almost entirely of
misunderstandings. Members of both parties are
talking of different things (though they are
usually under an illusion that they are talking
about the same thing). They are applying
different methods and criteria of correctness
(although they are usually under an illusion that
their arguments are universally valid and if
their opponents do not want to accept them, then
they are either stupid or malicious)
11Comment on need for breadth of knowledge
- Where are your specs?
- One example what is your par alg abstraction?
- First-specs then-build is not uncommon.. for
engineering - 2 options for architects WRT example
- 1. Learn parallel algorithms 2. Develop
abstraction that meets EoP 3. Develop specs 4.
Build - Start from abstraction with proven EoP
- It is similarly important for algorithms people
to learn architecture and applications