Teaching Parallelism Panel, SPAA11 - PowerPoint PPT Presentation

About This Presentation
Title:

Teaching Parallelism Panel, SPAA11

Description:

Title: Slide 1 Last modified by: Uzi Vishkin Created Date: 8/16/2006 12:00:00 AM Document presentation format: On-screen Show (4:3) Other titles: Arial MS P ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 12
Provided by: umiacsUm3
Category:

less

Transcript and Presenter's Notes

Title: Teaching Parallelism Panel, SPAA11


1
Teaching Parallelism Panel, SPAA11
  • Uzi Vishkin, University of Maryland

2
Dream opportunity
  • Limited interest in parallel computing evolved
    into quest for general-purpose parallel computing
    in mainstream computers. Alas
  • Only heroic programmers can exploit the vast
    parallelism in todays mainstream computers.
  • Rejection of their par prog by most programmers
    all but certain.
  • Widespread working assumption Programming models
    for larger-scale mainstream systems - similar.
    Not so in serial days.
  • Parallel computing plagued with prog
    difficulties. build-first figure-out-how-to-progr
    am-later ? fitting parallel languages to these
    arbitrary arch ? standardization of language fits
    ? doomed later parallel arch
  • Working assumption ? import parallel computing
    ills to mainstream
  • Shock and awe example 1st par prog trauma ASAP?
    Start par prog course with tile-based parallel
    algorithm for matrix mult. How many tiles to fit
    1000X1000 matrices in cache of modern PC? Teach
    later OK
  • Missing Many-Core Understanding Comparison of
    many-core platforms for Ease-of-programming
    achieving hard speedups. The Economist I (F)

3
Summary of my thoughts
  • 1. In class Parallel PRAM algorithmic theory
  • -2nd in magnitude only to serial algorithmic
    theory
  • -Won the battle of ideas in the 1980s.
    Repeatedly
  • -Challenged without success ? no real
    alternative!
  • ?Is this another the older we get the better we
    were? ?
  • 2. Parallel programming experience for
    concreteness
  • In Homework Extensive programming assignments
  • The XMT HW/SW/Alg solution Programming for
    locality 2nd order consideration
  • Must be Trauma-free providing Hard speedups/Best
    serial
  • 3. Tread carefully Consider non-parallel
    computing colleague instructors. Limited line of
    credit. Future change is certain. Pushing may
    backfire when need cooperation in future

4
Parallel Random-Access Machine/Model
  • PRAM

n synchronous processors all having unit time
access to a shared memory.
  • Reactions
  • Important to convey plurality, plus coherent
    approach(es)
  • You got to be kidding, this is way
  • - Too easy
  • Too difficult
  • Why even mention processors? What to do with
    n processors? How to allocate processors to
    instructions?

5
Immediate Concurrent Execution
  • Work-Depth framework SV82, Adopted in Par Alg
    texts J92,KKT01.
  • ICE basis for architecture specs
  • V, Using simple abstraction to reinvent computing
    for parallelism, CACM 1/2011
  • Similar to role of stored-program
    program-counter in arch specs for serial comp

6
Algorithms-aware many-core is feasible
  • PRAM-On-Chip HW Prototypes
  • Algorithms
  • 64-core, 75MHz FPGA of XMT SPAA98..CF08
  • Toolchain Compiler
  • simulator HIPS11
  • 128-core interconnection network
    IBM 90nm 9mmX5mm,
    -
    400 MHz HotI07
  • FPGA design?ASIC
  • IBM 90nm 10mmX10mm
  • 150 MHz
  • Programming
  • Programmers workflow
  • Rudimentary yet stable
  • compiler

Architecture scales to 1000
cores on-chip
XMT homepage www.umiacs.umd.edu/users/vishkin/XMT
/index.shtml or search XMT. considerable
material/suggestions for teaching class notes,
tool chain, lecture videos, programming
assignment.
7
Elements in My education platform
  • Identify thinking in parallel with the basic
    abstraction behind the SV82b work-depth
    framework. Note presentation framework in PRAM
    texts J92, KKT01.
  • Teach as much PRAM algorithms as timing and
    developmental stage of the students permit
    extensive dry theory homework required from
    graduate students. Little from high-school
    students.
  • Students self-study programming in XMTC (standard
    C plus 2 commands, spawn and prefix-sum) and do
    demanding programming assignments
  • Provide a programmers workflow links the simple
    PRAM abstraction with XMTC (even tuned)
    programming. The synchronous PRAM provides ease
    of algorithm design and reasoning about
    correctness and complexity. Multi-threaded
    programming relaxes this synchrony for
    implementation. Since reasoning directly about
    soundness and performance of multi-threaded code
    is known to be error prone, the workflow only
    tasks the programmer with establish that the
    code behavior matches the PRAM-like algorithm
  • Unlike PRAM, XMTC incorporates locality as 2nd
    order. Unlike many approaches, XMTC preempts harm
    of locality on programmers productivity.
  • If XMT architecture is presented only at the end
    of the course parallel programming more
    difficult than serial, which does not require
    architecture.

8
Anecdotal Validation (?)
  • Breadth-first-search (BFS) example 42 students
    joint UIUC/UMD course
  • lt1X speedups using OpenMP on 8-processor SMP
  • 7x-25x speedups on 64-processor XMT FPGA
    prototype Built at UMD
  • Whats the big deal of 64 processors beating
    8?
  • Silicon area of 64 XMT processors 1-2 SMP
    processors
  • Questionnaire Rank approaches for achieving
    (hard) speedups All students, but one XMTC
    ahead of OpenMP
  • Order-of-magnitude teachability/learnabilty (MS,
    HS up, SIGCSE10)
  • SPAA11 gt100X speedup on max-flow relative to
    2.5X on GPU (IPDPS10)
  • Fleck/Kuhn research too esoteric to be reliable
    ? exoteric validation!
  • Reward alert Try to publish a paper boasting
    easy to obtain results
  • EoP 1. Badly needed. Yet, 2. A lose-lose
    proposition.

9
Where to find a machine that supports effectively
such parallel algorithms?
  • Parallel algorithms researchers realized decades
    ago that the main reason that parallel machines
    are difficult to program has been that bandwidth
    between processors/memories is limited. Lower
    bounds VW85,MNV94.
  • BMM94 1. HW vendors see the cost benefit of
    lowering performance of interconnects, but
    grossly underestimate the programming
    difficulties and the high software development
    costs implied. 2. Their exclusive focus on
    runtime benchmarks misses critical costs,
    including (i) the time to write the code, and
    (ii) the time to port the code to different
    distribution of data or to different machines
    that require different distribution of data.
  • G. Blelloch, B. Maggs G. Miller. The hidden
    cost of low bandwidth communication. In
    Developing a CS Agenda for HPC (Ed. U. Vishkin).
    ACM Press, 1994
  • Patterson, CACM04 Latency Lags Bandwidth. HP12
    as latency improved by 30-80X, bandwidth improved
    by 10-25KX
  • ? Isnt this great news cost benefit of low
    bandwidth drastically decreasing
  • Not so fast. X86Gen Senior Eng, 1/2011 Okay, you
    do have a convenient way to do parallel
    programming so whats the big deal?!
  • Commodity HW ? Decomposition-first programming
    doctrine ? heroic programmers ? sigh Has the
    bw ? ease-of-programming opportunity got lost?

10
Sociologists of science
  • Debates between adherents of different thought
    styles consist almost entirely of
    misunderstandings. Members of both parties are
    talking of different things (though they are
    usually under an illusion that they are talking
    about the same thing). They are applying
    different methods and criteria of correctness
    (although they are usually under an illusion that
    their arguments are universally valid and if
    their opponents do not want to accept them, then
    they are either stupid or malicious)

11
Comment on need for breadth of knowledge
  • Where are your specs?
  • One example what is your par alg abstraction?
  • First-specs then-build is not uncommon.. for
    engineering
  • 2 options for architects WRT example
  • 1. Learn parallel algorithms 2. Develop
    abstraction that meets EoP 3. Develop specs 4.
    Build
  • Start from abstraction with proven EoP
  • It is similarly important for algorithms people
    to learn architecture and applications
Write a Comment
User Comments (0)
About PowerShow.com