Title: The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer
1The eXplicit MultiThreading (XMT)
Easy-To-Program Parallel Computer
- Uzi Vishkin
- www.umiacs.umd.edu/users/vishkin/XMT
- Students just remember to take ENEE459P
Parallel Algorithms, fall10 - - What is a parallel algorithm?
- - Why should I care?
2Taste of a Parallel Algorithm Example Exchange
Problem
- 2 Bins A and B. Exchange contents of A and B. Ex.
A2,B5?A5,B2. - Algorithm (serial or parallel) XAABBX.
3 Ops. 3 Steps. Space 1. - Array Exchange Problem
- 2n bins A1..n, B1..n. Replace A(i) and B(i),
i1..n. - Serial Alg For i1 to n do /serial exchange
through eye-of-a-needle - XA(i)A(i)B(i)B(i)X
- 3n Ops. 3n Steps. Space 1
- Parallel Alg For i1 to n pardo /2-bin
exchange in parallel - X(i)A(i)A(i)B(i)B(i)X(
i) - 3n Ops. 3 Steps. Space n
- Discussion
- Parallelism tends to require some extra space
- Par Alg clearly faster than Serial Alg.
- What is simpler and more natural serial or
parallel? - Small sample of people serial, but only if you
.. majored in CS - Eye-of-a-needle metaphor for the von-Neumann
mental operational bottleneck - Reflects extreme scarcity of HW. Less acute now
3Commodity computer systems
- Chapter 1 1946?2003 Serial. 5KHz?4GHz.
- Chapter 2 2004-- Parallel. cores dy-2003
- Apple 2004 1 core
- 2013 gt100 cores
- Windows 7 scales to 256 cores
- How to use the other 255?
- Did I mention ENEE459P?
- BIG NEWS
- Clock frequency growth flat.
- If you want your program to run significantly
faster youre going to have to parallelize it ?
Parallelism only game in town - Transistors/chip 1980?2011 29K?30B!
- Programmers IQ? Flat..
- 40 years of parallel computing?
- The world is yet to see a successful
general-purpose parallel computer Easy to
program good speedups
Intel Platform 2015, March05
4Is performance at a plateau?
?
Source published SPECInt data
Students Make yourself ready for the job market.
Serial computing lt1 of computing power. Will
serial computing be taught for history majors?
5Welcome to the 2010 Impasse
- All vendors committed to multi-cores. Yet, their
architecture and how to program them for single
program completion time not clear - ? The software spiral (HW improvements ? SW imp ?
HW imp) growth engine for IT (A. Grove, Intel)
Alas, now broken! - ? SW vendors avoid investment in long-term SW
development since may bet on the wrong horse.
Impasse bad for business. - Parallel programming education Does CSE degree
mean being trained for a 50yr career dominated
by parallelism by programming yesterdays serial
computers? - ENEE459P Teach (i) common denominator, and (ii)
main approaches.
6Serial Abstraction A Parallel Counterpart
- Rudimentary abstraction that made serial
computing simple that any single instruction
available for execution in a serial program
executes immediately - Abstracts away different execution time for
different operations (e.g., memory hierarchy) .
Used by programmers to conceptualize serial
computing and supported by hardware and
compilers. The program provides the instruction
to be executed next (inductively) - Rudimentary abstraction for making parallel
computing simple that indefinitely many
instructions, which are available for concurrent
execution, execute immediately, dubbed Immediate
Concurrent Execution (ICE) - ?Step-by-step (inductive) explication of the
instructions available next for concurrent
execution. processors not even mentioned. Falls
back on the serial abstraction if 1
instruction/step.
7Explicit Multi-threading (XMT)
- 1979- THEORY figure out how to think
algorithmically in parallel - Outcome in a nutshell above abstraction
- 1997- XMT_at_UMD derive specs for architecture
design and build - UV Using Simple Abstraction to Guide the
Reinvention of Computing for Parallelism, - http//www.umiacs.umd.edu/users/vishkin/XMT/cacm20
10.pdf, to appear in CACM
8Not just talking
- PRAM-On-Chip HW Prototypes
- 64-core, 75MHz FPGA of XMT
- (Explicit Multi-Threaded) architecture
- SPAA98..CF08
-
- 128-core intercon. network
IBM 90nm 9mmX5mm, 400 MHz HotI07 - FPGA design?ASIC
- IBM 90nm 10mmX10mm
- 150 MHz
- PRAM parallel algorithmic theory. Natural
selection. Latent, though not widespread,
knowledgebase - Work-depth. SV82 conjectured The rest (full
PRAM algorithm) just a matter of skill. - Lots of evidence that work-depth works. Used as
framework in main PRAM algorithms texts JaJa92,
KKT01 - Programming Workflow
- Rudimentary yet stable compiler
Architecture
scales to 1000 cores on-chip
9Participants
- Grad students, Aydin Balkan, PhD, George
Caragea, James Edwards, David Ellison, Mike
Horak, MS, Fuat Keceli, Beliz Saybasili, Alex
Tzannes, Xingzhi Wen, PhD - Industry design experts (pro-bono).
- Rajeev Barua, Compiler. Co-advisor of 2 CS grad
students. 2008 NSF grant. - Gang Qu, VLSI and Power. Co-advisor.
- Steve Nowick, Columbia U., Asynch computing.
Co-advisor. 2008 NSF team grant. - Ron Tzur, Purdue U., K12 Education. Co-advisor.
2008 NSF seed funding - K12 Montgomery Blair Magnet HS, MD, Thomas
Jefferson HS, VA, Baltimore (inner city)
Ingenuity Project Middle School 2009 Summer Camp,
Montgomery County Public Schools - Marc Olano, UMBC, Computer graphics. Co-advisor.
- Tali Moreshet, Swarthmore College, Power.
Co-advisor. - Marty Peckerar, Microelectronics
- Igor Smolyaninov, Electro-optics
- Funding NSF, NSA 2008 deployed XMT computer, NIH
- Industry partner Intel
- Started from core CS. Built HWCompiler
foundation. Ready for 10 timely CS PhD theses,
2 Education, and 10 ECE.
10More on ENEE459P, fall 2010
- Parallel algorithmic thinking (PAT) based on
first principles. More challenging to self-study - Mainstream computing?parallelism chaotic. Hence
Pluralism valuable. - ENEE459 jointly taught by 2 instructors, video
conferencing, U. Illinois - CS_at_Illinois top 5. Parallel_at_Illinois 1.
- Joint course on timely topic extremely rare
opportunity. - More than 2 for the price of one. 2 courses,
each with 1 instructors would lack the
interaction. - Advanced by Google, Intel and Microsoft, the
introduction of parallelism into the curriculum
dominated the recent flagship Computer Science
Education Conference. Several speakers, including
a Keynote by the Director of Education at Intel,
reported that - In job interviews, employers now expect an
intelligent discussion of parallelism and - (2) International competition recognizes that
85 of the people that have been trained in
parallel programming are outside the U.S.
11Membership in Intel Academic Community
Implementing parallel computing into CS curriculum
85 outside USA
Source M. Wrinn, Intel
12The Pain of Parallel Programming
- Parallel programming is currently too difficult
- To many users programming existing parallel
computers is as intimidating and time consuming
as programming in assembly language NSF
Blue-Ribbon Panel on Cyberinfrastructure. - AMD/Intel Need PhD in CS to program todays
multicores. - The real problem Parallel architectures built
using the following methodology build-first
figure-out-how-to-program-later. - J. Hennessy Many of the early ideas were
motivated by observations of what was easy to
implement in the hardware rather than what was
easy to use
132nd Example of PRAM-like Algorithm
- Parallel parallel data-structures.
- Inherent serialization S.
- Gain relative to serial (first cut) T/S!
- Decisive also relative to coarse-grained
parallelism. - Note (i) Concurrently as in natural BFS only
change to serial algorithm - (ii) No decomposition/partition
- ? Speed-up wrt GPU same-silicon area for highly
parallel input 5.4X! - (iii) But, SMALL CONFIG on 20-way parallel
input 109X wrt same GPU - Mental effort of PRAM-like programming
- 1. sometimes easier than serial
- 2. considerably easier than for any parallel
computer currently sold. Understanding falls
within the common denominator of other
approaches.
- Input (i) All world airports.
- (ii) For each, all its non-stop flights.
- Find smallest number of flights from DCA to
every other airport. - Basic (actually parallel) algorithm
- Step i
- For all airports requiring i-1flights
- For all its outgoing flights
- Mark (concurrently!) all yet unvisited
airports as requiring i flights (note nesting) - Serial forces eye-of-a-needle queue need to
prove that still the same as the parallel
version. - O(T) time T total of flights
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18Back to the education crisis
- CTO of NVidia and the official Intel leader of
multi-cores at Intel teach parallelism as early
as you. - Reason we dont only under teach. We misteach,
since students acquire bad habits. - Current situation is unacceptable. Sort of
malpractice. - Some possibilities
- Teach as a major elective.
- Teach all CSE undergrads.
- Teach CSE Freshmen and invite all Eng, Math, and
Science sends message CSE is where the action
is.
19Need
- A general-purpose parallel computer framework
successor to the Pentium for the multi-core
era that - is easy to program
- gives good performance with any amount of
parallelism provided by the algorithm namely,
up- and down-scalability including backwards
compatibility on serial code - supports application programming (VHDL/Verilog,
OpenGL, MATLAB) and performance programming and - fits current chip technology and scales with it.
- (in particular strong speed-ups for single-task
completion time) - Main Point of talk PRAM-On-Chip_at_UMD is
addressing (i)-(iv).
20The PRAM Rollercoaster ride
- Late 1970s Theory work began
- UP Won the battle of ideas on parallel
algorithmic thinking. No silver or bronze! - Model of choice in all theory/algorithms
communities. 1988-90 Big chapters in standard
algorithms textbooks. - DOWN FCRC93 PRAM is not feasible. 93
despair ? no good alternative! Where vendors
expect good enough alternatives to come from in
2008? Device changed it all - UP Highlights eXplicit-multi-threaded (XMT)
FPGA-prototype computer (not simulator),
SPAA07,CF08 90nm ASIC tape-outs int. network,
HotI07, XMT. on-chip transistors - How come? crash course on parallel computing
- How much processors-to-memories bandwidth?
- Enough Ideal Programming Model (PRAM)
- Limited Programming difficulties
21How does it work
- Work-depth Algs Methodology (source SV82) State
all ops you can do in parallel. Repeat. Minimize
Total operations, rounds The rest is skill. - Program single-program multiple-data (SPMD).
Short (not OS) threads. Independence of order
semantics (IOS). XMTC C plus 3 commands
SpawnJoin, Prefix-Sum Unique First parallelism.
Then decomposition - Programming methodology Algorithms ? effective
programs. - Extend the SV82 Work-Depth framework from PRAM to
XMTC - Or Established APIs (VHDL/Verilog, OpenGL,
MATLAB) win-win proposition - Compiler minimize length of sequence of
round-trips to memory take advantage of
architecture enhancements (e.g., prefetch).
ideally given XMTC program, compiler provides
decomposition teach the compiler
Architecture Dynamically load-balance concurrent
threads over processors. OS of the language.
(Prefix-sum to registers to memory. )
22PERFORMANCE PROGRAMMING ITS PRODUCTIVITY
Basic Algorithm (sometimes informal)
Add data-structures (for serial algorithm)
Add parallel data-structures (for PRAM-like
algorithm)
Serial program (C)
3
Parallel program (XMT-C)
1
Low overheads!
4
Standard Computer
Decomposition
XMT Computer (or Simulator)
Assignment
Parallel Programming (Culler-Singh)
- 4 easier than 2
- Problems with 3
- 4 competitive with 1 cost-effectiveness
natural
Orchestration
Mapping
2
Parallel computer
23APPLICATION PROGRAMMING ITS PRODUCTIVITY
Application programmers interfaces
(APIs) (OpenGL, VHDL/Verilog, Matlab)
compiler
Serial program (C)
Parallel program (XMT-C)
Automatic?
Yes
Maybe
Yes
Standard Computer
Decomposition
XMT architecture (Simulator)
Assignment
Parallel Programming (Culler-Singh)
Orchestration
Mapping
Parallel computer
24Naming Contest for New Computer
- Paraleap
- chosen out of 6000 submissions
- Single (hard working) person (X. Wen) completed
synthesizable Verilog description AND the new
FPGA-based XMT computer in slightly more than two
years. No prior design experience. Attests to
basic simplicity of the XMT architecture ? faster
time to market, lower implementation cost.
25Experience with High School Students, Fall07
- Gave 1-day parallel algorithms tutorial to 12 HS
students. Some (2 10th graders) managed 8
programming assignments, including 5 of the 6 in
the grad course. Only help 1 office hour/week by
undergrad TA. No school credit. Part of a
computer club after 8 periods/day. - May-June 08 23 HS students, by self-taugh HS
teacher, Alexandria, VA - Spring08 Course to non-major Freshmen (UMD
Honor). How will programmers have to think by the
time you graduate. - Spring08 Course to seniors.
26NEW Software release
- Allows to use your own computer for programming
on an XMT - environment and experimenting with it, including
- Cycle-accurate simulator of the XMT machine
- Compiler from XMTC to that machine
- Also provided, extensive material for teaching or
self-studying parallelism, including - Tutorial manual for XMTC (150 pages)
- Classnotes on parallel algorithms (100 pages)
- Video recording of 9/15/07 HS tutorial (300
minutes) - Next Major Objective
- Industry-grade chip and production quality
compiler. Requires 10X in funding.