The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer

Description:

Title: PowerPoint Presentation Last modified by: Uzi Vishkin Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) Other titles – PowerPoint PPT presentation

Number of Views:291
Avg rating:3.0/5.0
Slides: 27
Provided by: umiacsUm3
Category:

less

Transcript and Presenter's Notes

Title: The eXplicit MultiThreading (XMT) Easy-To-Program Parallel Computer


1
The eXplicit MultiThreading (XMT)
Easy-To-Program Parallel Computer
  • Uzi Vishkin
  • www.umiacs.umd.edu/users/vishkin/XMT
  • Students just remember to take ENEE459P
    Parallel Algorithms, fall10
  • - What is a parallel algorithm?
  • - Why should I care?

2
Taste of a Parallel Algorithm Example Exchange
Problem
  • 2 Bins A and B. Exchange contents of A and B. Ex.
    A2,B5?A5,B2.
  • Algorithm (serial or parallel) XAABBX.
    3 Ops. 3 Steps. Space 1.
  • Array Exchange Problem
  • 2n bins A1..n, B1..n. Replace A(i) and B(i),
    i1..n.
  • Serial Alg For i1 to n do /serial exchange
    through eye-of-a-needle
  • XA(i)A(i)B(i)B(i)X
  • 3n Ops. 3n Steps. Space 1
  • Parallel Alg For i1 to n pardo /2-bin
    exchange in parallel
  • X(i)A(i)A(i)B(i)B(i)X(
    i)
  • 3n Ops. 3 Steps. Space n
  • Discussion
  • Parallelism tends to require some extra space
  • Par Alg clearly faster than Serial Alg.
  • What is simpler and more natural serial or
    parallel?
  • Small sample of people serial, but only if you
    .. majored in CS
  • Eye-of-a-needle metaphor for the von-Neumann
    mental operational bottleneck
  • Reflects extreme scarcity of HW. Less acute now

3
Commodity computer systems
  • Chapter 1 1946?2003 Serial. 5KHz?4GHz.
  • Chapter 2 2004-- Parallel. cores dy-2003
  • Apple 2004 1 core
  • 2013 gt100 cores
  • Windows 7 scales to 256 cores
  • How to use the other 255?
  • Did I mention ENEE459P?
  • BIG NEWS
  • Clock frequency growth flat.
  • If you want your program to run significantly
    faster youre going to have to parallelize it ?
    Parallelism only game in town
  • Transistors/chip 1980?2011 29K?30B!
  • Programmers IQ? Flat..
  • 40 years of parallel computing?
  • The world is yet to see a successful
    general-purpose parallel computer Easy to
    program good speedups

Intel Platform 2015, March05
4
Is performance at a plateau?
?
Source published SPECInt data
Students Make yourself ready for the job market.
Serial computing lt1 of computing power. Will
serial computing be taught for history majors?
5
Welcome to the 2010 Impasse
  • All vendors committed to multi-cores. Yet, their
    architecture and how to program them for single
    program completion time not clear
  • ? The software spiral (HW improvements ? SW imp ?
    HW imp) growth engine for IT (A. Grove, Intel)
    Alas, now broken!
  • ? SW vendors avoid investment in long-term SW
    development since may bet on the wrong horse.
    Impasse bad for business.
  • Parallel programming education Does CSE degree
    mean being trained for a 50yr career dominated
    by parallelism by programming yesterdays serial
    computers?
  • ENEE459P Teach (i) common denominator, and (ii)
    main approaches.

6
Serial Abstraction A Parallel Counterpart
  • Rudimentary abstraction that made serial
    computing simple that any single instruction
    available for execution in a serial program
    executes immediately
  • Abstracts away different execution time for
    different operations (e.g., memory hierarchy) .
    Used by programmers to conceptualize serial
    computing and supported by hardware and
    compilers. The program provides the instruction
    to be executed next (inductively)
  • Rudimentary abstraction for making parallel
    computing simple that indefinitely many
    instructions, which are available for concurrent
    execution, execute immediately, dubbed Immediate
    Concurrent Execution (ICE)
  • ?Step-by-step (inductive) explication of the
    instructions available next for concurrent
    execution. processors not even mentioned. Falls
    back on the serial abstraction if 1
    instruction/step.

7
Explicit Multi-threading (XMT)
  • 1979- THEORY figure out how to think
    algorithmically in parallel
  • Outcome in a nutshell above abstraction
  • 1997- XMT_at_UMD derive specs for architecture
    design and build
  • UV Using Simple Abstraction to Guide the
    Reinvention of Computing for Parallelism,
  • http//www.umiacs.umd.edu/users/vishkin/XMT/cacm20
    10.pdf, to appear in CACM

8
Not just talking
  • Algorithms
  • PRAM-On-Chip HW Prototypes
  • 64-core, 75MHz FPGA of XMT
  • (Explicit Multi-Threaded) architecture
  • SPAA98..CF08
  • 128-core intercon. network
    IBM 90nm 9mmX5mm, 400 MHz HotI07
  • FPGA design?ASIC
  • IBM 90nm 10mmX10mm
  • 150 MHz
  • PRAM parallel algorithmic theory. Natural
    selection. Latent, though not widespread,
    knowledgebase
  • Work-depth. SV82 conjectured The rest (full
    PRAM algorithm) just a matter of skill.
  • Lots of evidence that work-depth works. Used as
    framework in main PRAM algorithms texts JaJa92,
    KKT01
  • Programming Workflow
  • Rudimentary yet stable compiler

Architecture
scales to 1000 cores on-chip
9
Participants
  • Grad students, Aydin Balkan, PhD, George
    Caragea, James Edwards, David Ellison, Mike
    Horak, MS, Fuat Keceli, Beliz Saybasili, Alex
    Tzannes, Xingzhi Wen, PhD
  • Industry design experts (pro-bono).
  • Rajeev Barua, Compiler. Co-advisor of 2 CS grad
    students. 2008 NSF grant.
  • Gang Qu, VLSI and Power. Co-advisor.
  • Steve Nowick, Columbia U., Asynch computing.
    Co-advisor. 2008 NSF team grant.
  • Ron Tzur, Purdue U., K12 Education. Co-advisor.
    2008 NSF seed funding
  • K12 Montgomery Blair Magnet HS, MD, Thomas
    Jefferson HS, VA, Baltimore (inner city)
    Ingenuity Project Middle School 2009 Summer Camp,
    Montgomery County Public Schools
  • Marc Olano, UMBC, Computer graphics. Co-advisor.
  • Tali Moreshet, Swarthmore College, Power.
    Co-advisor.
  • Marty Peckerar, Microelectronics
  • Igor Smolyaninov, Electro-optics
  • Funding NSF, NSA 2008 deployed XMT computer, NIH
  • Industry partner Intel
  • Started from core CS. Built HWCompiler
    foundation. Ready for 10 timely CS PhD theses,
    2 Education, and 10 ECE.

10
More on ENEE459P, fall 2010
  • Parallel algorithmic thinking (PAT) based on
    first principles. More challenging to self-study
  • Mainstream computing?parallelism chaotic. Hence
    Pluralism valuable.
  • ENEE459 jointly taught by 2 instructors, video
    conferencing, U. Illinois
  • CS_at_Illinois top 5. Parallel_at_Illinois 1.
  • Joint course on timely topic extremely rare
    opportunity.
  • More than 2 for the price of one. 2 courses,
    each with 1 instructors would lack the
    interaction.
  • Advanced by Google, Intel and Microsoft, the
    introduction of parallelism into the curriculum
    dominated the recent flagship Computer Science
    Education Conference. Several speakers, including
    a Keynote by the Director of Education at Intel,
    reported that
  • In job interviews, employers now expect an
    intelligent discussion of parallelism and
  • (2) International competition recognizes that
    85 of the people that have been trained in
    parallel programming are outside the U.S.

11
Membership in Intel Academic Community
Implementing parallel computing into CS curriculum
85 outside USA
Source M. Wrinn, Intel
12
The Pain of Parallel Programming
  • Parallel programming is currently too difficult
  • To many users programming existing parallel
    computers is as intimidating and time consuming
    as programming in assembly language NSF
    Blue-Ribbon Panel on Cyberinfrastructure.
  • AMD/Intel Need PhD in CS to program todays
    multicores.
  • The real problem Parallel architectures built
    using the following methodology build-first
    figure-out-how-to-program-later.
  • J. Hennessy Many of the early ideas were
    motivated by observations of what was easy to
    implement in the hardware rather than what was
    easy to use

13
2nd Example of PRAM-like Algorithm
  • Parallel parallel data-structures.
  • Inherent serialization S.
  • Gain relative to serial (first cut) T/S!
  • Decisive also relative to coarse-grained
    parallelism.
  • Note (i) Concurrently as in natural BFS only
    change to serial algorithm
  • (ii) No decomposition/partition
  • ? Speed-up wrt GPU same-silicon area for highly
    parallel input 5.4X!
  • (iii) But, SMALL CONFIG on 20-way parallel
    input 109X wrt same GPU
  • Mental effort of PRAM-like programming
  • 1. sometimes easier than serial
  • 2. considerably easier than for any parallel
    computer currently sold. Understanding falls
    within the common denominator of other
    approaches.
  • Input (i) All world airports.
  • (ii) For each, all its non-stop flights.
  • Find smallest number of flights from DCA to
    every other airport.
  • Basic (actually parallel) algorithm
  • Step i
  • For all airports requiring i-1flights
  • For all its outgoing flights
  • Mark (concurrently!) all yet unvisited
    airports as requiring i flights (note nesting)
  • Serial forces eye-of-a-needle queue need to
    prove that still the same as the parallel
    version.
  • O(T) time T total of flights

14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
Back to the education crisis
  • CTO of NVidia and the official Intel leader of
    multi-cores at Intel teach parallelism as early
    as you.
  • Reason we dont only under teach. We misteach,
    since students acquire bad habits.
  • Current situation is unacceptable. Sort of
    malpractice.
  • Some possibilities
  • Teach as a major elective.
  • Teach all CSE undergrads.
  • Teach CSE Freshmen and invite all Eng, Math, and
    Science sends message CSE is where the action
    is.

19
Need
  • A general-purpose parallel computer framework
    successor to the Pentium for the multi-core
    era that
  • is easy to program
  • gives good performance with any amount of
    parallelism provided by the algorithm namely,
    up- and down-scalability including backwards
    compatibility on serial code
  • supports application programming (VHDL/Verilog,
    OpenGL, MATLAB) and performance programming and
  • fits current chip technology and scales with it.
  • (in particular strong speed-ups for single-task
    completion time)
  • Main Point of talk PRAM-On-Chip_at_UMD is
    addressing (i)-(iv).

20
The PRAM Rollercoaster ride
  • Late 1970s Theory work began
  • UP Won the battle of ideas on parallel
    algorithmic thinking. No silver or bronze!
  • Model of choice in all theory/algorithms
    communities. 1988-90 Big chapters in standard
    algorithms textbooks.
  • DOWN FCRC93 PRAM is not feasible. 93
    despair ? no good alternative! Where vendors
    expect good enough alternatives to come from in
    2008? Device changed it all
  • UP Highlights eXplicit-multi-threaded (XMT)
    FPGA-prototype computer (not simulator),
    SPAA07,CF08 90nm ASIC tape-outs int. network,
    HotI07, XMT. on-chip transistors
  • How come? crash course on parallel computing
  • How much processors-to-memories bandwidth?
  • Enough Ideal Programming Model (PRAM)
  • Limited Programming difficulties

21
How does it work
  • Work-depth Algs Methodology (source SV82) State
    all ops you can do in parallel. Repeat. Minimize
    Total operations, rounds The rest is skill.
  • Program single-program multiple-data (SPMD).
    Short (not OS) threads. Independence of order
    semantics (IOS). XMTC C plus 3 commands
    SpawnJoin, Prefix-Sum Unique First parallelism.
    Then decomposition
  • Programming methodology Algorithms ? effective
    programs.
  • Extend the SV82 Work-Depth framework from PRAM to
    XMTC
  • Or Established APIs (VHDL/Verilog, OpenGL,
    MATLAB) win-win proposition
  • Compiler minimize length of sequence of
    round-trips to memory take advantage of
    architecture enhancements (e.g., prefetch).
    ideally given XMTC program, compiler provides
    decomposition teach the compiler

Architecture Dynamically load-balance concurrent
threads over processors. OS of the language.
(Prefix-sum to registers to memory. )
22
PERFORMANCE PROGRAMMING ITS PRODUCTIVITY
Basic Algorithm (sometimes informal)
Add data-structures (for serial algorithm)
Add parallel data-structures (for PRAM-like
algorithm)
Serial program (C)
3
Parallel program (XMT-C)
1
Low overheads!
4
Standard Computer
Decomposition
XMT Computer (or Simulator)
Assignment
Parallel Programming (Culler-Singh)
  • 4 easier than 2
  • Problems with 3
  • 4 competitive with 1 cost-effectiveness
    natural

Orchestration
Mapping
2
Parallel computer
23
APPLICATION PROGRAMMING ITS PRODUCTIVITY
Application programmers interfaces
(APIs) (OpenGL, VHDL/Verilog, Matlab)
compiler
Serial program (C)
Parallel program (XMT-C)
Automatic?
Yes
Maybe
Yes
Standard Computer
Decomposition
XMT architecture (Simulator)
Assignment
Parallel Programming (Culler-Singh)
Orchestration
Mapping
Parallel computer
24
Naming Contest for New Computer
  • Paraleap
  • chosen out of 6000 submissions
  • Single (hard working) person (X. Wen) completed
    synthesizable Verilog description AND the new
    FPGA-based XMT computer in slightly more than two
    years. No prior design experience. Attests to
    basic simplicity of the XMT architecture ? faster
    time to market, lower implementation cost.

25
Experience with High School Students, Fall07
  • Gave 1-day parallel algorithms tutorial to 12 HS
    students. Some (2 10th graders) managed 8
    programming assignments, including 5 of the 6 in
    the grad course. Only help 1 office hour/week by
    undergrad TA. No school credit. Part of a
    computer club after 8 periods/day.
  • May-June 08 23 HS students, by self-taugh HS
    teacher, Alexandria, VA
  • Spring08 Course to non-major Freshmen (UMD
    Honor). How will programmers have to think by the
    time you graduate.
  • Spring08 Course to seniors.

26
NEW Software release
  • Allows to use your own computer for programming
    on an XMT
  • environment and experimenting with it, including
  • Cycle-accurate simulator of the XMT machine
  • Compiler from XMTC to that machine
  • Also provided, extensive material for teaching or
    self-studying parallelism, including
  • Tutorial manual for XMTC (150 pages)
  • Classnotes on parallel algorithms (100 pages)
  • Video recording of 9/15/07 HS tutorial (300
    minutes)
  • Next Major Objective
  • Industry-grade chip and production quality
    compiler. Requires 10X in funding.
Write a Comment
User Comments (0)
About PowerShow.com