Programming Models for Manycore Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Programming Models for Manycore Systems

Description:

Enthusiastic expert application partner, leader in field, promise to help design, ... Laptop/Handheld recreate 3D sound over ear buds. Hearing Augmenter ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 39
Provided by: kathyy
Category:

less

Transcript and Presenter's Notes

Title: Programming Models for Manycore Systems


1
Programming Models for Manycore Systems
Parallel Hardware
Parallel Applications
Parallel Software
  • Kathy Yelick
  • U.C. Berkeley

2
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs/Dwarfs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
3
Applications. What are the problems?
  • Who needs 100 cores to run M/S Word?
  • Need compelling apps that use 100s of cores
  • How did we pick applications?
  • Enthusiastic expert application partner, leader
    in field, promise to help design, use, evaluate
    our technology
  • Compelling in terms of likely market or social
    impact, with short term feasibility and longer
    term potential
  • Requires significant speed-up, or a smaller,
    more efficient platform to work as intended
  • As a whole, applications cover the most important
  • Platforms (handheld, laptop, games)
  • Markets (consumer, business, health)

4
Compelling Laptop/Handheld Apps(David Wessel)
  • Musicians have an insatiable appetite for
    computation
  • More channels, instruments, more processing,
    more interaction!
  • Latency must be low (5 ms)
  • Must be reliable (No clicks)
  • Music Enhancer
  • Enhanced sound delivery systems for home sound
    systems using large microphone and speaker arrays
  • Laptop/Handheld recreate 3D sound over ear buds
  • Hearing Augmenter
  • Handheld as accelerator for hearing aid
  • Novel Instrument User Interface
  • New composition and performance systems beyond
    keyboards
  • Input device for Laptop/Handheld

Berkeley Center for New Music and Audio
Technology (CNMAT) created a compact loudspeaker
array 10-inch-diameter icosahedron
incorporating 120 tweeters.
5
Content-Based Image Retrieval(Kurt Keutzer)
Relevance Feedback
Query by example
Similarity Metric
Candidate Results
Image Database
Final Result
  • Built around Key Characteristics of personal
    databases
  • Very large number of pictures (gt5K)
  • Non-labeled images
  • Many pictures of few people
  • Complex pictures including people, events,
    places, and objects

1000s of images
6
Coronary Artery Disease(Tony Keaveny)
After
Before
  • Modeling to help patient compliance?
  • 450k deaths/year, 16M w. symptom, 72M?BP
  • Massively parallel, Real-time variations
  • CFD FE solid (non-linear), fluid (Newtonian),
    pulsatile
  • Blood pressure, activity, habitus, cholesterol

7
Meeting Diarist and Teleconference Aid(Nelson
Morgan)
  • Meeting Diarist
  • Laptops/ Handhelds at meeting coordinate to
    create speaker identified, partially transcribed
    text diary of meeting
  • Teleconference speaker identifier, speech
    helper
  • L/Hs used for teleconference, identifies who is
    speaking, closed caption hint of what being
    said

8
Parallel Browser
  • Goal Desktop quality browsing on handhelds
  • Enabled by 4G networks, better output devices
  • Bottlenecks to parallelize
  • Parsing, Rendering, Scripting
  • SkipJax
  • Parallel replacement for JavaScript/AJAX
  • Based on Browns FlapJax

9
Broader Coverage of Applicationsthrough Motifs
  • How invent parallel systems of future when tied
    to old code, programming models, CPUs of the
    past?
  • Look for common computational patterns
  • Embedded Computing (42 EEMBC benchmarks)
  • Desktop/Server Computing (28 SPEC2006)
  • Data Base / Text Mining Software
  • Games/Graphics/Vision
  • Machine Learning
  • High Performance Computing (Original 7 Dwarfs)
  • Result 13 Dwarfs (Use motif instead after
    go from 7 to 13?)

10
Motif/Dwarf" Popularity (Red Hot ? Blue Cool)
  • How do compelling apps relate to 13 motif/dwarfs?

11
Roles of Motifs/Dwarfs
  • Anti-benchmarks
  • Motifs not tied to code or language artifacts ?
    encourage innovation in algorithms, languages,
    data structures, and/or hardware
  • Universal, understandable vocabulary, at least at
    high level
  • To talk across disciplinary boundaries
  • Bootstrapping Parallelize parallel research
  • Allow analysis of HW SW design without waiting
    years for full apps
  • Targets for libraries

12
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs/Dwarfs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
13
Developing Parallel Software
  • 2 types of programmers ? 2 layers
  • Efficiency Layer (10 of todays programmers)
  • Expert programmers build Frameworks Libraries,
    Hypervisors,
  • Bare metal efficiency possible at Efficiency
    Layer
  • Productivity Layer (90 of todays programmers)
  • Domain experts / Naïve programmers productively
    build parallel apps using frameworks libraries
  • Frameworks libraries composed to form app
    frameworks
  • Effective composition techniques allows the
    efficiency programmers to be highly leveraged ?
    Create language for Composition and Coordination
    (CC)

14
Composition to Build Applications
Serial Code
Parallel Code
15
Composition
  • Composition is key to software reuse
  • Solutions exist for libraries with hidden
    parallelism
  • Partitions in OS help runtime composition
  • Instantiating parallel frameworks is harder
  • Framework specifies required independence
  • E.g., operations in map, divconq must not
    interfere
  • Guaranteed independence through types
  • Type system extensions (side effects, ownership)
  • Extra Image ?READ Arraydouble
  • Data decomposition may be implicit or explicit
  • Partition ArrayT, ? List ArrayT
    (well understood)
  • Partition GraphT, ? List GraphT
  • Efficiency layer code has these specifications at
    interfaces, which are verified, tested, or
    asserted
  • Independence is proven by checking side effects
    and overlap at instantiation

16
Coordination
  • Coordination is used to create parallelism
  • Support parallelism patterns of applications
  • Data parallelism (degree data size, not core
    count)
  • May be nested forall images, forall blocks in
    image
  • Divide-and-conquer (parallelism from recursion)
  • Event-driven nondeterminism at algorithm level
  • BranchBound dwarf, etc.
  • Serial semantics with limited nondeterminism
  • Choose the solution that fits your domain
  • Data parallelism comes from array/aggregate
    operations and loops without side effects
  • Divide-and-conquer parallelism comes from
    recursive functions with non-overlapping side
    effects
  • Event-driven programs are written as guarded
    atomic commands, which may be implemented as
    transactions
  • Discovered parallelism mapped to available
    resources
  • Techniques include static scheduling, autotuning,
    dynamic schedule and possibly hints

17
CC Language Strategy
  • Application-driven Domain-specific languages
  • Ensure usefulness for at least one application
  • Music language
  • Image framework
  • Browser language
  • Health application language
  • Bottom-up implementation strategy
  • Ensure efficiently implementable
  • Grow a language from one that is efficient but
    not productive by abstraction levels
  • Identify common features across DSLs
  • Cross-language meetings/discussions

18
Coordination Composition in CBIR Application
  • Parallelism in CBIR is hierarchical
  • Mostly independent tasks/data with reducing

output stream of feature vectors
output stream of images
feature extraction
Face Recog
?
DCT

DWT
?
coll
stream parallel over images
task parallel over extraction algorithms
19
Using Map Reduce for Image Retreival
  • Map Reduce can mean various things
  • To us, it means
  • A map stage, where threads compute independently
  • A reduce stage, where the results of the map
    stage are summarized
  • This is a pattern of computation and
    communication
  • Not an implementation involving key/value pairs,
    parallel I/O...
  • We consider Map Reduce computations where
  • A map function produces a set of outputs
  • Each of a set of reduce functions, gated by per
    element predicates, produces a set of outputs

Work by B. Catanzaro, N. Sundaram K. Keutzer
20
SVM Classification Results
  • Average 100x speedup (180x max)
  • Map Reduce Framework reduced kernel LOC by 64

Work by B. Catanzaro, N. Sundaram K. Keutzer
21
CC Language for Health
  • and the applications to go with it
  • Personalized medicine application has large
    amounts of data parallelism
  • Irregular data structures / access sparse
    matrices, particles
  • But most of code could be expressed in a
    data-parallel way, meaning serial semantics
  • Note that parallelism over data is essential at
    O(100) cores
  • Composition across languages is still key
  • Calls to optimized (not data parallel) libraries
  • Supported by static analysis for phase-based
    computations

22
Partitioned Global Address Space
  • Global address space any thread/process may
    directly read/write data allocated by another
  • Partitioned data is designated as local or global
  • By default
  • Object heaps are shared
  • Program stacks are private

x 1 y
x 5 y
x 7 y 0
Global address space
l
l
l
g
g
g
p0
p1
pn
  • 3 Current languages UPC, CAF, and Titanium
  • All three use an SPMD execution model
  • Designed for large-scale (clusters) and
    scientific computing
  • 3 Emerging languages X10, Fortress, and Chapel

23
Arrays in a Global Address Space
  • Key features of Titanium arrays
  • Generality indices may start/end and any point
  • Domain calculus allow for slicing, subarray,
    transpose and other operations without data
    copies
  • Use domain calculus to identify ghosts and
    iterate
  • foreach (p in gridA.shrink(1).domain()) ...
  • Array copies automatically work on intersection
  • gridB.copy(gridA.shrink(1))

intersection (copied area)
restricted (non-ghost) cells
ghost cells
gridA
gridB
Joint work with Titanium group
24
Languages Support Helps Productivity
  • C/Fortran/MPI AMR
  • Chombo package from LBNL
  • Bulk-synchronous comm
  • Pack boundary data between procs
  • All optimizations done by programmer
  • Titanium AMR
  • Entirely in Titanium
  • Finer-grained communication
  • No explicit pack/unpack code
  • Automated in runtime system
  • General approach
  • Language allow programmer optimizations
  • Compiler/runtime does some automatically

Work by Tong Wen and Philip Colella
Communication optimizations joint with Jimmy Su
25
Particle/Mesh Method Heart Simulation
  • Elastic structures in an incompressible fluid.
  • Blood flow, clotting, inner ear, embryo growth,
  • Complicated parallelization
  • Particle/Mesh method, but Particles connected
    into materials (1D or 2D structures)
  • Communication patterns irregular between
    particles (structures) and mesh (fluid)

2D Dirac Delta Function
Code Size in Lines Code Size in Lines
Fortran Titanium
8000 4000
Note Fortran code is not parallel
Joint work with Ed Givelberg, Armando
Solar-Lezama, Charlie Peskin, Dave McQueen
26
Titanium Experience Composition
  • Data parallelism could have been used
  • Parallel over n, rather than p
  • Compiler can generate SPMD code
  • Most code could be written as pure data
    parallelism (serial semantics) and translated
    Su
  • Can we mix data and other parallelism?
  • Compiler analysis makes this possible
  • Barriers are restricted all threads must reach
    the same barrier (proven by single analysis
    Gay and Aiken)
  • Single analysis identifies global execution
    points
  • Allows global optimizations (across threads)
  • Create natural points to switch in and out of
    data parallel or serial code
  • Also may be points for heterogeneous processor
    switches in code

Joint work with the Titanium group
27
Efficiency layer
  • Remember why we are here

28
Efficiency Layer Selective Virtualization
  • Efficiency layer is abstract machine model
    selective virtualization
  • Libraries provide add-ons
  • Schedulers
  • Add runtime w/ dynamic scheduling
    for dynamic task tree
  • Memory movement / sharing primitives
  • Synchronization primitives
  • E.g., fast barriers, atomic operations
  • More on this in Krste Asanovics talk
  • Division of layers allows us to explore execution
    model separately from programming model

General task graph with weights structure
29
Synthesis
  • Extensive tuning knobs at efficiency level
  • Performance feedback from hardware and OS
  • Sketching Correct by construction
  • More on this in Ras Bodiks talk

Spec simple implementation (3 loop 3D stencil)
Optimized code (tiled, prefetched, time skewed)
  • Autotuning Efficient by search
  • Examples Spectral (FFTW, SPIRAL), Dense (PHiPAC,
    Atlas), Sparse (OSKI), Structured grids
    (Stencils)
  • Can select from algorithms/data structures
    changes not producible by compiler transform

30
Autotuning 21st Century Code Generation
  • Problem generating optimal code is like
    searching for needle in a haystack
  • Manycore ? even more diverse
  • New approach Auto-tuners
  • 1st generate program variations of combinations
    of optimizations (blocking, prefetching, ) and
    data structures
  • Then compile and run to heuristically search for
    best code for that computer
  • Examples PHiPAC (BLAS), Atlas (BLAS), Spiral
    (DSP), FFT-W (FFT), OSKI (sparse matrices)
  • Search space for block sizes (dense matrix)
  • Axes are block
    dimensions
  • Temperature is speed

50 more zeros 50 faster
31
LBMHD Structure Grid Application
  • Plasma turbulence simulation
  • Two distributions
  • momentum distribution (27 components)
  • magnetic distribution (15 vector components)
  • Three macroscopic quantities
  • Density
  • Momentum (vector)
  • Magnetic Field (vector)
  • Must read 73 doubles, and update(write) 79
    doubles per point in space
  • Requires about 1300 floating point operations per
    point in space
  • Just over 1.0 flops/byte (ideal)
  • No temporal locality between points in space
    within one time step

Joint work with Sam Williams, Lenny Oliker, John
Shalf, and Jonathan Carter
32
Autotuned Performance(Cell/SPE version)
  • First attempt at cell implementation.
  • VL, unrolling, reordering fixed
  • Exploits DMA and double buffering to load vectors
  • Straight to SIMD intrinsics.
  • Despite the relative performance, Cells DP
    implementation severely impairs performance

SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
33
Productivity
  • Niagara2 required significantly less work to
    deliver good performance.
  • For LBMHD, Clovertown, Opteron, and Cell all
    required SIMD (hampers productivity) for best
    performance.
  • Virtually every optimization was required (sooner
    or later) for Opteron and Cell.
  • Cache based machines required search for some
    optimizations, while Cell relied solely on
    heuristics (less time to tune)

34
PGAS Languages Autotuning for Multicore
DMA
  • PGAS languages are a good fit to shared memory
    machines, including multicore
  • Global address space implemented as reads/writes
  • Also may be exploited for processor with explicit
    local store rather than cache, e.g., Cell, GPUs,
  • Open question in architecture
  • Cache-coherence shared memory
  • Software-controlled local memory (or hybrid)

Private on-chip
m
l
Shared off-chip DRAM
35
Correctness
36
Ensuring Correctness
  • Productivity Layer
  • Enforce independence of tasks using decomposition
    (partitioning) and copying operators
  • Goal Remove chance for concurrency errors (e.g.,
    nondeterminism from execution order, not just
    low-level data races)
  • Efficiency Layer Check for subtle concurrency
    bugs (races, deadlocks, and so on)
  • Mixture of verification and automated directed
    testing
  • Error detection on frameworks with sequential
    code as specification

37
Software Correctness
  • At the Productivity layer, many concurrency
    errors are not permitted
  • At the Efficiency layer, we need more tools for
    correctness
  • Both concurrency errors and (eventually)
    numerical errors
  • Traditional approach to correct software testing
  • Low probability of finding an error lots of
    manual effort
  • Symbolic model checking
  • Many recent successes in security, control
    systems, etc.
  • Ideas from theorem proving applied to specific
    classes of errors
  • Cant handle libraries, complex data types...
  • Concolic testing combines
  • Concrete execution Symbolic analysis
  • Use state-of-art theorem proving to find inputs
    that reach all program paths
  • Ideas applied successfully on concurrent programs

38
Why Languages at all?
  • Most of work is in runtime and libraries
  • Do we need a language? And a compiler?
  • If higher level syntax is needed for productivity
  • We need a language
  • If static analysis is needed to help with
    correctness
  • We need a compiler (front-end)
  • If static optimizations are needed to get
    performance
  • We need a compiler (back-end)
  • All of these decisions will be driven by
    application need
Write a Comment
User Comments (0)
About PowerShow.com