Programming Models for Manycore Systems

About This Presentation

Title:

Programming Models for Manycore Systems

Description:

Enthusiastic expert application partner, leader in field, promise to help design, ... Laptop/Handheld recreate 3D sound over ear buds. Hearing Augmenter ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 39

Provided by: kathyy

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Programming Models for Manycore Systems

1
Programming Models for Manycore Systems
Parallel Hardware
Parallel Applications
Parallel Software

Kathy Yelick
U.C. Berkeley

2
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs/Dwarfs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
3
Applications. What are the problems?

Who needs 100 cores to run M/S Word?
Need compelling apps that use 100s of cores
How did we pick applications?
Enthusiastic expert application partner, leader
in field, promise to help design, use, evaluate
our technology
Compelling in terms of likely market or social
impact, with short term feasibility and longer
term potential
Requires significant speed-up, or a smaller,
more efficient platform to work as intended
As a whole, applications cover the most important
Platforms (handheld, laptop, games)
Markets (consumer, business, health)

4
Compelling Laptop/Handheld Apps(David Wessel)

Musicians have an insatiable appetite for
computation
More channels, instruments, more processing,
more interaction!
Latency must be low (5 ms)
Must be reliable (No clicks)
Music Enhancer
Enhanced sound delivery systems for home sound
systems using large microphone and speaker arrays
Laptop/Handheld recreate 3D sound over ear buds
Hearing Augmenter
Handheld as accelerator for hearing aid
Novel Instrument User Interface
New composition and performance systems beyond
keyboards
Input device for Laptop/Handheld

Berkeley Center for New Music and Audio
Technology (CNMAT) created a compact loudspeaker
array 10-inch-diameter icosahedron
incorporating 120 tweeters.
5
Content-Based Image Retrieval(Kurt Keutzer)
Relevance Feedback
Query by example
Similarity Metric
Candidate Results
Image Database
Final Result

Built around Key Characteristics of personal
databases
Very large number of pictures (gt5K)
Non-labeled images
Many pictures of few people
Complex pictures including people, events,
places, and objects

1000s of images
6
Coronary Artery Disease(Tony Keaveny)
After
Before

Modeling to help patient compliance?
450k deaths/year, 16M w. symptom, 72M?BP
Massively parallel, Real-time variations
CFD FE solid (non-linear), fluid (Newtonian),
pulsatile
Blood pressure, activity, habitus, cholesterol

7
Meeting Diarist and Teleconference Aid(Nelson
Morgan)

Meeting Diarist
Laptops/ Handhelds at meeting coordinate to
create speaker identified, partially transcribed
text diary of meeting

Teleconference speaker identifier, speech
helper
L/Hs used for teleconference, identifies who is
speaking, closed caption hint of what being
said

8
Parallel Browser

Goal Desktop quality browsing on handhelds
Enabled by 4G networks, better output devices
Bottlenecks to parallelize
Parsing, Rendering, Scripting
SkipJax
Parallel replacement for JavaScript/AJAX
Based on Browns FlapJax

9
Broader Coverage of Applicationsthrough Motifs

How invent parallel systems of future when tied
to old code, programming models, CPUs of the
past?
Look for common computational patterns
Embedded Computing (42 EEMBC benchmarks)
Desktop/Server Computing (28 SPEC2006)
Data Base / Text Mining Software
Games/Graphics/Vision
Machine Learning
High Performance Computing (Original 7 Dwarfs)
Result 13 Dwarfs (Use motif instead after
go from 7 to 13?)

10
Motif/Dwarf" Popularity (Red Hot ? Blue Cool)

How do compelling apps relate to 13 motif/dwarfs?

11
Roles of Motifs/Dwarfs

Anti-benchmarks
Motifs not tied to code or language artifacts ?
encourage innovation in algorithms, languages,
data structures, and/or hardware
Universal, understandable vocabulary, at least at
high level
To talk across disciplinary boundaries
Bootstrapping Parallelize parallel research
Allow analysis of HW SW design without waiting
years for full apps
Targets for libraries

12
Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs/Dwarfs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
13
Developing Parallel Software

2 types of programmers ? 2 layers
Efficiency Layer (10 of todays programmers)
Expert programmers build Frameworks Libraries,
Hypervisors,
Bare metal efficiency possible at Efficiency
Layer
Productivity Layer (90 of todays programmers)
Domain experts / Naïve programmers productively
build parallel apps using frameworks libraries
Frameworks libraries composed to form app
frameworks
Effective composition techniques allows the
efficiency programmers to be highly leveraged ?
Create language for Composition and Coordination
(CC)

14
Composition to Build Applications
Serial Code
Parallel Code
15
Composition

Composition is key to software reuse
Solutions exist for libraries with hidden
parallelism
Partitions in OS help runtime composition
Instantiating parallel frameworks is harder
Framework specifies required independence
E.g., operations in map, divconq must not
interfere
Guaranteed independence through types
Type system extensions (side effects, ownership)
Extra Image ?READ Arraydouble
Data decomposition may be implicit or explicit
Partition ArrayT, ? List ArrayT
(well understood)
Partition GraphT, ? List GraphT
Efficiency layer code has these specifications at
interfaces, which are verified, tested, or
asserted
Independence is proven by checking side effects
and overlap at instantiation

16
Coordination

Coordination is used to create parallelism
Support parallelism patterns of applications
Data parallelism (degree data size, not core
count)
May be nested forall images, forall blocks in
image
Divide-and-conquer (parallelism from recursion)
Event-driven nondeterminism at algorithm level
BranchBound dwarf, etc.
Serial semantics with limited nondeterminism
Choose the solution that fits your domain
Data parallelism comes from array/aggregate
operations and loops without side effects
Divide-and-conquer parallelism comes from
recursive functions with non-overlapping side
effects
Event-driven programs are written as guarded
atomic commands, which may be implemented as
transactions
Discovered parallelism mapped to available
resources
Techniques include static scheduling, autotuning,
dynamic schedule and possibly hints

17
CC Language Strategy

Application-driven Domain-specific languages
Ensure usefulness for at least one application
Music language
Image framework
Browser language
Health application language
Bottom-up implementation strategy
Ensure efficiently implementable
Grow a language from one that is efficient but
not productive by abstraction levels
Identify common features across DSLs
Cross-language meetings/discussions

18
Coordination Composition in CBIR Application

Parallelism in CBIR is hierarchical
Mostly independent tasks/data with reducing

output stream of feature vectors
output stream of images
feature extraction
Face Recog
?
DCT

DWT
?
coll
stream parallel over images
task parallel over extraction algorithms
19
Using Map Reduce for Image Retreival

Map Reduce can mean various things
To us, it means
A map stage, where threads compute independently
A reduce stage, where the results of the map
stage are summarized
This is a pattern of computation and
communication
Not an implementation involving key/value pairs,
parallel I/O...
We consider Map Reduce computations where
A map function produces a set of outputs
Each of a set of reduce functions, gated by per
element predicates, produces a set of outputs

Work by B. Catanzaro, N. Sundaram K. Keutzer
20
SVM Classification Results

Average 100x speedup (180x max)
Map Reduce Framework reduced kernel LOC by 64

Work by B. Catanzaro, N. Sundaram K. Keutzer
21
CC Language for Health

and the applications to go with it
Personalized medicine application has large
amounts of data parallelism
Irregular data structures / access sparse
matrices, particles
But most of code could be expressed in a
data-parallel way, meaning serial semantics
Note that parallelism over data is essential at
O(100) cores
Composition across languages is still key
Calls to optimized (not data parallel) libraries
Supported by static analysis for phase-based
computations

22
Partitioned Global Address Space

Global address space any thread/process may
directly read/write data allocated by another
Partitioned data is designated as local or global

By default
Object heaps are shared
Program stacks are private

x 1 y
x 5 y
x 7 y 0
Global address space
l
l
l
g
g
g
p0
p1
pn

3 Current languages UPC, CAF, and Titanium
All three use an SPMD execution model
Designed for large-scale (clusters) and
scientific computing
3 Emerging languages X10, Fortress, and Chapel

23
Arrays in a Global Address Space

Key features of Titanium arrays
Generality indices may start/end and any point
Domain calculus allow for slicing, subarray,
transpose and other operations without data
copies
Use domain calculus to identify ghosts and
iterate
foreach (p in gridA.shrink(1).domain()) ...
Array copies automatically work on intersection
gridB.copy(gridA.shrink(1))

intersection (copied area)
restricted (non-ghost) cells
ghost cells
gridA
gridB
Joint work with Titanium group
24
Languages Support Helps Productivity

C/Fortran/MPI AMR
Chombo package from LBNL
Bulk-synchronous comm
Pack boundary data between procs
All optimizations done by programmer

Titanium AMR
Entirely in Titanium
Finer-grained communication
No explicit pack/unpack code
Automated in runtime system
General approach
Language allow programmer optimizations
Compiler/runtime does some automatically

Work by Tong Wen and Philip Colella
Communication optimizations joint with Jimmy Su
25
Particle/Mesh Method Heart Simulation

Elastic structures in an incompressible fluid.
Blood flow, clotting, inner ear, embryo growth,
Complicated parallelization
Particle/Mesh method, but Particles connected
into materials (1D or 2D structures)
Communication patterns irregular between
particles (structures) and mesh (fluid)

2D Dirac Delta Function
Code Size in Lines Code Size in Lines
Fortran Titanium
8000 4000
Note Fortran code is not parallel
Joint work with Ed Givelberg, Armando
Solar-Lezama, Charlie Peskin, Dave McQueen
26
Titanium Experience Composition

Data parallelism could have been used
Parallel over n, rather than p
Compiler can generate SPMD code
Most code could be written as pure data
parallelism (serial semantics) and translated
Su
Can we mix data and other parallelism?
Compiler analysis makes this possible
Barriers are restricted all threads must reach
the same barrier (proven by single analysis
Gay and Aiken)
Single analysis identifies global execution
points
Allows global optimizations (across threads)
Create natural points to switch in and out of
data parallel or serial code
Also may be points for heterogeneous processor
switches in code

Joint work with the Titanium group
27
Efficiency layer

Remember why we are here

28
Efficiency Layer Selective Virtualization

Efficiency layer is abstract machine model
selective virtualization
Libraries provide add-ons
Schedulers
Add runtime w/ dynamic scheduling
for dynamic task tree
Memory movement / sharing primitives
Synchronization primitives
E.g., fast barriers, atomic operations
More on this in Krste Asanovics talk
Division of layers allows us to explore execution
model separately from programming model

General task graph with weights structure
29
Synthesis

Extensive tuning knobs at efficiency level
Performance feedback from hardware and OS
Sketching Correct by construction
More on this in Ras Bodiks talk

Spec simple implementation (3 loop 3D stencil)
Optimized code (tiled, prefetched, time skewed)

Autotuning Efficient by search
Examples Spectral (FFTW, SPIRAL), Dense (PHiPAC,
Atlas), Sparse (OSKI), Structured grids
(Stencils)
Can select from algorithms/data structures
changes not producible by compiler transform

30
Autotuning 21st Century Code Generation

Problem generating optimal code is like
searching for needle in a haystack
Manycore ? even more diverse
New approach Auto-tuners
1st generate program variations of combinations
of optimizations (blocking, prefetching, ) and
data structures
Then compile and run to heuristically search for
best code for that computer
Examples PHiPAC (BLAS), Atlas (BLAS), Spiral
(DSP), FFT-W (FFT), OSKI (sparse matrices)

Search space for block sizes (dense matrix)
Axes are block
dimensions
Temperature is speed

50 more zeros 50 faster
31
LBMHD Structure Grid Application

Plasma turbulence simulation
Two distributions
momentum distribution (27 components)
magnetic distribution (15 vector components)
Three macroscopic quantities
Density
Momentum (vector)
Magnetic Field (vector)
Must read 73 doubles, and update(write) 79
doubles per point in space
Requires about 1300 floating point operations per
point in space
Just over 1.0 flops/byte (ideal)
No temporal locality between points in space
within one time step

Joint work with Sam Williams, Lenny Oliker, John
Shalf, and Jonathan Carter
32
Autotuned Performance(Cell/SPE version)

First attempt at cell implementation.
VL, unrolling, reordering fixed
Exploits DMA and double buffering to load vectors
Straight to SIMD intrinsics.
Despite the relative performance, Cells DP
implementation severely impairs performance

SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
33
Productivity

Niagara2 required significantly less work to
deliver good performance.
For LBMHD, Clovertown, Opteron, and Cell all
required SIMD (hampers productivity) for best
performance.
Virtually every optimization was required (sooner
or later) for Opteron and Cell.
Cache based machines required search for some
optimizations, while Cell relied solely on
heuristics (less time to tune)

34
PGAS Languages Autotuning for Multicore
DMA

PGAS languages are a good fit to shared memory
machines, including multicore
Global address space implemented as reads/writes
Also may be exploited for processor with explicit
local store rather than cache, e.g., Cell, GPUs,
Open question in architecture
Cache-coherence shared memory
Software-controlled local memory (or hybrid)

Private on-chip
m
l
Shared off-chip DRAM
35
Correctness
36
Ensuring Correctness

Productivity Layer
Enforce independence of tasks using decomposition
(partitioning) and copying operators
Goal Remove chance for concurrency errors (e.g.,
nondeterminism from execution order, not just
low-level data races)
Efficiency Layer Check for subtle concurrency
bugs (races, deadlocks, and so on)
Mixture of verification and automated directed
testing
Error detection on frameworks with sequential
code as specification

37
Software Correctness