Title: Programming Models for Manycore Systems
1Programming Models for Manycore Systems
Parallel Hardware
Parallel Applications
Parallel Software
- Kathy Yelick
- U.C. Berkeley
2Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs/Dwarfs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
3Applications. What are the problems?
- Who needs 100 cores to run M/S Word?
- Need compelling apps that use 100s of cores
- How did we pick applications?
- Enthusiastic expert application partner, leader
in field, promise to help design, use, evaluate
our technology - Compelling in terms of likely market or social
impact, with short term feasibility and longer
term potential - Requires significant speed-up, or a smaller,
more efficient platform to work as intended - As a whole, applications cover the most important
- Platforms (handheld, laptop, games)
- Markets (consumer, business, health)
4Compelling Laptop/Handheld Apps(David Wessel)
- Musicians have an insatiable appetite for
computation - More channels, instruments, more processing,
more interaction! - Latency must be low (5 ms)
- Must be reliable (No clicks)
- Music Enhancer
- Enhanced sound delivery systems for home sound
systems using large microphone and speaker arrays - Laptop/Handheld recreate 3D sound over ear buds
- Hearing Augmenter
- Handheld as accelerator for hearing aid
- Novel Instrument User Interface
- New composition and performance systems beyond
keyboards - Input device for Laptop/Handheld
Berkeley Center for New Music and Audio
Technology (CNMAT) created a compact loudspeaker
array 10-inch-diameter icosahedron
incorporating 120 tweeters.
5Content-Based Image Retrieval(Kurt Keutzer)
Relevance Feedback
Query by example
Similarity Metric
Candidate Results
Image Database
Final Result
- Built around Key Characteristics of personal
databases - Very large number of pictures (gt5K)
- Non-labeled images
- Many pictures of few people
- Complex pictures including people, events,
places, and objects
1000s of images
6Coronary Artery Disease(Tony Keaveny)
After
Before
- Modeling to help patient compliance?
- 450k deaths/year, 16M w. symptom, 72M?BP
- Massively parallel, Real-time variations
- CFD FE solid (non-linear), fluid (Newtonian),
pulsatile - Blood pressure, activity, habitus, cholesterol
7Meeting Diarist and Teleconference Aid(Nelson
Morgan)
- Meeting Diarist
- Laptops/ Handhelds at meeting coordinate to
create speaker identified, partially transcribed
text diary of meeting
- Teleconference speaker identifier, speech
helper - L/Hs used for teleconference, identifies who is
speaking, closed caption hint of what being
said
8Parallel Browser
- Goal Desktop quality browsing on handhelds
- Enabled by 4G networks, better output devices
- Bottlenecks to parallelize
- Parsing, Rendering, Scripting
- SkipJax
- Parallel replacement for JavaScript/AJAX
- Based on Browns FlapJax
9Broader Coverage of Applicationsthrough Motifs
- How invent parallel systems of future when tied
to old code, programming models, CPUs of the
past? - Look for common computational patterns
- Embedded Computing (42 EEMBC benchmarks)
- Desktop/Server Computing (28 SPEC2006)
- Data Base / Text Mining Software
- Games/Graphics/Vision
- Machine Learning
- High Performance Computing (Original 7 Dwarfs)
- Result 13 Dwarfs (Use motif instead after
go from 7 to 13?)
10Motif/Dwarf" Popularity (Red Hot ? Blue Cool)
- How do compelling apps relate to 13 motif/dwarfs?
-
11Roles of Motifs/Dwarfs
- Anti-benchmarks
- Motifs not tied to code or language artifacts ?
encourage innovation in algorithms, languages,
data structures, and/or hardware - Universal, understandable vocabulary, at least at
high level - To talk across disciplinary boundaries
- Bootstrapping Parallelize parallel research
- Allow analysis of HW SW design without waiting
years for full apps - Targets for libraries
12Par Lab Research Overview
Easy to write correct programs that run
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs/Dwarfs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS Libraries Services
OS
Hypervisor
Multicore/GPGPU
RAMP Manycore
Arch.
13Developing Parallel Software
- 2 types of programmers ? 2 layers
- Efficiency Layer (10 of todays programmers)
- Expert programmers build Frameworks Libraries,
Hypervisors, - Bare metal efficiency possible at Efficiency
Layer - Productivity Layer (90 of todays programmers)
- Domain experts / Naïve programmers productively
build parallel apps using frameworks libraries - Frameworks libraries composed to form app
frameworks - Effective composition techniques allows the
efficiency programmers to be highly leveraged ?
Create language for Composition and Coordination
(CC)
14 Composition to Build Applications
Serial Code
Parallel Code
15Composition
- Composition is key to software reuse
- Solutions exist for libraries with hidden
parallelism - Partitions in OS help runtime composition
- Instantiating parallel frameworks is harder
- Framework specifies required independence
- E.g., operations in map, divconq must not
interfere - Guaranteed independence through types
- Type system extensions (side effects, ownership)
- Extra Image ?READ Arraydouble
- Data decomposition may be implicit or explicit
- Partition ArrayT, ? List ArrayT
(well understood) - Partition GraphT, ? List GraphT
- Efficiency layer code has these specifications at
interfaces, which are verified, tested, or
asserted - Independence is proven by checking side effects
and overlap at instantiation
16Coordination
- Coordination is used to create parallelism
- Support parallelism patterns of applications
- Data parallelism (degree data size, not core
count) - May be nested forall images, forall blocks in
image - Divide-and-conquer (parallelism from recursion)
- Event-driven nondeterminism at algorithm level
- BranchBound dwarf, etc.
- Serial semantics with limited nondeterminism
- Choose the solution that fits your domain
- Data parallelism comes from array/aggregate
operations and loops without side effects - Divide-and-conquer parallelism comes from
recursive functions with non-overlapping side
effects - Event-driven programs are written as guarded
atomic commands, which may be implemented as
transactions - Discovered parallelism mapped to available
resources - Techniques include static scheduling, autotuning,
dynamic schedule and possibly hints
17CC Language Strategy
- Application-driven Domain-specific languages
- Ensure usefulness for at least one application
- Music language
- Image framework
- Browser language
- Health application language
- Bottom-up implementation strategy
- Ensure efficiently implementable
- Grow a language from one that is efficient but
not productive by abstraction levels - Identify common features across DSLs
- Cross-language meetings/discussions
18Coordination Composition in CBIR Application
- Parallelism in CBIR is hierarchical
- Mostly independent tasks/data with reducing
output stream of feature vectors
output stream of images
feature extraction
Face Recog
?
DCT
DWT
?
coll
stream parallel over images
task parallel over extraction algorithms
19Using Map Reduce for Image Retreival
- Map Reduce can mean various things
- To us, it means
- A map stage, where threads compute independently
- A reduce stage, where the results of the map
stage are summarized - This is a pattern of computation and
communication - Not an implementation involving key/value pairs,
parallel I/O... - We consider Map Reduce computations where
- A map function produces a set of outputs
- Each of a set of reduce functions, gated by per
element predicates, produces a set of outputs
Work by B. Catanzaro, N. Sundaram K. Keutzer
20SVM Classification Results
- Average 100x speedup (180x max)
- Map Reduce Framework reduced kernel LOC by 64
Work by B. Catanzaro, N. Sundaram K. Keutzer
21CC Language for Health
- and the applications to go with it
- Personalized medicine application has large
amounts of data parallelism - Irregular data structures / access sparse
matrices, particles - But most of code could be expressed in a
data-parallel way, meaning serial semantics - Note that parallelism over data is essential at
O(100) cores - Composition across languages is still key
- Calls to optimized (not data parallel) libraries
- Supported by static analysis for phase-based
computations
22Partitioned Global Address Space
- Global address space any thread/process may
directly read/write data allocated by another - Partitioned data is designated as local or global
- By default
- Object heaps are shared
- Program stacks are private
x 1 y
x 5 y
x 7 y 0
Global address space
l
l
l
g
g
g
p0
p1
pn
- 3 Current languages UPC, CAF, and Titanium
- All three use an SPMD execution model
- Designed for large-scale (clusters) and
scientific computing - 3 Emerging languages X10, Fortress, and Chapel
23Arrays in a Global Address Space
- Key features of Titanium arrays
- Generality indices may start/end and any point
- Domain calculus allow for slicing, subarray,
transpose and other operations without data
copies - Use domain calculus to identify ghosts and
iterate - foreach (p in gridA.shrink(1).domain()) ...
- Array copies automatically work on intersection
- gridB.copy(gridA.shrink(1))
intersection (copied area)
restricted (non-ghost) cells
ghost cells
gridA
gridB
Joint work with Titanium group
24Languages Support Helps Productivity
- C/Fortran/MPI AMR
- Chombo package from LBNL
- Bulk-synchronous comm
- Pack boundary data between procs
- All optimizations done by programmer
- Titanium AMR
- Entirely in Titanium
- Finer-grained communication
- No explicit pack/unpack code
- Automated in runtime system
- General approach
- Language allow programmer optimizations
- Compiler/runtime does some automatically
Work by Tong Wen and Philip Colella
Communication optimizations joint with Jimmy Su
25Particle/Mesh Method Heart Simulation
- Elastic structures in an incompressible fluid.
- Blood flow, clotting, inner ear, embryo growth,
- Complicated parallelization
- Particle/Mesh method, but Particles connected
into materials (1D or 2D structures) - Communication patterns irregular between
particles (structures) and mesh (fluid)
2D Dirac Delta Function
Code Size in Lines Code Size in Lines
Fortran Titanium
8000 4000
Note Fortran code is not parallel
Joint work with Ed Givelberg, Armando
Solar-Lezama, Charlie Peskin, Dave McQueen
26Titanium Experience Composition
- Data parallelism could have been used
- Parallel over n, rather than p
- Compiler can generate SPMD code
- Most code could be written as pure data
parallelism (serial semantics) and translated
Su - Can we mix data and other parallelism?
- Compiler analysis makes this possible
- Barriers are restricted all threads must reach
the same barrier (proven by single analysis
Gay and Aiken) - Single analysis identifies global execution
points - Allows global optimizations (across threads)
- Create natural points to switch in and out of
data parallel or serial code - Also may be points for heterogeneous processor
switches in code
Joint work with the Titanium group
27Efficiency layer
28Efficiency Layer Selective Virtualization
- Efficiency layer is abstract machine model
selective virtualization - Libraries provide add-ons
- Schedulers
- Add runtime w/ dynamic scheduling
for dynamic task tree - Memory movement / sharing primitives
- Synchronization primitives
- E.g., fast barriers, atomic operations
- More on this in Krste Asanovics talk
- Division of layers allows us to explore execution
model separately from programming model
General task graph with weights structure
29Synthesis
- Extensive tuning knobs at efficiency level
- Performance feedback from hardware and OS
- Sketching Correct by construction
- More on this in Ras Bodiks talk
Spec simple implementation (3 loop 3D stencil)
Optimized code (tiled, prefetched, time skewed)
- Autotuning Efficient by search
- Examples Spectral (FFTW, SPIRAL), Dense (PHiPAC,
Atlas), Sparse (OSKI), Structured grids
(Stencils) - Can select from algorithms/data structures
changes not producible by compiler transform
30Autotuning 21st Century Code Generation
- Problem generating optimal code is like
searching for needle in a haystack - Manycore ? even more diverse
- New approach Auto-tuners
- 1st generate program variations of combinations
of optimizations (blocking, prefetching, ) and
data structures - Then compile and run to heuristically search for
best code for that computer - Examples PHiPAC (BLAS), Atlas (BLAS), Spiral
(DSP), FFT-W (FFT), OSKI (sparse matrices)
- Search space for block sizes (dense matrix)
- Axes are block
dimensions - Temperature is speed
50 more zeros 50 faster
31LBMHD Structure Grid Application
- Plasma turbulence simulation
- Two distributions
- momentum distribution (27 components)
- magnetic distribution (15 vector components)
- Three macroscopic quantities
- Density
- Momentum (vector)
- Magnetic Field (vector)
- Must read 73 doubles, and update(write) 79
doubles per point in space - Requires about 1300 floating point operations per
point in space - Just over 1.0 flops/byte (ideal)
- No temporal locality between points in space
within one time step
Joint work with Sam Williams, Lenny Oliker, John
Shalf, and Jonathan Carter
32Autotuned Performance(Cell/SPE version)
- First attempt at cell implementation.
- VL, unrolling, reordering fixed
- Exploits DMA and double buffering to load vectors
- Straight to SIMD intrinsics.
- Despite the relative performance, Cells DP
implementation severely impairs performance
SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
33Productivity
- Niagara2 required significantly less work to
deliver good performance. - For LBMHD, Clovertown, Opteron, and Cell all
required SIMD (hampers productivity) for best
performance. - Virtually every optimization was required (sooner
or later) for Opteron and Cell. - Cache based machines required search for some
optimizations, while Cell relied solely on
heuristics (less time to tune)
34PGAS Languages Autotuning for Multicore
DMA
- PGAS languages are a good fit to shared memory
machines, including multicore - Global address space implemented as reads/writes
- Also may be exploited for processor with explicit
local store rather than cache, e.g., Cell, GPUs, - Open question in architecture
- Cache-coherence shared memory
- Software-controlled local memory (or hybrid)
Private on-chip
m
l
Shared off-chip DRAM
35Correctness
36Ensuring Correctness
- Productivity Layer
- Enforce independence of tasks using decomposition
(partitioning) and copying operators - Goal Remove chance for concurrency errors (e.g.,
nondeterminism from execution order, not just
low-level data races) - Efficiency Layer Check for subtle concurrency
bugs (races, deadlocks, and so on) - Mixture of verification and automated directed
testing - Error detection on frameworks with sequential
code as specification
37Software Correctness
- At the Productivity layer, many concurrency
errors are not permitted - At the Efficiency layer, we need more tools for
correctness - Both concurrency errors and (eventually)
numerical errors - Traditional approach to correct software testing
- Low probability of finding an error lots of
manual effort - Symbolic model checking
- Many recent successes in security, control
systems, etc. - Ideas from theorem proving applied to specific
classes of errors - Cant handle libraries, complex data types...
- Concolic testing combines
- Concrete execution Symbolic analysis
- Use state-of-art theorem proving to find inputs
that reach all program paths - Ideas applied successfully on concurrent programs
38Why Languages at all?
- Most of work is in runtime and libraries
- Do we need a language? And a compiler?
- If higher level syntax is needed for productivity
- We need a language
- If static analysis is needed to help with
correctness - We need a compiler (front-end)
- If static optimizations are needed to get
performance - We need a compiler (back-end)
- All of these decisions will be driven by
application need