Compiling with multicore - PowerPoint PPT Presentation

About This Presentation

Title:

Compiling with multicore

Description:

With use of inter-core queue, threads can be decoupled. Efficiency high tolerance for latency ... Decide # of partitions (threads) ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 26

Provided by: Suig

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiling with multicore

1
Compiling with multicore

Jeehyung Lee
15-745 Spring 2009

2
Papers

Automatic Thread Extraction with Decoupled
Software Pipelining
Fully automatic
Fine grained pipelining
A Practical Approach to Exploring Coarse-Grained
Pipeline Parallelism in C Programs
Semi-automatic
Coarse grained pipelining

3
First paper

Automatic Thread Extraction with Decoupled
Software Pipelining
Guilherme Ottoni, Ram Rangan, Adam Stoler and
David August
From Princeton University

4
What is the paper about?

Despite increasing uses of multiprocessors, many
single threaded applications do not benefit
Let the compiler automatically extract threads
and exploit lurking pipeline parallelism
Extract non-speculative and truly decoupled
threads through Decoupled Software
Pipelining(DSWP)

5
Why decoupled pipelining?

Example

Linked list traversal
6
Why decoupled pipelining?

DOACROSS

Iteration (LD latency communication latency)
7
Why decoupled pipelining?

DSWP

One way pipelining
Iteration LD latency
8
DSWP

Flow of data (dependency) is acyclic among cores
With use of inter-core queue, threads can be
decoupled
Efficiency high tolerance for latency

9
DSWP Algorithm

Build dependence graph
Find strongly connected components (SCC)
Create DAG of SCC
Partition DAG
Split codes into partitions
Add flows to partitions

10
Build dependence graph
Include every traditional dependence (data,
control, and memory) extensions
11
Find SCC

SCC Instructions that form a dependency cycle
in a loop
Instructions in SCC cannot be parallelized

1
1
2
2
1
2
12
Create DAG of SCCs

Merge instructions within each SCC and update
dependency arrows

13
Partition DAG

Partition DAG nodes into n partitions
( n lt of processors)
Use heuristic to maximize load balance
Decide of partitions (threads)
Start filling in from partition 1 with nodes from
the top of DAG.
When the partition is stuffed (estimated by of
cycles), move on to next partition
Find the best of threads and its partition

14
Split codes and insert flows (done!)

For each partition, insert code basic blocks
relevant to its contained SCC node
Add in codes for dependency flow

15
Result

19.4 speedup on important benchmark loops, 9.2
overall
When core bandwidth is halved
Single threaded code slows down by 17.1
DSWP code is still slightly faster than
single-threaded code running on full-bandwidth
core
Promising enabler for Thread-Level-Parallelism(TLP
)?

16
Second Paper

A Practical Approach to Exploring Coarse-Grained
Pipeline Parallelism in C Programs
William Thies, Vikram Chandrasekhar and Saman
Amaransinghe
From MIT

17
What is the paper about?

Despite increasing uses of multiprocessors, many
single threaded (Repeated)
Coarse grained pipelining is more desirable, but
is especially hard with obfuscated C codes
Let people define pipeline, and learn practical
dependencies in runtime

18
What is the paper about?

Despite increasing uses of multiprocessors, many
single threaded (Repeated)
Coarse grained pipelining is more desirable, but
is especially hard with obfuscated C codes
Let people define stages, and learn practical
dependencies in runtime for streaming
applications

19
Interface