Compiling with multicore - PowerPoint PPT Presentation

About This Presentation
Title:

Compiling with multicore

Description:

With use of inter-core queue, threads can be decoupled. Efficiency high tolerance for latency ... Decide # of partitions (threads) ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 26
Provided by: Suig
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Compiling with multicore


1
Compiling with multicore
  • Jeehyung Lee
  • 15-745 Spring 2009

2
Papers
  • Automatic Thread Extraction with Decoupled
    Software Pipelining
  • Fully automatic
  • Fine grained pipelining
  • A Practical Approach to Exploring Coarse-Grained
    Pipeline Parallelism in C Programs
  • Semi-automatic
  • Coarse grained pipelining

3
First paper
  • Automatic Thread Extraction with Decoupled
    Software Pipelining
  • Guilherme Ottoni, Ram Rangan, Adam Stoler and
    David August
  • From Princeton University

4
What is the paper about?
  • Despite increasing uses of multiprocessors, many
    single threaded applications do not benefit
  • Let the compiler automatically extract threads
    and exploit lurking pipeline parallelism
  • Extract non-speculative and truly decoupled
    threads through Decoupled Software
    Pipelining(DSWP)

5
Why decoupled pipelining?
  • Example

Linked list traversal
6
Why decoupled pipelining?
  • DOACROSS

Iteration (LD latency communication latency)
7
Why decoupled pipelining?
  • DSWP

One way pipelining
Iteration LD latency
8
DSWP
  • Flow of data (dependency) is acyclic among cores
  • With use of inter-core queue, threads can be
    decoupled
  • Efficiency high tolerance for latency

9
DSWP Algorithm
  • Build dependence graph
  • Find strongly connected components (SCC)
  • Create DAG of SCC
  • Partition DAG
  • Split codes into partitions
  • Add flows to partitions

10
Build dependence graph
Include every traditional dependence (data,
control, and memory) extensions
11
Find SCC
  • SCC Instructions that form a dependency cycle
    in a loop
  • Instructions in SCC cannot be parallelized

1
1
2
2
1
2
12
Create DAG of SCCs
  • Merge instructions within each SCC and update
    dependency arrows

13
Partition DAG
  • Partition DAG nodes into n partitions
  • ( n lt of processors)
  • Use heuristic to maximize load balance
  • Decide of partitions (threads)
  • Start filling in from partition 1 with nodes from
    the top of DAG.
  • When the partition is stuffed (estimated by of
    cycles), move on to next partition
  • Find the best of threads and its partition

14
Split codes and insert flows (done!)
  • For each partition, insert code basic blocks
    relevant to its contained SCC node
  • Add in codes for dependency flow

15
Result
  • 19.4 speedup on important benchmark loops, 9.2
    overall
  • When core bandwidth is halved
  • Single threaded code slows down by 17.1
  • DSWP code is still slightly faster than
    single-threaded code running on full-bandwidth
    core
  • Promising enabler for Thread-Level-Parallelism(TLP
    )?

16
Second Paper
  • A Practical Approach to Exploring Coarse-Grained
    Pipeline Parallelism in C Programs
  • William Thies, Vikram Chandrasekhar and Saman
    Amaransinghe
  • From MIT

17
What is the paper about?
  • Despite increasing uses of multiprocessors, many
    single threaded (Repeated)
  • Coarse grained pipelining is more desirable, but
    is especially hard with obfuscated C codes
  • Let people define pipeline, and learn practical
    dependencies in runtime

18
What is the paper about?
  • Despite increasing uses of multiprocessors, many
    single threaded (Repeated)
  • Coarse grained pipelining is more desirable, but
    is especially hard with obfuscated C codes
  • Let people define stages, and learn practical
    dependencies in runtime for streaming
    applications

19
Interface
  • Add annotations in the body of top loop

20
Dynamic analysis
  • The system creates a stream graph according to
    annotations.

How do they find dependencies?
21
Dynamic analysis
  • Streaming applications tend to have a fixed
    pattern of dataflow (stable flow) among pipeline
    stages

22
Dynamic analysis
  • Run the application on training examples, and
    record every relevant store-load pair across
    pipeline boundaries

This gives us practical dependencies
23
Interface
  • Program shows a complete stream graph
  • User decides if he/she likes this
  • pipelining or not
  • If yes, done!
  • else, redo annotations. Iterate over until
    satisfied

24
Actual pipelining
  • When compiled, annotation macros emit codes that
    will fork original program for each pipeline
    stage

25
Result
  • Average 2.78x speedup, max 3.89x on 4-core
  • Seems unsound but practical (?)
Write a Comment
User Comments (0)
About PowerShow.com