Compiler-directed Data Partitioning for Multicluster Processors - PowerPoint PPT Presentation

About This Presentation

Title:

Compiler-directed Data Partitioning for Multicluster Processors

Description:

Compilation focuses on partitioning operations. Most previous work assumes a unified memory ... This work focuses on use of scratchpad-like static local memories ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 27

Provided by: Micha1

Learn more at: https://cccp.eecs.umich.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiler-directed Data Partitioning for Multicluster Processors

1
Compiler-directed Data Partitioning for
Multicluster Processors

Michael Chu and Scott Mahlke
Advanced Computer Architecture Lab
University of Michigan
March 28, 2006

2
Multicluster Architectures

Addresses the register file bottleneck
Decentralizes architecture
Compilation focuses on partitioning operations
Most previous work assumes a unified memory

Determine object placement into data memories
Limited by
Memory sizes/capacities
Computation operations related to data
Partitioning relevant to caches and scratchpad
memories

4
Architectural Model

This work focuses on use of scratchpad-like
static local memories
Each cluster has one local memory
Each object placed in one specific memory
Data object available in the memory throughout
the lifetime of the program

5
Data Unaware Partitioning
Lose average 30 performance by ignoring data
6
Our Objective

Goal Produce efficient code
Strategy
Partition both data objects and computation
operations
Balance memory size across clusters
Improve memory bandwidth
Maximize parallelism

int y 100
int x100
struct foo
7
First Try Greedy Approach
Data Partition Computation Partition
Data Unaware None, Profile-based placement Region-view
Greedy Region-view Greedy Profile-based Region-view

Computation-centric partition of data
Place data where computation references it most
often
Greedy approach
Pass 1 Region-view computation partition Greedy
data cluster assignment
Pass 2 Region-view computation repartition
Full knowledge of data location

8
Greedy Approach Results

2 Clusters
One Integer, Float, Memory, Branch unit per
cluster
Relative to a unified, dual-ported memory
Improvement over Data Unaware, still room for
improvement

9
Second Try Global Data Partition

Data-centric partition of computation
Hierarchical technique
Pass 1 Global-view for data
Consider memory relationships throughout program
Locks memory operations to clusters
Pass 2 Region-view for computation
Partition computation based on data location

10
Pass 1 Global Data Partitioning

Determine memory relationships
Pointer analysis profiling of memory
Build program-level graph representation of all
operations
Perform data object memory operation merging
Respect correctness constraints of the program

11
Global Data Graph Representation

Nodes Operations, either memory or non-memory
Memory operations loads, stores, malloc
callsites
Edges Data flow between operations
Node weight Data object size
Sum of data sizes forreferenced objects
Object size determined by
Globals/locals pointer analysis
Malloc callsites memory profile

int x100
struct foo
malloc site 1
12
Global Data Partitioning Example
BB1
2 Objects referenced 80 Kb
BB2
2 Objects referenced 200 Kb
1 Object referenced 100 Kb
13
Pass 2 Computation Partitioning

ObservationGlobal-level data partition is only
half the answer
Doesnt account for operation resource usage
Doesnt consider code scheduling regions
Second pass of partitioning on each scheduling
region
Memory operations from first phase locked in
place

BB1
14
Experimental Methodology

Compared to
2 Clusters
One Integer, Float, Memory, Branch unit per
cluster
All results relative to a unified, dual-ported
memory

Data Partitioning Computation Partition
Global Global-view Data-centric Know data location
Greedy Region-view Greedy computation-centric Know data location
Data Unaware None, assume unified memory Assume unified memory
Unified Memory N/A Unified memory
15
Performance 1-cycle Remote Access
Unified Memory
16
Performance 10-cycle Remote Access
Unified Memory
17
Case Study rawcaudio
X
Global Data Partition
Greedy Profile-based
X
18
Summary

Global Data Partitioning
Data placement first-order design principle
Global data-centric partition of computation
Phased ordered approach
Global-view for decisions on data
Region-view for decisions on computation
Achieves 96 performance of a unified memory on
partitioned memories
Future work apply to cache memories

19
Data Partitioning for Multicores

Adapt global data partitioning for cache memory
domain
Similar goals
Increase data bandwidth
Maximize parallel computation
Different goals
Reducing coherence traffic
Keep working set cache size

20
Questions?

http//cccp.eecs.umich.edu

21
Backup
22
Future Work Cache Memories

Adapt global data partitioning for cache memory
domain
Similar goals
Increase data bandwidth
Maximize parallelcomputation
Different goals
Reducing coherence traffic
Balancing working set

23
Memory Operation Merging

Interprocedural pointer analysis determines
memory relationships

int x int foo 100 int bar 100 void
main() int a malloc() int b int
c if(cond) c foo1 b a
else c bar1 b bar1 b
100 foo0 c
malloc
load bar
load foo
store malloc or bar
store foo
24
Multicluster Compilation