Title: Compiler-directed Data Partitioning for Multicluster Processors
1Compiler-directed Data Partitioning for
Multicluster Processors
- Michael Chu and Scott Mahlke
- Advanced Computer Architecture Lab
- University of Michigan
- March 28, 2006
2Multicluster Architectures
- Addresses the register file bottleneck
- Decentralizes architecture
- Compilation focuses on partitioning operations
- Most previous work assumes a unified memory
Register File
Data Memory
3Problem Partitioning of Data
int x100
struct foo
int y100
- Determine object placement into data memories
- Limited by
- Memory sizes/capacities
- Computation operations related to data
- Partitioning relevant to caches and scratchpad
memories
4Architectural Model
- This work focuses on use of scratchpad-like
static local memories - Each cluster has one local memory
- Each object placed in one specific memory
- Data object available in the memory throughout
the lifetime of the program
5Data Unaware Partitioning
Lose average 30 performance by ignoring data
6Our Objective
- Goal Produce efficient code
- Strategy
- Partition both data objects and computation
operations - Balance memory size across clusters
- Improve memory bandwidth
- Maximize parallelism
int y 100
int x100
struct foo
7First Try Greedy Approach
Data Partition Computation Partition
Data Unaware None, Profile-based placement Region-view
Greedy Region-view Greedy Profile-based Region-view
- Computation-centric partition of data
- Place data where computation references it most
often - Greedy approach
- Pass 1 Region-view computation partition Greedy
data cluster assignment - Pass 2 Region-view computation repartition
Full knowledge of data location
8Greedy Approach Results
- 2 Clusters
- One Integer, Float, Memory, Branch unit per
cluster - Relative to a unified, dual-ported memory
- Improvement over Data Unaware, still room for
improvement
9Second Try Global Data Partition
- Data-centric partition of computation
- Hierarchical technique
- Pass 1 Global-view for data
- Consider memory relationships throughout program
- Locks memory operations to clusters
- Pass 2 Region-view for computation
- Partition computation based on data location
10Pass 1 Global Data Partitioning
- Determine memory relationships
- Pointer analysis profiling of memory
- Build program-level graph representation of all
operations - Perform data object memory operation merging
- Respect correctness constraints of the program
11Global Data Graph Representation
- Nodes Operations, either memory or non-memory
- Memory operations loads, stores, malloc
callsites - Edges Data flow between operations
- Node weight Data object size
- Sum of data sizes forreferenced objects
- Object size determined by
- Globals/locals pointer analysis
- Malloc callsites memory profile
int x100
struct foo
malloc site 1
12Global Data Partitioning Example
BB1
2 Objects referenced 80 Kb
BB2
2 Objects referenced 200 Kb
1 Object referenced 100 Kb
13Pass 2 Computation Partitioning
- ObservationGlobal-level data partition is only
half the answer - Doesnt account for operation resource usage
- Doesnt consider code scheduling regions
- Second pass of partitioning on each scheduling
region - Memory operations from first phase locked in
place
BB1
14Experimental Methodology
- Compared to
- 2 Clusters
- One Integer, Float, Memory, Branch unit per
cluster - All results relative to a unified, dual-ported
memory
Data Partitioning Computation Partition
Global Global-view Data-centric Know data location
Greedy Region-view Greedy computation-centric Know data location
Data Unaware None, assume unified memory Assume unified memory
Unified Memory N/A Unified memory
15Performance 1-cycle Remote Access
Unified Memory
16Performance 10-cycle Remote Access
Unified Memory
17Case Study rawcaudio
X
Global Data Partition
Greedy Profile-based
X
18Summary
- Global Data Partitioning
- Data placement first-order design principle
- Global data-centric partition of computation
- Phased ordered approach
- Global-view for decisions on data
- Region-view for decisions on computation
- Achieves 96 performance of a unified memory on
partitioned memories - Future work apply to cache memories
19Data Partitioning for Multicores
- Adapt global data partitioning for cache memory
domain - Similar goals
- Increase data bandwidth
- Maximize parallel computation
- Different goals
- Reducing coherence traffic
- Keep working set cache size
20Questions?
- http//cccp.eecs.umich.edu
21Backup
22Future Work Cache Memories
- Adapt global data partitioning for cache memory
domain - Similar goals
- Increase data bandwidth
- Maximize parallelcomputation
- Different goals
- Reducing coherence traffic
- Balancing working set
23Memory Operation Merging
- Interprocedural pointer analysis determines
memory relationships
int x int foo 100 int bar 100 void
main() int a malloc() int b int
c if(cond) c foo1 b a
else c bar1 b bar1 b
100 foo0 c
malloc
load bar
load foo
store malloc or bar
store foo
24Multicluster Compilation
- Previous techniques focused on operation
partitioning cite some papers - Ignores the issue of data object placement in
memory - Assumes shared memory accessible from each cluster
25Phase 2 Computation Partitioning
- Observation Global-level data partition is only
half the solution - Doesnt properly account for resource usage
details - Doesnt consider code scheduling regions
- Second pass of partitioning is done locally on
each basic block of the program - Memory operations locked into specific clusters
- Uses Region-based Hierarchical Operation
Partitioner (RHOP)
26Computation Partitioning Example
- Memory operations from first phase locked in
place - RHOP performs a detailed resource-cognizant
computation partition - Modified multi-level Kernighan-Lin algorithm
using schedule estimates
BB1
L
L
S