Title: Generalized Data Transfers At Memory Bandwidth
1Generalized Data TransfersAt Memory Bandwidth
- Peter A. Dinda David R. OHallaron
- Carnegie Mellon University
- http//www.cs.cmu.edu/pdinda
- http//www.cs.cmu.edu/droh
2Generalized Data Transfers
Receiving Node Memory
Sending Node Memory
A
D
B
E
C
F
3Address Relations
Receiving Node Memory
Sending Node Memory
D
A
B
E
C
F
(A,F),(B,D),(C,E)
R(x,y) data item at address x on sender is
copied to address y on receiver
4Send/Recv Implementation
Receiving Node Memory
Sending Node Memory
A
D
B
E
C
(A,F), (B,D), (C,E)
F
Message Disassembly
Message Assembly
Message Contents
Data Transfer
(also put and get communication models)
5Storing Address Relations
Compute Address Relation - Inspector
Done Once
while not done compute_address_pair(x,y) store_a
ddress_pair(x,y) end while
Assemble Message - Executor
while not done get_address_pair(x,y) bufferi
datax end while
Repeated Many Times
6Inspector/Executor Salz, et al
In-line Computation
Inspector/Executor
i1
Inspector
i1
do i1,1000 call Work() call COPY() call
Work() enddo
Executor
i2
i2
Executor
i3
i3
Executor
i3
Executor
7Context Array Assignments
BA
Array A
Array B
Abstraction
We concentrate on BA and BTRANSPOSE(A) More
general forms exist
8Distributed Arrays
Regular Block-cyclic distributions as in High
Performance Fortran(HPF)
(,CYCLIC)
(,BLOCK)
(,CYCLIC(k))
Distribution
Elements Processor 0 Owns
Local Array on Processor 0
9Representative Assignments
(CYCLIC,)
(BLOCK,)
(BLOCK,)
(,BLOCK)
(,CYCLIC)
(BLOCK,)
(CYCLIC,)
(CYCLIC,)
Data Transpose
10Representing Address Relations
- General Purpose
- Space Efficiency
- Hardware Limited Performance
- In-line expansion
11AAPAIR Simple Representation
Receiving Node Memory
Sending Node Memory
D
A
A
F
B
E
B
D
C
F
C
E
(A,F),(B,D),(C,E)
Simple sequence of pointer pairs
PROBLEM Space Efficiency PROBLEM Performance
12AABLK Run-length Encoding
D
A
2
A
F
B
2
E
B
D
2
C
E
C
F
(A,F),(A1,F1), (B,D),(B1,D1),
(C,E),(C1,E1)
Sequence of pointer, pointer, length triples
PROBLEM Strided Access
13DMRLE Handling Strides
D
A
1
A
F
g
h
B
2
E
g
h
g
C
F
h
(A,F),(B,E),(C,D) B-A C-B g E-F D-E h
sequence of offset, offset, length triples
PROBLEM Repeated Strides
14DMRLEC Repeated Strides
D
A
E
h
0
1
2
1
g
B
h
1
F
A
F
0
g
C
2
g
h
1
v
D
1
u
v
2
u
A
E
h
g
B
h
F
g
C
(A,F),(B,E),(C,D), (A,F),(B,E),(C,D) B-A
C-B B-A C-B g E-F D-E E-F
D-E h A-C u and F-Dv
Sequence of indices into table of offset, offset,
length triples
15Address Relation Storage Costs
16Copying Superscalar Plateau
Issued at time t
Time
load
load
store
load
store
store
load
store
...
stall
stall
stall
stall
Free Issue Slots
...
p
...
Plateau np 23 6
n
Maximum number of non load/store instructions
before copy bandwidth suffers
17Paragon No Superscalar Plat.
18Pentium 90 Clear Plateau
19DEC 3K/400a Complex Plateau
20Measurement Details
- Portable Library written in C
- Four representative assignments
- 512x512, 1Kx1K, 2Kx2K arrays of doubles
distributed on Four processors - Six Machines
- Assembly and Disassembly Rates
21Measurement Testcases
(CYCLIC,)
(BLOCK,)
(BLOCK,)
(,BLOCK)
(,CYCLIC)
(BLOCK,)
(CYCLIC,)
(CYCLIC,)
Data Transpose
22Performance DEC 3K/400a
23PerformanceIBM 250 (PPC601)
24Performance IBM SP2 (PWR2)
25Performance Paragon
26Performance Pentium 90
27Performance Pentium 133
28Conclusions
- Exploit Superscalar Plateau using compact
address relation encodings - Cheap enough even for scalar machines
- Generalized data transfer with hardware-limited
throughput - Many possible applications
29Copying with Address Relations
Copy Engine
Data Items
Data Items
Sender Data Addresses
Receiver Data Addresses
Address Relation Decoder
Address Relation Addresses
Address Relation Data
30A Simple Copy Engine
Comm. System
Data
Copy Engine
Data
Copy Engine
Sender Data Adx
Receiver Data Adx
Decoder
Decoder
Address Relation Addresses
Address Relation Data
Address Relation Data
Address Relation Addresses