Generalized Data Transfers At Memory Bandwidth - PowerPoint PPT Presentation

About This Presentation

Title:

Generalized Data Transfers At Memory Bandwidth

Description:

Compute Address Relation - 'Inspector' Assemble Message - 'Executor' 6. SIGMETRICS 96 ... Exploit 'Superscalar Plateau' using compact address relation encodings ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 31

Provided by: petera45

Learn more at: https://users.cs.northwestern.edu

Category:

more less

Transcript and Presenter's Notes

Title: Generalized Data Transfers At Memory Bandwidth

1
Generalized Data TransfersAt Memory Bandwidth

Peter A. Dinda David R. OHallaron
Carnegie Mellon University
http//www.cs.cmu.edu/pdinda
http//www.cs.cmu.edu/droh

2
Generalized Data Transfers
Receiving Node Memory
Sending Node Memory
A
D
B
E
C
F
3
Address Relations
Receiving Node Memory
Sending Node Memory
D
A
B
E
C
F
(A,F),(B,D),(C,E)
R(x,y) data item at address x on sender is
copied to address y on receiver
4
Send/Recv Implementation
Receiving Node Memory
Sending Node Memory
A
D
B
E
C
(A,F), (B,D), (C,E)
F
Message Disassembly
Message Assembly
Message Contents
Data Transfer
(also put and get communication models)
5
Storing Address Relations
Compute Address Relation - Inspector
Done Once
while not done compute_address_pair(x,y) store_a
ddress_pair(x,y) end while
Assemble Message - Executor
while not done get_address_pair(x,y) bufferi
datax end while
Repeated Many Times
6
Inspector/Executor Salz, et al
In-line Computation
Inspector/Executor
i1
Inspector
i1
do i1,1000 call Work() call COPY() call
Work() enddo
Executor
i2
i2
Executor
i3
i3
Executor
i3
Executor
7
Context Array Assignments
BA
Array A
Array B
Abstraction
We concentrate on BA and BTRANSPOSE(A) More
general forms exist
8
Distributed Arrays
Regular Block-cyclic distributions as in High
Performance Fortran(HPF)
(,CYCLIC)
(,BLOCK)
(,CYCLIC(k))
Distribution
Elements Processor 0 Owns
Local Array on Processor 0
9
Representative Assignments
(CYCLIC,)
(BLOCK,)
(BLOCK,)
(,BLOCK)
(,CYCLIC)
(BLOCK,)
(CYCLIC,)
(CYCLIC,)
Data Transpose
10
Representing Address Relations

General Purpose
Space Efficiency
Hardware Limited Performance
In-line expansion

11
AAPAIR Simple Representation
Receiving Node Memory
Sending Node Memory
D
A
A
F
B
E
B
D
C
F
C
E
(A,F),(B,D),(C,E)
Simple sequence of pointer pairs
PROBLEM Space Efficiency PROBLEM Performance
12
AABLK Run-length Encoding
D
A
2
A
F
B
2
E
B
D
2
C
E
C
F
(A,F),(A1,F1), (B,D),(B1,D1),
(C,E),(C1,E1)
Sequence of pointer, pointer, length triples
PROBLEM Strided Access
13
DMRLE Handling Strides
D
A
1
A
F
g
h
B
2
E
g
h
g
C
F
h
(A,F),(B,E),(C,D) B-A C-B g E-F D-E h
sequence of offset, offset, length triples
PROBLEM Repeated Strides
14
DMRLEC Repeated Strides
D
A
E
h
0
1
2
1
g
B
h
1
F
A
F
0
g
C
2
g
h
1
v
D
1
u
v
2
u
A
E
h
g
B
h
F
g
C
(A,F),(B,E),(C,D), (A,F),(B,E),(C,D) B-A
C-B B-A C-B g E-F D-E E-F
D-E h A-C u and F-Dv
Sequence of indices into table of offset, offset,
length triples
15
Address Relation Storage Costs
16
Copying Superscalar Plateau
Issued at time t
Time
load
load
store
load
store
store
load
store
...
stall
stall
stall
stall
Free Issue Slots
...
p
...
Plateau np 23 6
n
Maximum number of non load/store instructions
before copy bandwidth suffers
17
Paragon No Superscalar Plat.
18
Pentium 90 Clear Plateau
19
DEC 3K/400a Complex Plateau
20
Measurement Details

Portable Library written in C
Four representative assignments
512x512, 1Kx1K, 2Kx2K arrays of doubles
distributed on Four processors
Six Machines
Assembly and Disassembly Rates

21
Measurement Testcases
(CYCLIC,)
(BLOCK,)
(BLOCK,)
(,BLOCK)
(,CYCLIC)
(BLOCK,)
(CYCLIC,)
(CYCLIC,)
Data Transpose
22
Performance DEC 3K/400a
23
PerformanceIBM 250 (PPC601)
24
Performance IBM SP2 (PWR2)
25
Performance Paragon
26
Performance Pentium 90
27
Performance Pentium 133
28
Conclusions

Exploit Superscalar Plateau using compact
address relation encodings
Cheap enough even for scalar machines
Generalized data transfer with hardware-limited
throughput
Many possible applications

29
Copying with Address Relations
Copy Engine
Data Items
Data Items
Sender Data Addresses
Receiver Data Addresses
Address Relation Decoder
Address Relation Addresses
Address Relation Data
30
A Simple Copy Engine
Comm. System
Data
Copy Engine
Data
Copy Engine
Sender Data Adx
Receiver Data Adx
Decoder
Decoder
Address Relation Addresses
Address Relation Data
Address Relation Data
Address Relation Addresses

Write a Comment

User Comments (0)