Send and Receive Based Message-Passing for SCMP - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Send and Receive Based Message-Passing for SCMP

Description:

Thesis Defense. Virginia Tech ... Head flits claim virtual channel resources as they travel ... The original message-passing system uses requests and replies. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 44

Provided by: chjr

Category:

more less

Transcript and Presenter's Notes

Title: Send and Receive Based Message-Passing for SCMP

1
Send and Receive Based Message-Passing for SCMP

Charles W. Lewis, Jr.
Thesis Defense
Virginia Tech
April 28th, 2004

2
This presentation introduces the SCMP
architecture, discusses problems with the current
SCMP message-passing system, and focuses on the
design and performance of a new SCMP
message-passing system.
1. Overview of SCMP
2. Original Message-Passing System
4. Performance Comparisons
3. New Message-Passing System
3
Problems with current design trends motivate the
SCMP concept.

As transistor sizes shrink, so do communication
wires. This leads to higher cross-chip
communication latencies.
ILP faces diminishing returns.
Large and complex uni-processors require
extensive amounts of design and verification.

4
SCMP provides PLP through replication.

Up to 64 identical nodes on-chip
Replicated nodes reduce complexity
2-D network eliminates cross-chip wires

SCMP Network with 64 Nodes
5
SCMP provides TLP through multi-thread hardware
support.

Up to 16 threads
Round-robin thread scheduling by hardware
On every node
4-stage RISC pipeline
8MB memory
Networking hardware

SCMP Node
6
The original messaging system has two message
types.
H T Payload Payload Payload Payload
1 0 X Y THREAD 1
0 0 Address Address Address Address
0 0 Register Data Register Data Register Data Register Data
. . . .
. . . .
. . . .
0 1 Register Data Register Data Register Data Register Data
H T Payload Payload Payload Payload
1 0 X Y DATA Stride
0 0 Address Address Address Address
0 0 Data Data Data Data
. . . .
. . . .
. . . .
0 1 Data Data Data Data
Thread Message
Data Message
Because they contain handling information these
message formats borrow from the Active-Messages
message-passing system.
7
Network uses wormhole and dimension-order routing.
0
1
2
3
4
5
6
7

Every router multiplexes virtual channel buffers
over physical channels.
Head flits claim virtual channel resources as
they travel
If one message blocks, other messages may still
continue as long as enough virtual channels are
free.
Messages move along X axis, then Y axis
Tail flits release virtual channel resources as
they travel.

8
Dimension-order routing is deadlock free as long
as messages eventually drain.
Router

Even with VCs, network can still deadlock if
messages dont drain.
If all contexts are consumed, thread messages
block at NIU
Threads may not release until a data message is
received
Data messages must not be stopped by congested
thread messages
Data messages must have a separate path through
network.

Thread VCs
West
East
Data VCs
9
The NIU bears most of the messaging load.
NIU
Thread Buffer
Context 1
Context 2
Injection Channel
Data Buffer
Context 2
To Router
From Router
Receive Buffer
Ejection Channel
Memory
10
Messages are built through assembly instructions.
Instruction Arguments Description
sendh d_node, type, d_address, d_stride send a header flit
send data send one data flit
send2 data, data send two data flits
sende data send one data flit and end message
send2e data, data send two data flits and end message
sendm l_address, l_stride, count send data block from memory
sendme l_address, l_stride, count send data block from memory and end message
11
The thread library facilitates thread messages.
Operation Arguments Description
createThread int dst_node create a thread on dst_node
void(addr)()
void(callback)()

parExecute int num_nodes create threads on
void(addr)() num_nodes nodes

getBlock unsigned int node_id request a block of values
char dst_addr from node node_id
unsigned int dst_stride
char src_addr
unsigned int src_offset
unsigned int src_stride
unsigned int num_words
12
The send library facilitates data messages.
Operation Arguments Description
sendDataIntValue int dst_node send an integer to dst_node
int dst_addr
int value
sendDataFloatValue int dst_node send a double to dst_node
double dst_addr
double value
sendDataBlock int dst_node send a block of values from
int dst_addr memory to dst_node
int dst_stride (blocking)
int src_addr
int src_stride
int num_words
sendDataBlockNB int dst_node send a block of values from
int dst_addr memory to dst_node
int dst_stride (non-blocking)
int src_addr
int src_stride
int num_words
13
The original message-passing system uses requests
and replies.

Node A requires data held by Node B
Node A creates a thread on Node B
New thread on Node B sends data to Node A
New thread on Node B sends SYNC message when done

A
B
Thread
B
A
Data
Sync
14
Dynamic memory is a problem.

Request thread on node B must know
Source Address
Source Stride
Destination Address
Destination Stride
Number of Values to Send
How can Node A know the source address and stride
if Node B allocates the buffer dynamically?
Program must contain global pointers

H T Payload Payload Payload Payload
1 0 X Y DATA Stride
0 0 Address Address Address Address
0 0 Data Data Data Data
. . . .
. . . .
. . . .
0 1 Data Data Data Data
15
In-order delivery of messages is a problem.

SCMP network does not guarantee in-order delivery
of messages
SYNC message may reach Node A before data message
Node A will read bad values from memory

B
A
Data
Sync
16
Request threads and finite thread contexts are a
problem.
Contexts
0X0000de5a
0X00000f70
0X00000ff8
0X00000ff8
NIU
0X00000ff8
Thread
Thread
Thread
0X00000ff8
0X00000ff8
0X00000ff8

If a node holds highly demanded data, request
threads may consume all of its contexts
Additional thread messages will block in the
network

17
Send-and-Receive message-passing eliminates all
of these problems.

A thread must execute a receive before data will
be accepted
Dont need request messages
Messages are identified abstractly
Dont need global pointers
Completion notification occurs locally
Dont need SYNC messages

18
Rendezvous mode uses an RTS/CTS handshake.

Node B holds data required by Node A
Node B sends Node A an RTS message when send is
executed
After receive is executed Node A sends Node B a
CTS message
Node B sends data after receiving RTS

A
B
RTS
B
A
CTS
B
A
Data
19
Ready mode foregoes the handshake to reduce
message latency.

Node B holds data required by Node A
Node B sends data when send is executed
User must ensure that receive has executed on
Node A

B
A
Data
20
The implementation centers around two tables.
Send Table Entry
33 2 1 0
id id state state
Receive Table Entry
83 50 49 29 28 13 12 7 6 3 2 0
id id address address stride stride r_node r_node r_cntxt r_cntxt state state
21
Send Table Entries may be in 4 states, and
Receive Table Entries may be in 5 states.
Value State
00 Empty
01 In Use
10 In Progress
11 Complete
Value State
000 Empty
001 In Use
010 In Progress
011 RTS Received
10X NOT USED
110 NOT USED
111 Complete
Send Table Entry States
Receive Table Entry States
22
The new messaging system has four message types.
H T Payload Payload Payload Payload
1 0 X Y THREAD
1 1 Handler Address Handler Address Handler Address Handler Address
0 0 Register Data Register Data Register Data Register Data
. . . .
. . . .
. . . .
0 1 Register Data Register Data Register Data Register Data
H T Payload Payload Payload Payload
1 0 X Y DATA
1 1 Message ID Message ID Message ID Message ID
0 0 Data Data Data Data
. . . .
. . . .
. . . .
0 1 Data Data Data Data
Data Message
Thread Message
H T Payload Payload Payload Payload Payload
1 0 X Y RTS cntxt node
0 1 Message ID Message ID Message ID Message ID Message ID
H T Payload Payload Payload Payload
1 0 X Y CTS cntxt
0 1 Message ID Message ID Message ID Message ID
CTS Message
RTS Message
23
The NIU now contains a data queue for every
context.
NIU
Thread Buffer
Injection Channel
To Router
Data Buffer
Context 1
Context 2
Context 2
RTS Buffer
From Router
CTS Buffer
Receive Buffer
Ejection Channel
Memory
24
Only five new instructions and one modified
instruction are needed.
Instruction Arguments Description
sendh d_node, type, d_address send a header flit
send data send one data flit
send2 data, data send two data flits
sende data send one data flit and end message
send2e data, data send two data flits and end message
sendm l_address, l_stride, count send data block from memory
sendme l_address, l_stride, count send data block from memory and end message
ldss r, message_id poll send operation status
ldsr r, message_id poll receive operation status
str message_id, address, stride store a receive to table
rms message_id clear a send operation
rmr message_id clear a receive operation
25
The thread library remains nearly the same.
Operation Arguments Description
createThread int dst_node create a thread on dst_node
void(addr)()
void(callback)()

parExecute int num_nodes create threads on
void(addr)() num_nodes nodes

getBlock unsigned int node_id request a block of values
char dst_addr from node node_id
unsigned int dst_stride
char src_addr
unsigned int src_offset
unsigned int src_stride
message_id
unsigned int num_words
26
The new send library is more familiar.
Operation Arguments Description
SCMPSendInt int dst_node send an integer to dst_node
int message_id
int value
SCMPSendFloat int dst_node send a double to dst_node
int message_id
double value
SCMPSend int dst_node send a block of values from
int message_id memory to dst_node
int address (blocking)
int stride
int num_words
SCMPSendNB int dst_node send a block of values from
int message_id memory to dst_node
int address (non-blocking)
int stride
int num_words
SCMPPollSend int message_id poll status of send operation
SCMPWaitSend int message_id suspend until message sends
SCMPClearSend int message_id clear send operation
27
The receive library is all new.
Operation Arguments Description
SCMPReceive int message_id receive a message and
int address store it at address
int stride (blocking)
SCMPReceiveNB int message_id receive a message and
int address store it at address
int stride (non-blocking)
SCMPPollReceive int message_id poll status of receive operation
SCMPWaitReceive int message_id suspend until message arrives
SCMPClearReceive int message_id clear receive operation
28
Rendezvous Mode Operation at the Sender
sendh
No Entry?
F
SUSPEND
CTS Message Arrives
T
Queue Head And Tag
Queue Waiting
F
Create Entry-gtIn Use
ERROR
T
Head Flit _at_ Queue Head
Tail Flit Not Sent
Send Flit
No Entry?
T
ERROR
Entry-gtComplete
F
Send RTS
Entry-gtIn Progress
29
Rendezvous Mode Operation at the Receiver
RTS Message Arrives
Data Message Arrives
No Entry
No Entry
T
T
DISCARD
Record RTS
F
In Progress
Entry-gtRTS Rcvd
F
F
Block Data
In Use
T
Send CTS
T
Tail Flit Not Stored
F
Entry-gtIn Progress
Store Flit
Block RTS
Entry-gtComplete
RTS Rcvd
No Entry
F
F
SUSPEND
str
T
T
Record str
Send CTS
Entry-gtIn Use
Entry-gtIn Progress
30
RTS and CTS Messages also need separate VC paths.
Router

RTS messages can block in the network.
For a given RTS message to leave the network, RTS
messages ahead of it must be satisfied
CTS message to source
Data message back
RTS and CTS messages have their own VC paths.

Thread VCs
Data VCs
West
East
RTS VCs
CTS VCs
31
Ready Mode Operation at the Sender
Head Flit _at_ Queue Head
F
sendh
No Entry?
ERROR
No Entry?
F
T
SUSPEND
Entry-gtIn Progress
T
Queue Head And Tag
Tail Flit Not Sent
Send Flit
Create Entry-gtIn Use
Entry-gtComplete
32
Ready Mode Operation at the Receiver
33
Stressmark testing was used to verify that
performance was not hurt.

DIS Stressmark Suite
Neighborhood Stressmark
Matrix Stressmark
Transitive Closure Stressmark
LU Factorization Stressmark

34
The neighborhood stressmark measures image
texture.

Every node owns a portion of the total rows
Every row owns complete sum and difference
histograms
Each node determines, and requests, the pairs
for pixels in its rows
Each node fills in sum and difference histogram
Histograms are shared
Each node manages only a portion of each
histogram
Only the correct portion is sent to a node

35
Queues with 16 flits perform best.
36
The new system out performs the old under the
neighborhood stressmark.
37
Matrix stressmark solves a linear system of
equations using the Conjugate Gradient Method.

Additional vectors r and p used for intermediate
steps
Every node has
Rows of A
Elements of b and r
Complete x and p
After each iteration p must be globally
redistributed
Share with columns
Share with rows

38
The new system provides marginal improvement over
the original under the matrix stressmark.
39
The transitive closure stressmark solves the
all-pairs shortest-path problem.

Floyd-Warshall Algorithm
Adjacency Matrix
Dij
Iterative Improvements
Dij min(Dij, DikDkj)
Each node owns sub-block of adjacency matrix
Each node needs portion of row k
Each node needs portion of column k

40
The new system provides marginal improvement over
the original under the transitive closure
stressmark.
41
The LU factorization stressmark is used by linear
system solvers.

Factors matrix into a lower triangular matrix and
an upper triangular matrix.
Matrix is divided into blocks
Pivot block is factored
Pivot column and row blocks are divided by pivot.
Inner active matrix blocks are modified by the
pivot row and column blocks.

Pivot Row
Pivot
Pivot Column
Inner Active Matrix
42
The new system out performs the original under
the LU factorization stressmark.
43
Send-and-Receive Messaging for SCMP is
worthwhile.