Title: Send and Receive Based Message-Passing for SCMP
1Send and Receive Based Message-Passing for SCMP
- Charles W. Lewis, Jr.
- Thesis Defense
- Virginia Tech
- April 28th, 2004
2This presentation introduces the SCMP
architecture, discusses problems with the current
SCMP message-passing system, and focuses on the
design and performance of a new SCMP
message-passing system.
1. Overview of SCMP
2. Original Message-Passing System
4. Performance Comparisons
3. New Message-Passing System
3Problems with current design trends motivate the
SCMP concept.
- As transistor sizes shrink, so do communication
wires. This leads to higher cross-chip
communication latencies. - ILP faces diminishing returns.
- Large and complex uni-processors require
extensive amounts of design and verification.
4SCMP provides PLP through replication.
- Up to 64 identical nodes on-chip
- Replicated nodes reduce complexity
- 2-D network eliminates cross-chip wires
SCMP Network with 64 Nodes
5SCMP provides TLP through multi-thread hardware
support.
- Up to 16 threads
- Round-robin thread scheduling by hardware
- On every node
- 4-stage RISC pipeline
- 8MB memory
- Networking hardware
SCMP Node
6The original messaging system has two message
types.
H T Payload Payload Payload Payload
1 0 X Y THREAD 1
0 0 Address Address Address Address
0 0 Register Data Register Data Register Data Register Data
. . . .
. . . .
. . . .
0 1 Register Data Register Data Register Data Register Data
H T Payload Payload Payload Payload
1 0 X Y DATA Stride
0 0 Address Address Address Address
0 0 Data Data Data Data
. . . .
. . . .
. . . .
0 1 Data Data Data Data
Thread Message
Data Message
Because they contain handling information these
message formats borrow from the Active-Messages
message-passing system.
7Network uses wormhole and dimension-order routing.
0
1
2
3
4
5
6
7
- Every router multiplexes virtual channel buffers
over physical channels. - Head flits claim virtual channel resources as
they travel - If one message blocks, other messages may still
continue as long as enough virtual channels are
free. - Messages move along X axis, then Y axis
- Tail flits release virtual channel resources as
they travel.
8Dimension-order routing is deadlock free as long
as messages eventually drain.
Router
- Even with VCs, network can still deadlock if
messages dont drain. - If all contexts are consumed, thread messages
block at NIU - Threads may not release until a data message is
received - Data messages must not be stopped by congested
thread messages - Data messages must have a separate path through
network.
Thread VCs
West
East
Data VCs
9The NIU bears most of the messaging load.
NIU
Thread Buffer
Context 1
Context 2
Injection Channel
Data Buffer
Context 2
To Router
From Router
Receive Buffer
Ejection Channel
Memory
10Messages are built through assembly instructions.
Instruction Arguments Description
sendh d_node, type, d_address, d_stride send a header flit
send data send one data flit
send2 data, data send two data flits
sende data send one data flit and end message
send2e data, data send two data flits and end message
sendm l_address, l_stride, count send data block from memory
sendme l_address, l_stride, count send data block from memory and end message
11The thread library facilitates thread messages.
Operation Arguments Description
createThread int dst_node create a thread on dst_node
void(addr)()
void(callback)()
parExecute int num_nodes create threads on
void(addr)() num_nodes nodes
getBlock unsigned int node_id request a block of values
char dst_addr from node node_id
unsigned int dst_stride
char src_addr
unsigned int src_offset
unsigned int src_stride
unsigned int num_words
12The send library facilitates data messages.
Operation Arguments Description
sendDataIntValue int dst_node send an integer to dst_node
int dst_addr
int value
sendDataFloatValue int dst_node send a double to dst_node
double dst_addr
double value
sendDataBlock int dst_node send a block of values from
int dst_addr memory to dst_node
int dst_stride (blocking)
int src_addr
int src_stride
int num_words
sendDataBlockNB int dst_node send a block of values from
int dst_addr memory to dst_node
int dst_stride (non-blocking)
int src_addr
int src_stride
int num_words
13The original message-passing system uses requests
and replies.
- Node A requires data held by Node B
- Node A creates a thread on Node B
- New thread on Node B sends data to Node A
- New thread on Node B sends SYNC message when done
A
B
Thread
B
A
Data
Sync
14Dynamic memory is a problem.
- Request thread on node B must know
- Source Address
- Source Stride
- Destination Address
- Destination Stride
- Number of Values to Send
- How can Node A know the source address and stride
if Node B allocates the buffer dynamically? - Program must contain global pointers
H T Payload Payload Payload Payload
1 0 X Y DATA Stride
0 0 Address Address Address Address
0 0 Data Data Data Data
. . . .
. . . .
. . . .
0 1 Data Data Data Data
15In-order delivery of messages is a problem.
- SCMP network does not guarantee in-order delivery
of messages - SYNC message may reach Node A before data message
- Node A will read bad values from memory
B
A
Data
Sync
16Request threads and finite thread contexts are a
problem.
Contexts
0X0000de5a
0X00000f70
0X00000ff8
0X00000ff8
NIU
0X00000ff8
Thread
Thread
Thread
0X00000ff8
0X00000ff8
0X00000ff8
- If a node holds highly demanded data, request
threads may consume all of its contexts - Additional thread messages will block in the
network
17Send-and-Receive message-passing eliminates all
of these problems.
- A thread must execute a receive before data will
be accepted - Dont need request messages
- Messages are identified abstractly
- Dont need global pointers
- Completion notification occurs locally
- Dont need SYNC messages
18Rendezvous mode uses an RTS/CTS handshake.
- Node B holds data required by Node A
- Node B sends Node A an RTS message when send is
executed - After receive is executed Node A sends Node B a
CTS message - Node B sends data after receiving RTS
A
B
RTS
B
A
CTS
B
A
Data
19Ready mode foregoes the handshake to reduce
message latency.
- Node B holds data required by Node A
- Node B sends data when send is executed
- User must ensure that receive has executed on
Node A
B
A
Data
20The implementation centers around two tables.
Send Table Entry
33 2 1 0
id id state state
Receive Table Entry
83 50 49 29 28 13 12 7 6 3 2 0
id id address address stride stride r_node r_node r_cntxt r_cntxt state state
21Send Table Entries may be in 4 states, and
Receive Table Entries may be in 5 states.
Value State
00 Empty
01 In Use
10 In Progress
11 Complete
Value State
000 Empty
001 In Use
010 In Progress
011 RTS Received
10X NOT USED
110 NOT USED
111 Complete
Send Table Entry States
Receive Table Entry States
22The new messaging system has four message types.
H T Payload Payload Payload Payload
1 0 X Y THREAD
1 1 Handler Address Handler Address Handler Address Handler Address
0 0 Register Data Register Data Register Data Register Data
. . . .
. . . .
. . . .
0 1 Register Data Register Data Register Data Register Data
H T Payload Payload Payload Payload
1 0 X Y DATA
1 1 Message ID Message ID Message ID Message ID
0 0 Data Data Data Data
. . . .
. . . .
. . . .
0 1 Data Data Data Data
Data Message
Thread Message
H T Payload Payload Payload Payload Payload
1 0 X Y RTS cntxt node
0 1 Message ID Message ID Message ID Message ID Message ID
H T Payload Payload Payload Payload
1 0 X Y CTS cntxt
0 1 Message ID Message ID Message ID Message ID
CTS Message
RTS Message
23The NIU now contains a data queue for every
context.
NIU
Thread Buffer
Injection Channel
To Router
Data Buffer
Context 1
Context 2
Context 2
RTS Buffer
From Router
CTS Buffer
Receive Buffer
Ejection Channel
Memory
24Only five new instructions and one modified
instruction are needed.
Instruction Arguments Description
sendh d_node, type, d_address send a header flit
send data send one data flit
send2 data, data send two data flits
sende data send one data flit and end message
send2e data, data send two data flits and end message
sendm l_address, l_stride, count send data block from memory
sendme l_address, l_stride, count send data block from memory and end message
ldss r, message_id poll send operation status
ldsr r, message_id poll receive operation status
str message_id, address, stride store a receive to table
rms message_id clear a send operation
rmr message_id clear a receive operation
25The thread library remains nearly the same.
Operation Arguments Description
createThread int dst_node create a thread on dst_node
void(addr)()
void(callback)()
parExecute int num_nodes create threads on
void(addr)() num_nodes nodes
getBlock unsigned int node_id request a block of values
char dst_addr from node node_id
unsigned int dst_stride
char src_addr
unsigned int src_offset
unsigned int src_stride
message_id
unsigned int num_words
26The new send library is more familiar.
Operation Arguments Description
SCMPSendInt int dst_node send an integer to dst_node
int message_id
int value
SCMPSendFloat int dst_node send a double to dst_node
int message_id
double value
SCMPSend int dst_node send a block of values from
int message_id memory to dst_node
int address (blocking)
int stride
int num_words
SCMPSendNB int dst_node send a block of values from
int message_id memory to dst_node
int address (non-blocking)
int stride
int num_words
SCMPPollSend int message_id poll status of send operation
SCMPWaitSend int message_id suspend until message sends
SCMPClearSend int message_id clear send operation
27The receive library is all new.
Operation Arguments Description
SCMPReceive int message_id receive a message and
int address store it at address
int stride (blocking)
SCMPReceiveNB int message_id receive a message and
int address store it at address
int stride (non-blocking)
SCMPPollReceive int message_id poll status of receive operation
SCMPWaitReceive int message_id suspend until message arrives
SCMPClearReceive int message_id clear receive operation
28Rendezvous Mode Operation at the Sender
sendh
No Entry?
F
SUSPEND
CTS Message Arrives
T
Queue Head And Tag
Queue Waiting
F
Create Entry-gtIn Use
ERROR
T
Head Flit _at_ Queue Head
Tail Flit Not Sent
Send Flit
No Entry?
T
ERROR
Entry-gtComplete
F
Send RTS
Entry-gtIn Progress
29Rendezvous Mode Operation at the Receiver
RTS Message Arrives
Data Message Arrives
No Entry
No Entry
T
T
DISCARD
Record RTS
F
In Progress
Entry-gtRTS Rcvd
F
F
Block Data
In Use
T
Send CTS
T
Tail Flit Not Stored
F
Entry-gtIn Progress
Store Flit
Block RTS
Entry-gtComplete
RTS Rcvd
No Entry
F
F
SUSPEND
str
T
T
Record str
Send CTS
Entry-gtIn Use
Entry-gtIn Progress
30RTS and CTS Messages also need separate VC paths.
Router
- RTS messages can block in the network.
- For a given RTS message to leave the network, RTS
messages ahead of it must be satisfied - CTS message to source
- Data message back
- RTS and CTS messages have their own VC paths.
Thread VCs
Data VCs
West
East
RTS VCs
CTS VCs
31Ready Mode Operation at the Sender
Head Flit _at_ Queue Head
F
sendh
No Entry?
ERROR
No Entry?
F
T
SUSPEND
Entry-gtIn Progress
T
Queue Head And Tag
Tail Flit Not Sent
Send Flit
Create Entry-gtIn Use
Entry-gtComplete
32Ready Mode Operation at the Receiver
33Stressmark testing was used to verify that
performance was not hurt.
- DIS Stressmark Suite
- Neighborhood Stressmark
- Matrix Stressmark
- Transitive Closure Stressmark
- LU Factorization Stressmark
34The neighborhood stressmark measures image
texture.
- Every node owns a portion of the total rows
- Every row owns complete sum and difference
histograms - Each node determines, and requests, the pairs
for pixels in its rows - Each node fills in sum and difference histogram
- Histograms are shared
- Each node manages only a portion of each
histogram - Only the correct portion is sent to a node
35Queues with 16 flits perform best.
36The new system out performs the old under the
neighborhood stressmark.
37Matrix stressmark solves a linear system of
equations using the Conjugate Gradient Method.
- Additional vectors r and p used for intermediate
steps - Every node has
- Rows of A
- Elements of b and r
- Complete x and p
- After each iteration p must be globally
redistributed - Share with columns
- Share with rows
38The new system provides marginal improvement over
the original under the matrix stressmark.
39The transitive closure stressmark solves the
all-pairs shortest-path problem.
- Floyd-Warshall Algorithm
- Adjacency Matrix
- Dij
- Iterative Improvements
- Dij min(Dij, DikDkj)
- Each node owns sub-block of adjacency matrix
- Each node needs portion of row k
- Each node needs portion of column k
40The new system provides marginal improvement over
the original under the transitive closure
stressmark.
41The LU factorization stressmark is used by linear
system solvers.
- Factors matrix into a lower triangular matrix and
an upper triangular matrix. - Matrix is divided into blocks
- Pivot block is factored
- Pivot column and row blocks are divided by pivot.
- Inner active matrix blocks are modified by the
pivot row and column blocks.
Pivot Row
Pivot
Pivot Column
Inner Active Matrix
42The new system out performs the original under
the LU factorization stressmark.
43Send-and-Receive Messaging for SCMP is
worthwhile.
- Fixes Problems With Original SCMP Messaging
System - Global Buffer Pointers
- Races between Data and SYNC messages
- Request Thread Storms
- Programming Model is more familiar
- Performance is better
Questions?