Title: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures
1Scalar Operand Networks On-Chip Interconnect
for ILP in Partitioned Architectures
- Michael Bedford Taylor, Walter Lee, Saman
Amarasinghe, Anant Agarwal - Presented By Sarah Lynn Bird
2Scalar Operand Networks
- A set of mechanisms that joins the dynamic
operands and operations of a program in space to
enact the computation specified by a program
graph - Physical Interconnection Network
- Operation-operand matching system
3Example Scalar Operand Networks
Register File
Raw Microprocessor
4Design Issues
- Delay Scalability
- Intra-component delay
- Inter-component delay
- Managing latency
- Bandwidth Scalability
- Deadlock and Starvation
- Efficient Operation-Operand Matching
- Handling Exceptional Events
5Operation-Operand Matching
- 5-Tuples of Costs ltSO, SL, NHL, RL, ROgt
- SO Send Occupancy
- The number of cycles that the ALU wastes in
sending - SL Send Latency
- The number of cycles of delay for the message on
the send side of the network - NHL Network Hop Latency
- The number of cycles of delay per hop
- RL Receive Latency
- The number of cycles of delay between the final
input arrives and the instruction is consumed - RO Receive Occupancy
- The number of cycles that an ALU wastes before
employing a remote value
6Raw Design
- 2 Static Networks
- Instructions from a 64KB cache
- Point-to-point for operand transport
- 2 Dynamic networks
- Memory traffic, interrupts, user-level messages
- 8 -stage in-order single-issue pipeline
- 4-stage pipelined FPU
- 32KB data cache
- 32KB instruction cache
- 16 Cores on a Chip
7Experiments
- Beetle a cycle-accurate simulator
- Actual Scalar Operand Network
- Parameterized Scalar Operand Network without
Contention - Data cache misses modeled correctly
- Assume no instruction cache misses
- Memory Model
- Compiler maps memory to tiles
- Each location has one home site
- Benchmarks
- From Spec92, Spec95, Raw benchmark suite
- Dense Matrix Codes, 1 Secure Hash Algorithm
8Benchmark Scaling
2 4 8 16 32 64
cholesky 1.622 3.234 5.995 9.185 11.898 12.934
vpenta 1.714 3.112 6.093 12.132 24.172 44.872
mxm 1.933 3.731 6.207 8.900 14.836 20.472
fppp-kernal 1.511 3.336 5.724 6.143 5.988 6.536
sha 1.123 1.955 1.976 2.321 2.536 2.523
swim 1.601 2.624 4.691 8.301 17.090 28.889
jacobi 1.430 2.757 4.953 9.304 15.881 22.756
life 1.807 3.365 6.436 12.049 21.081 36.095
- Benchmark speedups on many tiles relative to the
speed of the benchmark on one tile
9Effect of Send Receive Occupancy
- 64 tiles
- Parameterized network without contention
- ltn,1, 1, 1, 0gt lt0,1,1,1, ngt
10Effect of Send or Receive Latencies
- Applications with courser-grain parallelism are
less sensitive to send/receive latencies - Overall, applications are less sensitive to
send/receive latencies as compared with
send/receive occupancies.
11Other Experiments
- Comparing with Other networks
12Conclusions
- Many difficult issues with designing scalar
operand networks - Send and receive occupancies have the biggest
impact on performance - Network contention, multicast, and send/receive
latencies have a smaller impact