Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures presentation

About This Presentation

Title:

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Description:

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal –

Number of Views:76

Avg rating:3.0/5.0

Slides: 13

Provided by: SarahB213

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

1
Scalar Operand Networks On-Chip Interconnect
for ILP in Partitioned Architectures

Michael Bedford Taylor, Walter Lee, Saman
Amarasinghe, Anant Agarwal
Presented By Sarah Lynn Bird

2
Scalar Operand Networks

A set of mechanisms that joins the dynamic
operands and operations of a program in space to
enact the computation specified by a program
graph
Physical Interconnection Network
Operation-operand matching system

3
Example Scalar Operand Networks
Register File
Raw Microprocessor
4
Design Issues

Delay Scalability
Intra-component delay
Inter-component delay
Managing latency
Bandwidth Scalability
Deadlock and Starvation
Efficient Operation-Operand Matching
Handling Exceptional Events

5
Operation-Operand Matching

5-Tuples of Costs ltSO, SL, NHL, RL, ROgt
SO Send Occupancy
The number of cycles that the ALU wastes in
sending
SL Send Latency
The number of cycles of delay for the message on
the send side of the network
NHL Network Hop Latency
The number of cycles of delay per hop
RL Receive Latency
The number of cycles of delay between the final
input arrives and the instruction is consumed
RO Receive Occupancy
The number of cycles that an ALU wastes before
employing a remote value

6
Raw Design

2 Static Networks
Instructions from a 64KB cache
Point-to-point for operand transport
2 Dynamic networks
Memory traffic, interrupts, user-level messages

8 -stage in-order single-issue pipeline
4-stage pipelined FPU
32KB data cache
32KB instruction cache
16 Cores on a Chip

7
Experiments

Beetle a cycle-accurate simulator
Actual Scalar Operand Network
Parameterized Scalar Operand Network without
Contention
Data cache misses modeled correctly
Assume no instruction cache misses
Memory Model
Compiler maps memory to tiles
Each location has one home site
Benchmarks
From Spec92, Spec95, Raw benchmark suite
Dense Matrix Codes, 1 Secure Hash Algorithm

8
Benchmark Scaling
2 4 8 16 32 64
cholesky 1.622 3.234 5.995 9.185 11.898 12.934
vpenta 1.714 3.112 6.093 12.132 24.172 44.872
mxm 1.933 3.731 6.207 8.900 14.836 20.472
fppp-kernal 1.511 3.336 5.724 6.143 5.988 6.536
sha 1.123 1.955 1.976 2.321 2.536 2.523
swim 1.601 2.624 4.691 8.301 17.090 28.889
jacobi 1.430 2.757 4.953 9.304 15.881 22.756
life 1.807 3.365 6.436 12.049 21.081 36.095

Benchmark speedups on many tiles relative to the
speed of the benchmark on one tile

9
Effect of Send Receive Occupancy

64 tiles
Parameterized network without contention
ltn,1, 1, 1, 0gt lt0,1,1,1, ngt

10
Effect of Send or Receive Latencies

Applications with courser-grain parallelism are
less sensitive to send/receive latencies
Overall, applications are less sensitive to
send/receive latencies as compared with
send/receive occupancies.

11
Other Experiments