Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

1 / 12
About This Presentation
Title:

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Description:

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal –

Number of Views:76
Avg rating:3.0/5.0
Slides: 13
Provided by: SarahB213
Category:

less

Transcript and Presenter's Notes

Title: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures


1
Scalar Operand Networks On-Chip Interconnect
for ILP in Partitioned Architectures
  • Michael Bedford Taylor, Walter Lee, Saman
    Amarasinghe, Anant Agarwal
  • Presented By Sarah Lynn Bird

2
Scalar Operand Networks
  • A set of mechanisms that joins the dynamic
    operands and operations of a program in space to
    enact the computation specified by a program
    graph
  • Physical Interconnection Network
  • Operation-operand matching system

3
Example Scalar Operand Networks
Register File
Raw Microprocessor
4
Design Issues
  • Delay Scalability
  • Intra-component delay
  • Inter-component delay
  • Managing latency
  • Bandwidth Scalability
  • Deadlock and Starvation
  • Efficient Operation-Operand Matching
  • Handling Exceptional Events

5
Operation-Operand Matching
  • 5-Tuples of Costs ltSO, SL, NHL, RL, ROgt
  • SO Send Occupancy
  • The number of cycles that the ALU wastes in
    sending
  • SL Send Latency
  • The number of cycles of delay for the message on
    the send side of the network
  • NHL Network Hop Latency
  • The number of cycles of delay per hop
  • RL Receive Latency
  • The number of cycles of delay between the final
    input arrives and the instruction is consumed
  • RO Receive Occupancy
  • The number of cycles that an ALU wastes before
    employing a remote value

6
Raw Design
  • 2 Static Networks
  • Instructions from a 64KB cache
  • Point-to-point for operand transport
  • 2 Dynamic networks
  • Memory traffic, interrupts, user-level messages
  • 8 -stage in-order single-issue pipeline
  • 4-stage pipelined FPU
  • 32KB data cache
  • 32KB instruction cache
  • 16 Cores on a Chip

7
Experiments
  • Beetle a cycle-accurate simulator
  • Actual Scalar Operand Network
  • Parameterized Scalar Operand Network without
    Contention
  • Data cache misses modeled correctly
  • Assume no instruction cache misses
  • Memory Model
  • Compiler maps memory to tiles
  • Each location has one home site
  • Benchmarks
  • From Spec92, Spec95, Raw benchmark suite
  • Dense Matrix Codes, 1 Secure Hash Algorithm

8
Benchmark Scaling
2 4 8 16 32 64
cholesky 1.622 3.234 5.995 9.185 11.898 12.934
vpenta 1.714 3.112 6.093 12.132 24.172 44.872
mxm 1.933 3.731 6.207 8.900 14.836 20.472
fppp-kernal 1.511 3.336 5.724 6.143 5.988 6.536
sha 1.123 1.955 1.976 2.321 2.536 2.523
swim 1.601 2.624 4.691 8.301 17.090 28.889
jacobi 1.430 2.757 4.953 9.304 15.881 22.756
life 1.807 3.365 6.436 12.049 21.081 36.095
  • Benchmark speedups on many tiles relative to the
    speed of the benchmark on one tile

9
Effect of Send Receive Occupancy
  • 64 tiles
  • Parameterized network without contention
  • ltn,1, 1, 1, 0gt lt0,1,1,1, ngt

10
Effect of Send or Receive Latencies
  • Applications with courser-grain parallelism are
    less sensitive to send/receive latencies
  • Overall, applications are less sensitive to
    send/receive latencies as compared with
    send/receive occupancies.

11
Other Experiments
  • Increasing Hop Latency
  • Removing Contention
  • Comparing with Other networks

12
Conclusions
  • Many difficult issues with designing scalar
    operand networks
  • Send and receive occupancies have the biggest
    impact on performance
  • Network contention, multicast, and send/receive
    latencies have a smaller impact
Write a Comment
User Comments (0)
About PowerShow.com