Title: Prediction Router:
1Prediction Router
Yet another low-latency on-chip router
architecture
- Hiroki Matsutani (Keio Univ., Japan)
- Michihiro Koibuchi (NII, Japan)
- Hideharu Amano (Keio Univ., Japan)
- Tsutomu Yoshinaga (UEC, Japan)
2Why low-latency router is needed?
- Tile architecture
- Many cores (e.g., processors caches)
- On-chip interconnection network
Dally, DAC01
Router
Core
router
router
router
router
router
router
router
router
router
Packet switched network
16-core tile architecture
On-chip router affects the performance and cost
of the chip
3Why low-latency router is needed?
Low-latency router architecture has been
extensively studied
4Outline Prediction router for low-latency NoC
- Existing low-latency routers
- Speculative router
- Look-ahead router
- Bypassing router
- Prediction router
- Architecture and the prediction algorithms
- Hit rate analysis
- Evaluations
- Hit rate, gate count, and energy consumption
- Case study 1 2-D mesh (small core size)
- Case study 2 2-D mesh (large core size)
- Case study 3 Fat tree network
5Wormhole router Hardware structure
Output ports
Input ports
ARBITER
X
X
FIFO
X-
X-
FIFO
Y
Y
FIFO
Y-
Y-
FIFO
5x5 CROSSBAR
CORE
CORE
FIFO
Routing, arbitration, switch traversal are
performed in a pipeline manner
6Pipeline structure 3-cycle router
- At least 3-cycle for traversing a router
- RC (Routing computation)
- VSA (Virtual channel switch allocations)
- ST (Switch traversal)
- A packet transfer from router (a) to router (c)
VA SA are speculatively performed in parallel
_at_Router B
_at_Router C
_at_Router A
RC
VSA
ST
RC
VSA
ST
RC
VSA
ST
HEAD
DATA 1
ST
ST
ST
SA
SA
SA
ST
ST
ST
DATA 2
SA
SA
SA
ST
ST
ST
DATA 3
SA
SA
SA
1
2
3
4
5
6
7
8
9
10
11
12
To perform RC and VSA in parallel, look-ahead
routing is used
ELAPSED TIME CYCLE
At least 12-cycle for transferring a packet from
router (a) to router (c)
7Look-ahead routerRC/VA in parallel
- At least 3-cycle for traversing a router
- NRC (Next routing computation)
- VSA (Virtual channel switch allocations)
- ST (Switch traversal)
VSA can be performed w/o waiting for NRC
Routing computation for the next hop ? Output
port of router (i1) is selected by router i
_at_Router B
_at_Router C
_at_Router A
NRC
VSA
ST
VSA
ST
VSA
ST
HEAD
NRC
NRC
DATA 1
ST
ST
ST
SA
SA
SA
ST
ST
ST
DATA 2
SA
SA
SA
ST
ST
ST
DATA 3
SA
SA
SA
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
8Look-ahead routerRC/VA in parallel
- At least 2-cycle for traversing a router
- NRC VSA (Next routing computation /
arbitrations) - ST (Switch traversal)
No dependency between NRC VSA ? NRC VSA
in parallel
Dallys book, 2004
_at_Router A
_at_Router B
_at_Router C
NRC
NRC
NRC
Typical example of 2-cycle router
ST
HEAD
ST
ST
VSA
VSA
VSA
DATA 1
DATA 2
DATA 3
1
2
3
4
5
6
7
8
9
Packing NRC,VSA,ST into a single stage ?
frequency harmed
ELAPSED TIME CYCLE
At least 9-cycle for transferring a packet from
router (a) to router (c)
9Bypassing router skip some stages
- Bypassing between intermediate nodes
- E.g., Express VCs
Kumar, ISCA07
SRC
DST
3-cycle
3-cycle
3-cycle
3-cycle
3-cycle
10Bypassing router skip some stages
- Bypassing between intermediate nodes
- E.g., Express VCs
- Pipeline bypassing utilizing the regularity of
DOR - E.g., Mad postman
- Pipeline stages on frequently used are skipped
- E.g., Dynamic fast path
- Pipeline stages on user-specific paths are
skipped - E.g., Preferred path
- E.g., DBP
Kumar, ISCA07
SRC
DST
3-cycle
3-cycle
3-cycle
3-cycle
3-cycle
Izu, PDP94
Park, HOTI07
Michelogiannakis, NOCS07
Koibuchi, NOCS08
We propose a low-latency router based on multiple
predictors
11Outline Prediction router for low-latency NoC
- Existing low-latency routers
- Speculative router
- Look-ahead router
- Bypassing router
- Prediction router
- Architecture and the prediction algorithms
- Hit rate analysis
- Evaluations
- Hit rate, gate count, and energy consumption
- Case study 1 2-D mesh (small core size)
- Case study 2 2-D mesh (large core size)
- Case study 3 Fat tree network
12Prediction router for 1-cycle transfer
Yoshinaga,IWIA06
- Each input channel has predictors
- When an input channel is idle,
- Predict an output port to be used (RC
pre-execution) - Arbitration to use the predicted port(SA
pre-execution)
Yoshinaga,IWIA07
RC VSA are skipped if prediction hits ? 1-cycle
transfer
_at_Router A
_at_Router B
_at_Router C
RC
VSA
ST
RC
VSA
ST
RC
VSA
ST
HEAD
DATA 1
ST
ST
ST
ST
ST
ST
DATA 2
ST
ST
ST
DATA 3
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
E.g, we can expect 1.6 cycle transfer if 70 of
predictions hit
13Prediction router for 1-cycle transfer
Yoshinaga,IWIA06
- Each input channel has predictors
- When an input channel is idle,
- Predict an output port to be used (RC
pre-execution) - Arbitration to use the predicted port(SA
pre-execution)
Yoshinaga,IWIA07
RC VSA are skipped if prediction hits ? 1-cycle
transfer
MISS
_at_Router B
_at_Router C
RC
VSA
ST
RC
VSA
ST
RC
VSA
ST
HEAD
DATA 1
ST
ST
ST
ST
ST
ST
DATA 2
ST
ST
ST
DATA 3
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
E.g, we can expect 1.6 cycle transfer if 70 of
predictions hit
14Prediction router for 1-cycle transfer
Yoshinaga,IWIA06
- Each input channel has predictors
- When an input channel is idle,
- Predict an output port to be used (RC
pre-execution) - Arbitration to use the predicted port(SA
pre-execution)
Yoshinaga,IWIA07
RC VSA are skipped if prediction hits ? 1-cycle
transfer
HIT
MISS
_at_Router C
RC
VSA
ST
ST
RC
VSA
ST
HEAD
DATA 1
ST
ST
ST
ST
ST
DATA 2
ST
ST
ST
DATA 3
ST
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
E.g, we can expect 1.6 cycle transfer if 70 of
predictions hit
15Prediction router for 1-cycle transfer
Yoshinaga,IWIA06
- Each input channel has predictors
- When an input channel is idle,
- Predict an output port to be used (RC
pre-execution) - Arbitration to use the predicted port(SA
pre-execution)
Yoshinaga,IWIA07
RC VSA are skipped if prediction hits ? 1-cycle
transfer
HIT
HIT
MISS
RC
VSA
ST
ST
ST
HEAD
DATA 1
ST
ST
ST
ST
ST
DATA 2
ST
ST
ST
DATA 3
ST
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
E.g, we can expect 1.6 cycle transfer if 70 of
predictions hit
16Prediction router Prediction algorithms
Yoshinaga,IWIA06
- Efficient predictor is key
- Prediction router
- Multiple predictors for each input channel
- Select one of them in response to a given network
environment
Yoshinaga,IWIA07
Single predictor isnt enough
for applications with different traffic patterns
17Basic operation _at_ Correct prediction
2nd cycle Next flit is transferred to X
without RC and VSA
1-cycle transfer using the reserved crossbar-port
when prediction hits
18Basic operation _at_ Miss prediction
2nd/3rd cycle Dead flit is removed
retransmission to the correct port
More energy for retransmission
Even with miss prediction, a flit is transferred
in 3-cycle as original router
19Outline Prediction router for low-latency NoC
- Existing low-latency routers
- Speculative router
- Look-ahead router
- Bypassing router
- Prediction router
- Architecture and the prediction algorithms
- Hit rate analysis
- Evaluations
- Hit rate, gate count, and energy consumption
- Case study 1 2-D mesh (small core size)
- Case study 2 2-D mesh (large core size)
- Case study 3 Fat tree network
20Prediction hit rate analysis
- Formulas to calculate the prediction hit rates on
- 2-D torus (Random, LP, SS, FCM, and SPM)
- 2-D mesh (Random, LP, SS, FCM, and SPM)
- Fat tree (Random and LRU)
- To forecast which prediction algorithm is suited
for a given network environment w/o simulations - Accuracy of the analytical model is confirmed
through simulations
Derivation of the formulas is omitted in this
talk (See Section 4 of our paper for more
detail)
21Outline Prediction router for low-latency NoC
- Existing low-latency routers
- Speculative router
- Look-ahead router
- Bypassing router
- Prediction router
- Architecture and the prediction algorithms
- Hit rate analysis
- Evaluations
- Hit rate, gate count, and energy consumption
- Case study 1 2-D mesh (small core size)
- Case study 2 2-D mesh (large core size)
- Case study 3 Fat tree network
22Evaluation items
How many cycles ?
Astro (place route)
FIFO
hit
NC-Verilog (simulation)
FIFO
XBAR
SDF
SAIF
miss
hit
hit
Design compiler(synthesis)
Power compiler
Fujitsu 65nm library
Flit-level net simulation
Hit rate / Comm. latency
Area (gate count)
Energy cons. pJ / bit
Table 1 Router network parameters
Table 2 Process library
Table 3 CAD tools used
Topology and traffic are mentioned later
233 case studies of prediction router
How many cycles ?
Astro (place route)
FIFO
hit
NC-Verilog (simulation)
FIFO
XBAR
SDF
SAIF
miss
hit
hit
Design compiler(synthesis)
Power compiler
Fujitsu 65nm library
Flit-level net simulation
Hit rate / Comm. latency
Area (gate count)
Energy cons. pJ / bit
2-D mesh network
Fat tree network
- The most popular network topology
- MITs RAW Taylor,ISCA04
- Intels 80-core Vangal,ISSCC07
- Dimension-order routing (XY routing)
- ? Here, we show the results of case studies 1 and
2 together
Case study 3
Case study 1 2
24Case study 1 Zero-load comm.latency
- Original router
- Pred router (SS)
- Pred router (100 hit)
Uniform random traffic on
4x4 to 16x16 meshes
() 1-cycle transfer for correct prediction,
3-cycle for wrong prediction
? Simulation results
(analytical model also shows the same result)
Comm. latency cycles
More latency reduced (48 for k16) as network
size increases
Network size (k-ary 2-mesh)
25Case study 2 Hit rate _at_ 8x8 mesh
- SS go straight
- LP the last one
- FCM frequently used pattern
Prediction hit rate
7 NAS parallel benchmark programs
4 synthesized traffics
26Case study 2 Hit rate _at_ 8x8 mesh
- SS go straight
- LP the last one
- FCM frequently used pattern
Efficient for long straight comm.
Efficient for short repeated comm.
Prediction hit rate
7 NAS parallel benchmark programs
4 synthesized traffics
27Case study 2 Hit rate _at_ 8x8 mesh
- SS go straight
- LP the last one
- FCM frequently used pattern
Efficient for long straight comm.
Efficient for short repeated comm.
All arounder !
Prediction hit rate
7 NAS parallel benchmark programs
4 synthesized traffics
28Case study 2 Area Energy
- Area (gate count)
- Original router
- Pred router (SS LP)
- Pred router (SSLPFCM)
Light-weight (small overhead)
Verilog-HDL designs
Router area kilo gates
Synthesized with 65nm library
6.4 - 15.9 increased, depending on type and
number of predictors
29Case study 2 Area Energy
- Area (gate count)
- Original router
- Pred router (SS LP)
- Pred router (SSLPFCM)
- Energy consumption
- Original router
- Pred router (70 hit)
- Pred router (100 hit)
- This estimation is pessimistic.
- More energy consumed in links ? Effect of router
energy overhead is reduced - Application will be finished early ? More energy
saved
Flit switching energy pJ / bit
Router area kilo gates
6.4 - 15.9 increased, depending on type and
number of predictors
Miss prediction consumes power 9.5 increased if
hit rate is 70
Latency 35.8-48.2 saved w/ reasonable
area/energy overheads
303 case studies of prediction router
How many cycles ?
Astro (place route)
FIFO
hit
NC-Verilog (simulation)
FIFO
XBAR
SDF
SAIF
miss
hit
hit
Design compiler(synthesis)
Power compiler
Fujitsu 65nm library
Flit-level net simulation
Hit rate / Comm. latency
Area (gate count)
Energy cons. pJ / bit
2-D mesh network
Fat tree network
Case study 3
Case study 1 2
31Case study 3 Fat tree network
Down
Up
1. LRU algorithm LRU output port is selected
for upward transfer 2. LRU LP algorithm Plus,
LP for downward transfer
32Case study 3 Fat tree network
- Comm. latency _at_uniform
- Original router
- Pred router (LRU)
- Pred router (LRU LP)
Down
Up
Comm. latency cycles
1. LRU algorithm LRU output port is selected
for upward transfer 2. LRU LP algorithm Plus,
LP for downward transfer
Network size ( of cores)
Latency 30.7 reduced _at_ 256-core Small area
overhead (7.8)
33Summary of the prediction router
- Prediction router for low-latency NoCs
- Multiple predictors, which can be switched in a
cycle - Architecture and six prediction algorithms
- Analytical model of prediction hit rates
- Evaluations of prediction router
- Case study 1 2-D mesh (small core size)
- Case study 2 2-D mesh (large core size)
- Case study 3 Fat tree network
- Results
- Prediction router can be applied to various NoCs
- Communication latency reduced with small
overheads - 3. Prediction router with multiple predictors
can accelerate a wider range of applications
From three case studies
34Thank you for your attention
It would be very helpful if you would speak
slowly. Thank you in advance.
35Prediction router New modifications
- Predictors for each input channel
- Kill mechanism to remove dead flits
- Two-level arbiter
- Reservation ? higher priority
- Tentative reservation by the pre-execution of
VSA
KILL signals
ARBITER
X
X
FIFO
Currently, the critical path is related to the
arbiter
X-
X-
Y
Y
Y-
Y-
5x5 XBAR
CORE
CORE
36Prediction router Predictor selection
- Static scheme
- A predictor is selected by user per application
- Dynamic scheme
- A predictor is adaptively selected
Predictors
Predictors
A
B
C
A
B
C
Count up if each predictor hits
Configuration table
A predictor is selected every n cycles (e.g., n
10,000)
Flexible More energy
Simple Pre-analysis is needed
37Case study 1 Router critical path
- RC Routing comp.
- VSA Arbitration
- ST Switch traversal
ST can be occurred in these stages of prediction
router
6.2 critical path delay increased compared with
original router
Stage delay FO4s
Pred router (SS)
Original router
38Case study 2 Hit rate _at_ 8x8 mesh
- SS go straight
- LP the last one
- FCM frequently used pattern
- Custom user-specific path
Efficient for long straight comm.
Efficient for short repeated comm.
All arounder !
Efficient for simple comm.
Prediction hit rate
7 NAS parallel benchmark programs
4 synthesized traffics
39Case study 4 Spidergon network
- Spidergon topology
- Ring across links
- Each router has 3-port
- Mesh-like 2-D layout
- Across first routing
Coppola,ISSOC04
40Case study 4 Spidergon network
- Spidergon topology
- Ring across links
- Each router has 3-port
- Mesh-like 2-D layout
- Across first routing
- Hit rate _at_ Uniform
- SS Go straight
- LP Last used one
- FCM Frequently used one
Coppola,ISSOC04
Prediction hit rate
Hit rates of SS FCM are almost the same
Network size ( of cores)
High hit rate is achieved (80 for 64core 94
for 256core)
414 case studies of prediction router
How many cycles ?
Astro (place route)
FIFO
hit
NC-Verilog (simulation)
FIFO
XBAR
SDF
SAIF
miss
hit
hit
Design compiler(synthesis)
Power compiler
Fujitsu 65nm library
Flit-level net simulation
Hit rate / Comm. latency
Area (gate count)
Energy cons. pJ / bit
2-D mesh network
Fat tree network
Spidergon network
Case study 3
Case study 4
Case study 1 2