Program Mapping onto Network Processors by Recursive Bipartitioning and Refining

1 / 20

About This Presentation

Title:

Program Mapping onto Network Processors by Recursive Bipartitioning and Refining

Description:

Program Mapping onto Network Processors by Recursive Bipartitioning and Refining ... r-Balanced Min-Cut algorithm, based on push-relabel Max-Flow-Min-Cut algorithm ... –

Number of Views:24

Avg rating:3.0/5.0

Slides: 21

Provided by: carl290

Category:

more less

Transcript and Presenter's Notes

Title: Program Mapping onto Network Processors by Recursive Bipartitioning and Refining

1
Program Mapping onto Network Processors by
Recursive Bipartitioning and Refining

Jia Yu1, Jingnan Yao1, Laxmi Bhuyan1, Jun Yang2
1 University of California, Riverside
2 University of Pittsburgh

2
Outline

Introduction of Network Processors Program
Mapping problem
Previous Work
The Algorithm
Recursive Bipartitioning
Refinement
Experiment Framework and Results
Conclusion

3
Packet Processing in the Future Internet
More packets Complex packet processing

Network processors
High processing capability
Support wire speed
Programmable
Scalable
Optimized for network applications

4
Typical Network Processor Architecture
SDRAM (packet buffer)
SRAM (routing table)
Network interfaces
Limited instruction memory
IXP1200 6 MEs IXP2400 8 MEs IXP2800 16 MEs
PE
Co-processor
H/w accelerator
Network Processor
Bus
5
Program Dependence Graphs (PDGs)
6
Programming Model in NP

Context Pipeline
Multiprocessing
Hybrid of pipeline and multiprocessing

T1
T2
T3
T4
T1 T2 T3
T1 T2 T3
T1 T2 T3 T4
T2 T3
T1
T4
T2 T3
7
Related Work

Difficulty NP hard problem, optimal throughput
or stage time is unknown during program mapping
instruction memory size
Multi-core and multithreading
Previous work
Greedily pack code according to code sizeJ.
Yao, Globecom05
Guarantee minimum number of stages
Does not consider communication cost in the
program mapping
Intel IXP auto-partitioning C compiler, balanced
cut J. Dai, PLDI05
k-cut cut the program (k-1) times into k
sequential parts with equal size
Not applicable to hybrid parallel and pipeline
topology, which can tolerate unbalanced workload
by replications

8
Hybrid Parallel and Pipeline Topology
Processing resource mapping
9
Throughput of Hybrid Parallel and Pipeline
Topology

Original stage time
Effective stage time
Throughput

10
Bipartitioning Algorithm

Divide-and-conquer approach Recursive
bipartitioning
Two objectives
Optimize the number of pipeline stages
Minimize the communication cost -- use r-Balanced
Min-Cut algorithm, based on push-relabel
Max-Flow-Min-Cut algorithm
PE assignment
Allocate enough PEs to two subgraphs to satisfy
staticcode requirement
Compute actual stage time of the two subgraphs.
Allocate remaining PE resource in proportion to
the actual stage timeto form parallel execution.

11
Resource Balanced Bipartitioning
Code
PE
12
Refinement

Previous work solely rely on allocating more PEs
to heavy loaded stages to achieve balanced
pipeline
Example ST1 ST2 2 3, given 3 PEs, how to
divide?
Optimal case PE_NUM1 PE_NUM2 1 1.5 1.2
1.8
Practical case PE_NUM1 PE_NUM2 1 2
Our approach PE_NUM1 PE_NUM2 1 2, and we
migrate tasks from stage 1 to stage 2 to balance
pipeline stage time
Time complexity of whole algorithm O(T2E)
Acceptable when performed offline by compilers

13
Refinement
14
Experiment Framework

Augment a PDG pass in Machine SUIF tool
Intel IXA Architecture Tool NP thread level
simulation tool
8 to 16 PEs, 1K to 5K instruction memory sizes
Specify tasks according to mapping and compiler
profiled information
I/O memory references
Code blocks (computation)
Signals (asynchronous operations)
Waits (asynchronous operations)
Benchmarks (Packetbench and Netbench)

15
Benchmarks
16
Throughput (8PE, 1K insns/IM)
Improvement 6 to 108, 22 on average
17
Throughput (16PE, 1Kinsns/IM)
18
Effect of IM sizes (1K- 5K insns)
19
Effect of Number of PEs (4 to 16)
20
Summary

Program partitioning and mapping onto hybrid
parallel and pipeline topology in NP systems
Recursive bipartitioning
Optimize the number of stages, considering
instruction memory size constraint
Minimize the communication costs
Hierarchical refinement task migration
On average 20 throughput improvement

21
Questions ?

Write a Comment

User Comments (0)