Title: Program Mapping onto Network Processors by Recursive Bipartitioning and Refining
1Program Mapping onto Network Processors by
Recursive Bipartitioning and Refining
- Jia Yu1, Jingnan Yao1, Laxmi Bhuyan1, Jun Yang2
- 1 University of California, Riverside
- 2 University of Pittsburgh
2Outline
- Introduction of Network Processors Program
Mapping problem - Previous Work
- The Algorithm
- Recursive Bipartitioning
- Refinement
- Experiment Framework and Results
- Conclusion
3Packet Processing in the Future Internet
More packets Complex packet processing
- Network processors
- High processing capability
- Support wire speed
- Programmable
- Scalable
- Optimized for network applications
-
4Typical Network Processor Architecture
SDRAM (packet buffer)
SRAM (routing table)
Network interfaces
Limited instruction memory
IXP1200 6 MEs IXP2400 8 MEs IXP2800 16 MEs
PE
Co-processor
H/w accelerator
Network Processor
Bus
5Program Dependence Graphs (PDGs)
6Programming Model in NP
- Context Pipeline
- Multiprocessing
- Hybrid of pipeline and multiprocessing
T1
T2
T3
T4
T1 T2 T3
T1 T2 T3
T1 T2 T3 T4
T2 T3
T1
T4
T2 T3
7Related Work
- Difficulty NP hard problem, optimal throughput
or stage time is unknown during program mapping - instruction memory size
- Multi-core and multithreading
- Previous work
- Greedily pack code according to code sizeJ.
Yao, Globecom05 - Guarantee minimum number of stages
- Does not consider communication cost in the
program mapping - Intel IXP auto-partitioning C compiler, balanced
cut J. Dai, PLDI05 - k-cut cut the program (k-1) times into k
sequential parts with equal size - Not applicable to hybrid parallel and pipeline
topology, which can tolerate unbalanced workload
by replications
8Hybrid Parallel and Pipeline Topology
Processing resource mapping
9Throughput of Hybrid Parallel and Pipeline
Topology
- Original stage time
- Effective stage time
- Throughput
-
10Bipartitioning Algorithm
- Divide-and-conquer approach Recursive
bipartitioning - Two objectives
- Optimize the number of pipeline stages
- Minimize the communication cost -- use r-Balanced
Min-Cut algorithm, based on push-relabel
Max-Flow-Min-Cut algorithm - PE assignment
- Allocate enough PEs to two subgraphs to satisfy
staticcode requirement - Compute actual stage time of the two subgraphs.
Allocate remaining PE resource in proportion to
the actual stage timeto form parallel execution.
11Resource Balanced Bipartitioning
Code
PE
12Refinement
- Previous work solely rely on allocating more PEs
to heavy loaded stages to achieve balanced
pipeline - Example ST1 ST2 2 3, given 3 PEs, how to
divide? - Optimal case PE_NUM1 PE_NUM2 1 1.5 1.2
1.8 - Practical case PE_NUM1 PE_NUM2 1 2
- Our approach PE_NUM1 PE_NUM2 1 2, and we
migrate tasks from stage 1 to stage 2 to balance
pipeline stage time - Time complexity of whole algorithm O(T2E)
- Acceptable when performed offline by compilers
13Refinement
14Experiment Framework
- Augment a PDG pass in Machine SUIF tool
- Intel IXA Architecture Tool NP thread level
simulation tool - 8 to 16 PEs, 1K to 5K instruction memory sizes
- Specify tasks according to mapping and compiler
profiled information - I/O memory references
- Code blocks (computation)
- Signals (asynchronous operations)
- Waits (asynchronous operations)
- Benchmarks (Packetbench and Netbench)
15Benchmarks
16Throughput (8PE, 1K insns/IM)
Improvement 6 to 108, 22 on average
17Throughput (16PE, 1Kinsns/IM)
18Effect of IM sizes (1K- 5K insns)
19Effect of Number of PEs (4 to 16)
20Summary
- Program partitioning and mapping onto hybrid
parallel and pipeline topology in NP systems - Recursive bipartitioning
- Optimize the number of stages, considering
instruction memory size constraint - Minimize the communication costs
- Hierarchical refinement task migration
- On average 20 throughput improvement
21Questions ?