Program Mapping onto Network Processors by Recursive Bipartitioning and Refining

1 / 20
About This Presentation
Title:

Program Mapping onto Network Processors by Recursive Bipartitioning and Refining

Description:

Program Mapping onto Network Processors by Recursive Bipartitioning and Refining ... r-Balanced Min-Cut algorithm, based on push-relabel Max-Flow-Min-Cut algorithm ... –

Number of Views:24
Avg rating:3.0/5.0
Slides: 21
Provided by: carl290
Category:

less

Transcript and Presenter's Notes

Title: Program Mapping onto Network Processors by Recursive Bipartitioning and Refining


1
Program Mapping onto Network Processors by
Recursive Bipartitioning and Refining
  • Jia Yu1, Jingnan Yao1, Laxmi Bhuyan1, Jun Yang2
  • 1 University of California, Riverside
  • 2 University of Pittsburgh

2
Outline
  • Introduction of Network Processors Program
    Mapping problem
  • Previous Work
  • The Algorithm
  • Recursive Bipartitioning
  • Refinement
  • Experiment Framework and Results
  • Conclusion

3
Packet Processing in the Future Internet
More packets Complex packet processing
  • Network processors
  • High processing capability
  • Support wire speed
  • Programmable
  • Scalable
  • Optimized for network applications

4
Typical Network Processor Architecture
SDRAM (packet buffer)
SRAM (routing table)
Network interfaces
Limited instruction memory
IXP1200 6 MEs IXP2400 8 MEs IXP2800 16 MEs
PE
Co-processor
H/w accelerator
Network Processor
Bus
5
Program Dependence Graphs (PDGs)
6
Programming Model in NP
  • Context Pipeline
  • Multiprocessing
  • Hybrid of pipeline and multiprocessing

T1
T2
T3
T4
T1 T2 T3
T1 T2 T3
T1 T2 T3 T4
T2 T3
T1
T4
T2 T3
7
Related Work
  • Difficulty NP hard problem, optimal throughput
    or stage time is unknown during program mapping
  • instruction memory size
  • Multi-core and multithreading
  • Previous work
  • Greedily pack code according to code sizeJ.
    Yao, Globecom05
  • Guarantee minimum number of stages
  • Does not consider communication cost in the
    program mapping
  • Intel IXP auto-partitioning C compiler, balanced
    cut J. Dai, PLDI05
  • k-cut cut the program (k-1) times into k
    sequential parts with equal size
  • Not applicable to hybrid parallel and pipeline
    topology, which can tolerate unbalanced workload
    by replications

8
Hybrid Parallel and Pipeline Topology
Processing resource mapping
9
Throughput of Hybrid Parallel and Pipeline
Topology
  • Original stage time
  • Effective stage time
  • Throughput

10
Bipartitioning Algorithm
  • Divide-and-conquer approach Recursive
    bipartitioning
  • Two objectives
  • Optimize the number of pipeline stages
  • Minimize the communication cost -- use r-Balanced
    Min-Cut algorithm, based on push-relabel
    Max-Flow-Min-Cut algorithm
  • PE assignment
  • Allocate enough PEs to two subgraphs to satisfy
    staticcode requirement
  • Compute actual stage time of the two subgraphs.
    Allocate remaining PE resource in proportion to
    the actual stage timeto form parallel execution.

11
Resource Balanced Bipartitioning
Code
PE
12
Refinement
  • Previous work solely rely on allocating more PEs
    to heavy loaded stages to achieve balanced
    pipeline
  • Example ST1 ST2 2 3, given 3 PEs, how to
    divide?
  • Optimal case PE_NUM1 PE_NUM2 1 1.5 1.2
    1.8
  • Practical case PE_NUM1 PE_NUM2 1 2
  • Our approach PE_NUM1 PE_NUM2 1 2, and we
    migrate tasks from stage 1 to stage 2 to balance
    pipeline stage time
  • Time complexity of whole algorithm O(T2E)
  • Acceptable when performed offline by compilers

13
Refinement
14
Experiment Framework
  • Augment a PDG pass in Machine SUIF tool
  • Intel IXA Architecture Tool NP thread level
    simulation tool
  • 8 to 16 PEs, 1K to 5K instruction memory sizes
  • Specify tasks according to mapping and compiler
    profiled information
  • I/O memory references
  • Code blocks (computation)
  • Signals (asynchronous operations)
  • Waits (asynchronous operations)
  • Benchmarks (Packetbench and Netbench)

15
Benchmarks
16
Throughput (8PE, 1K insns/IM)
Improvement 6 to 108, 22 on average
17
Throughput (16PE, 1Kinsns/IM)
18
Effect of IM sizes (1K- 5K insns)
19
Effect of Number of PEs (4 to 16)
20
Summary
  • Program partitioning and mapping onto hybrid
    parallel and pipeline topology in NP systems
  • Recursive bipartitioning
  • Optimize the number of stages, considering
    instruction memory size constraint
  • Minimize the communication costs
  • Hierarchical refinement task migration
  • On average 20 throughput improvement

21
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com