Title: Experimental Evaluation of Load Balancers in Packet Processing Systems
1Experimental Evaluation of Load Balancers in
Packet Processing Systems
- Taylor L. Riché, Jayaram Mudigonda, and Harrick
M. Vin
2Background
- Packet processing systems must support
- High-bandwidth links
- Increasingly complex applications
- Implication
- Processing time dominated by memory access
latency - Packet Processing time gt Packet Inter-arrival
time (IAT) - Utilize multi-core parallel architectures
- Load balancers Key building block
- Scale the throughput of a system with parallel
resources
3Flow-level Load Balancer
- Maps a flow to a processor
- Advantage
- Localization of per flow state to a processor
- Disadvantages
- Coarse-grain load balancing ) Limits concurrency
- Non-uniformity in flow characteristics ) Load
imbalance
Local Memory
Processor
Global Memory
Load Balancer
Local Memory
Processor
Local Memory
Processor
4Packet-level Load Balancer
- Independently maps each packet
- Advantage
- Fine-grain load balancing
- Disadvantage
- Higher overhead for accessing per-flow state
- Lock overhead for ensuring mutual exclusion
- Higher latency to access shared memory levels
Local Memory
Processor
Global Memory
Load Balancer
Local Memory
Processor
Local Memory
Processor
5How Does One Select a Load Balancer?
- Relative performance depends upon
- System characteristics
- Memory access latency
- Application characteristics
- Application length
- Length of critical code segment
- Flow definition
- Traffic characteristics
- Inter-arrival time of packets
- Arrival rate of flows
- Holding time for flows
- Size of flows
6Outline
- Introduction
- Methodology
- Simulation Model
- Performance Metric
- Experimental Evaluation
- Setup
- Results
- Concluding Remarks
7Simulation Model
- System Model
- Homogeneous multi-processors
- Memory system
- Local memory Single-cycle accesses
- Shared global memory Various access latencies
- Application Model
- A A1, A2, A3, , An
- Ai non-critical segment Aj critical segment
- Ai (ci, mi) where
- ci Number of computation instructions
- mi Number of memory access instructions
8Performance Metric
- Processor provisioning to meet trace throughput
- Depends on
- Processor capacity Ci(n)
- Number of packets processed within average IAT
- Per processor offered load Oi(n)
- Number of packets that arrive at a processor
within average IAT - Formal definition
- Pscheme min n 8 i lt n. Oi(n) lt Ci(n)
- Performance metric
- Processor provisioning ratios Pflow/Ppkt
9Outline
- Introduction
- Methodology
- Simulation Model
- Performance Metric
- Experimental Evaluation
- Setup
- Results
- Concluding Remarks
10Experimental Setup
- System
- Local memory access 1 cycle
- Shared memory access latencies 50, 100, 200
cycles - Application
- A A1, A2, A3
- Computation to memory access ratio selected using
benchmarks - Effective Critical Segment (ECS) size (c2 M
m2) - Traces UNC (edge), MRA (core)
11Experimental Results Preview
- Per processor offered load and capacity
- How do these quantities change with
- Number of processors
- Application length
- ECS
- Present guidelines for load balancer selection
- System Properties
- Trace Properties
- Application Properties
12Processor Capacity (UNC)
For the packet-level scheme, large ECS ) small
processor capacity
For the flow-level scheme, packet processing time
is independent of the number of processors )
processor capacity is constant
For the packet-level scheme, lock overhead
increases with number of processors (N) )
processor capacity decreases with N
13Lock Delay vs. Number of Processors
Lock delay increases with number of processors
and ECS
14Per Processor Offered Load (UNC)
Non-uniformity in flows ) per-processor offered
load reduces sub-linearly
Per-processor offered load reduces linearly with
number of processors
15Determining Processor Provisioning
1
FL Capacity
FL Load
PL ECS.140IAT
PL Load
PL ECS1.14IAT
PL ECS2.28IAT
PL ECS4.30IAT
PL ECS8.59IAT
Pkts. per Avg. IAT
0.1
0.01
1
2
4
8
16
32
Number of Processors
16Provisioning Ratio for UNC Trace
PC .5 Pkts/IAT
With lower processor capacities, the packet-level
scheme will out perform the flow-level scheme in
more cases.
17Provisioning Ratio for MRA Trace
For MRA, the cross-over ECS value is lower than
that of the UNC trace for the same processor
capacity
18Flow Characteristics UNC vs. MRA
- Key characteristics
- Per-flow packet inter-arrival time distribution
- Measure of non-uniformity in flow characteristics
- Observation
- MRA flows are more uniform than UNC flows
- Flow-level scheme does better for MRA than UNC
19Provisioning Ratio for Flow Types
Coarser flows increase lock delay (affects
packet-level), but also creates fatter flows
(affects flow-level), but changes are small )
cross-over point does not change much!
20Concluding Remarks
- Load balancer a key building block
- Key question
- How to select between a packet-level and
flow-level load balancer? - Answer depends on system, application, and trace
characteristics
20
UNC
MRA
18
16
14
12
Crossover ECS (Multiple of IAT)
10
Flow-Level
8
6
4
Packet-Level
2
0
0
0.05
0.1
0.15
0.2
0.25
0.3
Processor Capacity (Packets per Avg. IAT)
21Thank you!
- For more information on this work
- http//www.cs.utexas.edu/users/riche/
- For more information on Shangri-La
- http//www.cs.utexas.edu/users/vin/research/shangr
ila.shtml - Questions?
22Backup Slides
23Hypothetical Load Balancer
- Balances load on a packet per packet basis.
- Is not hindered by
- Global memory access latency
- Synchronization costs
24Ratios for Hypothetical System
25Conclusion 7 (Packet Level)
- Small capacities
- Proc. time dominated by non-critical segments
- Performs very similar to hypothetical system
- Large capacities
- Protected code a greater portion
- Increase on processing time is higher
26Conclusion 7 (Flow Level)
- Small Capacities
- Can only service a small number of flows
- Non-uniformity results in large provisioning
- Large Capacities
- Load-balancer better opp. to even out imbalance
27Simulator Design
- Event-driven model
- Implemented in C and driven by TCL scripts
- Four main components
- Packet reader
- Load distributor
- A set of processors
- Lock manager
28System Performance
- Performance is based on many parameters
- System characteristics
- Memory access latency Global vs. Local
- Application characteristics
- Application length
- Length of critical code segment
- Flow definition
- Workload characteristics
- Inter-arrival time of packets
- Arrival rate of flows
- Holding time for flows
- Size of flows
29Average Packets/Flow in a Window
Small number of packets/flow arrive in ECS
30Experimental Results
- Determine which factors effect Pflow/Ppkt
- I.e. where does one scheme outperform another?
- Packet-level capacity is reduced by
- Lock delay
- Memory latency
- Flow-level load does not scale
- The non-uniformity of flows
- Trace properties directly affect ratio
- Show the effect of the flow definition