MEMORY PERFORMANCE EVALUATION - PowerPoint PPT Presentation

About This Presentation

Title:

MEMORY PERFORMANCE EVALUATION

Description:

MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS Garba Ya u Isa Master s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum & Minerals – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 63

Provided by: Yau2

Category:

more less

Transcript and Presenter's Notes

Title: MEMORY PERFORMANCE EVALUATION

1
MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT
SERVERS
Garba Yau Isa Masters Thesis Oral
Defense Computer Engineering King Fahd University
of Petroleum Minerals Saturday, 7th June 2003
2
Outline

Introduction
Problem Statement
Analysis of Memory Accesses
Measurement Based Performance Evaluation
Design and Implementation of Prototype
Contributions
Conclusions
Future Work

3
Introduction

Processor and memory performance discrepancy
Growing network bandwidth
Data rates in Terabits per second possible
Gigabit per second LANs already deployed
High throughput servers in network infrastructure
Streaming media servers
Web servers
Software Routers

4
Dealing with Performance Gap

Hierarchical memory architecture
temporal locality
spatial locality
Constrains
Characteristics of network payload data
Large ? wont fit into cache
Hardly reusable ? poor temporal locality

5
Problem Statement

Network servers should
Deliver high throughput
Respond to requests with low latency
Respond to large number of clients
Our goal
Identify specific conditions at which server
memory becomes a bottleneck
Includes
cache,
main memory, and
virtual memory

Benefits
Better server design that alleviates memory
bottlenecks
Optimal performance can be achieved
Constraints
Large amount of data flowing through CPU and
memory
Writing code to optimize memory utilization is a
challenge

6
Analysis of Memory Accesses Data Flow Analysis

Four data transfer paths
Memory-CPU
Memory-memory
Memory-I/O
Memory-network

7
Latency Model and Memory Overhead

Each transaction involves
CPU cycles
Data transfers one or more of four identified
types
Transaction latency
Ttrans Tcpu n1Tm-c n2Tm-m n3Tm-disk
n4Tm-net
Tcpu ? Total CPU time needed for the transaction
Tm-c ? Time to transfer entire PDU from memory to
CPU for proc.
Tm-m ? Latency of memory-memory copy of a PDU
Tm-disk ? Latency of memory-I/O read/write of a
block of data
Tm-net ? Latency of memory-network read/write of
a PDU
ni ? Number of each type of data movement
operations

8
Memory-CPU Transfers

PDU Processing
checksum computation and header updating
Typically, one-way data flow (memory to CPU via
cache)
Memory stall cycles
Number of memory stall cycles (IC)(AR)(MR)(MP)
Cache miss rate
Worst case MR 1 (not as bad!)
Best case MR 0 (trivial)

9
Memory-CPU Transfers cont.

Cache overhead in various cases
Worst case MR 1, MP 10 and (MR)(MP) ?10
Best case MR 0 ? trivial
Average case MR 0.1, MP 10 and (MR)(MP)?1
Memory-CPU latency dependent on internal bus
bandwidth
Tm-c S/32Bi usec where S is the PDU size and Bi
is the internal bus bandwidth in MB/s

10
Memory-Memory Transfers

Memory-memory transfer
Due to memory copy of PDU between protocol layers
Transfers through caches and CPU
Stride 1 (contiguous)
Transfer involves memory?cache?CPU?cache?memory
data movement
Latency
Dependent on internal (system) bus bandwidth
Tm-m 2S/Bi usec

11
Memory-I/O and Memory-Network Transfers

Memory-network transfers
Passes over the I/O bus
DMA can be used
Again, stride 1 (contiguous)
Latency
Limiting factor is the I/O bus bandwidth
Tm-net S/Be usec

12
Latency of Reference Applications

RTP Transaction Latency

HTTP Transaction Latency

IP Transaction Latency

3
13
Peak Throughputs

Assumptions
CPU usage latency compared to data transfer
latency
is negligible and can be ignored
Bus contention from multiple simultaneously
executed
transactions do not result in any additional
overhead

Server Throughput S/T
S size of transaction data
T latency of a transaction given by equations
1, 2 and 3

14
Peak Throughputs cont.
Processor Internal bus bandwidth (MB/sec) Throughput of three network applications Throughput of three network applications Throughput of three network applications
Processor Internal bus bandwidth (MB/sec) IP forwarding (Mbits/sec) HTTP (Mbits/sec) RTP Streaming (Mbits/sec)
Intel Pentium IV 3.06 GHz 3200 4264 3640 3640
AMD Athlon XP 3000 2700 4264 3291 3291
MIPS R16000 700 MHz 3200 4264 3640 3640
Sun Ultraspac III 900 MHz 1200 4264 1862 1862
15
Measurement Based PerformanceEvaluation

Experimental Testbed
Dual boot server (Pentium IV 2.0 GHz)
256 MB RAM
1.0 GHz NIC
Closed LAN (Cisco catalyst 1.0 GHz 3550 switch)

Tools
Intel Vtune
Windows Performance Monitor
Netstat
Linux tools vmstat, sar, iostat

16
Platforms and Applications

Platforms
Linux (kernel 2.4.7-10)
Windows 2000
Applications
Streaming media servers
Darwin streaming server
Windows media server
Web servers
Apache web server
Microsoft Internet Information server
Software router
Linux kernel IP forwarding

17
Analysis of Operating System Role

Memory Throughput Test
ECT (extended copy transfer) memperf
Locality of reference
temporal locality varying working set size
(block size)
spatial locality varying access pattern
(strides)

18
Analysis of Operating System Role cont.

Context switching overhead

19
Streaming Media Servers

Experimental Design
Factors
Number of streams (streaming clients)
Media encoding rate (56kbps and 300kbps)
Stream distribution (unique and multiple media)
Metrics
Cache miss (L1 and L2 cache)
Page fault rate
Throughput
Benchmarking Tools
DSS - streaming load tool
WMS media load simulator

20
Cache Performance

L1 cache misses (56kbps)

21
Cache Performance cont.

L1 cache misses (300 kbps)

22
Memory Performance

Page fault (300kbps)

23
Throughput

Throughput (300kbps)

24
Summary Streaming Media Server Memory Performance

Highest degradation in cache performance (both
L1 and L2) when the number of clients is large
and the encoding rate is 300kbps with multiple
multimedia objects.

When clients demand unique media objects, page
fault rate is constant. However, if the request
is for multiple objects, the page fault rate
increases with the number of clients.

Throughput increases with number of clients.
Higher encoding rate - 300kbps, also accounts for
more throughputs. Darwin streaming server has
less throughput compared to Windows media server.

25
Web Servers

Experimental Design
Factors
Number of web clients
Document size
Metrics
Cache miss (L1 and L2 cache)
Page fault rate
Throughput
Transactions/sec (connection rate)
Average latency
Benchmarking Tool
Webstone

26
Transactions
27
L1 Cache Miss
28
Page Fault
29
Throughput
30
Summary Web Server Memory Performance Evaluation
Comparing Apache and IIS for an average file size
of 10K
Attribute Value Value
Attribute Apache IIS
Max. transaction rate (conn/sec) Max. throughput (Mbps) CPU utilization () 2586 217 71 4178 (58 more than apache) 349 (62 more than Apache) 63
L1 misses (Millions) L2 misses (Millions) Page fault rate (pfs/sec) 424 1673 lt 10 200 117 lt 10
31
Software Router

Experimental Design
Factors
Routing configurations
TCP message size (64bytes, 10 Kbytes, and 64
Kbytes)
Metrics
Throughput
Number of context switching
Number of active pages
Benchmarking Tool
Netperf

32
Software Router Throughput
33
CPU Utilization
34
Context Switching
35
Active Page
36
Summary Software Router Performance Evaluation

Maximum throughput of 449 Mbps for
configuration
number 2 - full duplex one-to-one
communication.

Highest CPU utilization was 84

Highest context switching rate was 5378/sec

Number of active pages fairly uniformly
distributed.
Indicates low memory activity.

37
Design, Implementation and Evaluation of
Prototype DB-RTP Server
Architecture

Implementation
Linux platform (C)
Our implementation of RTSP/RTP (why?)

38
Double Buffering and Synchronization
Buffer read
Buffer write
39
RTP Server Throughput
40
Jitter
41
Summary DB-RTP Server Performance Evaluation

Throughput
DB-RTP server 63.85 Mbps
RTP server 59 Mbps.
Both servers exhibit steady jitter, but DB-RTP
has relatively lower jitter compared to RTP
server.

42
Contributions

Cache overhead analysis.
Memory latency and bandwidth analysis
Measurement-based performance evaluation
Design, implementation, and evaluation of a
prototype streaming server - Double Buffer
RTP
(DB-RTP) server.

43
Conclusions

High throughput is possible with server design
enhancement.
Server throughput is significantly degraded by
excessive cache misses and page faults.
Latency hiding with pre-fetching and buffering
can
improve throughput and jitter performance

44
Future Work

Server Development
hybrid multiplexing multithreading
Special Architectures (Network processors
ASICs)
resource scheduling
investigation of the role I/O
use of IRAM (intelligent RAM) architectures
integrated network infrastructure server

45
Thank you
46
Array restructuring
Array Padding
go back
Loop nest transformation
47
Testbeds
Software router testbed
Streaming media/web server testbed
go back
48
Communication Configurations
go back
49
Backup slides
50
Memory Performance
Page fault
300 kbps
56 kbps
51
Streaming Server CPU Utilization
52
Cache Performance cont.

L2 cache misses (56kbps)

53
Cache Performance cont.

L2 cache misses (300kbps)

54
Web Servers
Transaction
Cache performance
L2 cache misses
L1 cache misses
55
Web Servers
Latency
CPU Utilization
56
DB-RTP Server
L2 cache misses
L1 cache misses
CPU Utilization
57
Memory Performance Evaluation Methodologies

Analytical
Requires just paper and pencil
Accuracy?
Simulation
Requires programming
Time and cost?
Measurement
Real system or a prototype required
Using on-chip counters
Benchmarking tools
More accurate

58
Server Performance Tuning

Memory performance tuning
Array padding
Array restructuring
Loop nest transformation
Latency hiding and multithreading
EPIC (IA-64)
VIRAM
Impulse

Multiprocessing and clustering
Task parallelization
E.g. Panama cluster router
Special Architectures
Network processors
ASICs and Data flow architectures

Temporal vs. spatial locality
A PDU lacks temporal locality
Observation PDU processing exhibits excellent
spatial locality
Suppose data cache line is 32 bytes (or 16 words)
long
Sequential accesses with stride 1
Accessing one word, brings other 15 words as well
Thus, effective MR 1/16 6.2 ? better than
even scientific apps
Thus, generally MR W/L
W - Width of each memory access (in bytes)
L - Length of each cache line (in bytes)
Validation of above observation
Similar special locality characteristics reported
via measurements
S. Sohoni et al., A Study of Memory System
Performance of Multimedia Applications, in proc.
of ACM SIGMETRICS 2001
MR for streaming media player better than SPEC
benchmark apps!