Title: MEMORY PERFORMANCE EVALUATION
1MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT
SERVERS
Garba Yau Isa Masters Thesis Oral
Defense Computer Engineering King Fahd University
of Petroleum Minerals Saturday, 7th June 2003
2Outline
- Introduction
- Problem Statement
- Analysis of Memory Accesses
- Measurement Based Performance Evaluation
- Design and Implementation of Prototype
- Contributions
- Conclusions
- Future Work
3Introduction
- Processor and memory performance discrepancy
- Growing network bandwidth
- Data rates in Terabits per second possible
- Gigabit per second LANs already deployed
- High throughput servers in network infrastructure
- Streaming media servers
- Web servers
- Software Routers
4Dealing with Performance Gap
- Hierarchical memory architecture
- temporal locality
- spatial locality
- Constrains
- Characteristics of network payload data
- Large ? wont fit into cache
- Hardly reusable ? poor temporal locality
5Problem Statement
- Network servers should
- Deliver high throughput
- Respond to requests with low latency
- Respond to large number of clients
- Our goal
- Identify specific conditions at which server
memory becomes a bottleneck - Includes
- cache,
- main memory, and
- virtual memory
- Benefits
- Better server design that alleviates memory
bottlenecks - Optimal performance can be achieved
- Constraints
- Large amount of data flowing through CPU and
memory - Writing code to optimize memory utilization is a
challenge
6Analysis of Memory Accesses Data Flow Analysis
- Four data transfer paths
- Memory-CPU
- Memory-memory
- Memory-I/O
- Memory-network
7Latency Model and Memory Overhead
- Each transaction involves
- CPU cycles
- Data transfers one or more of four identified
types - Transaction latency
- Ttrans Tcpu n1Tm-c n2Tm-m n3Tm-disk
n4Tm-net - Tcpu ? Total CPU time needed for the transaction
- Tm-c ? Time to transfer entire PDU from memory to
CPU for proc. - Tm-m ? Latency of memory-memory copy of a PDU
- Tm-disk ? Latency of memory-I/O read/write of a
block of data - Tm-net ? Latency of memory-network read/write of
a PDU - ni ? Number of each type of data movement
operations
8Memory-CPU Transfers
- PDU Processing
- checksum computation and header updating
- Typically, one-way data flow (memory to CPU via
cache) - Memory stall cycles
- Number of memory stall cycles (IC)(AR)(MR)(MP)
- Cache miss rate
- Worst case MR 1 (not as bad!)
- Best case MR 0 (trivial)
9Memory-CPU Transfers cont.
- Cache overhead in various cases
- Worst case MR 1, MP 10 and (MR)(MP) ?10
- Best case MR 0 ? trivial
- Average case MR 0.1, MP 10 and (MR)(MP)?1
- Memory-CPU latency dependent on internal bus
bandwidth - Tm-c S/32Bi usec where S is the PDU size and Bi
is the internal bus bandwidth in MB/s
10Memory-Memory Transfers
- Memory-memory transfer
- Due to memory copy of PDU between protocol layers
- Transfers through caches and CPU
- Stride 1 (contiguous)
- Transfer involves memory?cache?CPU?cache?memory
data movement - Latency
- Dependent on internal (system) bus bandwidth
- Tm-m 2S/Bi usec
11Memory-I/O and Memory-Network Transfers
- Memory-network transfers
- Passes over the I/O bus
- DMA can be used
- Again, stride 1 (contiguous)
- Latency
- Limiting factor is the I/O bus bandwidth
- Tm-net S/Be usec
12Latency of Reference Applications
1
2
3
13Peak Throughputs
- Assumptions
- CPU usage latency compared to data transfer
latency - is negligible and can be ignored
- Bus contention from multiple simultaneously
executed - transactions do not result in any additional
overhead
- Server Throughput S/T
- S size of transaction data
- T latency of a transaction given by equations
1, 2 and 3 -
14Peak Throughputs cont.
Processor Internal bus bandwidth (MB/sec) Throughput of three network applications Throughput of three network applications Throughput of three network applications
Processor Internal bus bandwidth (MB/sec) IP forwarding (Mbits/sec) HTTP (Mbits/sec) RTP Streaming (Mbits/sec)
Intel Pentium IV 3.06 GHz 3200 4264 3640 3640
AMD Athlon XP 3000 2700 4264 3291 3291
MIPS R16000 700 MHz 3200 4264 3640 3640
Sun Ultraspac III 900 MHz 1200 4264 1862 1862
15Measurement Based PerformanceEvaluation
- Experimental Testbed
- Dual boot server (Pentium IV 2.0 GHz)
- 256 MB RAM
- 1.0 GHz NIC
- Closed LAN (Cisco catalyst 1.0 GHz 3550 switch)
- Tools
- Intel Vtune
- Windows Performance Monitor
- Netstat
- Linux tools vmstat, sar, iostat
16Platforms and Applications
- Platforms
- Linux (kernel 2.4.7-10)
- Windows 2000
- Applications
- Streaming media servers
- Darwin streaming server
- Windows media server
- Web servers
- Apache web server
- Microsoft Internet Information server
- Software router
- Linux kernel IP forwarding
17Analysis of Operating System Role
- Memory Throughput Test
- ECT (extended copy transfer) memperf
- Locality of reference
- temporal locality varying working set size
(block size) - spatial locality varying access pattern
(strides)
18Analysis of Operating System Role cont.
- Context switching overhead
19Streaming Media Servers
- Experimental Design
- Factors
- Number of streams (streaming clients)
- Media encoding rate (56kbps and 300kbps)
- Stream distribution (unique and multiple media)
- Metrics
- Cache miss (L1 and L2 cache)
- Page fault rate
- Throughput
- Benchmarking Tools
- DSS - streaming load tool
- WMS media load simulator
20Cache Performance
21Cache Performance cont.
- L1 cache misses (300 kbps)
22Memory Performance
23Throughput
24Summary Streaming Media Server Memory Performance
- Highest degradation in cache performance (both
L1 and L2) when the number of clients is large
and the encoding rate is 300kbps with multiple
multimedia objects.
- When clients demand unique media objects, page
fault rate is constant. However, if the request
is for multiple objects, the page fault rate
increases with the number of clients.
- Throughput increases with number of clients.
Higher encoding rate - 300kbps, also accounts for
more throughputs. Darwin streaming server has
less throughput compared to Windows media server.
25Web Servers
- Experimental Design
- Factors
- Number of web clients
- Document size
- Metrics
- Cache miss (L1 and L2 cache)
- Page fault rate
- Throughput
- Transactions/sec (connection rate)
- Average latency
- Benchmarking Tool
- Webstone
26Transactions
27L1 Cache Miss
28Page Fault
29Throughput
30Summary Web Server Memory Performance Evaluation
Comparing Apache and IIS for an average file size
of 10K
Attribute Value Value
Attribute Apache IIS
Max. transaction rate (conn/sec) Max. throughput (Mbps) CPU utilization () 2586 217 71 4178 (58 more than apache) 349 (62 more than Apache) 63
L1 misses (Millions) L2 misses (Millions) Page fault rate (pfs/sec) 424 1673 lt 10 200 117 lt 10
31Software Router
- Experimental Design
- Factors
- Routing configurations
- TCP message size (64bytes, 10 Kbytes, and 64
Kbytes) - Metrics
- Throughput
- Number of context switching
- Number of active pages
- Benchmarking Tool
- Netperf
32Software Router Throughput
33CPU Utilization
34Context Switching
35Active Page
36Summary Software Router Performance Evaluation
- Maximum throughput of 449 Mbps for
configuration - number 2 - full duplex one-to-one
communication.
- Highest CPU utilization was 84
- Highest context switching rate was 5378/sec
- Number of active pages fairly uniformly
distributed. - Indicates low memory activity.
37Design, Implementation and Evaluation of
Prototype DB-RTP Server
Architecture
- Implementation
- Linux platform (C)
- Our implementation of RTSP/RTP (why?)
38Double Buffering and Synchronization
Buffer read
Buffer write
39RTP Server Throughput
40Jitter
41Summary DB-RTP Server Performance Evaluation
- Throughput
- DB-RTP server 63.85 Mbps
- RTP server 59 Mbps.
- Both servers exhibit steady jitter, but DB-RTP
has relatively lower jitter compared to RTP
server.
42Contributions
- Cache overhead analysis.
- Memory latency and bandwidth analysis
- Measurement-based performance evaluation
- Design, implementation, and evaluation of a
- prototype streaming server - Double Buffer
RTP - (DB-RTP) server.
43Conclusions
- High throughput is possible with server design
- enhancement.
- Server throughput is significantly degraded by
- excessive cache misses and page faults.
- Latency hiding with pre-fetching and buffering
can - improve throughput and jitter performance
44Future Work
- Server Development
- hybrid multiplexing multithreading
- Special Architectures (Network processors
ASICs) - resource scheduling
- investigation of the role I/O
- use of IRAM (intelligent RAM) architectures
- integrated network infrastructure server
45Thank you
46Array restructuring
Array Padding
go back
Loop nest transformation
47Testbeds
Software router testbed
Streaming media/web server testbed
go back
48Communication Configurations
go back
49Backup slides
50Memory Performance
Page fault
300 kbps
56 kbps
51Streaming Server CPU Utilization
52Cache Performance cont.
53Cache Performance cont.
- L2 cache misses (300kbps)
54Web Servers
Transaction
Cache performance
L2 cache misses
L1 cache misses
55Web Servers
Latency
CPU Utilization
56DB-RTP Server
L2 cache misses
L1 cache misses
CPU Utilization
57Memory Performance Evaluation Methodologies
- Analytical
- Requires just paper and pencil
- Accuracy?
- Simulation
- Requires programming
- Time and cost?
- Measurement
- Real system or a prototype required
- Using on-chip counters
- Benchmarking tools
- More accurate
58Server Performance Tuning
- Memory performance tuning
- Array padding
- Array restructuring
- Loop nest transformation
- Latency hiding and multithreading
- EPIC (IA-64)
- VIRAM
- Impulse
- Multiprocessing and clustering
- Task parallelization
- E.g. Panama cluster router
- Special Architectures
- Network processors
- ASICs and Data flow architectures
59- Temporal vs. spatial locality
- A PDU lacks temporal locality
- Observation PDU processing exhibits excellent
spatial locality - Suppose data cache line is 32 bytes (or 16 words)
long - Sequential accesses with stride 1
- Accessing one word, brings other 15 words as well
- Thus, effective MR 1/16 6.2 ? better than
even scientific apps - Thus, generally MR W/L
- W - Width of each memory access (in bytes)
- L - Length of each cache line (in bytes)
- Validation of above observation
- Similar special locality characteristics reported
via measurements - S. Sohoni et al., A Study of Memory System
Performance of Multimedia Applications, in proc.
of ACM SIGMETRICS 2001 - MR for streaming media player better than SPEC
benchmark apps!
60Memory-CPU Transfers
- PDU Processing
- checksum computation and header updating
- Typically, one-way data flow (memory to CPU via
cache) - Memory stall cycles
- Number of memory stall cycles (IC)(AR)(MR)(MP)
- IC Instruction count per transaction
- AR Number of memory accesses/instruction
(AR1) - MR Ratio of cache misses to memory accesses
- MP Miss penalty in terms of clock cycles
- Cache miss rate
- Worst case MR 1 while typically MP 10
- Stall cycles 10 x IC
61Memory-CPU Transfers cont.
- Determine cache overhead wrt execution time
- (Execution time)no-cache (IC)(CPI)(CC)
- (Execution time)with-cache (IC)(CPI)(CC) 1
(MR)(MP) - Cache overhead 1 (MR)(MP)
- Cache overhead in various cases
- Worst case MR 1 and MP 10
- Cache results in 11 times higher latency for each
transaction! - Memory-CPU latency dependent on internal bus
bandwidth - Best case MR 0 ? trivial
- Average case MR 0.1 and MP 10 and (MR)(MP)?1
- Latency due to stalls ideal execution time
without stalls - Tm-c S/32Bi usec where S is the PDU size and
- Bi is the internal bus BW in MB/s
62Open Questions
Open Questions
- Role of specific-purpose architecture on
- performance of high throughput servers
- (e.g. network processor)
- Role of memory compression