Memory Performance and Scalability of Intel - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Memory Performance and Scalability of Intel

Description:

Memory Performance and Scalability of Intel s and AMD s Dual-Core Processors: A Case Study Lu Peng1, Jih-Kwon Peir2, ... Company: LSU Other titles: – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 20
Provided by: lpe1
Learn more at: https://www.cise.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: Memory Performance and Scalability of Intel


1
Memory Performance and Scalability of Intels and
AMDs Dual-Core Processors A Case Study
  • Lu Peng1, Jih-Kwon Peir2, Tribuvan K. Prakash1,
  • Yen-Kuang Chen3 and David Koppelman1
  • 1Louisiana State University
  • 2University of Florida
  • 3Intel Corporation

2
Motivation
  • Dual-Core processors are popular.
  • Understanding the impact of memory hierarchy to
    overall performance.
  • What are important factors for memory hierarchy
    performance?
  • How about speedups for dual threads?

3
Selected Three Dual-Core Processors
On-Chip
Off-Chip
Shared
Intel Core 2 Duo Intel Pentium D
AMD Athlon 64X2
  • Shared Cache vs. Private Cache
  • On-chip vs. Off-chip Inter-core communication
  • On-chip vs. Off-chip memory controller

4
Selected Three Dual-Core Processors
Intel Core 2 Duo Intel Pentium D
AMD Athlon 64X2
  • Core 2 Duo
  • Shared L2 cache, no L2 coherence, beneficial
    with one active core,
  • higher latency, fairness issue
  • When L1 miss, search L2 and the other L1
    simultaneously, fast
  • cache-cache transfer and L1 coherence (like a
    bus)
  • Memory controller off-chip, aggressive memory
    dependence predict

5
Selected Three Dual-Core Processors
Intel Core 2 Duo Intel Pentium D
AMD Athlon 64X2
  • Pentium D
  • Two Pentium 4 on a chip, use technology remap
    approach (SMP)
  • Private L2 cache, MESI coherence, require memory
    update for M?S,
  • off-chip FSB for memory update, L1 coherence
    also go through FSB
  • Memory controller off-chip, longer delay but
    adaptive to new DRAM

6
Selected Three Dual-Core Processors
Intel Core 2 Duo Intel Pentium D
AMD Athlon 64X2
  • Athlon 64x2
  • Private L2 cache, connected through
    HyperTransport
  • Use system request queue for internal commun.
    Between two cores
  • MOESI coherence protocol allows shared-modified
    block in O-state
  • no need for memory updated when read a remote
    Modified block

7
Specifications of the selected processors
8
Methodology
  • Same platform SUSE Linux 10.1 with kernel
    2.6.16-smp
  • Micro-benchmarks
  • Memory bandwidth and latency measured by Lmbench
  • A lockless program 19 measuring cache-to-cache
    latency
  • Real workloads
  • Single threaded SPEC CPU2000 and CPU2006
  • Multi-threaded blastp, hmmpfam, SPECjbb2005 and
    SPLASH2

9
Memory operations from Lmbench
  • Memory read - measuring the time to read every 4
    byte word from memory.
  • Memory write - measuring the time to write every
    4 byte word to memory.
  • Other operations such as Memory bzero etc. Refer
    the paper for details.

10
Lockless Program measuring cache-to-cache latency
  • Doesnt employ expensive read-modify-write atomic
    primitives.
  • Maintains a lockless counter for each thread.
  • pPong is in a different cache line with pPing.
  • C2C latency for Core 2 Duo, Pentium D and Athlon
    64X2 33ns, 133ns and 68ns respectively.

11
Memory bandwidth collected from the lmbench suite
Private cache is faster!
3. Athlon 64X2 provides doubled memory read
bandwidth for two copies lmbench, benefiting from
its on-chip memory controller.
2. Pentium D shows the best memory read bandwidth
when the array size is less than its L2 size.
1. In general, Core 2 Duo and Athlon 64 X2 have
better bandwidth than that of Pentium D.
Doubled!!
12
SPEC CPU2000 and CPU2006 benchmarks execution
time
4. When mixed with another program, CPU bounded
programs execution time increasing is small.
2. Athlon shows the best performance for ammp
which has a large working set, resulting a high
L2 miss rate.
3. When mixed with another program, memory
intensive programs execution time increasing is
large.
1. Core 2 Duo processor runs fastest for almost
all workloads, especially for art, mcf.
13
Multi-programmed speedup of mixed SPEC CPU
2000/2006 benchmarks
2. CPU bounded program shows the best speedup.
3. Memory bounded program shows the worst speedup.
1. Athlon 64X2 achieves the best speedup for all
workloads.
14
Multithreaded Program Behaviors
2. Core 2 Duo shows the best speedup for ocean
due to high cache-to-cache transfer ratio.
Verified by Intel VTune Analyzer.
1. Core 2 Duos single thread performance boosts
because of larger L2 cache.
3. Pentium D shows the best speedup for barnes
because of the low cache miss rate
15
Conclusions
  • Analyzed the memory hierarchy of selected Intel
    and AMD dual-core processors.
  • For the best performance and scalability, the
    following are important factors
  • fast cache-to-cache communication
  • large L2 or shared capacity
  • fast front side bus
  • on-chip memory controller.
  • fair resource (cache) sharing.

16
  • Thank you!
  • Questions?

17
Backup Slides (Memory load latency collected from
the lmbench suite)
18
Memory latency collected from the lmbench suite
(continued)
  • Latencies for all configurations jump after the
    array size is larger than L2 sizes.
  • When the stride size is equal to 128 bytes,
    Pentium D still benefits partially but the L2
    prefetchers of Core 2 Duo and Athlon 64X2 is not
    triggered.
  • When the stride size is large than 128 bytes,
    Athlon 64X2s on-die memory controller and
    separate I/O HyperTransport show the advantage.
  • Two copies of lmbench suites bring more pressures
    on Pentium D.

19
Backup Slides (Bandwidth for STREAM / STREAM2)
  • The add operation is a loop of ci ai
    bi, which can easily take advantage of the SSE2
    packet operations. It shows higher bandwidth.
  • Intel Core 2 Duo shows the best bandwidth for
    all operations because of L1 data prefetchers and
    the faster Front Side Bus.
  • Athlon 64X2 has better bandwidth than that of
    Pentium D due to its faster on-chip memory
    controller.
Write a Comment
User Comments (0)
About PowerShow.com