Simultaneous%20Multithreading:%20Maximizing%20On-Chip%20Parallelism - PowerPoint PPT Presentation

About This Presentation
Title:

Simultaneous%20Multithreading:%20Maximizing%20On-Chip%20Parallelism

Description:

The high-speed execution core of the. AMD Athlon XP processor includes multiple x86 instruction. decoders, a dual-ported 128-Kbyte split level-one (L1) cache, an ... – PowerPoint PPT presentation

Number of Views:168
Avg rating:3.0/5.0
Slides: 27
Provided by: stude817
Category:

less

Transcript and Presenter's Notes

Title: Simultaneous%20Multithreading:%20Maximizing%20On-Chip%20Parallelism


1
Simultaneous Multithreading Maximizing On-Chip
Parallelism
  • Presented By
  • Daron Shrode
  • Shey Liggett

2
Introduction
  • Simultaneous Multithreading a technique
    permitting several independent threads to issue
    instructions to a superscalars multiple
    functional units in a single cycle.
  • The objective of SM is to substantially increase
    processor utilization in the face of both long
    memory latencies and limited available
    parallelism per thread.

3
(No Transcript)
4
Overview
  • Introduce several SM models
  • Evaluate the performance of those models relative
    to superscalar and fine-grain multithreading
  • Show how to tune the cache hierarchy for SM
    processors
  • Demonstrate the potential for performance and
    real-estate advantages of SM architectures over
    small-scale, on chip multiprocessors

5
Simulation Environment
  • Developed a simulation environment that defines
    an implementation of SM architecture
  • Uses emulation-based instructions-level
    simulation, similar to Tango and g88
  • Models the execution pipelines, the memory
    hierarchy (both in terms of hit rates and
    bandwidths), the TLBs, and the branch prediction
    logic of a wide superscalar processor
  • Based on the Alpha AXP 21164, augmented first for
    wider superscalar execution and then for
    multithreaded execution

6
Simulation Environment (cont.)
  • Typical Simulated configuration contains 10
    functional units of four types (four integer, two
    floating point, three load/store and 1 branch)
    and a maximum issue rate of 8 instructions per
    cycle.

7
Simulation Environment (cont.)
8
(No Transcript)
9
Superscalar Bottlenecks
  • No dominant source of wasted issue bandwidth,
    therefore, no dominant solution
  • No single latency-tolerating technique will
    produce a dramatic increase in the performance of
    these programs if it only attacks specific types
    of latencies

10
SM Machine Models
11
SM Machine Models (cont.)
12
SM Machine Models (cont.)
13
SM Machine Models (cont.)
14
SM Machine Models (cont.)
15
SM Machine Models (cont.)
16
SM Machine Models (cont.)
  • In summary, the results show that simultaneous
    multithreading surpasses limits on the
    performance attainable through either
    single-thread execution or fine-grain
    multithreading, when run on a wide superscalar.
  • Simplified implementations of SM with limited
    per-thread capabilities can still attain high
    instruction throughput.
  • These improvements come without any significant
    tuning of the architecture for multithreaded
    execution.

17
Cache Design
  • Cache sharing caused performance degradation in
    SM processors.
  • Different cache configurations were simulated to
    determine optimum configurations

18
(No Transcript)
19
Cache Design
  • Two configurations appear to be good choices
  • 64s.64s
  • 64p.64s
  • Important Note cache sizes today are larger than
    those at the time of the paper (1995).

20
Simultaneous Multithreading vs Single-Chip
Multiprocessing
  • On organizational level, the two are similar
  • Multiple register sets
  • Multiple functional units
  • High issue bandwidth on a single chip

21
Simultaneous Multithreading vs Single-Chip
Multiprocessing (cont.)
  • Key difference is the way resources are
    partitioned and scheduled
  • MP statically partitions resources
  • SM allows partitions to change every cycle
  • MP and SM tested in similar configurations to
    compare performance

22
Simultaneous Multithreading vs Single-Chip
Multiprocessing (cont.)
23
Conclusion
24
Pentium 4
  • Product Features
  • Available at 1.50, 1.60, 1.70, 1.80, 1.90 and 2
    GHz
  • Binary compatible with applications running on
    previous members of the Intel microprocessor line
  • Intel NetBurst micro-architecture
  • System bus frequency at 400 MHz
  • Rapid Execution Engine Arithmetic Logic Units
    (ALUs) run at twice the processor core frequency
  • Hyper Pipelined Technology
  • Advance Dynamic Execution
  • Very deep out-of-order execution
  • Enhanced branch prediction
  • Level 1 Execution Trace Cache stores 12K
    micro-ops and removes decoder latency from main
    execution loops
  • 8 KB Level 1 data cache
  • 256 KB Advanced Transfer Cache (on- die, full
    speed Level 2 (L2) cache) with 8-way
    associativity and Error Correcting Code (ECC)
  • 144 new Streaming SIMD Extensions 2 (SSE2)
    instructions
  • Enhanced floating point and multimedia unit for
    enhanced video, audio, encryption, and 3D
    performance
  • Power Management capabilities
  • System Management mode
  • Multiple low-power states
  • Optimized for 32-bit applications running on
    advanced 32-bit operating systems

25
AMD Athlon
  • The AMD Athlon XP processor features a
    seventh-generation
  • microarchitecture with an integrated, exclusive
    L2 cache, which
  • supports the growing processor and system
    bandwidth
  • requirements of emerging software, graphics, I/O,
    and memory
  • technologies. The high-speed execution core of
    the
  • AMD Athlon XP processor includes multiple x86
    instruction
  • decoders, a dual-ported 128-Kbyte split level-one
    (L1) cache, an
  • exclusive 256-Kbyte L2 cache, three independent
    integer
  • pipelines, three address calculation pipelines,
    and a superscalar,
  • fully pipelined, out-of-order, three-way
    floating-point engine.
  • The floating-point engine is capable of
    delivering outstanding
  • performance on numerically complex applications.

26
AMD Athlon (cont.)
  • The following features summarize the AMD Athlon
    XP
  • processor QuantiSpeed architecture
  • An advanced nine-issue, superpipelined,
    superscalar x86
  • processor microarchitecture designed for
    increased
  • Instructions Per Cycle (IPC) and high clock
    frequencies
  • Fully pipelined floating-point unit that executes
    all x87
  • (floating-point), MMX, SSE and 3DNow!
    instructions
  • Hardware data pre-fetch that increases and
    optimizes
  • performance on high-end software applications
    utilizing
  • high-bandwidth system capability
  • Advanced two-level Translation Look-aside Buffer
    (TLB)
  • structures for both enhanced data and
    instruction address
  • translation. The AMD Athlon XP processor with
  • QuantiSpeed architecture incorporates three TLB
  • optimizations the L1 DTLB increases from 32 to
    40 entries,
  • the L2 ITLB and L2 DTLB both use exclusive
    architecture,
  • and the TLB entries can be speculatively loaded.
Write a Comment
User Comments (0)
About PowerShow.com