Intel Pentium 4 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Intel Pentium 4

Description:

Intel Pentium 4 ENCM 515 - 2002 Jonathan Bienert Tyson Marchuk Overview: Product review Specialized architectural features (NetBurst) SIMD instructional capabilities ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 29
Provided by: Gailene5
Category:
Tags: intel | pentium

less

Transcript and Presenter's Notes

Title: Intel Pentium 4


1
Intel Pentium 4
  • ENCM 515 - 2002
  • Jonathan Bienert
  • Tyson Marchuk

2
Overview
  • Product review
  • Specialized architectural features (NetBurst)
  • SIMD instructional capabilities (MMX, SSE2)
  • SHARC 2106x comparison

3
Intel Pentium 4
  • Reworked micro-architecture for high-bandwidth
    applications
  • Internet audio and streaming video, image
    processing, video content creation, speech, 3D,
    CAD, games, multi-media, and multi-tasking user
    environments
  • These are DSP intensive applications!
  • What about uses other than in PC?

4
Hardware Features(NetBurst micro-architecture)
  • Hyper pipelined technology
  • Advanced dynamic execution
  • Cache (data, L1, L2)
  • Rapid ALU execution engines
  • 400 MHz bus
  • OOE
  • Microcode ROM

5

6
Hyper Pipeline
  • 20-stage pipeline!!!
  • breaks down complex CISC instructions
  • sub-stages mimic RISC
  • faster execution

7
Filling the pipeline...
  • Review of next 126 instructions to be executed
  • Branch prediction
  • if mispredict must flush 20-stage pipeline!!!
  • branch target buffer (BTB)
  • 4K branch history table (BHT)
  • assembly instruction hints

8
Cache
  • 8KB Data Cache
  • L1 Execution Trace Cache
  • 12K of previous micro-instructions stored
  • saves having to translate
  • L2 Advanced Transfer Cache
  • 256K for data
  • 256-bit transfer every cycle
  • allows 77GB/s data transfer on 2.4GHz

9
Rapid ALU Execution Engines
  • 2 ALUs
  • allow parallel operations
  • Many arithmetic operations take 1/2 cycle
  • each 2X ALU can have 2 operations per cycle

10
Software Features
  • Multimedia Extensions (MMX)
  • 8 MMX registers
  • Streaming SIMD Extensions (SSE2)
  • 8 SSE/SSE2 registers
  • Standard x86 Registers
  • EAX, EBX, ECX, EDX, ESI, etc.
  • Register rename to over 100

11
MMX (Multimedia Extensions)
  • Accelerated performance through SIMD
  • multimedia, communication, internet applications
  • 64-bit packed INTEGER data
  • signed/unsigned

12
SSE2 (Streaming SIMD Extensions)
  • Accelerate a broad range of applications
  • video, speech, and image, photo processing,
    encryption, financial, engineering, and
    scientific applications
  • 128-bit SIMD instruction formats
  • 4 single precision FP values
  • 2 double precision FP values
  • 16 byte values
  • 8 word values
  • 4 double word values
  • 2 quad word values
  • 1 128-bit integer value

13
SIMD Example(16-tap FIR filter - Real numbers)
  • Applications for real FIR filters
  • general purpose filters in image processing,
    audio, and communication algorithms
  • Will utilize SSE2 SIMD instruction set

14
Thinking about SIMD
  • SSE2 instruction format is 128-bits
  • 128-bit SSE2 registers
  • Many data formats!
  • What precision do we want?
  • Lets use 32-bit floating point for coefficients,
    input, output
  • 4 data sets x 32-bit 128 bits

15
Parallelizing
  • Require many single multiplications (coefficients
    x inputs), then add the results for output!
  • Multiplications
  • then need to perform additions...

16
Using SSE2 format
  • Can hold 4 elements of an array (of 32-bit data)
    in each 128-bit register
  • 4 single precision floating point ops per cycle
    (32-bit)

17
Additions...
  • In both registers, now have 4 32-bit results
  • First add the results into an accumulator
    register
  • 4 single precision floating point ops per cycle
    (32-bit)

18
Additions...
  • In a register, now have 4 32-bit results
  • however, NO SSE2 instruction to add these 4!
  • But can use other instructions
  • Some BIT INTERTWININGthen add
  • This will give results for several output values!

19
ADI SHARC 21k vs. P4
  • Disadvantages
  • Slower clock speed (40MHz vs 2400MHz)
  • Less opportunities for parallelism (5 vs 11)
  • Much less memory (Cache and System)
  • Limited algorithm applicability
  • Limited applications
  • Older (Less support compiler)
  • 1994 vs 2001

20
ADI Sharc 21k vs. P4
  • Advantages
  • Hardware loops
  • Easier to program for optimal speed
  • Cheaper
  • Lower power consumption
  • Runs cooler

21
FIR Performance
  • Hard to obtain P4 performance numbers
  • Can estimate based on 2 FP multiplies per clock,
    clock rate and assumption that pipeline can be
    kept full.
  • 2 2.4GHz 4.8 billion multiplies per second
  • If 4 multiplies per element 44000 samples/s
  • FIR length gt 25k taps
  • SHARC gt 200 taps (Lab 4)
  • Factor of 125x

22
IIR Performance
  • Hard to obtain P4 performance numbers
  • No hardware circular buffers
  • Does have BTB, BHT, etc.
  • Prefetches 256bytes ahead of current position in
    code.

23
FFT Performance
  • Hard to obtain P4 performance numbers
  • Prime95 uses FFT to calculate Lucas-Lehmer test
    for Mersenne Primes
  • Involves FFT, squaring and iFFT, etc.
  • 256k points on P4 2.3GHz 10.517ms
  • Compare to SHARC 2048 point FFT 0.37ms
  • If SHARC could do 256k, 46.25ms (But)

24
Optimization Example
  • Hard to optimize Pentium 4 assembly
  • Example of multiplying by a constant, 10
  • Taken mainly from www.emulators.com/docs/pentium_
    1.htm

25
Multiplying by 10
  • Slowest way
  • IMUL EAX, 10
  • Usually optimal way (Visual C 6.0)
  • LEA EAX, EAXEAX4
  • SHL EAX, 1
  • Shift Add Shift
  • On most x86 processors takes 2 cycles
  • Pentium MMX and before 3 cycles
  • On Pentium 4 takes 6 cycles!

26
Multiplying by 10
  • Optimal for Pentium 4
  • LEA ECX, EAX EAX
  • LEA EAX, ECXEAX8
  • On most x86 still takes 2 cycles
  • On Pentium 4 takes 3 cycles (OOE - ?Ops)
  • But on older processors Pentium MMX and before
    this now takes 4 cycles!

27
Multiplying by 10
  • Best generic case
  • LEA EAX, EAX EAX4
  • ADD EAX, EAX
  • On most x86 still takes 2 cycles
  • On older processors Pentium MMX and before this
    now takes 3 cycles again
  • On Pentium 4 this takes 4 cycles
  • Obviously really hard to optimize

28
REFERENCES
  • Intel application note AP 809 - Real and Complex
    Filter Using Streaming SIMD Extentions
  • graphics from http//www6.tomshardware.com/cpu/0
    0q4/001120/p4-01.html
Write a Comment
User Comments (0)
About PowerShow.com