AMD K7 Processor Architecture - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

AMD K7 Processor Architecture

Description:

First six generations were 8086, 80286, 80386, 80486, Pentium (AMD K5/K6) and ... Intel PIII Coppermine(1999): L2 changed to. 256kB, 256-bit, 8-way, 4~ on-chip ... – PowerPoint PPT presentation

Number of Views:648
Avg rating:3.0/5.0
Slides: 26
Provided by: Sage88
Category:

less

Transcript and Presenter's Notes

Title: AMD K7 Processor Architecture


1
AMD K7 Processor Architecture
  • CMPE 511
  • prepared by Özsun S. Sönmez

2
Introduction
  • AMD K7 is the first 7th generation PC CPU. First
    six generations were 8086, 80286, 80386, 80486,
    Pentium (AMD K5/K6) and Pentium II (AMD
    K6-2/K6-3). It is designed to operate above
    500MHz.
  • AMD K7,also known as AMD Athlon, was introduced
    in the first half of 1999 and its architecture
    forms the basis for the subsequent Athlon XP
    versions until the release of K8 (AMD Hammer).
  • Its competitor, Intel Pentium III was also
    released in the same year and these two
    processors will be compared whenever possible
    throughout the presentation.

3
Main Features
  • Out-of-order, 3-way superscalar x86 uP
  • 9 independent execution pipelines, with 10 stage
    integer and 15-stage FP pipeline
  • 3 Integer Execution Units
  • 3 Address Calculation Units
  • 3 Floating Point Execution Units
  • 64kB instruction and 64kB data L1 caches
  • Integrated L2 cache controller up to 8MB
  • Extended 3DNow! instructions

4
Main Features
  • K7 uses Digital Alpha EV6 system bus interface.
    This is probably the most important architectural
    difference from the previous generations. EV6
    provides
  • - Use of both rising and falling edges,
    resulting in doubled bus speed
  • - Scalability beyond 200MHz(beyond 400MHz bus
    speed)
  • - Highest bandwidth of that time
  • Athlon using 100MHz(x2) ? 1.60 GB/s
  • PIII using 133MHz ? 1.01 GB/s
  • - 72(64 8ECC) bit data bus
  • - Independent address bus able to address 8
    terabytes
  • - Independent snoop bus

5
Main Features EV6 cont.
  • - low-voltage signaling for low-cost motherboard
    implementations
  • ? Motherboards with GeForce, Dolby and
    Ethernet available below 80.
  • - Point-to-Point topology with clock forwarding
    for scalable multiprocessing.

6
AMD K7 Processor Block Diagram
7
Cache Architecture
  • Separate L1 instruction and data caches
  • Both are 64kB, 64-bit, 2-way set associative,
    dual ported and have 24-entry(32-entry for DC) L1
    TLB, 256-entry L2 TLB.
  • IC stores predecode information to assist
    multiple instruction decoders.
  • L2 cache controller can interface up to 8MB
    industry standard SDR or DDR SRAMs and provides
    full tag for 512kB cache or partial tag for
    larger caches. Interface is 648ECC

8
Cache Competition
  • AMD Athlon(1999)
  • 2x64kB, 64-bit, 2-way, 3 L1 cache with 64-byte
    lines
  • 512kB, 64-bit, 2-way, 18 off-chip L2 with
    64-byte lines
  • Intel PIII Katmai(1999)
  • 2x16kB, 64-bit, 4-way, 3 L1 cache with 32-byte
    lines
  • 512kB, 64-bit, 2-way, 21 off-chip L2 with
    32-byte lines
  • Intel PIII Coppermine(1999) L2 changed to
  • 256kB, 256-bit, 8-way, 4 on-chip
  • AMD Athlon Thunderbird(2000)L2 changed to
  • 256kB, 64-bit, 16-way, 7 on-chip
  • Exclusive cache structure meaning that data in
    L1 and L2 caches are different

9
Cache Competition
10
Pipeline Architecture - Decoders
- 3-way Decoders convert instructions into
fixed-length Macro-Ops (or MOPs) and send to
ICU - ICU contains 72 entries vs. 20 entries of
PIII ? superior out-of-order execution
performance
11
Pipeline Integer Execution Units
  • 3 IEU, 3 AGU
  • 15 entry integer scheduler
  • 24 entry 32bit 9 read 8 write
  • register file

12
Pipeline - Floating Point Unit
  • Floating Point Units execute MMX,
  • x87 (FP) and 3D-Now! Instructions
  • 36 entry FP scheduler
  • 88 entry 90bit 5 read 5 write register file.
  • Some stages of the MUL pipeline may be unused
    during DIV/Sqrt iterations. ICU informs the FP
    scheduler in such cases so that there is
    sufficient time to schedule independent MULs in
    the unused cycle.
  • - DIV by exact 2n or zero takes 11

13
Pipeline Load/Store Unit
  • 44 entry Load/Store queue
  • Data forwarding from
  • stores to dependent loads

14
Pipeline - Stages
15
Branch Prediction
  • Dynamic branch prediction logic composed of
  • Branch prediction table two-way, 2048-entry(512
    for PIII). BPT stores prediction information that
    is used for predicting the direction of
    conditional branches.
  • Branch target address table
  • stores target addresses of conditional and
    unconditional branches.

16
Branch Prediction
  • Return address stack 12-entry
  • optimizes CALL/RET instruction pairs
  • BPT is accessed during Fetch stage and prediction
    is made during scan stage using Smith Prediction
    Algorithm (2-bit counters)
  • Misprediction penalty is 10 cycles
  • Approximate Correct Branch Predictions
  • AMD Athlon 95
  • Intel Pentium III 90-92

17
3DNow! Technology
  • 3DNow! is a set of SIMD instructions designed to
    accelerate the FP-intensive multimedia
    applications.
  • Instructions operate on two packed
    single-precision 32-bit doublewords
    simultaneously
  • Dst6332 Dst6332 op Src6332
  • Dst3100 Dst3100 op Src3100

18
3DNow! Technology
  • With significant code analysis, AMD engineers
    found that there are two compelling
    implementation alternatives
  • - extending MMX with 3DNow! instructions
  • - using separate wide registers from MMX,
    4-operand instruction format and support for MAC.
  • - Anything in between requires significantly
    greater hardware area or complexity without
    providing a corresponding performance benefit.
  • AMD chose the first one that achieves most of the
    performance benefit with significantly less area
    and power. Since no additional registers are
    used, no new states are introduced ?
    compatibility with the existing OSs.
  • The second choice is implemented in PowerPC G4
    under the name AltiVec.

19
3DNow! Technology
  • Instead of division and sqrt, reciprocal and
    reciprocal sqrt are implemented in AMD K7 since
    they are encountered more often in multimedia
    applications.
  • MMX and 3DNow! instructions have at most 4 cycle
    latency (only for 3DNow! Add and Mul ) and 1
    cycle throughput. This is much faster than single
    precision FP division(13) and sqrt(16).
  • Using 2 FP pipelines simultaneously, maximum
    throughput is 4 FPops/.

20
Integer Performance of AMD Athlon
21
Floating Point Performance of AMD Athlon
22
(No Transcript)
23
Conclusion
  • Being the first 7th generation CPU, AMD K7 has
    been a major leap forward in the CPU history.
  • It had both performance and cost benefits when
    compared to Intel PIII and started the
    competition that ended with todays AMD Athlon XP
    and P4 processors.

24
References
  • Hesley, S., V. Andrade, B. Burd,G. Constant, J.
    Correll, M. Crowley, M. Golden, N. Hopkins, S.
    Islam, S. Johnson, R. Khondker, D. Meyer, J.
    Moench, H. Partovi, R. Posey, F. Weber and J.
    Yong, A 7th Generation x86 Microprocessor ,
    IEEE International Solid State Circuits
    Conference, pp. 92-93,1999.
  • Scherer, A., M. Golden, N. Juffa, S. Meier, S.
    Oberman, H. Partovi and F. Weber, An
    Out-of-Order Three-Way Superscalar Multimedia
    Floating Point Unit , IEEE International Solid
    State Circuits Conference, pp. 94-95,1999.
  • Oberman, S., Floating Point Division and Square
    Root Algorithms and Implementation in the AMD-K7
    Microprocessor , 14th IEEE Symposium on Computer
    Arithmetic, pp. 106-115, 1999.
  • Oberman, S., G. Favor and F. Weber, AMD 3DNow!
    Technology Architecture and Implementations ,
    IEEE Micro, 1999.
  • AMD Athlon Processor Datasheet and Technical
    Brief from www.amd.com
  • Intel PIII Processor Datasheet from www.intel.com

25
Questions?
Write a Comment
User Comments (0)
About PowerShow.com