The AMD Opteron - PowerPoint PPT Presentation

About This Presentation
Title:

The AMD Opteron

Description:

AMD produced both desktop and mobile K6 processors. ... AMD believes the following desktop apps stand to benefit the most from its ... – PowerPoint PPT presentation

Number of Views:471
Avg rating:3.0/5.0
Slides: 120
Provided by: ITCLabsand2
Category:
Tags: amd | amd | opteron

less

Transcript and Presenter's Notes

Title: The AMD Opteron


1
The AMD Opteron
  • Henry Cook
  • Kum Sackey
  • Andrew Weatherton

2
Presentation Outline
  • History and Goals
  • Improvements
  • Pipeline Structure
  • Performance Comparisons

3
K8 Architecture Development
  • The Nx586, March 1994
  • Superscalar
  • Designed by NexGen
  • Manufactured by IBM
  • 70-111MHz
  • 32KB L1 cache
  • 3.5 million transistors
  • .5 micron process

4
K8 Architecture Development
  • AMD SSA/5 (K5)
  • March 1996
  • Built by AMD from the ground up
  • Superscalar architecture
  • out of-order speculative execution
  • branch prediction
  • integrated FPU
  • power-management
  • 75-117MHz
  • Ran hot
  • 34KB L1 cache
  • 4.5 million transistors
  • .35 micron process

5
K8 Architecture Development
  • AMD K6 (1997)
  • Based on NexGen's RISC86 core (in the Nx586)
  • Based on Nx586 core
  • 166-300MHz
  • 84KB L1 Cache
  • 8.8 million transistors
  • .25 micron process

6
K8 Architecture Development
  • AMD K6 (1997) continued
  • Advantages of K6 over K5
  • RISC86 core translates x86 complex instructions
    into shorter ones, allowing the AMD to reach
    higher frequencies than the K5 core.
  • Larger L1 cache.
  • New MMX instructions.
  • AMD produced both desktop and mobile K6
    processors. The only difference being lower
    processor core voltage for the mobile part

7
K8 Architecture Development
  • First AMD Athlons, K7 (June 23, 1999)
  • Based on the K6 core
  • improved the K6s FPU
  • 128 KB (2x64 KB) L1 cache
  • Initially 500-700MHz
  • 8.8 million transistors
  • .25 micron process

8
K8 Architecture Development
  • AMD Athlons, K7 continued
  • 1999-2002 held fastest x86 title off and on
  • First to 1GHz clock speed
  • Intel suffered a series of major production,
    design, and quality control issues at this time.
  • Changed from slot to socket format
  • Athlon XP desktop
  • Athlon XP-M laptop
  • Athlon MP server

9
K8 Architecture Development
  • AMD Athlons, K7 continued
  • Final (5th) revision, the Barton
  • 400 MHz FSB (up from 200 MHz)
  • Up to 2.2 GHz clock
  • 512 KB L2 cache, off-chip
  • 54.3 million transistors
  • .13 micron process
  • In 2004 AMD began using 90nm process on XP-M

10
The AMD Opteron
  • Built on the K8 Core
  • Released April 22, 2003
  • AMD's AMD64 (x86-64) ISA
  • Direct Connect Architecture
  • Integrated memory controllers
  • HyperTransport interface
  • Native execution of x86 64-bit apps
  • Native execution of x86 32-bit apps with no speed
    penalty!

11
Opteron vs. Intel Offerings
  • Targeted at the server market
  • 64-bit computing
  • Registered memory
  • Initial direct competitor was the Itanium
  • Itanium was the only other 64-bit processor
    architecture with 32-bit x86 compatibility
  • But, 32-bit software support was not native
  • Emulated 32-bit performance took a significant hit

12
Opteron vs. ???
  • Opteron had no real competition
  • Near 11 multi-processor scaling
  • CPUs share a single common bus
  • integrated memory controller CPU can access
    local-RAM without using the Hypertransport bus
    processor-memory communication.
  • contention for the shared-bus leads to decreased
    efficiency, not an issue for the Opteron
  • Still did not dominate the market

13
Opteron Layout
14
Other New Opteron Features
  • 48-bit virtual address space and a 40-bit
    physical address space
  • ECC (error correcting code) protection for L1
    cache data, L2 cache data and tags
  • DRAM with hardware scrubbing of all ECC-protected
    arrays

15
Other New Opteron Features
  • Lower thermal output, improved frequency scaling
    via .13 micron SOI (silicon-insulator) process
    technology
  • Two additional pipeline stages (compared to K7)
    for increased performance and frequency
    scalability
  • Higher IPC (instructions-per-clock) with larger
    TLBs, flush filters, and enhanced branch
    prediction algorithms

16
64-bit Computing
  • Move beyond the 4GB virtual-address space ceiling
    32-bit systems impose
  • Servers and apps like databases, content
    creation, MCAD, and design-automation tools push
    that boundary.
  • AMDs implementation allows
  • Up to 256TB of virtual-address space
  • Up to 1TB of physical memory
  • No performance penalty

17
64-bit Computing Contd
  • AMD believes the following desktop apps stand to
    benefit the most from its architecture, once
    64-bit becomes more widespread
  • 3D gaming
  • Codecs
  • Compression algorithms
  • Encryption
  • Internet content serving
  • Rendering

18
AMD and 64-bit Computing
  • Goal is not immediate transition to 64-bit
    operation
  • Like Intels transition to 32-bit with the 386
  • AMD's Brunner "The transition will occur at the
    pace of demand for its benefits."
  • Sets foundation and encourages development of
    64-bit applications while fully supporting
    current 32-bit standard

19
AMD64
  • AMDs 64-bit ISA
  • 64-bit software support with zero-penalty 32-bit
    backward compatibility
  • x86 based, with extensions
  • Cleans up x86-32 idiosyncrasies
  • Updated since release i.e. SSE3

20
AMD64 - Features
  • All benefits of 64-bit processing (e.g.
    virtual-address space)
  • Added registers
  • Like Pentium 4 in 32-bit mode, but 8 more 64-bit
    GPRs available for 64-bit
  • 8 more XMM registers
  • Native 32-bit compatibility
  • Low translation overhead (unlike Intel)
  • Both 32 and 64-bit apps can be run under a 64bit
    OS

21
Register Map for AMD64
22
AMD64 More Features
  • RIP relative data access Instructions can
    reference data relative to PC, which makes code
    in shared libraries more efficient and able be
    mapped anywhere in the virtual address space.
  • NX Bit Not required for 64-bit computing, but
    provides for a more tightly controlled software
    environment. Hardware set permission levels make
    it much more difficult for malicious code to take
    control of the system.

23
AMD64 Operating Modes
  • Legacy mode supports 16- and 32-bit OSes and
    apps, while long mode enables 64-bit OSes to
    accommodate both 32- and 64-bit apps.
  • Legacy OS, device drivers, and apps will run
    exactly as they did prior to upgrading.
  • Long Drivers and apps have to be recompiled, so
    software selection will be limited, at least
    initially.
  • Most likely scenario is a 64-bit OS with 64-bit
    drivers, running a mixture of 32- and 64-bit apps
    in compatibility mode.

24
(No Transcript)
25
(No Transcript)
26
Direct Connect Architecture
  • I/O Architecture for Opteron and Athlon64
  • Microprocessors are connected to
  • Memory through an integrated memory controller.
  • A high performance I/O subsystem via
    Hypertransport bus
  • To other CPUs via HyperTransport bus

27
Onboard Memory Control
  • Processors do not have to go through a
    northbridge to access memory
  • 128-bit memory bus
  • Latency reduced and bandwidth doubled
  • Multicore Processors have own memory interface
    and own memory
  • Available memory scales with the number of
    processors

28
More Onboard Memory Control
  • DDR-SDRAM only
  • Up to 8 registered DDR DIMMs per processor
  • Memory bandwidth of up to 5.3 Gbytes/s (with
    PC2700) per processor.
  • 20 improvement over Athlon just due to
    integrated memory

29
HyperTransport
  • Bidirectional, serial/parallel, scalable,
    high-bandwidth low-latency bus
  • Packet based
  • 32-bit words regardless of physical width
  • Facilitates power management and low latencies

30
HyperTransport in the Opteron
  • 16 CAD HyperTransport (16-bit wide, CADCommand,
    Address, Data)
  • processor-to-processor and processor-to- chipset
  • bandwidth of up to 6.4 GB/s (per HT port)
  • 50 more than what the latest Pentium 4 or Xeon
    processors
  • 8-bit wide HyperTransport for components such as
    normal I/O-Hubs

31
More Opteron HyperTransport
  • Number of HyperTransport channels
  • (up to 3) determined by number of CPUs
  • 19.2 Gbytes/s of peak bandwidth per proccessor
  • All are bi-directional, quad-pumped
  • Low power consumption (1.2 W) reduces system
    thermal budget

32
(No Transcript)
33
(No Transcript)
34
More HyperTransport
  • Auto-negotiated bus widths
  • Devices negotiate sizes during initialization
  • 2-bit lines to 32-bit lines.
  • Busses of various widths can be mixed together in
    a single application
  • Allows for high speed busses between main memory
    and the CPU and lower speed busses to peripherals
    as appropriate
  • PCI compatible but 80x faster

35
DCA InterCPU Connections
  • Multiple CPUs connected through a proprietary
    extension running on additional HyperTransport
    interfaces
  • Allows support of a cache-coherent, Non-Uniform
    Memory Access, multi-CPU memory access protocol

36
DCA InterCPU Connections
  • Non-Uniform Memory Access
  • Separate cache memory for each processor
  • Memory access time depends on memory location.
    (i.e. local faster than non-local)
  • Cache coherence
  • Integrity of data stored in local caches of a
    shared resource
  • Each CPU can access the main memory of another
    processor, transparent to the programmer

37
DCA Enables Multiple CPUs
  • Integrated memory controller allows cache access
    without using HyperTransport
  • For non-local memory access and interprocessor
    communication, only the initiator and target are
    involved, keeping bus-utilization to a minimum.
  • All CPUs in multiprocessor Intel Xeon systems
    share a single common bus for both
  • Contention for shared bus reduces efficiency

38
Multicore vs Multi-Processor
  • In multi-processor systems (more than one Opteron
    on a single motherboard), the CPUs communicate
    using the Direct Connect Architecture
  • Most retail motherboards offer one or two CPU
    sockets
  • The Opteron CPU directly supports up to an 8-way
    configuration (found in mid-level servers)

39
Multicore vs Multi-Processor
  • With multicore each physical Opteron chip
    contains two separate processor cores (more
    someday soon?)
  • Doubles the compute-power available to each
    motherboard socket. One socket can delivers the
    performance of two processors, two deliver a four
    processor equivalent, etc.

40
Future Improvements
  • Dual-Core vs Double Core
  • Dual core Two processors on a single die
  • Double core Two single core processors in one
    package
  • Better for manufacturing
  • Intel Pentium D 900 Presler
  • Combined L2 cache
  • Quad-core, etc.

41
K7 vs. K8 Changes
42
Summary of Changes From K7 to K8
  • Deeper Wider Pipeline
  • Better Branch Predictor
  • Large workload TLB
  • HyperTransport capabilities eliminate Northbridge
    and allow low latency communication between
    processors as well as I/O
  • Larger L2 cache with higher bandwidth and lower
    latency
  • AMD 64 ISA allowing for 64-bit operation

43
The K7 Basics
  • 3 x86 decoding units
  • 3 integer units (ALU)
  • 3 floating point units (FPU)
  • A 128KB L1 cache
  • Designed with an efficiency aim
  • IPC mark (Instructions Per Cycle)
  • K7 units allow to handle up to 9 instructions per
    clock cycle

44
The K8 Basics
  • 3 x86 decoding units
  • 3 integer units (ALU)
  • 3 floating point units (FPU)
  • A 1MB L1 cache

45
The K7 Core
46
The K8 Core
47
Things To Note About the K8
  • Schedules a large number of instructions
    simultaneously
  • 3 8-entry schedulers for integer instructions
  • A 36-entry scheduler for floating point
    instructions
  • Compared to the K7, the K8 allows for more
    integer instructions to be active in the
    pipeline. How is this possible?

48
Processor Constraints
  • - A 'bigger' processor has more execution units
    (width) and more stages in the pipeline (depth)
  • Processor 'size' is limited by the accuracy of
    the branch predictor
  • determines how many instructions can be active in
    the pipeline before an incorrect branch
    prediction occurs
  • in theory, CPU should only accomodate the number
    of instructions that can be sent in a pipe before
    a misprediction

49
The K8 Branch Predictor
50
The K8 Branch Predictor Details
  • Compared to the K7, the K8 has improved branch
    prediction
  • Global history counter (ghc) is 4x previous size
  • ghc is a massive array of 2-bit (0-3) counters,
    indexed by a part of an instructions addresse
  • if the value is gt 2 then branch is predicted as
    "taken
  • Taken branches incrememnt counter
  • Untaken branches decrement it
  • The larger global history counter means more
    instruction addresses can be saved thus
    increasing branch predictor accuracy

51
Translation Look-aside Buffer
  • The number of entries TLB has been increased
  • Helps performance in servers with large memory
    requirements
  • Desktop performance impact will be limited to a
    small boost when running 3D rendering software

52
HyperTransportTypical CPU to Memory Set-Up
  • CPU sends 200MHz clock to the north bridge, this
    is the FSB.
  • The bus between north bridge and the CPU is 64
    bits wide at 200MHz, (Quad Pumped for 4 packets
    per cycle) giving effective rate of 800MHz
  • The memory bus is also 200MHz and 64 or 128 bits
    wide (single or dual channel). As it is DDR
    memory, two 64/128 bits packs are sent every
    clock cycle.

53
HyperTransportOpteron Memory Set-Up
  • integrated memory controller does not improve the
    memory bandwidth, but drastically reduces memory
    request time
  • HyperTransport uses a 16 bits wide bus at 800MHz,
    and a double data rate system that enables a
    3.2GB peak bandwidth one-way

54
(No Transcript)
55
Pros Cons
  • Pros
  • The performance of the integrated controller of
    the K8 increases as the CPU speed increases and
    so does the request speed.
  • The addressable memory size and the total
    bandwidth increase with the number of CPUs
  • Cons
  • Memory controller is customized to use a specific
    memory, and is not very flexible about upgrading

56
Caches
57
L1 Cache Comparison
CPU K8 Pentium 4 Prescott
Size code 64KB TC 12Kµops
Size data 64KB data 16KB
Associativity code 2 way TC 8 way
Associativity data 2 way data 8 way
Cache line size code 64 bytes TC n.a
Cache line size data 64 bytes data 64 bytes
Write policy Write Back Write Through
Latency Given By Manufacturer 3 cycles 4 cycles
58
K8 L1 Cache
  • Compared to the Intel machine,the large size of
    the L1 cache allows for bigger block size
  • Pros a big range of data or code in the same
    memory area
  • Cons low associativity tends to create conflicts
    during the caching phase.

59
(No Transcript)
60
L2 Cache Comparison
CPU K8 Pentium 4 Prescott
Size 512KB (NewCastle) 1024KB
Size 1024KB (Hammer) 1024KB
Associativity 16 way 8 way
Cache line size 64 bytes 64 bytes
Latency given by manufacturer 11 cycles 11 cycles
Bus width 128 bits 256 bits
L1 relationship exclusive inclusive
61
K8 L2 cache
  • L2 cache of the K8 shares lot of common features
    with the K7.
  • The K8s L2 cache uses a 16-way set associativity
    to partially compensates for the low
    associativity of the L1.
  • Although the bus width in the K8 is double what
    the K7 offered, it still is smaller than the
    Intel model
  • The K8 also includes an hardware prefetch logic,
    that allows to get data from memory to the L2
    cache during the the memory bus idle time.

62
(No Transcript)
63
Inclusive vs. Exclusive Caching
  • Inclusive Caching Used by the Intel P4
  • L1 cache contains a subset of the L2 cache
  • During an L1 miss/L2 success data is copied into
    the L1 cache and forwarded to the CPU
  • During an L1/L2 miss, data is copied from memory
    into both L1 and L2 caches

64
Inclusive vs. Exclusive Caching
  • Exclusive Used by the Opteron
  • L1 and L2 caches cannot contain the same data
  • During an L1 miss/L2 success data
  • One line is evicted from the L1 cache into the L2
  • L2 cache copies data into the L1 cache
  • During an L1/L2 miss, data is copied into the L1
    cache alone

65
Drawback of Exclusive Caching and its solution
  • Problem A line from the L1 must be copied to the
    L2 before getting back the data from the L2.
  • Takes a lot of clock cycles, adding to the time
    needed to get data from the L2
  • Solution
  • victim buffer (VB), that is a very little and
    fast memory between L1 and L2.
  • The line evicted from L1 is then copied into the
    VB rather than into the L2.
  • In the same time, the L2 read request is started,
    so doing the L1 to VB write operation is hidden
    by the L2 latency
  • Then, if by chance the next requested data is in
    the VB, getting back the data from it is much
    more quickly than getting it from the L2.
  • The VB is a good improvement, but it is very
    limited by its small size (generally between 8
    and 16 cache lines). Moreover, when the VB is
    full, it must be flushed into the L2, that is an
    additional step and needs some extra cycles.

66
Drawback of Inclusive
  • The constraint on the L1/L2 size ratio needs the
    L1 to be small,
  • but a small size will result in reducing its
    success rate, and consequently its performance.
  • On the other hand, if it is too big, the ratio
    will be too large for good performance of the L2.
  • Reduces flexibility when deciding size of L1 and
    L2 caches
  • It is very hard to build a CPU line with such
    constraints. Intel released the Celeron P4 as a
    budget CPU, but its 128KB L2 cache completely
    broke the performance.
  • Total useful cache size is reduced since data is
    duplicated over the caches

67
Inclusive vs. Exclusive Caching
Pros Cons
Exclusive No constraint on the L2 size. Total cache size is sum of the sub-level sizes. L2 performance decreases
Inclusive L2 performance Constraint on the L1/L2 size ratio Total cache size is effectively reduced
68
The Pipeline
69
K7 vs. K8 Pipeline Comparison
70
The Fetch Stage
  • Two Cycles Long
  • Feeds 3 Decoders with 16 instruction byres each
    cycle
  • Uses the L1 code cache and the branch prediction
    logic.

71
The Decode Stage
  • The decoders convert the x86 instruction in fixed
    length micro-operations (µOPs).
  • Can generate 3 µOPs per cycle
  • The FastPath "simple" instructions, that are
    decoded in 1-2 µOPs, are decoded by hardware then
    packed and dispatched
  • Microcoded path complex instructions are decoded
    using the internal ROM
  • Compared to the K7, more instructions in the K8
    use the fast path especially SSE instructions.
  • AMD claims that the microcoded instructions
    number decreased by 8 for integer and 28 for
    floating point instructions.

72
Instruction Dispatch
  • There are
  • 3 address generation units (AGU)
  • Three integer units (ALU). Most operations
    complete within a cycle, in both 32 and 64bits
    addition, rotation, shift, logical operations
    (and, or).
  • Integer multiplication has a 3 cycles latency in
    32 bits, and a 5 cycles latency in 64 bits.
  • Three floating point units (FPU), that handle
    x87, MMX, 3DNow!, SSE and SSE2.

73
Load/Store Stage
  • Last stage of the pipeline process
  • uses the L1 data cache.
  • the L1 is dual-ported to handle two 32/64 bits
    reads or writes each clock cycle.

74
Cache Summary
  • Compared to the K7, the K8 cache provides higher
    bandwidth and lower latencies
  • Compared to the Intel P4, the K8 caches are
    write-back and inclusive

75
AMD 64 GPR encoding
  • The IA32 instructions encoding is made with a
    special byte called the ModRM (Mode / Register /
    Memory), in which are encoded the source and
    destination registers of the instruction.
  • 3 bits encode the source register, 3 bits encode
    the destination
  • Theres no way to change the ModRM byte since
    that would break IA32 compatibility. So to allow
    instructions to use the 8 new GPRs, an addition
    bit named the REX is added outside the ModRM.
  • The REX is used only in long (64-bit) mode, and
    only if the specified instruction is a 64-bit one

76
AMD 64 SSE
  • Abandoned the original MMX, 3DNow! Instruction
    sets because they operated on the same physical
    registers
  • Supports SSE/SSE2 using eight SSE-dedicated
    80-bit registers
  • If a 128 bit instruction is processed it will
    take two steps to complete
  • Intels P4 allows for the use of 128 bit
    registers so 128 bit instructions only take a
    single step
  • However, C/C compilers still usually output
    scalar SSE instructions that only use 32/64 bits
    so the Opteron can processes most SSE
    instructions in one step and thus remain
    competitive with the P4

77
AMD 64 One Last Trick
  • suppose we want to write 1 in a register, that is
    written in pseudo-code as
  • mov register, 1
  • In the case of a 32 bits register, the immediate
    value 1 will be encoded on 32 bits
  • mov eax, 00000001h
  • In the case the register is 64 bits
  • mov rax, 0000000000000001h
  • Problems? The 64-bit instruction takes 5 more
    bits to encode the same number thus wasting space.

78
AMD 64 One Last Trick
  • Under AMD64, the default size for operand bits is
    32.

79
AMD 64 One Last Trick
  • For memory addressing a more complicated table is
    used.

80
AMD 64 Code Size
  • Cpuid.org estimated that a 64 bits code will be
    20-25 bigger compared to the same IA32
    instructions based code.
  • However, the use of sixteen GPR will tend to
    reduce the number of instructions, and perhaps
    make 64-bit code shorter than 32-bit code.
  • The K8 is able to handle the code size increase,
    thanks to its 3 decoding units, and its big L1
    code cache. The use of big 32KB blocs in the L1
    organization in order seems now very useful

81
AMD 64 32-bit code vs. 64-bit Code
82
  • HardOCP
  • AthlonXP 3200 got outpaced by the Athlon64
    3200the P4 and the P4EE came in at a dead tie,
    which suggests that the extra CPU cache is not a
    factor in this benchmark... pipeline enhancements
    made to the new K8 core certainly did impact
    instructions per clock.

Note Athlon64 3200 runs at 2.0GHz AthlonXP
3200 runs at 2.2 GHz
83
AMD 64 Conclusions
  • Allows for a larger addressable memory size
  • Allows for wider GPRs and 8 more of them
  • Allows the use of all x86 instructions that were
    avaliable on the AMD64 by default
  • Can lead to small code that is faster as a result
    of less memory shuffling

84
Opteron vs. Xeon
85
Opteron vs Xeon in a nutshell
  • Opteron offers better computing and per-Watt
    performance at a roughly equivalent per-device
    price
  • Opteron scales much better when moving from one
    to two or even more CPUs
  • Fundamental limitation
  • Xeon processors must share one front side bus and
    one memory array

86
FSB Bottleneck
Intels Xeon
AMDs Opteron
87
Xeon and the FSB Bottleneck
  • External north bridge makes implementing multiple
    FSB interfaces expensive and hard
  • Intel just has all the processors share
  • Relies on large on-die L3 caches to hide issue
  • Problem grows with number of CPUs

88
The AMD Solution
  • Recall Each processor has own integrated memory
    controller and three HyperTransport ports
  • No NB required for memory interaction
  • 6.4 GB/s bandwidth between all CPUs
  • No scaling issue!

89
Further Xeon Notes
  • Even 64-bit extensions would not solve the
    fundamental performance bottleneck imposed by the
    current architecture
  • Xeon can make use of Hyperthreading
  • Found to improve performance by 3 - 5

90
AnandTech Database Benchmarks
  • SQL workload based on sites forum usage,
    database was forums themselves
  • i.e. trying to be real world
  • Two categories 2-way and 4-way setups
  • Labels
  • Xeon Clock Speed / FSB Speed / L3 Cache Size
  • Opteron Clock Speed / L2 Cache Size

91
AnandTech Average Load 2-way
  • Longer line is better
  • Opterons at 2.2 GHz maintain 5 lead
  • over Xeons at 3.2 GHz

92
AnandTech Average Load 4-way
  • With two more processors, best Opteron system
    increases performance lead to 11
  • Opterons _at_ 1.8 GHz nearly equal Xeons at 3.0
    GHz

93
AnandTech Enterprise benchmarks
Stored Procedures / Second
  • 2-way Xeon at 3GHz and large L3 cache does
    better
  • 4-way Opteron jumps ahead (8.5 lead)

94
AnandTech Test Conclusions
  • Opteron is clear winner for gt2 processor systems
  • Even for dual-processors, Xeon essentially only
    ties
  • Clearly illustrates the scaling bottleneck
  • Xeons are using most of their huge (4MB) L3 cache
    to keep traffic off the FSB
  • Also Opteron systems used in tests cost ½ as much

95
Toms Hardware Benchmarks
  • AMD's Opteron 250 vs. Intel's Xeon 3.6GH
  • Xeon Nocona (i.e. 64-bit processing)
  • Results enhanced by chipset used (875P) which has
    improved memory controller
  • Still suffers from lack in memory performance
  • Workstation applications rather than server based
    tests

96
Toms Hardware
97
Toms Hardware
98
Toms Hardware Conclusions
  • AMD has memory benefits, as before
  • Opteron better in video, Intel better with 3D but
    only when 875P-chipset is used
  • Otherwise Opteron wins in spite of inferior
    graphics hardware
  • Still undecided re 64-bit, no good applications
    to benchmark on

99
K8 in Different Packages
100
K8 in Different Packages
  • Opteron
  • Server Market
  • Registered memory
  • 940 pin count
  • Three HyperTransport links
  • Multi-cpu configurations (1,2,4, or 8 cpus)
  • Multiple multi-core cpus supported as well
  • Up to 8 1GB DIMMs

101
K8 in Different Packages
  • Athlon 64
  • Desktop market
  • Unregistered memory
  • 754 or 939pin count
  • Up to 4 1GB DIMMs
  • Single HyperTransport links
  • Single slot configurations
  • X2 has multiple cores in one slot
  • Athlon 64 FX
  • Same feature set as Athlon 64
  • Unlocked multiplier for overclocking
  • Offered at higher clock speeds (2.8GHz vs.
    2.6GHz)

102
(No Transcript)
103
K8 in Different Packages
  • Turion 64
  • Named to evoke the touring concept
  • 90nm Lancaster Athlon 64 core
  • 64bit computing
  • SSE3 support
  • High quality core yields, can run at high clock
    speeds with low voltage
  • Similar process for low wattage opterons
  • On chip memory controller
  • Saves power by running in single channel mode
  • Better compared to Petium Ms extra controller on
    the mobo

104
(No Transcript)
105
Thermal Design Points
  • Pentium 4s TDP 130w
  • Athlon 64s TDP 89-104w
  • Opteron HE - 50w EE -30w 
  • Athlon 64 mobiles 50w
  • DTR market sector
  • Pentium M 27w
  • Turion 64 25w

106
K8 in Different Packages
  • Turion 64 continued
  • Uses PowerNow! Technology
  • Similar to Intels SpeedStep
  • Identical to desktop CoolNQuiet
  • Dynamic voltage and clock frequency modulation
  • Operates on demand
  • Run Cooler and Quieter even when plugged in

107
(No Transcript)
108
(No Transcript)
109
K8 in Different Packages
  • AMD uses Mobile Technology name
  • Intel has a monopoly on centrino
  • Supplies Wireless, chipset and cpu
  • invested 300 million in Centrino advertising
  • Some consumers think Centrino is the only way to
    get wireless connectivity in a notebook
  • AMD supplies only the cpu
  • Chipset and wireless are left up to the
    motherboard manufacturer/OEM

110
Marketing
  • VS

111
Intels Marketing
  • Men who are Blue
  • Moores Law
  • Megahertz
  • Most importantly Money
  • Beginning with In order to correctly
    communicate the benefits of new processors to PC
    buyers it became important that Intel transfer
    any brand equity from the ambiguous and
    unprotected processor numbers to the company
    itself

112
Industry on AMD vs. Intel
  • Intel spends more on RD in one quarter than AMD
    makes in a year
  • Intel still has a tremendous amount of arrogance
  • Has been shamed technologically by a flea-sized
    (relatively speaking) firm
  • Humbling? Intel is still grudgingly turning to
    the high IPC, low clock rate, dual-core, x86-64,
    on-die memory controller design pioneered by its
    diminutive rival.
  • Geek.com

113
AMDs Marketing
  • Mascot The AMD Arrow
  • AMD makes superior CPUs, but the marketing
    department is acting like they are still selling
    the K6 -theinquirer.net
  • Guilty with Intel on poor metrics
  • AMD made all the marketing hay it could on the
    historically significant clock-speed number. By
    trying to turn attention away from that number
    now, it runs the risk of appearing to want to
    change the subject when it no longer has the
    perceived advantage. In marketing, appearance is
    everything. And no one wants to look like a sore
    loser, even when they aren't. - Forbes

114
(No Transcript)
115
Anandtech on AMDs Marketing
  • AMD argued that they didn't have to talk about a
    new architecture, as Intel is just playing
    catch-up to their current architecture.
  • However, we look at it like this - AMD has the
    clear advantage today, and for a variety of
    reasons, their stance in the marketplace has not
    changed all that much.

116
Conclusion
  • Improvements over K7
  • 64-bit
  • Integrated memory controller
  • HyperTransport
  • Pipeline
  • Multiprocessor scaling gt Xeon
  • K8 is dominant in every market performance-wise
  • K8 is trounced in every market in sales

117
Reason for 64-bit in Consumer Market
  • If there aren't widespread, consumer-priced
    64-bit machines available in three years, we're
    going to have a hard time developing games that
    are more compelling than last year's games.- Tim
    Sweeney, Founder President Epic Games

118
Questions?
119
  • http//www.people.virginia.edu/avw6s/opteron.htm
    l
Write a Comment
User Comments (0)
About PowerShow.com