The Transmeta Crusoe Processor Architecture - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

The Transmeta Crusoe Processor Architecture

Description:

64-integer register file for x86 state, internal state, and SW register renaming ... Out-of-order scheduling and register renaming. Code Morphing: Overhead ... – PowerPoint PPT presentation

Number of Views:1087
Avg rating:5.0/5.0
Slides: 27
Provided by: Contemp
Category:

less

Transcript and Presenter's Notes

Title: The Transmeta Crusoe Processor Architecture


1
The Transmeta Crusoe Processor Architecture
  • Combining Low-Power VLIW
  • With x86 Compatibility
  • by Jeryl Contemprato
  • Advanced Computer Architecture
  • 23 October 2002

2
Transmeta Crusoe
  • Background
  • The Processor
  • Code Morphing
  • Hardware Support
  • Example Program
  • Power Consumption
  • Benchmarks

3
Background
  • Mobile computing necessitates low-power
    processing, yet requires sufficient processing
    capability to run standard applications
  • Existing superscalar designs that deliver
    required performance are limited in possibilities
    for power savings due to the complexity of
    dynamic scheduling
  • To achieve additional power savings, new
    processor design approaches are necessary, but
    compatibility with predominant x86 platform an
    issue

4
Background
  • Transmeta introduced the Crusoe processor family
    in Jan 2000
  • Targeted mobile devices, with goals of strong
    performance, low power consumption, and x86
    compatibility
  • 2-part solution
  • HW 4-issue native VLIW CPU (not x86 compatible)
  • SW Code Morphing x86-to-VLIW translation layer

5
The Processor (TM5800)
http//www.transmeta.com/crusoe_docs/TM5800_Produc
tBrief_7-18-02.pdf
6
The Processor
  • Latest processor, TM5800, clocked at 900MHz and
    fabricated using conventional CMOS 0.13µ process
  • Simple 128-bit VLIW engine, 4 operations per
    instruction word, not compatible with x86
    instructions
  • atom RISC-like instruction/operation
  • molecule the instruction word, composed of
    atoms
  • Functional units 2 integer, 1 floating point, 1
    load/store, 1 branch
  • 7-stage pipeline for integer ops, 10-stage for
    floating point
  • 64-integer register file for x86 state, internal
    state, and SW register renaming

7
The Processor
http//www.arstechnica.com/cpu/1q00/crusoe/crusoe-
3.html
8
The Processor
  • By choosing VLIW over traditional superscalar
    dynamic instruction scheduling, we can drop the
    massive numbers of transistors and wiring
    associated with dynamic scheduling in hardware
  • Assuming native instruction words, how does the
    Crusoe fundamentally differ from Itanium?
  • Despite Itaniums use of VLIW, it still does some
    dynamic scheduling, in hardware, of the
    operations in the instruction word for further
    run-time optimization, even if the IA-64 compiler
    had optimized construction of the word. The
    Crusoe doesnt do any dynamic scheduling in HW.

9
Code Morphing
  • Dynamic translation and instruction scheduling
    software that converts from x86 target ISA to
    VLIW host ISA

10
Code Morphing
  • Software is resident in a serial ROM that is
    transferred to RAM on startup for faster access
  • Code Morphing SW is the only native code written
    for the VLIW core essentially an on-the-fly
    compiler/interpreter for x86 instructions
  • Breaks all x86 instructions (including BIOS and
    OS) into atoms, then schedules them into
    molecules just as a VLIW compiler would for ILP.
    A group of molecules generated on a run of the
    Code Morphing SW is a translation
  • Translation more than just semantic from x86 to
    VLIW
  • Do typical compiler optimizations and Code
    Morphing-specific optimizations
  • Out-of-order scheduling and register renaming

11
Code Morphing Overhead
  • The catch while the Code Morphing software is
    the layer between x86 instructions and the VLIW
    engine, it does not physically sit between the
    two
  • Being software itself, Code Morphing has to run
    on the VLIW core
  • While Code Morphing software does translations,
    we are not doing useful operations
  • Must offset the cost of translation to maintain
    acceptable performance

12
Code Morphing Optimizations
  • Translation cache, in a protected memory space,
    stores translations for oft-used instruction
    groups, so they do not have to be retranslated
  • Filtering allows the Crusoe to take advantage of
    the 10-90 rule of execution
  • 90 of the time is spent on 10 of the code
  • Because further optimization incurs extra
    overhead, we only want to fully optimize code we
    are sure we will use again and again
  • Heuristics and feedback help decide what to spend
    time on
  • Predictions and path selections can be made
    through feedback, such as block execution
    frequency and branch history, to drive
    speculative execution

13
Hardware Support
  • Dynamic translation presents several problems
    that must be dealt with in hardware to achieve
    acceptable performance
  • Precise exceptions speculation
  • Memory instruction latency
  • Self-modifying code

14
Hardware Support Precise Exceptions Speculation
  • x86 ISA enforces precise exceptions, so that when
    an instruction causes an exception, all prior
    in-order instructions complete before reporting
    the exception, while flushing all subsequent
    instructions
  • Code Morphing schedules atoms, the
    micro-operations of x86 instructions,
    out-of-order to achieve ILP
  • Crusoe optimizes for the common case of no
    exceptions in order to achieve greater normal
    throughput

15
Hardware Support Precise Exceptions Speculation
  • x86 state registers are shadowed so that normal
    atoms modify only working registers
  • Store atoms dont write to memory, but to a gated
    store buffer
  • If a translation executes without causing any
    exceptions, a follow-on commit atom executes,
    copying working registers to shadow registers
    commiting store buffer contents to memory
  • If an exception happens, a rollback operation
    copies shadow register values back into working
    registers and discards store buffer contents,
    then re-executes x86 instructions in order
    without scheduling
  • Same technique applies to speculative execution
    of mispredicted branches

16
Hardware Support Alias HW
  • Code Morphing translator can generate more ILP if
    it has more freedom to move atoms around
  • Memory operation dependencies can hamper this
    freedom, especially in reordering loads ahead of
    stores
  • Alias hardware allows the scheduler to move loads
    ahead of stores even in case of dependency
  • Those load atoms that are moved ahead of stores
    are converted into load-and-protect (ldp), which
    records load address/size on execution
  • Store atoms converted to store-under-alias-mask
    (stam) to check, on execution, if the store will
    overwrite any memory accessed by an ldp
  • If the store does overwrite previously
    out-of-order loaded data, an exception is raised.

17
Hardware SupportSelf-Modifying Code
  • x86 instructions in memory may be overwritten due
    to OS loading new program or a program using
    self-modifying code
  • Code Morphing relies on translation cache to
    increase performance, which may contain stale
    translations if corresponding x86 instructions
    have been overwritten
  • Maintain translated bits for each page entry in
    the memory management unit
  • Crusoe knows code may have been modified if
    writes are made to pages with translated bit
    set
  • Simplest remedy is invalidating translations when
    these writes are made to force retranslation
  • More sophisticated strategies used as the
    system learns about the program behavior

18
Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
19
Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
path selector eliminates JMP
20
Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
predicated execution eliminates branches
21
Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
Speculative advanced load through aliasing
allows flexibility in load scheduling
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
22
Power Consumption
  • Simple VLIW engine itself has fewer transistors
    and buses than traditional dynamic superscalar
    processors, lowering power consumption
  • Number of transistors roughly corresponds to die
    size

Klaiber, Alexander. The Technology Behind
Crusoe Processors
23
Power Consumption
  • Less power less heat no need for external
    cooling device, saving even more power

24
Power ConsumptionLongRun Power Management
  • Conventional mobile x86 CPUs regulate power
    consumption by alternating between full speed
    execution and effectively stopping execution,
    with varying performance levels obtained by
    varying duty cycle between 2 states
  • Power(Total Capacitance x Frequency x Voltage2)
    /2.
  • Crusoe can actually adjust clock frequency on the
    fly to lower power consumption when fast
    performance is less critical, with more
    granularity than Intels SpeedStep technology for
    its mobile Pentium IIIs
  • Crusoe can also adjust voltage to save even more
    power, since lower frequency requires less
    voltage gt potential cubic power reduction

25
Benchmarks
26
Conclusion
  • Transmeta Crusoe targets the mobile computing
    market
  • Incorporates hardware and software to achieve
    best energy efficiency (faster performance given
    a certain power consumption)
  • VLIW engine reduces number of transistors
  • Code Morphing performs x86 translation to VLIW
    and optimally schedules molecules
  • Code Morphing software techniques and hardware
    support in the processor helps compensate for
    translation overhead
  • Power consumption reduced even more using LongRun
    real-time frequency/voltage adjustments
  • We do pay a performance price for our power
    savings
Write a Comment
User Comments (0)
About PowerShow.com