Title: The Transmeta Crusoe Processor Architecture
1The Transmeta Crusoe Processor Architecture
- Combining Low-Power VLIW
- With x86 Compatibility
- by Jeryl Contemprato
- Advanced Computer Architecture
- 23 October 2002
2Transmeta Crusoe
- Background
- The Processor
- Code Morphing
- Hardware Support
- Example Program
- Power Consumption
- Benchmarks
3Background
- Mobile computing necessitates low-power
processing, yet requires sufficient processing
capability to run standard applications - Existing superscalar designs that deliver
required performance are limited in possibilities
for power savings due to the complexity of
dynamic scheduling - To achieve additional power savings, new
processor design approaches are necessary, but
compatibility with predominant x86 platform an
issue
4Background
- Transmeta introduced the Crusoe processor family
in Jan 2000 - Targeted mobile devices, with goals of strong
performance, low power consumption, and x86
compatibility - 2-part solution
- HW 4-issue native VLIW CPU (not x86 compatible)
- SW Code Morphing x86-to-VLIW translation layer
5The Processor (TM5800)
http//www.transmeta.com/crusoe_docs/TM5800_Produc
tBrief_7-18-02.pdf
6The Processor
- Latest processor, TM5800, clocked at 900MHz and
fabricated using conventional CMOS 0.13µ process - Simple 128-bit VLIW engine, 4 operations per
instruction word, not compatible with x86
instructions - atom RISC-like instruction/operation
- molecule the instruction word, composed of
atoms - Functional units 2 integer, 1 floating point, 1
load/store, 1 branch - 7-stage pipeline for integer ops, 10-stage for
floating point - 64-integer register file for x86 state, internal
state, and SW register renaming
7The Processor
http//www.arstechnica.com/cpu/1q00/crusoe/crusoe-
3.html
8The Processor
- By choosing VLIW over traditional superscalar
dynamic instruction scheduling, we can drop the
massive numbers of transistors and wiring
associated with dynamic scheduling in hardware - Assuming native instruction words, how does the
Crusoe fundamentally differ from Itanium? - Despite Itaniums use of VLIW, it still does some
dynamic scheduling, in hardware, of the
operations in the instruction word for further
run-time optimization, even if the IA-64 compiler
had optimized construction of the word. The
Crusoe doesnt do any dynamic scheduling in HW.
9Code Morphing
- Dynamic translation and instruction scheduling
software that converts from x86 target ISA to
VLIW host ISA
10Code Morphing
- Software is resident in a serial ROM that is
transferred to RAM on startup for faster access - Code Morphing SW is the only native code written
for the VLIW core essentially an on-the-fly
compiler/interpreter for x86 instructions - Breaks all x86 instructions (including BIOS and
OS) into atoms, then schedules them into
molecules just as a VLIW compiler would for ILP.
A group of molecules generated on a run of the
Code Morphing SW is a translation - Translation more than just semantic from x86 to
VLIW - Do typical compiler optimizations and Code
Morphing-specific optimizations - Out-of-order scheduling and register renaming
11Code Morphing Overhead
- The catch while the Code Morphing software is
the layer between x86 instructions and the VLIW
engine, it does not physically sit between the
two - Being software itself, Code Morphing has to run
on the VLIW core - While Code Morphing software does translations,
we are not doing useful operations - Must offset the cost of translation to maintain
acceptable performance
12Code Morphing Optimizations
- Translation cache, in a protected memory space,
stores translations for oft-used instruction
groups, so they do not have to be retranslated - Filtering allows the Crusoe to take advantage of
the 10-90 rule of execution - 90 of the time is spent on 10 of the code
- Because further optimization incurs extra
overhead, we only want to fully optimize code we
are sure we will use again and again - Heuristics and feedback help decide what to spend
time on - Predictions and path selections can be made
through feedback, such as block execution
frequency and branch history, to drive
speculative execution
13Hardware Support
- Dynamic translation presents several problems
that must be dealt with in hardware to achieve
acceptable performance - Precise exceptions speculation
- Memory instruction latency
- Self-modifying code
14Hardware Support Precise Exceptions Speculation
- x86 ISA enforces precise exceptions, so that when
an instruction causes an exception, all prior
in-order instructions complete before reporting
the exception, while flushing all subsequent
instructions - Code Morphing schedules atoms, the
micro-operations of x86 instructions,
out-of-order to achieve ILP - Crusoe optimizes for the common case of no
exceptions in order to achieve greater normal
throughput
15Hardware Support Precise Exceptions Speculation
- x86 state registers are shadowed so that normal
atoms modify only working registers - Store atoms dont write to memory, but to a gated
store buffer - If a translation executes without causing any
exceptions, a follow-on commit atom executes,
copying working registers to shadow registers
commiting store buffer contents to memory - If an exception happens, a rollback operation
copies shadow register values back into working
registers and discards store buffer contents,
then re-executes x86 instructions in order
without scheduling - Same technique applies to speculative execution
of mispredicted branches
16Hardware Support Alias HW
- Code Morphing translator can generate more ILP if
it has more freedom to move atoms around - Memory operation dependencies can hamper this
freedom, especially in reordering loads ahead of
stores - Alias hardware allows the scheduler to move loads
ahead of stores even in case of dependency - Those load atoms that are moved ahead of stores
are converted into load-and-protect (ldp), which
records load address/size on execution - Store atoms converted to store-under-alias-mask
(stam) to check, on execution, if the store will
overwrite any memory accessed by an ldp - If the store does overwrite previously
out-of-order loaded data, an exception is raised.
17Hardware SupportSelf-Modifying Code
- x86 instructions in memory may be overwritten due
to OS loading new program or a program using
self-modifying code - Code Morphing relies on translation cache to
increase performance, which may contain stale
translations if corresponding x86 instructions
have been overwritten - Maintain translated bits for each page entry in
the memory management unit - Crusoe knows code may have been modified if
writes are made to pages with translated bit
set - Simplest remedy is invalidating translations when
these writes are made to force retranslation - More sophisticated strategies used as the
system learns about the program behavior
18Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
19Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
path selector eliminates JMP
20Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
predicated execution eliminates branches
21Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
Speculative advanced load through aliasing
allows flexibility in load scheduling
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
22Power Consumption
- Simple VLIW engine itself has fewer transistors
and buses than traditional dynamic superscalar
processors, lowering power consumption - Number of transistors roughly corresponds to die
size
Klaiber, Alexander. The Technology Behind
Crusoe Processors
23Power Consumption
- Less power less heat no need for external
cooling device, saving even more power
24Power ConsumptionLongRun Power Management
- Conventional mobile x86 CPUs regulate power
consumption by alternating between full speed
execution and effectively stopping execution,
with varying performance levels obtained by
varying duty cycle between 2 states - Power(Total Capacitance x Frequency x Voltage2)
/2. - Crusoe can actually adjust clock frequency on the
fly to lower power consumption when fast
performance is less critical, with more
granularity than Intels SpeedStep technology for
its mobile Pentium IIIs - Crusoe can also adjust voltage to save even more
power, since lower frequency requires less
voltage gt potential cubic power reduction
25Benchmarks
26Conclusion
- Transmeta Crusoe targets the mobile computing
market - Incorporates hardware and software to achieve
best energy efficiency (faster performance given
a certain power consumption) - VLIW engine reduces number of transistors
- Code Morphing performs x86 translation to VLIW
and optimally schedules molecules - Code Morphing software techniques and hardware
support in the processor helps compensate for
translation overhead - Power consumption reduced even more using LongRun
real-time frequency/voltage adjustments - We do pay a performance price for our power
savings