The Transmeta Crusoe Processor Architecture - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

The Transmeta Crusoe Processor Architecture

Description:

64-integer register file for x86 state, internal state, and SW register renaming ... Out-of-order scheduling and register renaming. Code Morphing: Overhead ... – PowerPoint PPT presentation

Number of Views:1088

Avg rating:5.0/5.0

Slides: 27

Provided by: Contemp

Category:

more less

Transcript and Presenter's Notes

Title: The Transmeta Crusoe Processor Architecture

1
The Transmeta Crusoe Processor Architecture

Combining Low-Power VLIW
With x86 Compatibility
by Jeryl Contemprato
Advanced Computer Architecture
23 October 2002

2
Transmeta Crusoe

Background
The Processor
Code Morphing
Hardware Support
Example Program
Power Consumption
Benchmarks

3
Background

Mobile computing necessitates low-power
processing, yet requires sufficient processing
capability to run standard applications
Existing superscalar designs that deliver
required performance are limited in possibilities
for power savings due to the complexity of
dynamic scheduling
To achieve additional power savings, new
processor design approaches are necessary, but
compatibility with predominant x86 platform an
issue

4
Background

Transmeta introduced the Crusoe processor family
in Jan 2000
Targeted mobile devices, with goals of strong
performance, low power consumption, and x86
compatibility
2-part solution
HW 4-issue native VLIW CPU (not x86 compatible)
SW Code Morphing x86-to-VLIW translation layer

5
The Processor (TM5800)
http//www.transmeta.com/crusoe_docs/TM5800_Produc
tBrief_7-18-02.pdf
6
The Processor

Latest processor, TM5800, clocked at 900MHz and
fabricated using conventional CMOS 0.13µ process
Simple 128-bit VLIW engine, 4 operations per
instruction word, not compatible with x86
instructions
atom RISC-like instruction/operation
molecule the instruction word, composed of
atoms
Functional units 2 integer, 1 floating point, 1
load/store, 1 branch
7-stage pipeline for integer ops, 10-stage for
floating point
64-integer register file for x86 state, internal
state, and SW register renaming

7
The Processor
http//www.arstechnica.com/cpu/1q00/crusoe/crusoe-
3.html
8
The Processor

By choosing VLIW over traditional superscalar
dynamic instruction scheduling, we can drop the
massive numbers of transistors and wiring
associated with dynamic scheduling in hardware
Assuming native instruction words, how does the
Crusoe fundamentally differ from Itanium?
Despite Itaniums use of VLIW, it still does some
dynamic scheduling, in hardware, of the
operations in the instruction word for further
run-time optimization, even if the IA-64 compiler
had optimized construction of the word. The
Crusoe doesnt do any dynamic scheduling in HW.

9
Code Morphing

Dynamic translation and instruction scheduling
software that converts from x86 target ISA to
VLIW host ISA

10
Code Morphing

Software is resident in a serial ROM that is
transferred to RAM on startup for faster access
Code Morphing SW is the only native code written
for the VLIW core essentially an on-the-fly
compiler/interpreter for x86 instructions
Breaks all x86 instructions (including BIOS and
OS) into atoms, then schedules them into
molecules just as a VLIW compiler would for ILP.
A group of molecules generated on a run of the
Code Morphing SW is a translation
Translation more than just semantic from x86 to
VLIW
Do typical compiler optimizations and Code
Morphing-specific optimizations
Out-of-order scheduling and register renaming

11
Code Morphing Overhead

The catch while the Code Morphing software is
the layer between x86 instructions and the VLIW
engine, it does not physically sit between the
two
Being software itself, Code Morphing has to run
on the VLIW core
While Code Morphing software does translations,
we are not doing useful operations
Must offset the cost of translation to maintain
acceptable performance

12
Code Morphing Optimizations

Translation cache, in a protected memory space,
stores translations for oft-used instruction
groups, so they do not have to be retranslated
Filtering allows the Crusoe to take advantage of
the 10-90 rule of execution
90 of the time is spent on 10 of the code
Because further optimization incurs extra
overhead, we only want to fully optimize code we
are sure we will use again and again
Heuristics and feedback help decide what to spend
time on
Predictions and path selections can be made
through feedback, such as block execution
frequency and branch history, to drive
speculative execution

13
Hardware Support

Dynamic translation presents several problems
that must be dealt with in hardware to achieve
acceptable performance
Precise exceptions speculation
Memory instruction latency
Self-modifying code

14
Hardware Support Precise Exceptions Speculation

x86 ISA enforces precise exceptions, so that when
an instruction causes an exception, all prior
in-order instructions complete before reporting
the exception, while flushing all subsequent
instructions
Code Morphing schedules atoms, the
micro-operations of x86 instructions,
out-of-order to achieve ILP
Crusoe optimizes for the common case of no
exceptions in order to achieve greater normal
throughput

15
Hardware Support Precise Exceptions Speculation

x86 state registers are shadowed so that normal
atoms modify only working registers
Store atoms dont write to memory, but to a gated
store buffer
If a translation executes without causing any
exceptions, a follow-on commit atom executes,
copying working registers to shadow registers
commiting store buffer contents to memory
If an exception happens, a rollback operation
copies shadow register values back into working
registers and discards store buffer contents,
then re-executes x86 instructions in order
without scheduling
Same technique applies to speculative execution
of mispredicted branches

16
Hardware Support Alias HW

Code Morphing translator can generate more ILP if
it has more freedom to move atoms around
Memory operation dependencies can hamper this
freedom, especially in reordering loads ahead of
stores
Alias hardware allows the scheduler to move loads
ahead of stores even in case of dependency
Those load atoms that are moved ahead of stores
are converted into load-and-protect (ldp), which
records load address/size on execution
Store atoms converted to store-under-alias-mask
(stam) to check, on execution, if the store will
overwrite any memory accessed by an ldp
If the store does overwrite previously
out-of-order loaded data, an exception is raised.

17
Hardware SupportSelf-Modifying Code

x86 instructions in memory may be overwritten due
to OS loading new program or a program using
self-modifying code
Code Morphing relies on translation cache to
increase performance, which may contain stale
translations if corresponding x86 instructions
have been overwritten
Maintain translated bits for each page entry in
the memory management unit
Crusoe knows code may have been modified if
writes are made to pages with translated bit
set
Simplest remedy is invalidating translations when
these writes are made to force retranslation
More sophisticated strategies used as the
system learns about the program behavior

18
Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
19
Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
path selector eliminates JMP
20
Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
predicated execution eliminates branches
21
Example Program
1. movl ecx,0x3 2. jmp lbl1 ... 3. lbl1 movl
edx,0x2fc(ebp) 4. movl eax,0x304(ebp) 5.
movl esi,0x0 6. cmpl edx,eax 7. movl
0x40(esp,1),0x0 8. jle skip1 9. movl esi,0x1
10.skip1movl 0x6c(esp,1),esi 11. cmpl
edx,eax 12. movl eax,0x1 13. jl
skip2 14. xorl eax,eax 15.skip2movl
esi,0x308(ebp) 16. movl edi,0x300(ebp) 17. mo
vl 0x7c(esp,1),eax 18. cmpl esi,edi 19.
movl eax,0x0 20. jnl exit1 exit2
Speculative advanced load through aliasing
allows flexibility in load scheduling
1. addi r39,ebp,0x2fc 2. addi
r38,ebp,0x304 3. ld edx,r39 add
r27,r38,4 add r26,r38,-4 4. ld
r31,r38 add r35,0,1 add
r36,esp,0x40 5. ldp esi,r27 add
r33,esp,0x6c sub.c null,edx,r31 6. ldp
edi,r26 sel le,r32,0,r35 7. stam
0,r36 sel l,r24,r35,0 add
r25,esp,0x7c 8. stam r32,r33 add ecx,0,3
sub.c null,esi,edi 9. st r24,r25
or eax,0,0 brcc lt,ltexit2gt 10. br
ltexit1gt
22
Power Consumption

Simple VLIW engine itself has fewer transistors
and buses than traditional dynamic superscalar
processors, lowering power consumption
Number of transistors roughly corresponds to die
size

Klaiber, Alexander. The Technology Behind
Crusoe Processors
23
Power Consumption

Less power less heat no need for external
cooling device, saving even more power

24
Power ConsumptionLongRun Power Management

Conventional mobile x86 CPUs regulate power
consumption by alternating between full speed
execution and effectively stopping execution,
with varying performance levels obtained by
varying duty cycle between 2 states
Power(Total Capacitance x Frequency x Voltage2)
/2.
Crusoe can actually adjust clock frequency on the
fly to lower power consumption when fast
performance is less critical, with more
granularity than Intels SpeedStep technology for
its mobile Pentium IIIs
Crusoe can also adjust voltage to save even more
power, since lower frequency requires less
voltage gt potential cubic power reduction

25
Benchmarks
26
Conclusion

Transmeta Crusoe targets the mobile computing
market
Incorporates hardware and software to achieve
best energy efficiency (faster performance given
a certain power consumption)
VLIW engine reduces number of transistors
Code Morphing performs x86 translation to VLIW
and optimally schedules molecules
Code Morphing software techniques and hardware
support in the processor helps compensate for
translation overhead
Power consumption reduced even more using LongRun
real-time frequency/voltage adjustments
We do pay a performance price for our power
savings