Title: Application of Binary Translation to Java Reconfigurable Architectures
1Application of Binary Translation to Java
Reconfigurable Architectures
- Antonio Carlos S. Beck Filho
- caco_at_inf.ufrgs.br
- Luigi Carro
- carro_at_inf.ufrgs.br
- Instituto de Informática - GME
- Universidade Federal do Rio Grande do Sul
2Introduction
1
- The embedded system market is expanding
1
3Introduction
1
- The embedded system market is expanding
More performance is required
1
4Introduction
1
- Moreover
- Shorter Design cycle
- The complexity of these embedded systems is
increasing as well - Battery dependent
2
5Introduction
1
- These embedded systems are adopting Java
- Devices with Java as cellular phones and PDAs
- 176 million in 2001
- 721 million in 2006 1
- 80 of cellular phones will support Java 2
- 10 times more embedded system developers than
general-purpose software ones by the year 2010 3
1 D. Takahashi, Java Chips Make a Comeback, Red
Herring, 2001 2 G. Lawton, Moving Java into
Mobile Phones, Computer, vol. 35, n. 6, 2002,
pp. 17-20 3 R.W. Atherton, Moving Java to the
Factory. IEEE Spectrum, 1998, pp. 18-23,
3
6Introduction
1
- Object Oriented
- Modeling
- Programation
- Validation
- Widely spread
- Safe
- Small size of ROM memory (CISC)
- Multiplataform
4
7Motivation
2
- How to increase the performance with
- low power consumption?
5
8Motivation
2
- How to increase the performance with
- low power consumption?
- Using a reconfigurable array!
5
9Motivation
2
- How to increase the performance with
- low power consumption?
- Using a reconfigurable array!
Special tools and compilers are needed!
5
10Motivation
2
- How to increase the performance with
- low power consumption?
- Using a reconfigurable array!
Special tools and compilers are needed!
No software portability!
And the design cycle?
5
11Motivation
2
- How to increase the performance with
- low power consumption?
- Using a reconfigurable array!
Special tools and compilers are needed!
No software portability!
And the design cycle?
5
12Outline
3
- Java processors
- Using Binary Translation with reconfigurable
arrays - The reconfigurable array
- Results
- Area
- Performance
- Power consumption
- Conclusions and Future Work
6
13Femtojava Low-Power
4
7
14Femtojava Low-Power
4
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
8
15Femtojava Low-Power
4
IADD
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
- With a instruction queue of 9 bytes long to
handle with variable size instructions
8
16Femtojava Low-Power
4
IADD
11011
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
- Responsible for the generation of the microOPs
and for checking data dependence
8
17Femtojava Low-Power
4
4
4
POP
Top of Stack
2
2
7
8
3
9
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
- It has a register bank with two ports
- Stack and local variable storage implemented in
this register file
8
18Femtojava Low-Power
4
4
4
POP
Top of Stack
2
2
7
8
3
9
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
- It has a register bank with two ports
- Stack and local variable storage implemented in
this register file
Allows comparisons with RISC machines!
8
19Femtojava Low-Power
4
4
2
6
Instruction Fetch
Operand Fetch
Write Back
Decoder
Execution
- Six functional units multiplier, ALU, shifter,
constant generator, branch and LD/ST
8
20Femtojava Low-Power
4
6
Top of Stack
7
8
3
9
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
- Write the results back to the stack or local
variable storage
8
21VLIW Architecture
5
- 2 instructions/VLIW packet
Instruction 2
Instruction 1
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
- VLIW packet has a variable size
- In this case, The VLIW packet can have 1 or 2
instructions/packet
9
22VLIW Architecture
5
Instruction 1
11011
Decoder 1
Instruction Fetch
Operand Fetch
Write Back
Execution
Decoder 2
Instruction 2
11011
- Decoder 2 doesnt support calls and return of
methods
9
23VLIW Architecture
5
Register Bank 2
4
OperandStack
2
7
Register Bank 1
OperandStack
8
6
Local Variable Pool
3
1
9
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
- Each flow has its own operand stack
- The local variable pool of the method is shared
No mechanism is necessary for communication among
the flows!
9
24VLIW Architecture
5
Instruction Fetch
Operand Fetch
Write Back
Decoder
Execution
- Six functional units multiplier, ALU, shifter,
constant generator, branch and LD/ST - They are replicated in each flow
9
25VLIW Architecture
5
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
- Write the results back to the operand stack of
each flow OR to local variable storage of the 1st
register bank
9
26Why use a reconfigurable array?
- Hypothesis substitution of a sequence of
instructions by a combinational circuit saves
power (we loose area) - Let us see the multiplication algorithm example
- TCalg n(TPFFnT?Tset)
- TCCC n nT? (very pessimistic)
27The Binary Translation
6
- BT take a binary code and produce another binary
for a different machine - BT advantages when used with reconfiguration
- One can detect paralelism and reconfigure the
array at run-time - No need for special tools or compilers anymore!
- We solve the sw-compatibility problem
10
28The Binary Translation
6
- How it works?
- Observe the bytecodes looking for frequently
executed sequences - Save this sequence in a special cache
- When this sequence of instructions is found
again, the array is reconfigured and set as
active functional unit
10
29Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
Considering these bytecodes
11
30Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
11
31Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
11
32Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
11
33Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
The instructions depend on each other!
11
34Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
11
35Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
These two blocks are independent !!!
11
36Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
Operand Block 1 First Sequence
Operand Block 2 Second Sequence
11
37The Reconfigurable Array
8
- The array is coarse-grain
- It allows to save a great number of sequences in
the cache - The reconfiguration is fast
12
38The Reconfigurable Array
8
- The array is coarse-grain
- It allows to save a great number of sequences in
the cache - The reconfiguration is fast
- It is formed by one or more basic cells
- With one multiplier and a sequence of seven sets
of basic functional units
13
39General Overview
9
Reconfiguration Cache
Array
. . .
Detector Unit
14
40Power Simulator
10
- CACO-PS
- Cycle Accurate COnfigurable Power Simulator
- Based on the switching activity
- Pd a . fc . C . Vdd²
- Result is given in number of gate capacitances
that switch
15
41Results
11
- A set of algorithms were executed in the
architectures - Sin Calculation
- Sort Bubble
- Sort Select
- Sort Quick (10 and 100 elements)
- Search Binary
- Search Sequential
- IMDCT (plus three unrolled versions)
- Floating Point Sums emulation
- Full MP3 PLAYER
16
42Performance
11
17
43Performance
11
17
44Performance
11
The same number of different sequences of
instructions
17
45Performance
11
Parallelism exposed by loop unrolling
17
46Performance
11
Parallelism exposed by loop unrolling
17
47Performance
11
No more parallelism available!
17
48Performance
11
No more parallelism available!
17
49Performance
11
There is room for improvement!
17
50Performance
11
17
51Energy in memory accesses
11
18
52Energy in the cores
11
19
53Total Energy Consumption
11
20
54Area
11
VLIW 2
21
55Final Results
11
VLIW 2
22
56Conclusions
12
- With BT, a reconfigurable array and Java we
achieve at the same time - The Java concept of write once, run everywhere
- Software portability for different machines
- Performance
- Low Energy Consumption
- thanks to combinational circuits and paralelism
- we still can reduce Vdd
- HW upgrades with SW compatibility
23
57Future Works (I)
12
- Use Binary Translation with CMP
- At run-time detect what is the best core to
execute the software at certain time
24
58Future works (II)
- Implement the BT and reconfigurable array in
traditional RISC machines - What are the differences of implementation?
59The end...
- Questions?
- carro_at_inf.ufrgs.br
- caco_at_inf.ufrgs.br
?