Application of Binary Translation to Java Reconfigurable Architectures - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Application of Binary Translation to Java Reconfigurable Architectures

Description:

One can detect paralelism and reconfigure the array at run-time ... is found again, the array is reconfigured and set as active functional unit ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 60
Provided by: ceEtTu
Category:

less

Transcript and Presenter's Notes

Title: Application of Binary Translation to Java Reconfigurable Architectures


1
Application of Binary Translation to Java
Reconfigurable Architectures
  • Antonio Carlos S. Beck Filho
  • caco_at_inf.ufrgs.br
  • Luigi Carro
  • carro_at_inf.ufrgs.br
  • Instituto de Informática - GME
  • Universidade Federal do Rio Grande do Sul

2
Introduction
1
  • The embedded system market is expanding

1
3
Introduction
1
  • The embedded system market is expanding

More performance is required
1
4
Introduction
1
  • Moreover
  • Shorter Design cycle
  • The complexity of these embedded systems is
    increasing as well
  • Battery dependent

2
5
Introduction
1
  • These embedded systems are adopting Java
  • Devices with Java as cellular phones and PDAs
  • 176 million in 2001
  • 721 million in 2006 1
  • 80 of cellular phones will support Java 2
  • 10 times more embedded system developers than
    general-purpose software ones by the year 2010 3

1 D. Takahashi, Java Chips Make a Comeback, Red
Herring, 2001 2 G. Lawton, Moving Java into
Mobile Phones, Computer, vol. 35, n. 6, 2002,
pp. 17-20 3 R.W. Atherton, Moving Java to the
Factory. IEEE Spectrum, 1998, pp. 18-23,
3
6
Introduction
1
  • The Java Language...
  • Object Oriented
  • Modeling
  • Programation
  • Validation
  • Widely spread
  • Safe
  • Small size of ROM memory (CISC)
  • Multiplataform

4
7
Motivation
2
  • How to increase the performance with
  • low power consumption?

5
8
Motivation
2
  • How to increase the performance with
  • low power consumption?
  • Using a reconfigurable array!

5
9
Motivation
2
  • How to increase the performance with
  • low power consumption?
  • Using a reconfigurable array!

Special tools and compilers are needed!
5
10
Motivation
2
  • How to increase the performance with
  • low power consumption?
  • Using a reconfigurable array!

Special tools and compilers are needed!
No software portability!
And the design cycle?
5
11
Motivation
2
  • How to increase the performance with
  • low power consumption?
  • Using a reconfigurable array!

Special tools and compilers are needed!
No software portability!
And the design cycle?
5
12
Outline
3
  • Java processors
  • Using Binary Translation with reconfigurable
    arrays
  • The reconfigurable array
  • Results
  • Area
  • Performance
  • Power consumption
  • Conclusions and Future Work

6
13
Femtojava Low-Power
4
7
14
Femtojava Low-Power
4
  • Five stages

Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
8
15
Femtojava Low-Power
4
IADD
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
  • With a instruction queue of 9 bytes long to
    handle with variable size instructions

8
16
Femtojava Low-Power
4
IADD
11011
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
  • Responsible for the generation of the microOPs
    and for checking data dependence

8
17
Femtojava Low-Power
4
4
4
POP
Top of Stack
2
2
7
8
3
9
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
  • It has a register bank with two ports
  • Stack and local variable storage implemented in
    this register file

8
18
Femtojava Low-Power
4
4
4
POP
Top of Stack
2
2
7
8
3
9
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
  • It has a register bank with two ports
  • Stack and local variable storage implemented in
    this register file

Allows comparisons with RISC machines!
8
19
Femtojava Low-Power
4
4

2
6
Instruction Fetch
Operand Fetch
Write Back
Decoder
Execution
  • Six functional units multiplier, ALU, shifter,
    constant generator, branch and LD/ST

8
20
Femtojava Low-Power
4
6
Top of Stack
7
8
3
9
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
  • Write the results back to the stack or local
    variable storage

8
21
VLIW Architecture
5
  • 2 instructions/VLIW packet

Instruction 2
Instruction 1
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
  • VLIW packet has a variable size
  • In this case, The VLIW packet can have 1 or 2
    instructions/packet

9
22
VLIW Architecture
5
Instruction 1
11011
Decoder 1
Instruction Fetch
Operand Fetch
Write Back
Execution
Decoder 2
Instruction 2
11011
  • Decoder 2 doesnt support calls and return of
    methods

9
23
VLIW Architecture
5
Register Bank 2
4
OperandStack
2
7
Register Bank 1
OperandStack
8
6
Local Variable Pool
3
1
9
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
  • Each flow has its own operand stack
  • The local variable pool of the method is shared

No mechanism is necessary for communication among
the flows!
9
24
VLIW Architecture
5
Instruction Fetch
Operand Fetch
Write Back
Decoder
Execution
  • Six functional units multiplier, ALU, shifter,
    constant generator, branch and LD/ST
  • They are replicated in each flow

9
25
VLIW Architecture
5
Instruction Fetch
Operand Fetch
Decoder
Execution
Write Back
  • Write the results back to the operand stack of
    each flow OR to local variable storage of the 1st
    register bank

9
26
Why use a reconfigurable array?
  • Hypothesis substitution of a sequence of
    instructions by a combinational circuit saves
    power (we loose area)
  • Let us see the multiplication algorithm example
  • TCalg n(TPFFnT?Tset)
  • TCCC n nT? (very pessimistic)

27
The Binary Translation
6
  • BT take a binary code and produce another binary
    for a different machine
  • BT advantages when used with reconfiguration
  • One can detect paralelism and reconfigure the
    array at run-time
  • No need for special tools or compilers anymore!
  • We solve the sw-compatibility problem

10
28
The Binary Translation
6
  • How it works?
  • Observe the bytecodes looking for frequently
    executed sequences
  • Save this sequence in a special cache
  • When this sequence of instructions is found
    again, the array is reconfigured and set as
    active functional unit

10
29
Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
Considering these bytecodes
11
30
Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
11
31
Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
11
32
Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
11
33
Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
The instructions depend on each other!
11
34
Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
11
35
Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
These two blocks are independent !!!
11
36
Bytecodes Detection
7
Bipush 10 Bipush 5 Imul Bipush 3 Bipush
4 Ishl Iadd Istore Bipush 6 Bipush 7 imul
Operand Block 1 First Sequence
Operand Block 2 Second Sequence
11
37
The Reconfigurable Array
8
  • The array is coarse-grain
  • It allows to save a great number of sequences in
    the cache
  • The reconfiguration is fast

12
38
The Reconfigurable Array
8
  • The array is coarse-grain
  • It allows to save a great number of sequences in
    the cache
  • The reconfiguration is fast
  • It is formed by one or more basic cells
  • With one multiplier and a sequence of seven sets
    of basic functional units

13
39
General Overview
9
Reconfiguration Cache
Array
. . .
Detector Unit
14
40
Power Simulator
10
  • CACO-PS
  • Cycle Accurate COnfigurable Power Simulator
  • Based on the switching activity
  • Pd a . fc . C . Vdd²
  • Result is given in number of gate capacitances
    that switch

15
41
Results
11
  • A set of algorithms were executed in the
    architectures
  • Sin Calculation
  • Sort Bubble
  • Sort Select
  • Sort Quick (10 and 100 elements)
  • Search Binary
  • Search Sequential
  • IMDCT (plus three unrolled versions)
  • Floating Point Sums emulation
  • Full MP3 PLAYER

16
42
Performance
11
17
43
Performance
11
17
44
Performance
11
The same number of different sequences of
instructions
17
45
Performance
11
Parallelism exposed by loop unrolling
17
46
Performance
11
Parallelism exposed by loop unrolling
17
47
Performance
11
No more parallelism available!
17
48
Performance
11
No more parallelism available!
17
49
Performance
11
There is room for improvement!
17
50
Performance
11
17
51
Energy in memory accesses
11
18
52
Energy in the cores
11
19
53
Total Energy Consumption
11
20
54
Area
11
VLIW 2
21
55
Final Results
11
VLIW 2
22
56
Conclusions
12
  • With BT, a reconfigurable array and Java we
    achieve at the same time
  • The Java concept of write once, run everywhere
  • Software portability for different machines
  • Performance
  • Low Energy Consumption
  • thanks to combinational circuits and paralelism
  • we still can reduce Vdd
  • HW upgrades with SW compatibility

23
57
Future Works (I)
12
  • Use Binary Translation with CMP
  • At run-time detect what is the best core to
    execute the software at certain time

24
58
Future works (II)
  • Implement the BT and reconfigurable array in
    traditional RISC machines
  • What are the differences of implementation?

59
The end...
  • Questions?
  • carro_at_inf.ufrgs.br
  • caco_at_inf.ufrgs.br

?
Write a Comment
User Comments (0)
About PowerShow.com