Title: Programmable CPU cores
1Programmable CPU cores
- introduction
- architecture of the MIPS core
- discussed as an example
- pipelining
- application examples
- software issues
- comparison between different CPU cores
- towards application specific architectures
- discussion
2Introduction
- rationale as high multiplex factor R as possible
- consequence often manual handcrafted design
optimised for clock rate - problem fast changes in the IC process
technology - examples embedded
- MIPS (first one, licensing instruction set
architecture) - ARM (Advanced Risc Machines, telecom, low power,
- small code size, most popular one, licensing also
- the micro-architecture as hard or soft IP)
- Sparc
- derivatives from general purpose CPUs
- Intel, NEC, Hitachi, National, PowerPC
3Introduction
Instruction set architectures
implicit operands
explicit operands
4Introduction
C A B
5Architecture of the MIPS core
Hennessy Patterson
6MIPS instruction formats ( 32 bits )
Hennessy Patterson
op operation of the instruction rs,rt,rd source
and destination registers shamt shift
amount funct operation of the instruction-part
2 imm for program constants addr target address
of a jump
7Example 1 R - type add instruction
Hennessy Patterson
8Critical path R-type operation
Clk
PC
Hennessy Patterson
Instruction address
Instruction Memory
Instruction
Rd
Rt
Rs
Imm
5
5
5
16
32
Rw Ra Rb 32 32-bit registers
Data Memory
Data address
32
32
Data out
Data in
Clk
32
Clk
9Critical path R-type operation
Clock
Clock-to-Q
PC
New value
Old value
Instruction memory access time
Rs, rt, rd op, funct
Old value
New value
RFile access time
Bus A,B
Old value
New value
ALU delay
Bus W
Old value
New value
Set up skew
Write into RFile
10Example 2 I-type load word
Hennessy Patterson
- lw rs, rt, imm16
- memPC
- addr Rrs extimm16
- Rrt memaddr
- PC PC 4
11Critical path load operation
Clock
Clock-to-Q
PC
Old value
New value
Instruction memory access time
Rs, rt, rd op, funct
Old value
New value
RFile access time
Bus A,B
Old value
New value
ALU delay
Old value
address
New value
Mem access time
Bus W
Old value
New value
set upskew
12Example 3 I-type branch
Hennessy Patterson
- beq rs, rt, imm16
- memPC
- cond Rrs - Rrt
- if cond 0
- PC PC 4 ext(imm16)4
- else
- PC PC 4
13Example 3 I-type branch
Hennessy Patterson
Rd
Rt
RedDst
Branch
dc (Rt)
Rs
Clk
ALUctr
PC
5
5
5
Reg Wr
Next Address Logic
BusA 32
Imm 16 16
Rw Ra Rb 32 32-bit registers
Bus W
32
BusB 32
Zero
Clk
To Instruction Memory
Imm 16 16
32
Extender
ALUSrc
ExtOp
14Example 3 I-type branch
HennessyPatterson
30
30
Addrlt312gt Addrlt10gt Instruction Memory
30
PC
00
0
30
Clk
1
30
32
1
Imm 16 16
Instruction lt310gt
30
SignExt
Branch Zero
Instruction lt150gt
15Example 3 I-type branch
HennessyPatterson
30
Addrlt312gt Addrlt10gt Instruction Memory
30
PC
1 c_in
00
Clk
0
0
32
30
Imm 16 16
SignExt
1
Instruction lt150gt
Instruction lt310gt
Branch Zero
16Architecture of the MIPS core
- problem long critical path
- defined by the slowest instruction (load)
- solution ?
- pipelining
- break the instruction into smaller steps
- all steps have about the same critical path
17Pipelining lw instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
lw
Ifetch
RF read
ALU
dmem
RF write
Ifetch
RF read
ALU
dmem
RF write
lw
- One instructions enters the pipeline every clock
cycle - One instructions leaves the pipeline every clock
cycle - gt CPI 1 (Cycles per Instruction)
18Pipelining lw instructions
I
R
A
M
W
Instructions
Data
Current CPU cycle
194 stages of R-type instruction
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
Ifetch
RF read
ALU
RF write
E.g. ADD
20Pipelining lw and R-type instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
add
Ifetch
RF read
ALU
RF write
21Solution stretch R-type to 5 stages
Ifetch
RF read
ALU
dmem
RF write
Dummy op (noop)
HennessyPatterson
22Ifetch
Reg/dec
exec
mem
wr
RegWr
branch
Next PC
Rfile
4
flags
Rs
BusA
Ra
Rt
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
MemtoReg
MemWr
HennessyPatterson
RegDst
ALUSrc
ExtOp
ALUop
23Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
24Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
Solution bypasses
25Bypasses
HennessyPatterson
adr
Data mem
26Data dependencies load instruction
HennessyPatterson
R1 lw...
R1 ...
R1 ...
R1 ...
27Data dependencies load instruction
HennessyPatterson
R1 lw...
Bypass is no solution for instruction
R1 ...
R1 - ...
R1 - ...
28Data dependencies load instruction
HennessyPatterson
R1 lw...
DM
IM
RF
RF
R1 ...
DM
IM
RF
RF
R1 - ...
R1 - ...
IM
RF
DM
RF
Solution pipeline interlock detects a data
hazard and stalls the pipeline until the hazard
is cleared
29Instructions i1) lw r10, r2, r0 i2) add r8,
r9, r10
i1
Data available from data cache
i2
I
R(interlocked)
A
M
W
30Instructions i1) MULT r3, r2, r1 i2) ADD
r5, r4, r3
i1
i2
I
R(interlocked)
A
M
W
31Control hazards
branch
Next PC
Rfile
4
flags
Rs
BusA
Ra
Rt
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
HennessyPatterson
32Control hazards
branch
Next PC
0?
4
flags
Rs
Ra
BusA
Rt
Rfile
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
HennessyPatterson
33Control hazards
i1) beq r10, r2, 1b i2) nop/independent
instructions i3) add r8, r9, r10
i1
i2
Address available for instr. fetch
i3
Solution compiler action possibly filling the
branch delay slot
34PR3930 CPU
35TCP chip TV controller
- PR3930 peripherals
- Gfx, SDRAM controller,
- Serial interconnect bus,
- I2C, UART, timers
- PI bus architecture
- 80 mm2
- 352 pins
- 0.35 micron process
- 48 MHz (96 for gfx)
D
I
36Programmable CPU cores
- introduction
- architecture of the MIPS core
- discussed as an example
- pipelining
- application examples
- software issues
- comparison between different CPU cores
- towards application specific architectures
- discussion
37Application examples (1)
38Application examples (1)
39Application examples (2)
Bit level operations finite field arithmetic
40Application examples (2)
Bit level operations DES example
41Application examples (2)
Bit level operations A5 example (GSM encryption)
42Application examples (3)
Video conferencing H263
CIF format 352 288 px, 211, 8
bits/sample QCIF 1/4 CIF SQCIF 96128 Process
0.25 micron power consumption 100 mW _at_ 10 Hz
961281.510Hz 180 KB/s
72
20Kb/s
Compare 85257616b/p 50 49MB/s
43Application examples (3)
H.263 video encoder
44Application examples (3)
PR3940
I
D
memory
10 Hz gt 140 MHz CPU
45Application examples (3)
In which process can the H263 video encoder be
executed on a single MIPS processor ?
46func() ax.value 0x3 if (a ! 0) b
a c d else b
y.post(b)
compile each BB to instructions
ax.value 0x3
BB1
a 0
a ! 0
parser
b a c d
b
BB2
BB3
ldi 0x3, R5 and R4,R5,R6 cmp R0,R6,R7 br
R7,true ba false
y.post(b)
BB4
func() ax.value 0x3 DelayCycles(7)
if (a ! 0) b a c d
DelayCycles(8) else b
DelayCycles(5) y.post(b)
DelayCycles(4)
compile and run
generate new C with delay counts
Arch. Model ldi2 cycles nop 1 cycle ...
47Comparison between different CPU cores
48Comparison between different CPU cores
http//bwrc.eecs.berkeley.edu/cic
49Comparison between different CPU cores
50Comparison between different CPU cores
51Comparison between different CPU cores
52Power Consumption in microprocessors
- Power consumption is (becoming) the limiting
factor in processor design - Solution in direction of
- Hardware accelleration
- Instruction Level Parallelism instead of clock
speed - Code size efficiency
source ISSCC2001, Patrick Gelsinger, Intel
53Towards application specific architectures ConCISe
Bernardo Kastrup
54Towards application specific architectures
Example equation for one output bit (12) is shown!
55Towards application specific architectures
56Towards application specific architectures
ConCISe integrated tool-set
Simulator executable
Assembly code
Modified assembly with ASIs
Profile data
Hardware netlist
Assembler/ linker
Core compiler
Hardware/ software partitioning
hardware compiler
Does it fit? Y/N
Source code
Translator
HDL file
Hardware partition
57Towards application specific architectures ConCISe
Bernardo Kastrup
- Advantages faster execution, smaller code size,
lower power - The Configurable Functional Unit (CFU) can be
- Standard cell
- Field-Programmable Logic (FPL)
- Considerably bigger in silicon (4 to 5mm2 in
C075) - But its reconfigurable reprogrammable for
different application programs
58Some benchmarks
59Amdahls law
- Impact of an improvement on the execution time of
a program depends on 2 parameters - f fraction of the original computation time
that is affected by the improvement - s speedup factor (local)
- exec_time_new exec_time_old (1-f)
exec_time_old f / s - speedup_overall exec_time_old /
exec_time_new 1 / ( 1 f f / s) - if s gtgt 1 then speedup_overall 1 / ( 1 f )
- Example 40 of program can be executed 10 x
faster speedup_overall 1 / ( 0.6 0.4 / 10 )
1.56
60Towards application specific architectures
www.tensilica.com
61Conclusions
- Programmable CPU cores are important for the
control parts of the application. - They are well supported with tools to support
the development of end-user software. ( vs.
deeply embedded sw) - Keep it Simple heuristic (RISC vs. CISC)
- Make frequent cases fast and rare cases correct.
- Regular (orthogonal) instruction set
- No special features that match a high level
language construct. - At least 16 registers to ease register
allocation. - Embedded cores are often light cores which are a
compromise between performance, area and power
dissipation. (vs.
stand-alone CPU cores which are optimised for
performance)