Programmable CPU cores - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Programmable CPU cores

Description:

rationale: as high multiplex factor R as possible. consequence: often manual handcrafted design optimised for clock rate ... Dummy op (noop) [Hennessy&Patterson] ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 62
Provided by: abc774
Category:
Tags: cpu | cores | noop | programmable

less

Transcript and Presenter's Notes

Title: Programmable CPU cores


1
Programmable CPU cores
  • introduction
  • architecture of the MIPS core
  • discussed as an example
  • pipelining
  • application examples
  • software issues
  • comparison between different CPU cores
  • towards application specific architectures
  • discussion

2
Introduction
  • rationale as high multiplex factor R as possible
  • consequence often manual handcrafted design
    optimised for clock rate
  • problem fast changes in the IC process
    technology
  • examples embedded
  • MIPS (first one, licensing instruction set
    architecture)
  • ARM (Advanced Risc Machines, telecom, low power,
  • small code size, most popular one, licensing also
  • the micro-architecture as hard or soft IP)
  • Sparc
  • derivatives from general purpose CPUs
  • Intel, NEC, Hitachi, National, PowerPC

3
Introduction
Instruction set architectures
implicit operands
explicit operands
4
Introduction
C A B
5
Architecture of the MIPS core
Hennessy Patterson
6
MIPS instruction formats ( 32 bits )
Hennessy Patterson
op operation of the instruction rs,rt,rd source
and destination registers shamt shift
amount funct operation of the instruction-part
2 imm for program constants addr target address
of a jump
7
Example 1 R - type add instruction
Hennessy Patterson
8
Critical path R-type operation
Clk
PC
Hennessy Patterson
Instruction address
Instruction Memory
Instruction
Rd
Rt
Rs
Imm
5
5
5
16
32
Rw Ra Rb 32 32-bit registers
Data Memory
Data address
32
32
Data out
Data in
Clk
32
Clk
9
Critical path R-type operation
Clock
Clock-to-Q
PC
New value
Old value
Instruction memory access time
Rs, rt, rd op, funct
Old value
New value
RFile access time
Bus A,B
Old value
New value
ALU delay
Bus W
Old value
New value
Set up skew
Write into RFile
10
Example 2 I-type load word
Hennessy Patterson
  • lw rs, rt, imm16
  • memPC
  • addr Rrs extimm16
  • Rrt memaddr
  • PC PC 4

11
Critical path load operation
Clock
Clock-to-Q
PC
Old value
New value
Instruction memory access time
Rs, rt, rd op, funct
Old value
New value
RFile access time
Bus A,B
Old value
New value
ALU delay
Old value
address
New value
Mem access time
Bus W
Old value
New value
set upskew
12
Example 3 I-type branch
Hennessy Patterson
  • beq rs, rt, imm16
  • memPC
  • cond Rrs - Rrt
  • if cond 0
  • PC PC 4 ext(imm16)4
  • else
  • PC PC 4

13
Example 3 I-type branch
Hennessy Patterson
Rd
Rt
RedDst
Branch
dc (Rt)
Rs
Clk
ALUctr
PC
5
5
5
Reg Wr
Next Address Logic
BusA 32
Imm 16 16
Rw Ra Rb 32 32-bit registers
Bus W
32
BusB 32
Zero
Clk
To Instruction Memory
Imm 16 16
32
Extender
ALUSrc
ExtOp
14
Example 3 I-type branch
HennessyPatterson
30
30
Addrlt312gt Addrlt10gt Instruction Memory
30
PC
00
0
30
Clk
1
30
32
1
Imm 16 16
Instruction lt310gt
30
SignExt
Branch Zero
Instruction lt150gt
15
Example 3 I-type branch
HennessyPatterson
30
Addrlt312gt Addrlt10gt Instruction Memory
30
PC
1 c_in
00
Clk
0
0
32
30
Imm 16 16
SignExt
1
Instruction lt150gt
Instruction lt310gt
Branch Zero
16
Architecture of the MIPS core
  • problem long critical path
  • defined by the slowest instruction (load)
  • solution ?
  • pipelining
  • break the instruction into smaller steps
  • all steps have about the same critical path

17
Pipelining lw instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
lw
Ifetch
RF read
ALU
dmem
RF write
Ifetch
RF read
ALU
dmem
RF write
lw
  • One instructions enters the pipeline every clock
    cycle
  • One instructions leaves the pipeline every clock
    cycle
  • gt CPI 1 (Cycles per Instruction)

18
Pipelining lw instructions
I
R
A
M
W
Instructions
Data
Current CPU cycle
19
4 stages of R-type instruction
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
Ifetch
RF read
ALU
RF write
E.g. ADD
20
Pipelining lw and R-type instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
add
Ifetch
RF read
ALU
RF write
21
Solution stretch R-type to 5 stages
Ifetch
RF read
ALU
dmem
RF write
Dummy op (noop)
HennessyPatterson
22
Ifetch
Reg/dec
exec
mem
wr
RegWr
branch
Next PC
Rfile
4
flags
Rs
BusA
Ra

Rt
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
MemtoReg
MemWr
HennessyPatterson
RegDst
ALUSrc
ExtOp
ALUop
23
Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
24
Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
Solution bypasses
25
Bypasses
HennessyPatterson
adr
Data mem
26
Data dependencies load instruction
HennessyPatterson
R1 lw...
R1 ...
R1 ...
R1 ...
27
Data dependencies load instruction
HennessyPatterson
R1 lw...
Bypass is no solution for instruction
R1 ...
R1 - ...
R1 - ...
28
Data dependencies load instruction
HennessyPatterson
R1 lw...
DM
IM
RF
RF
R1 ...
DM
IM
RF
RF
R1 - ...
R1 - ...
IM
RF
DM
RF
Solution pipeline interlock detects a data
hazard and stalls the pipeline until the hazard
is cleared
29
Instructions i1) lw r10, r2, r0 i2) add r8,
r9, r10
i1
Data available from data cache
i2
I
R(interlocked)
A
M
W
30
Instructions i1) MULT r3, r2, r1 i2) ADD
r5, r4, r3
i1
i2
I
R(interlocked)
A
M
W
31
Control hazards
branch
Next PC
Rfile
4
flags
Rs
BusA
Ra

Rt
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
HennessyPatterson
32
Control hazards
branch
Next PC
0?
4
flags
Rs
Ra
BusA

Rt
Rfile
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
HennessyPatterson
33
Control hazards
i1) beq r10, r2, 1b i2) nop/independent
instructions i3) add r8, r9, r10
i1
i2
Address available for instr. fetch
i3
Solution compiler action possibly filling the
branch delay slot
34
PR3930 CPU
35
TCP chip TV controller
  • PR3930 peripherals
  • Gfx, SDRAM controller,
  • Serial interconnect bus,
  • I2C, UART, timers
  • PI bus architecture
  • 80 mm2
  • 352 pins
  • 0.35 micron process
  • 48 MHz (96 for gfx)

D
I
36
Programmable CPU cores
  • introduction
  • architecture of the MIPS core
  • discussed as an example
  • pipelining
  • application examples
  • software issues
  • comparison between different CPU cores
  • towards application specific architectures
  • discussion

37
Application examples (1)
38
Application examples (1)
39
Application examples (2)
Bit level operations finite field arithmetic
40
Application examples (2)
Bit level operations DES example
41
Application examples (2)
Bit level operations A5 example (GSM encryption)
42
Application examples (3)
Video conferencing H263
CIF format 352 288 px, 211, 8
bits/sample QCIF 1/4 CIF SQCIF 96128 Process
0.25 micron power consumption 100 mW _at_ 10 Hz
961281.510Hz 180 KB/s
72
20Kb/s
Compare 85257616b/p 50 49MB/s
43
Application examples (3)
H.263 video encoder
44
Application examples (3)
PR3940
I
D
memory
10 Hz gt 140 MHz CPU
45
Application examples (3)
In which process can the H263 video encoder be
executed on a single MIPS processor ?
46
func() ax.value 0x3 if (a ! 0) b
a c d else b
y.post(b)
compile each BB to instructions
ax.value 0x3
BB1
a 0
a ! 0
parser
b a c d
b
BB2
BB3
ldi 0x3, R5 and R4,R5,R6 cmp R0,R6,R7 br
R7,true ba false
y.post(b)
BB4
func() ax.value 0x3 DelayCycles(7)
if (a ! 0) b a c d
DelayCycles(8) else b
DelayCycles(5) y.post(b)
DelayCycles(4)
compile and run
generate new C with delay counts
Arch. Model ldi2 cycles nop 1 cycle ...
47
Comparison between different CPU cores
48
Comparison between different CPU cores
http//bwrc.eecs.berkeley.edu/cic
49
Comparison between different CPU cores
50
Comparison between different CPU cores
51
Comparison between different CPU cores
52
Power Consumption in microprocessors
  • Power consumption is (becoming) the limiting
    factor in processor design
  • Solution in direction of
  • Hardware accelleration
  • Instruction Level Parallelism instead of clock
    speed
  • Code size efficiency

source ISSCC2001, Patrick Gelsinger, Intel
53
Towards application specific architectures ConCISe
Bernardo Kastrup
54
Towards application specific architectures
Example equation for one output bit (12) is shown!
55
Towards application specific architectures
56
Towards application specific architectures
ConCISe integrated tool-set
Simulator executable
Assembly code
Modified assembly with ASIs
Profile data
Hardware netlist
Assembler/ linker
Core compiler
Hardware/ software partitioning
hardware compiler
Does it fit? Y/N
Source code
Translator
HDL file
Hardware partition
57
Towards application specific architectures ConCISe
Bernardo Kastrup
  • Advantages faster execution, smaller code size,
    lower power
  • The Configurable Functional Unit (CFU) can be
  • Standard cell
  • Field-Programmable Logic (FPL)
  • Considerably bigger in silicon (4 to 5mm2 in
    C075)
  • But its reconfigurable reprogrammable for
    different application programs

58
Some benchmarks
59
Amdahls law
  • Impact of an improvement on the execution time of
    a program depends on 2 parameters
  • f fraction of the original computation time
    that is affected by the improvement
  • s speedup factor (local)
  • exec_time_new exec_time_old (1-f)
    exec_time_old f / s
  • speedup_overall exec_time_old /
    exec_time_new 1 / ( 1 f f / s)
  • if s gtgt 1 then speedup_overall 1 / ( 1 f )
  • Example 40 of program can be executed 10 x
    faster speedup_overall 1 / ( 0.6 0.4 / 10 )
    1.56

60
Towards application specific architectures
www.tensilica.com
61
Conclusions
  • Programmable CPU cores are important for the
    control parts of the application.
  • They are well supported with tools to support
    the development of end-user software. ( vs.
    deeply embedded sw)
  • Keep it Simple heuristic (RISC vs. CISC)
  • Make frequent cases fast and rare cases correct.
  • Regular (orthogonal) instruction set
  • No special features that match a high level
    language construct.
  • At least 16 registers to ease register
    allocation.
  • Embedded cores are often light cores which are a
    compromise between performance, area and power
    dissipation. (vs.
    stand-alone CPU cores which are optimised for
    performance)
Write a Comment
User Comments (0)
About PowerShow.com