Title: Presentazione di PowerPoint
1Hardware platforms for Embedded computing
2The energy/flexibility conflict- Intrinsic Power
Efficiency -
Operations/WattMOPS/mW
Ambient Intelligence
10
DSP-ASIPs
hardwired muxed ASIC
1
Processors
Reconfigurable Computing
µPs
0.1
0.01
Technology
0.13µ
0.07µ
0.25µ
0.5µ
1.0µ
Necessary to optimize HW/SW otherwise the prize
for software flexibility cannot be paid!
H. de Man, Keynote, DATE02T. Claasen, ISSCC99
3Architectural Choices
Flexibility
1/Efficiency (power, speed)
4The Processor Design Space
Embedded processors
Application specific architectures for performance
Microprocessors
Performance is everything Software rules
Performance
Microcontrollers
Cost is everything
Cost
5Area of processor cores Cost
Nintendo processor
Cellular phones
6Another figure of meritComputation per unit area
Nintendo processor
Cellular phones
???
7Embedded vs. general-purpose processors
- Embedded processors may be optimized for a
category of applications. - Customization may be narrow or broad.
- We may judge embedded processors using different
metrics - Code size.
- Memory system performance.
- Preditability.
8Microcontrollers
Memory
ROM RAM
CPU
I/O
Subsystems Timers, Counters, Analog Interfaces,
I/O interfaces
A single chip
9Microcontroller Architectures
Memory
0
Program Data
Address Bus
Von Neumann Architecture
CPU
Data Bus
2n
Memory
0
Program
Address Bus
Harvard Architecture
Fetch Bus
CPU
Address Bus
0
Data
Data Bus
10MCS-51 Family of Microcontollers
- 8051 introduced by Intel in late 1970s
- Now produced by many companies in many variations
- The most pupular microcontroller about 40 of
market share - 8-bit microcontroller
11Original 8051 Microcontroller
4096 Bytes Program Memory
128 Bytes Data Memory
Two 16 Bit Timer/Event Counters
Oscillator and timing
Internal data bus
8051 CPU
Programmable I/O
Programmable Serial Port Full Duplex UART
Synchronous Shifter
64 K Byte Bus Expansion Control
subsystem interrupts
External interrupts
Parallel ports Address Data Bus I/O pins
Control
Serial Output
Serial Input
12Microcontrollers- MHS 80C51 as an example -
- 8-bit CPU optimised for control applications
- Extensive Boolean processing capabilities
- 64 k Program Memory address space
- 64 k Data Memory address space
- 4 k bytes of on chip Program Memory
- 128 bytes of on chip data RAM
- 32 bi-directional and individually addressable
I/O lines - Two 16-bit timers/counters
- Full duplex UART
- 6 sources/5-vector interrupt structure with 2
priority levels - On chip clock oscillators
- Very popular CPU with many different variations
13RISC processors
- RISC generally means highly-pipelinable, one
instruction per cycle. - Pipelines of embedded RISC processors have grown
over time - ARM7 has 3-stage pipeline.
- ARM9 has 5-stage pipeline.
- ARM11 has eight-stage pipeline.
ARM11 pipeline ARM05.
14RISC processor families
- ARM ARM7 is relatively simple, no memory
management ARM11 has memory management, other
features. - MIPS MIPS32 4K has 5-stage pipeline 4KE family
has DSP extension 4KS is designed for security. - PowerPC 400 series includes several embedded
processors MPD7410 is two-issue machine 970FX
has 16-stage pipeline.
15DSP Applications
- Audio applications
- MPEG Audio
- Portable audio
- Digital cameras
- Wireless
- Cellular telephones
- Base station
- Networking
- Cable modems
- ADSL
- VDSL
16Another Look at DSP Applications
- High-end
- Wireless Base Station - TMS320C6000
- Cable modem
- gateways
- Mid-end
- Cellular phone - TMS320C540
- Fax/ voice server
- Low end
- Storage products - TMS320C27
- Digital camera - TMS320C5000
- Portable phones
- Wireless headsets
- Consumer audio
- Automobiles, toasters, thermostats, ...
Increasing Cost
Increasing volume
17DSP vs. General Purpose MPU
- The MIPS/MFLOPS of DSPs is speed of
Multiply-Accumulate (MAC). - DSP are judged by whether they can keep the
multipliers busy 100 of the time. - The "SPEC" of DSPs is 4 algorithms
- Inifinite Impule Response (IIR) filters
- Finite Impule Response (FIR) filters
- FFT, and
- convolvers
- In DSPs, algorithms are king!
- Binary compatability not an issue
- Software is not (yet) king in DSPs.
- People still write in assembly language for a
product to minimize the die area for ROM in the
DSP chip.
18Architectural Features of DSPs
- Data path configured for DSP
- Fixed-point arithmetic
- MAC- Multiply-accumulate
- Multiple memory banks and buses -
- Harvard Architecture
- Multiple data memories
- Specialized addressing modes
- Bit-reversed addressing
- Circular buffers
- Specialized instruction set and execution control
- Zero-overhead loops
- Support for MAC
- Specialized peripherals for DSP
- THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE
DESIGN!!!
19Domain-oriented architectures
n-1
Application yj ?i0 xj-iai
?i 0?i ? n-1 yij yi-1j xj-iai
Architecture Example Data path ADSP210x
- Parallelism - Dedicated registers
MR0 A11 A2n-2 MXxn-1 MYa0for
( j1 to n) MRMRMXMY MYaA1 MXxA2
A1 A2--
20DSP - Features (1)
- Multiply/accumulate (MAC) and zero-overhead loop
(ZOL) instructions (as shown) - Heterogeneous registers (as shown)
- Separate address generation units (AGUs)(as in
ADSP 210x)
21DSP - Features (2)
sliding window
- Modulo addressing Am ? Am(Am1) mod
n(implements ring or circular buffer in memory)
x
t2
t1
t
..xn-2xn-1x0x1..
..xn-3xn-2xn-1xnx1
Memory, tt1
Memory, t2t11
22Multiple memory banks or memories
Simplifies parallel fetches
23Very long instruction word (VLIW) processors
Key idea detection of possible parallelism to be
done by compiler, not by hardware at run-time
(inefficient). VLIW parallel operations
(instructions) encoded in one long word
(instruction packet), each instruction
controlling one functional unit. E.g.
24The Texas InstrumentsTMS 320C6xx as an example
Bit in each instruction encodes end of parallel
execution
31 0
31 0
31 0
31 0
31 0
31 0
31 0
0
0
1
0
1
1
1
Instr. A
Instr. D
Instr. F
Instr. G
Instr. E
Instr. C
Instr. B
Instructions B, C and D use disjoint functional
units, cross paths and other data path resources.
The same is also true for E, F and G.
Parallel execution cannot span several packets.
25Partitioned register files
- Many memory ports are required to supply enough
operands per cycle. - Memories with many ports are expensive.
- ? Registers are partitioned into (typically 2)
sets, e.g. for TI C60x
Data path A
Data path B
register file A
register file B
L1
S1
M1
D1
D2
M2
S2
L2
Address bus
Data bus
26Instruction types are mapped tofunctional unit
types
- There are 4 functional unit (FU) types
- M Memory Unit
- I Integer Unit
- F Floating-Point Unit
- B Branch Unit
- Instruction types ? corresponding FU type,except
type A (mapping to either I or M-functional
units).
27Large of delay slots,a problem of VLIW
processors
add sub and orsub mult xor divld st mv beq
The execution of many instructions has been
started before it is realized that a branch was
required. Nullifying those instructions would
waste compute power ? Executing those
instructions is declared a feature, not a bug. ?
How to fill all delay slots with useful
instructions? ? Avoid branches wherever possible.
28Predicated executionImplementing IF-statements
branch-free
- Conditional Instruction c I consists of
- condition c
- instruction I
c true gt I executed c false gt NOP
29Predicated executionImplementing IF-statements
branch-free TI C6x
Conditional branch c B L1
NOP 5 B L2 NOP 4
SUB x,y,a SUB x,z,b L1
ADD x,y,a ADD x,z,b L2
Predicated execution c ADD x,y,a c
ADD x,z,b !c SUB x,y,a !c SUB
x,z,b
if (c) a x y b x z else a x -
y b x - z
max. 12 cycles
1 cycle
30Architecture Evolution
- Roadmap continues 90?65?45 nm
- Traditional Bus-based SoCs fit in one tile !!
- Communication demand is staggering, but unevenly
distributed, because of architectural
heterogeneity
31Multicores Are Here!
Amarasinghe06
512
256
128
64
of cores
32
16
8
4
2
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
32MPSoC 2005 ITRS roadmap
Martin06
33Power is the Challenge!
)
1400
2
SiO2 Lkg
10 mm Die
1200
SD Lkg
Active
1000
800
Power (W), Power Density (W/cm
600
400
200
0
90nm
65nm
45nm
32nm
22nm
16nm
34Near Term Solutions
- Move away from Frequency alone to deliver
performance - More on-die memory
- Multi-everywhere
- Multi-threading
- Chip level multi-processing
- Throughput oriented designs
- Performance by higher level of integration
35mArchitecture Techniques
Increase on-die Memory
36Multi-Core
Power
Power 1/4
4
Performance
Performance 1/2
3
2
2
1
1
1
1
4
4
Multi-Core Power efficient Better power and
thermal management
3
3
2
2
1
1
37Embedded vs. General Purpose
- Server Applications
- Symmetric Multi-Processing
- Homogeneous cores
- General tasks known late
- Tasks run on any core
- High-performance, high-speed microprocessors
- Communication
- large coherent memory space on multi-core die or
bus - SMT programming models (Simultaneous
Multi-Threading) - Examples large server chips (eg Sun Niagara 8x4
threads), scientific multi-processors
- Embedded Applications
- Asymmetric Multi-Processing
- Differentiated Processors
- Specific tasks known early
- Mapped to dedicated processors
- Configurable and extensible processors
performance, power efficiency - Communication
- Coherent memory
- Shared local memories
- HW FIFOS, other direct connections
- Dataflow programming models
- Classical example Smart mobile RISC DSP
Media processors
38MPSoC architectures
39Example system platforms
- Generic
- Automotive
- Wireless
- Multimedia
40PC-based platform
- Basic hardware components
- CPU
- memory
- timers
- DMA
- minimal I/O devices.
- Basic software
- BIOS.
41PC-style hardware architecture
CPU
memory
I/O
system bus
bridge
high-speed bus
DMA controller
timers
low-speed bus
bus interface
I/O
42Strong ARM
- StrongARM system includes
- CPU chip (3.686 MHz clock)
- system control module (32.768 kHz clock).
- Real-time clock
- operating system timer
- general-purpose I/O
- interrupt controller
- power manager controller
- reset controller.
43Pros and cons
- Plentiful hardware options.
- Simple programming semantics.
- Good software development environments.
- Performance-limited.
44TI Open Wireless Multimedia Applications Platform
- Dual-processor shared memory system
external memory
General-purpose processor
DSP
Mem ctrl
DSP task I/O ctrl
GPP OS
DSP manager
DSP OS
bridge
http//www.ti.com/sc/docs/apps/wireless/omap/overv
iew.htm
45TI OMAP Hardware platform
Program Memory
SDRAM
Memory Traffic Controller
- ARM9 core
- 16KB I-cache
- 8KB D-cache
- 2-way set associative
- 150 MHz
- C55x DSP core
- 16KB I-cache
- 8KB RAM set
- 2-way set associative
- 200 MHz
I-MMU
D-MMU
MMU
Internal RAM/ROM
I-Cache
D-Cache
I-Cache
DMA
RISC Core
DSP Core Appl Coprocessors
Peripherals
LCD Controller, Interrupt Handlers, Timers, GPIO,
UARTs, ...
46OMAPI Standard (ST/TI)
- Goal standardize the interfaces between
application processor and peripheral devices in a
mobile product - Provide standard services (APIs) in the OS that
can be used by application developers
47STMicro Nomadik platform
Main Core
I/Os
HW Accelerators
Memory System
48Nomadik SW platform
- Compliant with OMAPI standard
49Philips Digital Video Nexperia Platform
TriMedia
MIPS
SDRAM
- Scalable VLIW Media Processor
- 100 to 300 MHz
- 32-bit or 64-bit
- Nexperia
- System Buses
- 32-128 bit
- General-purpose Scalable RISC Processor
- 50 to 300 MHz
- 32-bit or 64-bit
- Library of DeviceIP Blocks
- Image coprocessors
- DSPs
- UART
- 1394
- USB
- and more
TriMedia CPU
MMI
MIPS CPU
TM-xxxx
D
PRxxxx
D
I
I
DEVICE IP BLOCK
DEVICE IP BLOCK
DEVICE IP BLOCK
DEVICE IP BLOCK
. . .
. . .
DVP MEMORY BUS
PI BUS
PI BUS
DEVICE IP BLOCK
DEVICE IP BLOCK
DVP SYSTEM SILICON
50Nexperia-DVP Software
- Nexperia -DVP Software Architecture
- Supports multiple OSs and middleware software
- Abstracts platform functionality via consistent
APIs - Nexperia-DVP Streaming Software
- Encapsulates implementation of streaming media
components (hardware and software) - Nexperia Platform Software
- OS independent device drivers for on-chip and
off-chip devices
Applications
MiddlewareJavaTV, TVPAK, OpenTV, MHP/Java,
proprietary ...
Kernel pSOS, Win-CE, JavaOS
Streaming and Platform Software
Nexperia Hardware
51Infineon Automotive Platform
- Applications
- High Performance drives / servo drives,
- Industrial control Robotics
- Features
- 32-bit super-scalar TriCoreTM V1.3 CPU, 4 stage
pipeline - Fully integrated DSP capabilities
- Single precision floating point unit (FPU)
- 80 MHz at full industrial temperature range
- 32-bit peripheral control processor with single
cycle instruction (PCP2) - Memories
- 1.5 MByte embedded progr. flash with ECC
- 32 KByte data flash - EEPROM emulation
- 56 KBSRAM, 8 KB I, 16 KB Imem
- 8-channel DMA controller
- Interrupt system with 2 x 255 hardware priority
arbitration levels serviced by CPU and PCP2
Coprocessor - Triple bus structure 64-bit local memory buses
to internal flash and data memory, 32-bit system
peripheral bus, 32-bit remote peripheral bus
TC1166
52MOSAIC SW Architecture Components for
Automotive Dashboard and Body Control
Application Platform layer (_at_ 10 of total SW)
Customer Libraries
OSEK RTOS
CCP
Application Specific Software
KWP 2000
Transport
SW Platform Reuse gt 70 of total SW
SW Platform layer (gt 60 of total SW)
OSEK COM
Application Programming Interface
I/O drivers handlers (gt 20 configurable modules)
mControllers Library
HW layer
53Architecture trends
High performance for narrow application field
Special Purpose processor
Dedicated hardware
DSP
Stream processor
Graphic processor
Multiple Cores
Network processor
Heterogeneous Multiprocessor
Programmable Hardware
Configurable Processor
FPGA?Reconfigurable systems
Dynamically Reconfigurable Processors
Special instructions
Tile Processor
Homogeneous Chip-multiprocessor
General purpose CPU
Multiple Cores
High performance for wide application field
54Task Specific (configurable) Processors
RWTH AACHEN ? Lisatek(CoWare) IMEC ?Target
Compiler T, ARM OptimoDE PHILIPS ? Siliconhive
TENSILICA, PicoChip
Courtesy Target Compilers T
DP
ISA
55Parallelism at Three Levels in Extensible
Instructions
- Three forms of instruction-set parallelism
- Very Long Instruction Word (VLIW)
- Single Instruction Multiple Data (SIMD) aka
vectors - Fused operations aka complex operations
Parallelism L x M x N Example 3 x 4 x 3 36
ops/cycle
56ExampleSAD (sum of absolute differences)
57 Dynamically Reconfigurable Processors
- Reconfigurable systems ? Previous lesson
- Flexible but It takes 10s milliseconds for
dynamic reconfiguration. - Dynamically Reconfigurable Processors
- Improves area efficiency by changing hardware
structure. - IPs used in various SoCs.
- History
- Reconfigurable Co-processor Garp(1997),
CHIMAERA(2000) - Multicontext reconfigurable devices
WASMII(1992),Time-multiplexing FPGA(1997),
PipeRench(1998), DRL(1998) - Functional-level synthesis
- Various commercial products are available since
2000 - IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp,
Elixent DFabrix - SONYs VME(Virtual Mobile Engine) is embedded in
Network Workman and PSP - Recently, many Japanese vendors start to develop
commercial products - Fujitsu
- Hitachi
- Lucent
- Sanyo
- Toshiba (MepD-Fabrix)
58What is Configurable Computing?
Spatially-programmed connection of processing
elements
Hardware customized to specifics of
problem. Direct map of problem specific dataflow,
control. Circuits adapted as problem
requirements change.
59Spatial vs. Temporal Computing
Temporal
Spatial
60Processor vs. FPGA Area
61Processing Element
- Specialized for media/stream processing
- Coarse grain ? Fine grain LUT of FPGAs
- Components
- ALU
- ShifterMask unit
- Multiplexers
- Registers
- Operations and interconnection between components
are changeable - No instruction fetch mechanism A part of large
datapath
62Reconfigurable HW (DSP fabric)
- Target signal processing and arithmetic intensive
applications - Reconfigurable array of simple DSP core (CNode)
- Low power architecture
- Hierarchical clock gating
- Distributed leakage control (fine grain power
gating) - Programmable DMA engine
- Reconfigurable at run time, multi task
63Mapping Flow
DFG
Behavioral code
Procedure(In,Out,inout) Constant
A,b,c, Begin Xa-in0 .. End
Coarse grained configuration
Partitioning/static scheduling
N0_i
Level 1
M
U
Clusters
X
N0_o
Level0
Data out
N1_i
N1_o
N2_i
Mux level 2
Data in
N2_o
- Alus execute a cyclic micro-sequence
- Data exchanges through hierarchical clustered
interconnect - Configuration step is sequence loading and
interconnect programming
Data in
Data out
Data in
Data out
Data in
Data out
ILP software pipelining
64Mapping Flow
- 3D optimization problem (place/route/schedule)
- Traditional scheduling techniques for VLIW or
clustered VLIW dont apply - The solution dont take into account the spatial
dimension of the problem - Traditional PR used in FPGA don't apply neither
because they don't consider the time dimension
65Putting it all together
2004 2006 2008 2010 2012
Technology Node (nm) 90 65 45 32 22
Loosely coupled Sub-Systems 2 4 6 8 12
General Purpose CPU Single Multiple Single Multiple Single Multiple Single Multiple Single Multiple
Hardware Accelerator Hardwired Reconfigurable Hardwired Reconfigurable Hardwired Reconfigurable Hardwired Reconfigurable Hardwired Reconfigurable
- Constant SoC Die Size
- Slow evolution of peripherals (area decrease)
- GP CPU sub-system complexity 2x each node
(constant area), - Embedded Memory capacity 2x at each node
(constant area) - Loosely coupled DSP sub-system complexity
increase by 30 at each node (30 area decrease)
66What can fit in 45mm² in 45nm
Programmable Multimedia Accelerator
Imaging H/W
192 CNode (40 GOPS)
Video H/W
Interconnect
4MB Multi-port Embedded Memory
L2
Peripherals analog
L1
L1
Host Core 2
Host Core 1