Presentazione di PowerPoint

About This Presentation

Title:

Presentazione di PowerPoint

Description:

MPEG Audio. Portable audio. Digital cameras. Wireless. Cellular telephones. Base station ... IF-statements 'branch-free' Conditional Instruction '[c] I' ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 67

Provided by: con99

Category:

more less

Transcript and Presenter's Notes

Title: Presentazione di PowerPoint

1
Hardware platforms for Embedded computing
2
The energy/flexibility conflict- Intrinsic Power
Efficiency -
Operations/WattMOPS/mW
Ambient Intelligence
10
DSP-ASIPs
hardwired muxed ASIC
1
Processors
Reconfigurable Computing
µPs
0.1
0.01
Technology
0.13µ
0.07µ
0.25µ
0.5µ
1.0µ
Necessary to optimize HW/SW otherwise the prize
for software flexibility cannot be paid!
H. de Man, Keynote, DATE02T. Claasen, ISSCC99
3
Architectural Choices
Flexibility
1/Efficiency (power, speed)
4
The Processor Design Space
Embedded processors
Application specific architectures for performance
Microprocessors
Performance is everything Software rules
Performance
Microcontrollers
Cost is everything
Cost
5
Area of processor cores Cost
Nintendo processor
Cellular phones
6
Another figure of meritComputation per unit area
Nintendo processor
Cellular phones
???
7
Embedded vs. general-purpose processors

Embedded processors may be optimized for a
category of applications.
Customization may be narrow or broad.
We may judge embedded processors using different
metrics
Code size.
Memory system performance.
Preditability.

8
Microcontrollers
Memory
ROM RAM
CPU
I/O
Subsystems Timers, Counters, Analog Interfaces,
I/O interfaces
A single chip
9
Microcontroller Architectures
Memory
0
Program Data
Address Bus
Von Neumann Architecture
CPU
Data Bus
2n
Memory
0
Program
Address Bus
Harvard Architecture
Fetch Bus
CPU
Address Bus
0
Data
Data Bus
10
MCS-51 Family of Microcontollers

8051 introduced by Intel in late 1970s
Now produced by many companies in many variations
The most pupular microcontroller about 40 of
market share
8-bit microcontroller

11
Original 8051 Microcontroller
4096 Bytes Program Memory
128 Bytes Data Memory
Two 16 Bit Timer/Event Counters
Oscillator and timing
Internal data bus
8051 CPU
Programmable I/O
Programmable Serial Port Full Duplex UART
Synchronous Shifter
64 K Byte Bus Expansion Control
subsystem interrupts
External interrupts
Parallel ports Address Data Bus I/O pins
Control
Serial Output
Serial Input
12
Microcontrollers- MHS 80C51 as an example -

8-bit CPU optimised for control applications
Extensive Boolean processing capabilities
64 k Program Memory address space
64 k Data Memory address space
4 k bytes of on chip Program Memory
128 bytes of on chip data RAM
32 bi-directional and individually addressable
I/O lines
Two 16-bit timers/counters
Full duplex UART
6 sources/5-vector interrupt structure with 2
priority levels
On chip clock oscillators
Very popular CPU with many different variations

13
RISC processors

RISC generally means highly-pipelinable, one
instruction per cycle.
Pipelines of embedded RISC processors have grown
over time
ARM7 has 3-stage pipeline.
ARM9 has 5-stage pipeline.
ARM11 has eight-stage pipeline.

ARM11 pipeline ARM05.
14
RISC processor families

ARM ARM7 is relatively simple, no memory
management ARM11 has memory management, other
features.
MIPS MIPS32 4K has 5-stage pipeline 4KE family
has DSP extension 4KS is designed for security.
PowerPC 400 series includes several embedded
processors MPD7410 is two-issue machine 970FX
has 16-stage pipeline.

15
DSP Applications

Audio applications
MPEG Audio
Portable audio
Digital cameras
Wireless
Cellular telephones
Base station

Networking
Cable modems
ADSL
VDSL

16
Another Look at DSP Applications

High-end
Wireless Base Station - TMS320C6000
Cable modem
gateways
Mid-end
Cellular phone - TMS320C540
Fax/ voice server
Low end
Storage products - TMS320C27
Digital camera - TMS320C5000
Portable phones
Wireless headsets
Consumer audio
Automobiles, toasters, thermostats, ...

Increasing Cost
Increasing volume
17
DSP vs. General Purpose MPU

The MIPS/MFLOPS of DSPs is speed of
Multiply-Accumulate (MAC).
DSP are judged by whether they can keep the
multipliers busy 100 of the time.
The "SPEC" of DSPs is 4 algorithms
Inifinite Impule Response (IIR) filters
Finite Impule Response (FIR) filters
FFT, and
convolvers
In DSPs, algorithms are king!
Binary compatability not an issue
Software is not (yet) king in DSPs.
People still write in assembly language for a
product to minimize the die area for ROM in the
DSP chip.

18
Architectural Features of DSPs

Data path configured for DSP
Fixed-point arithmetic
MAC- Multiply-accumulate
Multiple memory banks and buses -
Harvard Architecture
Multiple data memories
Specialized addressing modes
Bit-reversed addressing
Circular buffers
Specialized instruction set and execution control
Zero-overhead loops
Support for MAC
Specialized peripherals for DSP
THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE
DESIGN!!!

19
Domain-oriented architectures
n-1
Application yj ?i0 xj-iai
?i 0?i ? n-1 yij yi-1j xj-iai
Architecture Example Data path ADSP210x
- Parallelism - Dedicated registers
MR0 A11 A2n-2 MXxn-1 MYa0for
( j1 to n) MRMRMXMY MYaA1 MXxA2
A1 A2--
20
DSP - Features (1)

Multiply/accumulate (MAC) and zero-overhead loop
(ZOL) instructions (as shown)
Heterogeneous registers (as shown)
Separate address generation units (AGUs)(as in
ADSP 210x)

21
DSP - Features (2)
sliding window

Modulo addressing Am ? Am(Am1) mod
n(implements ring or circular buffer in memory)

x
t2
t1
t
..xn-2xn-1x0x1..
..xn-3xn-2xn-1xnx1
Memory, tt1
Memory, t2t11
22
Multiple memory banks or memories
Simplifies parallel fetches
23
Very long instruction word (VLIW) processors
Key idea detection of possible parallelism to be
done by compiler, not by hardware at run-time
(inefficient). VLIW parallel operations
(instructions) encoded in one long word
(instruction packet), each instruction
controlling one functional unit. E.g.
24
The Texas InstrumentsTMS 320C6xx as an example
Bit in each instruction encodes end of parallel
execution
31 0
31 0
31 0
31 0
31 0
31 0
31 0
0
0
1
0
1
1
1
Instr. A
Instr. D
Instr. F
Instr. G
Instr. E
Instr. C
Instr. B
Instructions B, C and D use disjoint functional
units, cross paths and other data path resources.
The same is also true for E, F and G.
Parallel execution cannot span several packets.
25
Partitioned register files

Many memory ports are required to supply enough
operands per cycle.
Memories with many ports are expensive.
? Registers are partitioned into (typically 2)
sets, e.g. for TI C60x

Data path A
Data path B
register file A
register file B
L1
S1
M1
D1
D2
M2
S2
L2
Address bus
Data bus
26
Instruction types are mapped tofunctional unit
types

There are 4 functional unit (FU) types
M Memory Unit
I Integer Unit
F Floating-Point Unit
B Branch Unit
Instruction types ? corresponding FU type,except
type A (mapping to either I or M-functional
units).

27
Large of delay slots,a problem of VLIW
processors
add sub and orsub mult xor divld st mv beq
The execution of many instructions has been
started before it is realized that a branch was
required. Nullifying those instructions would
waste compute power ? Executing those
instructions is declared a feature, not a bug. ?
How to fill all delay slots with useful
instructions? ? Avoid branches wherever possible.
28
Predicated executionImplementing IF-statements
branch-free

Conditional Instruction c I consists of
condition c
instruction I

c true gt I executed c false gt NOP
29
Predicated executionImplementing IF-statements
branch-free TI C6x
Conditional branch c B L1
NOP 5 B L2 NOP 4
SUB x,y,a SUB x,z,b L1
ADD x,y,a ADD x,z,b L2
Predicated execution c ADD x,y,a c
ADD x,z,b !c SUB x,y,a !c SUB
x,z,b
if (c) a x y b x z else a x -
y b x - z
max. 12 cycles
1 cycle
30
Architecture Evolution

Roadmap continues 90?65?45 nm
Traditional Bus-based SoCs fit in one tile !!

Communication demand is staggering, but unevenly
distributed, because of architectural
heterogeneity

31
Multicores Are Here!
Amarasinghe06
512
256
128
64
of cores
32
16
8
4
2
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
32
MPSoC 2005 ITRS roadmap
Martin06
33
Power is the Challenge!
)
1400
2
SiO2 Lkg
10 mm Die
1200
SD Lkg
Active
1000
800
Power (W), Power Density (W/cm
600
400
200
0
90nm
65nm
45nm
32nm
22nm
16nm
34
Near Term Solutions

Move away from Frequency alone to deliver
performance
More on-die memory
Multi-everywhere
Multi-threading
Chip level multi-processing
Throughput oriented designs
Performance by higher level of integration

35
mArchitecture Techniques
Increase on-die Memory
36
Multi-Core
Power
Power 1/4
4
Performance
Performance 1/2
3
2
2
1
1
1
1
4
4
Multi-Core Power efficient Better power and
thermal management
3
3
2
2
1
1
37
Embedded vs. General Purpose

Server Applications
Symmetric Multi-Processing
Homogeneous cores
General tasks known late
Tasks run on any core
High-performance, high-speed microprocessors
Communication
large coherent memory space on multi-core die or
bus
SMT programming models (Simultaneous
Multi-Threading)
Examples large server chips (eg Sun Niagara 8x4
threads), scientific multi-processors

Embedded Applications
Asymmetric Multi-Processing
Differentiated Processors
Specific tasks known early
Mapped to dedicated processors
Configurable and extensible processors
performance, power efficiency
Communication
Coherent memory
Shared local memories
HW FIFOS, other direct connections
Dataflow programming models
Classical example Smart mobile RISC DSP
Media processors

38
MPSoC architectures
39
Example system platforms

Generic
Automotive
Wireless
Multimedia

40
PC-based platform

Basic hardware components
CPU
memory
timers
DMA
minimal I/O devices.
Basic software
BIOS.

41
PC-style hardware architecture
CPU
memory
I/O
system bus
bridge
high-speed bus
DMA controller
timers
low-speed bus
bus interface
I/O
42
Strong ARM

StrongARM system includes
CPU chip (3.686 MHz clock)
system control module (32.768 kHz clock).
Real-time clock
operating system timer
general-purpose I/O
interrupt controller
power manager controller
reset controller.

43
Pros and cons

Plentiful hardware options.
Simple programming semantics.
Good software development environments.
Performance-limited.

44
TI Open Wireless Multimedia Applications Platform

Dual-processor shared memory system

external memory
General-purpose processor
DSP
Mem ctrl
DSP task I/O ctrl
GPP OS
DSP manager
DSP OS
bridge
http//www.ti.com/sc/docs/apps/wireless/omap/overv
iew.htm
45
TI OMAP Hardware platform
Program Memory
SDRAM
Memory Traffic Controller

ARM9 core
16KB I-cache
8KB D-cache
2-way set associative
150 MHz

C55x DSP core
16KB I-cache
8KB RAM set
2-way set associative
200 MHz

I-MMU
D-MMU
MMU
Internal RAM/ROM
I-Cache
D-Cache
I-Cache
DMA
RISC Core
DSP Core Appl Coprocessors
Peripherals
LCD Controller, Interrupt Handlers, Timers, GPIO,
UARTs, ...
46
OMAPI Standard (ST/TI)

Goal standardize the interfaces between
application processor and peripheral devices in a
mobile product
Provide standard services (APIs) in the OS that
can be used by application developers

47
STMicro Nomadik platform
Main Core
I/Os
HW Accelerators
Memory System
48
Nomadik SW platform

Compliant with OMAPI standard

49
Philips Digital Video Nexperia Platform
TriMedia
MIPS
SDRAM

Scalable VLIW Media Processor
100 to 300 MHz
32-bit or 64-bit
Nexperia
System Buses
32-128 bit

General-purpose Scalable RISC Processor
50 to 300 MHz
32-bit or 64-bit
Library of DeviceIP Blocks
Image coprocessors
DSPs
UART
1394
USB
and more

TriMedia CPU
MMI
MIPS CPU
TM-xxxx
D
PRxxxx
D
I
I
DEVICE IP BLOCK
DEVICE IP BLOCK
DEVICE IP BLOCK
DEVICE IP BLOCK
. . .
. . .
DVP MEMORY BUS
PI BUS
PI BUS
DEVICE IP BLOCK
DEVICE IP BLOCK
DVP SYSTEM SILICON
50
Nexperia-DVP Software

Nexperia -DVP Software Architecture
Supports multiple OSs and middleware software
Abstracts platform functionality via consistent
APIs
Nexperia-DVP Streaming Software
Encapsulates implementation of streaming media
components (hardware and software)
Nexperia Platform Software
OS independent device drivers for on-chip and
off-chip devices

Applications
MiddlewareJavaTV, TVPAK, OpenTV, MHP/Java,
proprietary ...
Kernel pSOS, Win-CE, JavaOS
Streaming and Platform Software
Nexperia Hardware
51
Infineon Automotive Platform

Applications
High Performance drives / servo drives,
Industrial control Robotics
Features
32-bit super-scalar TriCoreTM V1.3 CPU, 4 stage
pipeline
Fully integrated DSP capabilities
Single precision floating point unit (FPU)
80 MHz at full industrial temperature range
32-bit peripheral control processor with single
cycle instruction (PCP2)
Memories
1.5 MByte embedded progr. flash with ECC
32 KByte data flash - EEPROM emulation
56 KBSRAM, 8 KB I, 16 KB Imem
8-channel DMA controller
Interrupt system with 2 x 255 hardware priority
arbitration levels serviced by CPU and PCP2
Coprocessor
Triple bus structure 64-bit local memory buses
to internal flash and data memory, 32-bit system
peripheral bus, 32-bit remote peripheral bus

TC1166
52
MOSAIC SW Architecture Components for
Automotive Dashboard and Body Control
Application Platform layer (_at_ 10 of total SW)
Customer Libraries
OSEK RTOS
CCP
Application Specific Software
KWP 2000
Transport
SW Platform Reuse gt 70 of total SW
SW Platform layer (gt 60 of total SW)
OSEK COM
Application Programming Interface
I/O drivers handlers (gt 20 configurable modules)
mControllers Library
HW layer
53
Architecture trends
High performance for narrow application field
Special Purpose processor
Dedicated hardware
DSP
Stream processor
Graphic processor
Multiple Cores
Network processor
Heterogeneous Multiprocessor
Programmable Hardware
Configurable Processor
FPGA?Reconfigurable systems
Dynamically Reconfigurable Processors
Special instructions
Tile Processor
Homogeneous Chip-multiprocessor
General purpose CPU
Multiple Cores
High performance for wide application field
54
Task Specific (configurable) Processors
RWTH AACHEN ? Lisatek(CoWare) IMEC ?Target
Compiler T, ARM OptimoDE PHILIPS ? Siliconhive
TENSILICA, PicoChip
Courtesy Target Compilers T
DP
ISA
55
Parallelism at Three Levels in Extensible
Instructions

Three forms of instruction-set parallelism
Very Long Instruction Word (VLIW)
Single Instruction Multiple Data (SIMD) aka
vectors
Fused operations aka complex operations

Parallelism L x M x N Example 3 x 4 x 3 36
ops/cycle
56
ExampleSAD (sum of absolute differences)
57
Dynamically Reconfigurable Processors

Reconfigurable systems ? Previous lesson
Flexible but It takes 10s milliseconds for
dynamic reconfiguration.
Dynamically Reconfigurable Processors
Improves area efficiency by changing hardware
structure.
IPs used in various SoCs.
History
Reconfigurable Co-processor Garp(1997),
CHIMAERA(2000)
Multicontext reconfigurable devices
WASMII(1992),Time-multiplexing FPGA(1997),
PipeRench(1998), DRL(1998)
Functional-level synthesis
Various commercial products are available since
2000
IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp,
Elixent DFabrix
SONYs VME(Virtual Mobile Engine) is embedded in
Network Workman and PSP
Recently, many Japanese vendors start to develop
commercial products
Fujitsu
Hitachi
Lucent
Sanyo
Toshiba (MepD-Fabrix)

58
What is Configurable Computing?
Spatially-programmed connection of processing
elements
Hardware customized to specifics of
problem. Direct map of problem specific dataflow,
control. Circuits adapted as problem
requirements change.
59
Spatial vs. Temporal Computing
Temporal
Spatial
60
Processor vs. FPGA Area
61
Processing Element

Specialized for media/stream processing
Coarse grain ? Fine grain LUT of FPGAs
Components
ALU
ShifterMask unit
Multiplexers
Registers
Operations and interconnection between components
are changeable
No instruction fetch mechanism A part of large
datapath

62
Reconfigurable HW (DSP fabric)

Target signal processing and arithmetic intensive
applications
Reconfigurable array of simple DSP core (CNode)
Low power architecture
Hierarchical clock gating
Distributed leakage control (fine grain power
gating)
Programmable DMA engine
Reconfigurable at run time, multi task

63
Mapping Flow
DFG
Behavioral code
Procedure(In,Out,inout) Constant
A,b,c, Begin Xa-in0 .. End
Coarse grained configuration
Partitioning/static scheduling

N0_i

Level 1

M
U
Clusters

X

N0_o

Level0

Data out
N1_i

N1_o

N2_i

Mux level 2

Data in
N2_o

Alus execute a cyclic micro-sequence
Data exchanges through hierarchical clustered
interconnect
Configuration step is sequence loading and
interconnect programming

Data in
Data out
Data in
Data out
Data in
Data out
ILP software pipelining
64
Mapping Flow

3D optimization problem (place/route/schedule)
Traditional scheduling techniques for VLIW or
clustered VLIW dont apply
The solution dont take into account the spatial
dimension of the problem
Traditional PR used in FPGA don't apply neither
because they don't consider the time dimension

65
Putting it all together
2004 2006 2008 2010 2012
Technology Node (nm) 90 65 45 32 22
Loosely coupled Sub-Systems 2 4 6 8 12
General Purpose CPU Single Multiple Single Multiple Single Multiple Single Multiple Single Multiple
Hardware Accelerator Hardwired Reconfigurable Hardwired Reconfigurable Hardwired Reconfigurable Hardwired Reconfigurable Hardwired Reconfigurable

Constant SoC Die Size
Slow evolution of peripherals (area decrease)
GP CPU sub-system complexity 2x each node
(constant area),
Embedded Memory capacity 2x at each node
(constant area)
Loosely coupled DSP sub-system complexity
increase by 30 at each node (30 area decrease)

66
What can fit in 45mm² in 45nm
Programmable Multimedia Accelerator
Imaging H/W
192 CNode (40 GOPS)
Video H/W
Interconnect
4MB Multi-port Embedded Memory
L2
Peripherals analog
L1
L1
Host Core 2
Host Core 1

Write a Comment

User Comments (0)