Presentazione di PowerPoint - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Presentazione di PowerPoint

Description:

MPEG Audio. Portable audio. Digital cameras. Wireless. Cellular telephones. Base station ... IF-statements 'branch-free' Conditional Instruction '[c] I' ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 67
Provided by: con99
Category:

less

Transcript and Presenter's Notes

Title: Presentazione di PowerPoint


1
Hardware platforms for Embedded computing
2
The energy/flexibility conflict- Intrinsic Power
Efficiency -
Operations/WattMOPS/mW
Ambient Intelligence
10
DSP-ASIPs
hardwired muxed ASIC
1
Processors
Reconfigurable Computing
µPs
0.1
0.01
Technology
0.13µ
0.07µ
0.25µ
0.5µ
1.0µ
Necessary to optimize HW/SW otherwise the prize
for software flexibility cannot be paid!
H. de Man, Keynote, DATE02T. Claasen, ISSCC99
3
Architectural Choices
Flexibility
1/Efficiency (power, speed)
4
The Processor Design Space
Embedded processors
Application specific architectures for performance
Microprocessors
Performance is everything Software rules
Performance
Microcontrollers
Cost is everything
Cost
5
Area of processor cores Cost
Nintendo processor
Cellular phones
6
Another figure of meritComputation per unit area
Nintendo processor
Cellular phones
???
7
Embedded vs. general-purpose processors
  • Embedded processors may be optimized for a
    category of applications.
  • Customization may be narrow or broad.
  • We may judge embedded processors using different
    metrics
  • Code size.
  • Memory system performance.
  • Preditability.

8
Microcontrollers
Memory
ROM RAM
CPU
I/O
Subsystems Timers, Counters, Analog Interfaces,
I/O interfaces
A single chip
9
Microcontroller Architectures
Memory
0
Program Data
Address Bus
Von Neumann Architecture
CPU
Data Bus
2n
Memory
0
Program
Address Bus
Harvard Architecture
Fetch Bus
CPU
Address Bus
0
Data
Data Bus
10
MCS-51 Family of Microcontollers
  • 8051 introduced by Intel in late 1970s
  • Now produced by many companies in many variations
  • The most pupular microcontroller about 40 of
    market share
  • 8-bit microcontroller

11
Original 8051 Microcontroller
4096 Bytes Program Memory
128 Bytes Data Memory
Two 16 Bit Timer/Event Counters
Oscillator and timing
Internal data bus
8051 CPU
Programmable I/O
Programmable Serial Port Full Duplex UART
Synchronous Shifter
64 K Byte Bus Expansion Control
subsystem interrupts
External interrupts
Parallel ports Address Data Bus I/O pins
Control
Serial Output
Serial Input
12
Microcontrollers- MHS 80C51 as an example -
  • 8-bit CPU optimised for control applications
  • Extensive Boolean processing capabilities
  • 64 k Program Memory address space
  • 64 k Data Memory address space
  • 4 k bytes of on chip Program Memory
  • 128 bytes of on chip data RAM
  • 32 bi-directional and individually addressable
    I/O lines
  • Two 16-bit timers/counters
  • Full duplex UART
  • 6 sources/5-vector interrupt structure with 2
    priority levels
  • On chip clock oscillators
  • Very popular CPU with many different variations

13
RISC processors
  • RISC generally means highly-pipelinable, one
    instruction per cycle.
  • Pipelines of embedded RISC processors have grown
    over time
  • ARM7 has 3-stage pipeline.
  • ARM9 has 5-stage pipeline.
  • ARM11 has eight-stage pipeline.

ARM11 pipeline ARM05.
14
RISC processor families
  • ARM ARM7 is relatively simple, no memory
    management ARM11 has memory management, other
    features.
  • MIPS MIPS32 4K has 5-stage pipeline 4KE family
    has DSP extension 4KS is designed for security.
  • PowerPC 400 series includes several embedded
    processors MPD7410 is two-issue machine 970FX
    has 16-stage pipeline.

15
DSP Applications
  • Audio applications
  • MPEG Audio
  • Portable audio
  • Digital cameras
  • Wireless
  • Cellular telephones
  • Base station
  • Networking
  • Cable modems
  • ADSL
  • VDSL

16
Another Look at DSP Applications
  • High-end
  • Wireless Base Station - TMS320C6000
  • Cable modem
  • gateways
  • Mid-end
  • Cellular phone - TMS320C540
  • Fax/ voice server
  • Low end
  • Storage products - TMS320C27
  • Digital camera - TMS320C5000
  • Portable phones
  • Wireless headsets
  • Consumer audio
  • Automobiles, toasters, thermostats, ...

Increasing Cost
Increasing volume
17
DSP vs. General Purpose MPU
  • The MIPS/MFLOPS of DSPs is speed of
    Multiply-Accumulate (MAC).
  • DSP are judged by whether they can keep the
    multipliers busy 100 of the time.
  • The "SPEC" of DSPs is 4 algorithms
  • Inifinite Impule Response (IIR) filters
  • Finite Impule Response (FIR) filters
  • FFT, and
  • convolvers
  • In DSPs, algorithms are king!
  • Binary compatability not an issue
  • Software is not (yet) king in DSPs.
  • People still write in assembly language for a
    product to minimize the die area for ROM in the
    DSP chip.

18
Architectural Features of DSPs
  • Data path configured for DSP
  • Fixed-point arithmetic
  • MAC- Multiply-accumulate
  • Multiple memory banks and buses -
  • Harvard Architecture
  • Multiple data memories
  • Specialized addressing modes
  • Bit-reversed addressing
  • Circular buffers
  • Specialized instruction set and execution control
  • Zero-overhead loops
  • Support for MAC
  • Specialized peripherals for DSP
  • THE ULTIMATE IN BENCHMARK DRIVEN ARCHITECTURE
    DESIGN!!!

19
Domain-oriented architectures
n-1
Application yj ?i0 xj-iai
?i 0?i ? n-1 yij yi-1j xj-iai
Architecture Example Data path ADSP210x
- Parallelism - Dedicated registers
MR0 A11 A2n-2 MXxn-1 MYa0for
( j1 to n) MRMRMXMY MYaA1 MXxA2
A1 A2--
20
DSP - Features (1)
  • Multiply/accumulate (MAC) and zero-overhead loop
    (ZOL) instructions (as shown)
  • Heterogeneous registers (as shown)
  • Separate address generation units (AGUs)(as in
    ADSP 210x)

21
DSP - Features (2)
sliding window
  • Modulo addressing Am ? Am(Am1) mod
    n(implements ring or circular buffer in memory)

x
t2
t1
t
..xn-2xn-1x0x1..
..xn-3xn-2xn-1xnx1
Memory, tt1
Memory, t2t11
22
Multiple memory banks or memories
Simplifies parallel fetches
23
Very long instruction word (VLIW) processors
Key idea detection of possible parallelism to be
done by compiler, not by hardware at run-time
(inefficient). VLIW parallel operations
(instructions) encoded in one long word
(instruction packet), each instruction
controlling one functional unit. E.g.
24
The Texas InstrumentsTMS 320C6xx as an example
Bit in each instruction encodes end of parallel
execution
31 0
31 0
31 0
31 0
31 0
31 0
31 0
0
0
1
0
1
1
1
Instr. A
Instr. D
Instr. F
Instr. G
Instr. E
Instr. C
Instr. B
Instructions B, C and D use disjoint functional
units, cross paths and other data path resources.
The same is also true for E, F and G.
Parallel execution cannot span several packets.
25
Partitioned register files
  • Many memory ports are required to supply enough
    operands per cycle.
  • Memories with many ports are expensive.
  • ? Registers are partitioned into (typically 2)
    sets, e.g. for TI C60x

Data path A
Data path B
register file A
register file B
L1
S1
M1
D1
D2
M2
S2
L2
Address bus
Data bus
26
Instruction types are mapped tofunctional unit
types
  • There are 4 functional unit (FU) types
  • M Memory Unit
  • I Integer Unit
  • F Floating-Point Unit
  • B Branch Unit
  • Instruction types ? corresponding FU type,except
    type A (mapping to either I or M-functional
    units).

27
Large of delay slots,a problem of VLIW
processors
add sub and orsub mult xor divld st mv beq
The execution of many instructions has been
started before it is realized that a branch was
required. Nullifying those instructions would
waste compute power ? Executing those
instructions is declared a feature, not a bug. ?
How to fill all delay slots with useful
instructions? ? Avoid branches wherever possible.
28
Predicated executionImplementing IF-statements
branch-free
  • Conditional Instruction c I consists of
  • condition c
  • instruction I

c true gt I executed c false gt NOP
29
Predicated executionImplementing IF-statements
branch-free TI C6x
Conditional branch c B L1
NOP 5 B L2 NOP 4
SUB x,y,a SUB x,z,b L1
ADD x,y,a ADD x,z,b L2
Predicated execution c ADD x,y,a c
ADD x,z,b !c SUB x,y,a !c SUB
x,z,b
if (c) a x y b x z else a x -
y b x - z
max. 12 cycles
1 cycle
30
Architecture Evolution
  • Roadmap continues 90?65?45 nm
  • Traditional Bus-based SoCs fit in one tile !!
  • Communication demand is staggering, but unevenly
    distributed, because of architectural
    heterogeneity

31
Multicores Are Here!
Amarasinghe06
512
256
128
64
of cores
32
16
8
4
2
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
32
MPSoC 2005 ITRS roadmap
Martin06
33
Power is the Challenge!
)
1400
2
SiO2 Lkg
10 mm Die
1200
SD Lkg
Active
1000
800
Power (W), Power Density (W/cm
600
400
200
0
90nm
65nm
45nm
32nm
22nm
16nm
34
Near Term Solutions
  • Move away from Frequency alone to deliver
    performance
  • More on-die memory
  • Multi-everywhere
  • Multi-threading
  • Chip level multi-processing
  • Throughput oriented designs
  • Performance by higher level of integration

35
mArchitecture Techniques
Increase on-die Memory
36
Multi-Core
Power
Power 1/4
4
Performance
Performance 1/2
3
2
2
1
1
1
1
4
4
Multi-Core Power efficient Better power and
thermal management
3
3
2
2
1
1
37
Embedded vs. General Purpose
  • Server Applications
  • Symmetric Multi-Processing
  • Homogeneous cores
  • General tasks known late
  • Tasks run on any core
  • High-performance, high-speed microprocessors
  • Communication
  • large coherent memory space on multi-core die or
    bus
  • SMT programming models (Simultaneous
    Multi-Threading)
  • Examples large server chips (eg Sun Niagara 8x4
    threads), scientific multi-processors
  • Embedded Applications
  • Asymmetric Multi-Processing
  • Differentiated Processors
  • Specific tasks known early
  • Mapped to dedicated processors
  • Configurable and extensible processors
    performance, power efficiency
  • Communication
  • Coherent memory
  • Shared local memories
  • HW FIFOS, other direct connections
  • Dataflow programming models
  • Classical example Smart mobile RISC DSP
    Media processors

38
MPSoC architectures
39
Example system platforms
  • Generic
  • Automotive
  • Wireless
  • Multimedia

40
PC-based platform
  • Basic hardware components
  • CPU
  • memory
  • timers
  • DMA
  • minimal I/O devices.
  • Basic software
  • BIOS.

41
PC-style hardware architecture
CPU
memory
I/O
system bus
bridge
high-speed bus
DMA controller
timers
low-speed bus
bus interface
I/O
42
Strong ARM
  • StrongARM system includes
  • CPU chip (3.686 MHz clock)
  • system control module (32.768 kHz clock).
  • Real-time clock
  • operating system timer
  • general-purpose I/O
  • interrupt controller
  • power manager controller
  • reset controller.

43
Pros and cons
  • Plentiful hardware options.
  • Simple programming semantics.
  • Good software development environments.
  • Performance-limited.

44
TI Open Wireless Multimedia Applications Platform
  • Dual-processor shared memory system

external memory
General-purpose processor
DSP
Mem ctrl
DSP task I/O ctrl
GPP OS
DSP manager
DSP OS
bridge
http//www.ti.com/sc/docs/apps/wireless/omap/overv
iew.htm
45
TI OMAP Hardware platform
Program Memory
SDRAM
Memory Traffic Controller
  • ARM9 core
  • 16KB I-cache
  • 8KB D-cache
  • 2-way set associative
  • 150 MHz
  • C55x DSP core
  • 16KB I-cache
  • 8KB RAM set
  • 2-way set associative
  • 200 MHz

I-MMU
D-MMU
MMU
Internal RAM/ROM
I-Cache
D-Cache
I-Cache
DMA
RISC Core
DSP Core Appl Coprocessors
Peripherals
LCD Controller, Interrupt Handlers, Timers, GPIO,
UARTs, ...
46
OMAPI Standard (ST/TI)
  • Goal standardize the interfaces between
    application processor and peripheral devices in a
    mobile product
  • Provide standard services (APIs) in the OS that
    can be used by application developers

47
STMicro Nomadik platform
Main Core
I/Os
HW Accelerators
Memory System
48
Nomadik SW platform
  • Compliant with OMAPI standard

49
Philips Digital Video Nexperia Platform
TriMedia
MIPS
SDRAM
  • Scalable VLIW Media Processor
  • 100 to 300 MHz
  • 32-bit or 64-bit
  • Nexperia
  • System Buses
  • 32-128 bit
  • General-purpose Scalable RISC Processor
  • 50 to 300 MHz
  • 32-bit or 64-bit
  • Library of DeviceIP Blocks
  • Image coprocessors
  • DSPs
  • UART
  • 1394
  • USB
  • and more

TriMedia CPU
MMI
MIPS CPU
TM-xxxx
D
PRxxxx
D
I
I
DEVICE IP BLOCK
DEVICE IP BLOCK
DEVICE IP BLOCK
DEVICE IP BLOCK
. . .
. . .
DVP MEMORY BUS
PI BUS
PI BUS
DEVICE IP BLOCK
DEVICE IP BLOCK
DVP SYSTEM SILICON
50
Nexperia-DVP Software
  • Nexperia -DVP Software Architecture
  • Supports multiple OSs and middleware software
  • Abstracts platform functionality via consistent
    APIs
  • Nexperia-DVP Streaming Software
  • Encapsulates implementation of streaming media
    components (hardware and software)
  • Nexperia Platform Software
  • OS independent device drivers for on-chip and
    off-chip devices

Applications
MiddlewareJavaTV, TVPAK, OpenTV, MHP/Java,
proprietary ...
Kernel pSOS, Win-CE, JavaOS
Streaming and Platform Software
Nexperia Hardware
51
Infineon Automotive Platform
  • Applications
  • High Performance drives / servo drives,
  • Industrial control Robotics
  • Features
  • 32-bit super-scalar TriCoreTM V1.3 CPU, 4 stage
    pipeline
  • Fully integrated DSP capabilities
  • Single precision floating point unit (FPU)
  • 80 MHz at full industrial temperature range
  • 32-bit peripheral control processor with single
    cycle instruction (PCP2)
  • Memories
  • 1.5 MByte embedded progr. flash with ECC
  • 32 KByte data flash - EEPROM emulation
  • 56 KBSRAM, 8 KB I, 16 KB Imem
  • 8-channel DMA controller
  • Interrupt system with 2 x 255 hardware priority
    arbitration levels serviced by CPU and PCP2
    Coprocessor
  • Triple bus structure 64-bit local memory buses
    to internal flash and data memory, 32-bit system
    peripheral bus, 32-bit remote peripheral bus

TC1166
52
MOSAIC SW Architecture Components for
Automotive Dashboard and Body Control
Application Platform layer (_at_ 10 of total SW)
Customer Libraries
OSEK RTOS
CCP
Application Specific Software
KWP 2000
Transport
SW Platform Reuse gt 70 of total SW
SW Platform layer (gt 60 of total SW)
OSEK COM
Application Programming Interface
I/O drivers handlers (gt 20 configurable modules)
mControllers Library
HW layer
53
Architecture trends
High performance for narrow application field
Special Purpose processor
Dedicated hardware
DSP
Stream processor
Graphic processor
Multiple Cores
Network processor
Heterogeneous Multiprocessor
Programmable Hardware
Configurable Processor
FPGA?Reconfigurable systems
Dynamically Reconfigurable Processors
Special instructions
Tile Processor
Homogeneous Chip-multiprocessor
General purpose CPU
Multiple Cores
High performance for wide application field
54
Task Specific (configurable) Processors
RWTH AACHEN ? Lisatek(CoWare) IMEC ?Target
Compiler T, ARM OptimoDE PHILIPS ? Siliconhive
TENSILICA, PicoChip
Courtesy Target Compilers T
DP
ISA
55
Parallelism at Three Levels in Extensible
Instructions
  • Three forms of instruction-set parallelism
  • Very Long Instruction Word (VLIW)
  • Single Instruction Multiple Data (SIMD) aka
    vectors
  • Fused operations aka complex operations

Parallelism L x M x N Example 3 x 4 x 3 36
ops/cycle
56
ExampleSAD (sum of absolute differences)
57
Dynamically Reconfigurable Processors
  • Reconfigurable systems ? Previous lesson
  • Flexible but It takes 10s milliseconds for
    dynamic reconfiguration.
  • Dynamically Reconfigurable Processors
  • Improves area efficiency by changing hardware
    structure.
  • IPs used in various SoCs.
  • History
  • Reconfigurable Co-processor Garp(1997),
    CHIMAERA(2000)
  • Multicontext reconfigurable devices
    WASMII(1992),Time-multiplexing FPGA(1997),
    PipeRench(1998), DRL(1998)
  • Functional-level synthesis
  • Various commercial products are available since
    2000
  • IPFlex DAPDNA-2, NEC electronics DRP-1, PACT Xpp,
    Elixent DFabrix
  • SONYs VME(Virtual Mobile Engine) is embedded in
    Network Workman and PSP
  • Recently, many Japanese vendors start to develop
    commercial products
  • Fujitsu
  • Hitachi
  • Lucent
  • Sanyo
  • Toshiba (MepD-Fabrix)

58
What is Configurable Computing?
Spatially-programmed connection of processing
elements
Hardware customized to specifics of
problem. Direct map of problem specific dataflow,
control. Circuits adapted as problem
requirements change.
59
Spatial vs. Temporal Computing
Temporal
Spatial
60
Processor vs. FPGA Area
61
Processing Element
  • Specialized for media/stream processing
  • Coarse grain ? Fine grain LUT of FPGAs
  • Components
  • ALU
  • ShifterMask unit
  • Multiplexers
  • Registers
  • Operations and interconnection between components
    are changeable
  • No instruction fetch mechanism A part of large
    datapath

62
Reconfigurable HW (DSP fabric)
  • Target signal processing and arithmetic intensive
    applications
  • Reconfigurable array of simple DSP core (CNode)
  • Low power architecture
  • Hierarchical clock gating
  • Distributed leakage control (fine grain power
    gating)
  • Programmable DMA engine
  • Reconfigurable at run time, multi task

63
Mapping Flow
DFG
Behavioral code
Procedure(In,Out,inout) Constant
A,b,c, Begin Xa-in0 .. End
Coarse grained configuration
Partitioning/static scheduling

N0_i

Level 1

M
U
Clusters

X

N0_o

Level0


Data out
N1_i

N1_o

N2_i

Mux level 2

Data in
N2_o
  • Alus execute a cyclic micro-sequence
  • Data exchanges through hierarchical clustered
    interconnect
  • Configuration step is sequence loading and
    interconnect programming

Data in
Data out
Data in
Data out
Data in
Data out
ILP software pipelining
64
Mapping Flow
  • 3D optimization problem (place/route/schedule)
  • Traditional scheduling techniques for VLIW or
    clustered VLIW dont apply
  • The solution dont take into account the spatial
    dimension of the problem
  • Traditional PR used in FPGA don't apply neither
    because they don't consider the time dimension

65
Putting it all together
2004 2006 2008 2010 2012
Technology Node (nm) 90 65 45 32 22
Loosely coupled Sub-Systems 2 4 6 8 12
General Purpose CPU Single Multiple Single Multiple Single Multiple Single Multiple Single Multiple
Hardware Accelerator Hardwired Reconfigurable Hardwired Reconfigurable Hardwired Reconfigurable Hardwired Reconfigurable Hardwired Reconfigurable
  • Constant SoC Die Size
  • Slow evolution of peripherals (area decrease)
  • GP CPU sub-system complexity 2x each node
    (constant area),
  • Embedded Memory capacity 2x at each node
    (constant area)
  • Loosely coupled DSP sub-system complexity
    increase by 30 at each node (30 area decrease)

66
What can fit in 45mm² in 45nm
Programmable Multimedia Accelerator
Imaging H/W
192 CNode (40 GOPS)
Video H/W
Interconnect
4MB Multi-port Embedded Memory
L2
Peripherals analog
L1
L1
Host Core 2
Host Core 1
Write a Comment
User Comments (0)
About PowerShow.com