Computer Architecture Embedded Computing - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Computer Architecture Embedded Computing

Description:

Title: EECS 252 Graduate Computer Architecture Lec XX - TOPIC Last modified by: WangYX Created Date: 2/8/2005 3:17:21 AM Document presentation format – PowerPoint PPT presentation

Number of Views:247
Avg rating:3.0/5.0
Slides: 41
Provided by: netclassD
Category:

less

Transcript and Presenter's Notes

Title: Computer Architecture Embedded Computing


1
Computer Architecture Embedded Computing
2
Recap Multithreaded Processors
Simultaneous Multithreading
Multiprocessing
Superscalar
Fine-Grained
Coarse-Grained
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
3
Embedded Computing
Sensor Nets
Cameras
Games
Set-top boxes
Media Players
Printers
Robots
Smart phones
Routers
Aircraft
Automobiles
4
What is an Embedded Computer?
  • A computer not used to run general-purpose
    programs, but instead used as a component of a
    larger system. Usually, user does not change the
    computer program (except for manufacturer
    upgrades).
  • Example applications
  • Toasters
  • Cellphone
  • Digital camera (some have several processors)
  • Games machines
  • Set-top boxes (DVD players, personal video
    recorders, ...)
  • Televisions
  • Dishwashers
  • Car (some have dozens of processors)
  • Internet router (some have hundreds to thousands
    of processors)
  • Cellphone basestation
  • .... many more

5
Early Embedded Computing Examples
6
Reducing Cost of Transistors Drives Spread of
Embedded Computing
  • When individuals could afford a single
    transistor, the killer application was the
    transistor radio
  • When individuals could afford thousands of
    transistors, the killer app was the personal
    computer
  • Now individuals can soon afford thousands of
    processors, what will be the killer apps?
  • In 2007
  • human population growth per day gt200,000
  • cellphones sold per day gt2,000,000

7
What is different about embedded computers?
  • Embedded processors usually optimized to perform
    one fixed task with software from system
    manufacturer
  • General-purpose processors designed to run
    flexible, extensible software systems with code
    from third-party suppliers
  • applications not known at design time
  • Note, many products contain both embedded and
    general-purpose processors
  • e.g., smartphone has embedded processors for
    radio baseband signal processing, and
    general-purpose processors to run third-party
    software applications

8
Lesser emphasis on software portability in
embedded applications
  • Embedded systems
  • can usually recompile/rewrite source code for
    different ISA, and/or use assembler code for new
    application-specific instructions
  • processor pipeline microarchitecture and memory
    capacity and hierarchy known to
    programmer/compiler
  • mix of tasks known to writer of each task,
    usually static uses custom run-time system
  • each task usually trusts others, can run in
    same address space
  • General-purpose systems
  • must have standard binary interface for
    third-party software
  • compiler doesnt know about this particular
    microarchitecture or memory capacity or hierarchy
    (compiled for general model)
  • unknown mix of tasks, tasks dynamically added and
    deleted from mix uses general-purpose operating
    system
  • tasks written by various third-parties, mutually
    distrustful, need separate address spaces or
    protection domains

9
Embedded application requirements constraints
  • Real-time performance
  • hard real-time if deadline missed, system has
    failed (car brakes!)
  • soft real-time missing deadline degrades
    performance (skipping frames on DVD playback)
  • Real-world I/O with multiple concurrent events
  • sensor and actuators require continuous I/O
    (cant batch process)
  • non-deterministic concurrent interactions with
    outside world
  • Cost
  • includes cost of supporting structures,
    particularly memory
  • static code size very important (cost of ROM/RAM)
  • often ship millions of copies (worth engineer
    time to optimize cost down)
  • Power
  • expensive package and cooling affects cost,
    system size, weight, noise, temperature

10
What is Performance?
  • Latency (or response time, or execution time)
  • time to complete one task
  • Bandwidth (or throughput)
  • tasks completed per unit time

11
Performance Measurement
  • Average rate A gt B gt C
  • Worst-case rate A lt B lt C

Which is best for desktop performance?
_______ Which is best for hard real-time task?
_______
12
Processors for real-time software
  • Simpler pipelines and memory hierarchies make it
    easier (possible?) to determine the worst-case
    execution time (WCET) of a piece of code
  • Would like to guarantee task completed by
    deadline
  • Out-of-order execution, caches, prefetching,
    branch prediction, make it difficult to determine
    worst-case run time
  • Have to pad WCET estimates for unlikely but
    possible cases, resulting in over-provisioning of
    processor (wastes resources)

13
Power Measurement
I
V
  • Energy measured in Joules
  • Power is rate of energy consumption measured in
    Watts (Joules/second)
  • Instantaneous power is Volts Amps
  • Battery Capacity Measured in Joules
  • 720 Joules/gram for Lithium-Ion batteries
  • 1 instruction on Intel XScale processor takes
    1nJ
  • ? 1 billion executed instructions weigh 1mg

14
Power versus Energy
Peak A
Integrate power curve to get energy
Peak B
Power
Time
  • System A has higher peak power, but lower total
    energy
  • System B has lower peak power, but higher total
    energy

15
Power Impacts on Computer System
  • Energy consumed per task determines battery life
  • Second order effect is that higher current draws
    decrease effective battery energy capacity
    (higher power also lowers battery life)
  • Current draw causes IR drops in power supply
    voltage
  • Requires more power/ground pins to reduce
    resistance R
  • Requires thickwide on-chip metal wires or
    dedicated metal layers
  • Switching current (dI/dt) causes inductive power
    supply voltage bounce ? LdI/dt
  • Requires more pins/shorter pins to reduce
    inductance L
  • Requires on-chip/on-package decoupling
    capacitance to help bypass pins during switching
    transients
  • Power dissipated as heat, higher temps reduce
    speed and reliability
  • Requires more expensive packaging and cooling
    systems
  • Fan noise
  • Laptop/handheld case temperature

16
Power Dissipation in CMOS
Short-Circuit Current
Diode Leakage Current
CapacitorCharging Current
CL
Gate Leakage Current
Subthreshold Leakage Current
  • Primary Components
  • Capacitor Charging (85 of active power)
  • Energy is 1/2 CV2 per transition
  • Short-Circuit Current (10 of active power)
  • When both p and n transistors turn on during
    signal transition
  • Subthreshold Leakage (dominates when inactive)
  • Transistors dont turn off completely, getting
    worse with technology scaling
  • For Intel Pentium-4/Prescott, around 60 of power
    is leakage
  • Optimal setting for lowest total power is when
    leakage around 30-40
  • Gate Leakage (becoming significant)
  • Current leaks through gate of transistor
  • Diode Leakage (negligible)
  • Parasitic source and drain diodes leak to
    substrate

17
Reducing Switching Power
  • Power ? activity 1/2 CV2 frequency
  • Reduce activity
  • Reduce switched capacitance C
  • Reduce supply voltage V
  • Reduce frequency

18
Reducing Activity
  • Bus Encodings
  • choose encodings that minimize transitions on
    average (e.g., Gray code for address bus)
  • compression schemes (move fewer bits)
  • Remove Glitches
  • balance logic paths to avoid glitches during
    settling
  • use monotonic logic (domino)

19
Reducing Switched Capacitance
  • Reduce switched capacitance C
  • Different logic styles (logic, pass transistor,
    dynamic)
  • Careful transistor sizing
  • Tighter layout
  • Segmented structures

20
Reducing Frequency
  • Doesnt save energy, just reduces rate at which
    it is consumed
  • Some saving in battery life from reduction in
    rate of discharge

21
Reducing Supply Voltage
  • Quadratic savings in energy per transition BIG
    effect
  • Circuit speed is reduced
  • Must lower clock frequency to maintain correctness

22
Voltage Scaling for Reduced Energy
  • Reducing supply voltage by 0.5 improves energy
    per transition to 0.25 of original
  • Performance is reduced need to use slower clock
  • Can regain performance with parallel architecture
  • Alternatively, can trade surplus performance for
    lower energy by reducing supply voltage until
    just enough performance
  • Dynamic Voltage Scaling

23
Just Enough Performance
  • Save energy by reducing frequency and voltage to
    minimum necessary (usually done in O.S.)

24
Voltage Scaling on Transmeta Crusoe TM5400
Frequency (MHz) Relative Performance () Voltage (V) Relative Energy () Relative Power ()
700 100.0 1.65 100.0 100.0
600 85.7 1.60 94.0 80.6
500 71.4 1.50 82.6 59.0
400 57.1 1.40 72.0 41.4
300 42.9 1.25 57.4 24.6
200 28.6 1.10 44.4 12.7
25
Chip energy versus frequency for various supply
voltages
MIT Scale Vector-Thread Processor, TSMC 0.18µm
CMOS process, 2006
26
Chip energy versus frequency for various supply
voltages
2x Reduction in Supply Voltage
4x Reduction in Energy
MIT Scale Vector-Thread Processor, TSMC 0.18µm
CMOS process, 2006
27
Parallel Architectures Reduce Energy at Constant
Throughput
  • 8-bit adder/comparator
  • 40MHz at 5V, area 530 km2
  • Base power, Pref
  • Two parallel interleaved adder/compare units
  • 20MHz at 2.9V, area 1,800 km2 (3.4x)
  • Power 0.36 Pref
  • One pipelined adder/compare unit
  • 40MHz at 2.9V, area 690 km2 (1.3x)
  • Power 0.39 Pref
  • Pipelined and parallel
  • 20MHz at 2.0V, area 1,961 km2 (3.7x)
  • Power 0.2 Pref
  • Chandrakasan et. al. Low-Power CMOS Digital
    Design,
  • IEEE JSSC 27(4), April 1992

28
CS252 Administrivia
  • Next project meetings Nov 12, 13, 15
  • Should have interesting results by then
  • Only three weeks left after this to finish
    project
  • Second midterm Tuesday Nov 20 in class
  • Focus on multiprocessor/multithreading issues
  • Well assume youll have worked through practice
    questions

29
Embedded memory hierarchies
  • Scratchpad RAMs often used instead, or as well
    as, caches
  • RAM has predictable access latency, simplifies
    execution time analysis for real-time
    applications
  • RAM has lower energy/access (no tag access or
    comparison/multiplexing logic)
  • RAM is cheaper than same size cache (no tags or
    cache logic)
  • Typically no memory protection or translation
  • Code uses physical addresses
  • Often embedded processors will not have direct
    access to off-chip memory (only on-chip RAM)
  • Often no disk or secondary storage (but printers,
    iPods, digital cameras, sometimes have hard
    drives)
  • No swapping or demand-paged virtual memory
  • Often, flash EEPROM storage of application code,
    copied to system RAM/DRAM at boot

30
Reconfigurable lockable caches
  • Many embedded systems allow cache lines to be
    locked in cache to provide RAM-like predictable
    access
  • Lock by set
  • E.g., in an 8KB direct-mapped cache with 32B
    lines (213/2528256 sets), lock half the sets,
    leaving a 4KB cache with 128 sets
  • Have to flush entire cache before changing
    locking by set
  • Lock by way
  • E.g., in a 2-way cache, lock one way so it is
    never evicted
  • Can quickly change amount of cache that is locked
    (doesnt change cache index function)
  • Can be used in both instruction and data caches
  • Lock instructions for interrupt handlers
  • Lock data used by handlers

31
Code Size
  • Cost of memory big factor in cost of many
    embedded systems
  • RISC core about same size as 16KB of SRAM
  • Techniques to reduce code size
  • Variable length and complex instructions
  • Compressed Instructions
  • Compressed in memory then uncompressed in cache
  • compressed in cache

32KB
32KB
Intel Xscale (2001) 16.8mm2 in 180nm
32
Embedded Processor Architectures
  • Wide variety of embedded architectures, but
    mostly based on combinations of techniques
    originally pioneered in supercomputers
  • VLIW instruction issue
  • SIMD/vector instructions
  • Multithreading
  • VLIW more popular here than in general-purpose
    computing
  • Binary code compatibility not as important,
    recompile for new configuration OK
  • Memory latencies are more predictable in embedded
    system hence more amenable to static scheduling
  • Lower cost and power compared to out-of-order ILP
    core.

33
System-on-a-Chip Environment
  • Often, a single chip will contain multiple
    embedded cores with multiple private or shared
    memory banks, and multiple hardware accelerators
    for application-specific tasks
  • Multiple dedicated memory banks provide high
    bandwidth, predictable access latency
  • Hardware accelerators can be 100x higher
    performance and/or lower power than software for
    certain tasks
  • Off-chip I/O ports have autonomous data movement
    engines to move data in and out of on-chip memory
    banks
  • Complex on-chip interconnect to connect cores,
    RAMs, accelerators, and I/O ports together

34
Block Diagram of Cellphone SoC(TI OMAP 2420)
35
Classic DSP Processors
36
TI C6x VLIW DSP
VLIW fetch of up to 8 operations/instruction
Dual symmetric ALU/Register clusters (each
4-issue)
37
TI C6x regfile/ALU datapath clusters
32b/40b arithmetic
32b arithmetic, 32b/40b shifts
16b x 16b multiplies
32b arithmetic, address generation
38
Intel IXP Network Processors
RISC Control Processor
Network 10Gb/s
Buffer RAM
DRAM0
Buffer RAM
DRAM1
Buffer RAM
DRAM2
Buffer RAM
SRAM0
Buffer RAM
Buffer RAM
SRAM1
Buffer RAM
SRAM2
SRAM3
16 Multithreaded microengines
39
Programming Embedded Computers
  • Embedded applications usually involve many
    concurrent processes and handle multiple
    concurrent I/O streams
  • Microcontrollers, DSPs, network processors, media
    processors usually have complex, non-orthogonal
    instruction sets with specialized instructions
    and special memory structures
  • poor compiled code quality ( peak with compiled
    code)
  • high static code efficiency
  • high MIPS/ and MIPS/W
  • usually assembly-coded in critical routines
  • Worth one engineer year in code development to
    save 1 on system that will ship 1,000,000 units
  • Assembly coding easier than ASIC chip design
  • But much room for improvement

40
Discussion Memory consistency models
  • Discussion Memory consistency models
  • Tutorial on consistency models Mark Hills
    position paper
  • Conflict between simpler memory models and
    simpler/faster hardware
Write a Comment
User Comments (0)
About PowerShow.com