Title: SOC Design: From System to Transistor
1SOC Design From System to Transistor
Zoran Stamenkovic
2Outline
- Modeling Systems
- Simulation and Verification
- Analog Integrated Circuits
- Digital Integrated Circuits
- Embedded Memories
- Logic Synthesis
- Design for Testability
- Layout Generation
- Design for Manufacturability
- SOC Example
3Modeling Systems
- Domains and Levels
- ESL Design
- Basics of HDL
- Gate Modeling
- Delay Modeling
- Power Modeling
- Effects of Parasitics
- Logic Optimization
4Domains and Levels
- Open Systems Interconnection (OSI) model of
network communication - Local area network (LAN) technologies are defined
by standards that describe unique functions at
both the Physical and the Data Link layers
5Domains and Levels
- 802.11 Wireless LAN modem
- Modulates outgoing digital signals from a
computer or other digital device to an analogue
(radio) signal - Demodulates the incoming analogue (radio) signal
and converts it to a digital signal for the
digital device
6MIMO and MIMAX WLAN Modems
Domains and Levels
Signal processing performed in the analogue RF
domain Number of the digital basebands reduced to
a single one
Signal processing performed in the digital
baseband
7Domains and Levels
8Behavioral Domain
9Structural Domain
10Physical Domain
11Electronic System Level Design
- The point of a system level model is to capture
the intent of the design - Design does exactly what it is defined to do, and
the model is the definition of what the design
does - It allows software developers to test their code
on a working model - The value of system level modeling is in helping
us to understand the implications of our intent - To explore responses to the stimulus in an useful
way - ESL Languages
- UML, SystemC, SystemVerilog
- ESL Verification
- No amount of experimentation can ever prove me
right a single experiment can prove me wrong
Albert Einstein - The system level testbench languages and
methodologies that exist today are woefully
inadequate - If one tries to capture enough information in ESL
to verify RTL, then one might as well write RTL
12Electronic System Level Design
- The environment that provides models of
memories, connectors, and queues that can be
interconnected with configured processors into an
overall system model - Processor and device interfaces are at the
transaction level - Transaction-level modeling requests for SOC
architecture assembly and simulation tools - If RTL IP blocks present, HW/SW co-verification
tools needed
13Electronic System Level Design
14Hardware Description Languages
- Motivation for HDL
- Increased hardware complexity
- Design space exploration
- Inexpensive alternative to prototyping
- General features
- Support for describing circuit connectivity
- High-level programming language support for
describing behavior - Support for timing information (constraints,
etc.) - Support for concurrency
- VHDL
- IEEE Standard 1076-1987
- IEEE Standard 1076-1993
- Extension VHDL-AMS-1999
- Verilog
- IEEE Standard 1364-1995
- IEEE Standard 1364-2000
15Modeling Interfaces
- Entity (VHDL) or Module (Verilog) declaration
- Describes the input/output ports of a module
16Modeling Behavior
- Architecture Body (VHDL)
- Describes an implementation of an entity
- May be several per entity
- Module (Verilog)
- Is unique
- Behavioral Architecture
- Describes the algorithm performed by the module
- Contains
- Procedural Statements, each containing
- Sequential Statements, including
- Assignment Statements and
- Wait Statements
17Behavior Example
entity reg3 is port ( d0, d1, d2, en, clk in
bit q0, q1, q2 out bit )end architectur
e behav of reg3 isbegin process ( d0, d1, d2,
en, clk ) begin if en '1' and clk '1'
then q0 lt d0 after 5 ns q1 lt d1 after
5 ns q2 lt d2 after 5 ns end if end
process end
timescale 1ns/10ps module reg3 ( d0, d1, d2, en,
clk, q0, q1, q2 ) input d0, d1, d2, en,
clk output q0, q1, q2 reg q0, q1, q2 always
_at_ ( d0 or d1 or d2 or en or clk ) if ( en
clk ) begin q0 lt 5 d0
q1 lt 5 d1 q2 lt 5 d2
end endmodule
VHDL
Verilog
18Modeling Structure
- Structural Architecture
- Implements the module as a composition of
components - Contains
- Signal Declarations (entity ports are also
signals) - Declare internal connections
- Component Instances
- Instantiate previously declared
entity/architecture pairs - Port Maps in component instances
- Connect signals to component ports
- Wait Statements
- Suspend a process or procedure
19Structure Example
20Structure Example
21Structure Example
22Mixing Behavior and Structure
- An architecture can contain both behavioral and
structural parts - Process Statements and Component Instances
- Collectively called Concurrent Statements
- Processes can read and assign to signals
- Example Register-transfer-language model
- Data-path described structurally
- Control section described behaviorally
23Mixed Example
24Mixed Example
entity multiplier is port ( clk, reset in
bit multiplicand, multiplier in
integer product out integer
)end architecture mixed of multiplier
is signal partial_product, full_product
integer signal arith_control, result_en,
mult_bit, mult_load bit begin arith_unit
entity work.shift_adder(behavior) port map (
addend gt multiplicand, augend gt
full_product, sum gt partial_product, ad
d_control gt arith_control ) result entity
work.reg(behavior) port map ( d gt
partial_product, q gt full_product, en gt
result_en, reset gt reset ) ...
25Mixed Example
multiplier_sr entity work.shift_reg(behavior
) port map ( d gt multiplier, q gt
mult_bit, load gt mult_load, clk gt clk
) product lt full_product control_section
process is -- variable declarations for
control_section -- begin -- sequential
statements to assign values to control
signals -- wait on clk, reset end process
control_section end
26Logic Functions
- Function
- f ab ab a is a variable, a and a are
literals, ab is a term - Irredundant Function
- No literal can be removed without changing its
value - Implementing logic functions is non-trivial
- No logic gates in the library for all logic
expressions - A logic expression may map into gates that
consume a lot of area, time, or power - A set of functions f1, f2, ... is complete if
every Boolean function can be generated by a
combination of the functions from the set - NAND is a complete set
- NOR is a complete set
- AND and OR are not complete
- Transmission gates are not complete
- Incomplete set of logic gates
- No way to design arbitrary logic
27Inverter
28Inverter
29Switches
- Complementary switch produces full-supply
voltages for both logic 0 and logic 1 - n-type transistor conducts logic 0
- p-type transistor conducts logic 1
30NAND Gate
31NOR Gate
32AOI/OAI Gates
- AOI and/or/invert
- OAI or/and/invert
- Implement larger functions
- Pull-up and pull-down networks are compact
- Smaller area, higher speed than NAND/NOR network
equivalents - AOI312
- And 3 inputs
- And 1 input (dummy)
- And 2 inputs
- Or together these terms
- Invert
out abc
33Logic Levels
- Solid logic 0/1 defined by VSS/VDD
- Inner bounds of logic values VL/VH are not
directly determined by circuit properties, as in
some other logic families
- Levels at output of one gate must be sufficient
to drive next gate
34Inverter Transfer Curve
- Choose threshold voltages at points where slope
of transfer curve is -1 - Inverter has
- High gain between VIL and VIH points
- Low gain at outer regions of transfer curve
- Note that logic 0 and 1 regions are not equally
sized - In this case, high pull-up resistance leads to
smaller logic 1 range - Noise margins are VDD-VIH and VIL-VSS
- Noise must exceed noise margin to make second
gate produce wrong output
35Inverter Delay
- Only one transistor is on at the time
- Rise time (pull-up on)
- Fall time (pull-up off)
- Resistor model of transistor
- Ignores saturation region
- Mischaracterizes linear region
- Gives acceptable results
36RC Model for Delay
- Delay
- Time required for gates output to reach 50 of
final value - Transition time
- Time required for gates output to reach 10
(logic 0) or 90 (logic 1) of final value - Gate delay based on RC time constant
- Vout(t) VDD exp-t/(RnRL)CL
- td 0.69 RnCL
- tf 2.3 RnCL
- 0.5 mm process
- Rn 3.9 kW
- CL 0.68 fF
- td 0.69 x 3.9 x .68E-15 1.8 ps
- tf 2.3 x 3.9 x .68E-15 6.1 ps
- For pull-up time, use pull-up resistance
- Current source model (in power/delay studies)
- tf CL (VDD-VSS)/0.5 k (W/L) (VDD-VSS -Vt)2
- Fitted model
- Fit curve to measured circuit characteristics
37Step Input (VGS VDD) Approximation
38Body Effect
- Source voltage of gates in middle of network may
not equal substrate voltage - Difference between source and substrate voltages
causes body effect
- To minimize body effect
- Put early arriving signals at transistors closest
to power supply
39Power Consumption
- Clock frequency
- f 1/t
- Energy
- E CL(VDD - VSS)2
- Power
- E x f f CL(VDD - VSS)2
- Almost all power consumption comes from switching
behavior - A single cycle requires one charge and one
discharge of capacitor - Static power dissipation
- Comes from leakage currents
- Surprising result
- Resistance of the pull-up/pull-down transistor
drops out of energy calculation - Power consumption is independent of the sizes of
the pull-up and pull-down transistors - Static CMOS power-delay product is independent of
frequency - Voltage scaling depends on this fact
40Effects of Parasitics
- Capacitance on power supply is not bad
- Can be good in absence of inductance
- Resistance slows down static gates
- May cause pseudo-nMOS circuits to fail
- Increasing capacitance/resistance
- Reduces input slope
- Resistance near source is more damaging
- It must charge more capacitance
41Optimal Sizing
- Sometimes, large loads must be driven
- Off-chip or by long wires on-chip
- Sizing up the driver transistors only pushes back
the problem - Driver now presents larger capacitance to earlier
stage - Use a chain of inverters
- Each stage has transistors larger than previous
stage - a is the driver size ratio, Cbig/Cd an,
ln(Cbig/Cd) n lna - Minimize total delay through the driver chain
- ttot ln(Cbig/Cd)(a/lna)td
- Optimal driver size ratio is aopt e
- Optimal number of stages is nopt ln(Cbig/Cd)
42Driving Large Fan-Out
- Fan-out adds capacitance
- Increase sizes of driver transistors
- Must take into account rules for driving large
loads - Add intermediate buffers
- This may require/allow restructuring of the logic
43Path Delay
- Network delay is measured over paths through
network - Can trace a causality chain from inputs to
worst-case output - Critical path creates longest delay
- Can trace transitions which cause delays that are
elements of the critical path delay - To reduce circuit delay, speed up the critical
path - Reducing delay off the path doesnt help
- There may be more than one path of the same delay
- Must speed up all equivalent paths to speed up
circuit
44False Paths
- Logic gates are not simple nodes
- Some input changes dont cause output changes
- A false path is a path which cannot be exercised
due to Boolean gate conditions - False paths cause pessimistic delay estimates
45Logic Transformations
- Rewrite by using sub-expressions
- Logic rewrites may affect gate placement
- Flattening logic
- Increases gate fan-in
- Logic synthesis programs
- Transform Boolean expressions into logic gate
networks in a particular library
Deep Logic
Shallow Logic
46Logic Optimization
- Optimization goals
- Minimize area, meet delay constraint
- Technology-independent optimization
- Works on Boolean expression equivalent
- Estimates size based on number of literals
- Uses factorization, resubstitution, minimization,
etc. - Uses simple delay models
- Technology-dependent optimization
- Maps Boolean expressions into a particular cell
library - May perform some optimizations on addition to
simple mapping - Allows more accurate delay models
47Simulation and Verification
- Simulation
- Verification
- Annotation
48Simulation
- Simulation
- Tests the functionality of a designs elaborated
model - Needs a test bench and a simulation tool
- Advances in discrete time steps
- Test Bench
- Includes an instance of the design under test
- Applies sequences of test values to inputs
- Monitors signal values on outputs using simulator
- Simulation Tools
- NCSIM (Cadence)
- VSIM (Mentor Graphics)
- VCS (Synopsys)
49Event-Driven Simulation
- Event-driven simulation is designed for digital
circuit characteristics - Small number of signal values
- Relatively sparse activity over time
- Event-driven simulators try to update only those
signals which change in order to reduce CPU time
requirements - An event is a change in a signal value
- A time-wheel is a queue of events
- Simulator traces structure of circuit to
determine causality of events - Event at input of one gate may cause new event at
gates output
50Switch Simulation
- Special type of event-driven simulation optimized
for MOS transistors - Treats the transistor as a switch
- Takes capacitance into account to model charge
sharing - Can also be enhanced to model the transistor as a
resistive switch
51Test Bench Example
entity test_bench isend architecture test_reg3
of test_bench is signal d0, d1, d2, en, clk, q0,
q1, q2 bit begin dut entity
work.reg3(behav) port map ( d0, d1, d2, en,
clk, q0, q1, q2 ) stimulus process
is begin d0 lt 1 d1 lt 1 d2 lt 1
wait for 20 ns en lt 0 clk lt 0 wait
for 20 ns en lt 1 wait for 20 ns clk lt
1 wait for 20 ns d0 lt 0 d1 lt 0
d2 lt 0 wait for 20 ns wait end
process stimulus end
52Verification
- To test a refinement of a design
- Low-level structural model must be functionally
the same as a corresponding behavioral model - To include two instances of a design in the test
bench - To stimulate both with same test values on inputs
- To compare values of outputs for equality
- To take account of timing differences
- Zero delay
- Unit delay
- Gate delay
- RC delay
53Verification Example
architecture regression of test_bench is signal
d0, d1, d2, d3, en, clk bit signal q0a, q1a,
q2a, q3a, q0b, q1b, q2b, q3b bit begin dut_a
entity work.reg4(struct) port map ( d0, d1,
d2, d3, en, clk, q0a, q1a, q2a, q3a ) dut_b
entity work.reg4(behav) port map ( d0, d1, d2,
d3, en, clk, q0b, q1b, q2b, q3b ) stimulus
process is begin d0 lt 1 d1 lt 1 d2 lt
1 d3 lt 1 wait for 20 ns en lt 0
clk lt 0 wait for 20 ns en lt 1 wait
for 20 ns clk lt 1 wait for 20
ns wait end process stimulus ...
54Verification Example
verify process is begin wait for 10
ns assert q0a q0b and q1a q1b and q2a
q2b and q3a q3b report implementations have
different outputs severity error wait on
d0, d1, d2, d3, en, clk end process verify end
architecture regression
55Annotation
- Standard Delay Format (SDF) annotation
- Design timing is stored in an SDF file
- Used to iteratively improve design
- Updates a more-abstract design with information
from later design stages - Annotation of logic schematic with extracted
parasitic resistances and capacitances - Back annotation requires tools to know more about
each other - Simulation tools
- Synthesis tools
- Layout tools
56Standard Delay Format
(CELL (CELLTYPE "exnor2_1") (INSTANCE
i_aes_wr/U_ALG/U6533) (DELAY (ABSOLUTE
(IOPATH a x (0.6621.0451.045)
(0.6821.0761.076)) (IOPATH b x
(1.3791.4161.416) (1.4541.4921.492)) )
) ) ... (CELL (CELLTYPE "mux2_2") (INSTANCE
i_mips/u0/ejt_tap\/pa_addr_reg_next\/bit_00i/U1)
(DELAY (ABSOLUTE (IOPATH d0 x
(0.3950.3950.395) (0.4640.4640.464))
(IOPATH d1 x (0.3870.4030.403)
(0.4470.4770.477)) (IOPATH sl x
(1.7681.7811.781) (1.8791.8921.892)) )
) ) )
- (DELAYFILE
- (SDFVERSION "OVI 1.0")
- (DESIGN "tcp_1_chip")
- (DATE "Fri Apr 30 094822 2004")
- (VENDOR "cdr3synPwcslV225T125")
- (PROGRAM "Synopsys Design Compiler cmos")
- (VERSION "2003.06")
- (DIVIDER /)
- (VOLTAGE 2.252.252.25)
- (PROCESS)
- (TEMPERATURE 125.00125.00125.00)
- (TIMESCALE 1ns)
- (CELL
- (CELLTYPE "tcp_1_chip")
- (INSTANCE)
- (DELAY
- (ABSOLUTE
- (INTERCONNECT U5/x U81/a (0.0000.0000.000))
- (INTERCONNECT U73/x U74/a (0.0000.0000.000))
57Analog Integrated Circuits
- Filters
- Amplifiers
- Phase Lock Loop
- Voltage Control Oscillator
- Modulator/Demodulator
58Fairchild Semiconductor µA741 Op-Amp
- In 1963, a 26-year-old engineer named Robert
Widlar designed the first monolithic op-amp IC,
the µA702 - Price at the beginning was 300
- Fairchild and competitors have sold it in the
hundreds of millions - Now, for 300 you can get about a thousand of
todays 741 chips
59Signetics NE555 Timer
- A simple IC from 1971 that could function as a
timer or an oscillator - It would become a best seller in analog
semiconductors - Kitchen appliances
- Toys
- Spacecraft
- A few thousand other things
- Many billions have been sold
60Intersil ICL8038 Waveform Generator
- A generator of sine, square, triangular,
sawtooth, and pulse waveforms from 1983 - Countless applications
- Music synthesizers
- Blue boxes
- Hundreds of millions sold
- Intersil discontinued the production in 2002
61LNA in BiCMOS Technology
62PLL for 802.11a WLAN
63Oscillator
64Modulator
65Digital Integrated Circuits
- Adders
- Multipliers
- Shifters
- Carry Units
- Arithmetic-Logic Units
66Full Adder
- Computes one-bit sum and carry
- si ai ? bi ? cin
- cout aibi aici bicin
- Ripple-carry adder n-bit adder built from full
adders - Delay of ripple-carry adder goes through all
carry bits
67Combinational Multiplier
- 0 1 1 0 multiplicand
- x 1 0 0 1 multiplier
- 0 1 1 0
- 0 0 0 0
- 0 0 1 1 0
- 0 0 0 0
- 0 0 0 1 1 0
- 0 1 1 0
- 0 1 1 0 1 1 0
68Array Multiplier
- Array multiplier is an efficient layout of a
combinational multiplier - Array multipliers may be pipelined to decrease
clock period at the expense of latency
69Wallace Tree
- Reduces depth of adder chain
- Built from carry-save adders
- Three inputs a, b, c
- Produces two outputs y, z
- y z a b c
- Carry-save equations
- yi parity (ai,bi,ci)
- zi majority (ai,bi,ci)
- At each stage, i numbers are combined to form
2i/3-sums - Final adder completes the summation
- Wiring is more complex
70Serial-Parallel Multiplier
- Used in serial-arithmetic operations
- Multiplicand can be held in place by register
- Multiplier is shifted into array
71Barrel Shifter
- Can perform n-bit shifts in a single cycle
- Accepts 2n data inputs and n control signals,
producing n data outputs - Selects arbitrary contiguous n bits out of 2n
input buts - Examples
- Right shift data into top, 0 into bottom
- Left shift 0 into top, data into bottom
- Rotate data into top and bottom
72Barrel Shifter
- Two-dimensional array of 2n vertical X n
horizontal cells - Input data travels diagonally upward
- Output wires travel horizontally
- Control signals run vertically
- Exactly one control signal is set to 1, turning
on all transmission gates in that column - Large number of cells, but each one is small
- Delay is large, considering long wires and
transmission gates
73Carry-Lookahead Unit
- First computes carry propagate and generate
- Pi ai bi
- Gi aibi
- Computes sum and carry from P and G
- si ci ? Pi ? Gi
- ci1 Gi Pici
- Can recursively expand carry formula
- ci1 Gi Pi(Gi-1 Pi-1ci-1)
- ci1 Gi PiGi-1 PiPi-1 (Gi-2 Pi-1ci-2)
- Expanded formula does not depend on intermediate
carries - Allows carry for each bit to be computed
independently
74Depth-4 Carry-Lookahead Unit
- Deepest carry expansion requires gates with large
fan-in - Large and slow
- Carry-lookahead unit requires complex wiring
between adders and lookahead unit - Values must be routed back from lookahead unit to
adder
75Carry-Skip Adder
- Looks for cases in which carry out of a set of
bits is identical to carry in - Typically organized into m-bit stages
- If ai bi for every bit in stage, then bypass
gate sends stages carry input directly to carry
output
76Carry-Select Adder
- Computes two results in parallel, each for
different carry input assumptions - Uses actual carry in to select correct result
- Reduces delay to multiplexer
77Manchester Carry Chain
- Precharged carry chain which uses P and G signals
- Propagate signal connects adjacent carry bits
- Generate signal discharges carry bit
- Worst-case discharge path goes through entire
carry chain
78Serial Adder
- May be used in signal-processing arithmetic where
fast computation is important but latency is
unimportant - LSB control signal clears the carry shift register
79Arithmetic-Logic Unit
- Computes a variety of logical and arithmetic
functions based on opcode - May offer complete set of functions of two
variables or a subset - Built around adder, since carry chain determines
delay - Function block may be used to compute required
intermediate signals for a full-function ALU - Transmission gates may introduce significant delay
80Arithmetic-Logic Unit
- P and G compute intermediate values from inputs
- May not correspond to carry lookahead P and G for
non-addition functions - Add unit is adder of choice
- Output unit computes from sum, propagate signal
81Acorn Computers ARM1 Processor
- 32-bit RISC microprocessor from 1985
- The simplicity made all the difference
- Small, low power, and easy to program
- ARM architecture has become the dominant embedded
processor - More than 10 billion ARM cores have been used in
all sorts of gadgetry, including the iPhone
82Computer Cowboys Sh-Boom Processor
- Russell Fish and Chuck Moore 1988 found a way to
have the processor run its own super fast
internal clock while still staying synchronized
with the rest of the computer - In the years since Sh-Booms invention, the speed
of processors had by far surpassed that of
motherboards, and so practically every maker of
computers and consumer electronics wound up using
the same solution - Since 2006, Patriot Scientific (and Moore) have
reaped over USÂ 125 million in licensing fees
from Intel, AMD, Sony, Olympus, and others
838-bit Microprocessors
- Microchip Technology PIC16C84 8-bit
microcontroller in 1993 - Incorporates EEPROM
- Does not need UV light to be erased as EPROM needs
- Radiation-hardened RCA CDP 1802 8-bit
microprocessor in 1976 - One of the first, if not the first, CMOS
processors - Low power consumption, wide range of operating
voltages and military operating temperature range
84Embedded Memories
- Read-Only Memory
- Static Random-Access Memory
- Dynamic Random-Access Memory
- Memory Generators
85Memory Architecture
- Address is divided into row and column
- Row may contain full word or more than one word
- Selected row drives/senses bit lines in columns
- Amplifiers/drivers read/write bit lines
86Read-Only Memory (ROM)
- ROM core is organized as an array of NOR gates
- Pull-down transistors of NOR determine
programming - Erasable ROMs require special processing that is
not typically available - ROMs on digital ICs are generally mask-programmed
- Placement of pull-downs determines ROM contents
87Static Random-Access Memory (SRAM)
- Core cell uses six-transistor circuit to store
value - Value is stored symmetrically
- Both true and complement are stored on
cross-coupled transistors - SRAM retains value as long as power is applied to
circuit - Read
- Precharge bit and bit high
- Set select line high from row decoder
- One bit line will be pulled down
- Write
- Set bit/bit to desired (complementary) values
- Set select line high
- Drive on bit lines will flip state if necessary
88SRAM Sense Amplifier
- Differential pair
- Takes advantage of complementarity of bit lines
- One bit line goes low
- One arm of diff pair reduces its current, causing
compensating increase in current of another arm - Sense amp can be cross-coupled to increase speed
89Dynamic Random-Access Memory (DRAM)
- Cell can easily be made with a CMOS digital
technology process - Dynamic RAM loses value due to charge leakage
- Must be refreshed
- Value is stored on gate capacitance of transistor
t1 - Read
- read 1, write 0, read_data is precharged
- t1 will pull down read_data if 1 is stored
- Write
- read 0, write 1, write_data value
- Guard transistor writes value onto gate
capacitance - Modern commercial DRAMs use one-transistor cell
90Toshiba NAND Flash Memory
- In 1980, Fujio Masuoka recruited four engineers
to a project aimed at designing a memory chip
that could store lots of data and would be
affordable - Team came up with a variation of EEPROM that
featured a memory cell consisting of a single
transistor (at the time, conventional EEPROM
needed two transistors per cell) - Why is it named flash?
- Because of the chips ultrafast erasing
capability - In 1984 Masuoka presented a paper at the IEEE
International Electron Devices Meeting - In 1988 Intel developed a type of flash based on
NOR logic gates (a 256-kilobit chip) - Toshibas first NAND flash (greater storage
densities but trickier to manufacture) hit the
market in 1989
91Memory Generators
- A software tool which can create memories (ROM or
RAM blocks) in a range of sizes as needed - The customer usually wants a particular number of
words (depth) and bits (width) for each memory
ordered - Each of the final building blocks (physical
layout) will be implemented as a stand-alone,
densely packed, pitch-matched array - Complex layout generators and state-of-the-art
logic and circuit design techniques offer - Embedded memories of extreme density and
performance - Each memory generator is a set of various,
parameterized generators - Layout generator generates an array of custom,
pitch-matched leaf cells - Schematic generator and Net-lister extracts a
net-list used for both layout vs. schematic and
functional verification - Function and Timing model generators create
models for gate level simulation, dynamic/static
timing analysis and synthesis - Symbol generator generates schematic
- Critical Path generator is used for both circuit
design and timing characterization
92Logic Synthesis
- Logic Synthesis Flow
- Optimization
- Technology Mapping
- Low-Power Techniques
93Logic Synthesis Flow
- Goal is to create a logic gate network which
performs a given set of functions - Input is Boolean formulae
- Output is gates implementing Boolean functions
- Several iterations needed for generation of the
optimized gate-level description - Logic synthesis
- Maps onto available gates
- Restructures for delay, area, testability, power,
etc. - Automated logic synthesis has enabled
- Enormous reduction of the time needed for
conversion of a design from high-level to
gate-level description - Saving of designer resources for architectural
and RTL descriptions, and optimization of the
standard cell library
94High-Level Synthesis
- Scheduling determines
- Number of clock cycles required
- As-soon-as-possible (ASAP) schedule puts every
operation as early in time as possible - As-late-as-possible (ALAP) schedule puts every
operation as late in schedule as possible - Binding determines
- Area and cycle time
- Area tradeoffs must consider
- Shared function units vs. multiplexers and
control - Delay tradeoffs must consider
- Cycle time vs. number of cycles
95Logic Synthesis Phases
- Technology-independent optimizations
- A Boolean network is the main representation of
the logic functions - Each node can be represented as sum-of-products
(or product-of-sums) - Functions in the network need not correspond to
logic gates - Technology mapping (library binding)
- Design transformation from technology-independent
to technology-dependent - Technology-dependent optimizations
- Work in the available set of logic gates
96Technology-Independent Optimization
- Area is estimated by number of literals
- Literal is true or complement form of a variable
- Simplification
- Rewrites a node to reduce the number of literals
in it - Network restructuring
- Introduces new nodes for common factors
- Collapses several nodes into one new node
- Delay restructuring
- Changes factorization to reduce path length
97Covers and Cubes
- Function is defined by
- On-set set of inputs for which output is 1
- Off-set set of inputs for which output is 0
- Dont-care-set set of inputs for which output is
dont-care - Each way to write a function as a sum-of-products
is a cover - It covers the on-set
- A cover is composed of cubes
- Cubes are product terms that define a subspace
cube in the function space
98Covers and Optimizations
- Larger cover
- x1 x2 x3 x1 x2 x3 x1 x2 x3 x1 x2 x3
- Requires four cubes (12 literals)
- Smaller cover
- x2 x3 x1 x3 x1 x2 x3
- Requires three cubes (7 literals)
- x1 x2 x3 is covered by two cubes
- Dont-cares
- Can be implemented in either on-set or off-set
- Provide the greatest opportunities for
minimization in many cases - Espresso
- A two-level logic optimizer
- Expands, makes irredundant and reduces
- Optimization loop refines cover to reduce its size
99Factorization
- Based on division
- Formulate candidate divisor
- Test how it divides into the function
- If g f/c, we can use c as an intermediate
function - Algebraic division
- Doesnt take into account Boolean simplification
- Less expensive then Boolean division
- Three steps
- Generate potential common factors and compute
literal savings if used - Choose factors to substitute into network
- Restructure the network to use the new factors
- Algebraic/Boolean division is used to implement
first step
100Technology Mapping
- Rewrites Boolean network
- In terms of available logic functions
- Optimizes for
- Area
- Delay
- Can be viewed as a pattern matching problem
- Find pattern match which minimizes area/delay
cost - Procedure
- Write Boolean network in canonical NAND form
- Write each library gate in canonical NAND form
- Assign cost to each library gate
- Use dynamic programming to select minimum-cost
cover of network by library gates
101Breaking into Trees
not optimal, but reasonable cuts usually work well
102Mapping Example
after three levels of matching
103Mapping Example
after four levels of matching
104Low Power Techniques
- Architecture-driven supply voltage scaling
- Add extra logic to increase parallelism so that
system can run at lower frequency - Power improvement for n parallel units over Vref
- Pn(n) 1 Ci(n)/nCref Cx(n)/Cref(V/Vref)
- Dynamic voltage and frequency scaling
- Decreased to parts of the circuit where it does
not adversely affect the performance - Dynamic scaling is regulated by software based on
system load - Reducing capacitances
- Parasitic capacitances of the transistors
- Parasitic capacitances of the wires
105Low Power Techniques
- Reducing switching activity
- Deactivate the clock to unused registers (clock
gating) - Deactivate signals if not used (signal gating)
- Deactivate VDD for unused hardware blocks (power
gating)
- Distributed clocks Globally Asynchronous Locally
Synchronous - Eliminating centrally synchronous clocks and
utilizing local clocks - Distinct local clocks, possibly running at
different frequencies
106Design for Testability
- DFT Methods
- Scan Design
- Test Pattern Generation
- Built-In Self-Test
107Design for Testability Methods
- Make the system as testable as possible
- Keep minimum cost in hardware and testing time
- Use knowledge of architecture to help in
selection of testability points - Modify architecture to improve testability
- DFT for digital circuits
- Ad-hoc methods
- Avoid asynchronous feedback
- Make flip-flops initializable
- Avoid redundant gates, large fan-in gates and
gated clocks - Provide test control for difficult-to-control
signals - Consider ATE requirements (tri-states, etc.)
- Structured methods
- Scan Design
- Built-in self-test (BIST)
- Boundary scan
108Scan Design
- Circuit is designed using pre-specified design
rules - Test structure (hardware) is added to the
verified design - Add a test control (TC) primary input
- Replace flip-flops by scan flip-flops (SFF) and
connect to form one or more shift registers
(scan-chains) in the test mode - Make input/output of each scan-chain
controllable/observable from primary
input/primary output - Use combinational ATPG to obtain tests for all
testable faults in the combinational logic - Add shift register tests and convert ATPG tests
into scan sequences for use in manufacturing test - Full scan is expensive
- Must roll out and roll in state many times during
a set of tests - Partial scan selects some registers (not all) for
scanability to reduce the chain length - Analysis is required to choose which registers
are best for scan
109Scanable Flip-Flop
110Level-Sensitive Scanable Flip-Flop
111Scan Structure
112Combinational Test Vectors
113Testing Scan Chain
- Scan-chain must be tested prior to application of
scan test sequences - A shift sequence 00110011 . . . of length nsff4
in scan mode (TC0) - Produces 00, 01, 11 and 10 transitions in all
flip-flops - Observes the result at SCANOUT output
- Total scan test length
- (ncomb 2) nsff ncomb 4 clock periods
- Example
- 2,000 scan flip-flops, 500 comb. vectors, total
scan test length 106 clocks - Multiple scan-chains reduce test length
114Testing and Faults
- Errors are introduced during manufacturing
- Testing weeds out infant mortality
- Varieties of testing
- Functional testing
- Performance testing
- Fault model
- Possible locations of faults
- I/O behavior produced by the fault
- With a fault model, we can test the network for
every possible instantiation of that type of
fault - It is difficult to enumerate all types of
manufacturing faults - Testing procedure
- Set inputs
- Observe output
- Compare fault-free and observed output
115Stuck-At-0/1 Faults
- Logic gate output is always stuck at 0 or 1
independently on input values - Correspondence to manufacturing defects depends
on logic family - Experiments show that 100 stuck-at-0/1 fault
coverage corresponds to high overall fault
coverage - Testing NAND
- Three ways to test it for stuck-at-0
- Only one way to test it for stuck-at-1
- Testing NOR
- Three ways to test it for stuck-at-1
- Only one way to test it for stuck-at-0
116Multiple Test Example
- Can test both NANDs for stuck-at-0 simultaneously
- abc 000
- Cannot test both NANDs for stuck-at-1
simultaneously due to inverter - Must use two vectors
- Must also test inverter
117Stuck-At-Open/Closed Model
- Transistors always on/off
- t1 is stuck open (switch cannot be closed)
- No path from VDD to output capacitance
- Testing requires two cycles
- Must discharge capacitor
- Try to operate t1 to charge capacitor
118Combinational Testing Example
- Two parts of testing
- Controlling the inputs of (possibly interior)
gates - Observing the outputs of (possibly interior)
gates - Delay faults
- Gate delay model assumes that all delays are
lumped into one gate - Path delay model takes into account the delay of
a path through network - Performance problems
- Functional problems in some types of circuits
119Testing Procedure
- Goal
- Test gate D for stuck-at-0 fault
- First step
- Justify 0 values on gate inputs
- Work backward from gate to primary inputs
- w1 0 (A output 0)
- i1 i2 1
- Observe the fault at a primary output
- o1 gives different values if D is true/faulty
- Work forward and backward
- Fs other input must be 0 to detect true/fault
- Justify 0 at Es output
- In general, may have to propagate fault through
multiple levels of logic to primary outputs
120Redundancy and Testing
- Redundant logic can mask faults
- Testing NOR for SA0 requires setting both inputs
to 0 - Network topology ensures that one NOR input (for
instance b) will always be 1 - Function reduces to 0
- f ((ab) b) (a b)b 0
- Redundant logic can introduce delay faults and
other problems
121Sequential Testing
- Much harder than combinational testing
- Cant set memory element values directly
- Must apply sequences
- To put machine in proper state for test
- To observe value of test
- Testing of NAND for stuck-at-1
- Set both NAND inputs to 1
- Primary input i1 can be controlled directly
- Lower input is 1 if ps0/ps1 1
122Time-Frame Expansion
- A model for sequential test
- Unroll machine in time
- A single-stuck-at fault in sequential machine
appears to be the multiple-stuck-at fault
123Test Pattern Generation
- Automatic test pattern generator (ATPG) generates
a set of test vectors - Boolean network (combinational ATPG)
- Sequential machine (sequential ATPG)
- D (from Discrepancy) allows us to quickly write
fault - D value on a node means that good and faulty
circuits have different values at that point - If a test for a particular fault exists,
D-algorithm will find it by an exhaustive search
of all sensitized paths - Start at the faulty gate
- Suppose initially a stuck-at fault on gate output
- Primitive D-cube of failure (PDCF) of gate
summarizes minimal assignment of input values to
highlight fault - Propagation D-cube (PDC) has D or D on output
and on at least one input - Summarizes non-controlling values for other
inputs to allow propagation of D signal
124PODEM Algorithm
- PODEM stands for Path-Oriented DEcision Making
- Circuit-based, fault-oriented ATPG algorithm
- Goal
- Propagate D value to primary outputs
- Signal values are explicitly assigned at primary
inputs only - Other values are computed by implication
- Backtracking means reassigning primary inputs
when a contradiction occurs - Uses implicit enumeration
- Uses five values 0, 1, D, D, and X
- Start all values at X
- In worst case, must examine all possible inputs
- Can be implemented to run quickly
125Fault Propagation Example
126Built-In Self-Test (BIST)
- Includes on-chip machine responsible for
- Generating tests
- Evaluating correctness of tests
- Allows many tests to be applied
- Cant afford large memory for test results
- Rely on compression and statistical analysis
- Uses a linear-feedback shift register (LFSR) to
generate a pseudo-random sequence of bit vectors
127BIST Architecture
- One LFSR generates test sequence
- Another LFSR captures/compresses results
- Can store a small number of signatures which
contain expected compressed results for valid
system - Usually used for testing memory blocks
128Layout Generation
- Layout Generation Flow
- Design Rules
- Layout Tools
- Standard Cells
- Floorplanning
- Placement
- Routing
- Clock Tree
- Pads
129Layout Generation Flow
- Library Exchange Format (LEF) files
- To create a library database (standard cells, I/O
cells, and macro blocks) - Timing Library Format (TLF) file
- Timing constraints
- General Constraints Format (GCF) file
- Design constraints
- Verilog net-list
- To create a design database
130Layout Generation Flow
- Floorplanning
- To create a core area with rows (or columns) and
I/O rows around the core area - Power planning and routing
- To plan, modify and rout power paths, power rings
and power stripes - Placement
- An I/O constraints file may be used to place the
I/O pads - Block placement
- Cell placement
- Size adjustment
- To estimate the die size
- To resize the design to make it routable
131Layout Generation Flow
- Generating clock trees
- The clock buffer space and clock net must be
defined - Generating clock trees is iterative process
- At this point, the physical net-list differ from
the logical (original) net-list - Placement optimization
- To resize gates and insert buffers to correct
timing and electrical violations - Routing
- To perform both global and final route on a
placed design - Verification
- To check for shorts and design rule violations
132Design Rules
- Masks are tools for manufacturing
- Manufacturing processes have inherent limitations
in accuracy - Design rules specify geometry of masks which will
provide reasonable yields - Design rules are determined by experience
- MOSIS SCMOS
- Designed to scale across a wide range of
technologies - Designed to support multiple vendors
- Designed for educational use
- Fairly conservative
- Lambda (?) design rules
- Size of a minimum feature defines ?
- Specifying ? particularizes the scalable rules
- Parasitics are generally not specified in ??units
133Wires
134Transistors
135Vias
- Types of via
- Metal1/diff
- Metal1/poly
- Metal2/metal1
- Metal3/metal2
- ...
- Highest via
- Cut 3 x 3
- Overlap by metal2 1
- Minimum spacing 3
- Minimum spacing to via1 2
136Spacings
- Diffusion/diffusion
- 3
- Poly/poly
- 2
- Poly/diffusion
- 1
- Via/via
- 2
- Metal1/metal1
- 3
- Metal2/metal2
- 4
- Metal3/metal3
- 4
137Overglass
- Cut in passivation layer
- Connection for bonding wire
- Minimum bonding pad
- 100
- Pad overlap of glass opening
- 6
- Minimum pad spacing to unrelated metal2/3
- 30
- Minimum pad spacing to unrelated metal1, poly,
active - 15
138Layout Tools
- Layout editors are interactive tools
- Design rule checkers identify errors on the
layout - Circuit extractors extract the net-list from the
layout - Connectivity verification systems (CVS) compare
extracted and original net-lists - CADENCE Virtuosos Layout-versus-Schematic (LVS)
tool - Standard cell layouts are created from
pre-designed cells using the custom routing - Silicon Ensemble (CADENCE)
- Encounter (CADENCE)
- Physical Compiler (SYNOPSYS)
139Standard Cell Layout
- Layout made of small cells
- Gates, flip-flops, etc.
- Cells are hand-designed
- Assembly of cells is automatic
- Cells arranged in rows
- Wires routed between and through cells
- Pitch is the height of a cell
- All cells have same pitch, may have different
widths - VDD/VSS connections are designed to run through
cells - A feedthrough area allows wires to be routed over
the cell
140Floorplanning Strategy
- Floorplanning must take into account
- Blocks of varying function, size, and shape
- Space allocation
- Signal routing
- Power supply routing
- Clock distribution
141Floorplanning Tips
- Develop a wiring plan
- Think about how layers will be used to distribute
important wires - Draw separate wiring plans for power and clocking
- These are important design tasks which should be
tackled early - Sweep small components into larger blocks
- A floorplan with a single NAND gate in the middle
will be hard to work with - Design wiring that looks simple
- If it looks complicated, it is complicated
- Design planar wiring
- Planarity is the essence of simplicity
- Do it where feasible (and where it doesnt
introduce unacceptable delay)
142Placement Metrics
- Placement of components interacts with routing of
wires - Quality metrics for layout
- Area and delay
- Area and delay determined in part by
- Wiring
- How do we judge a placement without wiring?
- Estimate wire length without actually performing
routing
bad placement
good placement
143Placement Techniques
- To construct an initial solution
- To improve an existing solution
- Pairwise interchange is a simple improvement
metric - Interchange a pair, keep the swap if it helps
wire length - Heuristic determines which two components to swap
- Placement by partitioning
- Works well for components of fairly uniform size
- Partition net-list to minimize total wire length
using min-cut criterion - Kernighan-Lin Algorithm
- Computes min-cut criterion, count total net-cut
change - Exchanges sets of nodes to perform hill-climbing
finding improvements where no single swap will
improve the cut - Recursively subdivide to determine placement
detail
144Routing
- Major phases in routing
- Global routing assigns nets to routing areas
- Detailed routing designs the routing areas
- Net ordering determines quality of result
- Net ordering is a heuristic
- Blocks and wiring
- Blocks divide wiring area into routing channels
- Large wiring areas may force rearrangement of
block placement - Channel routing
- Channel grows in one dimension to accommodate
wires - Pins generally on only two sides
- Switchbox routing
- Box cannot grow in any dimension
- Pins are on all four sides
145Routing Channels
- Tracks form a grid for routing
- Spacing between tracks is center-to-center
distance between wires - Track spacing depends on wire layer used
- Density (vertical and horizontal)
- Gives the number of wire segments crossing a
vertical/horizontal grid segment - Different layers are used for horizontal and
vertical wires - Horizontal and vertical wires can be routed
relatively independently - Placement of cells determines placement of pins
- Pin placement determines difficulty of routing
problem
146Left-Edge Algorithm
- Assumes one horizontal segment per net
- Sweep pins from left to right
- Assign horizontal segment to lowest available
track - Limitations
- Some combinations of nets require more than one
horizontal segment per net (a dog-leg wire) - Aligned pins form vertical constraints
- Wire to lower pin must be on lower track
- Wire to upper pin must be above lower pins wire
147Global and Detailed Routing
- Global routing
- Assign wires to paths through channels
- Dont worry about exact routing of wires within
channel - Can estimate channel height using congestion
- Detailed routing
- Dog-leg router breaks net into m