Title: Field Programmable Gate Arrays
1Field Programmable Gate Arrays
- MAS863
- How To Make (almost) Anything
- Andrew bunnie HuangE. Rehmi Post
2Agenda
- Lecture
- Motivation and Application
- Theory and Architecture
- System Integration
- Design Demo
- How to use the tools and features
- In-class Project
- VGA display of moving ball
3Introduction
- Field Programmable Gate Array
- Field as in field operations -- programmable in
the field, as opposed to in the factory - Gate array
- array of logic gates and storage elements
- When and why would you use such a device?
4Motivation
response time (latency)
PIC
PCs, Workstations
cost
embedded processors
Raw Speed and Interrupt Latency
FPGAs
simple gates
cost / volume
oops region
complexity
cost
ASIC (full-custom IC)
complexity
5Motivation
- FPGAs span the middle ground
- Fast design cycles
- IP cores
- reconfigurability
- late binding decisions--hardware is no longer
cast in concrete - High Performance
- excellent in latency limited situations (network
routers, real-time systems, timing generators),
i.e. situations where lots of time resolution is
required with a good degree of complexity - Can be cost effective
- Especially in low-volume scenarios vs. ASIC
6Applications
- Fast-turn, low volume ASIC (uninteresting)
- Reconfigurable Hardware Processors
- One-connector I/O solutions
- Rapid prototyping
7Applications RHP
- Direct implementation of algorithms in hardware
- circumvents instruction fetch, decode, issue
overhead - unrestricted parallelism
- disadvantage little hardware abstraction,
difficult to use - RISC framework with reconfigurable instruction
set - user-defined instructions depending on process
context - prevents the MMX disease
- easier to use, more hardware abstraction, but
lower performance
8Applications RHP
- Optimal ISA
- compiler analyzes code and chooses an ISA optimal
for the problem, and bundles the hardware
description for the ISA with the code object - Configurable memory management and caching
- useful for implementing special OS features
- VM paging schemes directly in hardware
- Ultimate RHP-one processor, any ISA
- In the future - possibly adaptive processors
which automatically optimize their architecture
per application
9Applications Direct Hardware
- Ideal for implementing simple, repetitive
operations (overhead operations) - time synchronization on Novell networks
- CAM lookup tables for IP routing and neural nets
- encryption/decryption
- FEA (finite element analysis)
- Relaxation networks
- database searching
- higher peformance with special architectures
(embedded RAM)
10Applications I/O solutions
- One-connector I/O solutions
- use a single connector with any protocol desired
- ex a DB-25 which can do SCSI, IEEE1284 parallel,
serial - ideal for space-limited applications
- Object oriented hardware
- devise a system such that a device plugged into
the I/O port uploads the hardware configuration
necessary to implement the communications
protocol - protocol upgrades are a cinch
- limited by electrical signalling compatibility
issues - drawback - can be confusing to users, potentially
damaging to hardware
11Applications EA
- Evolutionary algorithms
- some research done on FPGAs already
- tone recognition application
- possibly requires intimate knowledge of FPGA
hardware - vendor licencsing issues
- EA apps do not map well into current FPGA
architectures - however, with the right FPGA EA could yield very
interesting results
12Applications Rapid Prototyping
- FPGAs are a handy thing to have on the lab bench
- simple digital circuits no longer require wiring
or parts ordering - modification and duplication of existing designs
is relatively straightforward - with the right design tools, hardware design
re-use is an additional benefit
13General Architecture
remember
compute
compute
connect
CONFIGURE
connect
connect
connect
compute
remember
remember
connect
Terminology Granularity, Configuration, and
Routing
14Architecture Varieties
- Primary classifications for FPGAs
- configuration method
- granularity
- routing architecture
- Other practical considerations
- density
- speed
- cost
- design tools
- vertical migration
15Architecture Varieties
- EPAC
- Electrically Programmable Analog Circuit
- Contains programmable gain amplifiers,
comparators, multiplexers, DACs, track-and-hold,
filtering components - Made by iMP
16Architecture Configuration
- Configuration method
- In-circuit programmable methods
- SRAM based (Xilinx 2K/3K/4K, Altera 8K/10K,
Lucent Orca) - volatile, but fast configuration times
- must reprogram on every power-up
- some architectures offer partial reconfiguration
(Atmel) - most expensive in terms of area and timing costs
- standard CMOS process
- EEPROM based (Altera 7K/9K)
- nonvolatile, slow config sometimes requires
extra voltages for programming and erasing - special silicon processing required
17Architecture Configuration
- Configuration method (contd)
- Pre-assembly programmable methods
- Antifuse based (Actel, Quicklink FPGAs)
- nonvolatile, very fast links
- permanent configuration (OTP)
- smallest link size (lower cost)
- special silicon processing technology required
- (E)EPROM based (Altera 5K, 7K, Xilinx 7200, 7300)
- nonvolatile, moderate performance
- reprogrammable after special erase cycle
- medium-sized link
- special silicon processing technology required
18Architecture Granularity
- Granularity
- Defined as ratio of logic per cell versus routing
- Very fine-grained architectures
- Partial set of n-input boolean functions per cell
- Roughly 6-1 ratio of logic inputs to registers
per cell - Atmel, Actel
- Fine-grained architectures
- Full set of n-input boolean functions per cell
- Sometimes multiple n-input boolean functions per
cell - Roughly 8-1 ratio of logic inputs to registers
per cell - Well-suited for state machines, simple
arithmetic, pipelined applications - Xilinx 3K/4K, Altera 8K/10K
19Architecture Granularity
- Granularity (contd)
- Coarse-grained architectures
- PLD-style product term arrays
- Roughly 32-1 ratio of logic inputs to registers
per cell - Well-suited for address decoding, complicated
arithmetic operations, datapath operators,
complex state machines - Poorly suited for pipelined applications and
simple operations - Altera 5K/7K, Xilinx 7K
- Dual-grained architectures or heirarchical
architectures - Combines coarse and fine-grained features
- Often exhibit separate local and global routing
resources - Lucent Orca, Altera 9K
20Architecture Routing
- Routing method
- Fine-grained
- Short hops (1 to 8 logic cells spanned per track)
- Path-dependent timing
- Exhibits high density
- Flexible switch matrices
- Less logic placement constraints
- Coarse-grained
- Tracks span entire chip
- Fixed timing regardless of logic placement
- Lower density
- Logic placement constrained by routing
availability
21Architecture Routing
- Routing method (contd)
- Heirarchical routing
- Local, fine-grained routing between cells
- Global, coarse-grained routing between groups of
cells - Usually path-dependant timing
- Best of both worlds, but can be difficult to
utilize efficiently
22Architecture Other Practical
- Density, speed and vertical migration
- Altera FLEX 8K is targetted at density-driven
apps - Altera MAX 7K is targetted at performance-driven
apps - Xilinx 4K series targets both speed and
performance, with good vertical migration from 3K
gates to 250K gates (Altera 10K is Xilinx 4K
competitor) - Xilinx 6200 series targets reconfigurable
hardware applications
23Architecture Design Tools
- Design tools - the other half of the equation
- FPGA is useless without good design tools
- Design tools slowly progressing to acceptable
levels - Entry methods include HDL, schematic
- Compilers are improving! Xilinxs most recent
compiler can place and route reasonably tough
designs in about fifteen minutes very tough
designs will finish in a half hour or not at all. - Xilinx Foundation Series / M1 technology
- Altera MAXPLUS
24Architecture Cost
- Cost formulas for FPGAs are complex
- OTP FPGAs tend to be cheaper
- Established lines are cheaper than new lines
- Cost increases exponentially with performance and
density - Some lines are targetted at cost-sensitive
applications (Altera 7K) - Not all speed grade-density combos available from
manufacturers
25Detailed Architecture Xilinx 4000E
- Fine-grained logic, SRAM based, with fine-grained
routing - Array of CLBs embedded in single length / double
length / quad length / longline routing resources
PSM - CLB Configurable Logic Block
- Two 4-input LUTs (LookUp Tables) and one 3-input
LUT - Two SR D-type flip flops
- Bypass paths and carry/cascade logic
- PSM Programmable Switch Matrix
- 10 interconnect points per matrix
- Each interconnect contains six pass transistors
for full connectivity between four directions - Located at intersections of single and double
length lines
26Detailed Architecture Xilinx 4000E
27Detailed Architecture Xilinx 4000E
28Detailed Architecture Xilinx 4000E
29Detailed Architecture Xilinx 4000E
30Detailed Architecture Xilinx 4000E
31Detailed Architecture Xilinx 4000E
- Configuration
- total (device) reconfiguration (no partial
reconfig) - several configuration modes available
- parallel and serial modes
- master and slave modes
- daisy chain ability
- device bitstreams between 50Kbits and 400Kbits
- config rate around 10 Mbit/sec
- max reconfig rate in a few tens of milliseconds
- typical reconfig in a couple of seconds
32Detailed Architecture Xilinx 4000E
- Other features
- distributed RAM
- CLB LUTs can function as a 32x1, 16x1, or 16x2
RAM - synchronous RAM options available
- internal tri-state buffers
- global routing resources
- JTAG boundary scan
- configuration readback
- programmable slew rate and logic levels in IOBs
- common per-package pinout for all devices
- allows for easy vertical migration
33System Integration
- FPGAs offer flexible I/O solutions
- laying out a board around an FPGA is very nice
- newest FPGAs, sp. Virtex, has multi-standard I/O
support - Requires a source of configuration data
- Host computer, parallel or serial
- Serial ROM
- fewest wires--CCLK,DIN,INIT,PROG, sometimes DOUT
- FLASH ROM controlled by dedicated config
circuitry - Combination of both
34Serial Programming
Slave and Master modes
35Programming From a ROM
36Thats Nice. How Do I Use It?!
- Present basic design flow
- Work through a demo implementation
37Design Tools Process
Libraries
IP Cores
Design description (HDL, schematic)
Technology mapping
Place
Route
Errors
Timing Analysis
Bitstream
FPGA
38Design Tools Design Entry
- HDL
- Verilog, VHDL or proprietary language (AHDL,
etc.) - verilog is like C with multithreading and strict
typing - VHDL stands for VHSIC HDL intended for detailed
simulations commissioned by the military very
complex - Ideal for large designs because of well-defined
scoping and instantiation rules top down design - Also ideal for state machines, decoders/encoders,
and odd or awkward busses - Hardware mapping is difficult
- very easy to make inefficient designs subtle
semantics choices can lead to drastic perfomance
variations - hard to specify hardware-specific features such
as carry chains - hard to specify placement and routing info
39Design Tools Design Entry
- Schematic entry
- more intuitive, easier to observe design flow
- helpful when trying to optimize designs for speed
or area - difficult when implementing large amounts of
miscellaneous logic (state machines) - heirarchical schematic tools help make large
designs more manageable - global changes difficult (hard to change global
mistakes) - hardware mapping is much easier
- schematic primitives for special hardware
features - schematic attributes for routing info
- WYSIWYG design entry
40Design Tools Hardware Mapping
- Many options for HDL to hardware mapping
- vendor-specific options
- third party tools
- EDIF is the most common intermediate language
- When using HDLs, good hardware mapping tools are
critical for perfomance and device utilization - Deep understanding of HDL is also useful
- Schematic hardware mapping is much easier - very
close to WYSIWYG editing - Hardware mappers often perform aggressive logic
optimizations - watch your assumptions! (hazards)
41Design Tools Place and Route
- Place and route tools are always vendor-specific
- Much progress remains in place and route tools
- typical PR times for a reasonably complex design
is around 30 minutes to an hour - device utilization and performance still well
below that of hand-placed and routed designs - Many vendors offer hand-placement or tweaking
tools for speed and area critical applications - Partial compilation of macros in the works
42Design Tools Timing Analysis
- Especially important for path-dependant delay
devices - Designs often iterate at this point - critical
path is extracted and optimized - Timing analysis tools also have a ways to go
- difficult to analyze designs with multiple clock
domains - impossile to analyze designs with combinational
loops
43Design Tools Bitstream management
- Bitstreams can be merged
- daisy-chained devices
44Configuration
- Applies to ISP devices only
- Many options
- serial ROM
- master mode with standard ROM
- slave of intelligent host or another FPGA in a
daisy chain - serial ROM
- very popular in ASIC-style applications
- low pin and parts count, but sometimes slower
45Configuration
- Master mode with parallel ROM device
- FPGA drives a ROMs address bits and reads data
from ROM - expensive in terms of pins, but pins can be
reused in some designs - sometimes faster than serial methods
- Slave modes
- intelligent host configures FPGA
- host can be a PC or another FPGA in master mode
- most flexible method
- many FPGA architectures allow daisy chaining
46Other Considerations
- Design for the future
- vertical migration
- largest FPGA for your budget
- Be wary of logic interface levels and new low
voltage devices - Pin-locking
- some FPGA architectures perform very poorly under
pin-locking (Altera 8K in particular) - all architectures experience some performance
loss under pin-locking - I/O count
- many designs are I/O limited, not logic limited
- System performance, not just logic performance
- includes I/O times and routing times, clock skew
- compare to FF toggle rates often quoted by vendors