Title: Welcome to the ECE 449 Computer Design Lab
1Lecture 18 FPGA Boards FPGA-based
Supercomputers High Level Language (HLL)Design
Methodology
2Resources
PCI http//en.wikipedia.org/wiki/Peripheral_Compo
nent_Interconnect PCI-X http//en.wikipedia.org/
wiki/PCI-X Reconfigurable Supercomputing T.
El-Ghazawi, K. Gaj, D. Buell, D. Pointer Tutorial
at the Supercomputing 2005 conference http//hpcl.
seas.gwu.edu/openfpga/tutorial_html/index.html
3FPGA Device Capacity Trends
Virtex-5 550 MHz 24M gates
Virtex-II Pro 450 MHz 8M gates
Virtex-4 500 MHz 16M gates
Virtex-II 450 MHz 8M gates
Spartan-3 326 MHz 5M gates
Virtex-E 240 MHz 4M gates
Xilinx Device Complexity
Virtex 200 MHz 1M gates
XC4000 100 MHz 250K gates
Spartan-II 200 MHz 200K gates
Spartan 80 MHz 40K gates
XC3000 85 MHz 7.5K gates
XC5200 50 MHz 23K gates
XC2000 50 MHz 1K gates
1985
1991
1987
1995
1998
1999
2000
2002
2003
2004
2006
Year
Source http//class.ece.iastate.edu/cpre583/lectu
res/Lect-01.ppt
4FPGA Boards
5General Architecture of an FPGA-Based Board
6Reconfigurable Computing Boards
- Boards may have one or several interconnected
FPGA chips - Support different bus standards, e.g. PCI, PCI-X,
USB, etc. - May have direct real-time data I/O through a
daughter board - Boards may have local onboard memory (OBM) to
handle large data while avoiding the system bus
(e.g. PCI) bottleneck
7Reconfigurable Computing Boards
- Many boards per node can be supported
- Host program (e.g. C) to interface user (and mP)
with board via a board API - Driver API functions may include functionalities
such as Reset, Open, Close, Set Clocks, DMA,
Read, Write, Download Configurations, Interrupt,
Readback
8Common Interface - PCI
- PCI Peripheral Component Interconnect
64-bit bus
32-bit bus
9PCI - Conventional hardware specifications
- 32-bit or 64-bit bus width
- 33.33 MHz clock with synchronous transfers
- peak transfer rate of 133 MB per second for
32-bit bus width (33.33 MHz 32 bits (1 byte
8 bits) 133 MB/s) - peak transfer rate of 266MB/s for 64-bit bus
width - 32-bit address space (4 gigabytes)
- 32-bit port space
- 5-volt signaling
10PCI-X (PCI eXtended)
- PCI-X doubles the width to 64-bit, revises the
protocol, and increases the maximum signaling
frequency to 133 MHz (peak transfer rate of 1014
MB/s) - PCI-X 2.0 permits a 266 MHz rate (peak transfer
rate of 2035 MB/s) and also 533 MHz rate, adds a
16-bit bus variant and allows for 1.5 volt
signaling
11Some Reconfigurable Boards Vendors
- ANNAPOLIS MICRO SYSTEMS, INC. (www.annapmicro.com)
- University of Southern California -USC/ISI
(http//www.east.isi.edu). - AMONTEC (www.amontec.com/chameleon.shtml)
- XESS Corporation (www.xess.com)
- CELOXICA (www.celoxica.com)
- CESYS (www.cesys.com)
- TRAQUAIR (www.traquair.com)
- SILICON SOFTWARE (www.silicon-software.com)
- COMPAQ (www.research.compaq.com/SRC/pamette/)
- ALPHA DATA (www.alpha-data.com)
- Associated Professional Systems
(www.associatedpro.com) - NALLATECH (www.nallatech.com)
12WILDSTAR II Pro
Reproduced and displayed with permission
13WILDSTAR II Pro
Reproduced and displayed with permission
14Reconfigurable Supercomputers
15Scalable Reconfigurable Systems
- Large numbers of reconfigurable processors and
microprocessors - Everything can be configured
- Functional units
- Interconnects
- Interfaces
- High-level of scalability
- Suitable for a wide range of applications
- Everything can be reconfigured over and over at
run time (Run-Time Reconfiguration) to suite
underlying applications - Can be easily programmed by application
scientists, at least in the same way of
programming conventional parallel computers
16Early Reconfigurable Architecture
17Current Reconfigurable Architecture
?P
. . .
?P memory
Shared Memory and or NIC
18Possible Classes of Reconfigurable Supercomputers
µP 1
µP N
RP 1
RP N
Independent Board Design
µP Board
RP Board
µP 1
µP N
RP 1
RP N
Joint Board Design
Joint µP/RP Board
Tighter Integration
19Possible Classes of Reconfigurable
Supercomputers cont.
µP inside of RP Design
µP 1
µP N
RP 1
RP N
Joint µP/RP Board
RP inside of µP Design
RP 1
RP N
µP 1
µP N
Joint µP/RP Board
Tighter Integration
20FPGA based supercomputers
Machine
Released
SRC 6 fromSRC Computers Cray XD1 fromfrom
Cray SGI Altix from SGI SRC 7 from SRC
Computers, Inc,
2002 2005 2005 2006
21How to choose the system that best suits your
needs?
Typical users criteria
1. Clock speed 2. Amount of memory 3. Cost of
Ownership
22How to choose the system that best suits your
needs?
Recommended users criteria
- Tools
- - right level of abstraction
- - ease of development verification
- - progress backward compatibility
- 2. Libraries
- - basic operations
- - examples of full applications
- 3. Technical support
23How to choose the system that best suits your
needs?
Recommended users criteria (cont.)
4. Data Bandwidth
Reconfigurable ProcessorSystem
?Psystem
external I/O devices
24How to choose the system that best suits your
needs?
Recommended users criteria (cont.)
5. Scalability - variable power and price
- efficient communication among the modules
25Recommended users criteria (cont.)
6. Transfer of control overhead
Actual behavior
Theoretical behavior
FPGA
?P
?P
FPGA
Control transfer overhead
time
26High Level Language (HLL)Design
MethodologyHandel C
27Behavioral Synthesis
I/O Behavior
Target Library
Algorithm
Behavioral Synthesis
RTL Design
Logic Synthesis
Classic RTL Design Flow
Gate level Netlist
28Need for High-Level Design
- Higher level of abstraction
- Modeling complex designs
- Reduce design efforts
- Fast turnaround time
- Technology independence
- Ease of HW/SW partitioning
29Advantages of Behavioral Synthesis
- Easy to model higher level of complexities
- Smaller in size source compared to RTL code
- Generates RTL much faster than manual method
- Multi-cycle functionality
- Loops
- Memory Access
30High-Level Languages
- C/C-Based
- Handel C Celoxica Ltd., UK
- Impulse C Impulse Accelerated Technologies
- Catapult C Impulse Accelerated Technologies
- System C The Open SystemC Initiative
- Java-based
- Forge Xilinx
- JHDL Brigham Young University
31Other High-Level Design Flows
- Matlab-based
- System Generator for DSP Xilinx
- AccelChip DSP Synthesis AccelChip
- GUI Data-Flow based
- Corefire Annapolis Microsystems
- RC Toolbox DSPlogic
32Handel C Design Flow
33Design Flow
Executable Specification
Handel-C
VHDL
Synthesis
EDIF
EDIF
Place Route
34Handel-C/ANSI-C Comparisons
ANSI-C
HANDEL-C
Handel-C Standard Library
ANSI-C Standard Library
Preprocessors i.e. define
Parallelism
Pointers
Structures
Channels
Side Effects i.e. X Y
ANSI-C Constructs for, while, if, switch
Arbitrary width variables
Arrays
Bitwise logical operators
Enhanced bit manipulation
Recursion
Logical operators
Arithmetic operators
RAM, ROM
Signals
Functions
Floating Point
Interfaces
35Variables
- Only one fundamental type for variables int
- int 5 x
- unsigned int 13 y
- Default types
- char 8 bits
- short 16 bits
- long 32 bits
36Type Summary
37Arrays
- Same way as in ANSI-C
- int 6 x7
- 7 registers of 6 bits wide
- unsigned int 6 x 4 5 6
- 120 registers of 6 bits wide
- Index must be a compile time constant. If random
access is required, consider using RAM or ROM
38Internal RAMs and ROMs
- Using ram and rom keywords
- ram int 6 a 43
- a RAM consisting of 43 entries of 6 bits wide
- rom int 16 b 4
- a ROM consisting of 4 entries of 16 bits wide
- RAMs and ROMs are accessed the same way that
arrays are accessed in ANSI-C - Index need not be a compile time constant
39Restrictions on RAMs and ROMs
- RAMs and ROMs are restricted to performing
operations sequentially. Only one element may be
addressed in any given clock cycle - ram unsigned int 8 x 4
- x 1 x 3 1 illegal
- if (x 0 0)
- x 1 1 illegal
40Multi-port RAMs
- static mpram Fred
-
- ram ltunsigned 8gt ReadWrite256 (read/write
port) - rom ltunsigned 8gt Read256
- (read only port)
-
- Now we can read and write in a given
- clock cycle
41Dual Port Memory
42Handel-C Language (1)
- A subset of ANSI-C
- Sequential software style with a par construct
to implement parallelism - A channel chan statement allows for
communication and synchronization between
parallel branches - Level of design abstraction is above RTL but
below behavioral
43Handel-C Language (2)
- Each assignment and delay statement take one
clock cycle - Automatic generation of the state machine from an
algorithmic description of the circuit in terms
of parallel and sequential blocks - Automatic scheduling of parallel and sequential
blocks, that is the code following a group is
scheduled only after that whole group has
completed
44Parallelism
Statement
Parallel blocks
45Par construct - Examples
46Par constructs - timing
47Par construct shift register
48Handel C vs. C - functions
- Functions may not be called recursively, since
all logic must be - expanded at compile-time to generate hardware
- You can only call functions in expression
statements. - These statements must not contain any other calls
or assignments. - Variable length parameter lists are not
supported. - Old-style ANSI-C function declarations
- (where the type of the parameters is not
specified) are not supported. - main() functions take no arguments and return no
values. - Each main() function is associated with a clock.
- If you have more than one main() function in the
same source file, - they must all use the same clock.
49Handel-C Overview
- High-level language based on ISO/ANSI-C for the
implementation of algorithms in hardware - Allows software engineers to design hardware
without retraining - Clean extensions for hardware design including
flexible data widths, parallelism and
communications - Based on Communicating Sequential Process model
- Independent parallel processes
- par construct to specify parallel computation
blocks within a process - Well defined timing model
- Each statement takes a single clock cycle
- Includes extended operators for bit manipulation,
and high-level mathematical macros (including
floating point)