The%20Design%20and%20Application%20of%20Berkeley%20Emulation%20Engines - PowerPoint PPT Presentation

About This Presentation
Title:

The%20Design%20and%20Application%20of%20Berkeley%20Emulation%20Engines

Description:

Emulation capacity of 10 Million ASIC gate-equivalents ... 19' Rack Cabin Capacity. 40 compute modules in 5 chassis (8U) per rack ~40TeraOPS, ~1.5TeraFLOPS ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 27
Provided by: chen190
Learn more at: http://www.fdis.org
Category:

less

Transcript and Presenter's Notes

Title: The%20Design%20and%20Application%20of%20Berkeley%20Emulation%20Engines


1
The Design and Application of Berkeley
Emulation Engines
  • John Wawrzynek
  • Bob Brodersen
  • Chen Chang
  • University of California, Berkeley
  • Berkeley Wireless Research Center

2
Berkeley Emulation Engine (BEE), 2002
  • FPGA-based system for real-time hardware
    emulation
  • Emulation speeds up to 60 MHz
  • Emulation capacity of 10 Million ASIC
    gate-equivalents (although not a logic gate
    emulator), corresponding to 600 Gops (16-bit
    adds)
  • 2400 external parallel I/Os providing 192 Gbps
    raw bandwidth.
  • 20 Xilinx VirtexE 2000 chips, 16 1MB ZBT SRAM
    chips.

3
Realtime Processing Allows In-System Emulation
4
Matlab/Simulink Programming Tools
Discrete-Time-Block-Diagrams with FSMs
  • Tool flow developed by Mathworks, Xilinx, and
    UCB.
  • User specifies design as block diagrams (for
    datapaths) and finite state machines for control.
  • Tools automatically map to both FPGAs and ASIC
    implementation.
  • User assisted partitioning with automatic system
    level routing.

5
BEE Status
  • Four BEE processing units built
  • Three in near continuous production use
  • Other supported universities
  • CMU, USC, Tampere, UMass, Stanford
  • Successful tapeout of
  • 3.2M transistor pico-radio chip
  • 1.8M transistor LDPC decoder chip
  • System emulated
  • QPSK radio transceiver
  • BCJR decoder
  • MPEG IDCT
  • On-going projects
  • UWB mix-signal SOC
  • MPEG/PRISM transcoder
  • Pico radio multi-node system
  • Infineon SIMD processor for SDR

6
Lessons from BEE
  • Real-time performance vastly eases the
    debugging/verification/tuning process.
  • Simulink based tool-flow very effective FPGA
    programming model in DSP domain.
  • System emulation tasks are significant
    computations in their own right
    high-performance emulation hardware makes for
    high-performance general computing.
  • Is this the right way to build high-end (super)
    computers?

BEE could be scaled up with latest FPGAs and by
using multiple boards ? BEE2 (and beyond).
7
BEE2 Hardware
  1. Modular design scalable from a few to hundreds of
    FPGAs.
  2. High memory capacity and bandwidth to support
    general computing applications.
  3. High bandwidth / low-latency inter-module
    communication to support massive parallelism.
  4. All off-the-shelf components no custom chips.
  • Thanks to Xilinx for engineering assistance,
    FPGAs, and interaction on application
    development.

8
Basic Computing Element
  • Single Xilinx Virtex 2 Pro 70 FPGA
  • 130nm technology
  • 70K logic cells
  • 1704 package with 996 user I/O pins
  • 2 PowerPC405 cores
  • 326 dedicated multipliers (18-bit)
  • 5.8 Mbit on-chip SRAM
  • 20X 3.125-Gbit/s duplex serial communication
    links (MGTs)
  • 4 physical DDR2-400 banks
  • Per FPGA up to 12.8 Gbyte/s memory bandwidth and
    maximum 8 GByte capacity.
  • Virtex 4 (90nm) out now, 2x capacity, 2x
    frequency.
  • Virtex 5 (65nm) next spring.

9
Compute Module Diagram
10GigE or Infiniband
10
Compute Module
Completed 12/04.
  • Module also includes I/O for administration and
    maintenance
  • 10/100 Ethernet
  • HDMI / DVI
  • USB

14X17 inch 22 layer PCB
11
Inter-Module Connections
Global Communication Tree
Stream Packets
Admin, UI, NFS
12
Alternative topology 3D mesh or torus
  • The 4 compute FPGA can be used to extend to 3D
    mesh/torus
  • 6 directional links
  • 4 off-board MGT links
  • 2 on-board LVCMOS links

13
19 Rack Cabin Capacity
  • 40 compute modules in 5 chassis (8U) per rack
  • 40TeraOPS, 1.5TeraFLOPS
  • 150 Watt AC/DC power supply to each blade
  • 6 Kwatt power consumption
  • Hardware cost 500K

14
Why are these systems interesting?
  • Best solution in several domains
  • Emulation for custom chip design
  • Extreme real-time signal processing tasks
  • Scientific and Supercomputing
  • Good model on how to build future chips and
    systems
  • Massively parallel
  • Fine-grained reconfigurability enables
  • Robust performance/power efficiency on a
    wide-range of problems.
  • Manufacturing defect tolerance.

15
Moores Law in FPGA world
100X higher performance, 100X more efficient than
microprocessors
FPGA performance doubles every 12 months
16
Extreme Digital-Signal-Processing
BEE2 is a promising computing platform for for
Allen Telescope Array (ATA) (350 antennas) and
proposed Square Kilometer Array (SKA) (1K
antennas) SETI spectrometer Image-formation for
Radio Astronomy Research
  • Massive arithmetic operations per second
    requirement.
  • Stream-based computation model
  • Real-time requirement
  • High-bandwidth data I/O
  • Low numerical precision requirements
  • Mostly fix-point operations
  • Rarely needs floating point
  • Data-flow processing dominated
  • few control branch points

17
SETI Spectrometer
  • Target 0.7Hz channels over 800MHz ? 1 billion
    Channel real-time spectrometer
  • Result
  • One BEE2 module meets target and yields 333GOPS
    (16-bit mults, 32-bit adds), at 150Watts (similar
    to desk-top computer)
  • gt100x peak throughput of current Pentium-4 system
    on integer performance, gt100x better throughput
    per energy.

18
FPGA versus DSP Chips
  • Spectrometer polyphase filter bank (PFB) 18
    mult, Correlator 4bit mult, 32bit acc.
  • Cost based on street price.
  • Assume peak numbers for DSPs, mapped for FPGAs
    (automatic Simulink tools).
  • TI DSPs
  • C6415-7E, 130nm (720MHz)
  • C6415T-1G, 90nm (IGHz)
  • FPGAs 130nm, freq. 200-250MHz.

Performance
Energy Efficiency
Cost-Performance
Metrics include chips only (not system). FPGAs
provide extra benefit at the PC board level.
19
Active Application Areas
  • High-performance DSP
  • SETI Spectroscopy, ATA / SKA Image Formation
  • Scientific computation and simulation
  • E M simulation for antenna design
  • Communication systems development Platform
  • Algorithms for SDR and Cognitive radio
  • Large wireless Ad-Hoc sensor networks
  • In-the-loop emulation of SOCs and Reconfigurable
    Architectures
  • Bioinformatics
  • BLAST (Basic Local Alignment Search Tool)
    biosequence alignment
  • System design acceleration
  • Full Chip Transistor-Level Circuit Simulation
    (Xilinx)
  • RAMP (Research Accelerator for MultiProcessing)

20
Opportunity for a New Research Platform
RAMP(Research Accelerator for Multiple
Processors)
  • Krste Asanovic (MIT), Christos Kozyrakis
    (Stanford), Dave Patterson (UCB), Jan Rabaey
    (UCB), John Wawrzynek (UCB)
  • July 2005

21
Change in Computer Landscape
  • Old Conventional Wisdom Uniprocessor performance
    2X / 1.5 yrs (Moores Law)
  • New Conventional Wisdom 2X CPUs per socket /
    2 years
  • Problem Compilers, operating systems,
    architectures not ready for 1000s of CPU per
    chip, but thats where were headed
  • How do research on 1000 CPU systems in compilers,
    OS, architecture?

22
FPGA Boards as New Research Platform
  • Given 25 soft CPUs can fit in FPGA, what if
    made a 1000-CPU system from 40 FPGAs?
  • 64-bit simple RISC at 100HMz
  • Research community does logic design (gate
    shareware) to create out-of-the-box Massively
    Parallel Processor that runs standard binaries of
    OS and applications
  • Processors, Caches, Coherency, Switches, Ethernet
    Interfaces,
  • Recreate synergy of old VAX BSD Unix?

23
Why RAMP Attractive?Priorities for Research
Parallel Computers
  • 1a. Cost of purchase
  • 1b. Cost of ownership (staff to administer it)
  • 1c. Scalability (1000 much better than 100 CPUs)
  • 4. Observability (measure, trace everything)
  • 5. Reproducibility (to debug, run experiments)
  • 6. Community synergy (share code, )
  • 7. Flexibility (change for different experiments)
  • 8. Performance

24
Why RAMP Attractive? Grading SMP vs. Cluster vs.
RAMP
SMP Cluster RAMP
Cost of purchase (1 CPU, 1 GB DRAM) D (40k, 4k) B(2k, 0.4k) A(0.1k, 0.2k)
Cost of ownership A D B
Scalability C A A
Observability D C A
Reproducibility B D A
Community D A A
Flexibility D C A
Performance (clock) A (2 GHz) A (3 GHz) D (0.2 GHz)
Costs from TPC-C Benchmark IBM eServer P5 595,
IBM eServer x346/Apple Xserver, BWRC BEE2
25
Internet in a Box?
  • Could RAMP radically change research in
    distributed computing? (Armando Fox, Ion Stoica,
    Scott Shenker)
  • Existing distributed environments (like
    PlanetLab) very hard to use for development
  • The computers are live on the Internet and
    subject to all kinds of problems (security, ...)
    and there is no reproducibility.
  • You cannot reserve the whole thing for yourself
    and change OS or routing or ....
  • Very expensive to support - the reason the
    biggest ones are order 200 to 300 nodes, and
    there are lots of restrictions on using them.

26
Internet in a Box?
  • RAMP promises a private "internet in a box" for
    50k to 100k.
  • A collection of 1000 computers running
    independent OS that could do real checkpoints and
    have reproducible behavior.
  • We can set parameters for network delays,
    bandwidth, number of disks, disk latency and
    bandwidth, ...
  • Could have every board running synchronously to
    the same clock cycle,
  • so that we could do a checkpoint at clock cycle
    4,000,000,000, and then reload later from that
    point and cause the network interrupt to occur
    exactly at clock cycle 4,000,000,100 for CPU 104
    every single time.
Write a Comment
User Comments (0)
About PowerShow.com