PROTOFLEX: FPGA-Accelerated Hybrid Functional Simulator - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

PROTOFLEX: FPGA-Accelerated Hybrid Functional Simulator

Description:

Title: PowerPoint Presentation Last modified by: Eric S. Chung Created Date: 1/1/1601 12:00:00 AM Document presentation format: Custom Other titles – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 13
Provided by: epf9
Category:

less

Transcript and Presenter's Notes

Title: PROTOFLEX: FPGA-Accelerated Hybrid Functional Simulator


1
PROTOFLEX FPGA-Accelerated Hybrid Functional
Simulator
  • Eric S. Chung, Eriko Nurvitadhi,James C. Hoe,
    Babak Falsafi, Ken Mai
  • echung, enurvita, jhoe, babak,
    kenmai_at_ece.cmu.edu

PROTOFLEX/SIMFLEX
2
Multiprocessor Functional Simulation
  • Functionally simulating one processor in software
    is slow
  • Simulating many processors is of course even
    slower
  • Parallelism of FPGAs can scale up functional MP
    simulation perf ? conduct large-scale (gt64-way)
    SW research, cache simulations, perf sampling
    studies, etc.
  • But we cant forfeit full-ISA, full-system
    fidelity (run stock OS)

FPGAs unprecedented level of scalability but
full-system building effort can outweigh any
benefits
CPU
FPGAs
Memory
3
Combining FPGAs and simulators
Target design
FPGA
Simulator
Cpu
Cpu
  • Advantages
  • Leverage full-system simulators for reference
    designs
  • Infrequent, complex behaviors remain simulated
  • TLB misses, block memory instrs, disk I/O
    instrs, SCSI disks, graphics,

Disk
Mem
Mem
Disk
  • 3 ways to map target object to hybrid-simulation
    host
  • Emulation-only Simulation-only
    Transplantable
  • Transplant runtime system
  • target processors switch modes between FPGA
    simulator hosts
  • processors need not execute 100 in FPGA mode
  • e.g., implement only the frequently used ISA
    subset in FPGA

4
It Really Works
Virtutech Simics (commercial simulator)
Xilinx XUP Virtex-II Pro 30
OurSPARCV9 core
Simics UltraSPARC
Embedded PowerPC
Transplant messageinterface
  • BlueSPARC specs
  • 7k lines Bluespec
  • UltraSPARC III ISA
  • Validated against Simics w/ real apps (e.g.,
    Solaris 8, SPEC2000, DB2, Oracle, etc.)
  • 41 all instr groups implemented MMU
  • 8kB I/D direct-mapped caches
  • multi-cycle func model (CPIideal 5 _at_ 100MHz)
  • 16K LUTs (50 of XUP Virtex-II Pro 30)

Simulated target devices
DDR memory
Ethernet
SUN 3800 Server (1x UltraSPARC III, Solaris 8)

developed in 6 monthsx86 also works
5
Is this the best we can do?
  • Reality check transplants are expensive!
    (10ms1,000,000 cycles)
  • given CPI 1 _at_ 100 Mhz (100 MIPS), 1 transplant
    per 1 million instructions increases CPI to 2
    (50 MIPS)
  • Recall lessons in hierarchical cache design
  • Hierarchical transplants
  • Run simulator kernel on nearby embedded
    PowerPC
  • write SW to cover the entire ISA
  • only I/O operations need full transplant to
    SIMICS(a 10x reduction in our case)

CPIeffective 1.1
CPIeffective 2
FPGA fabric
coverage99.9999 CPIraw 1
coverage99.9999 CPIraw 1
  • Advantages
  • Now it makes sense to optimize towards CPIraw 1
  • You actually need fewer instructions in
    hardware (especially beneficial for x86)

Embedded PPC ISAsim
coverage99.99999 CPI1,000
full-system SIMICS
coverage100 CPItplant1,000,000
6
  • Demo

7
How to build a 1024-node MP functional emulator,
without building 1024 nodes?
8
How fast do you need to simulate?
fast enough for 1024-way arch. studies
Aggregate Throughput
  • In the uniprocessor world
  • up to 100x slowdown for interactive software
    research (e.g. Simics)
  • 1k to 10k slowdown for design exploration (e.g.
    cache simulation)

9
Different approaches to scale to 1K
  • Even for 1K-node MP, only 1000 to 10,000 MIPS
    (aggregate) to do useful work
  • The obvious approach
  • build fast ISA core (estimate 100 MIPS per core)
  • physically replicate the core 1000 times
  • ? 10x to 100x faster than needed, why spend
    effort and area on perf I dont need?
  • The better approach?think in terms of MIPS
  • build 100 MIPS ISA emulation engine supporting
    multiple contexts
  • map 100 simulated processors onto single engine
  • with just 10 physical engines, I can emulate
    1000-way system(10 x 100 MIPS 1000 MIPS)

10
PROTOFLEXMP
  • Build 1000-MIPS simulator from 10s of emulation
    engines
  • multiplex large of emulated contexts onto few
    emulation engines
  • Decide of emulation engines to build from
    desired performance, not from nodes to emulate

N-way target system
11
Interleaved Emulation Engine
  • Statically interleaved emulation engine (ala HEP)
  • issue new instr from new context per cycle ?
    maximize engine throughput
  • simple pipeline (no fwding or interlock if
    context gt pipe stages)
  • deeper pipelines for higher frequency (or complex
    x86 instrs)
  • hide the latency of memory and transplants
  • It is actually easier to optimize instruction
    throughput
  • Open issues
  • How to manage very large of contexts? Do we
    have to dynamically page clusters of contexts
    in and out of the engine?
  • How to fake memory capacity? How much DRAM to
    emulate 1000-node system?

12
Conclusion
  • Contributions
  • hybrid transplant simulation reduces FPGA
    development effort
  • proof-of-concept demonstrates up to 16 MIPS on
    select SPECINT
  • ? plan to run TPC-C on DB2 and Oracle on BEE2
    (not enough DRAM on XUP)
  • Future work
  • 1024-way system on 10-way interleaved emulation
    engines
  • Thanks! Questions? echung_at_ece.cmu.edu
  • PROTOFLEX/SIMFLEX (http//www.ece.cmu.edu/simflex
    )
Write a Comment
User Comments (0)
About PowerShow.com