ProtoFlex: Status Update and Design Experiences - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

ProtoFlex: Status Update and Design Experiences

Description:

Relegate infrequent to SW. Target full-system. behaviors. FPGA ... Remaining behavs relegated to SW (turns out many of complex ones) 1. 2. 3. CPU. CPU. Memory ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 25
Provided by: rampEecs
Category:

less

Transcript and Presenter's Notes

Title: ProtoFlex: Status Update and Design Experiences


1
ProtoFlex Status Update and Design Experiences
  • Eric S. Chung, Michael Papamichael, Eriko
    Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai
  • echung, enurvita, jhoe, babak,
    kenmai_at_ece.cmu.edu

PROTOFLEX
Our work in this area has been supported in part
by NSF, IBM, Intel, and Xilinx.
2
Full-system Functional Simulation
  • Effective substitute for real (or non-existent)
    HW
  • Can boot OS, run commercial apps
  • Important in SW research computer architecture
  • But too slow for large-scale MP studies
  • Multicore wont help existing tools
  • Is serious challenge for large-MP (1000-way)
    simulation

2
2
3
Alternative FPGA-based simulation
  • Only 10x slower in clock freq than custom HW
  • But FPGAs harder to use than software
  • Simulating large-MP (100- to 1000-way) ? cant be
    done trivially
  • Simulating full-system support? need devices
    entire ISA

The build-all strategy in FPGAs significant
effort resources
3
3
4
Reducing complexity w/ virtualization
Making a single physical resource appear as
multiple logical resources
Making multiple physical resources appear as a
single logical resource
Hybrid Full-System Simulation
Virtualized MP Simulation
Target full-system behaviors
CPU
CPU
CPU
CPU
CPU
frequent
infrequent
FPGA
Software
1 FPGA CPU
Host resources
Host resources
Only frequent behaviors hosted in FPGA. Relegate
infrequent to SW.
Logical CPUs multiplexed onto fewer physical
CPUs.
4
4
5
Outline
  • Hybrid Full-System Simulation
  • Virtualized Multiprocessor Simulation
  • BlueSPARC Implementation
  • Design Experiences
  • Future Work

5
5
6
Hybrid Full-System Simulation
transplant
Software full-system simulator host
FPGA host
CPU
CPU
CPU
CPU
MMU
Fibre
Terminal
CPU
NIC
PCI
Memory
Graphics
SCSI
Hybrid Simulation
  • 3 ways to map target component to hybrid
    simulation host
  • FPGA-only Simulation-only
    Transplantable
  • CPUs can fallback to SW by transplanting
    between hosts
  • Only common-case instructions/behaviors
    implemented in FPGA
  • Remaining behavs relegated to SW (turns out many
    of complex ones)

Transplants reduce full-system design effort
6
6
7
Outline
  • Hybrid Full-System Simulation
  • Virtualized Multiprocessor Simulation
  • BlueSPARC Implementation
  • Design Experiences
  • Future Work

7
7
8
Virtualized Multiprocessor Simulation
  • Problem large-scale simulation configurations
    challenging to implement in FPGAs using
    structurally-accurate approaches

processors in target model
host processors implemented in FPGA
Structural-accuracy1-to-1 mapping between target
and host CPUs
1-to-1
10x slower than real HW
Pros fastest possible solution, only 10x slower
than real HW Cons difficult to build for
large-scale configs (e.g., gt100-way)
9
Virtualized Multiprocessor Simulation
processors in target model
host engines implemented in FPGA
HostInterleavingMultiplex target processors
onto fewer FPGA-hosted processors
4-to-1
40x slower than real HW
  • Advantages
  • Decouple logical target system size from FPGA
    host size
  • Scale FPGA host as-needed to deliver required
    performance
  • High target-to-host ratio (TH) simplifies/consolid
    ates HW (e.g., fewer nodes in cache coherence,
    interconnect)

9
9
10
Whats inside an FPGA host processor?
  • An engine that architecturally executes
    multiple contexts
  • Existing multithreaded designs are good
    candidates
  • Choice is influenced by TH ratio (target-to-host
    ratio)
  • We propose an interleaved pipeline (e.g.,
    TERA-style)
  • Best suited for high TH ratio
  • Switch in new CPU context on each cycle
  • Simple, efficient design w/ no stalling or
    forwarding
  • Long-latency tolerance (e.g., cache miss,
    transplants)
  • Coherence is free between CPUs mapped onto same
    engine

CPU
CPU
CPU
HOSTCPU
10
10
11
Outline
  • Hybrid Full-System Simulation
  • Virtualized Multiprocessor Simulation
  • BlueSPARC Implementation
  • Design Experiences
  • Future Work

11
11
12
Implementation BlueSPARC simulator
16-CPU Shared-memory UltraSPARC III Server
(SunFire 3800)
BEE2 Platform
12
13
BlueSPARC Simulator (continued)
13
14
BlueSPARC host microarchitecture
64-bit ISA, SW-visible MMU, complex memory ?
high of pipeline stages
14
15
Hybrid host partitioning choices
ON-CHIP FPGA
OFF-CHIP
BlueSPARC
Micro-transplants(PowerPC405)
Transplants(Simics on PC)
15
16
Performance
Perf comparable to Simics-fast39x speedup on
average over Simics-trace
16
17
Outline
  • Hybrid Full-System Simulation
  • Virtualized Multiprocessor Simulation
  • BlueSPARC Implementation
  • Design Experiences
  • Future Work

17
17
18
Design experiences
  • 2007 Timeline

To appear in FPGA08
19
Design experiences (cont)
  • What was important
  • Developing effective validation strategies (more
    on next slide)
  • Existing reference model (Simics) to study and
    compare against
  • Efficient mapping of state to FPGA resources
    (e.g., 16 PCs ? 16-bit LUT-based distributed
    RAM)
  • Coping with long Xilinx builds by easing up on
    timing constraints
  • Judicious Bluespec
  • What was NOT important
  • Meeting 100MHz timing for every Xilinx build
    (i.e., deep pipelining)
  • Implementing every functionality as
    efficiently/fast as possible

20
Validation
  • THE most challenging aspect of this project
  • Strategies used
  • Auto-generated torture tests hand-written test
    cases
  • Auto-port test-cases from OpenSPARC T1 framework
    to UltraSPARC III
  • Validated single-threaded multithreaded ISA
    execution against Simics (both in Verilog
    Simulations and in FPGA)
  • Flight data recorder for non-deterministic
    interleaving of CPUs
  • Batched Verilog simulations w/ varying parameters
  • Validate non-blocking memory system with shadow
    flat memories during Verilog simulation ? caught
    self-modifying code bugs
  • gt 200 synthesizable assertions to Chipscope
  • Built-in deadlock/error detectors

21
In retrospect
  • What I would have done differently to begin with
  • Write entire USIII functional model myself in
    software first
  • Take more advantage of Verilog PLI for validation
    (interface to C)
  • Dont over-engineer HDL
  • Dont upgrade tools unless necessary (e.g., trial
    license runs out)
  • Validation infrastructure w/ batching
    capabilities (do earlier!)
  • Automated binary search tool for bug hunting
  • Re-write DDR2 Async FIFOs without BRAMs
  • Fast memory checkpoint loader (3GB images per run
    25m)
  • Simple, correct gtgt Fast, buggy

22
Future Work
  • Scalability
  • Burden-of-proof for 1000-way simulation?
  • Investigate cache-coherence/interconnect
    mechanisms for combining multiple interleaved
    pipelines
  • Virtualization design spaces
  • On-chip storage virtualization (e.g.,
    architectural state)
  • Memory disk capacity (e.g., HW-based demand
    paging?)
  • Virtualizing instrumentation (e.g., paging
    functional cache tags)
  • Fast instrumentation tools
  • Understanding systems at multiple levels of
    abstraction (beyond ISA)
  • Validationanalysis beyond ISA, how to
    sanity-check appsys behavior?

23
BlueSPARC Demo on BEE2
4 DDR2 Controllers 4 GB memory
  • Demo application
  • On-Line Transaction Processing benchmark (TPC-C)
    in Oracle
  • Runs in Solaris 8 (unmodified binary)
  • FPGA Memory directly loaded from Simics
    checkpoint

Ethernet (to Simics on PC)
Virtex-II Pro 70 (PowerPC BlueSPARC)
RS232 (Debugging)
BEE2 Platform
23
24
Conclusion
  • Build-all simulation approach in FPGAs is
    challenging
  • Two virtualization techniques for reducing
    complexity
  • Hybrid attain full-system by deferring rare
    behavs to SW
  • Virtualized MP decouples target system size from
    host size
  • BlueSPARC proof-of-concept
  • Models 16-cpu UltraSPARC III server
  • Comparable perf to Simics-fast, 39x on avg faster
    than Simics-trace
  • Thanks! Questions? echung_at_ece.cmu.edu
  • PROTOFLEX (http//www.ece.cmu.edu/simflex/protofl
    ex.html)

24
24
Write a Comment
User Comments (0)
About PowerShow.com