Title: ProtoFlex: Status Update and Design Experiences
1ProtoFlex Status Update and Design Experiences
- Eric S. Chung, Michael Papamichael, Eriko
Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai - echung, enurvita, jhoe, babak,
kenmai_at_ece.cmu.edu
PROTOFLEX
Our work in this area has been supported in part
by NSF, IBM, Intel, and Xilinx.
2Full-system Functional Simulation
- Effective substitute for real (or non-existent)
HW - Can boot OS, run commercial apps
- Important in SW research computer architecture
- But too slow for large-scale MP studies
- Multicore wont help existing tools
- Is serious challenge for large-MP (1000-way)
simulation
2
2
3Alternative FPGA-based simulation
- Only 10x slower in clock freq than custom HW
- But FPGAs harder to use than software
- Simulating large-MP (100- to 1000-way) ? cant be
done trivially - Simulating full-system support? need devices
entire ISA
The build-all strategy in FPGAs significant
effort resources
3
3
4Reducing complexity w/ virtualization
Making a single physical resource appear as
multiple logical resources
Making multiple physical resources appear as a
single logical resource
Hybrid Full-System Simulation
Virtualized MP Simulation
Target full-system behaviors
CPU
CPU
CPU
CPU
CPU
frequent
infrequent
FPGA
Software
1 FPGA CPU
Host resources
Host resources
Only frequent behaviors hosted in FPGA. Relegate
infrequent to SW.
Logical CPUs multiplexed onto fewer physical
CPUs.
4
4
5Outline
- Hybrid Full-System Simulation
- Virtualized Multiprocessor Simulation
- BlueSPARC Implementation
- Design Experiences
- Future Work
5
5
6Hybrid Full-System Simulation
transplant
Software full-system simulator host
FPGA host
CPU
CPU
CPU
CPU
MMU
Fibre
Terminal
CPU
NIC
PCI
Memory
Graphics
SCSI
Hybrid Simulation
- 3 ways to map target component to hybrid
simulation host - FPGA-only Simulation-only
Transplantable - CPUs can fallback to SW by transplanting
between hosts - Only common-case instructions/behaviors
implemented in FPGA - Remaining behavs relegated to SW (turns out many
of complex ones) -
Transplants reduce full-system design effort
6
6
7Outline
- Hybrid Full-System Simulation
- Virtualized Multiprocessor Simulation
- BlueSPARC Implementation
- Design Experiences
- Future Work
7
7
8Virtualized Multiprocessor Simulation
- Problem large-scale simulation configurations
challenging to implement in FPGAs using
structurally-accurate approaches
processors in target model
host processors implemented in FPGA
Structural-accuracy1-to-1 mapping between target
and host CPUs
1-to-1
10x slower than real HW
Pros fastest possible solution, only 10x slower
than real HW Cons difficult to build for
large-scale configs (e.g., gt100-way)
9Virtualized Multiprocessor Simulation
processors in target model
host engines implemented in FPGA
HostInterleavingMultiplex target processors
onto fewer FPGA-hosted processors
4-to-1
40x slower than real HW
- Advantages
- Decouple logical target system size from FPGA
host size - Scale FPGA host as-needed to deliver required
performance - High target-to-host ratio (TH) simplifies/consolid
ates HW (e.g., fewer nodes in cache coherence,
interconnect)
9
9
10Whats inside an FPGA host processor?
- An engine that architecturally executes
multiple contexts - Existing multithreaded designs are good
candidates - Choice is influenced by TH ratio (target-to-host
ratio) - We propose an interleaved pipeline (e.g.,
TERA-style) - Best suited for high TH ratio
- Switch in new CPU context on each cycle
- Simple, efficient design w/ no stalling or
forwarding - Long-latency tolerance (e.g., cache miss,
transplants) - Coherence is free between CPUs mapped onto same
engine
CPU
CPU
CPU
HOSTCPU
10
10
11Outline
- Hybrid Full-System Simulation
- Virtualized Multiprocessor Simulation
- BlueSPARC Implementation
- Design Experiences
- Future Work
11
11
12Implementation BlueSPARC simulator
16-CPU Shared-memory UltraSPARC III Server
(SunFire 3800)
BEE2 Platform
12
13BlueSPARC Simulator (continued)
13
14BlueSPARC host microarchitecture
64-bit ISA, SW-visible MMU, complex memory ?
high of pipeline stages
14
15Hybrid host partitioning choices
ON-CHIP FPGA
OFF-CHIP
BlueSPARC
Micro-transplants(PowerPC405)
Transplants(Simics on PC)
15
16Performance
Perf comparable to Simics-fast39x speedup on
average over Simics-trace
16
17Outline
- Hybrid Full-System Simulation
- Virtualized Multiprocessor Simulation
- BlueSPARC Implementation
- Design Experiences
- Future Work
17
17
18Design experiences
To appear in FPGA08
19Design experiences (cont)
- What was important
- Developing effective validation strategies (more
on next slide) - Existing reference model (Simics) to study and
compare against - Efficient mapping of state to FPGA resources
(e.g., 16 PCs ? 16-bit LUT-based distributed
RAM) - Coping with long Xilinx builds by easing up on
timing constraints - Judicious Bluespec
- What was NOT important
- Meeting 100MHz timing for every Xilinx build
(i.e., deep pipelining) - Implementing every functionality as
efficiently/fast as possible
20Validation
- THE most challenging aspect of this project
- Strategies used
- Auto-generated torture tests hand-written test
cases - Auto-port test-cases from OpenSPARC T1 framework
to UltraSPARC III - Validated single-threaded multithreaded ISA
execution against Simics (both in Verilog
Simulations and in FPGA) - Flight data recorder for non-deterministic
interleaving of CPUs - Batched Verilog simulations w/ varying parameters
- Validate non-blocking memory system with shadow
flat memories during Verilog simulation ? caught
self-modifying code bugs - gt 200 synthesizable assertions to Chipscope
- Built-in deadlock/error detectors
21In retrospect
- What I would have done differently to begin with
- Write entire USIII functional model myself in
software first - Take more advantage of Verilog PLI for validation
(interface to C) - Dont over-engineer HDL
- Dont upgrade tools unless necessary (e.g., trial
license runs out) - Validation infrastructure w/ batching
capabilities (do earlier!) - Automated binary search tool for bug hunting
- Re-write DDR2 Async FIFOs without BRAMs
- Fast memory checkpoint loader (3GB images per run
25m) - Simple, correct gtgt Fast, buggy
22Future Work
- Scalability
- Burden-of-proof for 1000-way simulation?
- Investigate cache-coherence/interconnect
mechanisms for combining multiple interleaved
pipelines - Virtualization design spaces
- On-chip storage virtualization (e.g.,
architectural state) - Memory disk capacity (e.g., HW-based demand
paging?) - Virtualizing instrumentation (e.g., paging
functional cache tags) - Fast instrumentation tools
- Understanding systems at multiple levels of
abstraction (beyond ISA) - Validationanalysis beyond ISA, how to
sanity-check appsys behavior?
23BlueSPARC Demo on BEE2
4 DDR2 Controllers 4 GB memory
- Demo application
- On-Line Transaction Processing benchmark (TPC-C)
in Oracle - Runs in Solaris 8 (unmodified binary)
- FPGA Memory directly loaded from Simics
checkpoint
Ethernet (to Simics on PC)
Virtex-II Pro 70 (PowerPC BlueSPARC)
RS232 (Debugging)
BEE2 Platform
23
24Conclusion
- Build-all simulation approach in FPGAs is
challenging - Two virtualization techniques for reducing
complexity - Hybrid attain full-system by deferring rare
behavs to SW - Virtualized MP decouples target system size from
host size - BlueSPARC proof-of-concept
- Models 16-cpu UltraSPARC III server
- Comparable perf to Simics-fast, 39x on avg faster
than Simics-trace - Thanks! Questions? echung_at_ece.cmu.edu
- PROTOFLEX (http//www.ece.cmu.edu/simflex/protofl
ex.html)
24
24