FPGA-based Fast, Cycle-Accurate Full System Simulators - PowerPoint PPT Presentation

About This Presentation
Title:

FPGA-based Fast, Cycle-Accurate Full System Simulators

Description:

FPGA-based Fast, Cycle-Accurate Full System Simulators ... Accurately (to cycle resolution) simulate its behavior ... to 100MHz, cycle-accurate, full-system, ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 13
Provided by: derek157
Category:

less

Transcript and Presenter's Notes

Title: FPGA-based Fast, Cycle-Accurate Full System Simulators


1
FPGA-based Fast, Cycle-Accurate Full System
Simulators
  • Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo,
    John Xu and Nikhil Patil
  • University of Texas at Austin

2
Wouldnt it be nice to have a simulator that is
  • Fast
  • 10M cycles per second, fast enough to run real
    datasets to completion
  • Accurate
  • Produce cycle-accurate numbers
  • Complete
  • Run real operating systems, applications
  • Transparent
  • Can see everything in processor, no performance
    hit
  • Inexpensive
  • Need thousands
  • Usable
  • Quick changes, easy to see performance

3
Software?
  • Software-based simulators inherently cannot
    achieve this speed and be cycle-accurate at the
    same time
  • A 128 entry, fully-associative TLB at the limit
    requires 128 load, compare operations
  • Arbitration requires first looking across
    multiple bidders
  • There are lots of these structures in a complex
    processor!
  • Thousands to tens of thousands of events
  • Even with perfect parallelism, need a lot of CPUs

4
Hardware
  • Clearly, hardware is necessary
  • Reconfigurability (read FPGAs) is required for
    flexibility
  • But how?

5
Full Implementation?
  • Take RTL code, compile for FPGA
  • Implementing full system in FPGA is prohibitively
    large
  • Shih-Lin Lus group has single original Pentium
    (586, 3.1M transistors) in largest Xilinx FPGA
  • Emulate Pentium M in a single FPGA?
  • 140M transistors
  • Instead, what about
  • Accurately (to cycle resolution) simulate its
    behavior
  • Running real, unmodified applications, OS
  • With full visibility at full speed?
  • If execution speeds are reasonable, do I care?

Derek Chiou, UTexas, Austin
6
Can I Partition the Problem?
  • 64b adder way too big to be implemented as a
    single monolithic entity
  • But, I can implement 64 1b adders very easily
    with very little state and complexity
  • Partitioning is good if possible
  • But, how to partition?

7
Classic Partitioning
  • On module boundary
  • Caches, memories, ALUs, processors, memory
    controllers
  • Partitioning doesnt save state or complexity,
    but enables design to be partitioned over
    multiple FPGAs and software
  • Problems?

I1
bypass
IR
IR
IR
Add
we
I2
rr1
rr2
addr
rd1
PC
we
algn
inst
wr
waddr
ALU
rd2
wd
raddr
GPR File
Instruction /Mem
rdata
0
Data /Memory
1
M
R
2
Immed. Extend
wdata
3
re
MD1
MD2
8
Functional/Timing Partition
  • Functional model simulates ISA
  • Timing model simulates micro-architecture
  • Asim and Simplescalar are written like this
  • Software
  • One processor
  • Lots of interaction between functional and timing
  • Intended to avoid rollback of any component
  • Put timing model in FPGA???
  • Parallel component executed in hardware!

9
UT FAST Partitioning
  • On ISA/micro-architecture boundary (ISA FPGA)
  • Instruction trace generated by ISA simulator
    (e.g., Bochs, Simics)
  • Fast, full system but no timing information
    (could be hardware!!!)
  • What do we need to simulate in the timing model?

I1
bypass
IR
IR
IR
Add
we
I2
rr1
rr2
addr
rd1
PC
we
algn
inst
wr
waddr
ALU
rd2
wd
raddr
GPR File
Instruction Memory
rdata
Trace
0
Data Memory
1
M
R
2
Immed. Extend
wdata
3
re
MD1
MD2
10
UT FAST Complex Processors
  • Straight pipelines are easy what about
  • Caches/TLBs?
  • Keep tags, pass address (virtual and physical if
    necessary)
  • Hits, misses determined but dont need data
  • Superscalar (multiple issue)?
  • Fetch and issue multiple instructions assuming
    they meet boundary constraints
  • Multiple functional units
  • Reservation stations
  • Reorder buffer
  • Pipeline control along with instructions
  • NO DATAPATH!!!
  • Timing Model speed almost unimportant!
  • Multi-cycle memories to create more ports

11
Example of Complication Branch Prediction
  • Must process mis-speculated instructions in
    timing model
  • Implement BP in timing model
  • Timing model forces ISA simulator to
    mis-speculate
  • Rollback, restore
  • Requires support from ISA simulator
  • Branch predictor predictor in ISA simulator?
  • BP only works in processor if its fairly
    accurate
  • FAST simulators take advantage of the fact that
    most of the time micro-architecture is on the
    right path
  • Most complexity (BP, parallelism) can be handled
    this way

12
Status Conclusions
  • 1MHz to 100MHz, cycle-accurate, full-system,
    multiprocessor simulator
  • Well, not quite that fast right now, but we are
    using embedded 300MHz PowerPC 405 to simplify
  • X86, boots Linux, Windows, targeting 80486 to
    Pentium D-like and beyond (Dam Sunwoo, Nikhil
    Patil)
  • Bochs functional model (looking at much faster
    models)
  • Heavily modified instruction trace and rollback
  • Branch-predicted superscalar model almost done in
    Bluespec and Verilog (John Xu, Huzefa
    Sanjeliwala)
  • Have straight pipeline 486 model with TLBs and
    caches
  • Statistics gathered in hardware
  • Very little if any probe effect
  • Tools to semi-automate micro-architectural and
    ISA level exploration
  • Orthogonality of models makes both simpler
Write a Comment
User Comments (0)
About PowerShow.com