Title: RAMP Models and Platforms
1RAMP Models and Platforms
- Krste Asanovic
- UC Berkeley
- RAMP Retreat, Berkeley, CA
- January 15, 2009
2Much confusion about RAMP
- Frequently asked questions
- When will RAMP be finished/usable?
- What ISA does RAMP use?
- Can RAMP model my new feature X?
- How accurate is RAMP?
- Why so many different RAMP projects?
- Why is there not more sharing among projects?
3Not much confusion about software simulators
- Rarely asked questions
- When will software simulation be finished/usable?
- What ISA do software simulators use?
- Can a software simulator model my new feature
X? - How accurate is software simulation?
- Why so many software simulators?
- Why is there not more sharing among software
simulators?
4RAMP is a consortium, not a project
- Many projects with different goals
- sometimes multiple per site
- So far, much sharing of ideas and techniques
- Very healthy and active community
- Some sharing of low-level infrastructure
- Boards platform-level interfaces to DRAM,
Ethernet, etc. - Not a single complete infrastructure that
everyone uses - and thats been OK, and might continue to be OK
5Run Model of Target on Host Platform
Hard Work
6RAMP Projects Goals
- Model some target machine trading off
- Fidelity
- Model design effort
- Emulation speed (and capacity)
7Space of Target Machines
- Which ISA?
- x86, SPARC, PowerPC, Alpha, ARM, MIPS?
- In-order or out-of-order cores?
- How many cores?
- 1, 16, 256, 1M?
- Processormemory of general-purpose machine, or
whole SoC including I/O devices? - Accelerators, GPUs?
- Which operating system? Hypervisor?
8ISA Wars
- Original pick to standardize around was SPARC
- Open standard
- Available verification suite
- Simplest ISA with extensive general-purpose
software support (i.e., desktop/server
development environment available) - SGI/MIPS sorely missed
- Leon implementation for FPGA
- Simics
- But the intent was always to support multiple
ISAs
9ISA usage in RAMP models
- UCB RAMP Blue Microblaze
- Xilinx soft core modified to add 64-bit FPU
- Stanford RAMP Red PowerPC
- Used Virtex-II Pro hard cores
- UT FAST x86
- Functional simulation in software on front-end
machine (or on PowerPC hardcore) - UT RAMP White PowerPC -gt SPARC
- Initial version used hard PowerPC cores moving to
Leon soft cores - MIT/Intel HASIM Alpha -gt x86?
- Initially Alpha ISA, eventually to form basis of
x86/uOP machine - CMU ProtoFLEX SPARC
- SPARC three ways (own core emulation on hard
PowerPC core emulation on front-end machine) - UCB RAMP Gold Internet-in-a-Box SPARC
- Own core design
- UCB/LBNL Green Flash Tensilica
- RTL generated from Tensilica tools
10Supporting new ISAs
- x86 still very desirable, but difficult
- FAST software functional model is probably
current best approach if want to play with
different timings - Microcoded functional model would be good way to
go if had resources (HASIM?) - Even with working functional model, timing model
is difficult? Adding new features difficult? - ARM also desirable for mobile device modeling
- Renewed interest in engaging here
- MIT/IBM PowerPC work in progress, could form
functional model - But nobody does this for fun - only to advance
their own research goals
11Commercial/Existing RTL Cores
- Originally seen as big benefit of RAMP
- But didnt turn out that way in practice (except
for prototyping usage model - see later) - Cores dont provide features we need, too big,
too difficult to modify - For simple ISAs (i.e. non-x86), biggest help is
ISA verification suites, and/or really simple
synthesizable ISA pipeline to form basis of
functional model
12Operating System Support
- Currently only ProtoFLEX, FAST, RAMP-White
support OS - Others can run one application with proxy
mechanism for I/O - Reflects interests of groups. OS is not primary
subject of research for groups building models so
far. - RAMP Gold to add support for ParLab OS work
(Tessellation) - Green Flash to add support for HPC-style
microkernel
13Target systems
- From a few, to millions of cores
- Scaling simulation to 100s of cores was a shared
goal - But smaller core counts (16-128) very interesting
also - Huge core counts (gt1E6) also of interest
- Single node versus clusters
- RAMP Blue Internet-in-a-box are message-passing
clusters - Rest are shared-memory systems
- Memory hierarchy and cache coherence protocols
- Wide variety of possibilities
- Desktop/Laptop/Server versus Handheld or SoC
- What is important to model for given research
topic? - Accelerators/GPUs
- Even wider variety than CPU ISAs/microarchitecture
s
14Wide variety, how to reuse?
- Proposal
- ISA functional models
- also FPU across ISAs
- Perhaps even common uOP engine across all ISAs?
- CPU Microarchitecture timing model
- E.g., in-order superscalar, out-of-order with
unified physical register file - Memory functional model
- Host-level caches memory interleaving
- Memory hierarchy timing models
- On-chip network types as subset
- I/O bus shims
- To allow random RTL to be attached for I/O
devices and non-GPU accelerators - This wont be easy, as have to agree on
interfaces between these components, might need
further specialization - Definitely need more experience doing all of the
above
15Simulator Types
- Functional model only (no timing)
- RTL models (functional includes timing)
- Also used for chip prototyping
- Split functional and timing models
- Hybrids of above
16Simulator Mapping Styles
- Gate-level emulator (Quickturn, Palladium)
- 1MHz
- Direct RTL emulator
- 5-20MHz
- FPGA-tuned RTL emulator
- 20-50MHz
- Virtualized RTL emulator
- 50-100MHz
- Host-multithreaded models
- gt100MHz
17(No Transcript)
18- RAMP Blue Release 2/25/2008
- design available from RAMP website
- ramp.eecs.berkeley.edu
19Climate System Design ConceptStrawman Design
Study
10PF sustained 120 m2 lt3MWatts lt 75M
20Virtualized RTL Improves FPGA Resource Usage
- RAMP allows units to run at varying target-host
clock ratios to optimize area and overall
performance - Example 1 Multiported register file
- Example, Sun Niagara has 3 read ports and 2 write
ports to 6KB of register storage - If RTL mapped directly, requires 48K flip-flops
- Slow cycle time, large area
- If mapping into block RAMs (one readone write
per cycle), takes 3 host cycles and 3x2KB block
RAMs - Faster cycle time (3X) and far less resources
- Example 2 Large L2/L3 caches
- Current FPGAs only have 1MB of on-chip SRAM
- Use on-chip SRAM to build cache of active piece
of L2/L3 cache, stall target cycle if access
misses and fetch data from off-chip DRAM
21Host Multithreading(Zhangxi Tan (UCB), Chung,
(CMU))
- Multithreading emulation engine reduces FPGA
resource use and improves emulator throughput - Hides emulation latencies (e.g., communicating
across FPGAs)
22Split Functional/Timing Models(HASIM Emer
(MIT/Intel), FAST Chiou, (UT Austin))
Functional Model
Timing Model
- Functional model executes CPU ISA correctly, no
timing information - Only need to develop functional model once for
each ISA - Timing model captures pipeline timing details,
does not need to execute code - Much easier to change timing model for
architectural experimentation - Without RTL design, cannot be 100 certain that
timing is accurate - Many possible splits between timing and
functional model
23RAMP WhiteHari Angepat, Derek Chiou (UT Austin)
- Scalable Coherent Shared Memory Multiprocessor
- Support standard shared memory programming models
Leon3 shim
Leon3 shim
Intersection Unit
NIU
Intersection Unit
NIU
Router
Router
AHB shim
AHB shim
AHB bus
AHB bus
MP IntCntrl
DSU
Eth
DDR2
DDR2
RAMP-White
23
24Multithreaded Func. Timing Models(RAMP Gold
UCB)
Timing Model Pipeline
MT-Channels
MT-Unit
- MT-Unit multiplexes multiple target units on a
single host engine - MT-Channel multiplexes multiple target channels
over a single host link
25CMU Simics/RAMP Simulator
16-CPU Shared-memory UltraSPARC III Server
(SunFire 3800)
BEE2 Platform
25
26What Hardware Platforms?
- RTL mapping approaches
- Need large amounts of logic
- Selected BEE2, and then designed BEE3 for this
emulation style - Observed that dont need much interconnect
bandwidth (memory inter-board links) because
RTL cores are slow and latency sensitive - Host-multithreading allows large systems to be
mapped to small (one?) FPGA (e.g., 64-128 cores
on ML505) - Logic gate count not as critical, need to focus
on on-chip capacity, off-chip memory bandwidth
and total memory capacity per FPGA (conventional
processor memory hierarchy issues multiplied by
multithreading factor) - One big FPGA with lots of fast memory channels
would be ideal - Software functional emulation (FAST) or
transplant (ProtoFLEX) - Focus on fast coherent connection to front-end
x86 CPU - Hypertransport, FSB, QPI interfaces better than
PCI I/O connections
27Summary
- Many reasons for great divergence in RAMP
projects - Different ISAs, different target machines,
different research topics, different emulation
styles - Sharing possible, but hard work and more
experience needed - Questions?