Title: RAMP Gold: ParLab InfiniCore Model
1RAMP GoldParLab InfiniCore Model
- Krste Asanovic
- UC Berkeley
RAMP Retreat, January 16, 2008
2Outline
- UCB Parallel Computing Laboratory (ParLab)
overview - InfiniCore UCBs Manycore prototype architecture
- RAMP Gold A RAMP model for InfiniCore
3UCB Par Lab Overview
Easy to write correct software that runs
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs/Dwarfs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS LibrariesServices
OS
Hypervisor
Multicore/GPGPU
Arch.
InfiniCore/RAMP Gold
4Manycore covers huge design space
Thin Cores
Fat Cores
Special-Purpose Cores
L2 Interconnect
L2 Bank
L2 Bank
L2 Bank
Multiple On-Chip L2 /RAM banks
Many alternative memory hierarchies
Fast Serial I/O Ports
Mem I/O Interconnect
Multiple Off-Chip DRAM/Flash Channels
DRAM
DRAM
Flash
5Narrowing our search space
- Laptops/Handhelds gt single-socket systems
- Dont expect gt1 manycore chip per platform
- Servers/HPC will probably use multiple
single-socket blades - Homogeneous, general-purpose cores
- Presents most of the interesting design
challenges - Resulting designs can later be specialized for
improved efficiency - Simple in-order cores
- Want low energy/op floor
- Want high performance/area ceiling
- More predictable performance
- A tiled physical design
- Reduces logical/physical design verification
costs - Enables design reuse across large family of parts
- Provides natural locality to reduce latency and
energy/op - Natural redundancy for yield enhancement
surviving failures
6InfiniCore
- ParLab strawman manycore architecture
- A playground (punching bag?) for trying out
architecture ideas - Highlights
- Flexible hardware partitioning protected
communication - Latency-tolerant CPUs
- Fast and flexible synchronization primitives
- Configurable memory hierarchy and user-level DMA
- Pervasive QoS and performance counters
7InfiniCore Architecture Overview
- Four separate on-chip network types
- Control networks combine 1-bit signals in
combinational tree for interrupts barriers - Active message networks carry register-register
messages between cores - L2/Coherence network connects L1 caches to L2
slices and indirectly to memory - Memory network connects L2 slices to memory
controllers - I/O and accelerators potentially attach to all
network types. - Flash replaces rotating disks.
- Only high-speed I/O is network display.
8Physical View of Tiled Architecture
DRAM
Flash
9Core Internals
- RISC-style 64-bit instruction set
- SPARC V9 used for pragmatic reasons
- In-order pipeline with decoupled single-lane
(64-bit) vector unit (VU) - Integer control unit generates/checks addresses
in-order to give precise exceptions on vector
loads/stores - VU runs behind executing queued instructions on
queued load data - VU executes both scalar vector, can mix (e.g.,
vector load plus scalar ALU) - Each VU cycle 2 ALU, 1 load, 1 store (all 64b)
- Vector regfile configurable to trade reduced
I-fetch for fewer register spills - 256 total registers (e.g., 32 regs. x 8 elements,
or 8 regs. x 32 elements) - Decoupling is cheap way to tolerate memory
latency inside thread (scalar vector) - Vectors increase performance, reduce energy/op,
and increase effective decoupling queue size
L1I
2x64b FLOPS/clock
TLB/PLB
1-3 issue?
Control Processor (Int 64b)
Vector Unit (Int/FP 64b)
GPRs
VRegs
Command Queue
Virtual Address
Load Data Queues
TLB/PLB
L1D
(Store Queues not shown)
To outer levels of memory hierarchy
10Cache Coherence
- L1 cache coherence tracked at L2 memory managers
(set of readers) - All cases except write to currently read shared
line handled in pure hardware - Writer gets trap on memory response, invokes
handler - Same process used for transactional memory (TM)
- Cache tags visible to user-level software in
partition, useful for TM swapping
11RAMP GoldA Model of ParLab InfiniCore Target
- Target is single-socket tiled manycore system
- Based on SPARC ISA (v8-gtv9)
- Distributed coherent caches
- Multiple on-chip networks (barrier, active
message, coherence, memory) - Multiple DRAM channels
- Split timing/functional models, both in hardware
- Host multithreading of both timing and functional
models - Expect to model up to 1024 64-bit cores in system
(8 BEE3 boards) - Predict peak performance around 1-10 GIPS, with
full timing models
12Host Multithreading(Zhangxi Tan (UCB), Chung,
(CMU))
- Multithreading emulation engine reduces FPGA
resource use and improves emulator throughput - Hides emulation latencies (e.g., communicating
across FPGAs)
13Split Functional/Timing Models(HASIM Emer
(MIT/Intel), FAST Chiou, (UT Austin))
Functional Model
Timing Model
- Functional model executes CPU ISA correctly, no
timing information - Only need to develop functional model once for
each ISA - Timing model captures pipeline timing details,
does not need to execute code - Much easier to change timing model for
architectural experimentation - Without RTL design, cannot be 100 certain that
timing is accurate - Many possible splits between timing and
functional model
14RAMP Gold Approach
- Split (and decoupled) functional and timing
models - Host multithreading of both functional and timing
models
15Multithreaded Func. Timing Models
MT-Channels
MT-Unit
- MT-Unit multiplexes multiple target units on a
single host engine - MT-Channel multiplexes multiple target channels
over a single host link
16RAMP Gold CPU Model (v0.1)
Addresses
ALU Func.
Data Memory Interface
Immediates
PC/Fetch Func.
Instruction Memory Interface
Instructions
PC Values
Load
Status
Status
Exec. Comm.
Mem. Comm.
Status
Fetch Commands
Status
GPR1
GPR1
GPR1
Timing State
Execute Timing
Decode/Issue Timing
Commit Timing
17RAMP Gold Memory Model (v0.1)
CPU Model
CPU Model
(duplicate paths for Instruction and Data
interface)
Memory Model
Host DRAM Cache
BEE DRAM
18Matching physical resources to utilization
- Only implement sufficient functional units to
match expected utilization, e.g. - For single-issue core, expected IPC 0.6
- Regfile read ports (1.2 operands/instruction)
- 0.61.20.72 per timing model
- Regfile write ports (0.8 operands/instruction)
- 0.60.80.48 per timing model
- Instruction mix
- Mem 0.3
- FPU 0.1
- Int 0.5
- Branch 0.1
- Therefore only need (per timing model)
- 0.60.3 0.18 memory ports
- 0.60.1 0.06 FPUs
- 0.60.5 0.30 Integer execution units
- 0.60.1 0.06 Branch execution units
19Balancing Resource Utilization
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Regfile
Regfile
Regfile
Regfile
Operand Interconnect
FPU
Mem
Int
Int
Int
Int
Int
Branch
20RAMP Gold Capacity Estimates
- For SPARC v8 (32-bit) pipeline
- Purely functional, no timing model
- Integer only
- For BEE3, predict 64 CPUs/engine, 8 engines/FPGA
(LX110), or 512 CPUs/FPGA - Throughput of 150MHz 8 engines 1200
MIPS/FPGA - 8 BEE3 boards 4 FPGAs/board 38 GIPS/system
- Perhaps 4x reduction in capacity with v9, FPU,
and timing models