RAMP Gold: ParLab InfiniCore Model - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

RAMP Gold: ParLab InfiniCore Model

Description:

Want low energy/op floor. Want high performance/area ceiling. More predictable performance ... Split timing/functional models, both in hardware ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 21

Provided by: KrsteAs9

Category:

more less

Transcript and Presenter's Notes

Title: RAMP Gold: ParLab InfiniCore Model

1
RAMP GoldParLab InfiniCore Model

Krste Asanovic
UC Berkeley

RAMP Retreat, January 16, 2008
2
Outline

UCB Parallel Computing Laboratory (ParLab)
overview
InfiniCore UCBs Manycore prototype architecture
RAMP Gold A RAMP model for InfiniCore

3
UCB Par Lab Overview
Easy to write correct software that runs
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs/Dwarfs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS LibrariesServices
OS
Hypervisor
Multicore/GPGPU
Arch.
InfiniCore/RAMP Gold
4
Manycore covers huge design space
Thin Cores
Fat Cores
Special-Purpose Cores
L2 Interconnect
L2 Bank
L2 Bank
L2 Bank
Multiple On-Chip L2 /RAM banks
Many alternative memory hierarchies
Fast Serial I/O Ports
Mem I/O Interconnect
Multiple Off-Chip DRAM/Flash Channels
DRAM
DRAM
Flash
5
Narrowing our search space

Laptops/Handhelds gt single-socket systems
Dont expect gt1 manycore chip per platform
Servers/HPC will probably use multiple
single-socket blades
Homogeneous, general-purpose cores
Presents most of the interesting design
challenges
Resulting designs can later be specialized for
improved efficiency
Simple in-order cores
Want low energy/op floor
Want high performance/area ceiling
More predictable performance
A tiled physical design
Reduces logical/physical design verification
costs
Enables design reuse across large family of parts
Provides natural locality to reduce latency and
energy/op
Natural redundancy for yield enhancement
surviving failures

6
InfiniCore

ParLab strawman manycore architecture
A playground (punching bag?) for trying out
architecture ideas
Highlights
Flexible hardware partitioning protected
communication
Latency-tolerant CPUs
Fast and flexible synchronization primitives
Configurable memory hierarchy and user-level DMA
Pervasive QoS and performance counters

7
InfiniCore Architecture Overview

Four separate on-chip network types
Control networks combine 1-bit signals in
combinational tree for interrupts barriers
Active message networks carry register-register
messages between cores
L2/Coherence network connects L1 caches to L2
slices and indirectly to memory
Memory network connects L2 slices to memory
controllers
I/O and accelerators potentially attach to all
network types.
Flash replaces rotating disks.
Only high-speed I/O is network display.

8
Physical View of Tiled Architecture
DRAM
Flash
9
Core Internals

RISC-style 64-bit instruction set
SPARC V9 used for pragmatic reasons
In-order pipeline with decoupled single-lane
(64-bit) vector unit (VU)
Integer control unit generates/checks addresses
in-order to give precise exceptions on vector
loads/stores
VU runs behind executing queued instructions on
queued load data
VU executes both scalar vector, can mix (e.g.,
vector load plus scalar ALU)
Each VU cycle 2 ALU, 1 load, 1 store (all 64b)
Vector regfile configurable to trade reduced
I-fetch for fewer register spills
256 total registers (e.g., 32 regs. x 8 elements,
or 8 regs. x 32 elements)
Decoupling is cheap way to tolerate memory
latency inside thread (scalar vector)
Vectors increase performance, reduce energy/op,
and increase effective decoupling queue size

L1I
2x64b FLOPS/clock
TLB/PLB
1-3 issue?
Control Processor (Int 64b)
Vector Unit (Int/FP 64b)
GPRs
VRegs
Command Queue
Virtual Address
Load Data Queues
TLB/PLB
L1D
(Store Queues not shown)
To outer levels of memory hierarchy
10
Cache Coherence

L1 cache coherence tracked at L2 memory managers
(set of readers)
All cases except write to currently read shared
line handled in pure hardware
Writer gets trap on memory response, invokes
handler
Same process used for transactional memory (TM)
Cache tags visible to user-level software in
partition, useful for TM swapping

11
RAMP GoldA Model of ParLab InfiniCore Target

Target is single-socket tiled manycore system
Based on SPARC ISA (v8-gtv9)
Distributed coherent caches
Multiple on-chip networks (barrier, active
message, coherence, memory)
Multiple DRAM channels
Split timing/functional models, both in hardware
Host multithreading of both timing and functional
models
Expect to model up to 1024 64-bit cores in system
(8 BEE3 boards)
Predict peak performance around 1-10 GIPS, with
full timing models

12
Host Multithreading(Zhangxi Tan (UCB), Chung,
(CMU))

Multithreading emulation engine reduces FPGA
resource use and improves emulator throughput
Hides emulation latencies (e.g., communicating
across FPGAs)

13
Split Functional/Timing Models(HASIM Emer
(MIT/Intel), FAST Chiou, (UT Austin))
Functional Model
Timing Model

Functional model executes CPU ISA correctly, no
timing information
Only need to develop functional model once for
each ISA
Timing model captures pipeline timing details,
does not need to execute code
Much easier to change timing model for
architectural experimentation
Without RTL design, cannot be 100 certain that
timing is accurate
Many possible splits between timing and
functional model

14
RAMP Gold Approach

Split (and decoupled) functional and timing
models
Host multithreading of both functional and timing
models

15
Multithreaded Func. Timing Models
MT-Channels
MT-Unit

MT-Unit multiplexes multiple target units on a
single host engine
MT-Channel multiplexes multiple target channels
over a single host link

16
RAMP Gold CPU Model (v0.1)
Addresses
ALU Func.
Data Memory Interface
Immediates
PC/Fetch Func.
Instruction Memory Interface
Instructions
PC Values
Load
Status
Status
Exec. Comm.
Mem. Comm.
Status
Fetch Commands
Status
GPR1
GPR1
GPR1
Timing State
Execute Timing
Decode/Issue Timing
Commit Timing
17
RAMP Gold Memory Model (v0.1)
CPU Model
CPU Model
(duplicate paths for Instruction and Data
interface)
Memory Model
Host DRAM Cache
BEE DRAM
18
Matching physical resources to utilization

Only implement sufficient functional units to
match expected utilization, e.g.
For single-issue core, expected IPC 0.6
Regfile read ports (1.2 operands/instruction)
0.61.20.72 per timing model
Regfile write ports (0.8 operands/instruction)
0.60.80.48 per timing model
Instruction mix
Mem 0.3
FPU 0.1
Int 0.5
Branch 0.1
Therefore only need (per timing model)
0.60.3 0.18 memory ports
0.60.1 0.06 FPUs
0.60.5 0.30 Integer execution units
0.60.1 0.06 Branch execution units

19
Balancing Resource Utilization
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Regfile
Regfile
Regfile
Regfile
Operand Interconnect
FPU
Mem
Int
Int
Int
Int
Int
Branch
20
RAMP Gold Capacity Estimates