RAMP Gold: ParLab InfiniCore Model - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

RAMP Gold: ParLab InfiniCore Model

Description:

Want low energy/op floor. Want high performance/area ceiling. More predictable performance ... Split timing/functional models, both in hardware ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 21
Provided by: KrsteAs9
Category:

less

Transcript and Presenter's Notes

Title: RAMP Gold: ParLab InfiniCore Model


1
RAMP GoldParLab InfiniCore Model
  • Krste Asanovic
  • UC Berkeley

RAMP Retreat, January 16, 2008
2
Outline
  • UCB Parallel Computing Laboratory (ParLab)
    overview
  • InfiniCore UCBs Manycore prototype architecture
  • RAMP Gold A RAMP model for InfiniCore

3
UCB Par Lab Overview
Easy to write correct software that runs
efficiently on manycore
Personal Health
Image Retrieval
Hearing, Music
Speech
Parallel Browser
Applications
Motifs/Dwarfs
Composition Coordination Language (CCL)
Static Verification
CCL Compiler/Interpreter
Productivity Layer
Parallel Libraries
Parallel Frameworks
Type Systems
Correctness
Efficiency Languages
Directed Testing
Sketching
Efficiency Layer
Autotuners
Dynamic Checking
Legacy Code
Schedulers
Communication Synch. Primitives
Efficiency Language Compilers
Debugging with Replay
Legacy OS
OS LibrariesServices
OS
Hypervisor
Multicore/GPGPU
Arch.
InfiniCore/RAMP Gold
4
Manycore covers huge design space
Thin Cores
Fat Cores
Special-Purpose Cores
L2 Interconnect
L2 Bank
L2 Bank
L2 Bank
Multiple On-Chip L2 /RAM banks
Many alternative memory hierarchies
Fast Serial I/O Ports
Mem I/O Interconnect
Multiple Off-Chip DRAM/Flash Channels
DRAM
DRAM
Flash
5
Narrowing our search space
  • Laptops/Handhelds gt single-socket systems
  • Dont expect gt1 manycore chip per platform
  • Servers/HPC will probably use multiple
    single-socket blades
  • Homogeneous, general-purpose cores
  • Presents most of the interesting design
    challenges
  • Resulting designs can later be specialized for
    improved efficiency
  • Simple in-order cores
  • Want low energy/op floor
  • Want high performance/area ceiling
  • More predictable performance
  • A tiled physical design
  • Reduces logical/physical design verification
    costs
  • Enables design reuse across large family of parts
  • Provides natural locality to reduce latency and
    energy/op
  • Natural redundancy for yield enhancement
    surviving failures

6
InfiniCore
  • ParLab strawman manycore architecture
  • A playground (punching bag?) for trying out
    architecture ideas
  • Highlights
  • Flexible hardware partitioning protected
    communication
  • Latency-tolerant CPUs
  • Fast and flexible synchronization primitives
  • Configurable memory hierarchy and user-level DMA
  • Pervasive QoS and performance counters

7
InfiniCore Architecture Overview
  • Four separate on-chip network types
  • Control networks combine 1-bit signals in
    combinational tree for interrupts barriers
  • Active message networks carry register-register
    messages between cores
  • L2/Coherence network connects L1 caches to L2
    slices and indirectly to memory
  • Memory network connects L2 slices to memory
    controllers
  • I/O and accelerators potentially attach to all
    network types.
  • Flash replaces rotating disks.
  • Only high-speed I/O is network display.

8
Physical View of Tiled Architecture
DRAM
Flash
9
Core Internals
  • RISC-style 64-bit instruction set
  • SPARC V9 used for pragmatic reasons
  • In-order pipeline with decoupled single-lane
    (64-bit) vector unit (VU)
  • Integer control unit generates/checks addresses
    in-order to give precise exceptions on vector
    loads/stores
  • VU runs behind executing queued instructions on
    queued load data
  • VU executes both scalar vector, can mix (e.g.,
    vector load plus scalar ALU)
  • Each VU cycle 2 ALU, 1 load, 1 store (all 64b)
  • Vector regfile configurable to trade reduced
    I-fetch for fewer register spills
  • 256 total registers (e.g., 32 regs. x 8 elements,
    or 8 regs. x 32 elements)
  • Decoupling is cheap way to tolerate memory
    latency inside thread (scalar vector)
  • Vectors increase performance, reduce energy/op,
    and increase effective decoupling queue size

L1I
2x64b FLOPS/clock
TLB/PLB
1-3 issue?
Control Processor (Int 64b)
Vector Unit (Int/FP 64b)
GPRs
VRegs
Command Queue
Virtual Address
Load Data Queues
TLB/PLB
L1D
(Store Queues not shown)
To outer levels of memory hierarchy
10
Cache Coherence
  • L1 cache coherence tracked at L2 memory managers
    (set of readers)
  • All cases except write to currently read shared
    line handled in pure hardware
  • Writer gets trap on memory response, invokes
    handler
  • Same process used for transactional memory (TM)
  • Cache tags visible to user-level software in
    partition, useful for TM swapping

11
RAMP GoldA Model of ParLab InfiniCore Target
  • Target is single-socket tiled manycore system
  • Based on SPARC ISA (v8-gtv9)
  • Distributed coherent caches
  • Multiple on-chip networks (barrier, active
    message, coherence, memory)
  • Multiple DRAM channels
  • Split timing/functional models, both in hardware
  • Host multithreading of both timing and functional
    models
  • Expect to model up to 1024 64-bit cores in system
    (8 BEE3 boards)
  • Predict peak performance around 1-10 GIPS, with
    full timing models

12
Host Multithreading(Zhangxi Tan (UCB), Chung,
(CMU))
  • Multithreading emulation engine reduces FPGA
    resource use and improves emulator throughput
  • Hides emulation latencies (e.g., communicating
    across FPGAs)

13
Split Functional/Timing Models(HASIM Emer
(MIT/Intel), FAST Chiou, (UT Austin))
Functional Model
Timing Model
  • Functional model executes CPU ISA correctly, no
    timing information
  • Only need to develop functional model once for
    each ISA
  • Timing model captures pipeline timing details,
    does not need to execute code
  • Much easier to change timing model for
    architectural experimentation
  • Without RTL design, cannot be 100 certain that
    timing is accurate
  • Many possible splits between timing and
    functional model

14
RAMP Gold Approach
  • Split (and decoupled) functional and timing
    models
  • Host multithreading of both functional and timing
    models

15
Multithreaded Func. Timing Models
MT-Channels
MT-Unit
  • MT-Unit multiplexes multiple target units on a
    single host engine
  • MT-Channel multiplexes multiple target channels
    over a single host link

16
RAMP Gold CPU Model (v0.1)
Addresses
ALU Func.
Data Memory Interface
Immediates
PC/Fetch Func.
Instruction Memory Interface
Instructions
PC Values
Load
Status
Status
Exec. Comm.
Mem. Comm.
Status
Fetch Commands
Status
GPR1
GPR1
GPR1
Timing State
Execute Timing
Decode/Issue Timing
Commit Timing
17
RAMP Gold Memory Model (v0.1)
CPU Model
CPU Model
(duplicate paths for Instruction and Data
interface)
Memory Model
Host DRAM Cache
BEE DRAM
18
Matching physical resources to utilization
  • Only implement sufficient functional units to
    match expected utilization, e.g.
  • For single-issue core, expected IPC 0.6
  • Regfile read ports (1.2 operands/instruction)
  • 0.61.20.72 per timing model
  • Regfile write ports (0.8 operands/instruction)
  • 0.60.80.48 per timing model
  • Instruction mix
  • Mem 0.3
  • FPU 0.1
  • Int 0.5
  • Branch 0.1
  • Therefore only need (per timing model)
  • 0.60.3 0.18 memory ports
  • 0.60.1 0.06 FPUs
  • 0.60.5 0.30 Integer execution units
  • 0.60.1 0.06 Branch execution units

19
Balancing Resource Utilization
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Timing Model
Regfile
Regfile
Regfile
Regfile
Operand Interconnect
FPU
Mem
Int
Int
Int
Int
Int
Branch
20
RAMP Gold Capacity Estimates
  • For SPARC v8 (32-bit) pipeline
  • Purely functional, no timing model
  • Integer only
  • For BEE3, predict 64 CPUs/engine, 8 engines/FPGA
    (LX110), or 512 CPUs/FPGA
  • Throughput of 150MHz 8 engines 1200
    MIPS/FPGA
  • 8 BEE3 boards 4 FPGAs/board 38 GIPS/system
  • Perhaps 4x reduction in capacity with v9, FPU,
    and timing models
Write a Comment
User Comments (0)
About PowerShow.com