MMachine and Grids Parallel Computer Architectures - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

MMachine and Grids Parallel Computer Architectures

Description:

Exploiting fine-grain thread level parallelism on the MIT ... Horizontal Threads (H-Threads) Instruction level parallelism. Executes on a single MAP cluster ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 30
Provided by: cs911
Category:

less

Transcript and Presenter's Notes

Title: MMachine and Grids Parallel Computer Architectures


1
M-Machine and GridsParallel Computer
Architectures
  • Navendu Jain

2
Readings
  • The M-machine multicomputer Marco
    et al., MICRO 1995
  • Exploiting fine-grain thread level parallelism on
    the MIT multi-ALU processorKeckler et al., MICRO
    1998
  • A design space evaluation of grid processor
    architecturesNagarajan et al., MICRO 2001

3
Outline
  • The M-Machine Multicomputer
  • Thread Level Parallelism on M-Machine
  • Grid Processor Architectures
  • Review and Discussion

4
The M-Machine Multicomputer
5
Design Motivation
  • Achieve higher throughput of memory resources
  • Increase chip area devoted to processors
  • Arithmetic to bandwidth ratio of 12
    operations/word
  • Minimize global communication (local sync.)
  • Faster execution of fixed size problems
  • Easier programmability of parallel computers
  • Incremental approach

6
Architecture
  • A bi-directional 3-D network mesh of
    multi-threaded processing nodes
  • A chip comprises of a multi-ALU processor (MAP)
    and 128KB on-chip sync. DRAM
  • A user-accessible message passing system (SEND)
  • Single global virtual address space
  • Target CLK 100 MHz (control logic 40MHz)

7
Multi-ALU processor (MAP)
  • A MAP chip comprises
  • Three 64-bit 3-issue clusters
  • 2-way interleaved on-chip cache
  • A Memory Switch
  • A Cluster switch
  • External memory interface
  • On-chip network interfaces and routers

8
A MAP Cluster
  • 64-bit three issue pipelined processor
  • 2 Integer ALUs
  • 1 Floating point ALU
  • Register Files
  • 4KB Instruction cache
  • A MAP instruction has 1, 2 or 3 operations

9
Map Chip Die (18 mm side, 5M transistors)
10
Exploiting Parallelism on M-Machine
11
Threads
  • Exploit ILP both with-in and across the clusters
  • Horizontal Threads (H-Threads)
  • Instruction level parallelism
  • Executes on a single MAP cluster
  • 3-wide instruction stream
  • Communication/synchronization through
    messages/registers/memory
  • Max. 6 H-Threads can be interleaved dynamically
    on a cycle-by-cycle basis

12
Threads (contd.)
  • Vertical Threads (V-Threads)
  • Thread level parallelism (a standard process)
  • contains up-to 4 H-Threads (one per cluster)
  • Flexibility of scheduling (compiler/run-time)
  • Communication/synchronization through registers
  • At-most 6 resident V-Threads
  • 4 user slots, 1 event slot, 1 exception

13
(No Transcript)
14
Concurrency Model Three Levels of Parallelism
  • Instruction Level Parallelism ( 1 instruction)
  • VLIW, Superscalar processors
  • Issues Control Flow, Data dependency,
    Scalability
  • Thread Level Parallelism ( 1000 instructions)
  • Chip Multiprocessors
  • Issues Limited coarse TLP, Inner cores
    non-optimal
  • Fine grain Parallelism ( 50 1000 instructions)

15
Mapping
Program
Architecture
Granularity
16
Fine-grain overheads
  • Thread creation (11 cycles hfork)
  • Communication
  • Register-Register read/writes
  • Message passing/on-chip cache
  • Synchronization
  • Blocking on a register (full/empty bit)
  • Barrier Instruction (cbar instruction)
  • Memory (sync bit)

17
Grid Processor Architecture
18
Design Motivation
  • Continued scaling of the clock rate
  • Scalability of the processing core
  • Higher ILP - Instruction throughput (IPC)
  • Mitigate global wire and delay overheads
  • Closer coupling of Architecture and compiler

19
Architecture
  • An inter-connected 2-D network of ALU arrays
  • Each node has a IB and a execution unit
  • A single control thread maps instructions to
    nodes
  • Block-Atomic Execution Model
  • Mapping blocks of statically scheduled
    instructions
  • Dynamic execution in data-flow order
  • Forwarding temp. values to the consumer ALUs
  • Critical path scheduled along shortest physical
    path

20
GPA Architecture
21
Example Block-Atomic Mapping
22
Implementation
  • Instruction fetch and map
  • predicated hyper-block, move instructions
  • Execution - control logic
  • Operand routing max 3 dest., split instructions
  • Hyper-block control
  • Predication (execute-all approach), cmove
    instructions
  • Block-commit
  • Block-stitching

23
Review and Discussion
24
Key Ideas Convergence
  • Microprocessor no. of superscalar processors
    comm./sync. via registers low overheads
  • Exploiting ILP TLP granularities
  • Dependency mapped to a grid of ALUs
  • Replication reduces design/verification effort
  • Point-to-point communication
  • Exposing architecture partitioning and flow of
    operations to the compiler
  • Avoid wire, routing delays, memory wall problems

25
Ideas Divergence
  • M-Machine
  • On-chip cache Register based mech.
    Delays
  • Broadcasting and Point-to-point communication
  • GPA
  • Register Set Grid Chaining
    Scalability
  • Point-to-point communication
  • TERA
  • Fine-grain threads Memory comm/sync
    (full/empty)
  • No support for single threaded code

26
Drawbacks (Unresolved Issues)
  • M-Machine
  • Scalability
  • Clock speeds
  • Memory Synchronization
  • (use hfork)
  • Grid Processor Arch.
  • Data Caches far from ALUs
  • Incur delays between dependent operations due to
    network router and wires
  • Complex Frame-management and Block-stitching
  • Explicit compiler dependence

27
Challenges/Future Directions
  • Architectural support to extract TLP
  • Parallelizing compiler technology
  • How many cores/threads
  • No. of threads memory latency, wire delays
    Flynn
  • Inter-thread communication
  • Height of Grid 8 (IPC 5-6) GPA, Peter
  • Optimization - f(comm., delays, memory costs)

28
Challenges (contd.)
  • On-fly data-dependence detection (RAW/WAR)
  • TLP/ILP Balance M Multi-Computer

29
Thanks
Write a Comment
User Comments (0)
About PowerShow.com