Satisfying Your Dependencies with SuperMatrix - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Satisfying Your Dependencies with SuperMatrix

Description:

Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures ... Threads asynchronously dequeue tasks from head of waiting queue ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 37
Provided by: Ernie67
Category:

less

Transcript and Presenter's Notes

Title: Satisfying Your Dependencies with SuperMatrix


1
Satisfying Your Dependencies with SuperMatrix
  • Ernie Chan

2
Motivation
  • Transparent Parallelization of Matrix Operations
    for SMP and Multi-Core Architectures
  • Schedule submatrix operations out-of-order via
    dependency analysis
  • Programmability
  • High-level abstractions to hide details of
    parallelization from user

3
Outline
  • SuperMatrix
  • Implementation
  • Performance Results
  • Conclusion

4
SuperMatrix
5
SuperMatrix
  • FLA_Part_2x2( A, ATL, ATR,
  • ABL, ABR, 0, 0,
    FLA_TL )
  • while ( FLA_Obj_length( ATL ) lt FLA_Obj_length( A
    )
  • FLA_Obj_width ( ATL ) lt FLA_Obj_width ( A
    ) )
  • b min( FLA_Obj_length( ABR ), nb_alg )
  • FLA_Repart_2x2_to_3x3( ATL, // ATR,
    A00, // A01, A02,
  • / / /
    /

  • A10, // A11, A12,
  • ABL, // ABR,
    A20, // A21, A22,
  • b, b, FLA_BR )
  • /---------------------------------------------
    ---------------------/
  • FLA_LU_nopiv( A11 )
  • FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
  • FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
  • FLA_ONE, A11, A12 )
  • FLA_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,
  • FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,
  • FLA_ONE, A11, A21 )

6
SuperMatrix
  • LU Factorization Without Pivoting
  • Iteration 1

LU
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
TRSM
GEMM
7
SuperMatrix
  • LU Factorization Without Pivoting
  • Iteration 2

LU
TRSM
GEMM
TRSM
8
SuperMatrix
  • LU Factorization Without Pivoting
  • Iteration 3

LU
9
SuperMatrix
  • FLASH
  • Matrix of matrices

10
SuperMatrix
  • FLA_Part_2x2( A, ATL, ATR,
  • ABL, ABR, 0, 0,
    FLA_TL )
  • while ( FLA_Obj_length( ATL ) lt FLA_Obj_length( A
    )
  • FLA_Obj_width ( ATL ) lt FLA_Obj_width ( A
    ) )
  • FLA_Repart_2x2_to_3x3( ATL, // ATR,
    A00, // A01, A02,
  • / / /
    /

  • A10, // A11, A12,
  • ABL, // ABR,
    A20, // A21, A22,
  • 1, 1, FLA_BR )
  • /---------------------------------------------
    ---------------------/
  • FLASH_LU_nopiv( A11 )
  • FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
  • FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
  • FLA_ONE, A11, A12 )
  • FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,
  • FLA_NO_TRANSPOSE,
    FLA_NONUNIT_DIAG,
  • FLA_ONE, A11, A21 )
  • FLASH_Gemm( FLA_NO_TRANSPOSE,
    FLA_NO_TRANSPOSE,

11
SuperMatrix
  • Analyzer
  • Delay execution and place tasks on queue
  • Tasks are function pointers annotated with
    input/output information
  • Compute dependence information (flow, anti,
    output) between all tasks
  • Create DAG of tasks

12
SuperMatrix
  • Dispatcher
  • Use DAG to execute tasks out-of-order in parallel
  • Akin to Tomasulos algorithm and
    instruction-level parallelism on blocks of
    computation
  • SuperScalar vs. SuperMatrix

13
SuperMatrix
  • Dispatcher
  • 4 threads
  • 5 x 5 matrix
  • of blocks
  • 55 tasks
  • 18 stages

LU
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
LU
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
LU
GEMM
GEMM
GEMM
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
LU
TRSM
TRSM
GEMM
LU
14
Outline
  • SuperMatrix
  • Implementation
  • Performance Results
  • Conclusion

15
Implementation
  • Analyzer

LU
LU
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
GEMM
Task Queue
DAG of tasks
GEMM
LU
GEMM
GEMM
TRSM
TRSM
LU
TRSM
TRSM
GEMM
GEMM
LU
LU
16
Implementation
  • Analyzer
  • FLASH routines enqueue tasks onto global task
    queue
  • Dependencies between each task are calculated and
    stored in the task structure
  • Each submatrix block stores the last task
    enqueued that writes to it
  • Flow dependencies occur when a subsequent task
    reads that block
  • DAG is embedded in task queue

17
Implementation
  • Dispatcher

Task Queue
Waiting Queue
LU
LU

TRSM
LU
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
LU
TRSM
TRSM
GEMM
LU
Threads
18
Implementation
  • Dispatcher
  • Place ready and available tasks on global waiting
    queue
  • First task on task queue always ready and
    available
  • Threads asynchronously dequeue tasks from head of
    waiting queue
  • Once a task completes execution, notify dependent
    tasks and update waiting queue
  • Loop until all tasks complete execution

19
Outline
  • SuperMatrix
  • Implementation
  • Performance Results
  • Conclusion

20
Performance Results
21
Performance Results
  • GotoBLAS 1.13 installed on all machines
  • Supported Operations
  • LAPACK-level functions
  • Cholesky factorization
  • LU factorization without pivoting
  • All level-3 BLAS
  • GEMM, TRMM, TRSM
  • SYMM, SYRK, SYR2K
  • HEMM, HERK, HER2K

22
Performance Results
  • Implementations
  • SuperMatrix serial BLAS
  • FLAME multithreaded BLAS
  • LAPACK multithreaded BLAS
  • Block size 192
  • Processing elements 8

23
Performance Results
  • SuperMatrix Implementation
  • Fixed block sized
  • Varying block sizes can lead to better
    performance
  • Experiments show 192 generally the best
  • Simplest scheduling
  • No sorting to execute task on critical path
    earlier
  • No attempt to improve data locality in these
    experiments

24
Performance Results
25
Performance Results
26
Performance Results
27
Performance Results
28
Performance Results
29
Performance Results
30
Outline
  • SuperMatrix
  • Implementation
  • Performance Results
  • Conclusion

31
Conclusion
  • Apply out-of-order execution techniques to
    schedule tasks
  • The whole is greater than the sum of the parts
  • Exploit parallelism between operations
  • Despite having to calculate dependencies,
    SuperMatrix only has small performance penalties

32
Conclusion
  • Programmability
  • Code at a high level without needing to deal with
    aspects of parallelization

33
Authors
  • Ernie Chan
  • Field G. Van Zee
  • Enrique S. Quintana-Ortí
  • Gregorio Quintana-Ortí
  • Robert van de Geijn
  • The University of Texas at Austin
  • Universidad Jaume I

34
Acknowledgements
  • We thank the Texas Advanced Computing Center
    (TACC) for access to their machines and their
    support
  • Funding
  • NSF Grants
  • CCF0540926
  • CCF0702714

35
References
  • 1 Ernie Chan, Enrique S. Quintana-Ortí,
    Gregorio Quintana-Ortí, and Robert van de Geijn.
    SuperMatrix Out-of-Order Scheduling of Matrix
    Operations on SMP and Multi-Core Architectures.
    In SPAA 07 Proceedings of the Nineteenth
    Annual ACM Symposium on Parallelism in Algorithms
    and Architectures, pages 116-125, San Diego, CA,
    USA, June 2007.
  • 2 Ernie Chan, Field G. Van Zee, Paolo
    Bientinesi, Enrique S. Quintana-Ortí, Gregorio
    Quintana-Ortí, and Robert van de Geijn.
    SuperMatrix A Multithreaded Runtime Scheduling
    System for Algorithms-by-Blocks. Submitted to
    PPoPP 2008.
  • 3 Gregorio Quintana-Ortí, Enrique S.
    Quintana-Ortí, Ernie Chan, Robert A. van de
    Geijn, and Field G. Van Zee. Scheduling of QR
    Factorization Algorithms on SMP and Multi-Core
    Architectures. Submitted to Euromicro PDP 2008.

36
Conclusion
  • More Information
  • http//www.cs.utexas.edu/users/flame
  • Questions?
  • echan_at_cs.utexas.edu
Write a Comment
User Comments (0)
About PowerShow.com