Satisfying Your Dependencies with SuperMatrix - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Satisfying Your Dependencies with SuperMatrix

Description:

Transparent Parallelization of Matrix Operations for SMP and Multi-Core Architectures ... Threads asynchronously dequeue tasks from head of waiting queue ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 37

Provided by: Ernie67

Category:

more less

Transcript and Presenter's Notes

Title: Satisfying Your Dependencies with SuperMatrix

1
Satisfying Your Dependencies with SuperMatrix

Ernie Chan

2
Motivation

Transparent Parallelization of Matrix Operations
for SMP and Multi-Core Architectures
Schedule submatrix operations out-of-order via
dependency analysis
Programmability
High-level abstractions to hide details of
parallelization from user

3
Outline

SuperMatrix
Implementation
Performance Results
Conclusion

4
SuperMatrix
5
SuperMatrix

FLA_Part_2x2( A, ATL, ATR,
ABL, ABR, 0, 0,
FLA_TL )
while ( FLA_Obj_length( ATL ) lt FLA_Obj_length( A
)
FLA_Obj_width ( ATL ) lt FLA_Obj_width ( A
) )
b min( FLA_Obj_length( ABR ), nb_alg )
FLA_Repart_2x2_to_3x3( ATL, // ATR,
A00, // A01, A02,
/ / /
/
A10, // A11, A12,
ABL, // ABR,
A20, // A21, A22,
b, b, FLA_BR )
/---------------------------------------------
---------------------/
FLA_LU_nopiv( A11 )
FLA_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 )
FLA_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 )

6
SuperMatrix

LU Factorization Without Pivoting
Iteration 1

LU
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
TRSM
GEMM
7
SuperMatrix

LU Factorization Without Pivoting
Iteration 2

LU
TRSM
GEMM
TRSM
8
SuperMatrix

LU Factorization Without Pivoting
Iteration 3

LU
9
SuperMatrix

FLASH
Matrix of matrices

10
SuperMatrix

FLA_Part_2x2( A, ATL, ATR,
ABL, ABR, 0, 0,
FLA_TL )
while ( FLA_Obj_length( ATL ) lt FLA_Obj_length( A
)
FLA_Obj_width ( ATL ) lt FLA_Obj_width ( A
) )
FLA_Repart_2x2_to_3x3( ATL, // ATR,
A00, // A01, A02,
/ / /
/
A10, // A11, A12,
ABL, // ABR,
A20, // A21, A22,
1, 1, FLA_BR )
/---------------------------------------------
---------------------/
FLASH_LU_nopiv( A11 )
FLASH_Trsm( FLA_LEFT, FLA_LOWER_TRIANGULAR,
FLA_NO_TRANSPOSE, FLA_UNIT_DIAG,
FLA_ONE, A11, A12 )
FLASH_Trsm( FLA_RIGHT, FLA_UPPER_TRIANGULAR,
FLA_NO_TRANSPOSE,
FLA_NONUNIT_DIAG,
FLA_ONE, A11, A21 )
FLASH_Gemm( FLA_NO_TRANSPOSE,
FLA_NO_TRANSPOSE,

11
SuperMatrix

Analyzer
Delay execution and place tasks on queue
Tasks are function pointers annotated with
input/output information
Compute dependence information (flow, anti,
output) between all tasks
Create DAG of tasks

12
SuperMatrix

Dispatcher
Use DAG to execute tasks out-of-order in parallel
Akin to Tomasulos algorithm and
instruction-level parallelism on blocks of
computation
SuperScalar vs. SuperMatrix

13
SuperMatrix

Dispatcher
4 threads
5 x 5 matrix
of blocks
55 tasks
18 stages

LU
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
GEMM
LU
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
LU
GEMM
GEMM
GEMM
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
LU
TRSM
TRSM
GEMM
LU
14
Outline

SuperMatrix
Implementation
Performance Results
Conclusion

15
Implementation

Analyzer

LU
LU
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
GEMM
Task Queue
DAG of tasks
GEMM
LU
GEMM
GEMM
TRSM
TRSM
LU
TRSM
TRSM
GEMM
GEMM
LU
LU
16
Implementation

Analyzer
FLASH routines enqueue tasks onto global task
queue
Dependencies between each task are calculated and
stored in the task structure
Each submatrix block stores the last task
enqueued that writes to it
Flow dependencies occur when a subsequent task
reads that block
DAG is embedded in task queue

17
Implementation

Dispatcher

Task Queue
Waiting Queue
LU
LU

TRSM
LU
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
TRSM
GEMM
GEMM
GEMM
GEMM
LU
TRSM
TRSM
GEMM
LU
Threads
18
Implementation

Dispatcher
Place ready and available tasks on global waiting
queue
First task on task queue always ready and
available
Threads asynchronously dequeue tasks from head of
waiting queue
Once a task completes execution, notify dependent
tasks and update waiting queue
Loop until all tasks complete execution

19
Outline

SuperMatrix
Implementation
Performance Results
Conclusion

20
Performance Results
21
Performance Results

GotoBLAS 1.13 installed on all machines
Supported Operations
LAPACK-level functions
Cholesky factorization
LU factorization without pivoting
All level-3 BLAS
GEMM, TRMM, TRSM
SYMM, SYRK, SYR2K
HEMM, HERK, HER2K

22
Performance Results

Implementations
SuperMatrix serial BLAS
FLAME multithreaded BLAS
LAPACK multithreaded BLAS
Block size 192
Processing elements 8

23
Performance Results

SuperMatrix Implementation
Fixed block sized
Varying block sizes can lead to better
performance
Experiments show 192 generally the best
Simplest scheduling
No sorting to execute task on critical path
earlier
No attempt to improve data locality in these
experiments

24
Performance Results
25
Performance Results
26
Performance Results
27
Performance Results
28
Performance Results
29
Performance Results
30
Outline

SuperMatrix
Implementation
Performance Results
Conclusion

31
Conclusion

Apply out-of-order execution techniques to
schedule tasks
The whole is greater than the sum of the parts
Exploit parallelism between operations
Despite having to calculate dependencies,
SuperMatrix only has small performance penalties

32
Conclusion

Programmability
Code at a high level without needing to deal with
aspects of parallelization

33
Authors

Ernie Chan
Field G. Van Zee
Enrique S. Quintana-Ortí
Gregorio Quintana-Ortí
Robert van de Geijn
The University of Texas at Austin
Universidad Jaume I

34
Acknowledgements

We thank the Texas Advanced Computing Center
(TACC) for access to their machines and their
support
Funding
NSF Grants
CCF0540926
CCF0702714

35
References

1 Ernie Chan, Enrique S. Quintana-Ortí,
Gregorio Quintana-Ortí, and Robert van de Geijn.
SuperMatrix Out-of-Order Scheduling of Matrix
Operations on SMP and Multi-Core Architectures.
In SPAA 07 Proceedings of the Nineteenth
Annual ACM Symposium on Parallelism in Algorithms
and Architectures, pages 116-125, San Diego, CA,
USA, June 2007.
2 Ernie Chan, Field G. Van Zee, Paolo
Bientinesi, Enrique S. Quintana-Ortí, Gregorio
Quintana-Ortí, and Robert van de Geijn.
SuperMatrix A Multithreaded Runtime Scheduling
System for Algorithms-by-Blocks. Submitted to
PPoPP 2008.
3 Gregorio Quintana-Ortí, Enrique S.
Quintana-Ortí, Ernie Chan, Robert A. van de
Geijn, and Field G. Van Zee. Scheduling of QR
Factorization Algorithms on SMP and Multi-Core
Architectures. Submitted to Euromicro PDP 2008.