Title: Runtime Data Flow Scheduling of Matrix Computations
1Runtime Data Flow Scheduling of Matrix
Computations
2Motivation
- Solving Linear Systems
- Solve for x
- A x b
- Factorize A O(n3)
- P A L U
- Forward and Backward substitution O(n2)
- L y P b
- U x y
3Goals
- Programmability
- Use tools provided by FLAME
- Parallelism
- Directed acyclic graph (DAG) scheduling
4Outline
- LU Factorization with Partial Pivoting
- Algorithm-by-Blocks
- SuperMatrix Runtime System
- Performance
- Conclusion
- P A L U
5LU Factorization with Partial Pivoting
- Formal Linear Algebra Method Environment (FLAME)
- High-level abstractions for expressing linear
algebra algorithms - Application programming interfaces (APIs) for
seamlessly implementing algorithms in code - Library of commonly used linear algebra
operations in libflame
6(No Transcript)
7LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 1
A12
A11
A21
A22
8LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 1
LUpiv
A12
A22
9LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 1
PIV
A11
A21
10LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 1
TRSM
A11
A21
A22
11LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 1
A12
A11
A21
GEMM
12LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 2
A00
A01
A02
A10
A11
A12
A20
A21
A22
13LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 2
A00
A01
A02
A10
LUpiv
A12
A20
A22
14LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 2
A00
A01
A02
PIV
A11
PIV
A21
15LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 2
A00
A01
A02
A10
A11
TRSM
A20
A21
A22
16LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 2
A00
A01
A02
A10
A11
A12
A20
A21
GEMM
17LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 3
A00
A01
A10
A11
18LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 3
A00
A01
A10
LUpiv
19LU Factorization with Partial Pivoting
- Blocked Algorithm
- Iteration 3
A00
A01
PIV
A11
20Outline
- LU Factorization with Partial Pivoting
- Algorithm-by-Blocks
- SuperMatrix Runtime System
- Performance
- Conclusion
- P A L U
21Algorithm-by-Blocks
22- FLA_Part_2x2( A, ATL, ATR,
- ABL, ABR, 0, 0, FLA_TL
) - FLA_Part_2x1( p, pT,
- pB, 0, FLA_TOP
) - while ( FLA_Obj_length( ATL ) lt FLA_Obj_length( A
) ) -
- FLA_Repart_2x2_to_3x3( ATL, // ATR,
A00, // A01, A02, - / / /
/ -
A10, // A11, A12, - ABL, // ABR,
A20, // A21, A22, 1, 1, FLA_BR ) - FLA_Repart_2x1_to_3x1( pT, p0,
- / / /
/ - p1,
- pB, p2,
1, FLA_BOTTOM ) - /---------------------------------------------
---------/ - FLA_Merge_2x1( A11,
- A21, AB1 )
- FLASH_LU_piv( AB1 , p1 )
23Algorithm-by-Blocks
- LU Factorization with Partial Pivoting
- Iteration 1
PIV1 TRSM3
LUpiv0
PIV2 TRSM4
PIV1 GEMM5
PIV2 GEMM7
LUpiv0
PIV1 GEMM6
LUpiv0
PIV2 GEMM8
24Algorithm-by-Blocks
- LU Factorization with Partial Pivoting
- Iteration 2
LUpiv9
PIV11 TRSM12
PIV10
LUpiv9
PIV10
PIV11 GEMM13
25Algorithm-by-Blocks
- LU Factorization with Partial Pivoting
- Iteration 3
PIV16
PIV15
LUpiv14
26LUpiv0
PIV1
PIV2
TRSM4
TRSM3
GEMM5
GEMM8
GEMM6
GEMM7
LUpiv9
PIV11
PIV10
TRSM12
GEMM13
LUpiv14
PIV16
PIV15
27Outline
- LU Factorization with Partial Pivoting
- Algorithm-by-Blocks
- SuperMatrix Runtime System
- Performance
- Conclusion
- P A L U
28SuperMatrix Runtime System
- Separation of Concerns
- Analyzer
- Decomposes subproblems into component tasks
- Store tasks in global task queue sequentially
- Internally calculates all dependencies between
tasks, which form a directed acyclic graph (DAG),
only using input and output parameters for each
task - Dispatcher
- Spawn threads
- Schedule and dispatch tasks to threads in parallel
29SuperMatrix Runtime System
- Dispatcher Single Queue
- Set of all ready and available tasks
- FIFO, priority
PE1
PE0
PEp-1
30LUpiv0
PIV1
PIV2
TRSM4
TRSM3
GEMM5
GEMM8
GEMM6
GEMM7
LUpiv9
PIV11
PIV10
TRSM12
GEMM13
LUpiv14
PIV16
PIV15
31SuperMatrix Runtime System
- Lookahead
- Schedule GEMM5 and GEMM6 tasks first so LUpiv9
can be computed ahead in parallel with GEMM7
and GEMM8 - Implemented directly within the code which
increases the complexity and detracts from
programmability - High-Performance LINPACK
32SuperMatrix Runtime System
- Scheduling
- Sorting tasks by height of each task in DAG
mimics lookahead - Multiple queues
- Data affinity
- Work stealing
- Macroblocks
- Tasks overwriting more than one block at a time
33Outline
- LU Factorization with Partial Pivoting
- Algorithm-by-Blocks
- SuperMatrix Runtime System
- Performance
- Conclusion
- P A L U
34Performance
- Implementations
- SuperMatrix serial BLAS
- Partial and incremental pivoting
- LAPACK dgetrf multithreaded BLAS
- Multithreaded dgetrf
- Multithreaded dgemm
- Double-precision real floating-point arithmetic
- Tuned block size per problem size
35Performance
- Target Architecture Linux
- 4 socket 2.3 GHz AMD Opteron Quad-Core
- ranger.tacc.utexas.edu
- 3936 SMP nodes
- 16 cores per node
- 2 MB shared L3 cache per socket
- OpenMP
- Intel compiler 10.1
- BLAS
- GotoBLAS2 1.00
36Performance
37Performance
38Performance
- Target Architecture Windows
- 4 socket 2.4 GHz Intel Xeon E7330 Quad-Core
- Windows Server 2008 R2 Enterprise
- 16 core UMA machine
- Two 3 MB shared L2 cache per socket
- OpenMP
- Microsoft Visual C 2010
- BLAS
- Intel MKL 10.2
39Performance
40Performance
- Results
- SuperMatrix is competitive with GotoBLAS and MKL
- Incremental pivoting ramps up in performance
faster but partial pivoting provides better
asymptotic performance - Linux and Windows platforms attain similar
performance curves
41Performance
- Target Architecture Windows Linux
- 4 socket 2.66 GHz Intel Dunnington 24 cores
- Windows Server 2008 R2 Enterprise
- Red Hat 4.1.2-46
- 16 MB shared L3 cache per socket
- OpenMP
- Intel compiler 11.1
- BLAS
- Intel MKL 11.1, 10.2
42Performance
43Performance
44Performance
45Performance
46Performance
47Performance
48Performance
49Performance
50Performance
51Performance
52Outline
- LU Factorization with Partial Pivoting
- Algorithm-by-Blocks
- SuperMatrix Runtime System
- Performance
- Conclusion
- P A L U
53Conclusion
- Separation of Concerns
- Programmability
- Allows us to experiment with different scheduling
algorithms
54Acknowledgements
- Andrew Chapman, Robert van de Geijn
- I thank the other members of the FLAME team for
their support - Funding
- Microsoft
- NSF grants
- CCF0540926
- CCF0702714
55Conclusion
- More Information
- http//www.cs.utexas.edu/flame
- Questions?
- echan_at_cs.utexas.edu