Runtime Data Flow Scheduling of Matrix Computations - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Runtime Data Flow Scheduling of Matrix Computations

Description:

Runtime Data Flow Scheduling of Matrix Computations Ernie Chan – PowerPoint PPT presentation

Number of Views:223
Avg rating:3.0/5.0
Slides: 47
Provided by: Erni83
Category:

less

Transcript and Presenter's Notes

Title: Runtime Data Flow Scheduling of Matrix Computations


1
Runtime Data Flow Scheduling of Matrix
Computations
  • Ernie Chan

2
Motivation
  • Solving Linear Systems
  • Solve for x
  • A x b
  • Factorize A O(n3)
  • P A L U
  • Forward and Backward substitution O(n2)
  • L y P b
  • U x y

3
Goals
  • Programmability
  • Use tools provided by FLAME
  • Parallelism
  • Directed acyclic graph (DAG) scheduling

4
Outline
  • LU Factorization with Partial Pivoting
  • Algorithm-by-Blocks
  • SuperMatrix Runtime System
  • Performance
  • Conclusion
  • P A L U

5
LU Factorization with Partial Pivoting
  • Formal Linear Algebra Method Environment (FLAME)
  • High-level abstractions for expressing linear
    algebra algorithms
  • Application programming interfaces (APIs) for
    seamlessly implementing algorithms in code
  • Library of commonly used linear algebra
    operations in libflame

6
(No Transcript)
7
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 1

A12
A11
A21
A22
8
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 1

LUpiv
A12
A22
9
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 1

PIV
A11
A21
10
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 1

TRSM
A11
A21
A22
11
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 1

A12
A11
A21
GEMM
12
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 2

A00
A01
A02
A10
A11
A12
A20
A21
A22
13
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 2

A00
A01
A02
A10
LUpiv
A12
A20
A22
14
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 2

A00
A01
A02
PIV
A11
PIV
A21
15
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 2

A00
A01
A02
A10
A11
TRSM
A20
A21
A22
16
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 2

A00
A01
A02
A10
A11
A12
A20
A21
GEMM
17
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 3

A00
A01
A10
A11
18
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 3

A00
A01
A10
LUpiv
19
LU Factorization with Partial Pivoting
  • Blocked Algorithm
  • Iteration 3

A00
A01
PIV
A11
20
Outline
  • LU Factorization with Partial Pivoting
  • Algorithm-by-Blocks
  • SuperMatrix Runtime System
  • Performance
  • Conclusion
  • P A L U

21
Algorithm-by-Blocks
  • FLASH
  • Storage-by-blocks

22
  • FLA_Part_2x2( A, ATL, ATR,
  • ABL, ABR, 0, 0, FLA_TL
    )
  • FLA_Part_2x1( p, pT,
  • pB, 0, FLA_TOP
    )
  • while ( FLA_Obj_length( ATL ) lt FLA_Obj_length( A
    ) )
  • FLA_Repart_2x2_to_3x3( ATL, // ATR,
    A00, // A01, A02,
  • / / /
    /

  • A10, // A11, A12,
  • ABL, // ABR,
    A20, // A21, A22, 1, 1, FLA_BR )
  • FLA_Repart_2x1_to_3x1( pT, p0,
  • / / /
    /
  • p1,
  • pB, p2,
    1, FLA_BOTTOM )
  • /---------------------------------------------
    ---------/
  • FLA_Merge_2x1( A11,
  • A21, AB1 )
  • FLASH_LU_piv( AB1 , p1 )

23
Algorithm-by-Blocks
  • LU Factorization with Partial Pivoting
  • Iteration 1

PIV1 TRSM3
LUpiv0
PIV2 TRSM4
PIV1 GEMM5
PIV2 GEMM7
LUpiv0
PIV1 GEMM6
LUpiv0
PIV2 GEMM8
24
Algorithm-by-Blocks
  • LU Factorization with Partial Pivoting
  • Iteration 2

LUpiv9
PIV11 TRSM12
PIV10
LUpiv9
PIV10
PIV11 GEMM13
25
Algorithm-by-Blocks
  • LU Factorization with Partial Pivoting
  • Iteration 3

PIV16
PIV15
LUpiv14
26
LUpiv0
PIV1
PIV2
TRSM4
TRSM3
GEMM5
GEMM8
GEMM6
GEMM7
LUpiv9
PIV11
PIV10
TRSM12
GEMM13
LUpiv14
PIV16
PIV15
27
Outline
  • LU Factorization with Partial Pivoting
  • Algorithm-by-Blocks
  • SuperMatrix Runtime System
  • Performance
  • Conclusion
  • P A L U

28
SuperMatrix Runtime System
  • Separation of Concerns
  • Analyzer
  • Decomposes subproblems into component tasks
  • Store tasks in global task queue sequentially
  • Internally calculates all dependencies between
    tasks, which form a directed acyclic graph (DAG),
    only using input and output parameters for each
    task
  • Dispatcher
  • Spawn threads
  • Schedule and dispatch tasks to threads in parallel

29
SuperMatrix Runtime System
  • Dispatcher Single Queue
  • Set of all ready and available tasks
  • FIFO, priority

PE1

PE0
PEp-1
30
LUpiv0
PIV1
PIV2
TRSM4
TRSM3
GEMM5
GEMM8
GEMM6
GEMM7
LUpiv9
PIV11
PIV10
TRSM12
GEMM13
LUpiv14
PIV16
PIV15
31
SuperMatrix Runtime System
  • Lookahead
  • Schedule GEMM5 and GEMM6 tasks first so LUpiv9
    can be computed ahead in parallel with GEMM7
    and GEMM8
  • Implemented directly within the code which
    increases the complexity and detracts from
    programmability
  • High-Performance LINPACK

32
SuperMatrix Runtime System
  • Scheduling
  • Sorting tasks by height of each task in DAG
    mimics lookahead
  • Multiple queues
  • Data affinity
  • Work stealing
  • Macroblocks
  • Tasks overwriting more than one block at a time

33
Outline
  • LU Factorization with Partial Pivoting
  • Algorithm-by-Blocks
  • SuperMatrix Runtime System
  • Performance
  • Conclusion
  • P A L U

34
Performance
  • Implementations
  • SuperMatrix serial BLAS
  • Partial and incremental pivoting
  • LAPACK dgetrf multithreaded BLAS
  • Multithreaded dgetrf
  • Multithreaded dgemm
  • Double-precision real floating-point arithmetic
  • Tuned block size per problem size

35
Performance
  • Target Architecture Linux
  • 4 socket 2.3 GHz AMD Opteron Quad-Core
  • ranger.tacc.utexas.edu
  • 3936 SMP nodes
  • 16 cores per node
  • 2 MB shared L3 cache per socket
  • OpenMP
  • Intel compiler 10.1
  • BLAS
  • GotoBLAS2 1.00

36
Performance
37
Performance
38
Performance
  • Target Architecture Windows
  • 4 socket 2.4 GHz Intel Xeon E7330 Quad-Core
  • Windows Server 2008 R2 Enterprise
  • 16 core UMA machine
  • Two 3 MB shared L2 cache per socket
  • OpenMP
  • Microsoft Visual C 2010
  • BLAS
  • Intel MKL 10.2

39
Performance
40
Performance
  • Results
  • SuperMatrix is competitive with GotoBLAS and MKL
  • Incremental pivoting ramps up in performance
    faster but partial pivoting provides better
    asymptotic performance
  • Linux and Windows platforms attain similar
    performance curves

41
Performance
  • Target Architecture Windows Linux
  • 4 socket 2.66 GHz Intel Dunnington 24 cores
  • Windows Server 2008 R2 Enterprise
  • Red Hat 4.1.2-46
  • 16 MB shared L3 cache per socket
  • OpenMP
  • Intel compiler 11.1
  • BLAS
  • Intel MKL 11.1, 10.2

42
Performance
43
Performance
44
Performance
45
Performance
46
Performance
47
Performance
48
Performance
49
Performance
50
Performance
51
Performance
52
Outline
  • LU Factorization with Partial Pivoting
  • Algorithm-by-Blocks
  • SuperMatrix Runtime System
  • Performance
  • Conclusion
  • P A L U

53
Conclusion
  • Separation of Concerns
  • Programmability
  • Allows us to experiment with different scheduling
    algorithms

54
Acknowledgements
  • Andrew Chapman, Robert van de Geijn
  • I thank the other members of the FLAME team for
    their support
  • Funding
  • Microsoft
  • NSF grants
  • CCF0540926
  • CCF0702714

55
Conclusion
  • More Information
  • http//www.cs.utexas.edu/flame
  • Questions?
  • echan_at_cs.utexas.edu
Write a Comment
User Comments (0)
About PowerShow.com