Title: Optimizing Matrix Multiplication with a Classifier Learning System
1Optimizing Matrix Multiplication with a
Classifier Learning System
- Xiaoming Li (presenter)
- MarÃa Jesús Garzarán
- University of Illinois at Urbana-Champaign
2Tuning library for recursive matrix multiplication
- Use cache-aware algorithms that take into account
architectural features - Memory hierarchy
- Register file,
- Take into account input characteristics
- matrix sizes
- The process of tuning is automatic.
3Recursive Matrix Partitioning
- Previous approaches
- Multiple recursive steps
- Only divide by half
A
B
4Recursive Matrix Partitioning
- Previous approaches
- Multiple recursive steps
- Only divide by half
A
B
Step 1
5Recursive Matrix Partitioning
- Previous approaches
- Multiple recursive steps
- Only divide by half
A
B
Step 2
6Recursive Matrix Partitioning
- Our approach is more general
- No need to divide by half
- May use a single step to reach the same partition
- Faster and more general
A
B
Step 1
7Our approach
- A general framework to describe a family of
recursive matrix multiplication algorithms, where
given the input dimensions of the matrices, we
determine - Number of partition levels
- How to partition at each level
- An intelligent search method based on a
classifier learning system - Search for the best partitioning strategy in a
huge search space
8Outline
- Background
- Partition Methods
- Classifier Learning System
- Experimental Results
9Recursive layout framework
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
- Multiple levels of recursion
- Takes into account the cache hierarchy
10Recursive layout framework
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
- Multiple levels of recursion
- Takes into account the cache hierarchy
2
1
4
3
11Recursive layout in our framework
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
- Multiple levels of recursion
- Takes into account the cache hierarchy
12Recursive layout framework
1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32
33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48
49 50 51 52 53 54 55 56
57 58 59 60 61 62 63 64
- Multiple levels of recursion
- Takes into account the cache hierarchy
13Recursive layout framework
1 2 5 6 17 18 21 22
3 4 7 8 19 20 23 24
9 10 13 14 25 26 29 30
11 12 15 16 27 28 31 32
33 34 37 38 49 50 53 54
35 36 39 40 51 52 55 56
41 42 45 46 57 58 61 62
43 44 47 48 59 60 63 64
- Multiple levels of recursion
- Takes into account the cache hierarchy
1
2
5
6
3
4
7
8
9
10
13
14
11
12
15
16
14Padding
- Necessary when the partition factor is not a
divisor of the matrix dimension.
Divide by 3
2000
15Padding
- Necessary when the partition factor is not a
divisor of the matrix dimension.
Divide by 3
2001
667
16Padding
- Necessary when the partition factor is not a
divisor of the matrix dimension.
Divide by 4
2001
667
17Padding
- Necessary when the partition factor is not a
divisor of the matrix dimension.
Divide by 4
2004
668
18Recursive layout in our framework
- Multiple level recursion
- Support cache hierarchy
- Square tile ? rectangular tile
- Fit non-square matrixes
19Recursive layout in our framework
- Multiple level recursion
- Support cache hierarchy
- Square tile ? rectangular tile
- Fit non-square matrixes
8
9
20Recursive layout in our framework
- Multiple level recursion
- Support cache hierarchy
- Square tile ? rectangular tile
- Fit non-square matrixes
8
10
Padding
21Recursive layout in our framework
- Multiple level recursion
- Support cache hierarchy
- Square tile ? rectangular tile
- Fit non-square matrixes
4
3
22Outline
- Background
- Partition Methods
- Classifier Learning System
- Experimental Results
23Two methods to partition matrices
- Partition by Block (PB)
- Specify the size of each tile
- Example
- Dimensions (M,N,K) (100, 100, 40)
- Tile size (bm, bn, bk) (50, 50, 20)
- Partition factors (pm, pn, pk)
(2,2,2) - Tiles need not to be square
24Two methods to partition matrices
- Partition by Size (PS)
- Specify the maximum size of the three tiles.
- Maintain the ratios between dimensions constant
- Example
- (M,N,K) (100, 100,50)
- Maximum tile size for M,N 1250
- (pm, pn, pk) (2,2,1)
- Generalization of the divide-by-half approach.
- Tile size 1/4 matrix size
25Outline
- Background
- Partition Methods
- Classifier Learning System
- Experimental Results
26Classifier Learning System
- Use the two partition primitives to determine how
the input matrices are partitioned - Determine partition factors at each level
- f (M,N,K) ? (pmi,pni,pki), i0,1,2 (only
consider 3 levels) - The partition factors depend on the matrix size
- Eg. The partitions factors of a (1000 x 1000)
matrix should be different that those of a (50 x
1000) matrix. - The partition factors also depend on the
architectural characteristics, like cache size.
27Determine the best partition factors
- The search space is huge ? exhaustive search is
impossible - Our proposal use a multi-step classifier
learning system - Creates a table that given the matrix dimensions
determines the partition factors
28Classifier Learning System
- The result of the classifier learning system is a
table with two columns - Column 1 (Pattern) A string of 0, 1, and
that encodes the dimensions of the matrices - Column 2 (Action) Partition method for one step
- Built using the partition-by-block and
partition-by-size primitives with different
parameters.
29Learn with Classifier System
Pattern Action
(10,11) PS 100
(010,011) PB (4,4)
30Learn with Classifier System
Pattern Action
(10,11) PS 100
(010,011) PB (4,4)
5 bits / dim
31Learn with Classifier System
24
Pattern Action
(10,11) PS 100
(010,011) PB (4,4)
16
32Learn with Classifier System
24
Pattern Action
(10,11) PS 100
(010,011) PB (4,4)
16
33Learn with Classifier System
12
Pattern Action
(10,11) PS 100
(010,011) PB (4,4)
8
34Learn with Classifier System
12
Pattern Action
(10,11) PS 100
(010,011) PB (4,4)
8
35Learn with Classifier System
12
Pattern Action
(10,11) PS 100
(010,011) PB (4,4)
8
36Learn with Classifier System
4
4
Pattern Action
(10,11) PS 100
(010,011) PB (4,4)
37How classifier learning algorithm works?
- Change the table based on the feedback of
performance and accuracy from previous runs. - Mutate the condition part of the table to adjust
the range of matching matrix dimensions. - Mutate the action part to find the best partition
method for the matching matrices.
38Outline
- Background
- Partition Methods
- Classifier Learning System
- Experimental Results
39Experimental Results
- Experiments on three platforms
- Sun UltraSparcIII
- P4 Intel Xeon
- Intel Itanium2
- Matrices of sizes from 1000 x 1000 to 5000 x 5000
40Algorithms
- Classifier MMM our approach
- Include the overhead of copying in and out of
recursive layout - ATLAS Library generated by ATLAS using the
search procedure without hand-written codes. - Has some type of blocking for L2
- L1 One level of tiling
- tile size the same that ATLAS for L1
- L2 Two levels of tiling
- L1tile and L2tile the same that ATLAS for L1
41(No Transcript)
42(No Transcript)
43Conclusion and Future Work
- Preliminary results prove the effectiveness of
our approach - Sun UltraSparcIII and Xeon 18 and 5
improvement, respectively. - Itanium -14
- Need to improve padding mechanism
- Reduce the amount of padding
- Avoid unnecessary computation on padding
44Thank you!