Title: Matrix Multiply Statistics using the Cyclops64 Simulator
1Matrix - Multiply Statistics using the Cyclops64
Simulator
2Introduction / Overview
- Cyclops64 Software Infrastructure
- C Compiler
- Thread Library
- Execution Control Tools (Linker/Loader, Debugger,
etc) - Simulator
- Focus C64 Simulator
- Basic knowledge of Input Commands / Simulator
Output - Case Study (Matrix) x (Matrix) C programs
3Cyclops64 Simulator Basics
- Data Needed
- Statistics Information ( of Cycles, of
Instructions, frequencies of each type of
instruction) - Program Execution Trace
- Sample Compile / Simulation
4Cyclops64 Simulator Basics
1896 00000C50 11FE8010 LDW R7,R62,64
1897 00000C54 11BE8011 LDW R6,R62,68 1898
00000C58 F18718A8 ADDW R6,R7,R6 1901 00000C5C
21BE8012 STW R6,R62,72 1902 00000C60
11FE8010 LDW R7,R62,64 1903 00000C64
11BE8011 LDW R6,R62,68 1904 00000C68
F18718B8 SUBW R6,R7,R6 1907 00000C6C 21BE8013
STW R6,R62,76 1908 00000C70 11FE8010 LDW
R7,R62,64 1909 00000C74 11BE8011 LDW
R6,R62,68 1910 00000C78 F187180C MULS
R6,R7,R6 1913 00000C7C 21BE8014 STW R6,R62,80
5Matrix Multiply Programs
- Original Matrix Multiply Source Code
- Run at O0, O1, O2, O3 optimization levels
- Variable Update
- Loop Unrolling
- Unroll 4 times, 8 times
- Software Cache
- Share Cache, Split Cache, Cache Line
- Blocking, Loop Interchange
- Use old Memcpy function, new Memcpy (LDM)
- Multithreaded version
6Matrix Multiply Programs
- Sample Simulation (Loop Unrolling, 4 Times)
- Statistics Calculations
- Execution Time (500,000,000 cycles/sec) /
4,727,243 cycles 0.00945 seconds - 9.45 ms
- MIPS 3064268 Instructions / 0.00945 seconds
324,107,299 Instructions/sec - 324 MIPS
7Matrix Multiply Results
Program Instructions Cycles MIPS
ms float Original Code (-O0
Optimization) 22988512 44038638 261 88.07
2.28 Original Code (-O1 Optimization) 6956179
15615387 223 31.23 3.77 Original Code
(-O2 Optimization) 6632543 13710695 242
27.42 3.95 Original Code (-O3 Optimization)
5862220 12166226 241 24.33
4.47 Variable Update 3457481
8462795 204 16.92 7.58 Loop Unrolling ( 4
Times ) 3064268 4727243 324 9.45
8.55 Loop Unrolling ( 8 Times ) 3547617
10429218 170 20.85 7.39 Software Cache
(Share) 15404266 27033258 285 54.06
1.70 Software Cache (Split) 15275584
26815883 285 53.63 1.72 Software Cache
(Cache Line) 19469407 28907683 337 57.81
1.35 Blocking / Unrolling (MEMCPY) 1351970
1751386 385 3.50 20.00 Blocking /
Unrolling (Using LDM) 1316405 1442606 456
2.88 20.54 Multithreaded Version
1316857 721803 912 1.44 20.53
8Future Work
- Speed Up
- Examine source code, find weaknesses /
bottlenecks - Parallelize for C64 Architecture