Title: Compiling for VIRAM
1Compiling for VIRAM
- Dave Judd
- Kathy Yelick
- Rich Fromm
- David Martin
-
- Computer Science Division
- UC Berkeley
2VIRAM/Cray Compiler vcc
- VIRAM/Cray vectorizing compiler
- Production compiler
- Used on the T90, C90, as well as the T3D and T3E
- Being ported (by SGI/Cray) to the SV2
architecture - Has C, C, and Fortran front-ends (focus on C)
- Extensive vectorization capability
- VIRAM code generator based on new SV2 code
generator - SV2 code gen being developed in parallel w/ VIRAM
- SV2 vector architecture similar to VIRAM
- SV2 scalar similar to T90 with more registers
3VIRAM/Cray Compiler Status
Optimizer
Frontends
Code Generators
C
T3D/T3E
C
PDGCS
C90/T90
Fortran95
SV2/VIRAM
- MIPS backend developed by last retreat
- Compiled code for a commercial test suite
- VIRAM vector support developed since last retreat
- Compiles executes a commercial test suite
- Compiles and executes a Cray vector test suite
- Remaining issues with scheduler and memory
consistency model
4Codegen/optimizer issues for VIRAM
- Variable virtual processor width (VPW)
- Variable vector register length (MVL)
- Vector flag registers treated as 1 bit wide
vector register - Multiple base, incr, stride regs. autoincrement
- Fixed point arithmetic (saturating add, etc.)
- Memory Consistency
5Vector Architectural State
- Number of VPs given by the Vector Length register
vl - Width of each VP given by the register vpw
- vpw is one of 8b,16b,32b,64b
- Maximum vector length is given by a read-only
register mvl - mvl depends on implementation and vpw
NA,128,64,32 in VIRAM-1
6Generating Code for Variable VPW
- Strategy vectorizer determines minimum correct
vpw for each loop nest - Vectorizer assumes vpw64 initially
- At end of vectorization, discard vectorized copy
of loop if greatest width encountered is less
than 64 and start vectorization over with new
vpw. - Code gen checks vpw for each loop nest.
- Limitation a single loop nest will run at the
speed of the widest type. - Reason simplicity performance of the common
case - No attempt to split/combine loops based on vpw
7Generating Code for Variable MVL
- Maximum vector length is not specified in IRAM
ISA. - However, compiler assumes mvl at compile time
- mvl based on vpw
- mvl assumption dependent on VIRAM-1 hardware
implementation - Recompiling required for future hardware versions
if mvl changes - MVL knowledge useful for code gen and vectorizer
- register spilling
- short loop vectorization
- length-dependent vectorization ( and may
eliminate safe vector length computation at run
time) - for (i 0 i lt n i)
- ai ai32
8Vector Flag Registers
- Vector flag (mask) register treated as vector
register - Bit width of 1
- Flag registyer under control of vector length
- Can spill/reload directly to memory
- Optimizer and code gen issues to handle correctly
9Multiple Base, Incr, Stride Registers
- Dedicated registers for vector memory references
- 16 vbase, 8 vinc and 8 vstride registers
- optional automatic increment of base register
- Vectorizer/Codegen strategy
- Changed from computing base address each time
thru loop to incrementing base address by vl
stridemultiplier - Define compiler temporary for each base address
- Teach codegen to assign vbase, vinc and vstride
registers as needed. - Trick code gen into handling multiple results for
single vector load/store instruction. - Results in very clean vector loops with only
vector instructions in inner loop vl
computation.
10C Compiler Testing
- vector regression test suite (CRAY)
- Specifically tests for vectorization
- Compares vector and scalar results
- Easy to isolate problems
- Status
- 56 of 62 tests pass
- Some minor numerical differences
- 3 failures w/ wrong answers
- 1 failure causes assembler abort on bad
instruction (caused by vinc autoincrement
feature)
11C Compiler Testing (cont.)
- C regression test suite (industry standard suite)
- Scalar emphasis, C conformance
- All tests pass except
- errors with functions returning a structure
larger than 16 bytes - errors with long double constants (128 bit
floating point)
12What Essential Features Remain
- Finish instruction scheduler
- Implement sync strategy
- Support -n32 ABI
- Double / long double
13Instruction Scheduler
- Instruction scheduler working, but needs
- Functional unit layout for VIRAM
- Instruction latency and busy time for VIRAM
- Support for chaining of vector instructions,
including mask registers - Scheduler responsible for sync processing
- vector - vector sync analysis is working
- vector - scalar scalar - vector analysis needed
- synch instr. currently comments in assembler
output
14Chaining
vadd.s vr1,vr3,vr4 vabs.s vr2,vr1 With
chaining 0 1 2 3 4 5 6 7 . . . V10 R X
X X X W V11 R X X X X W V12 R X X X X
W V13 R X X X X W V20 R X
X W V21 R X X W V22
R X X W V23 R X X W
15Convert from -64 to -n32 ABI
- Pointer and long type revert to 32 bits from 64
- VIRAM tools were n32 originally
- Switched to -64 to accommodate compiler, to match
sv2 - Revert to n32 to match vector addressing hardware
- Change will allow some gather/scatter loops to
execute faster (vpw32 instead of vpw64)
16C
- All components being generated now
- Include files and libC library differences? SGI
/ Cray - C testing on SV2 simulator now
- Testing/ problem isotation needed
17Fortran
- Fortran 95 frontend
- FCD (fortran character descriptor) code gen
support needed for VIRAM - Differences between IRIX and UNICOS libraries for
I/O and array intrinsics - Testing needed
18Other Future Compiler Features ?
- VIRAM machine target
- Support for speculative execution
- Support Cray additions
- peephole optimizer
- vector loop unrolling/ tiling
- Compiler extensions for fixed point hardware
19Vector functions
- Define calling sequence conventions
- Must be coded in assembler
- Take advantage of C versions of Cray routines
- Needed for some key benchmarks?
20Memory consistency
SaV VaS VaV vp RaW WaR WaW
21VIRAM Tools
- vas assembler
- vdis disassembler
- vsim-isa simulator
- vsim-db debugger
- vsim-p performance simulator
- vsim-syncmemory consistency simulator
22vsim-sync
- Intended for debugging and optimizing syncs
- Tells you when there is a data hazard (sync
needed) - Tells you when a sync executed that didnt
prevent a hazard - sync may not be needed
- according to dynamic execution
- sync may be needed on some other execution path
23Summary
- vcc is a reasonably robust compiler for VIRAM
- All of the basic compiler elements are present
now - Need to prioritize remaining work
- Finish and tune scheduler
- Implement sync strategy
- C
- Fortran
- IRAM target