Title: vcc cc Compiler for VIRAM
1vcc c/c Compiler for VIRAM
- Sam Williams
- CS 265
- samw_at_cs.berkeley.edu
2Topics
- Introduction
- Simulation Methodology
- Vectorization
- Speedup
- Quality of codegen
- Instruction usage
3Introduction
- vcc is the c/c compiler for VIRAM
- It quickly became evident that many features
havent been implemented, including - inlining
- scheduling
- loop unrolling
- code motion
4Simulation Methodology
- In order to separate micro-architectural
performance from the ability of the compiler to
take full advantage of the ISA and find potential
parallelism, I assumed - The processor is a single issue machine
- No stalls will occur do to number of
- functional units, or bandwidth
- All instructions take a single cycle to execute
- Thus vsim-isa simulator could be used.
5Vectorization
- The compiler was able to vectorize most of the
loops. - Primary reason for failing data dependence
- Additionally Function calls, non-existent vector
version of library function - Some loops were skipped entirely since they
didnt produce any results. - Some loops were conditionally vectorized
- There were a couple of bugs in the benchmark,
which initially skewed the results.
6Speedup
7Quality
- It appears the compiler does not consistently
take full advantage of auto-increments found in
the ISA. - It also doesnt keep track of vl/mvl efficiently
- This resulted in a great deal of unnecessary loop
overhead in each strip-mined loop. - Furthermore, there were many instances where code
motion out of the loop should have been applied.
8ISA usage
- Loops are primarily a single precision FP,
however integer and vector processing
instructions can be used effectively in
calculating addresses. - Relatively few of the vector processing
instructions were used. - About half of the flag processing instructions
were used. - Only 4 of the 16 FP compare predicates were used
- No surprise that saturating and the more complex
integer arithmetic instructions were not used.
9Examples loop 72 (21.1x)
for(i0 iltn i) if(ai gt 0) bj
ai j When compiled each strip
would load mvl elements of a, compare to 0,
generate an index to the grater than 0 elements,
use that in an indexed load of a, then store
that to b. What it should do is load strip of
a, compare to 0, use vcompress to compress the
strip, and store to b
10Examples loop 100 (31.8x)
for(i0 iltn i) ai bi
ci/2 Here the compiler maintains the base
for c in a vector register, and uses a vdiv to
generate an indexing vector to load strips of
c, furthermore it then has to increment all
elements in the addressing register each
iteration. All thats needed is to break the
loop into even and odd parts, and use stride2
load for b, and unit stride load for c.