vcc cc Compiler for VIRAM

About This Presentation

Title:

vcc cc Compiler for VIRAM

Description:

In order to separate micro-architectural performance from the ability of the ... strip of a[], compare to 0, use vcompress to compress the strip, and store to b ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 11

Provided by: csBer

Category:

more less

Transcript and Presenter's Notes

Title: vcc cc Compiler for VIRAM

1
vcc c/c Compiler for VIRAM

Sam Williams
CS 265
samw_at_cs.berkeley.edu

2
Topics

Introduction
Simulation Methodology
Vectorization
Speedup
Quality of codegen
Instruction usage

3
Introduction

vcc is the c/c compiler for VIRAM
It quickly became evident that many features
havent been implemented, including
inlining
scheduling
loop unrolling
code motion

4
Simulation Methodology

In order to separate micro-architectural
performance from the ability of the compiler to
take full advantage of the ISA and find potential
parallelism, I assumed
The processor is a single issue machine
No stalls will occur do to number of
functional units, or bandwidth
All instructions take a single cycle to execute
Thus vsim-isa simulator could be used.

5
Vectorization

The compiler was able to vectorize most of the
loops.
Primary reason for failing data dependence
Additionally Function calls, non-existent vector
version of library function
Some loops were skipped entirely since they
didnt produce any results.
Some loops were conditionally vectorized
There were a couple of bugs in the benchmark,
which initially skewed the results.

6
Speedup
7
Quality

It appears the compiler does not consistently
take full advantage of auto-increments found in
the ISA.
It also doesnt keep track of vl/mvl efficiently
This resulted in a great deal of unnecessary loop
overhead in each strip-mined loop.
Furthermore, there were many instances where code
motion out of the loop should have been applied.

8
ISA usage

Loops are primarily a single precision FP,
however integer and vector processing
instructions can be used effectively in
calculating addresses.
Relatively few of the vector processing
instructions were used.
About half of the flag processing instructions
were used.
Only 4 of the 16 FP compare predicates were used
No surprise that saturating and the more complex
integer arithmetic instructions were not used.

9
Examples loop 72 (21.1x)
for(i0 iltn i) if(ai gt 0) bj
ai j When compiled each strip
would load mvl elements of a, compare to 0,
generate an index to the grater than 0 elements,
use that in an indexed load of a, then store
that to b. What it should do is load strip of
a, compare to 0, use vcompress to compress the
strip, and store to b
10
Examples loop 100 (31.8x)
for(i0 iltn i) ai bi
ci/2 Here the compiler maintains the base
for c in a vector register, and uses a vdiv to
generate an indexing vector to load strips of
c, furthermore it then has to increment all
elements in the addressing register each
iteration. All thats needed is to break the
loop into even and odd parts, and use stride2
load for b, and unit stride load for c.

Write a Comment

User Comments (0)