Title: Bitwidth Analysis with Application to Silicon Compilation
1Bitwidth Analysis with Application to Silicon
Compilation
Amit Chaudhari
- Paper by Mark Stephenson,
- Jonathan Babb, Saman Amarasinghe
- MIT Laboratory for Computer Science
- Princeton
- _at_ ACM SIGPLAN conference on Programming Language
Design and Implementation, Vancouver, British
Columbia, June 2000
2Goal
- For a program written in a high level language,
automatically find the minimum number of bits
needed to represent - Each static variable in the program
- Each operation in the program.
3Usefulness of Bitwidth Analysis
- Higher Language Abstraction
- Enables other compiler optimizations
- Synthesizing application-specific processors
- Optimizing for power-aware processors
- Extracting more parallelism for SIMD processors
4Bitwidth Opportunities
- Runtime profiling reveals plenty of bitwidth
opportunities. - For the SPECint95 benchmark suite,
- Over 50 of operands use less than half the
number of bits specified by the programmer.
5Analysis Constraints
- Bitwidth results must maintain program
correctness for all input data sets - Results are not runtime/data dependent
- A static analysis can do very well, even in light
of this constraint
6Bitwidth Extraction
- Use abundant hints in the source language to
discover bitwidths with near optimal precision. - Caveats
- Analysis limited to fixed-point variables.
- The hints assume source program correctness.
7The Hints
- Bitwidth refining constructs
- Arithmetic operations
- Boolean operations
- Bitmask operations
- Loop induction variable bounding
- Clamping operations
- Type castings
- Static array index bounding
81. Arithmetic Operations
- Example
- int a
- unsigned b
- a random()
- b random()
- a a / 2
- b b gtgt 4
a 32 bits b 32 bits
a 31 bits b 32 bits
a 31 bits b 28 bits
92. Boolean Operations
a 32 bits
a 1 bit
103. Bitmask Operations
int a a random() 0xff
a 32 bits
a 8 bits
114. Loop Induction Variable Bounding
- Applicable to for loop induction variables.
- Example
- int i
-
- for (i 0 i lt 6 i)
-
-
-
i 32 bits
125. Clamping Optimization
- Multimedia codes often simulate saturating
instructions. - Example
- int valpred
- if (valpred gt 32767)
- valpred 32767
- else if (valpred lt -32768)
- valpred -32768
valpred 32 bits
valpred 16 bits
136. Type Casting (Part I)
a 32 bits b 8 bits
a 8 bits b 8 bits
146. Type Cast1ing (Part II)
a 32 bits b 8 bits
a 8 bits b 8 bits
a 8 bits b 8 bits
157. Array Index Optimization
- An index into an array can be set based on the
bounds of the array. - Example
- int a, b
- int X1024
-
-
- Xa X4b
a 32 bits b 32 bits
a 10 bits b 8 bits
a 10 bits b 8 bits
16Propagating Data-Ranges
- Data-flow analysis
- Three candidate lattices
- Bitwidth
- Vector of bits
- Data-ranges
a 4 bits
Propagating bitwidths
a a 1
a 5 bits
17Propagating Data-Ranges
- Data-flow analysis
- Three candidate lattices
- Bitwidth
- Vector of bits
- Data-ranges
a ??????1X
Propagating bit vectors
a a 1
a ?????XXX
18Propagating Data-Ranges
- Data-flow analysis
- Three candidate lattices
- Bitwidth
- Vector of bits
- Data-ranges
a lt0,13gt
Propagating data-ranges
a a 1
a lt1,14gt
19Propagating Data-Ranges
- Propagate data-ranges forward and backward over
the control-flow graph using transfer functions
described in the paper - Use Static Single Assignment (SSA) form with
extensions to - Gracefully handle pointers and arrays.
- Extract data-range information from conditional
statements.
20Example of Data-Range Propagation
a0 input() a1 a0 1
a1 lt 0
true
a2 a1(a1?0) a3 a2 1
a4 a1(a1?0) c0 a4
a5 ?(a3,a4) b0 arraya5
21Example of Data-Range Propagation
a0 input() a1 a0 1
a1 lt 0
true
a2 a1(a1?0) a3 a2 1
a4 a1(a1?0) c0 a4
a5 ?(a3,a4) b0 arraya5
22What to do with Loops?
- Finding the fixed-point around back edges will
often saturate data-ranges. - Instruction in loops comprise the bulk of
dynamically executed instruction!
23Their Loop Solution
- Find the closed-form solutions to commonly
occurring sequences. - A sequence is a mutually dependent group of
instructions. - Use the closed-form solutions to determine final
ranges.
24Finding the Closed-Form Solution
- a 0
- for i 1 to 10
- a a 1
- for j 1 to 10
- a a 2
- for k 1 to 10
- a a 3
- ... a 4
-
25Finding the Closed-Form Solution
- a 0
- for i 1 to 10
- a a 1
- for j 1 to 10
- a a 2
- for k 1 to 10
- a a 3
- ... a 4
-
26Finding the Closed-Form Solution
- a 0 lt0,0gt
- for i 1 to 10
- a a 1 lt1,460gt
- for j 1 to 10
- a a 2 lt3,480gt
- for k 1 to 10
- a a 3 lt24,510gt
- ... a 4 lt510,510gt
-
- Non-trivial to find the exact ranges
27Finding the Closed-Form Solution
- a 0 lt0,0gt
- for i 1 to 10
- a a 1 lt1,460gt
- for j 1 to 10
- a a 2 lt3,480gt
- for k 1 to 10
- a a 3 lt24,510gt
- ... a 4 lt510,510gt
-
- Non-trivial to find the exact ranges
28Finding the Closed-Form Solution
- a 0 lt0,0gt
- for i 1 to 10
- a a 1 lt1,460gt
- for j 1 to 10
- a a 2 lt3,480gt
- for k 1 to 10
- a a 3 lt24,510gt
- ... a 4 lt510,510gt
-
- Can easily find conservative range of lt0,510gt
29Solving the Linear Sequence
- a 0
- for i 1 to 10
- a a 1
- for j 1 to 10
- a a 2
- for k 1 to 10
- a a 3
- ... a 4
-
- Figure out the iteration count of each loop.
30Solving the Linear Sequence
- a 0
- for i 1 to 10
- a a 1
- for j 1 to 10
- a a 2
- for k 1 to 10
- a a 3
- ... a 4
-
lt1,10gt
lt1,100gt
lt1,100gt
- Find out how much each instruction contributes to
sequence using iteration count.
31Solving the Linear Sequence
- a 0
- for i 1 to 10
- a a 1
- for j 1 to 10
- a a 2
- for k 1 to 10
- a a 3
- ... a 4
-
lt1,10gt
lt1,10gtlt1,1gtlt1,10gt
lt1,100gt
lt1,100gtlt2,2gtlt2,200gt
lt1,100gt
lt1,100gtlt3,3gtlt3,300gt
(lt1,10gtlt2,200gtlt3,300gt)?lt0,0gtlt0,510gt
- Sum all the contributions together, and take the
data-range union with the initial value.
32Results
- Standalone Bitwise compiler.
- Bits cut from scalar variables
- Bits cut from array variables
- With the DeepC silicon compiler.
33Percentage of Original Scalar Bits
34Percentage of Original Array Bits
35DeepC Compiler Targeted to FPGAs
C/Fortran program
Suif Frontend
Pointer alias and other high-level analyses
Bitwidth Analysis
MachSuif Codegen
Raw parallelization
DeepC specialization
Verilog
Traditional CAD optimizations
Physical Circuit
36FPGA Area
Without bitwise
With bitwise
2000
1800
1600
1400
1200
Area (CLB count)
1000
800
600
400
200
0
life (1)
sor (32)
intfir (32)
jacobi (8)
newlife (1)
parity (32)
adpcm (8)
median (32)
pmatch (32)
convolve (16)
intmatmul (16)
histogram (16)
mpegcorr (16)
bubblesort (32)
- On average bitwidth optimized circuit used 57
less area
Benchmark (main datapath width)
37FPGA Clock Speed (50 MHz Target)
Without bitwise
With bitwise
150
125
100
XC4000-09 Clock Speed (MHZ)
75
50
25
0
life
sor
intfir
parity
jacobi
adpcm
newlife
median
pmatch
convolve
intmatmul
mpegcorr
histogram
bubblesort
38Power Savings
Without bitwidth analysis
With bitwidth analysis
5
4.5
4
3.5
3
Average Dynamic Power (mW)
2.5
2
1.5
1
0.5
0
bubblesort
histogram
jacobi
pmatch
- On average, analysis reduced power by 50.
39Power Savings
- C ? ASIC
- IBM SA27E process
- 0.15 micron drawn
- 200 MHz
- Methodology
- C ? RTL
- RTL simulation ? Register switching activity
- Synthesis reports dynamic power
40Summary
- Bitwise a scalable bitwidth analyzer
- Standard data-flow analysis
- Loop analysis
- Incorporate pointer analysis
- Demonstrated savings when targeting silicon from
high-level languages - 57 less area
- up to 86 improvement in clock speed
- less than 50 of the power
41