Title: Exploiting Vector Parallelism in Software Pipelined Loops
1Exploiting Vector Parallelismin Software
Pipelined Loops
Sam Larsen Rodric Rabbah Saman Amarasinghe Compu
ter Science and Artificial Intelligence
Laboratory Massachusetts Institute of Technology
2Multimedia Extensions
- Short vector extensions in ILP processors
- AltiVec, 3DNow!, SSE, etc.
- Accelerate loops in multimedia DSP codes
- New designs have floating point support
3Multimedia Extensions
- Vector resources do not overwhelm the scalar
resources - Scalar 2 FP ops / cycle
- Vector 4 FP ops / cycle
- Full vectorization may underutilize scalar
resources - ILP techniques do not target vector resources
- Need both
Courtesy of International Business Machines
Corporation. Unauthorized use not permitted.
4Modulo Scheduling
for (i0 iltN i) s s Xi Yi
Cycle Slot 1 Slot 2 Slot 3
1 LOAD LOAD
2 MULT
3 LOAD LOAD ADD
4 MULT
Cycle Slot 1 Slot 2 Slot 3
5Traditional Vectorization
for (i0 iltN i2) Sii1 Xii1
Yii1
for (i0 iltN i2) Sii1 Xii1
Yii1
for (i0 iltN i) s s Si
for (i0 iltN i) s s Si
Cycle Slot 1 Slot 2 Slot 3
1 VLOAD
2 VLOAD
3 VMUL
4 VSTORE
Cycle Slot 1 Slot 2 Slot 3
1 LOAD
2 LOAD ADD
3
4
Cycle Slot 1 Slot 2 Slot 3
1
6Vectorization without Distribution
for (i0 iltN i2) S Xii1
Yii1 s s S0 s s S1
Cycle Slot 1 Slot 2 Slot 3
1 VLOAD
2 VLOAD
3 VMUL
4 VLOAD ADD
5 VLOAD ADD
6 VMUL
Cycle Slot 1 Slot 2 Slot 3
7Selective Vectorization
for (i0 iltN i2) S Xii1
YiYi1 s s S0 s s S1
Cycle Slot 1 Slot 2 Slot 3
1 VLOAD LOAD
2 LOAD
3 VLOAD LOAD
4 VMUL LOAD
5 VLOAD LOAD ADD
6 VMUL LOAD ADD
Cycle Slot 1 Slot 2 Slot 3
8Complications
- Complex scheduling requirements
- Particularly in statically scheduled machines
- Memory alignment
- Example assumes no communication cost
- In reality, explicit operations required
- Often through memory
- Reserve critical resources
- Potential long latency
- Performance improvement still possible
9Tomcatv main loop (50)
10Tomcatv (SpecFP 95)
Issue Width 6
Memory Units 2
ALUs 4
FPUs 2
Vector Units 1
Vector Length 2
1.7x Speedup over Modulo Scheduling
Technique ALU MEM FPU VEC
Modulo Scheduling 6 22 46 0
Full Vectorization 7 13 0 46
Selective Vectorization 7 27 19 27
11Tomcatv (SpecFP 95)
12Selective Vectorization
- Balance computation among resources
- Minimize II when loop is modulo scheduled
- Carefully manage communication
- Incorporate alignment information
- Software pipelining hides latency
- Adapt a 2-cluster partitioning heuristic
- Fidduccia Matheyses 82
- Kernighan Lin 70
13Selective Vectorization
scalar
vector
cost
14Cost Function
- Projected II due to resources (ResMII)
- Bin-packing approach Rau MICRO 94
- With some modifications
- Can ignore operation latency
- Software pipelining hides latency
- Vectorizable ops not on dependence cycles
for (i0 iltN i) Xi4 Xi
15Evaluation
C or Fortran
- SUIF front-end
- Dependence analysis
- Dataflow optimization
- Trimaran back-end
- Modulo scheduler
- Register allocator
- VLIW Simulator
- Added vector ops
Simulation Binary
16Evaluation
- Operands communicated through memory
- Software responsible for realignment
Issue Width 6
Memory Units 2
ALUs 4
FPUs 2
Vector Units 1
Vector Length 2
17Evaluation
- SpecFP 92, 95, 2000
- Easier to extract dependence information
- Detectable data parallelism
- 64-bit data means vector length of 2
- Considered amenable to vectorization SWP
- Apply selective vectorization to DO loops
- No control flow, no function calls
- Fully simulate with training sets
18Traditional Vectorization
19Vectorization without Distribution
20Vectorization Free Communication
21Vectorization without Distribution
22Selective Vectorization
23Selective Vectorization
tomcatv
mgrid
su2cor
swim
24Communication Support
- Transfer through memory
- Register to register copy
- Uses fewer issue slots
- Frees memory resources
- Shared register file
- Vector elements addressable in scalar ops
- Requires no extra issue slots
25Through Memory
tomcatv
mgrid
su2cor
swim
26Reg to Reg Transfer Support
tomcatv
mgrid
su2cor
swim
27Shared Register File
tomcatv
mgrid
su2cor
swim
28Related Work
- Traditional vectorization
- Allen Kennedy, Wolfe
- Software Pipelining
- Raus iterative modulo scheduling
- Clustered VLIW
- Aleta MICRO34, Codina PACT01, Nystrom
MICRO31, Sanchez MICRO33, Zalamea MICRO34 - Partitioning among clusters similar
- Ours is also an instruction selection problem
- No dedicated communication resources
29Conclusion
- Targeting all FUs improves performance
- Selective vectorization
- Vectorization better in the backend
- Cost analysis more accurate
- Software pipeline vectorized loops
- Good idea anyway
- Facilitates selective vectorization
- Hides communication and alignment latency