Exploiting Vector Parallelism in Software Pipelined Loops - PowerPoint PPT Presentation

About This Presentation

Title:

Exploiting Vector Parallelism in Software Pipelined Loops

Description:

Slot 2. Slot 3. Slot 1. Cycle. II = 2. mod sched. for (i=0; i N; i ) { s = s S[i] ... Particularly in statically scheduled machines. Memory alignment ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 30

Provided by: samuel76

Learn more at: https://groups.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting Vector Parallelism in Software Pipelined Loops

1
Exploiting Vector Parallelismin Software
Pipelined Loops
Sam Larsen Rodric Rabbah Saman Amarasinghe Compu
ter Science and Artificial Intelligence
Laboratory Massachusetts Institute of Technology
2
Multimedia Extensions

Short vector extensions in ILP processors
AltiVec, 3DNow!, SSE, etc.
Accelerate loops in multimedia DSP codes
New designs have floating point support

3
Multimedia Extensions

Vector resources do not overwhelm the scalar
resources
Scalar 2 FP ops / cycle
Vector 4 FP ops / cycle
Full vectorization may underutilize scalar
resources
ILP techniques do not target vector resources
Need both

Courtesy of International Business Machines
Corporation. Unauthorized use not permitted.
4
Modulo Scheduling
for (i0 iltN i) s s Xi Yi
Cycle Slot 1 Slot 2 Slot 3
1 LOAD LOAD
2 MULT
3 LOAD LOAD ADD
4 MULT

Cycle Slot 1 Slot 2 Slot 3

5
Traditional Vectorization
for (i0 iltN i2) Sii1 Xii1
Yii1
for (i0 iltN i2) Sii1 Xii1
Yii1
for (i0 iltN i) s s Si
for (i0 iltN i) s s Si
Cycle Slot 1 Slot 2 Slot 3
1 VLOAD
2 VLOAD
3 VMUL
4 VSTORE

Cycle Slot 1 Slot 2 Slot 3
1 LOAD
2 LOAD ADD
3
4

Cycle Slot 1 Slot 2 Slot 3

1
6
Vectorization without Distribution
for (i0 iltN i2) S Xii1
Yii1 s s S0 s s S1
Cycle Slot 1 Slot 2 Slot 3
1 VLOAD
2 VLOAD
3 VMUL
4 VLOAD ADD
5 VLOAD ADD
6 VMUL
Cycle Slot 1 Slot 2 Slot 3

7
Selective Vectorization
for (i0 iltN i2) S Xii1
YiYi1 s s S0 s s S1
Cycle Slot 1 Slot 2 Slot 3
1 VLOAD LOAD
2 LOAD
3 VLOAD LOAD
4 VMUL LOAD
5 VLOAD LOAD ADD
6 VMUL LOAD ADD
Cycle Slot 1 Slot 2 Slot 3

8
Complications

Complex scheduling requirements
Particularly in statically scheduled machines
Memory alignment
Example assumes no communication cost
In reality, explicit operations required
Often through memory
Reserve critical resources
Potential long latency
Performance improvement still possible

9
Tomcatv main loop (50)
10
Tomcatv (SpecFP 95)
Issue Width 6
Memory Units 2
ALUs 4
FPUs 2
Vector Units 1
Vector Length 2
1.7x Speedup over Modulo Scheduling
Technique ALU MEM FPU VEC
Modulo Scheduling 6 22 46 0
Full Vectorization 7 13 0 46
Selective Vectorization 7 27 19 27
11
Tomcatv (SpecFP 95)
12
Selective Vectorization

Balance computation among resources
Minimize II when loop is modulo scheduled
Carefully manage communication
Incorporate alignment information
Software pipelining hides latency
Adapt a 2-cluster partitioning heuristic
Fidduccia Matheyses 82
Kernighan Lin 70

13
Selective Vectorization
scalar
vector
cost
14
Cost Function

Projected II due to resources (ResMII)
Bin-packing approach Rau MICRO 94
With some modifications
Can ignore operation latency
Software pipelining hides latency
Vectorizable ops not on dependence cycles

for (i0 iltN i) Xi4 Xi
15
Evaluation
C or Fortran

SUIF front-end
Dependence analysis
Dataflow optimization
Trimaran back-end
Modulo scheduler
Register allocator
VLIW Simulator
Added vector ops

Simulation Binary
16
Evaluation

Operands communicated through memory
Software responsible for realignment

Issue Width 6
Memory Units 2
ALUs 4
FPUs 2
Vector Units 1
Vector Length 2
17
Evaluation

SpecFP 92, 95, 2000
Easier to extract dependence information
Detectable data parallelism
64-bit data means vector length of 2
Considered amenable to vectorization SWP
Apply selective vectorization to DO loops
No control flow, no function calls
Fully simulate with training sets

18
Traditional Vectorization
19
Vectorization without Distribution
20
Vectorization Free Communication
21
Vectorization without Distribution
22
Selective Vectorization
23
Selective Vectorization
tomcatv
mgrid
su2cor
swim
24
Communication Support

Transfer through memory
Register to register copy
Uses fewer issue slots
Frees memory resources
Shared register file
Vector elements addressable in scalar ops
Requires no extra issue slots

25
Through Memory
tomcatv
mgrid
su2cor
swim
26
Reg to Reg Transfer Support
tomcatv
mgrid
su2cor
swim
27
Shared Register File
tomcatv
mgrid
su2cor
swim
28
Related Work