Code Optimization - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Code Optimization

Description:

Fetch, decode and execute only according to the branch prediction ... Begins fetching and decoding instructions at correct target. jl .L24. jl-taken cc.1 ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 50

Provided by: binyu5

Category:

more less

Transcript and Presenter's Notes

Title: Code Optimization

1
Code Optimization
2
Outline

Optimizing Blockers
Memory alias
Side effect in function call
Understanding Modern Processor
Super-scalar
Out-of order execution
Suggested reading
5.1, 5.7

3
Example

void combine1(vec_ptr v, data_t dest)
int i
dest IDENT
for (i 0 i lt vec_length(v) i)
int val
get_vec_element(v, i, val)
dest dest OPER val

4
Example

void combine2(vec_ptr v, int dest)
int i
int length vec_length(v)
dest IDENT
for (i 0 i lt length i)
int val
get_vec_element(v, i, val)
dest dest OPER val

5
Example

void combine3(vec_ptr v, int dest)
int i
int length vec_length(v)
int data get_vec_start(v)
dest IDENT
for (i 0 i lt length i)
dest dest OPER datai

6
Example

void combine4(vec_ptr v, int dest)
int i
int length vec_length(v)
int data get_vec_start(v)
int x IDENT
for (i 0 i lt length i)
x x OPER datai
dest x

7
Machine Independent Opt. Results

Optimizations
Reduce function calls and memory references
within loop

8
Optimizing Compilers

Provide efficient mapping of program to machine
register allocation
code selection and ordering
eliminating minor inefficiencies

9
Optimizing Compilers

Dont (usually) improve asymptotic efficiency
up to programmer to select best overall algorithm
big-O savings are (often) more important than
constant factors
but constant factors also matter
Have difficulty overcoming optimization
blockers
potential memory aliasing
potential procedure side-effects

10
Optimization Blockers ? Memory aliasing

void twiddle1(int xp, int yp)
xp yp
xp yp
void twiddle2(int xp, int yp)
xp 2 yp

11
Optimization Blockers ? Function call and side
effect

int f(int)
int func1(x)
return f(x)f(x)f(x)f(x)
int func2(x)
return 4f(x)

12
Optimization Blockers ? Function call and side
effect

int counter 0
int f(int x)
return counter

13
Optimization Blocker Memory Aliasing

Aliasing
Two different memory references specify single
location
Example
v 3, 2, 17
combine3(v, get_vec_start(v)2) --gt ?
combine4(v, get_vec_start(v)2) --gt ?

14
Optimization Blocker Memory Aliasing

Observations
Easy to have happen in C
Since allowed to do address arithmetic
Direct access to storage structures
Get in habit of introducing local variables
Accumulating within loops
Your way of telling compiler not to check for
aliasing

15
Limitations of Optimizing Compilers

Operate Under Fundamental Constraint
Must not cause any change in program behavior
under any possible condition
Often prevents it from making optimizations when
would only affect behavior under pathological
conditions.

16
Limitations of Optimizing Compilers

Behavior that may be obvious to the programmer
can be obfuscated by languages and coding styles
e.g., data ranges may be more limited than
variable types suggest
e.g., using an int in C for what could be an
enumerated type

17
Limitations of Optimizing Compilers

Most analysis is performed only within procedures
whole-program analysis is too expensive in most
cases
Most analysis is based only on static information
compiler has difficulty anticipating run-time
inputs
When in doubt, the compiler must be conservative

18
Modern CPU Design
19
Modern Processor

Superscalar
Perform multiple operations on every clock cycle
Out-of-order execution
The order in which the instructions execute need
not correspond to their ordering in the assembly
program

20
Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
Functional units
Integer /branch
General Integer
FP Add
FP mult/div
Load
Store
addr
addr
Operation results
data
data
Data Cache
Execution
21
Modern Processor

Two main parts
Instruction Control Unit
Responsible for reading a sequence of
instructions from memory
Generating from above instructions a set of
primitive operations to perform on program data
Execution Unit

22
Instruction Control Unit

Instruction Cache
A special, high speed memory containing the most
recently accessed instructions.

Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
23
Instruction Control Unit

Instruction Decoding Logic
Take actual program instructions

Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
24
Instruction Control Unit

Instruction Decoding Logic
Take actual program instructions
Converts them into a set of primitive operations
Each primitive operation performs some simple
task
Simple arithmetic, Load, Store
addl eax, 4(edx)
load 4(edx) ? t1
addl eax, t1 ? t2
store t2, 4(edx)
Register renaming

25
Fetch Control

Fetch Ahead
Fetches well ahead of currently accessed
instructions
ICU has enough time to decode these
ICU has enough time to send decoded operations
down to the EU

Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
26
Fetch Control

Branch Predication
Branch taken or fall through
Guess whether branch is taken or not
Speculative Execution
Fetch, decode and execute only according to the
branch prediction
Before the branch predication has been determined

27
Multi-functional Units

Multiple Instructions Can Execute in Parallel
1 load
1 store
2 integer (one may be branch)
1 FP Addition
1 FP Multiplication or Division

28
Functional units
Integer /branch
General Integer
FP Add
FP mult/div
Load
Store
addr
addr
Operation results
data
data
Data Cache
Execution
29
Multi-functional Units

Some Instructions Take gt 1 Cycle, but Can be
Pipelined
Instruction Latency Cycles/Issue
Load / Store 3 1
Integer Multiply 4 1
Integer Divide 36 36
Double/Single FP Multiply 5 2
Double/Single FP Add 3 1
Double/Single FP Divide 38 38

30
Execution Unit

Receives operations from ICU
Each cycle it may receive more than one operation
Operations are queued in buffer

31
Execution Unit

Operation is dispatched to one of
multi-functional units, whenever
All the operands of an operation are ready
Suitable functional units are available
Execution results are passed among functional
units
Data Cache
A high speed memory containing the most recently
accessed data values

32
Retirement Unit

Instructions need to commit in serial order
Misprediction?Exception
Updates Architecture status
Memory and register values

Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
33
Translation Example
.L24 Loop imull (eax,edx,4),ecx t
datai incl edx i cmpl esi,edx
ilength jl .L24 if lt goto Loop
.L24 imull (eax,edx,4),ecx incl
edx cmpl esi,edx jl .L24
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0
? ecx.1 incl edx.0 ? edx.1 cmpl esi,
edx.1 ? cc.1 jl-taken cc.1
34
Understanding Translation Example
imull (eax,edx,4),ecx
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1

Split into two operations
Load reads from memory to generate temporary
result t.1
Multiply operation just operates on registers

35
Understanding Translation Example
imull (eax,edx,4),ecx
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1

Operands
Registers eax does not change in loop. Values
will be retrieved from register file during
decoding

36
Understanding Translation Example
imull (eax,edx,4),ecx
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1

Operands
Register ecx changes on every iteration.
Uniquely identify different versions as
ecx.0, ecx.1, ecx.2,
Register renaming
Values passed directly from producer to consumers

37
Understanding Translation Example
incl edx
incl edx.0 ? edx.1

Register edx changes on each iteration
Renamed as edx.0, edx.1, edx.2,

38
Understanding Translation Example
cmpl esi, edx
cmpl esi, edx.1 ? cc.1

Condition codes are treated similar to registers
Assign tag to define connection between producer
and consumer

39
Understanding Translation Example
jl .L24
jl-taken cc.1

Instruction control unit determines destination
of jump
Predicts whether target will be taken
Starts fetching instruction at predicted
destination

40
Understanding Translation Example
jl .L24
jl-taken cc.1

Execution unit simply checks whether or not
prediction was OK
If not, it signals instruction control
Instruction control then invalidates any
operations generated from misfetched instructions
Begins fetching and decoding instructions at
correct target

41
Visualizing Operations
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1 incl edx.0 ? edx.1 cmpl esi, edx.1 ?
cc.1 jl-taken cc.1