Title: Code Optimization
1Code Optimization
2Outline
- Optimizing Blockers
- Memory alias
- Side effect in function call
- Understanding Modern Processor
- Super-scalar
- Out-of order execution
- Suggested reading
- 5.1, 5.7
3Example
- void combine1(vec_ptr v, data_t dest)
-
- int i
- dest IDENT
- for (i 0 i lt vec_length(v) i)
- int val
- get_vec_element(v, i, val)
- dest dest OPER val
-
4Example
- void combine2(vec_ptr v, int dest)
-
- int i
- int length vec_length(v)
- dest IDENT
- for (i 0 i lt length i)
- int val
- get_vec_element(v, i, val)
- dest dest OPER val
-
5Example
- void combine3(vec_ptr v, int dest)
-
- int i
- int length vec_length(v)
- int data get_vec_start(v)
- dest IDENT
- for (i 0 i lt length i)
- dest dest OPER datai
6Example
- void combine4(vec_ptr v, int dest)
-
- int i
- int length vec_length(v)
- int data get_vec_start(v)
- int x IDENT
- for (i 0 i lt length i)
- x x OPER datai
- dest x
7Machine Independent Opt. Results
- Optimizations
- Reduce function calls and memory references
within loop
8Optimizing Compilers
- Provide efficient mapping of program to machine
- register allocation
- code selection and ordering
- eliminating minor inefficiencies
9Optimizing Compilers
- Dont (usually) improve asymptotic efficiency
- up to programmer to select best overall algorithm
- big-O savings are (often) more important than
constant factors - but constant factors also matter
- Have difficulty overcoming optimization
blockers - potential memory aliasing
- potential procedure side-effects
10Optimization Blockers ? Memory aliasing
- void twiddle1(int xp, int yp)
-
- xp yp
- xp yp
-
- void twiddle2(int xp, int yp)
-
- xp 2 yp
-
11Optimization Blockers ? Function call and side
effect
- int f(int)
- int func1(x)
-
- return f(x)f(x)f(x)f(x)
-
- int func2(x)
-
- return 4f(x)
12Optimization Blockers ? Function call and side
effect
- int counter 0
- int f(int x)
-
- return counter
13Optimization Blocker Memory Aliasing
- Aliasing
- Two different memory references specify single
location - Example
- v 3, 2, 17
- combine3(v, get_vec_start(v)2) --gt ?
- combine4(v, get_vec_start(v)2) --gt ?
14Optimization Blocker Memory Aliasing
- Observations
- Easy to have happen in C
- Since allowed to do address arithmetic
- Direct access to storage structures
- Get in habit of introducing local variables
- Accumulating within loops
- Your way of telling compiler not to check for
aliasing
15Limitations of Optimizing Compilers
- Operate Under Fundamental Constraint
- Must not cause any change in program behavior
under any possible condition - Often prevents it from making optimizations when
would only affect behavior under pathological
conditions.
16Limitations of Optimizing Compilers
- Behavior that may be obvious to the programmer
can be obfuscated by languages and coding styles - e.g., data ranges may be more limited than
variable types suggest - e.g., using an int in C for what could be an
enumerated type
17Limitations of Optimizing Compilers
- Most analysis is performed only within procedures
- whole-program analysis is too expensive in most
cases - Most analysis is based only on static information
- compiler has difficulty anticipating run-time
inputs - When in doubt, the compiler must be conservative
18Modern CPU Design
19Modern Processor
- Superscalar
- Perform multiple operations on every clock cycle
- Out-of-order execution
- The order in which the instructions execute need
not correspond to their ordering in the assembly
program
20Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
Functional units
Integer /branch
General Integer
FP Add
FP mult/div
Load
Store
addr
addr
Operation results
data
data
Data Cache
Execution
21Modern Processor
- Two main parts
- Instruction Control Unit
- Responsible for reading a sequence of
instructions from memory - Generating from above instructions a set of
primitive operations to perform on program data - Execution Unit
22Instruction Control Unit
- Instruction Cache
- A special, high speed memory containing the most
recently accessed instructions.
Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
23Instruction Control Unit
- Instruction Decoding Logic
- Take actual program instructions
Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
24Instruction Control Unit
- Instruction Decoding Logic
- Take actual program instructions
- Converts them into a set of primitive operations
- Each primitive operation performs some simple
task - Simple arithmetic, Load, Store
- addl eax, 4(edx)
- load 4(edx) ? t1
- addl eax, t1 ? t2
- store t2, 4(edx)
- Register renaming
25Fetch Control
- Fetch Ahead
- Fetches well ahead of currently accessed
instructions - ICU has enough time to decode these
- ICU has enough time to send decoded operations
down to the EU
Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
26Fetch Control
- Branch Predication
- Branch taken or fall through
- Guess whether branch is taken or not
- Speculative Execution
- Fetch, decode and execute only according to the
branch prediction - Before the branch predication has been determined
27Multi-functional Units
- Multiple Instructions Can Execute in Parallel
- 1 load
- 1 store
- 2 integer (one may be branch)
- 1 FP Addition
- 1 FP Multiplication or Division
28Functional units
Integer /branch
General Integer
FP Add
FP mult/div
Load
Store
addr
addr
Operation results
data
data
Data Cache
Execution
29Multi-functional Units
- Some Instructions Take gt 1 Cycle, but Can be
Pipelined - Instruction Latency Cycles/Issue
- Load / Store 3 1
- Integer Multiply 4 1
- Integer Divide 36 36
- Double/Single FP Multiply 5 2
- Double/Single FP Add 3 1
- Double/Single FP Divide 38 38
30Execution Unit
- Receives operations from ICU
- Each cycle it may receive more than one operation
- Operations are queued in buffer
31Execution Unit
- Operation is dispatched to one of
multi-functional units, whenever - All the operands of an operation are ready
- Suitable functional units are available
- Execution results are passed among functional
units - Data Cache
- A high speed memory containing the most recently
accessed data values
32Retirement Unit
- Instructions need to commit in serial order
- Misprediction?Exception
- Updates Architecture status
- Memory and register values
Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
33Translation Example
.L24 Loop imull (eax,edx,4),ecx t
datai incl edx i cmpl esi,edx
ilength jl .L24 if lt goto Loop
.L24 imull (eax,edx,4),ecx incl
edx cmpl esi,edx jl .L24
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0
? ecx.1 incl edx.0 ? edx.1 cmpl esi,
edx.1 ? cc.1 jl-taken cc.1
34Understanding Translation Example
imull (eax,edx,4),ecx
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1
- Split into two operations
- Load reads from memory to generate temporary
result t.1 - Multiply operation just operates on registers
35Understanding Translation Example
imull (eax,edx,4),ecx
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1
- Operands
- Registers eax does not change in loop. Values
will be retrieved from register file during
decoding
36Understanding Translation Example
imull (eax,edx,4),ecx
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1
- Operands
- Register ecx changes on every iteration.
- Uniquely identify different versions as
- ecx.0, ecx.1, ecx.2,
- Register renaming
- Values passed directly from producer to consumers
37Understanding Translation Example
incl edx
incl edx.0 ? edx.1
- Register edx changes on each iteration
- Renamed as edx.0, edx.1, edx.2,
38Understanding Translation Example
cmpl esi, edx
cmpl esi, edx.1 ? cc.1
- Condition codes are treated similar to registers
- Assign tag to define connection between producer
and consumer
39Understanding Translation Example
jl .L24
jl-taken cc.1
- Instruction control unit determines destination
of jump - Predicts whether target will be taken
- Starts fetching instruction at predicted
destination
40Understanding Translation Example
jl .L24
jl-taken cc.1
- Execution unit simply checks whether or not
prediction was OK - If not, it signals instruction control
- Instruction control then invalidates any
operations generated from misfetched instructions - Begins fetching and decoding instructions at
correct target
41Visualizing Operations
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1 incl edx.0 ? edx.1 cmpl esi, edx.1 ?
cc.1 jl-taken cc.1
- Operations
- Vertical position denotes time at which executed
- Cannot begin operation until operands available
- Height denotes latency
- Operands
- Arcs shown only for operands that are passed
within execution unit
Time
42Multi-functional Units
- Some Instructions Take gt 1 Cycle, but Can be
Pipelined - Instruction Latency Cycles/Issue
- Load / Store 3 1
- Integer Multiply 4 1
- Integer Divide 36 36
- Double/Single FP Multiply 5 2
- Double/Single FP Add 3 1
- Double/Single FP Divide 38 38
43Visualizing Operations
load (eax,edx.0,4) ? t.1 addl t.1, ecx.0 ?
ecx.1 incl edx.0 ? edx.1 cmpl esi, edx.1 ?
cc.1 jl-taken cc.1
- Operations
- Vertical position denotes time at which executed
- Cannot begin operation until operands available
- Height denotes latency
- Operands
- Arcs shown only for operands that are passed
within execution unit
44Iteration 3
Iteration 1
Iteration 2
453 Iterations of Combining Product
- Unlimited Resource Analysis
- Assume operation can start as soon as operands
available - Operations for multiple iterations overlap in
time - Performance
- Limiting factor becomes latency of integer
multiplier - Gives CPE of 4.0
464 Iterations of Combining Sum
474 Iterations of Combining Sum
- Unlimited Resource Analysis
- Performance
- Can begin a new iteration on each clock cycle
- Should give CPE of 1.0
- Would require executing 4 integer operations in
parallel
48Combining Sum Resource Constraints
49Combining Sum Resource Constraints
- Only have two integer functional units
- Some operations delayed even though operands
available - Set priority based on program order
- Performance
- Sustain CPE of 2.0