Code Optimization - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Code Optimization

Description:

Fetch, decode and execute only according to the branch prediction ... Begins fetching and decoding instructions at correct target. jl .L24. jl-taken cc.1 ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 50
Provided by: binyu5
Category:

less

Transcript and Presenter's Notes

Title: Code Optimization


1
Code Optimization
2
Outline
  • Optimizing Blockers
  • Memory alias
  • Side effect in function call
  • Understanding Modern Processor
  • Super-scalar
  • Out-of order execution
  • Suggested reading
  • 5.1, 5.7

3
Example
  • void combine1(vec_ptr v, data_t dest)
  • int i
  • dest IDENT
  • for (i 0 i lt vec_length(v) i)
  • int val
  • get_vec_element(v, i, val)
  • dest dest OPER val

4
Example
  • void combine2(vec_ptr v, int dest)
  • int i
  • int length vec_length(v)
  • dest IDENT
  • for (i 0 i lt length i)
  • int val
  • get_vec_element(v, i, val)
  • dest dest OPER val

5
Example
  • void combine3(vec_ptr v, int dest)
  • int i
  • int length vec_length(v)
  • int data get_vec_start(v)
  • dest IDENT
  • for (i 0 i lt length i)
  • dest dest OPER datai

6
Example
  • void combine4(vec_ptr v, int dest)
  • int i
  • int length vec_length(v)
  • int data get_vec_start(v)
  • int x IDENT
  • for (i 0 i lt length i)
  • x x OPER datai
  • dest x

7
Machine Independent Opt. Results
  • Optimizations
  • Reduce function calls and memory references
    within loop

8
Optimizing Compilers
  • Provide efficient mapping of program to machine
  • register allocation
  • code selection and ordering
  • eliminating minor inefficiencies

9
Optimizing Compilers
  • Dont (usually) improve asymptotic efficiency
  • up to programmer to select best overall algorithm
  • big-O savings are (often) more important than
    constant factors
  • but constant factors also matter
  • Have difficulty overcoming optimization
    blockers
  • potential memory aliasing
  • potential procedure side-effects

10
Optimization Blockers ? Memory aliasing
  • void twiddle1(int xp, int yp)
  • xp yp
  • xp yp
  • void twiddle2(int xp, int yp)
  • xp 2 yp

11
Optimization Blockers ? Function call and side
effect
  • int f(int)
  • int func1(x)
  • return f(x)f(x)f(x)f(x)
  • int func2(x)
  • return 4f(x)

12
Optimization Blockers ? Function call and side
effect
  • int counter 0
  • int f(int x)
  • return counter

13
Optimization Blocker Memory Aliasing
  • Aliasing
  • Two different memory references specify single
    location
  • Example
  • v 3, 2, 17
  • combine3(v, get_vec_start(v)2) --gt ?
  • combine4(v, get_vec_start(v)2) --gt ?

14
Optimization Blocker Memory Aliasing
  • Observations
  • Easy to have happen in C
  • Since allowed to do address arithmetic
  • Direct access to storage structures
  • Get in habit of introducing local variables
  • Accumulating within loops
  • Your way of telling compiler not to check for
    aliasing

15
Limitations of Optimizing Compilers
  • Operate Under Fundamental Constraint
  • Must not cause any change in program behavior
    under any possible condition
  • Often prevents it from making optimizations when
    would only affect behavior under pathological
    conditions.

16
Limitations of Optimizing Compilers
  • Behavior that may be obvious to the programmer
    can be obfuscated by languages and coding styles
  • e.g., data ranges may be more limited than
    variable types suggest
  • e.g., using an int in C for what could be an
    enumerated type

17
Limitations of Optimizing Compilers
  • Most analysis is performed only within procedures
  • whole-program analysis is too expensive in most
    cases
  • Most analysis is based only on static information
  • compiler has difficulty anticipating run-time
    inputs
  • When in doubt, the compiler must be conservative

18
Modern CPU Design
19
Modern Processor
  • Superscalar
  • Perform multiple operations on every clock cycle
  • Out-of-order execution
  • The order in which the instructions execute need
    not correspond to their ordering in the assembly
    program

20
Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
Functional units
Integer /branch
General Integer
FP Add
FP mult/div
Load
Store
addr
addr
Operation results
data
data
Data Cache
Execution
21
Modern Processor
  • Two main parts
  • Instruction Control Unit
  • Responsible for reading a sequence of
    instructions from memory
  • Generating from above instructions a set of
    primitive operations to perform on program data
  • Execution Unit

22
Instruction Control Unit
  • Instruction Cache
  • A special, high speed memory containing the most
    recently accessed instructions.

Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
23
Instruction Control Unit
  • Instruction Decoding Logic
  • Take actual program instructions

Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
24
Instruction Control Unit
  • Instruction Decoding Logic
  • Take actual program instructions
  • Converts them into a set of primitive operations
  • Each primitive operation performs some simple
    task
  • Simple arithmetic, Load, Store
  • addl eax, 4(edx)
  • load 4(edx) ? t1
  • addl eax, t1 ? t2
  • store t2, 4(edx)
  • Register renaming

25
Fetch Control
  • Fetch Ahead
  • Fetches well ahead of currently accessed
    instructions
  • ICU has enough time to decode these
  • ICU has enough time to send decoded operations
    down to the EU

Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
26
Fetch Control
  • Branch Predication
  • Branch taken or fall through
  • Guess whether branch is taken or not
  • Speculative Execution
  • Fetch, decode and execute only according to the
    branch prediction
  • Before the branch predication has been determined

27
Multi-functional Units
  • Multiple Instructions Can Execute in Parallel
  • 1 load
  • 1 store
  • 2 integer (one may be branch)
  • 1 FP Addition
  • 1 FP Multiplication or Division

28
Functional units
Integer /branch
General Integer
FP Add
FP mult/div
Load
Store
addr
addr
Operation results
data
data
Data Cache
Execution
29
Multi-functional Units
  • Some Instructions Take gt 1 Cycle, but Can be
    Pipelined
  • Instruction Latency Cycles/Issue
  • Load / Store 3 1
  • Integer Multiply 4 1
  • Integer Divide 36 36
  • Double/Single FP Multiply 5 2
  • Double/Single FP Add 3 1
  • Double/Single FP Divide 38 38

30
Execution Unit
  • Receives operations from ICU
  • Each cycle it may receive more than one operation
  • Operations are queued in buffer

31
Execution Unit
  • Operation is dispatched to one of
    multi-functional units, whenever
  • All the operands of an operation are ready
  • Suitable functional units are available
  • Execution results are passed among functional
    units
  • Data Cache
  • A high speed memory containing the most recently
    accessed data values

32
Retirement Unit
  • Instructions need to commit in serial order
  • Misprediction?Exception
  • Updates Architecture status
  • Memory and register values

Instruction control
Fetch Control
Address
Retirement Unit
Instruction Cache
Instruction Decode
Instructions
Register File
operations
Register Updates
Predication OK?
33
Translation Example
.L24 Loop imull (eax,edx,4),ecx t
datai incl edx i cmpl esi,edx
ilength jl .L24 if lt goto Loop
.L24 imull (eax,edx,4),ecx incl
edx cmpl esi,edx jl .L24
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0
? ecx.1 incl edx.0 ? edx.1 cmpl esi,
edx.1 ? cc.1 jl-taken cc.1
34
Understanding Translation Example
imull (eax,edx,4),ecx
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1
  • Split into two operations
  • Load reads from memory to generate temporary
    result t.1
  • Multiply operation just operates on registers

35
Understanding Translation Example
imull (eax,edx,4),ecx
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1
  • Operands
  • Registers eax does not change in loop. Values
    will be retrieved from register file during
    decoding

36
Understanding Translation Example
imull (eax,edx,4),ecx
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1
  • Operands
  • Register ecx changes on every iteration.
  • Uniquely identify different versions as
  • ecx.0, ecx.1, ecx.2,
  • Register renaming
  • Values passed directly from producer to consumers

37
Understanding Translation Example
incl edx
incl edx.0 ? edx.1
  • Register edx changes on each iteration
  • Renamed as edx.0, edx.1, edx.2,

38
Understanding Translation Example
cmpl esi, edx
cmpl esi, edx.1 ? cc.1
  • Condition codes are treated similar to registers
  • Assign tag to define connection between producer
    and consumer

39
Understanding Translation Example
jl .L24
jl-taken cc.1
  • Instruction control unit determines destination
    of jump
  • Predicts whether target will be taken
  • Starts fetching instruction at predicted
    destination

40
Understanding Translation Example
jl .L24
jl-taken cc.1
  • Execution unit simply checks whether or not
    prediction was OK
  • If not, it signals instruction control
  • Instruction control then invalidates any
    operations generated from misfetched instructions
  • Begins fetching and decoding instructions at
    correct target

41
Visualizing Operations
load (eax,edx.0,4) ? t.1 imull t.1, ecx.0 ?
ecx.1 incl edx.0 ? edx.1 cmpl esi, edx.1 ?
cc.1 jl-taken cc.1
  • Operations
  • Vertical position denotes time at which executed
  • Cannot begin operation until operands available
  • Height denotes latency
  • Operands
  • Arcs shown only for operands that are passed
    within execution unit

Time
42
Multi-functional Units
  • Some Instructions Take gt 1 Cycle, but Can be
    Pipelined
  • Instruction Latency Cycles/Issue
  • Load / Store 3 1
  • Integer Multiply 4 1
  • Integer Divide 36 36
  • Double/Single FP Multiply 5 2
  • Double/Single FP Add 3 1
  • Double/Single FP Divide 38 38

43
Visualizing Operations
load (eax,edx.0,4) ? t.1 addl t.1, ecx.0 ?
ecx.1 incl edx.0 ? edx.1 cmpl esi, edx.1 ?
cc.1 jl-taken cc.1
  • Operations
  • Vertical position denotes time at which executed
  • Cannot begin operation until operands available
  • Height denotes latency
  • Operands
  • Arcs shown only for operands that are passed
    within execution unit

44
Iteration 3
Iteration 1
Iteration 2
45
3 Iterations of Combining Product
  • Unlimited Resource Analysis
  • Assume operation can start as soon as operands
    available
  • Operations for multiple iterations overlap in
    time
  • Performance
  • Limiting factor becomes latency of integer
    multiplier
  • Gives CPE of 4.0

46
4 Iterations of Combining Sum
47
4 Iterations of Combining Sum
  • Unlimited Resource Analysis
  • Performance
  • Can begin a new iteration on each clock cycle
  • Should give CPE of 1.0
  • Would require executing 4 integer operations in
    parallel

48
Combining Sum Resource Constraints
49
Combining Sum Resource Constraints
  • Only have two integer functional units
  • Some operations delayed even though operands
    available
  • Set priority based on program order
  • Performance
  • Sustain CPE of 2.0
Write a Comment
User Comments (0)
About PowerShow.com