Title: More Code Optimization
1More Code Optimization
2Outline
- Memory Performance
- Tuning Performance
- Suggested reading
- 5.12 5.14
3Load Performance
- load unit can only initiate one load operation
every clock cycle (Issue1.0)
typedef struct ELE struct ELE next int
data list_ele, list_ptr int
list_len(list_ptr ls) int len 0 while
(ls) len ls ls-gtnext return
len
len in eax, ls in rdi .L11 addl 1,
eax movq (rdi), rdi testq rdi,
rdi jne .L11
4Store Performance
- store unit can only initiate one store operation
every clock cycle (Issue1.0)
void array_clear(int dest, int n) int
i for (i 0 i lt n i) desti 0
5Store Performance
- store unit can only initiate one store operation
every clock cycle (Issue1.0)
void array_clear_4(int dest, int n) int
i int limit n-3 for (i 0 i lt limit
i4) desti 0 desti1
0 desti2 0 desti3 0 for (
i lt n i) desti 0
6Store Performance
void write_read(int src, int dest, int
n) int cnt n int val 0 while (cnt--)
dest val val (src)1
Example A write_read(a0,a1,3)
cnt
a
val
Example B write_read(a0,a0,3)
cnt
a
val
7Load and Store Units
Store Unit
Load Unit
Store buffer
Address
address
data
Matching addresses
Data
Address
Data
Address
Data
Data Cache
8Graphical Representation
eax
ebx
ecx
edx
s_addr
movl eax,(ecx)
s_data
movl (ebx),eax
load
t
addl 1,eax
add
subl 1,edx
sub
jne loop
jne
eax
ebx
ecx
edx
//inner-loop while (cnt--) dest val
val (src)1
9Graphical Representation
eax
ebx
ecx
edx
eax
edx
s_addr
s_data
s_data
load
sub
load
sub
add
jg
add
eax
edx
edx
eax
10Graphical Representation(book)
Example B
Example A
Critical Path
s_data
s_data
load
load
sub
sub
add
add
load
s_data
load
load
sub
sub
add
add
11Graphical Representation
Example B
Example A
Critical Path
s_data
s_data
load
load
sub
sub
add
add
s_data
s_data
load
load
sub
sub
add
add
12Getting High Performance
- High-level design
- Choose appropriate algorithms and data structures
for the problem at hand - Be especially vigilant to avoid algorithms or
coding techniques that yield asymptotically poor
performance
13Getting High Performance
- Basic coding principles
- Avoid optimization blockers so that a compiler
can generate efficient code. - Eliminate excessive function calls
- Move computations out of loops when possible
- Consider selective compromises of program
modularity to gain greater efficiency - Eliminate unnecessary memory references.
- Introduce temporary variables to hold
intermediate results - Store a result in an array or global variable
only when the final value has been computed.
14Getting High Performance
- Low-level optimizations
- Unroll loops to reduce overhead and to enable
further optimizations - Find ways to increase instruction-level
parallelism by techniques such as multiple
accumulators and reassociation - Rewrite conditional operations in a functional
style to enable compilation via conditional data
transfers - Write cache friendly code
15Performance Tuning
16Performance Tuning
- Identify
- Which is the hottest part of the program
- Using a very useful method profiling
- Instrument the program
- Run it with typical input data
- Collect information from the result
- Analysis the result
17Program Example
- Task
- Analyzing the n-gram statistics of a text
document - an n-gram is a sequence of n words occurring in a
document - reads a text file,
- creates a table of unique n-grams
- specifying how many times each one occurs
- sorts the n-grams in descending order of
occurrence
18Program Example
- Steps
- Convert strings to lowercase
- Apply hash function
- Read n-grams and insert into hash table
- Mostly list operations
- Maintain counter for each unique n-gram
- Sort results
- Data Set
- Collected works of Shakespeare
- 965,028 total words, 23,706 unique
- N2, called bigrams
- 363,039 unique bigrams
19Examples Timing
unixgt gcc O1 pg prog.c o prog unixgt ./prog
file.txt unixgt gprof prog cumulative
self self
total time seconds seconds calls
s/call s/call name 97.58 173.05
173.05 1 173.05 173.05
sort_words 2.36 177.24 4.19
965027 0.00 0.00 find_ele_rec 0.12
177.46 0.22 12511031 0.00
0.00 Strlen
20Principle
- Interval counting
- Maintain a counter for each function
- Record the time spent executing this function
- Interrupted at regular time (1ms)
- Check which function is executing when interrupt
occurs - Increment the counter for this function
- The calling information is quite reliable
- By default, the timings for library functions are
not shown
21Example Calling History
- index time self children called
name - 158655725 find_ele_rec
5 - 4.19 0.02 965027/965027
insert_string 4 - 5 2.4 4.19 0.02
965027158655725 find_ele_rec 5 - 0.01 0.01
363039/363039 new_ele 10 - 0.00 0.01 363039/363039
save_string 13 - 158655725 find_ele_rec
5 - Ratio 158655725/965027 164.4
- The average length of a list in one hash bucket
is 164
22Code Optimizations
- First step Use more efficient sorting function
- Library function qsort
23Further Optimizations
24Optimizaitons
- Iter first Use iterative function to insert
elements in linked list - Causes code to slow down
- Iter last Iterative function, places new entry
at end of list - Tend to place most common words at front of list
- Big table Increase number of hash buckets
- Better hash Use more sophisticated hash function
- Linear lower Move strlen out of loop
25Code Motion
- 1 / Convert string to lowercase slow /
- 2 void lower1(char s)
- 3
- 4 int i
- 5
- 6 for (i 0 i lt strlen(s) i)
- 7 if (si gt A si lt Z)
- 8 si - (A - a)
- 9
- 10
26Code Motion
- 11 / Convert string to lowercase faster /
- 12 void lower2(char s)
- 13
- 14 int i
- 15 int len strlen(s)
- 16
- 17 for (i 0 i lt len i)
- 18 if (si gt A si lt Z)
- 19 si - (A - a)
- 20
- 21
27Code Motion
- 22 / Sample implementation of library function
strlen / - 23 / Compute length of string /
- 24 size_t strlen(const char s)
- 25
- 26 int length 0
- 27 while (s ! \0)
- 28 s
- 29 length
- 30
- 31 return length
- 32
28Code Motion
29Performance Tuning
- Benefits
- Helps identify performance bottlenecks
- Especially useful when have complex system with
many components - Limitations
- Only shows performance for data tested
- E.g., linear lower did not show big gain, since
words are short - Quadratic inefficiency could remain lurking in
code - Timing mechanism fairly crude
- Only works for programs that run for gt 3 seconds
30Getting High Performance
- High-level design
- Choose appropriate algorithms and data structures
for the problem at hand - Be especially vigilant to avoid algorithms or
coding techniques that yield asymptotically poor
performance
31Getting High Performance
- Basic coding principles
- Avoid optimization blockers so that a compiler
can generate efficient code. - Eliminate excessive function calls
- Move computations out of loops when possible
- Consider selective compromises of program
modularity to gain greater efficiency - Eliminate unnecessary memory references.
- Introduce temporary variables to hold
intermediate results - Store a result in an array or global variable
only when the final value has been computed.
32Limit Amdahls Law
- Tnew (1-?)Told (?Told)/k
- Told(1-?) ?/k
-
- S Told / Tnew 1/(1-?) ?/k
- S? 1/(1-?)
33Profiling Tools
- Unix
- gprof
- Intels Vtune
- Valgrind
- Windows
- Intels Vtune
34Next