More Code Optimization

About This Presentation

Title:

More Code Optimization

Description:

Title: Introduction to Computer Systems Author: Binyu Zang Last modified by: Yi Li Created Date: 1/15/2000 7:54:11 AM Document presentation format – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 35

Provided by: Biny157

Category:

more less

Transcript and Presenter's Notes

Title: More Code Optimization

1
More Code Optimization
2
Outline

Memory Performance
Tuning Performance
Suggested reading
5.12 5.14

3
Load Performance

load unit can only initiate one load operation
every clock cycle (Issue1.0)

typedef struct ELE struct ELE next int
data list_ele, list_ptr int
list_len(list_ptr ls) int len 0 while
(ls) len ls ls-gtnext return
len
len in eax, ls in rdi .L11 addl 1,
eax movq (rdi), rdi testq rdi,
rdi jne .L11
4
Store Performance

store unit can only initiate one store operation
every clock cycle (Issue1.0)

void array_clear(int dest, int n) int
i for (i 0 i lt n i) desti 0
5
Store Performance

store unit can only initiate one store operation
every clock cycle (Issue1.0)

void array_clear_4(int dest, int n) int
i int limit n-3 for (i 0 i lt limit
i4) desti 0 desti1
0 desti2 0 desti3 0 for (
i lt n i) desti 0
6
Store Performance
void write_read(int src, int dest, int
n) int cnt n int val 0 while (cnt--)
dest val val (src)1
Example A write_read(a0,a1,3)
cnt
a
val
Example B write_read(a0,a0,3)
cnt
a
val
7
Load and Store Units
Store Unit
Load Unit
Store buffer
Address
address
data

Matching addresses
Data
Address
Data
Address
Data
Data Cache
8
Graphical Representation
eax
ebx
ecx
edx
s_addr
movl eax,(ecx)
s_data
movl (ebx),eax
load
t
addl 1,eax
add
subl 1,edx
sub
jne loop
jne
eax
ebx
ecx
edx
//inner-loop while (cnt--) dest val
val (src)1
9
Graphical Representation
eax
ebx
ecx
edx
eax
edx
s_addr
s_data
s_data
load
sub
load
sub
add
jg
add
eax
edx
edx
eax
10
Graphical Representation(book)
Example B
Example A
Critical Path
s_data
s_data
load
load
sub
sub
add
add
load
s_data
load
load
sub
sub
add
add
11
Graphical Representation
Example B
Example A
Critical Path
s_data
s_data
load
load
sub
sub
add
add
s_data
s_data
load
load
sub
sub
add
add
12
Getting High Performance

High-level design
Choose appropriate algorithms and data structures
for the problem at hand
Be especially vigilant to avoid algorithms or
coding techniques that yield asymptotically poor
performance

13
Getting High Performance

Basic coding principles
Avoid optimization blockers so that a compiler
can generate efficient code.
Eliminate excessive function calls
Move computations out of loops when possible
Consider selective compromises of program
modularity to gain greater efficiency
Eliminate unnecessary memory references.
Introduce temporary variables to hold
intermediate results
Store a result in an array or global variable
only when the final value has been computed.

14
Getting High Performance

Low-level optimizations
Unroll loops to reduce overhead and to enable
further optimizations
Find ways to increase instruction-level
parallelism by techniques such as multiple
accumulators and reassociation
Rewrite conditional operations in a functional
style to enable compilation via conditional data
transfers
Write cache friendly code

15
Performance Tuning
16
Performance Tuning

Identify
Which is the hottest part of the program
Using a very useful method profiling
Instrument the program
Run it with typical input data
Collect information from the result
Analysis the result

17
Program Example

Task
Analyzing the n-gram statistics of a text
document
an n-gram is a sequence of n words occurring in a
document
reads a text file,
creates a table of unique n-grams
specifying how many times each one occurs
sorts the n-grams in descending order of
occurrence

18
Program Example

Steps
Convert strings to lowercase
Apply hash function
Read n-grams and insert into hash table
Mostly list operations
Maintain counter for each unique n-gram
Sort results
Data Set
Collected works of Shakespeare
965,028 total words, 23,706 unique
N2, called bigrams
363,039 unique bigrams

19
Examples Timing
unixgt gcc O1 pg prog.c o prog unixgt ./prog
file.txt unixgt gprof prog cumulative
self self
total time seconds seconds calls
s/call s/call name 97.58 173.05
173.05 1 173.05 173.05
sort_words 2.36 177.24 4.19
965027 0.00 0.00 find_ele_rec 0.12
177.46 0.22 12511031 0.00
0.00 Strlen
20
Principle

Interval counting
Maintain a counter for each function
Record the time spent executing this function
Interrupted at regular time (1ms)
Check which function is executing when interrupt
occurs
Increment the counter for this function
The calling information is quite reliable
By default, the timings for library functions are
not shown

21
Example Calling History

index time self children called
name
158655725 find_ele_rec
5
4.19 0.02 965027/965027
insert_string 4
5 2.4 4.19 0.02
965027158655725 find_ele_rec 5
0.01 0.01
363039/363039 new_ele 10
0.00 0.01 363039/363039
save_string 13
158655725 find_ele_rec
5
Ratio 158655725/965027 164.4
The average length of a list in one hash bucket
is 164

22
Code Optimizations

First step Use more efficient sorting function
Library function qsort

23
Further Optimizations
24
Optimizaitons

Iter first Use iterative function to insert
elements in linked list
Causes code to slow down
Iter last Iterative function, places new entry
at end of list
Tend to place most common words at front of list
Big table Increase number of hash buckets
Better hash Use more sophisticated hash function
Linear lower Move strlen out of loop

25
Code Motion

1 / Convert string to lowercase slow /
2 void lower1(char s)
3
4 int i
5
6 for (i 0 i lt strlen(s) i)
7 if (si gt A si lt Z)
8 si - (A - a)
9
10

26
Code Motion

11 / Convert string to lowercase faster /
12 void lower2(char s)
13
14 int i
15 int len strlen(s)
16
17 for (i 0 i lt len i)
18 if (si gt A si lt Z)
19 si - (A - a)
20
21

27
Code Motion

22 / Sample implementation of library function
strlen /
23 / Compute length of string /
24 size_t strlen(const char s)
25
26 int length 0
27 while (s ! \0)
28 s
29 length
30
31 return length
32

28
Code Motion
29
Performance Tuning

Benefits
Helps identify performance bottlenecks
Especially useful when have complex system with
many components
Limitations
Only shows performance for data tested
E.g., linear lower did not show big gain, since
words are short
Quadratic inefficiency could remain lurking in
code
Timing mechanism fairly crude
Only works for programs that run for gt 3 seconds

30
Getting High Performance

High-level design
Choose appropriate algorithms and data structures
for the problem at hand
Be especially vigilant to avoid algorithms or
coding techniques that yield asymptotically poor
performance

31
Getting High Performance

Basic coding principles
Avoid optimization blockers so that a compiler
can generate efficient code.
Eliminate excessive function calls
Move computations out of loops when possible
Consider selective compromises of program
modularity to gain greater efficiency
Eliminate unnecessary memory references.
Introduce temporary variables to hold
intermediate results
Store a result in an array or global variable
only when the final value has been computed.

32
Limit Amdahls Law

Tnew (1-?)Told (?Told)/k
Told(1-?) ?/k
S Told / Tnew 1/(1-?) ?/k
S? 1/(1-?)

33
Profiling Tools

Unix
gprof
Intels Vtune
Valgrind
Windows
Intels Vtune

More Code Optimization - PowerPoint PPT Presentation

More Code Optimization

Title: Introduction to Computer Systems Author: Binyu Zang Last modified by: Yi Li Created Date: 1/15/2000 7:54:11 AM Document presentation format – PowerPoint PPT presentation