Performance Analysis and Optimization - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Performance Analysis and Optimization

Description:

... loading, serial bottlenecks, and I/O that make up the serial component of the run ... Using Amdahl's law, calculate the speed up for a code that is 70% percent ... – PowerPoint PPT presentation

Number of Views:90

Avg rating:3.0/5.0

Slides: 60

Provided by: kayhan7

Category:

more less

Transcript and Presenter's Notes

Title: Performance Analysis and Optimization

1
Performance Analysis and Optimization

Kayhan Atesci
www.auburn.edu/atescka

2
Overview

Theoretical Preliminaries
NP-Completeness
Analyzing RT Systems
The Halting Problem
Amdahls Law
Gustafsons Law

3
Overview (contd)

Performance Analysis
Execution Time Estimation
Instruction Counting
Instruction Execution-Time Simulators
Using the System Clock
Analysis of Polled Loops
Analysis of Coroutines
Analysis of Round-Robin Systems
Analysis of Fixed-Period Systems

4
Overview (cont.)

Performance Analysis (cont.)
Analysis of Sporadic and Aperiodic Interrupt
Systems
Interrupt Latency
Instruction Completion Times
Deterministic Performance

5
Intro to Theoretical Preliminaries

Performance Analysis is one of the fields where
theory and practice do not coincide
Formulas
ignore resource contention
use theoretically artificial hardware
assume zero context switch time
Not totally useless but less realistic

6
NP-Completeness

NP The class of problems that cant be solved in
polynomial time although a candidate solution can
be defined
NP-Complete A problem in the NP class to which
other problems in NP are transformable
NP-Hard A problem outside of the NP class to
which other problems in NP are transformable

7
Challenges in Analyzing RTS

30 years of research, strict practical
constraints, NP-Complete problems etc..
Mutual exclusions makes it impossible to find an
optimal scheduler
Earliest deadline scheduling not optimal in
multiprocessing
Prior knowledge of deadlines, computation times
and task start time needed

8
Challenges contd

NP-Complete, NP-Hard problems
Possible to schedule a set of periodic processes
that use semaphores only to enforce mutual
exclusion?
Multiprocessor scheduling problems with either
two or three processors, either with one or no
resource, arbitrary or specified partial order

9
The Halting Problem

Is there a computer program that takes an
arbitrary program, Pi, and all possible
combinations of inputs, Ik, and determines
whether or not Pi will halt on Ik?
No, there is not.

10
The Halting Problem contd

Cantors diagonal (cont).

Arbitrary Program Pi Source Code
Halt or no Halt Decision
Oracle
Set of inputs to Program, Ik
11
The Halting Problem contd
12
The Halting Problem contd

Why is it relevant to RT systems?
Schedulability Analyzer
Takes an arbitrary program and the set of all
possible inputs to that program and determines
the best, worst, and average case execution times
The running time can be determined given the
specific language, fixed set of inputs and the
execution times..
But! NOT GENERALIZABLE!

13
Amdahls Law

Level of parallelization that can be achieved by
a parallel computer
n number of processors available for parallel
processing
s the fraction of the code that cannot be
parallelized
1 s the fraction of code that can be
parallelized

14
Amdahls Law (cont.)

Speedup (s ( 1 s))
(s (1-s)/n)
.
.
.
n
1 s(n-1)

15
Amdahls Law contd

S 0, linear speedup!
S .1,
The processors working on the remaining 90
(parallelizable code) will end up waiting for the
single processor to finish the last 10.
Revealed an insoluble problem in the field of
parallel computers limited efficiency and
application of parallelism

16
Gustafsons Law

Demonstrated with a 1024-processor system that
the basic presumptions in Amdahls Law are
inappropriate for parallelism
Found that the problem size scales with the
number of processors, or with a more powerful
processor, the problem expands to make use of the
increased facilities is inappropriate.

17
Gustafsons Law contd

Demonstrated that only the parallel or vector
part of a program scales with the problem size.
Times for the vector startup, program loading,
serial bottlenecks, and I/O that make up the
serial component of the run do not grow with the
problem size

18
Gustafsons Law contd

Formulation
s serial time
p parallel time (1 s)
n number of processors
time required s pn
Much more optimistic than Amdahls law, much
easier to achieve parallelism

19
So far

Theoretical Preliminaries
NP-Completeness
Challenges in Analyzing RTS
The Halting Problem
Amdahls Law
Gustafsons Law

20
Performance Analysis

Natural desire to see if they will meet the
deadlines
Rarely possible due to NP-completeness and
physical constraints
However, it is possible to get a handle
Important because CPU utilization requirements
are stated as design goals and knowing them
upfront is important in selecting the appropriate
hardware and system design approach

21
Code Execution Time Estimation

Best method Logic Analyzer (Ch. 8)
H/W latencies and other delays are
taken into account
- System must be completely coded and
the target H/W available
Usually employed in the late stages of
coding, testing, and during system
integration.

22
Instruction Counting

When too early for logic analyzer, or one is not
available the best method of determining CPU
utilization due to code execution time
Involves tracing the longest path through the
code, adding up the instruction execution times
along the way
Reqs
Code need to be already written or approximation
of the final code
Actual instruction times

23
Instruction Counting contd
24
Instruction Counting contd

Path 1
7 instructions _at_ 6 µsec 42 µsec
Paths 2 3
9 instructions _at_ 6 µsec 54 µsec
Utilization
0.054
5
Can be automated with a parser for the target
assembly language

1.08
25
Instruction Execution-Time Simulators

Requires more than just the information supplied
in the CPU manufacturers data books.
Dependent on memory access times and wait states.
Simulation programs
take as input CPU types, memory speeds, and an
instruction mix.
Output total instruction time and throughput

26
Using the System Clock

Code can be timed by reading the system clock
before and after its execution
If code only takes a few microseconds, it will be
better to execute the code a few thousand times.
Helps to remove any inaccuracy introduced by the
granularity of the clock.

27
Using the System Clock contd
More iterations Better precision
28
So far

Theoretical Preliminaries
NP-Completeness
Analyzing RT Systems
The Halting Problem
Amdahls Law
Gustafsons Law
Performance Analysis
Execution Time Estimation
Instruction Counting
Instruction Execution-Time Simulators
Using the System Clock

29
Analysis of Polled Loops

3 components
The H/W delays involved in setting the S/W flag
by some external device
The time for the polled loop to test the flag
The time needed to process the event associated
with the flag

30
Polled Loops - contd

First delay on the order of nanoseconds, can be
ignored
Second delay order of several microseconds
Third delay depends on the process involved
If events overlap, the response time is worse

31
Polled Loops contd

f the time needed to check the flag
P the time to process the event
n overlapping events
Response time for the nth overlapping event
n f P

32
Analysis of Coroutines

Absence of interrupts makes this easy
Response time found by tracing the worst-case
path through each of the tasks

33
Analysis of Round-Robin Systems

Assumptions
n processes in the ready queue
No new ones arrive after the system starts
None terminate prematurely
The release time is arbitrary
All processes have maximum end-to-end execution
time c
Timeslice of q

34
Round-Robin Systems contd

In practice, if a process completes before the
end of a time quantum, that slack time would be
assigned to the next ready process.
However Assume it will not.
This does not hurt the analysis because an upper
bound is desired.

35
Analysis of Round-Robin Systems (cont.)

Each process receives 1/n of the CPU time in
chunks of q
Each process waits no longer than (n 1)q time
units
Each process requires at most c/q time units
Each context switch time takes o time units
Waiting time (n 1) q n o c/q

36
Round-Robin Systems contd

Worst case time from readiness to completion is
waiting time plus undisturbed time to complete, c
T (n 1) q n o (c/q) c

37
Analysis of Round-Robin Systems (cont.)

Ex Consider six processes with a maximum
execution time of 600ms. The time quantum, q, is
40ms, and each context switch costs 2 ms.
n 6, c 600, q 40, o 2
T ((6 1) 40 (6 2)) (600/40) 600
3750 ms

38
Round-Robin Systems contd

In order to achieve fair behavior, q must be
less than c
Otherwise, the round-robin algorithm will become
a first-come, first-serve algorithm in which each
process will execute to the completion in order
of arrival and this will be in favor of longer
processes.

39
Response Time Analysis for Fixed-Period Systems

For the highest priority task, its worst case
response time will be equal to its own execution
time.
Other tasks in the system are subjected to
interference caused by execution of higher
priority tasks.

40
Analysis for Fixed-Period Systems (cont.)

For a general task, Ti, the response time Ri, is
given as Ri ei Ii
Where
Ii is the max amount of delay in execution caused
by higher priority tasks
ei is the execution time of the current task

41
Analysis for Fixed-Period Systems (cont.)

The maximum interference Ti will face
(ceiling)(Ri/pj)ej
Each task of higher priority is interfering with
task Ti, so Ii ? (ceiling)(Ri/pj) ej
Which yields Ri ei ? (ceiling)(Ri/pj)
ej

42
Analysis for Fixed-Period Systems (cont.)

But this can be very difficult to solve for Ri
Recursive Solution n Rn1,i
ei ?(ceiling)(Ri/pk) ekj
Where Rn,i is the response for the nth iteration.

43
Analysis for Fixed-Period Systems (cont.)

To use the recurrence relation to find response
times, it is necessary to compute Rn1,I
iteratively until the first value m is found such
that Ri,n1 Rm,I Rm,i is then the
response time.
If the equation does not have a solution, then
the value of Ri will continue to rise, as is the
case when a task set has a utilization greater
than 100.

44
Response-Time Analysis RMA Example

Highest priority task, T1, will have a response
time equal to its execution time
R1 3.

45
Response-Time Analysis RMA Example

T2 response time
R1,2 4 (c)(4/9)3 7
R2,2 4 (c)(7/9)3 7
Since R1,2 R2,2, the response time of T2 7

46
Response-Time Analysis RMA Example

T3 response time
R1,3 2 (c)(2/9)3 (c)(2/12)4 9
R2,3 2 (c)(9/9)3 (c)(9/12)4 9
Since R1,3 R2,3, the response time of the
lowest priority task is 9.

47
Analysis of Sporadic and Aperiodic Interrupt
Systems

Ideally modeled as a rate-monotonic system, but
with the non-periodic tasks modeled as having a
period equal to their worst-case expected
interarrival time.
May lead to unacceptably high utilizations
Response time calculation depends on interrupt
latency, dispatching times and context switch
times

48
Interrupt Latency

Period between when a device requests an
interrupt and when the first instruction for the
associated H/W interrupt service routine
executes.
Worst-case INT latency must be considered!
Typically occurs when all possible INTs in the
system are requested simultaneously.

49
Interrupt Latency contd

Worst case latency is also affected by the number
of threads or processes running.
Typically a RTOS need to disable INTs while it is
processing a large number of threads or
processes.
If the design of the system requires a large
number of threads or processes, it is necessary
to perform latency measurements to check that the
scheduler is not disabling INTs for an
unreasonably long period of time.

50
Instruction Completion Time

Contributor to Interrupt latency
Necessary to find the execution time of every
macroinstruction by calculation, measurement, or
manufacturers data sheets.

51
Instruction Completion Time contd

The instruction with the longest execution time
in the code will maximize the contribution to
interrupt latency if it has just begun executing
when the INT signal is received.
Ex A system has instructions that take 10
microseconds, 50 microseconds, and 250
microseconds. The highest INT latency that can
occur is 250 microseconds.

52
Deterministic Performance

Cache, Pipelines, DMA
All designed to improve average RT performance
But they destroy determinism, making RTS
performance unanalyzable and unpredictable
Worse-Case Scenarios
Cache It must be assumed that every instr is not
fetched from the cache but from memory.

53
Deterministic Performance contd

Worst-Case Scenarios (cont.)
Pipelines One must always assume that at every
possible opportunity the pipeline is flushed
DMA must be assumed that cycle stealing is
occurring at every opportunity
By making some reasonable assumptions about the
impact of these effects, rational approximation
of performance is possible

54
In Review

Theoretical Preliminaries
NP-Completeness
Analyzing RT Systems
The Halting Problem
Amdahls Law
Gustafsons Law

55
In Review

Performance Analysis
Execution Time Estimation
Instruction Counting
Instruction Execution-Time Simulators
Using the System Clock
Analysis of Polled Loops
Analysis of Coroutines
Analysis of Round-Robin Systems
Analysis of Fixed-Period Systems

56
In Review

Performance Analysis (cont.)
Analysis of Sporadic and Aperiodic Interrupt
Systems
Interrupt Latency
Instruction Completion Time
Deterministic Performance

57
Questions

What is an NP-hard problem? How does it
differentiate from an NP-complete problem?
Using Amdahls law, calculate the speed up for a
code that is 70 percent parallelizable on 2
processors.
Why is performance analysis for RTS important?
Why can one not do 100 realistic performance
analysis?

58
Questions Contd

The execution time of the function to be timed on
slide 27 was estimated using the system clock
technique as follows
Total time 2 microseconds
Time1 10000 microseconds
Time2 6000 microseconds
No of iterations 1000
What is the loop time?

59
Questions Contd

Why must q be less than c in Round Robin systems?
What is interrupt latency? Under what condition
does the worst-case interrupt latency occur?
How should one act to maintain determinism in RTS
to some extent? (Hint 3 components)
Why should one ALWAYS assume the worst case in
general while doing RTS performance analysis?

Write a Comment

User Comments (0)