Title: Performance Analysis and Optimization
1Performance Analysis and Optimization
- Kayhan Atesci
- www.auburn.edu/atescka
2Overview
- Theoretical Preliminaries
- NP-Completeness
- Analyzing RT Systems
- The Halting Problem
- Amdahls Law
- Gustafsons Law
3Overview (contd)
- Performance Analysis
- Execution Time Estimation
- Instruction Counting
- Instruction Execution-Time Simulators
- Using the System Clock
- Analysis of Polled Loops
- Analysis of Coroutines
- Analysis of Round-Robin Systems
- Analysis of Fixed-Period Systems
4Overview (cont.)
- Performance Analysis (cont.)
- Analysis of Sporadic and Aperiodic Interrupt
Systems - Interrupt Latency
- Instruction Completion Times
- Deterministic Performance
5Intro to Theoretical Preliminaries
- Performance Analysis is one of the fields where
theory and practice do not coincide - Formulas
- ignore resource contention
- use theoretically artificial hardware
- assume zero context switch time
- Not totally useless but less realistic
6NP-Completeness
- NP The class of problems that cant be solved in
polynomial time although a candidate solution can
be defined - NP-Complete A problem in the NP class to which
other problems in NP are transformable - NP-Hard A problem outside of the NP class to
which other problems in NP are transformable
7Challenges in Analyzing RTS
- 30 years of research, strict practical
- constraints, NP-Complete problems etc..
- Mutual exclusions makes it impossible to find an
optimal scheduler - Earliest deadline scheduling not optimal in
multiprocessing - Prior knowledge of deadlines, computation times
and task start time needed
8Challenges contd
- NP-Complete, NP-Hard problems
- Possible to schedule a set of periodic processes
that use semaphores only to enforce mutual
exclusion? - Multiprocessor scheduling problems with either
two or three processors, either with one or no
resource, arbitrary or specified partial order
9The Halting Problem
- Is there a computer program that takes an
arbitrary program, Pi, and all possible
combinations of inputs, Ik, and determines
whether or not Pi will halt on Ik? - No, there is not.
10The Halting Problem contd
Arbitrary Program Pi Source Code
Halt or no Halt Decision
Oracle
Set of inputs to Program, Ik
11The Halting Problem contd
12The Halting Problem contd
- Why is it relevant to RT systems?
- Schedulability Analyzer
- Takes an arbitrary program and the set of all
possible inputs to that program and determines
the best, worst, and average case execution times - The running time can be determined given the
specific language, fixed set of inputs and the
execution times.. - But! NOT GENERALIZABLE!
13Amdahls Law
- Level of parallelization that can be achieved by
a parallel computer - n number of processors available for parallel
processing - s the fraction of the code that cannot be
parallelized - 1 s the fraction of code that can be
parallelized
14Amdahls Law (cont.)
- Speedup (s ( 1 s))
- (s (1-s)/n)
- .
- .
- .
- n
- 1 s(n-1)
15Amdahls Law contd
- S 0, linear speedup!
- S .1,
- The processors working on the remaining 90
(parallelizable code) will end up waiting for the
single processor to finish the last 10. - Revealed an insoluble problem in the field of
parallel computers limited efficiency and
application of parallelism
16Gustafsons Law
- Demonstrated with a 1024-processor system that
the basic presumptions in Amdahls Law are
inappropriate for parallelism - Found that the problem size scales with the
number of processors, or with a more powerful
processor, the problem expands to make use of the
increased facilities is inappropriate.
17Gustafsons Law contd
- Demonstrated that only the parallel or vector
part of a program scales with the problem size. - Times for the vector startup, program loading,
serial bottlenecks, and I/O that make up the
serial component of the run do not grow with the
problem size
18Gustafsons Law contd
- Formulation
- s serial time
- p parallel time (1 s)
- n number of processors
- time required s pn
- Much more optimistic than Amdahls law, much
easier to achieve parallelism
19So far
- Theoretical Preliminaries
- NP-Completeness
- Challenges in Analyzing RTS
- The Halting Problem
- Amdahls Law
- Gustafsons Law
20Performance Analysis
- Natural desire to see if they will meet the
deadlines - Rarely possible due to NP-completeness and
physical constraints - However, it is possible to get a handle
- Important because CPU utilization requirements
are stated as design goals and knowing them
upfront is important in selecting the appropriate
hardware and system design approach
21Code Execution Time Estimation
- Best method Logic Analyzer (Ch. 8)
- H/W latencies and other delays are
- taken into account
- - System must be completely coded and
- the target H/W available
- Usually employed in the late stages of
- coding, testing, and during system
- integration.
22Instruction Counting
- When too early for logic analyzer, or one is not
available the best method of determining CPU
utilization due to code execution time - Involves tracing the longest path through the
code, adding up the instruction execution times
along the way - Reqs
- Code need to be already written or approximation
of the final code - Actual instruction times
23Instruction Counting contd
24Instruction Counting contd
- Path 1
- 7 instructions _at_ 6 µsec 42 µsec
- Paths 2 3
- 9 instructions _at_ 6 µsec 54 µsec
- Utilization
- 0.054
- 5
- Can be automated with a parser for the target
assembly language
1.08
25Instruction Execution-Time Simulators
- Requires more than just the information supplied
in the CPU manufacturers data books. - Dependent on memory access times and wait states.
- Simulation programs
- take as input CPU types, memory speeds, and an
instruction mix. - Output total instruction time and throughput
26Using the System Clock
- Code can be timed by reading the system clock
before and after its execution - If code only takes a few microseconds, it will be
better to execute the code a few thousand times. - Helps to remove any inaccuracy introduced by the
granularity of the clock.
27Using the System Clock contd
More iterations Better precision
28So far
- Theoretical Preliminaries
- NP-Completeness
- Analyzing RT Systems
- The Halting Problem
- Amdahls Law
- Gustafsons Law
- Performance Analysis
- Execution Time Estimation
- Instruction Counting
- Instruction Execution-Time Simulators
- Using the System Clock
29Analysis of Polled Loops
- 3 components
- The H/W delays involved in setting the S/W flag
by some external device - The time for the polled loop to test the flag
- The time needed to process the event associated
with the flag
30Polled Loops - contd
- First delay on the order of nanoseconds, can be
ignored - Second delay order of several microseconds
- Third delay depends on the process involved
- If events overlap, the response time is worse
31Polled Loops contd
- f the time needed to check the flag
- P the time to process the event
- n overlapping events
- Response time for the nth overlapping event
- n f P
32Analysis of Coroutines
- Absence of interrupts makes this easy
- Response time found by tracing the worst-case
path through each of the tasks
33Analysis of Round-Robin Systems
- Assumptions
- n processes in the ready queue
- No new ones arrive after the system starts
- None terminate prematurely
- The release time is arbitrary
- All processes have maximum end-to-end execution
time c - Timeslice of q
34Round-Robin Systems contd
- In practice, if a process completes before the
end of a time quantum, that slack time would be
assigned to the next ready process. - However Assume it will not.
- This does not hurt the analysis because an upper
bound is desired.
35Analysis of Round-Robin Systems (cont.)
- Each process receives 1/n of the CPU time in
chunks of q - Each process waits no longer than (n 1)q time
units - Each process requires at most c/q time units
- Each context switch time takes o time units
- Waiting time (n 1) q n o c/q
36Round-Robin Systems contd
- Worst case time from readiness to completion is
waiting time plus undisturbed time to complete, c - T (n 1) q n o (c/q) c
37Analysis of Round-Robin Systems (cont.)
- Ex Consider six processes with a maximum
execution time of 600ms. The time quantum, q, is
40ms, and each context switch costs 2 ms. - n 6, c 600, q 40, o 2
-
- T ((6 1) 40 (6 2)) (600/40) 600
- 3750 ms
38Round-Robin Systems contd
- In order to achieve fair behavior, q must be
less than c - Otherwise, the round-robin algorithm will become
a first-come, first-serve algorithm in which each
process will execute to the completion in order
of arrival and this will be in favor of longer
processes.
39Response Time Analysis for Fixed-Period Systems
- For the highest priority task, its worst case
response time will be equal to its own execution
time. - Other tasks in the system are subjected to
interference caused by execution of higher
priority tasks.
40Analysis for Fixed-Period Systems (cont.)
- For a general task, Ti, the response time Ri, is
given as Ri ei Ii - Where
- Ii is the max amount of delay in execution caused
by higher priority tasks - ei is the execution time of the current task
41Analysis for Fixed-Period Systems (cont.)
- The maximum interference Ti will face
(ceiling)(Ri/pj)ej - Each task of higher priority is interfering with
task Ti, so Ii ? (ceiling)(Ri/pj) ej - Which yields Ri ei ? (ceiling)(Ri/pj)
ej
42Analysis for Fixed-Period Systems (cont.)
- But this can be very difficult to solve for Ri
- Recursive Solution n Rn1,i
ei ?(ceiling)(Ri/pk) ekj -
- Where Rn,i is the response for the nth iteration.
43Analysis for Fixed-Period Systems (cont.)
- To use the recurrence relation to find response
times, it is necessary to compute Rn1,I
iteratively until the first value m is found such
that Ri,n1 Rm,I Rm,i is then the
response time. - If the equation does not have a solution, then
the value of Ri will continue to rise, as is the
case when a task set has a utilization greater
than 100.
44Response-Time Analysis RMA Example
- Highest priority task, T1, will have a response
time equal to its execution time - R1 3.
45Response-Time Analysis RMA Example
- T2 response time
- R1,2 4 (c)(4/9)3 7
- R2,2 4 (c)(7/9)3 7
- Since R1,2 R2,2, the response time of T2 7
46Response-Time Analysis RMA Example
- T3 response time
- R1,3 2 (c)(2/9)3 (c)(2/12)4 9
-
- R2,3 2 (c)(9/9)3 (c)(9/12)4 9
- Since R1,3 R2,3, the response time of the
lowest priority task is 9.
47Analysis of Sporadic and Aperiodic Interrupt
Systems
- Ideally modeled as a rate-monotonic system, but
with the non-periodic tasks modeled as having a
period equal to their worst-case expected
interarrival time. - May lead to unacceptably high utilizations
- Response time calculation depends on interrupt
latency, dispatching times and context switch
times
48Interrupt Latency
- Period between when a device requests an
interrupt and when the first instruction for the
associated H/W interrupt service routine
executes. - Worst-case INT latency must be considered!
- Typically occurs when all possible INTs in the
system are requested simultaneously.
49Interrupt Latency contd
- Worst case latency is also affected by the number
of threads or processes running. - Typically a RTOS need to disable INTs while it is
processing a large number of threads or
processes. - If the design of the system requires a large
number of threads or processes, it is necessary
to perform latency measurements to check that the
scheduler is not disabling INTs for an
unreasonably long period of time.
50Instruction Completion Time
- Contributor to Interrupt latency
- Necessary to find the execution time of every
macroinstruction by calculation, measurement, or
manufacturers data sheets.
51Instruction Completion Time contd
- The instruction with the longest execution time
in the code will maximize the contribution to
interrupt latency if it has just begun executing
when the INT signal is received. - Ex A system has instructions that take 10
microseconds, 50 microseconds, and 250
microseconds. The highest INT latency that can
occur is 250 microseconds.
52Deterministic Performance
- Cache, Pipelines, DMA
- All designed to improve average RT performance
- But they destroy determinism, making RTS
performance unanalyzable and unpredictable - Worse-Case Scenarios
- Cache It must be assumed that every instr is not
fetched from the cache but from memory.
53Deterministic Performance contd
- Worst-Case Scenarios (cont.)
- Pipelines One must always assume that at every
possible opportunity the pipeline is flushed - DMA must be assumed that cycle stealing is
occurring at every opportunity - By making some reasonable assumptions about the
impact of these effects, rational approximation
of performance is possible
54In Review
- Theoretical Preliminaries
- NP-Completeness
- Analyzing RT Systems
- The Halting Problem
- Amdahls Law
- Gustafsons Law
55In Review
- Performance Analysis
- Execution Time Estimation
- Instruction Counting
- Instruction Execution-Time Simulators
- Using the System Clock
- Analysis of Polled Loops
- Analysis of Coroutines
- Analysis of Round-Robin Systems
- Analysis of Fixed-Period Systems
56In Review
- Performance Analysis (cont.)
- Analysis of Sporadic and Aperiodic Interrupt
Systems - Interrupt Latency
- Instruction Completion Time
- Deterministic Performance
57Questions
- What is an NP-hard problem? How does it
differentiate from an NP-complete problem? - Using Amdahls law, calculate the speed up for a
code that is 70 percent parallelizable on 2
processors. - Why is performance analysis for RTS important?
Why can one not do 100 realistic performance
analysis?
58Questions Contd
- The execution time of the function to be timed on
slide 27 was estimated using the system clock
technique as follows - Total time 2 microseconds
- Time1 10000 microseconds
- Time2 6000 microseconds
- No of iterations 1000
- What is the loop time?
59Questions Contd
- Why must q be less than c in Round Robin systems?
- What is interrupt latency? Under what condition
does the worst-case interrupt latency occur? - How should one act to maintain determinism in RTS
to some extent? (Hint 3 components) - Why should one ALWAYS assume the worst case in
general while doing RTS performance analysis?