Performance Analysis and Optimization - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Performance Analysis and Optimization

Description:

... loading, serial bottlenecks, and I/O that make up the serial component of the run ... Using Amdahl's law, calculate the speed up for a code that is 70% percent ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 60
Provided by: kayhan7
Category:

less

Transcript and Presenter's Notes

Title: Performance Analysis and Optimization


1
Performance Analysis and Optimization
  • Kayhan Atesci
  • www.auburn.edu/atescka

2
Overview
  • Theoretical Preliminaries
  • NP-Completeness
  • Analyzing RT Systems
  • The Halting Problem
  • Amdahls Law
  • Gustafsons Law

3
Overview (contd)
  • Performance Analysis
  • Execution Time Estimation
  • Instruction Counting
  • Instruction Execution-Time Simulators
  • Using the System Clock
  • Analysis of Polled Loops
  • Analysis of Coroutines
  • Analysis of Round-Robin Systems
  • Analysis of Fixed-Period Systems

4
Overview (cont.)
  • Performance Analysis (cont.)
  • Analysis of Sporadic and Aperiodic Interrupt
    Systems
  • Interrupt Latency
  • Instruction Completion Times
  • Deterministic Performance

5
Intro to Theoretical Preliminaries
  • Performance Analysis is one of the fields where
    theory and practice do not coincide
  • Formulas
  • ignore resource contention
  • use theoretically artificial hardware
  • assume zero context switch time
  • Not totally useless but less realistic

6
NP-Completeness
  • NP The class of problems that cant be solved in
    polynomial time although a candidate solution can
    be defined
  • NP-Complete A problem in the NP class to which
    other problems in NP are transformable
  • NP-Hard A problem outside of the NP class to
    which other problems in NP are transformable

7
Challenges in Analyzing RTS
  • 30 years of research, strict practical
  • constraints, NP-Complete problems etc..
  • Mutual exclusions makes it impossible to find an
    optimal scheduler
  • Earliest deadline scheduling not optimal in
    multiprocessing
  • Prior knowledge of deadlines, computation times
    and task start time needed

8
Challenges contd
  • NP-Complete, NP-Hard problems
  • Possible to schedule a set of periodic processes
    that use semaphores only to enforce mutual
    exclusion?
  • Multiprocessor scheduling problems with either
    two or three processors, either with one or no
    resource, arbitrary or specified partial order

9
The Halting Problem
  • Is there a computer program that takes an
    arbitrary program, Pi, and all possible
    combinations of inputs, Ik, and determines
    whether or not Pi will halt on Ik?
  • No, there is not.

10
The Halting Problem contd
  • Cantors diagonal (cont).

Arbitrary Program Pi Source Code
Halt or no Halt Decision
Oracle
Set of inputs to Program, Ik
11
The Halting Problem contd
12
The Halting Problem contd
  • Why is it relevant to RT systems?
  • Schedulability Analyzer
  • Takes an arbitrary program and the set of all
    possible inputs to that program and determines
    the best, worst, and average case execution times
  • The running time can be determined given the
    specific language, fixed set of inputs and the
    execution times..
  • But! NOT GENERALIZABLE!

13
Amdahls Law
  • Level of parallelization that can be achieved by
    a parallel computer
  • n number of processors available for parallel
    processing
  • s the fraction of the code that cannot be
    parallelized
  • 1 s the fraction of code that can be
    parallelized

14
Amdahls Law (cont.)
  • Speedup (s ( 1 s))
  • (s (1-s)/n)
  • .
  • .
  • .
  • n
  • 1 s(n-1)

15
Amdahls Law contd
  • S 0, linear speedup!
  • S .1,
  • The processors working on the remaining 90
    (parallelizable code) will end up waiting for the
    single processor to finish the last 10.
  • Revealed an insoluble problem in the field of
    parallel computers limited efficiency and
    application of parallelism

16
Gustafsons Law
  • Demonstrated with a 1024-processor system that
    the basic presumptions in Amdahls Law are
    inappropriate for parallelism
  • Found that the problem size scales with the
    number of processors, or with a more powerful
    processor, the problem expands to make use of the
    increased facilities is inappropriate.

17
Gustafsons Law contd
  • Demonstrated that only the parallel or vector
    part of a program scales with the problem size.
  • Times for the vector startup, program loading,
    serial bottlenecks, and I/O that make up the
    serial component of the run do not grow with the
    problem size

18
Gustafsons Law contd
  • Formulation
  • s serial time
  • p parallel time (1 s)
  • n number of processors
  • time required s pn
  • Much more optimistic than Amdahls law, much
    easier to achieve parallelism

19
So far
  • Theoretical Preliminaries
  • NP-Completeness
  • Challenges in Analyzing RTS
  • The Halting Problem
  • Amdahls Law
  • Gustafsons Law

20
Performance Analysis
  • Natural desire to see if they will meet the
    deadlines
  • Rarely possible due to NP-completeness and
    physical constraints
  • However, it is possible to get a handle
  • Important because CPU utilization requirements
    are stated as design goals and knowing them
    upfront is important in selecting the appropriate
    hardware and system design approach

21
Code Execution Time Estimation
  • Best method Logic Analyzer (Ch. 8)
  • H/W latencies and other delays are
  • taken into account
  • - System must be completely coded and
  • the target H/W available
  • Usually employed in the late stages of
  • coding, testing, and during system
  • integration.

22
Instruction Counting
  • When too early for logic analyzer, or one is not
    available the best method of determining CPU
    utilization due to code execution time
  • Involves tracing the longest path through the
    code, adding up the instruction execution times
    along the way
  • Reqs
  • Code need to be already written or approximation
    of the final code
  • Actual instruction times

23
Instruction Counting contd
24
Instruction Counting contd
  • Path 1
  • 7 instructions _at_ 6 µsec 42 µsec
  • Paths 2 3
  • 9 instructions _at_ 6 µsec 54 µsec
  • Utilization
  • 0.054
  • 5
  • Can be automated with a parser for the target
    assembly language

1.08
25
Instruction Execution-Time Simulators
  • Requires more than just the information supplied
    in the CPU manufacturers data books.
  • Dependent on memory access times and wait states.
  • Simulation programs
  • take as input CPU types, memory speeds, and an
    instruction mix.
  • Output total instruction time and throughput

26
Using the System Clock
  • Code can be timed by reading the system clock
    before and after its execution
  • If code only takes a few microseconds, it will be
    better to execute the code a few thousand times.
  • Helps to remove any inaccuracy introduced by the
    granularity of the clock.

27
Using the System Clock contd
More iterations Better precision
28
So far
  • Theoretical Preliminaries
  • NP-Completeness
  • Analyzing RT Systems
  • The Halting Problem
  • Amdahls Law
  • Gustafsons Law
  • Performance Analysis
  • Execution Time Estimation
  • Instruction Counting
  • Instruction Execution-Time Simulators
  • Using the System Clock

29
Analysis of Polled Loops
  • 3 components
  • The H/W delays involved in setting the S/W flag
    by some external device
  • The time for the polled loop to test the flag
  • The time needed to process the event associated
    with the flag

30
Polled Loops - contd
  • First delay on the order of nanoseconds, can be
    ignored
  • Second delay order of several microseconds
  • Third delay depends on the process involved
  • If events overlap, the response time is worse

31
Polled Loops contd
  • f the time needed to check the flag
  • P the time to process the event
  • n overlapping events
  • Response time for the nth overlapping event
  • n f P

32
Analysis of Coroutines
  • Absence of interrupts makes this easy
  • Response time found by tracing the worst-case
    path through each of the tasks

33
Analysis of Round-Robin Systems
  • Assumptions
  • n processes in the ready queue
  • No new ones arrive after the system starts
  • None terminate prematurely
  • The release time is arbitrary
  • All processes have maximum end-to-end execution
    time c
  • Timeslice of q

34
Round-Robin Systems contd
  • In practice, if a process completes before the
    end of a time quantum, that slack time would be
    assigned to the next ready process.
  • However Assume it will not.
  • This does not hurt the analysis because an upper
    bound is desired.

35
Analysis of Round-Robin Systems (cont.)
  • Each process receives 1/n of the CPU time in
    chunks of q
  • Each process waits no longer than (n 1)q time
    units
  • Each process requires at most c/q time units
  • Each context switch time takes o time units
  • Waiting time (n 1) q n o c/q

36
Round-Robin Systems contd
  • Worst case time from readiness to completion is
    waiting time plus undisturbed time to complete, c
  • T (n 1) q n o (c/q) c

37
Analysis of Round-Robin Systems (cont.)
  • Ex Consider six processes with a maximum
    execution time of 600ms. The time quantum, q, is
    40ms, and each context switch costs 2 ms.
  • n 6, c 600, q 40, o 2
  • T ((6 1) 40 (6 2)) (600/40) 600
  • 3750 ms

38
Round-Robin Systems contd
  • In order to achieve fair behavior, q must be
    less than c
  • Otherwise, the round-robin algorithm will become
    a first-come, first-serve algorithm in which each
    process will execute to the completion in order
    of arrival and this will be in favor of longer
    processes.

39
Response Time Analysis for Fixed-Period Systems
  • For the highest priority task, its worst case
    response time will be equal to its own execution
    time.
  • Other tasks in the system are subjected to
    interference caused by execution of higher
    priority tasks.

40
Analysis for Fixed-Period Systems (cont.)
  • For a general task, Ti, the response time Ri, is
    given as Ri ei Ii
  • Where
  • Ii is the max amount of delay in execution caused
    by higher priority tasks
  • ei is the execution time of the current task

41
Analysis for Fixed-Period Systems (cont.)
  • The maximum interference Ti will face
    (ceiling)(Ri/pj)ej
  • Each task of higher priority is interfering with
    task Ti, so Ii ? (ceiling)(Ri/pj) ej
  • Which yields Ri ei ? (ceiling)(Ri/pj)
    ej

42
Analysis for Fixed-Period Systems (cont.)
  • But this can be very difficult to solve for Ri
  • Recursive Solution n Rn1,i
    ei ?(ceiling)(Ri/pk) ekj
  • Where Rn,i is the response for the nth iteration.

43
Analysis for Fixed-Period Systems (cont.)
  • To use the recurrence relation to find response
    times, it is necessary to compute Rn1,I
    iteratively until the first value m is found such
    that Ri,n1 Rm,I Rm,i is then the
    response time.
  • If the equation does not have a solution, then
    the value of Ri will continue to rise, as is the
    case when a task set has a utilization greater
    than 100.

44
Response-Time Analysis RMA Example
  • Highest priority task, T1, will have a response
    time equal to its execution time
  • R1 3.

45
Response-Time Analysis RMA Example
  • T2 response time
  • R1,2 4 (c)(4/9)3 7
  • R2,2 4 (c)(7/9)3 7
  • Since R1,2 R2,2, the response time of T2 7

46
Response-Time Analysis RMA Example
  • T3 response time
  • R1,3 2 (c)(2/9)3 (c)(2/12)4 9
  • R2,3 2 (c)(9/9)3 (c)(9/12)4 9
  • Since R1,3 R2,3, the response time of the
    lowest priority task is 9.

47
Analysis of Sporadic and Aperiodic Interrupt
Systems
  • Ideally modeled as a rate-monotonic system, but
    with the non-periodic tasks modeled as having a
    period equal to their worst-case expected
    interarrival time.
  • May lead to unacceptably high utilizations
  • Response time calculation depends on interrupt
    latency, dispatching times and context switch
    times

48
Interrupt Latency
  • Period between when a device requests an
    interrupt and when the first instruction for the
    associated H/W interrupt service routine
    executes.
  • Worst-case INT latency must be considered!
  • Typically occurs when all possible INTs in the
    system are requested simultaneously.

49
Interrupt Latency contd
  • Worst case latency is also affected by the number
    of threads or processes running.
  • Typically a RTOS need to disable INTs while it is
    processing a large number of threads or
    processes.
  • If the design of the system requires a large
    number of threads or processes, it is necessary
    to perform latency measurements to check that the
    scheduler is not disabling INTs for an
    unreasonably long period of time.

50
Instruction Completion Time
  • Contributor to Interrupt latency
  • Necessary to find the execution time of every
    macroinstruction by calculation, measurement, or
    manufacturers data sheets.

51
Instruction Completion Time contd
  • The instruction with the longest execution time
    in the code will maximize the contribution to
    interrupt latency if it has just begun executing
    when the INT signal is received.
  • Ex A system has instructions that take 10
    microseconds, 50 microseconds, and 250
    microseconds. The highest INT latency that can
    occur is 250 microseconds.

52
Deterministic Performance
  • Cache, Pipelines, DMA
  • All designed to improve average RT performance
  • But they destroy determinism, making RTS
    performance unanalyzable and unpredictable
  • Worse-Case Scenarios
  • Cache It must be assumed that every instr is not
    fetched from the cache but from memory.

53
Deterministic Performance contd
  • Worst-Case Scenarios (cont.)
  • Pipelines One must always assume that at every
    possible opportunity the pipeline is flushed
  • DMA must be assumed that cycle stealing is
    occurring at every opportunity
  • By making some reasonable assumptions about the
    impact of these effects, rational approximation
    of performance is possible

54
In Review
  • Theoretical Preliminaries
  • NP-Completeness
  • Analyzing RT Systems
  • The Halting Problem
  • Amdahls Law
  • Gustafsons Law

55
In Review
  • Performance Analysis
  • Execution Time Estimation
  • Instruction Counting
  • Instruction Execution-Time Simulators
  • Using the System Clock
  • Analysis of Polled Loops
  • Analysis of Coroutines
  • Analysis of Round-Robin Systems
  • Analysis of Fixed-Period Systems

56
In Review
  • Performance Analysis (cont.)
  • Analysis of Sporadic and Aperiodic Interrupt
    Systems
  • Interrupt Latency
  • Instruction Completion Time
  • Deterministic Performance

57
Questions
  • What is an NP-hard problem? How does it
    differentiate from an NP-complete problem?
  • Using Amdahls law, calculate the speed up for a
    code that is 70 percent parallelizable on 2
    processors.
  • Why is performance analysis for RTS important?
    Why can one not do 100 realistic performance
    analysis?

58
Questions Contd
  • The execution time of the function to be timed on
    slide 27 was estimated using the system clock
    technique as follows
  • Total time 2 microseconds
  • Time1 10000 microseconds
  • Time2 6000 microseconds
  • No of iterations 1000
  • What is the loop time?

59
Questions Contd
  • Why must q be less than c in Round Robin systems?
  • What is interrupt latency? Under what condition
    does the worst-case interrupt latency occur?
  • How should one act to maintain determinism in RTS
    to some extent? (Hint 3 components)
  • Why should one ALWAYS assume the worst case in
    general while doing RTS performance analysis?
Write a Comment
User Comments (0)
About PowerShow.com