Hardware and Software Tracing - PowerPoint PPT Presentation

About This Presentation
Title:

Hardware and Software Tracing

Description:

Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli_at_ece.neu.edu Trace Collection ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 70
Provided by: DavidRK150
Category:

less

Transcript and Presenter's Notes

Title: Hardware and Software Tracing


1
Hardware and Software Tracing
  • David Kaeli
  • Department of Electrical and Computer Engineering
  • Northeastern University
  • Boston, MA
  • kaeli_at_ece.neu.edu

2
Trace Collection Methodologies
  • Hardware
  • Monitors and instrumentation
  • Microcode
  • Software
  • Trap-based system
  • Emulators
  • Code annotation (source, object, executable)
  • Direct execution

3
Metrics for Evaluating Trace Collection
Methodologies
  • Speed trace capture rate
  • Memory extra memory used
  • Accuracy address perturbation
  • Intrusiveness tracing overhead
  • Completeness OS, interrupts, libraries
  • Granularity smallest traceable unit
  • Flexibility ease of use
  • Portability platform dependence
  • Capacity trace storage space
  • Cost - , time

4
Hardware Monitors
  • Capture trace at peak execution rates
  • Challenge - match storage media speed to tracing
    needs utilizing interleaving and multiplexing
  • Pros
  • Non-intrusive
  • Accurate
  • Complete
  • Cons
  • Expensive
  • Limited probeability
  • Limited trace length

5
Examples of Hardware Monitors
  • Monster (U. of Michigan 1992) R2000 traces
    using a DAS9200
  • BACH (BYU, 1992) i486, Pentium SPARC, 68K
    developed a customized pod being used by Intel
    today
  • Real-time Tracer (IBM 1992) Customized SRAM
    array
  • National Instruments (2006) provides a family
    of programmable instrumentation monitors

6
Microcode-based Tracing
  • Places hooks in microcode to capture machine
    state
  • Pros
  • Complete (OS, application)
  • Minimal slowdown (2-10x)
  • Cons
  • Microcode is dated technology
  • Nonportable

7
Example Microcode-based Tracing
  • ATUM (Stanford 1986) VAX traces
  • PatchWrx (DEC WRL 1995, NU 1996) Complete
    OS-rich traces on Alpha running NT

8
Intrumenting NT-based Workloads
9
Participants
  • Chakib Ouarraoui EMC
  • Jason Casmira Intel
  • John Fraser US Air Force
  • David Hunter VMWare
  • Sharon Smith HP
  • Richard Sites Adobe Systems

10
Tracing tools that capture OS activity
11
OS Rich and NT-based Instrumentation Tools
  • SimOS
  • UNIX-based platforms (basis for VMWare)
  • OS, memory, I/O activity
  • High overhead (10X - 50,000X)
  • Etch
  • Intel x86-based platform
  • No OS activity
  • 35X slowdown

12
PatchWrx Overview
  • Dynamic execution tracing tool suite
  • Captures full system workloads
  • Traces branches executed by the processor
  • Reconstructs full instruction stream
  • DEC Alpha 21064 Windows NT 4.0 platforms
  • Low overhead with minimum slowdown
  • 2X while running
  • 4X while tracing

13
PatchWrx Components
  • PALcode Alpha Privileged Architecture Library
  • Reserves trace buffer upon boot
  • Captures trace info
  • Facilitates long branches
  • Patch instrument all NT images
  • Trace collect runtime information
  • Reconstruct reconstitute the information

14
Patching an Image
  • Instrument all WinNT binary image types
  • COM, EXE, DLL, SYS, DRV
  • Replace branch-type instructions with branches to
    PatchWrx PAL calls
  • Log trace entry of branch type into buffer
  • Branch to original target

15
Patching an Image
ORIGINAL IMAGE
PATCHED IMAGE
A
A
PAL
1
B
B
4
2
3
PATCH SECTION
PWX PAL
BR
16
Patching Large Images
  • Normal Alpha ISA branch instruction
  • (PC4) SEXT(disp21) 4
  • New PatchWrx long branches
  • LBR (PC4) SEXT(disp25) 4
  • LBSR (PC4) ZEXT(disp20) 32

17
Patching Large Images
LONG
PATCHED IMAGE
1
A
PAL
6
B
2
4
3
5
PATCH SECTION
CAPTURE
PWX PAL
BR
18
Tracing with PatchWrx
  • Trace
  • User controlled start/stop/dump
  • Dumps captured trace to binary file
  • Captures VA mapping snapshot of active processes
    during trace capture

19
Reconstructing Execution
IMAGE 0
IMAGE n
I-STREAM AND/OR D-STREAM
RAW TRACE
. . . .
RECONSTRUCT TOOL
VA MAP
SYMBOL TABLE 0
SYMBOL TABLE n
20
OS-Rich Workload Characterization
  • Execution domain analysis
  • Hot EXEs / DLLs (system resources)
  • Instruction mix
  • Application-only
  • Full system
  • Branching behavior
  • Branch frequency (average basic block size)
  • Branch prediction in presence of OS

21
Workloads Investigated
22
Five most frequently used images in each
benchmark or application
Workload 1st 2nd 3rd 4th 5th Other
fourier bytecpu.exe (99.5) winsrv.dll (0.2) win32k.sys (0.1) ntoskrnl.ece (0.1) user32.dll (.02) (0.8)
li li.exe (97.7) win32k.sys (1.0) ntoskrnl.exe (0.6) user32.dll (0.1) qv.dll (0.1) (0.5)
go go.exe (95.5) win32k.sys (2.0) ntoskrnl.exe (1.0) hal.dll (0.4) gv.dll (0.1) (1.0)
ie iexplore.exe (37.2) win32k.sys (19.3) ntoskrnl.exe (17.5) fastfat.sys ((6.1) ntdll.dll (6.0) (13.9)
vc50 c1.exe (83.1) ntoskrnl.exe (10.5) msvcrt.dll (2.8) nsfs.sys (1.2) win32k.sys (1.1) (1.3)
word mssp232.dll (36.4) msgren32.dll (34.0) ntoskrnl.exe (10.2) win32k.sys (7.7) hal.dll (4.0) (7.7)
fx!32 hal.dll (42.5) s3.dll (24.6) opengl32.dll (12.2) msvcrt.dll (11.7) glu32.dll (2.7) (6.3)
23
Average basic block lengths
24
Conditional Branch Prediction 2-level BTB, 12-bit
PHR, 4096 entries, gshare
25
Summary of Results
  • Benchmarks execute almost entirely within the
    application domain
  • Desktop applications execute across many images
    and interact with the kernel and system DLLs
  • Branch prediction accuracy can change drastically
    (sometimes it can even improve) when the
    operating system interaction is considered
  • The instruction mix in desktop applications
    changes significantly in the presence of OS
  • Increased number of indirect branches and
    privileged instructions (e.g., PALcalls)

26
For Further Information
  • 1. Tracing and Characterization of Windows
    NT-based System Workloads, J.P. Casmira, D.P.
    Hunter and D.R. Kaeli, Digital Technical Journal,
    Vol. 10, No. 1, 1998, pp. 6-21 (www.digital.com/in
    fo/DTJ01/DTJ01HM.HTM).
  • 2. Operating System Impact on Trace-Driven
    Simulation, J.P. Casmira, J. Fraser and D.R.
    Kaeli, Proceedings of the 31st Simulation
    Symposium, Boston, MA, April 1998, pp. 76-82.
  • 3. A Code Annotation Tool for Capturing
    Operating System Execution, J.Fraser,
    Northeastern University Technical Report,
    NUCAR_6-97-1, June 1997 (on the NUCAR website).

http//www.ece.neu.edu/groups/nucar
27
And now back to tracing..
28
Trap Based
  • Interrupt the application at selected points in
    order to save trace records
  • Pros
  • Available on many CPUs
  • Portable
  • Inexpensive
  • Cons
  • Considerable slowdown (1000x)
  • Intrusive (ISR), especially when considering
    real-time events
  • How we decide where to interrupt the processor
    and still maintain a representative trace?

29
Example Trap Based Systems
  • VAX-Tracer ClarkEmer study on VAX
  • OS2-Tracer Intel 386
  • Wisconsin Wind Tunnel ECC error trapping CM5
    (SPARC)
  • Tapeworm II system ECC error trapping OS trap
    handler

30
Emulators
  • Simulating the target ISA using one or a multiple
    machine instructions on the host ISA
  • Pros
  • Minimal slowdown (10-100x)
  • Opportunity for JIT compilation
  • Portable
  • Flexible software controlled
  • Cons
  • Serious programming effort needed
  • Extra memory needed
  • Typically single process tracing

31
Emulators
  • Shade (UW 1994) dynamic translation
  • Compiles emulated instructions to native
    instructions (many elements of Shade have shown
    up in Transmeta products)
  • Host SPARC-V8
  • Targets SPARC-V8, SPARC-V9, MIPS
  • Spa (Sun 1993) Iterative interpretation
  • Reinterprets instructions on each occurrence
  • Host MIPS-1
  • Targets MIPS-1, MIPS-2
  • SPIM (U of Wisc 1991) predecoded interpretation
  • Provides pointers to instruction handler and
    operands to speed decoding
  • Hosts SPARC, 680x0, MIPS, HP-PA
  • Target MIPS-1

32
More Recent Emulators
  • VisualDSP (Analog Devices 1995-present)
  • Simulator for SHARC and BlackFin DSPs that runs
    on WinTel and Linux-x86
  • Provides C/C compilation environment
  • Statistical profiling
  • Cycle-accurate simulator
  • Provides a full visualization environment for
    machine performance
  • AMD Opteron X86-64 (2003)
  • Simulator for the new 64-bit X86 from AMD
  • Runs on 32-bit Linux-x86
  • Comes complete with a X86-64 version of gcc
  • http//www.x86-64.org/

33
MP Emulators
  • MINT (University of Rochester 1994)
  • Predecoded interpretation memory references
  • Host R3000 (SGI, DECstations)
  • Target R3000, (an Alpha-based derivative was
    developed called AINT)
  • RSim (Rice Univ 1997) Simulator for high-ILP
    Multiprocessors
  • Detailed cycle-based emulation
  • Host SPARC, SGI PowerChallenge
  • Target MIPS R10K

34
Machine Emulators
  • Simics (1996-present) Virtutech
  • Developed out research work at SICS
  • Provides a large number of CPU targets
  • Alpha, ARM, Itanium, MIPS, Pentium, PowerPC,
    SPARC, X86-64
  • Provides both detailed simulation/emulation and
    high throughput
  • http//www.simics.com/
  • SimOS (1997) Stanford University
  • Originally designed to run on an SGI platform
  • Actually boots a full operating system (SGI IRIX
    and DEC UNIX)
  • Implementations on Alpha and MIPS platforms
  • Designed around the operating system, emulating
    IO and other system-related events
  • Provided the base technology for VMWare products

35
Code Annotation
  • Instrumented program produces trace while the
    application is run
  • Three levels of annotation
  • Source code modification
  • Object code modification
  • Binary code modification
  • Pros
  • Ease of implementation
  • Small slowdown (10x)
  • Inexpensive
  • Cons
  • Limited completeness (OS, multiprocessing)
  • May not capture DLLs
  • Memory dilation

36
Source Code Annotation
  • TRAPEDS (Univ. of Illinois 1989)
  • Adds a call upon exit from a basic block
  • MPTrace (Univ. of Washington 1990)
  • I386, instruments only MP-relevant events
  • Tangolite (Stanford 1993)
  • Annotates all memory events in an MP environment

37
Object Code Annotation
  • Epoxie (DEC WRL 1989) Titan MP
  • Epoxie2 (DEC WRL 1993) R3000
  • ATOM (DEC WRL 1994) Alpha
  • Alto (Univ. of Arizona 1996) Alpha
  • PLTO (Univ. of Arizona 2001) IA32

38
Binary Code Annotation
  • Pixie (DEC 1991) MIPS
  • Goblin (IBM/CMU 1991) RS/6000
  • IDtrace (Univ. of Mich.) i486
  • QPT (Univ. of Wisc.) MIPS, SPARC
  • EEL (Univ. of Wisc.) MIPS, SPARC
  • DSPTune (NEU) ADI SHARC DSP
  • Pin (Intel 2005) X86, XScale, Itanium

39
Embedded Systems Profiling Tools
  • Enhance current embedded system compilation
    environments, providing profile-driven analysis
    and feedback capabilities
  • DSPTune - instrumentation and analysis package
    for the SHARC family of DSPs
  • Allows for full instrumentation of C and C
    codes at the source, assembly and ELF binary
    levels
  • Supported by Analog Devices and the NSF

40
The DSPTune Toolset
  • A set of library routines that enable the user to
    instrument C and assembly programs
  • Function calls can be inserted at various
    locations in the application code, enabling
    execution driven simulation
  • The user provides
  • instrumentation routines, which specify the
    selected instrumentation events (e.g., loads,
    branches, traps)
  • analysis routines, which carry out the desired
    simulation (e.g., caches, stacks, branch
    predictors)

41
User application code
Step I
Parser
Intermediate Representation
User instrumentation code
Step II
Instrumenting Tool
Instrumented IR
Step III
Code Generator
Instrumented application code
User analysis code
Step IV
Assembler
Linker
Instrumented application executable
42
BDSPTune
  • Provides similar capabilites as DSPTune
  • Allows ELF binaries to be instrumented
  • Enable instrumentation and profiling to include
    library routines

43
Summary of Tracing Methodologies
Slow down OS coverage Sample size Cost
Source Code 10X NO gtGB LOW
Object Code 10X SOME gtGB LOW
Binary Code 10X NO gtGB LOW
Microcode 10X YES gtGB MEDIUM
I-Stepping 1000X YES unlimited MEDIUM
Emulation 10-100X YES unlimited MEDIUM
Real-time 1X YES ltGB HIGH
44
Counter-based Profiling and Instrumentation
  • David Kaeli
  • Department of Electrical and Computer Engineering
  • Northeastern University
  • Boston, MA
  • kaeli_at_ece.neu.edu

45
Counters are used to
  • Identify Performance Bottlenecks
  • especially unpredictable dynamic stallse.g.
    cache misses, branch mispredicts, TLB misses,
    etc.
  • complex out-of-order processors make this
    difficult
  • Guide Optimizations
  • help programmers understand and improve code
  • automatic, profile-driven optimizations
  • Profile Production Workloads
  • low overhead
  • transparent
  • profile whole system

46
Performance Counters
  • Interfaced through a device driver and supporting
    GUI (e.g., VTune)
  • Counters increment based on a set of events of
    interest (e.g., cache misses, pipeline stalls)
  • Interrupt will occur that signals that the
    counter has overflowed
  • An interrupt service routine reads the counter
    information and tags it to a program counter (PC)
    value
  • Information is then available for offline
    analysis

47
Performance Counters
  • Low overhead method for obtaining performance and
    profiling information
  • Typically less than 5 slowdown
  • Requires no modification of the binary
  • May require root level access to system
  • Lacks precision in cause/affect analysis
  • Come for free on most ISAs
  • Commonly used today to measure performance and
    estimate power usage

48
Counter Library
  • A number of counter libraries are available to
    provide an API to program and access common
    architectures
  • Rabbit
  • for Intel/AMD Processors and Linux
  • URL www.scl.ameslab.gov/Projects/Rabbit/
  • PAPI
  • Linux IA32, IA64
  • Allows counters to be captured on a per thread
    basis
  • URL icl.cs.utk.edu/projects/papi/

49
Counters available on different ISAs
Category PentiumII 21064 21164 IBM604e R10K Ultra2
counters 2 2 3 4 2 2
Counter Range 40 8, 12, 16 14,16 32 32 32
Variable Range No Yes No No No No
Sampling Freq Variable Fixed Fixed Variable Variable Fixed
R/W Access Yes No Yes Yes Yes Yes
Duration Counting Yes No No No No NO
Counting Modes Different Privilege Levels Selected Processes User, Kernel, PALmode User, Kernel, Processes User, Kernel User, Kernel
50
Events countable on different ISAs
Event PentiumII 21164 IBM604e R10K Ultra2
L1 data cache read Y N Y N Y
L1 data cache write Y N N N Y
L1 data cache r/w N Y N N N
L1 data cache miss Y Y Y Y Y
L1 inst cache read Y N N N N
L1 inst cache r/w N Y N N Y
L1 inst cache hit N Y N N Y
L1 inst cache miss Y Y Y Y Y
51
Events countable on different ISAs
Event Pentium2 21164 IBM604e R10K Ultra2
TLB miss N N Y Y N
Data TLB miss N Y Y N N
Inst TLB miss Y Y Y N N
Retired Branches Y N N Y N
Mispredicted Branches Y Y N Y N
Taken Branches Y N N N N
Mispredicted Retired B Y N N N N
52
Events countable on different ISAs
Event Pentium2 21164 IBM604e R10K Ultra2
Retired Instructions Y Y Y Y Y
Issued Instructions Y N Y Y N
Integer Inst Executed N Y Y N N
FP Inst Executed Y Y Y Y N
Load Inst Executed N Y Y Y N
Store Inst Executed N Y N Y N
Branch Inst Executed Y N Y N N
53
Events countable on different ISAs
Event Pentium2 21164 IBM604e R10K Ultra2
Total cycles Y Y Y Y Y
Cycles BPU is idle N N Y N N
Cycles IU is idle N N Y N N
Cycles LSU is idle N N Y N N
Cycles LSU stalls N N Y N N
Cycles FPU stalls Y N Y N N
Cycles BPU stalls N N Y N N
54
Existing Instruction-Level Sampling
  • Use Hardware Event Counters
  • small set of software-loadable counters
  • each counts a single event at a time, e.g. dcache
    miss
  • counter overflow generates interrupt
  • Advantages
  • low overhead vs. simulation and instrumentation
  • transparent vs. instrumentation
  • complete coverage, e.g. kernel, shared libs, etc.
  • Effective on In-Order Processors
  • analysis computes execution frequency
  • heuristics identify possible reasons for stalls
  • example DIGITALs Continuous Profiling
    Infrastructure

55
Problems with Event-Based Counters
  • Cannot simultaneously monitor all events
  • Limited information about events
  • event has occurred, but no additional
    contexte.g. cache miss latencies, recent
    execution path, ...
  • Blind spots in non-interruptible code
  • Key problem imprecise attribution
  • interrupt delivers restart PC, not the PC that
    caused event
  • problem worse on out-of-order processors

56
Problem Imprecise Attribution
  • Example Finding the single operation that
    introduces a long latency operation to occur
    (e.g., cache miss, TLB miss, branch mispredict)
  • Most counter-based schemes provide the PC at the
    point a counter overflowed
  • Inorder processors (Alpha 21164)
  • Imprecise exceptions/interrupts hinder our
    ability to quickly identify the cause of
    latencies during execution
  • It is possible to post-analyze the problem to
    attempt to identify the responsible instruction
  • Out-Of-Order processors (Alpha21264, Pentium4)
  • Due to the lack of sequentiality in the
    execution, the distance between the responsible
    instruction and the current PC could be far
  • It is nearly impossible to identify the cause of
    the latency

57
Profile-Me Profiling Strategy (DEC 1998)
  • PC Retire Status ? execution frequency
  • PC Cache Miss Flag ? cache miss rates
  • PC Branch Mispredict ? mispredict rates
  • PC Event Flag ? event rates
  • PC Branch Direction ? edge frequencies
  • PC Branch History ? path execution rates
  • PC Latency ? instruction stalls

58
Identifying True Botttlenecks
  • ProfileMe Detailed Data for Single Instruction
  • In-Order Processors
  • ProfileMe PC latency data identifies stalls
  • stalled instructions back up pipeline
  • Out-of-Order Processors
  • explicitly designed to mask stall latencye.g.
    dynamic reordering, speculative execution
  • stall does not necessarily imply bottleneck
  • Example Does This Stall Matter?
  • load r1,
  • add ,r1, average latency 35.0 cycles
  • other instructions

59
Example Retire Count Convergence
Estimate / Actual
60
How to handle concurrency and OOO?
  • Appropriate concurrency metrics
  • retired instructions per cycle
  • issue slots wasted while an instruction is in
    flight
  • pipeline stage utilization
  • How to measure concurrency?
  • Special-purpose hardware
  • some metrics difficult to measuree.g. need
    retire/abort status
  • Sample potentially-concurrent instructions
  • aggregate info from pairs of samples
  • statistically estimate metrics

61
How to handle concurrency and OOO?
  • Sample Two Instructions
  • sample instructions, not events
  • may be in-flight simultaneously
  • replicate ProfileMe hardware, add intra-pair
    distance
  • Nested Sampling
  • sample window around first profiled instruction
  • randomly select second profiled instruction
  • statistically estimate frequency for F (first,
    second)

62
Other Uses of Paired Sampling
  • Path Profiling
  • two PCs close in time can identify execution path
  • identify control flow, e.g. indirect branches,
    calls, traps
  • Direct Latency Measurements
  • data load-to-use
  • loop iteration cost

63
VTune IA32 Instrumentation and Profiling
  • Supports all versions of IA32 Intel processors
  • Provides a rich GUI to ease programming and
    reading of hardware counters
  • Features include
  • Time and event-based sampling
  • Call graph profiling
  • Provides source-level tuning advice
  • Allows for integrated visualization of source and
    counter information
  • Supports C/C, Fortran, Java and IA32 assembly

64
VTune Time Sample
65
VTune Call Graph
66
VTune Hot Spot Analyzer
67
VTune Tuning Assistant
68
Using Performance Counters for Power
Profiling/Estimation
  • Profile power-consuming events
  • Cache misses
  • TLB misses
  • Pipeline stalls
  • Opportunities to wait slower!
  • How can we tie high counts to when to adjust
    voltage/frequency? (more on this later in the
    class.)

69
Summary
  • Tracing/Instrumentation is still used today by
    industry and academia
  • The field has evolved significantly
  • Industry uses software-based tools for
    performance and hardware-based tools for
    power/energy
  • Most performance studies today use some form of
    emulation or virtualized execution to obtain
    trace data
  • Counters can be used effectively to capture
    performance data
  • The entry cost for using counters is low
  • OO microarchitectures inhibit the use of counters
  • Paired sampling can be an effective technique for
    handling imprecision
  • A number of high-quality free and commercial
    tools are available (and we are going to use at
    least one of them)
Write a Comment
User Comments (0)
About PowerShow.com