Hardware and Software Tracing

About This Presentation

Title:

Hardware and Software Tracing

Description:

Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli_at_ece.neu.edu Trace Collection ... – PowerPoint PPT presentation

Number of Views:155

Avg rating:3.0/5.0

Slides: 70

Provided by: DavidRK150

Learn more at: https://studies.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Hardware and Software Tracing

1
Hardware and Software Tracing

David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
kaeli_at_ece.neu.edu

2
Trace Collection Methodologies

Hardware
Monitors and instrumentation
Microcode
Software
Trap-based system
Emulators
Code annotation (source, object, executable)
Direct execution

3
Metrics for Evaluating Trace Collection
Methodologies

Speed trace capture rate
Memory extra memory used
Accuracy address perturbation
Intrusiveness tracing overhead
Completeness OS, interrupts, libraries
Granularity smallest traceable unit
Flexibility ease of use
Portability platform dependence
Capacity trace storage space
Cost - , time

4
Hardware Monitors

Capture trace at peak execution rates
Challenge - match storage media speed to tracing
needs utilizing interleaving and multiplexing
Pros
Non-intrusive
Accurate
Complete
Cons
Expensive
Limited probeability
Limited trace length

5
Examples of Hardware Monitors

Monster (U. of Michigan 1992) R2000 traces
using a DAS9200
BACH (BYU, 1992) i486, Pentium SPARC, 68K
developed a customized pod being used by Intel
today
Real-time Tracer (IBM 1992) Customized SRAM
array
National Instruments (2006) provides a family
of programmable instrumentation monitors

6
Microcode-based Tracing

Places hooks in microcode to capture machine
state
Pros
Complete (OS, application)
Minimal slowdown (2-10x)
Cons
Microcode is dated technology
Nonportable

7
Example Microcode-based Tracing

ATUM (Stanford 1986) VAX traces
PatchWrx (DEC WRL 1995, NU 1996) Complete
OS-rich traces on Alpha running NT

8
Intrumenting NT-based Workloads
9
Participants

Chakib Ouarraoui EMC
Jason Casmira Intel
John Fraser US Air Force
David Hunter VMWare
Sharon Smith HP
Richard Sites Adobe Systems

10
Tracing tools that capture OS activity
11
OS Rich and NT-based Instrumentation Tools

SimOS
UNIX-based platforms (basis for VMWare)
OS, memory, I/O activity
High overhead (10X - 50,000X)
Etch
Intel x86-based platform
No OS activity
35X slowdown

12
PatchWrx Overview

Dynamic execution tracing tool suite
Captures full system workloads
Traces branches executed by the processor
Reconstructs full instruction stream
DEC Alpha 21064 Windows NT 4.0 platforms
Low overhead with minimum slowdown
2X while running
4X while tracing

13
PatchWrx Components

PALcode Alpha Privileged Architecture Library
Reserves trace buffer upon boot
Captures trace info
Facilitates long branches
Patch instrument all NT images
Trace collect runtime information
Reconstruct reconstitute the information

14
Patching an Image

Instrument all WinNT binary image types
COM, EXE, DLL, SYS, DRV
Replace branch-type instructions with branches to
PatchWrx PAL calls
Log trace entry of branch type into buffer
Branch to original target

15
Patching an Image
ORIGINAL IMAGE
PATCHED IMAGE
A
A
PAL
1
B
B
4
2
3
PATCH SECTION
PWX PAL
BR
16
Patching Large Images

Normal Alpha ISA branch instruction
(PC4) SEXT(disp21) 4
New PatchWrx long branches
LBR (PC4) SEXT(disp25) 4
LBSR (PC4) ZEXT(disp20) 32

17
Patching Large Images
LONG
PATCHED IMAGE
1
A
PAL
6
B
2
4
3
5
PATCH SECTION
CAPTURE
PWX PAL
BR
18
Tracing with PatchWrx

Trace
User controlled start/stop/dump
Dumps captured trace to binary file
Captures VA mapping snapshot of active processes
during trace capture

19
Reconstructing Execution
IMAGE 0
IMAGE n
I-STREAM AND/OR D-STREAM
RAW TRACE
. . . .
RECONSTRUCT TOOL
VA MAP
SYMBOL TABLE 0
SYMBOL TABLE n
20
OS-Rich Workload Characterization

Execution domain analysis
Hot EXEs / DLLs (system resources)
Instruction mix
Application-only
Full system
Branching behavior
Branch frequency (average basic block size)
Branch prediction in presence of OS

21
Workloads Investigated
22
Five most frequently used images in each
benchmark or application
Workload 1st 2nd 3rd 4th 5th Other
fourier bytecpu.exe (99.5) winsrv.dll (0.2) win32k.sys (0.1) ntoskrnl.ece (0.1) user32.dll (.02) (0.8)
li li.exe (97.7) win32k.sys (1.0) ntoskrnl.exe (0.6) user32.dll (0.1) qv.dll (0.1) (0.5)
go go.exe (95.5) win32k.sys (2.0) ntoskrnl.exe (1.0) hal.dll (0.4) gv.dll (0.1) (1.0)
ie iexplore.exe (37.2) win32k.sys (19.3) ntoskrnl.exe (17.5) fastfat.sys ((6.1) ntdll.dll (6.0) (13.9)
vc50 c1.exe (83.1) ntoskrnl.exe (10.5) msvcrt.dll (2.8) nsfs.sys (1.2) win32k.sys (1.1) (1.3)
word mssp232.dll (36.4) msgren32.dll (34.0) ntoskrnl.exe (10.2) win32k.sys (7.7) hal.dll (4.0) (7.7)
fx!32 hal.dll (42.5) s3.dll (24.6) opengl32.dll (12.2) msvcrt.dll (11.7) glu32.dll (2.7) (6.3)
23
Average basic block lengths
24
Conditional Branch Prediction 2-level BTB, 12-bit
PHR, 4096 entries, gshare
25
Summary of Results

Benchmarks execute almost entirely within the
application domain
Desktop applications execute across many images
and interact with the kernel and system DLLs
Branch prediction accuracy can change drastically
(sometimes it can even improve) when the
operating system interaction is considered
The instruction mix in desktop applications
changes significantly in the presence of OS
Increased number of indirect branches and
privileged instructions (e.g., PALcalls)

26
For Further Information

1. Tracing and Characterization of Windows
NT-based System Workloads, J.P. Casmira, D.P.
Hunter and D.R. Kaeli, Digital Technical Journal,
Vol. 10, No. 1, 1998, pp. 6-21 (www.digital.com/in
fo/DTJ01/DTJ01HM.HTM).
2. Operating System Impact on Trace-Driven
Simulation, J.P. Casmira, J. Fraser and D.R.
Kaeli, Proceedings of the 31st Simulation
Symposium, Boston, MA, April 1998, pp. 76-82.
3. A Code Annotation Tool for Capturing
Operating System Execution, J.Fraser,
Northeastern University Technical Report,
NUCAR_6-97-1, June 1997 (on the NUCAR website).

http//www.ece.neu.edu/groups/nucar
27
And now back to tracing..
28
Trap Based

Interrupt the application at selected points in
order to save trace records
Pros
Available on many CPUs
Portable
Inexpensive
Cons
Considerable slowdown (1000x)
Intrusive (ISR), especially when considering
real-time events
How we decide where to interrupt the processor
and still maintain a representative trace?

29
Example Trap Based Systems

VAX-Tracer ClarkEmer study on VAX
OS2-Tracer Intel 386
Wisconsin Wind Tunnel ECC error trapping CM5
(SPARC)
Tapeworm II system ECC error trapping OS trap
handler

30
Emulators

Simulating the target ISA using one or a multiple
machine instructions on the host ISA
Pros
Minimal slowdown (10-100x)
Opportunity for JIT compilation
Portable
Flexible software controlled
Cons
Serious programming effort needed
Extra memory needed
Typically single process tracing

31
Emulators

Shade (UW 1994) dynamic translation
Compiles emulated instructions to native
instructions (many elements of Shade have shown
up in Transmeta products)
Host SPARC-V8
Targets SPARC-V8, SPARC-V9, MIPS
Spa (Sun 1993) Iterative interpretation
Reinterprets instructions on each occurrence
Host MIPS-1
Targets MIPS-1, MIPS-2
SPIM (U of Wisc 1991) predecoded interpretation
Provides pointers to instruction handler and
operands to speed decoding
Hosts SPARC, 680x0, MIPS, HP-PA
Target MIPS-1

32
More Recent Emulators

VisualDSP (Analog Devices 1995-present)
Simulator for SHARC and BlackFin DSPs that runs
on WinTel and Linux-x86
Provides C/C compilation environment
Statistical profiling
Cycle-accurate simulator
Provides a full visualization environment for
machine performance
AMD Opteron X86-64 (2003)
Simulator for the new 64-bit X86 from AMD
Runs on 32-bit Linux-x86
Comes complete with a X86-64 version of gcc
http//www.x86-64.org/

33
MP Emulators

MINT (University of Rochester 1994)
Predecoded interpretation memory references
Host R3000 (SGI, DECstations)
Target R3000, (an Alpha-based derivative was
developed called AINT)
RSim (Rice Univ 1997) Simulator for high-ILP
Multiprocessors
Detailed cycle-based emulation
Host SPARC, SGI PowerChallenge
Target MIPS R10K

34
Machine Emulators

Simics (1996-present) Virtutech
Developed out research work at SICS
Provides a large number of CPU targets
Alpha, ARM, Itanium, MIPS, Pentium, PowerPC,
SPARC, X86-64
Provides both detailed simulation/emulation and
high throughput
http//www.simics.com/
SimOS (1997) Stanford University
Originally designed to run on an SGI platform
Actually boots a full operating system (SGI IRIX
and DEC UNIX)
Implementations on Alpha and MIPS platforms
Designed around the operating system, emulating
IO and other system-related events
Provided the base technology for VMWare products

35
Code Annotation

Instrumented program produces trace while the
application is run
Three levels of annotation
Source code modification
Object code modification
Binary code modification
Pros
Ease of implementation
Small slowdown (10x)
Inexpensive
Cons
Limited completeness (OS, multiprocessing)
May not capture DLLs
Memory dilation

36
Source Code Annotation

TRAPEDS (Univ. of Illinois 1989)
Adds a call upon exit from a basic block
MPTrace (Univ. of Washington 1990)
I386, instruments only MP-relevant events
Tangolite (Stanford 1993)
Annotates all memory events in an MP environment

37
Object Code Annotation

Epoxie (DEC WRL 1989) Titan MP
Epoxie2 (DEC WRL 1993) R3000
ATOM (DEC WRL 1994) Alpha
Alto (Univ. of Arizona 1996) Alpha
PLTO (Univ. of Arizona 2001) IA32

38
Binary Code Annotation

Pixie (DEC 1991) MIPS
Goblin (IBM/CMU 1991) RS/6000
IDtrace (Univ. of Mich.) i486
QPT (Univ. of Wisc.) MIPS, SPARC
EEL (Univ. of Wisc.) MIPS, SPARC
DSPTune (NEU) ADI SHARC DSP
Pin (Intel 2005) X86, XScale, Itanium

39
Embedded Systems Profiling Tools

Enhance current embedded system compilation
environments, providing profile-driven analysis
and feedback capabilities
DSPTune - instrumentation and analysis package
for the SHARC family of DSPs
Allows for full instrumentation of C and C
codes at the source, assembly and ELF binary
levels
Supported by Analog Devices and the NSF

40
The DSPTune Toolset

A set of library routines that enable the user to
instrument C and assembly programs
Function calls can be inserted at various
locations in the application code, enabling
execution driven simulation
The user provides
instrumentation routines, which specify the
selected instrumentation events (e.g., loads,
branches, traps)
analysis routines, which carry out the desired
simulation (e.g., caches, stacks, branch
predictors)

41
User application code
Step I
Parser
Intermediate Representation
User instrumentation code
Step II
Instrumenting Tool
Instrumented IR
Step III
Code Generator
Instrumented application code
User analysis code
Step IV
Assembler
Linker
Instrumented application executable
42
BDSPTune

Provides similar capabilites as DSPTune
Allows ELF binaries to be instrumented
Enable instrumentation and profiling to include
library routines

43
Summary of Tracing Methodologies
Slow down OS coverage Sample size Cost
Source Code 10X NO gtGB LOW
Object Code 10X SOME gtGB LOW
Binary Code 10X NO gtGB LOW
Microcode 10X YES gtGB MEDIUM
I-Stepping 1000X YES unlimited MEDIUM
Emulation 10-100X YES unlimited MEDIUM
Real-time 1X YES ltGB HIGH
44
Counter-based Profiling and Instrumentation

David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
kaeli_at_ece.neu.edu

45
Counters are used to

Identify Performance Bottlenecks
especially unpredictable dynamic stallse.g.
cache misses, branch mispredicts, TLB misses,
etc.
complex out-of-order processors make this
difficult
Guide Optimizations
help programmers understand and improve code
automatic, profile-driven optimizations
Profile Production Workloads
low overhead
transparent
profile whole system

46
Performance Counters

Interfaced through a device driver and supporting
GUI (e.g., VTune)
Counters increment based on a set of events of
interest (e.g., cache misses, pipeline stalls)
Interrupt will occur that signals that the
counter has overflowed
An interrupt service routine reads the counter
information and tags it to a program counter (PC)
value
Information is then available for offline
analysis

47
Performance Counters

Low overhead method for obtaining performance and
profiling information
Typically less than 5 slowdown
Requires no modification of the binary
May require root level access to system
Lacks precision in cause/affect analysis
Come for free on most ISAs
Commonly used today to measure performance and
estimate power usage

48
Counter Library

A number of counter libraries are available to
provide an API to program and access common
architectures
Rabbit
for Intel/AMD Processors and Linux
URL www.scl.ameslab.gov/Projects/Rabbit/
PAPI
Linux IA32, IA64
Allows counters to be captured on a per thread
basis
URL icl.cs.utk.edu/projects/papi/

49
Counters available on different ISAs
Category PentiumII 21064 21164 IBM604e R10K Ultra2
counters 2 2 3 4 2 2
Counter Range 40 8, 12, 16 14,16 32 32 32
Variable Range No Yes No No No No
Sampling Freq Variable Fixed Fixed Variable Variable Fixed
R/W Access Yes No Yes Yes Yes Yes
Duration Counting Yes No No No No NO
Counting Modes Different Privilege Levels Selected Processes User, Kernel, PALmode User, Kernel, Processes User, Kernel User, Kernel
50
Events countable on different ISAs
Event PentiumII 21164 IBM604e R10K Ultra2
L1 data cache read Y N Y N Y
L1 data cache write Y N N N Y
L1 data cache r/w N Y N N N
L1 data cache miss Y Y Y Y Y
L1 inst cache read Y N N N N
L1 inst cache r/w N Y N N Y
L1 inst cache hit N Y N N Y
L1 inst cache miss Y Y Y Y Y
51
Events countable on different ISAs
Event Pentium2 21164 IBM604e R10K Ultra2
TLB miss N N Y Y N
Data TLB miss N Y Y N N
Inst TLB miss Y Y Y N N
Retired Branches Y N N Y N
Mispredicted Branches Y Y N Y N
Taken Branches Y N N N N
Mispredicted Retired B Y N N N N
52
Events countable on different ISAs
Event Pentium2 21164 IBM604e R10K Ultra2
Retired Instructions Y Y Y Y Y
Issued Instructions Y N Y Y N
Integer Inst Executed N Y Y N N
FP Inst Executed Y Y Y Y N
Load Inst Executed N Y Y Y N
Store Inst Executed N Y N Y N
Branch Inst Executed Y N Y N N
53
Events countable on different ISAs
Event Pentium2 21164 IBM604e R10K Ultra2
Total cycles Y Y Y Y Y
Cycles BPU is idle N N Y N N
Cycles IU is idle N N Y N N
Cycles LSU is idle N N Y N N
Cycles LSU stalls N N Y N N
Cycles FPU stalls Y N Y N N
Cycles BPU stalls N N Y N N
54
Existing Instruction-Level Sampling

Use Hardware Event Counters
small set of software-loadable counters
each counts a single event at a time, e.g. dcache
miss
counter overflow generates interrupt
Advantages
low overhead vs. simulation and instrumentation
transparent vs. instrumentation
complete coverage, e.g. kernel, shared libs, etc.
Effective on In-Order Processors
analysis computes execution frequency
heuristics identify possible reasons for stalls
example DIGITALs Continuous Profiling
Infrastructure

55
Problems with Event-Based Counters

Cannot simultaneously monitor all events
Limited information about events
event has occurred, but no additional
contexte.g. cache miss latencies, recent
execution path, ...
Blind spots in non-interruptible code
Key problem imprecise attribution
interrupt delivers restart PC, not the PC that
caused event
problem worse on out-of-order processors

56
Problem Imprecise Attribution

Example Finding the single operation that
introduces a long latency operation to occur
(e.g., cache miss, TLB miss, branch mispredict)
Most counter-based schemes provide the PC at the
point a counter overflowed
Inorder processors (Alpha 21164)
Imprecise exceptions/interrupts hinder our
ability to quickly identify the cause of
latencies during execution
It is possible to post-analyze the problem to
attempt to identify the responsible instruction
Out-Of-Order processors (Alpha21264, Pentium4)
Due to the lack of sequentiality in the
execution, the distance between the responsible
instruction and the current PC could be far
It is nearly impossible to identify the cause of
the latency

57
Profile-Me Profiling Strategy (DEC 1998)

PC Retire Status ? execution frequency
PC Cache Miss Flag ? cache miss rates
PC Branch Mispredict ? mispredict rates
PC Event Flag ? event rates
PC Branch Direction ? edge frequencies
PC Branch History ? path execution rates
PC Latency ? instruction stalls

58
Identifying True Botttlenecks

ProfileMe Detailed Data for Single Instruction
In-Order Processors
ProfileMe PC latency data identifies stalls
stalled instructions back up pipeline
Out-of-Order Processors
explicitly designed to mask stall latencye.g.
dynamic reordering, speculative execution
stall does not necessarily imply bottleneck
Example Does This Stall Matter?
load r1,
add ,r1, average latency 35.0 cycles
other instructions

59
Example Retire Count Convergence
Estimate / Actual
60
How to handle concurrency and OOO?

Appropriate concurrency metrics
retired instructions per cycle
issue slots wasted while an instruction is in
flight
pipeline stage utilization
How to measure concurrency?
Special-purpose hardware
some metrics difficult to measuree.g. need
retire/abort status
Sample potentially-concurrent instructions
aggregate info from pairs of samples
statistically estimate metrics

61
How to handle concurrency and OOO?

Sample Two Instructions
sample instructions, not events
may be in-flight simultaneously
replicate ProfileMe hardware, add intra-pair
distance
Nested Sampling
sample window around first profiled instruction
randomly select second profiled instruction
statistically estimate frequency for F (first,
second)

62
Other Uses of Paired Sampling

Path Profiling
two PCs close in time can identify execution path
identify control flow, e.g. indirect branches,
calls, traps
Direct Latency Measurements
data load-to-use
loop iteration cost

63
VTune IA32 Instrumentation and Profiling

Supports all versions of IA32 Intel processors
Provides a rich GUI to ease programming and
reading of hardware counters
Features include
Time and event-based sampling
Call graph profiling
Provides source-level tuning advice
Allows for integrated visualization of source and
counter information
Supports C/C, Fortran, Java and IA32 assembly

64
VTune Time Sample
65
VTune Call Graph
66
VTune Hot Spot Analyzer
67
VTune Tuning Assistant
68
Using Performance Counters for Power
Profiling/Estimation

Profile power-consuming events
Cache misses
TLB misses
Pipeline stalls
Opportunities to wait slower!
How can we tie high counts to when to adjust
voltage/frequency? (more on this later in the
class.)

69
Summary

Tracing/Instrumentation is still used today by
industry and academia
The field has evolved significantly
Industry uses software-based tools for
performance and hardware-based tools for
power/energy
Most performance studies today use some form of
emulation or virtualized execution to obtain
trace data
Counters can be used effectively to capture
performance data
The entry cost for using counters is low
OO microarchitectures inhibit the use of counters
Paired sampling can be an effective technique for
handling imprecision
A number of high-quality free and commercial
tools are available (and we are going to use at
least one of them)

Write a Comment

User Comments (0)

About PowerShow.com

Hardware and Software Tracing - PowerPoint PPT Presentation

Hardware and Software Tracing

Hardware and Software Tracing David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA kaeli_at_ece.neu.edu Trace Collection ... – PowerPoint PPT presentation