Design and Evaluation of Architectures for Commercial Applications

About This Presentation

Title:

Design and Evaluation of Architectures for Commercial Applications

Description:

Example: Sorting Stalls % cum% cycles cnt cpi blame PC file:line ... Infer execution counts, CPI, stalls, and stall explanations from cycles samples and program ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 76

Provided by: barr93

Learn more at: https://research.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Design and Evaluation of Architectures for Commercial Applications

1
Design and Evaluation of Architectures for
Commercial Applications
Part II tools methods

Luiz André Barroso

2
Overview

Evaluation methods/tools
Introduction
Software instrumentation (ATOM)
Hardware measurement profiling
IPROBE
DCPI
ProfileMe
Tracing trace-driven simulation
User-level simulators
Complete machine simulators (SimOS)

3
Studying commercial applications challenges

Size of the data sets and programs
Complex control flow
Complex interactions with Operating System
Difficult tuning process
Lack of access to source code (?)
Vendor restrictions on publications
important to have a rich set of tools

4
Tools are useful in many phases

Understanding behavior of workloads
Tuning
Performance measurements in existing systems
Performance estimation for future systems

5
Using ordinary system tools

Measuring CPU utilization and balance
Determining user/system breakdown
Detecting I/O bottlenecks
Disks
Networks
Monitoring memory utilization and swap activity

6
Gathering symbol table information

Most database programs are large statically
linked stripped binaries
Most tools will require symbol table information
However, distributions typically consist of
object files with symbolic data
Simple trick
replace system linker with wrapper that remove
strip flag, then calls real linker

7
ATOM A Tool-Building System

Developed at WRL by Alan Eustace Amitabh
Srivastava
Easy to build new tools
Flexible enough to build interesting tools
Fast enough to run on real applications
Compiler independent works on existing binaries

8
Code Instrumentation
Trojan Horse
TOOL
V
V

Application appears unchanged
ATOM adds code and data to the application
Information collected as a side effect of
execution

9
ATOM Programming Interface

Given an application program
Navigation Move around
Interrogation Ask questions
Definition Define interface to analysis
procedures
Instrumentation Add calls to analysis procedures
Pass ANYTHING as arguments!
PC, effective addresses, constants, register
values, arrays, function arguments, line numbers,
procedure names, file names, etc.

10
Navigation Primitives

GetFirst,Last,Next,PrevObj
GetFirst,Last,Next,PrevObjProc
GetFirst,Last,Next,PrevBlock
GetFirst,Last,Next,PrevInst
GetInstBlock - Find enclosing block
GetBlockProc - Find enclosing procedure
GetProcObj - Find enclosing object
GetInstBranchTarget - Find branch target
ResolveTargetProc - Find subroutine destination

11
Interrogation

GetProgramInfo(PInfo)
number of procedures, blocks, and instructions.
text and data addresses
GetProcInfo(Proc , BlockInfo)
Number of blocks or instructions
Procedure frame size, integer and floating point
save masks
GetBlockInfo(Inst , InstInfo)
Number of instructions
Any piece of the instruction (opcode, ra, rb,
displacement)

12
Interrogation(2)

ProcFileName
Returns the file name for this procedure
InstLineNo
Returns the line number of this procedure
GetInstRegEnum
Returns a unique register specifier
GetInstRegUsage
Computes Source and Destination masks

13
Interrogation(3)

GetInstRegUsage
Computes instruction source and destination masks
GetInstRegUsage(instFirst, usageFirst)
GetInstRegUsage(instSecond, usageSecond)
if (usageFirst.dreg_bitvec0
usageSecond.ureg_bitvec0)
/ set followed by a use /
Exactly what you need to find static pipeline
stalls!

14
Definition

AddCallProto(function(argument list))
Constants
Character strings
Program counter
Register contents
Cycle counter
Constant arrays
Effective Addresses
Branch Condition Values

15
Instrumentation

AddCallProgram(ProgramBefore,After,
name,args)
AddCallProc(p, ProcBefore,After, name,args)
AddCallBlock(b, BlockBefore,After, name,args)
AddCallInst(i, InstBefore,After, name,args)
ReplaceProc(p, new)

16
Example 1 Procedure Tracing

What procedures are executed by the following
mystery program?

include ltstdio.hgt main() printf(Hello
world!\n) Hint main gt printf gt ???
17
Procedure Tracing Example
gt cc hello.c -non_shared -g1 -o hello gt atom
hello ptrace.inst.c ptrace.anal.c -o hw.ptrace gt
hello.ptrace gt __start gt main gt printf gt
_doprnt gt __getmbcurmaz lt __getmbcurmax gt
memcpy lt memcpy gt fwrite
18
Procedure Trace (2)
19
Example 2 Cache Simulator

Write a tool that computes the miss rate of the
application running in a 64KB, direct mapped data
cache with 32 byte lines.
gt atom spice cache.inst.o cache.anal.o -o
spice.cache
gt spice.cache lt ref.in gt ref.out
gt more cache.out
5,387,822,402 620,855,884 11.523
Great use for 64 bit integers!

20
Cache Tool Implementation
Application
Instrumentation
Reference(-32592(gp))
Note Passes addresses as if uninstrumented!
Reference(-32592(gp))
PrintResults()
21
Cache Instrumentation File

include ltstdio.hgt
include ltcmplrs/atom.inst.hgt
unsigned InstrumentAll(int argc, char argv)
AddCallProto(Reference(VALUE))
AddCallProto(Print())
for (o GetFirstObj() p ! NULL p
GetNextObj(o))
if (BuildObj(o)) return (1)
if (o GetFirstObj()) AddCallObj(o,ObjAfter,
Print)
for (p GetFirstProc() p ! NULL p
GetNextProc(p))
for (b GetFirstBlock(p) b ! NULL
b GetNextBlock(b))
for (i GetFirstInst(b) i !
NULL i GetNextInst(i))
if (IsInstType(i, InstTypeLoad)
IsInstType(i,InstTypeStore))
AddCallInst(i, InstBefore,
Reference, EffAddrValue)
WriteObj(o)
return (0)

22
Cache Analysis File

include ltstdio.hgt
define CACHE_SIZE 65536
define BLOCK_SHIFT 5
long cacheCACHE_SIZE gtgt BLOCK_SHIFT,
refs,misses
Reference(long address)
int index address (CACHE_SIZE-1) gtgt
BLOCK_SHIFT
long tag address gtgt BLOCK_SHIFT
if (cacheindex ! tag) misses
cacheindex tag
refs
Print()
FILE file fopen(cache.out,w)
printf(file,ld ld .2f\n,refs, misses,
100.0 misses / refs)
fclose(file)

23
Example 3 TPC-B runtime information

Statistics per transaction
Instructions 180,398
Loads ( shared) 47,643 (24)
Stores ( shared) 21,380 (22)
Lock/Unlock 118
MBs 241
Footprints/CPU
Instr. 300 KB (1.6 MB in pages)
Private data 470 KB (4 MB in pages)
Shared data 7 MB (26 MB in pages)
50 of the shared data footprint is touched by at
least one other process

24
TPC-B (2)
25
TPC-B (3)
26
Oracle SGA activity in TPC-B
27
ATOM wrap-up

Very flexible hack-it-yourself tool
Discover detailed information on dynamic behavior
of programs
Especially good when you dont have source code
Shipped with Digital Unix
Can be used for tracing (later)

28
Hardware measurement tools

IPROBE
interface to CPU event counters
DCPI
hardware assisted profiling
ProfileMe
hardware assisted profiling for complex CPU cores

29
IPROBE

Developed by Digitals Performance Group
Use event counters provided by Alphas
Operation
set counter to monitor a particular event (e.g.,
icache_miss)
start counter
every counter overflow, interrupt wakes up
handler and events are accumulated
stop counter and read total
User can select
which processes to count
user level, kernel level, both

30
IPROBE 21164 event types

issues single_issue_cycles
long_stalls
cycles dual_issue_cycles
branch_mispr
triple_issue_cycles
pc_mispr
quad_issue_cycles
icache_miss
split_issue_cycles
dcache_miss
pipe_dry
dtb_miss
pipe_frozen
loads_merged
replay_trap
ldu_replays
branches
cycles
cond_branches
scache_miss
jsr_ret
scache_read_miss
integer_ops
scache_write
float_ops
scache_sh_write
loads
scache_write_miss
stores
bcache_miss
icache_access
sys_inv
dcache_access
itb_miss
scache_access
wb_maf_full_replays
scache_read
sys_read_req

31
IPROBE what you can do

Directly measure relevant events (e.g. cache
performance)
Overall CPU cycle breakdown diagnosis
microbenchmark machine to estimate latencies
combine latencies with event counts
Main of inaccuracy
load/store overlap in the memory system

32
IPROBE example 4-CPU SMP
Estimated breakdown of stall cycles
Breakdown of CPU cycles

CPI 7.4

33
Why did it run so bad?!?

Nominal memory latencies were good 80 cycles
Micro-benchmarks determined that
latency under load is over 120 cycles on 4
processors
base dirty miss latency was over 130 cycles
off-chip cache latency was high
IPROBE data uncovered significant sharing
for P2, 15 of bcache misses are to dirty blocks
for P4, 20 of bcache misses are to dirty blocks

34
Dirty miss latency on RISC SMPs

SPEC benchmark has no significant sharing
Current processors/systems optimize local cache
access
All RISC SMPs have high dirty miss penalties

35
DCPI continuous profiling infrastructure

Developed by SRC and WRL researchers
Based on periodic sampling
Hardware generates periodic interrupts
OS handles the interrupts and stores data
Program Counter (PC) and any extra info
Analysis Tools convert data
for users
for compilers
Other examples
SGI Speedshop, Unixs prof(), VTune

36
Sampling vs. Instrumentation

Much lower overhead than instrumentation
DCPI program 1-3 slower
Pixie program 2-3 times slower
Applicable to large workloads
100,000 TPS on Alpha
AltaVista
Easier to apply to whole systems (kernel, device
drivers, shared libraries, ...)
Instrumenting kernels is very tricky
No source code needed

37
Information from Profiles

DCPI estimates
Where CPU cycles went, broken down by
image, procedure, instruction
How often code was executed
basic blocks and CFG edges
Where peak performance was lost and why

38
Example Getting the Big Picture
Total samples for event type cycles 6095201
cycles cum load file
2257103 37.03 37.03 /usr/shlib/X11/lib_dec_
ffb_ev5.so 1658462 27.21 64.24 /vmunix
928318 15.23 79.47 /usr/shlib/X11/libmi.so
650299 10.67 90.14 /usr/shlib/X11/libos.
so cycles cum procedure
load file 2064143
33.87 33.87 ffb8ZeroPolyArc
/usr/shlib/X11/lib_dec_ffb_ev5.so 517464
8.49 42.35 ReadRequestFromClient
/usr/shlib/X11/libos.so 305072 5.01
47.36 miCreateETandAET
/usr/shlib/X11/libmi.so 271158 4.45
51.81 miZeroArcSetup
/usr/shlib/X11/libmi.so 245450 4.03
55.84 bcopy
/vmunix 209835 3.44 59.28 Dispatch
/usr/shlib/X11/libdix.so
186413 3.06 62.34 ffb8FillPolygon
/usr/shlib/X11/lib_dec_ffb_ev5.so
170723 2.80 65.14 in_checksum
/vmunix 161326 2.65 67.78
miInsertEdgeInET /usr/shlib/X11/libm
i.so 133768 2.19 69.98
miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so
39
Example Using the Microscope
Where peak performance is lost and why
40
Example Summarizing Stalls
I-cache (not ITB) 0.0 to 0.3
ITB/I-cache miss 0.0 to 0.0 D-cache
miss 27.9 to 27.9 DTB miss
9.2 to 18.3 Write buffer 0.0 to
6.3 Synchronization 0.0 to 0.0 Branch
mispredict 0.0 to 2.6 IMUL busy
0.0 to 0.0 FDIV busy 0.0 to
0.0 Other 0.0 to
0.0 Unexplained stall 2.3 to 2.3
Unexplained gain -4.3 to -4.3 -----------------
--------------------------------------------
Subtotal dynamic
44.1
Slotting 1.8 Ra
dependency 2.0 Rb dependency
1.0 Rc dependency 0.0 FU
dependency 0.0 ----------------------------
---------------------------------
Subtotal static
4.8 ---------------------------------------------
---------------- Total stall
48.9
Execution 51.2 Net
sampling error
-0.1 --------------------------------------------
----------------- Total tallied
100.0 (35171, 93.1 of
all samples)
41
Example Sorting Stalls
cum cycles cnt cpi blame PC
fileline 10.0 10.0 109885 4998 22.0 dcache
957c comp.c484 9.9 19.8 108776 5513 19.7
dcache 9530 comp.c477 7.8 27.6 85668 3836
22.3 dcache 959c comp.c488
42
Typical Hardware Support

Timers
Clock interrupt after N units of time
Performance Counters
Interrupt after N
cycles, issues, loads, L1 Dcache misses, branch
mispredicts, uops retired, ...
Alpha 21064, 21164 PPro, PII
Easy to measure total cycles, issues, CPI, etc.
Only extra information is restart PC

43
Problem Inaccurate Attribution

Experiment
count data loads
loop single load hundreds of nops
In-Order Processor
Alpha 21164
skew
large peak
Out-of-Order Processor
Intel Pentium Pro
skew
smear

load
44
Ramification of Misattribution

No skew or smear
Instruction-level analysis is easy!
Skew is a constant number of cycles
Instruction-level analysis is possible
Adjust sampling period by amount of skew
Infer execution counts, CPI, stalls, and stall
explanations from cycles samples and program
Smear
Instruction-level analysis seems hopeless
Examples PII, StrongARM

45
Desired Hardware Support

Sample fetched instructions
Save PC of sampled instruction
E.g., interrupt handler reads Internal Processor
Register
Makes skew and smear irrelevant
Gather more information

46
ProfileMe Instruction-Centric Profiling
Fetch counter
overflow?
fetch
map
issue
exec
retire
random selection
ProfileMe tag!
interrupt!
arithunits
branchpredict
dcache
icache
done?
tagged?
pc
addr
retired?
miss?
stage latencies
history
mp?
miss?
capture!
internal processor registers
47
Instruction-Level Statistics

PC Retire Status ? execution frequency
PC Cache Miss Flag ? cache miss rates
PC Branch Mispredict ? mispredict rates
PC Event Flag ? event rates
PC Branch Direction ? edge frequencies
PC Branch History ? path execution rates
PC Latency ? instruction stalls
100-cycle dcache miss vs. dcache miss

48
Data Analysis

Cycle samples are proportional to total time at
head of issue queue (at least on in-order Alphas)
Frequency indicates frequent paths
CPI indicates stalls

49
Estimating Frequency from Samples

Problem
given cycle samples, compute frequency and CPI
Approach
Let F Frequency / Sampling Period
E(Cycle Samples) F X CPI
So F E(Cycle Samples) / CPI

50
Estimating Frequency (cont.)

F E(Cycle Samples) / CPI
Idea
If no dynamic stall, then know CPI, so can
estimate F
So assume some instructions have no dynamic
stalls
Consider a group of instructions with the same
frequency (e.g., basic block)
Identify instructions w/o dynamic stalls then
average their sample counts for better accuracy
Key insight
Instructions without stalls have smaller sample
counts

51
Estimating Frequency (Example)

Does badly when
Few issue points
All issue points stall

Compute MinCPI from Code
Compute Samples/MinCPI
Select Data to Average

52
Frequency Estimate Accuracy

Compare frequency estimates for blocks to
measured values obtained with pixie-like tool

53
Explaining Stalls

Static stalls
Schedule instructions in each basic block
optimistically using a detailed pipeline model
for the processor
Dynamic stalls
Start with all possible explanations
I-cache miss, D-cache miss, DTB miss, branch
mispredict, ...
Rule out unlikely explanations
List the remaining possibilities

54
Ruling Out D-cache Misses

Is the previous occurrence of an operand register
the destination of a load instruction?
Search backward across basic block boundaries
Prune by block and edge execution frequencies

55
DCPI wrap-up

Very precise, non-intrusive profiling tool
Gathers both user-level and kernel profiles
Relates architectural events back to original
code
Used for profile-based code optimizations

56
Simulation of commercial workloads

Requires scaling down
Options
Trace-driven simulation
User-level execution-driven simulation
Complete machine simulation

57
Trace-driven simulation

Methodology
create ATOM instrumentation tool that logs a
complete trace per Oracle server process
instruction path
data accesses
synchronization accesses
system calls
run atomized version to derive trace
feed traces to simulator

58
Trace-driven studies limitations

No OS activity (in OLTP OS takes 10-15 of the
time)
Trace selected processes only (e.g. server
processes)
Time dilation alters system behavior
I/O looks faster
many places with hardwired timeout values have to
be patched
Capturing synchronization correctly is difficult
need to reproduce correct concurrency for shared
data structures
DB has complex synchronization structure, many
levels of procedures

59
Trace-driven studies limitations(2)

Scheduling traces into simulated processors
need enough information in the trace to reproduce
OS scheduling
need to suspend processes for I/O other
blocking operations
need to model activity of background processes
that are not traced (e.g. log writer)
Re-create OS virtual-physical mapping, page
coloring scheme
Very difficult to simulate wrong-path execution

60
User-level execution-driven simulator

Our experience was to modify AINT (MINT for
Alpha)
Problems
no OS activity measured
Oracle/OS interactions are very complex
OS system call interface has to be virtualized
Thats a hard one to crack
Our status
Oracle/TPC-B ran with 1 server process only
we gave up...

61
Complete machine simulator

Bite the bullet model the machine at the
hardware level
The good news is
hardware interface is cleaner better documented
than any software interface (including OS)
all software JUST RUNS!! Including OS
applications dont have to be ported to simulator
We ported SimOS (from Stanford) to Alpha

62
SimOS

A complete machine simulator
Speed-detail tradeoff for maximum flexibility
Flexible data collection and classification
Originally developed at Stanford University (MIPS
ISA)
SimOS-Alpha effort started at WRL in Fall 1996
Ed Bugnion, Luiz Barroso, Kourosh Gharachorloo,
Ben Verghese, Basem Nayfeh, and Jamey Hicks (CRL)

63
SimOS - Complete Machine Simulation
VCS
Workloads
Operating System of Simulated Machine
SimOS Hardware
Caches
Ethernet
Host
Host Machine
Models CPUs, caches, buses, memory, disks,
network, Complete enough to run OS and any
applications
64
Multiple Levels of Detail

Tradeoff between speed of simulation and the
amount of detail that is simulated
Multiple modes of CPU simulation
Fast on-the-fly compilation 10X slowdown!
Workload placement
Simple pipeline emulator, no caches 50-100X
slowdown
Rough characterization
Simple pipeline emulator, full cache simulation
100-200X slowdown
More accurate characterization of workloads

65
Multiple Models for each Component

Multiple models for CPU, cache, memory,and disk.
CPU
simple pipeline emulator 100-200X slowdown (EV5)
dynamically-scheduled processor 1000-10000X
slowdown (e.g.21264)
Caches
Two level set associative caches
Shared caches
Memory
Perfect (0-latency), Bus-based (Tlaser), NUMA
(Wildfire)
Disk
Fixed latency or more complex HP disk model
Modular add your own flavors

66
Checkpoint and Sampling

Checkpoint capability for entire machine state
CPU state, main memory, and disk changes
Important for positioning workload for detailed
simulation
Switching detail level in a sampling study
Run in faster modes, sample in more detailed
modes
Repeatability
Change parameters for studies
Cache size
Memory type and latencies
Disk models and latencies
Many others
Debugging race conditions

67
Data Collection and Classification

Exploits visibility and non-intrusiveness offered
by simulation
Can observe low-level events such as cache
misses, references and TLB misses
Tcl-based configuration and control provides ease
of use
Powerful annotation mechanism for triggering
events
Hardware, OS, or Application
Apps and mechanisms to organize and classify data
Some already provided (cache miss counts and
classification)
Mechanisms to do more (timing trees and detail
tables)

68
Easy configuration

TCL based configuration of the machine parameters
Example
set PARAM(CPU.Model) DELTA
set detailLevel 1
set PARAM(CPU.Clock) 1000
set PARAM(CPU.Count) 4
set PARAM(CACHE.2Level.L2Size) 1024
set PARAM(CACHE.2Level.L2Line) 64
set PARAM(CACHE.2Level.L2HitTime) 15
set PARAM(MEMSYS.MemSize) 1024
set PARAM(MEMSYS.Numa.NumMemories)
PARAM(CPU.Count)
set PARAM(MEMSYS.Model) Numa
set PARAM(DISK.Fixed.Latency) 10

69
Annotations - The building block

Small procedures to be run on encountering
certain events
PC, hardware events (cache miss, TLB, ),
simulator events
annotation set pc vmunixidle_threadSTART
set PROCESS(CPU) idle
annotation exec osEvent startIdle
annotation set osEvent switchIn
log "CYCLES ContextSwitch CPU,PID(CPU),PROC
ESS(CPU)\n"
annotation set pc 0x12004ba90
incr tpcbTOGO -1
console "TRANSACTION CYCLES togotpcbTOGO \n"
if tpcbTOGO 0 simosExit

70
Example Kernel Detail (TPCB)
71
SimOS Methodology

Configure and tune the workload on existing
machine
build the database schema, create indexes, load
data, optimize queries
more difficult if simulated system much different
from existing platform
Create file(s) with disk image (dd) of the
database disk(s)
write-protect dd files to prevent permanent
modification (i.e. use copy-on-write)
optionally, umount disks and let SimOS use them
as raw devices
Configure SimOS to see the dd files as raw
disks
Boot a SimOS configuration and mount the disks