Title: Design and Evaluation of Architectures for Commercial Applications
1Design and Evaluation of Architectures for
Commercial Applications
Part II tools methods
2Overview
- Evaluation methods/tools
- Introduction
- Software instrumentation (ATOM)
- Hardware measurement profiling
- IPROBE
- DCPI
- ProfileMe
- Tracing trace-driven simulation
- User-level simulators
- Complete machine simulators (SimOS)
3Studying commercial applications challenges
- Size of the data sets and programs
- Complex control flow
- Complex interactions with Operating System
- Difficult tuning process
- Lack of access to source code (?)
- Vendor restrictions on publications
- important to have a rich set of tools
4Tools are useful in many phases
- Understanding behavior of workloads
- Tuning
- Performance measurements in existing systems
- Performance estimation for future systems
5Using ordinary system tools
- Measuring CPU utilization and balance
- Determining user/system breakdown
- Detecting I/O bottlenecks
- Disks
- Networks
- Monitoring memory utilization and swap activity
6Gathering symbol table information
- Most database programs are large statically
linked stripped binaries - Most tools will require symbol table information
- However, distributions typically consist of
object files with symbolic data - Simple trick
- replace system linker with wrapper that remove
strip flag, then calls real linker
7ATOM A Tool-Building System
- Developed at WRL by Alan Eustace Amitabh
Srivastava - Easy to build new tools
- Flexible enough to build interesting tools
- Fast enough to run on real applications
- Compiler independent works on existing binaries
8Code Instrumentation
Trojan Horse
TOOL
V
V
- Application appears unchanged
- ATOM adds code and data to the application
- Information collected as a side effect of
execution
9ATOM Programming Interface
- Given an application program
- Navigation Move around
- Interrogation Ask questions
- Definition Define interface to analysis
procedures - Instrumentation Add calls to analysis procedures
- Pass ANYTHING as arguments!
- PC, effective addresses, constants, register
values, arrays, function arguments, line numbers,
procedure names, file names, etc.
10Navigation Primitives
- GetFirst,Last,Next,PrevObj
- GetFirst,Last,Next,PrevObjProc
- GetFirst,Last,Next,PrevBlock
- GetFirst,Last,Next,PrevInst
- GetInstBlock - Find enclosing block
- GetBlockProc - Find enclosing procedure
- GetProcObj - Find enclosing object
- GetInstBranchTarget - Find branch target
- ResolveTargetProc - Find subroutine destination
11Interrogation
- GetProgramInfo(PInfo)
- number of procedures, blocks, and instructions.
- text and data addresses
- GetProcInfo(Proc , BlockInfo)
- Number of blocks or instructions
- Procedure frame size, integer and floating point
save masks - GetBlockInfo(Inst , InstInfo)
- Number of instructions
- Any piece of the instruction (opcode, ra, rb,
displacement)
12Interrogation(2)
- ProcFileName
- Returns the file name for this procedure
- InstLineNo
- Returns the line number of this procedure
- GetInstRegEnum
- Returns a unique register specifier
- GetInstRegUsage
- Computes Source and Destination masks
13Interrogation(3)
- GetInstRegUsage
- Computes instruction source and destination masks
- GetInstRegUsage(instFirst, usageFirst)
- GetInstRegUsage(instSecond, usageSecond)
- if (usageFirst.dreg_bitvec0
usageSecond.ureg_bitvec0) - / set followed by a use /
-
- Exactly what you need to find static pipeline
stalls!
14Definition
- AddCallProto(function(argument list))
- Constants
- Character strings
- Program counter
- Register contents
- Cycle counter
- Constant arrays
- Effective Addresses
- Branch Condition Values
15Instrumentation
- AddCallProgram(ProgramBefore,After,
name,args) - AddCallProc(p, ProcBefore,After, name,args)
- AddCallBlock(b, BlockBefore,After, name,args)
- AddCallInst(i, InstBefore,After, name,args)
- ReplaceProc(p, new)
16Example 1 Procedure Tracing
- What procedures are executed by the following
mystery program?
include ltstdio.hgt main() printf(Hello
world!\n) Hint main gt printf gt ???
17Procedure Tracing Example
gt cc hello.c -non_shared -g1 -o hello gt atom
hello ptrace.inst.c ptrace.anal.c -o hw.ptrace gt
hello.ptrace gt __start gt main gt printf gt
_doprnt gt __getmbcurmaz lt __getmbcurmax gt
memcpy lt memcpy gt fwrite
18Procedure Trace (2)
19Example 2 Cache Simulator
- Write a tool that computes the miss rate of the
application running in a 64KB, direct mapped data
cache with 32 byte lines. - gt atom spice cache.inst.o cache.anal.o -o
spice.cache - gt spice.cache lt ref.in gt ref.out
- gt more cache.out
- 5,387,822,402 620,855,884 11.523
- Great use for 64 bit integers!
20Cache Tool Implementation
Application
Instrumentation
Reference(-32592(gp))
Note Passes addresses as if uninstrumented!
Reference(-32592(gp))
PrintResults()
21Cache Instrumentation File
- include ltstdio.hgt
- include ltcmplrs/atom.inst.hgt
- unsigned InstrumentAll(int argc, char argv)
- AddCallProto(Reference(VALUE))
- AddCallProto(Print())
- for (o GetFirstObj() p ! NULL p
GetNextObj(o)) - if (BuildObj(o)) return (1)
- if (o GetFirstObj()) AddCallObj(o,ObjAfter,
Print) - for (p GetFirstProc() p ! NULL p
GetNextProc(p)) - for (b GetFirstBlock(p) b ! NULL
b GetNextBlock(b)) - for (i GetFirstInst(b) i !
NULL i GetNextInst(i)) - if (IsInstType(i, InstTypeLoad)
IsInstType(i,InstTypeStore)) - AddCallInst(i, InstBefore,
Reference, EffAddrValue) - WriteObj(o)
-
- return (0)
-
22Cache Analysis File
- include ltstdio.hgt
- define CACHE_SIZE 65536
- define BLOCK_SHIFT 5
- long cacheCACHE_SIZE gtgt BLOCK_SHIFT,
refs,misses - Reference(long address)
- int index address (CACHE_SIZE-1) gtgt
BLOCK_SHIFT - long tag address gtgt BLOCK_SHIFT
- if (cacheindex ! tag) misses
cacheindex tag - refs
-
- Print()
- FILE file fopen(cache.out,w)
- printf(file,ld ld .2f\n,refs, misses,
100.0 misses / refs) - fclose(file)
-
23Example 3 TPC-B runtime information
- Statistics per transaction
- Instructions 180,398
- Loads ( shared) 47,643 (24)
- Stores ( shared) 21,380 (22)
- Lock/Unlock 118
- MBs 241
- Footprints/CPU
- Instr. 300 KB (1.6 MB in pages)
- Private data 470 KB (4 MB in pages)
- Shared data 7 MB (26 MB in pages)
- 50 of the shared data footprint is touched by at
least one other process
24TPC-B (2)
25TPC-B (3)
26Oracle SGA activity in TPC-B
27ATOM wrap-up
- Very flexible hack-it-yourself tool
- Discover detailed information on dynamic behavior
of programs - Especially good when you dont have source code
- Shipped with Digital Unix
- Can be used for tracing (later)
28Hardware measurement tools
- IPROBE
- interface to CPU event counters
- DCPI
- hardware assisted profiling
- ProfileMe
- hardware assisted profiling for complex CPU cores
29IPROBE
- Developed by Digitals Performance Group
- Use event counters provided by Alphas
- Operation
- set counter to monitor a particular event (e.g.,
icache_miss) - start counter
- every counter overflow, interrupt wakes up
handler and events are accumulated - stop counter and read total
- User can select
- which processes to count
- user level, kernel level, both
30IPROBE 21164 event types
- issues single_issue_cycles
long_stalls - cycles dual_issue_cycles
branch_mispr - triple_issue_cycles
pc_mispr - quad_issue_cycles
icache_miss - split_issue_cycles
dcache_miss - pipe_dry
dtb_miss - pipe_frozen
loads_merged - replay_trap
ldu_replays - branches
cycles - cond_branches
scache_miss - jsr_ret
scache_read_miss - integer_ops
scache_write - float_ops
scache_sh_write - loads
scache_write_miss - stores
bcache_miss - icache_access
sys_inv - dcache_access
itb_miss - scache_access
wb_maf_full_replays - scache_read
sys_read_req
31IPROBE what you can do
- Directly measure relevant events (e.g. cache
performance) - Overall CPU cycle breakdown diagnosis
- microbenchmark machine to estimate latencies
- combine latencies with event counts
- Main of inaccuracy
- load/store overlap in the memory system
32IPROBE example 4-CPU SMP
Estimated breakdown of stall cycles
Breakdown of CPU cycles
33Why did it run so bad?!?
- Nominal memory latencies were good 80 cycles
- Micro-benchmarks determined that
- latency under load is over 120 cycles on 4
processors - base dirty miss latency was over 130 cycles
- off-chip cache latency was high
- IPROBE data uncovered significant sharing
- for P2, 15 of bcache misses are to dirty blocks
- for P4, 20 of bcache misses are to dirty blocks
34Dirty miss latency on RISC SMPs
- SPEC benchmark has no significant sharing
- Current processors/systems optimize local cache
access - All RISC SMPs have high dirty miss penalties
35DCPI continuous profiling infrastructure
- Developed by SRC and WRL researchers
- Based on periodic sampling
- Hardware generates periodic interrupts
- OS handles the interrupts and stores data
- Program Counter (PC) and any extra info
- Analysis Tools convert data
- for users
- for compilers
- Other examples
- SGI Speedshop, Unixs prof(), VTune
36Sampling vs. Instrumentation
- Much lower overhead than instrumentation
- DCPI program 1-3 slower
- Pixie program 2-3 times slower
- Applicable to large workloads
- 100,000 TPS on Alpha
- AltaVista
- Easier to apply to whole systems (kernel, device
drivers, shared libraries, ...) - Instrumenting kernels is very tricky
- No source code needed
37Information from Profiles
- DCPI estimates
- Where CPU cycles went, broken down by
- image, procedure, instruction
- How often code was executed
- basic blocks and CFG edges
- Where peak performance was lost and why
38Example Getting the Big Picture
Total samples for event type cycles 6095201
cycles cum load file
2257103 37.03 37.03 /usr/shlib/X11/lib_dec_
ffb_ev5.so 1658462 27.21 64.24 /vmunix
928318 15.23 79.47 /usr/shlib/X11/libmi.so
650299 10.67 90.14 /usr/shlib/X11/libos.
so cycles cum procedure
load file 2064143
33.87 33.87 ffb8ZeroPolyArc
/usr/shlib/X11/lib_dec_ffb_ev5.so 517464
8.49 42.35 ReadRequestFromClient
/usr/shlib/X11/libos.so 305072 5.01
47.36 miCreateETandAET
/usr/shlib/X11/libmi.so 271158 4.45
51.81 miZeroArcSetup
/usr/shlib/X11/libmi.so 245450 4.03
55.84 bcopy
/vmunix 209835 3.44 59.28 Dispatch
/usr/shlib/X11/libdix.so
186413 3.06 62.34 ffb8FillPolygon
/usr/shlib/X11/lib_dec_ffb_ev5.so
170723 2.80 65.14 in_checksum
/vmunix 161326 2.65 67.78
miInsertEdgeInET /usr/shlib/X11/libm
i.so 133768 2.19 69.98
miX1Y1X2Y2InRegion /usr/shlib/X11/libmi.so
39Example Using the Microscope
Where peak performance is lost and why
40Example Summarizing Stalls
I-cache (not ITB) 0.0 to 0.3
ITB/I-cache miss 0.0 to 0.0 D-cache
miss 27.9 to 27.9 DTB miss
9.2 to 18.3 Write buffer 0.0 to
6.3 Synchronization 0.0 to 0.0 Branch
mispredict 0.0 to 2.6 IMUL busy
0.0 to 0.0 FDIV busy 0.0 to
0.0 Other 0.0 to
0.0 Unexplained stall 2.3 to 2.3
Unexplained gain -4.3 to -4.3 -----------------
--------------------------------------------
Subtotal dynamic
44.1
Slotting 1.8 Ra
dependency 2.0 Rb dependency
1.0 Rc dependency 0.0 FU
dependency 0.0 ----------------------------
---------------------------------
Subtotal static
4.8 ---------------------------------------------
---------------- Total stall
48.9
Execution 51.2 Net
sampling error
-0.1 --------------------------------------------
----------------- Total tallied
100.0 (35171, 93.1 of
all samples)
41Example Sorting Stalls
cum cycles cnt cpi blame PC
fileline 10.0 10.0 109885 4998 22.0 dcache
957c comp.c484 9.9 19.8 108776 5513 19.7
dcache 9530 comp.c477 7.8 27.6 85668 3836
22.3 dcache 959c comp.c488
42Typical Hardware Support
- Timers
- Clock interrupt after N units of time
- Performance Counters
- Interrupt after N
- cycles, issues, loads, L1 Dcache misses, branch
mispredicts, uops retired, ... - Alpha 21064, 21164 PPro, PII
- Easy to measure total cycles, issues, CPI, etc.
- Only extra information is restart PC
43Problem Inaccurate Attribution
- Experiment
- count data loads
- loop single load hundreds of nops
- In-Order Processor
- Alpha 21164
- skew
- large peak
- Out-of-Order Processor
- Intel Pentium Pro
- skew
- smear
load
44Ramification of Misattribution
- No skew or smear
- Instruction-level analysis is easy!
- Skew is a constant number of cycles
- Instruction-level analysis is possible
- Adjust sampling period by amount of skew
- Infer execution counts, CPI, stalls, and stall
explanations from cycles samples and program - Smear
- Instruction-level analysis seems hopeless
- Examples PII, StrongARM
45Desired Hardware Support
- Sample fetched instructions
- Save PC of sampled instruction
- E.g., interrupt handler reads Internal Processor
Register - Makes skew and smear irrelevant
- Gather more information
46ProfileMe Instruction-Centric Profiling
Fetch counter
overflow?
fetch
map
issue
exec
retire
random selection
ProfileMe tag!
interrupt!
arithunits
branchpredict
dcache
icache
done?
tagged?
pc
addr
retired?
miss?
stage latencies
history
mp?
miss?
capture!
internal processor registers
47Instruction-Level Statistics
- PC Retire Status ? execution frequency
- PC Cache Miss Flag ? cache miss rates
- PC Branch Mispredict ? mispredict rates
- PC Event Flag ? event rates
- PC Branch Direction ? edge frequencies
- PC Branch History ? path execution rates
- PC Latency ? instruction stalls
- 100-cycle dcache miss vs. dcache miss
48Data Analysis
- Cycle samples are proportional to total time at
head of issue queue (at least on in-order Alphas) - Frequency indicates frequent paths
- CPI indicates stalls
49Estimating Frequency from Samples
- Problem
- given cycle samples, compute frequency and CPI
- Approach
- Let F Frequency / Sampling Period
- E(Cycle Samples) F X CPI
- So F E(Cycle Samples) / CPI
50Estimating Frequency (cont.)
- F E(Cycle Samples) / CPI
- Idea
- If no dynamic stall, then know CPI, so can
estimate F - So assume some instructions have no dynamic
stalls - Consider a group of instructions with the same
frequency (e.g., basic block) - Identify instructions w/o dynamic stalls then
average their sample counts for better accuracy - Key insight
- Instructions without stalls have smaller sample
counts
51Estimating Frequency (Example)
- Does badly when
- Few issue points
- All issue points stall
- Compute MinCPI from Code
- Compute Samples/MinCPI
- Select Data to Average
52Frequency Estimate Accuracy
- Compare frequency estimates for blocks to
measured values obtained with pixie-like tool
53Explaining Stalls
- Static stalls
- Schedule instructions in each basic block
optimistically using a detailed pipeline model
for the processor - Dynamic stalls
- Start with all possible explanations
- I-cache miss, D-cache miss, DTB miss, branch
mispredict, ... - Rule out unlikely explanations
- List the remaining possibilities
54Ruling Out D-cache Misses
- Is the previous occurrence of an operand register
the destination of a load instruction? - Search backward across basic block boundaries
- Prune by block and edge execution frequencies
55DCPI wrap-up
- Very precise, non-intrusive profiling tool
- Gathers both user-level and kernel profiles
- Relates architectural events back to original
code - Used for profile-based code optimizations
56Simulation of commercial workloads
- Requires scaling down
- Options
- Trace-driven simulation
- User-level execution-driven simulation
- Complete machine simulation
57Trace-driven simulation
- Methodology
- create ATOM instrumentation tool that logs a
complete trace per Oracle server process - instruction path
- data accesses
- synchronization accesses
- system calls
- run atomized version to derive trace
- feed traces to simulator
58Trace-driven studies limitations
- No OS activity (in OLTP OS takes 10-15 of the
time) - Trace selected processes only (e.g. server
processes) - Time dilation alters system behavior
- I/O looks faster
- many places with hardwired timeout values have to
be patched - Capturing synchronization correctly is difficult
- need to reproduce correct concurrency for shared
data structures - DB has complex synchronization structure, many
levels of procedures
59Trace-driven studies limitations(2)
- Scheduling traces into simulated processors
- need enough information in the trace to reproduce
OS scheduling - need to suspend processes for I/O other
blocking operations - need to model activity of background processes
that are not traced (e.g. log writer) - Re-create OS virtual-physical mapping, page
coloring scheme - Very difficult to simulate wrong-path execution
60User-level execution-driven simulator
- Our experience was to modify AINT (MINT for
Alpha) - Problems
- no OS activity measured
- Oracle/OS interactions are very complex
- OS system call interface has to be virtualized
- Thats a hard one to crack
- Our status
- Oracle/TPC-B ran with 1 server process only
- we gave up...
61Complete machine simulator
- Bite the bullet model the machine at the
hardware level - The good news is
- hardware interface is cleaner better documented
than any software interface (including OS) - all software JUST RUNS!! Including OS
- applications dont have to be ported to simulator
- We ported SimOS (from Stanford) to Alpha
62SimOS
- A complete machine simulator
- Speed-detail tradeoff for maximum flexibility
- Flexible data collection and classification
- Originally developed at Stanford University (MIPS
ISA) - SimOS-Alpha effort started at WRL in Fall 1996
- Ed Bugnion, Luiz Barroso, Kourosh Gharachorloo,
Ben Verghese, Basem Nayfeh, and Jamey Hicks (CRL)
63SimOS - Complete Machine Simulation
VCS
Workloads
Operating System of Simulated Machine
SimOS Hardware
Caches
Ethernet
Host
Host Machine
Models CPUs, caches, buses, memory, disks,
network, Complete enough to run OS and any
applications
64Multiple Levels of Detail
- Tradeoff between speed of simulation and the
amount of detail that is simulated - Multiple modes of CPU simulation
- Fast on-the-fly compilation 10X slowdown!
- Workload placement
- Simple pipeline emulator, no caches 50-100X
slowdown - Rough characterization
- Simple pipeline emulator, full cache simulation
100-200X slowdown - More accurate characterization of workloads
65Multiple Models for each Component
- Multiple models for CPU, cache, memory,and disk.
- CPU
- simple pipeline emulator 100-200X slowdown (EV5)
- dynamically-scheduled processor 1000-10000X
slowdown (e.g.21264) - Caches
- Two level set associative caches
- Shared caches
- Memory
- Perfect (0-latency), Bus-based (Tlaser), NUMA
(Wildfire) - Disk
- Fixed latency or more complex HP disk model
- Modular add your own flavors
66Checkpoint and Sampling
- Checkpoint capability for entire machine state
- CPU state, main memory, and disk changes
- Important for positioning workload for detailed
simulation - Switching detail level in a sampling study
- Run in faster modes, sample in more detailed
modes - Repeatability
- Change parameters for studies
- Cache size
- Memory type and latencies
- Disk models and latencies
- Many others
- Debugging race conditions
67Data Collection and Classification
- Exploits visibility and non-intrusiveness offered
by simulation - Can observe low-level events such as cache
misses, references and TLB misses - Tcl-based configuration and control provides ease
of use - Powerful annotation mechanism for triggering
events - Hardware, OS, or Application
- Apps and mechanisms to organize and classify data
- Some already provided (cache miss counts and
classification) - Mechanisms to do more (timing trees and detail
tables)
68Easy configuration
- TCL based configuration of the machine parameters
- Example
- set PARAM(CPU.Model) DELTA
- set detailLevel 1
- set PARAM(CPU.Clock) 1000
- set PARAM(CPU.Count) 4
- set PARAM(CACHE.2Level.L2Size) 1024
- set PARAM(CACHE.2Level.L2Line) 64
- set PARAM(CACHE.2Level.L2HitTime) 15
- set PARAM(MEMSYS.MemSize) 1024
- set PARAM(MEMSYS.Numa.NumMemories)
PARAM(CPU.Count) - set PARAM(MEMSYS.Model) Numa
- set PARAM(DISK.Fixed.Latency) 10
69Annotations - The building block
- Small procedures to be run on encountering
certain events - PC, hardware events (cache miss, TLB, ),
simulator events - annotation set pc vmunixidle_threadSTART
- set PROCESS(CPU) idle
- annotation exec osEvent startIdle
-
- annotation set osEvent switchIn
- log "CYCLES ContextSwitch CPU,PID(CPU),PROC
ESS(CPU)\n" -
- annotation set pc 0x12004ba90
- incr tpcbTOGO -1
- console "TRANSACTIONÂ CYCLESÂ togotpcbTOGO \n"
- if tpcbTOGO 0 simosExit
70Example Kernel Detail (TPCB)
71SimOS Methodology
- Configure and tune the workload on existing
machine - build the database schema, create indexes, load
data, optimize queries - more difficult if simulated system much different
from existing platform - Create file(s) with disk image (dd) of the
database disk(s) - write-protect dd files to prevent permanent
modification (i.e. use copy-on-write) - optionally, umount disks and let SimOS use them
as raw devices - Configure SimOS to see the dd files as raw
disks - Boot a SimOS configuration and mount the disks
72SimOS Methodology (2)
- Boot and startup the database engine on fast
mode - Startup the workload
- When in steady state create a checkpoint and
exit - Resume from checkpoint with complex (slower)
simulator
73Sample NUMA TPC-B Profile
74Running from a Checkpoint
- What can be changed
- processor model
- disk model
- cache sizes, hierarchy, organization, replacement
- how long to run the simulation
- What cannot be changed
- number of processors
- size of physical memory
75Tools wrap-up
- No single tool will get the job done
- Monitoring application execution in a real system
is invaluable - Complete machine simulation advantages
- see the whole thing
- portability of software is non-issue
- speed/detail trade-off essential for detailed
studies