Title: Magpie: Distributed Profiling for Performance Analysis
1Magpie Distributed Profiling for Performance
Analysis
- Paul Barham
- (joint work with Rebecca Isaacs and Richard
Mortier) - 11th November 2002
2What is Magpie?
- Bottom-up approach to characterising the workload
of a distributed system - Observe concurrency, communication latency
- Focus is on responsiveness, rather than
throughput - Resource consumption is accounted to individual
service invocations - CPU, disk and network b/w consumed by each
request on each machine in the distributed system - Low level accounting higher level annotations
- Online measurement, offline processing
- Why? Web Services wide area synchronous RPC
3Main idea Data-driven Performance Analysis
- Account the resources consumed by a service
invocation throughout its lifetime - This is not the same as monitoring performance
counters on a system - The bottleneck resource for all requests may not
explain the performance of a particular single
request - Goal is to understand the relationship between
throughput and response time
4Motivation Indy PerformanceModelling Toolkit
- Indy is a hybrid event/throughput based modelling
kernel - Inputs Topology, hardware, workload
- Outputs Utilizations, response times,
bottlenecks, etc. - Scope is multi-tier server farms running .NET web
sites - Our goal is to use Magpie to derive a generative
model of the system workload suitable for input
to Indy - Acquire a workload description with less human
effort / knowledge than current approach of
incremental micro-benchmarking - Extract a detailed model from a live,
representative system - Measure with a realistic mix of transaction types
caches! - Not just a long-term average across all
transactions - Build a probabilistic model of the usage profile
which includes hidden transaction types, eg
error conditions, session state - Complex behaviour may not be easily observable
manually - e.g. web transaction type discriminator is not
necessarily the URL
5Outline of rest of talk
- Instrumenting a .NET web site
- Some raw data and visualizations
- Data analysis by clustering
- Current status and future work
6Duwamish7 VS.Net Enterprise Sample Site
710,000ft view of an e-commerce site
Web Server
SQL Server
Client
http//someurl.aspx
SQL request
data
web page
8A little bit more detail
Web Server(s)
SQL Server(s)
Client(s)
w3wp
ASP.Net
Business Logic
Cache
ADO.Net
front end
Stored Procs
DBMS
Cache
CLR / Jit Compiler
- Site-specific content and code shown in purple
- Requests can be either static content, or active
execute code to generate HTML - Code may execute SQL queries on database.
9Internal Architecture of IIS6 / ASP.NET
aspnet_isapi.dll
aspnet_isapi.dll
- w3wp.exe
- Per-site worker proc
- ISAPI extension hosts CLR
- ASP.Net, ADO.Net
- Threadpools
- One for CLR, one for IIS
- Nested state machines
- One per-request, event driven
- Non-blocking upcalls
- All async I/O
- Uses single IOCompletionPort
- Demux using OVERLAPPEDs
- http.sys
- aka Universal listener
- Kernel-mode driver
- TransmitFile() on steroids
- HTTP fragment cache
Common Language Runtime
Application
ASP.Net
ADO.Net
CLR
Winsock
ISAPI
ECB
Type-specific Handlers
Static
ISAPI
CGI
Per-Request FSMs
URLINFO
AUTHENT
AUTHORIZ
RESPONSE
LOG
DONE
HANDLE
START
W3_MAIN_CONTEXT
Main W3 Threadpool
ULATQ
HTTPAPI
Winsock
User
I/O Compn Port
Kernel
http.sys
TCP/IP
10Low Level Profiling
- Kernel Tracing
- Windows XP has very efficient low-level event
tracing built in to the kernel (ETW) - Command-line tools for turning on or off tracing
of specific system activities - Magpie uses ETW on each machine to capture
- Every context switch
- File IO
- Disk IO
- Network send and receive completions
- Process and thread creation and deletion
- Page faults
- Overhead is surprisingly small! (8MB log for 5
mins)
11Stitching together Asynchronous I/O and Thread
Synchronization
- Detours
- Public tool written by Galen Hunt (MSR)
- Can be used to intercept calls to any DLL and
record their arguments etc. - Magpie usage
- Intercept calls to Winsock2 API
- Allows communication to be followed (WSASend,
WSARecv) - Intercept the HTTPAPI
- Follows the processing of HTTP requests by
http.sys - Intercept certain Win32 API calls
- Observe thread synchronization (e.g. SetEvent,
Waits) - Follow async I/O completions (GetQueuedCompletionS
tatus) - Directly observe threadpool.
12Higher Level Annotations
- ISAPI filter
-
- Extension DLLs loaded into IIS (web server)
process - Sees all HTTP requests
- Filter registers with IIS to receive particular
event notifications - Can examine and modify both incoming and outgoing
streams of data - Magpie ISAPI filter
- Allocates a unique identifier to each incoming
request and adds it to the HTTP headers - Generates additional trace events on entry and
exit - Records cycle counter, request ID resource
usage - Thread 274 is now working on Request CCC00017
13Following the active content
- HTTPModule
- Plugins for ASP.NET extensibility
- Sees all active requests
- Executes in the managed (CLR) environment
- Each active request is processed by multiple
HTTPModules, e.g. session, authentication, etc. - Magpie HTTPModule
- Stashes request identifier in (managed) thread
local state - Records cycle counter, managed thread id, request
ID resource usage
14Accounting Database Activity
- SQL Profiler
- Standard part of SQL Server 2000 distribution
- Logs selected events to table or file
- (events can be user defined)
- Magpie SQL Profiling
- Wraps original stored procedures using
reflection - Wrappers generate user-defined profiler events
before and after executing the original stored
proc. - Recorded by the SQL Profiler in output trace
- Data includes request ID, cycle counter
resource usage - Uses extended stored procedure to get cycle
counter resource usage stats
15Intercepting Outgoing RPCs
- Common Language Runtime Profiler
- Two COM interfaces
- ICorProfiler provides notifications, e.g.
function enter/leave, module/class load, thread
mappings, .Net remoting, JIT compilation - ICorProfilerCallback Runtime provides API which
allows profiler to examine and modify VM state. - Magpie CLR Profiler
- Monitors CLRgOS thread mappings
- Records thread ids, cycle counter resource
usage - Intercepts JIT compilation of relevant ADO.NET
classes/methods - Rewrites byte-code (IL) to insert calls to our
own profiling functions - Modifies SQL stored procedure invocations to call
wrapped versions and pass an extra argument (i.e.
request ID) - Lots of fun but very hairy code! ?
16Putting it all together
SQL
Server
SQL
Server
Web
Server
Web
Server
Re
-
Re
-
SQL Profiler
SQL Profiler
App Logic
App Logic
written
written
User
-
User
-
IL
IL
Filter
Filter
defined
defined
ADO.NET
ADO.NET
HTTPModule
HTTPModule
ASP.NET
events
events
IIS
IIS
ASP.NET
ISAPI
ISAPI
Exten
Exten
Stored
Stored
Wrappers
Wrappers
-
ded
-
ded
CLR
CLR
Procs
Procs
SPs
SPs
DBMS
CLR Profiler
CLR Profiler
DBMS
IL
Patcher
IL
Patcher
Intercept
Intercept
Intercept
Intercept
Intercept
Intercept
Intercept
Intercept
Intercept
HTTP API
WinSock2 API
HTTP API
HTTP API
WinSock2 API
WinSock2 API
WinSock2 API
WinSock2 API
WinSock2 API
HTTP
TDS
TDS
Kernel
Kernel
Kernel
Kernel
Pkt
Pkt
Pkt
Pkt
http.sys
Pkt
http.sys
Pkt
Tap
Capture
Tap
Capture
Tap
Capture
PerfInfo
PerfInfo
PerfInfo
17Raw Event Data Yuk!
incoming HTTP request
incoming HTTP request packet
Logs sorted together using cycle counter
...
...
574530113665473L 157.58.60.98.3173 gt
192.168.187.66.http P 9987
10368(381)
\
\
574530113665473L 157.58.60.98.3173 gt
192.168.187.66.http P 9987
10368(381)
\
\
ack 80470 win 64240 (DF) (ttl 127, id 27528, len
421)
ack 80470 win 64240 (DF) (ttl 127, id 27528, len
421)
574530113724014 TcpReceive svchost.exe 11c 381
192.168.187.0668
0 157.058.060.0983173
574530113724014 TcpReceive svchost.exe 11c 381
192.168.187.0668
0 157.058.060.0983173
...
...
574530114445762 f4c HttpReceiveHttpRequest(ReqQu
eueHandle1c0,
RequestId0,
\
\
574530114445762 f4c HttpReceiveHttpRequest(ReqQu
eueHandle1c0,
RequestId0,
\
\
Flags1, pRequestBufferfaae48,
RequestBufferLength9d0, pBytesR
eceived0,
Flags1, pRequestBufferfaae48,
RequestBufferLength9d0, pBytesR
eceived0,
pOverlappedfaae14) 0
pOverlappedfaae14) 0
574530114613968 f4c
-
HttpReceiveHttpRequest()
-
gt 3e5
574530114613968 f4c
-
HttpReceiveHttpRequest()
-
gt 3e5
IIS worker
IIS worker
574530114724562 f4c GetQueuedCompletionStatus(1b
8,f5ff84,f5ff8
c,f5ff98,0)
574530114724562 f4c GetQueuedCompletionStatus(1b
8,f5ff84,f5ff8
c,f5ff98,0)
thread picks
thread picks
574530114836287 f4c
-
GetQueuedCompletionStatus(,,,,)
-
gt 0 5a8d11fc 0
574530114836287 f4c
-
GetQueuedCompletionStatus(,,,,)
-
gt 0 5a8d11fc 0
574530114947322 f4c GetQueuedCompletionStatus(1b
8,f5ff84,f5ff8
c,f5ff98,1b7740)
574530114947322 f4c GetQueuedCompletionStatus(1b
8,f5ff84,f5ff8
c,f5ff98,1b7740)
up request
up request
574530115070456 CSwitch 0 w3wp.exe 658 f4c
Waiting WrQueue Sys
tem 4 e50 3913680
574530115070456 CSwitch 0 w3wp.exe 658 f4c
Waiting WrQueue Sys
tem 4 e50 3913680
from http.sys
from http.sys
574530115394668 CSwitch 0 System 4 e50 Ready
-
w3wp.exe 658 f4c 324212
574530115394668 CSwitch 0 System 4 e50 Ready
-
w3wp.exe 658 f4c 324212
574530115456005 f4c
-
GetQueuedCompletionStatus(,,,,)
-
gt 1 5a8d11fc fab854
574530115456005 f4c
-
GetQueuedCompletionStatus(,,,,)
-
gt 1 5a8d11fc fab854
(async I/O Completion)
574530115683276 f4c ! HttpReceiveHttpRequest Ov
fab854 CID 400
00284e2000000
\
\
574530115683276 f4c ! HttpReceiveHttpRequest Ov
fab854 CID 400
00284e2000000
\
\
ReqID 40000284e2000000 LA 192.168.187.6680 RA
157.58.60.983
173
\
\
IndyProf label
ReqID 40000284e2000000 LA 192.168.187.6680 RA
157.58.60.983
173
\
\
IndyProf label
Bytes 0 Flags 0 Verb 4 RawUrl
/duwamish7/categories.aspx?ID
843
Bytes 0 Flags 0 Verb 4 RawUrl
/duwamish7/categories.aspx?ID
843
574530115882433 12673200697.357422 ccc00663 f4c
/duwamish7/categ
ories.aspx?ID843
574530115882433 12673200697.357422 ccc00663 f4c
/duwamish7/categ
ories.aspx?ID843
574530116116697 f4c GetQueuedCompletionStatus(1b
8,f5ff84,f5ff8
c,f5ff98,0)
574530116116697 f4c GetQueuedCompletionStatus(1b
8,f5ff84,f5ff8
c,f5ff98,0)
ASP worker
ASP worker
...
...
574530116959382 CSwitch 0 w3wp.exe 658 f4c
Waiting WrQueue w3wp.
exe 658 6dc 1564714
574530116959382 CSwitch 0 w3wp.exe 658 f4c
Waiting WrQueue w3wp.
exe 658 6dc 1564714
thread takes
thread takes
HttpModule label
574530117373185 12673200697.3574 ccc00663 6dc
/duwamish7/categor
ies.aspx?ID843
574530117373185 12673200697.3574 ccc00663 6dc
/duwamish7/categor
ies.aspx?ID843
over
over
...
...
574530119411731 6dc send(s4f0, buf2739044,
len7f, flags0)
574530119411731 6dc send(s4f0, buf2739044,
len7f, flags0)
574530119656299L 192.168.187.66.4681 gt
192.168.187.68.ms
-
sql
-
s P 2357823705(127)
\
\
574530119656299L 192.168.187.66.4681 gt
192.168.187.68.ms
-
sql
-
s P 2357823705(127)
\
\
ack 227230 win 16735 (DF) (ttl 128, id 10136, len
167)
ack 227230 win 16735 (DF) (ttl 128, id 10136, len
167)
send request to SQL Server
send request to SQL Server
574530119721688 6dc
-
send()
-
gt 7f
574530119721688 6dc
-
send()
-
gt 7f
...
...
574530120169086 6dc WSARecv(s4f0,
lpBuffers499f4fc 16384, dw
BufferCount1,
\
\
574530120169086 6dc WSARecv(s4f0,
lpBuffers499f4fc 16384, dw
BufferCount1,
\
\
blocking wait
blocking wait
lpNumberOfBytesRecvd499f514, lpFlags499f510,
lpOverlapped0 0,
\
\
lpNumberOfBytesRecvd499f514, lpFlags499f510,
lpOverlapped0 0,
\
\
for reply from
for reply from
lpCompletionRoutine0)
lpCompletionRoutine0)
574530120325932 CSwitch 0 w3wp.exe 658 6dc
Waiting UserRequest w
3wp.exe 658 ec8 3366550
574530120325932 CSwitch 0 w3wp.exe 658 6dc
Waiting UserRequest w
3wp.exe 658 ec8 3366550
SQL Server
SQL Server
...
...
18Visualisation Tools PowerPoint Macros!
Time/s
19Transaction ccc000b9 /duwamish7/categories.aspx?
ID831
20Interleaving of Simultaneous Requests
(Each colour is a different transaction, grey
blocked)
21An assortment of magical tastes
22How similar are two requests?
23Mining Some Structure
- Issues include
- Multiple clocks (at least one per machine)
- Lots of concurrency
- Only partial orders provided by network traffic
- Noisy observations preemptive scheduling
- Aperiodic sampling irregular events
- Current approach
- Borrowing algorithms from gene-sequence
comparison and speech recognition - Construct a string representation of traces
- Cluster using variant of String Edit Distance
24Levenshtein String Edit Distance
- d('', '') 0
- d(s, '') d('', s) s
- d(s1ch1, s2ch2)
- min( d(s1, s2) (ch1ch2 ? 0 1),
- d(s1ch1, s2) 1,
- d(s1, s2ch2) 1 )
- Example
- Can be computed in O(s1s2) time using simple
dynamic programming algorithm
appropriate m-eaning
approximate matching d(s1,s2)7
25Partial ordering using packets
- HTTPREQ 46265
- -GetQueuedCompletionStatus a18
- !HttpReceiveHttpRequest a18
- IndyProfReq a18
- GetQueuedCompletionStatus a18
- CPU 4178433 Block
- -------------------------------
- HttpModBegin dc0
- ProcessRequestMain
- OnStateChange
- -OnStateChange
- FlushBuffer
- send dc0
- ------------------------------------------------
-------------- - TDSREQ 23389 23389 TDSREQ
- -send dc0 Unblock
- -FlushBuffer SQLStartProf
87c - ReadNetlib CPU 31605 Block
- WSARecv dc0 Unblock
Web Server
SQL Server
26Example alphabet for trace strings
27A Distance Metric for Magpie Traces
- Assign each Magpie instrumentation point a
discrete label - Each trace entry has an 8-tuple of resource usage
deltas - (Web CPU, Web DISK, WAN Rx, WAN Tx, LAN Rx, LAN
Tx, SQL CPU, SQL DISK) - Deterministically flatten the partial order into
a total order - Consider as a string of weighted characters,
where weight is the length of observation vector - e.g. !1 (0 1 gt5 1 0 lt1 4 B0 b4 0 )0 B0 b0 Q0
q0 - Extend string-edit-distance to use normalised
Euclidian distance between between observation
vectors - Insert/delete cost v Substitution cost
v1-v2
distance of point from origin
distance between two points in 8D space
28Example String Edit Distances
29Trivial Clustering Algorithm
- Doesnt need to be very fancy yet!
- Uses a representative trace as cluster centroid
- Pick the best of 5 as in quicksort
- Compute distance from each trace to each cluster
centroid - Add a best-so-far threshold to string-edit
algorithm - (dynamic programming algs are monotonic)
- Compare inter/intra-cluster mean distances to
decide when to create a new cluster - Periodically move singleton clusters to an
outliers list and try to merge back in at the end - Approx O(N C) - where C is clusters
30Typical clusters
Centroid !1(0R0r0)1B0b0Q0q0 0.000000
static_ccc006c9 !1(0R0r0)1B0b0Q0q0 0.088417
static_ccc00168 !1(0R0r0)1B0b0Q0q0 0.003326
static_ccc00618 !1(0R0r0)1B0b0Q0q0 0.060546
static_ccc006c3 !1(0R0r0)1B0b0Q0q0 0.013645
static_ccc00616 !1(0R0r0)1B0b0Q0q0 0.032970
static_ccc006c5 !1(0R0r0)0B0b0Q0q0 0.048854
static_ccc006c4 !1(0R0r0)1B0b0Q0q0 0.043589
static_ccc006c7 !1(0R1r0)1B0b0Q0q0 0.106195
static_ccc0038e !1(0R0r0)1B0b0Q0q0 0.057646
static_ccc00043 !1(0R0r0)1B0b0Q0q0 0.025930
static_ccc00556 !1(0R0r0)1B0b0Q0q0 0.008057
static_ccc0038f !1(0R0r0)1B0b0Q0q0 ...
Centroid !7(02gt1730lt03B0b30)0B0b0Q0q0 0.0000
00 chkout_ccc0041c !7(02gt1830lt03B0b30)0B0b0Q
0q0 0.473224 chkout_ccc00092 !7(02gt1230lt03B0b
40)0B0b0Q0q0 0.273929 chkout_ccc00246
!7(02gt1630lt03B0b40)1B0b0Q0q0 0.167314
chkout_ccc002a8 !7(02gt1730lt03B0b30)0B0b0Q0q0
0.185258 chkout_ccc004be !7(02gt1730lt03B0b30)
0B0b0Q0q0 0.123095 chkout_ccc00675
!7(02gt1830lt03B0b40)0B0b0Q0q0 0.100177
chkout_ccc005f4 !7(02gt1730lt03B0b40)0B0b0Q0q0
0.318734 chkout_ccc001a5 !7(02gt1230lt03B0b30)
0B0b0Q0q0 0.049554 chkout_ccc00165
!7(02gt1730lt03B0b30)0B0b0Q0q0 0.369109
chkout_ccc003ac !7(02gt1230lt03B0b30)0B0b0Q0q0
...
Centroid !1(025B0b40)0 0.000000
books_ccc00301 !1(025B0b40)0 0.214118
books_ccc00548 !1(025B0b40)0 0.204555
books_ccc006fd !1(025B0b40)0 0.150912
books_ccc0079f !1(025B0b40)0 0.019864
books_ccc00873 !1(025B0b40)0 0.208676
books_ccc00484 !1(025B0b40)0 0.212472
books_ccc0029f !1(025B0b40)0 0.036158
books_ccc00842 !1(025B0b40)0 0.210166
books_ccc00517 !1(025B0b40)0 0.171975
books_ccc00589 !1(025B0b40)0 0.288412
books_ccc003d3 !1(045B0b40)0 0.217915
books_ccc0069c !1(025B0b40)0 0.238472
books_ccc0076e !1(025B0b30)0
Centroid !1(01gt410lt14B0b30)0B0b0Q0q0 0.00000
0 books_ccc00748 !1(01gt510lt14B0b40)0B0b0Q0q0
0.396412 logon_ccc005be !2(02gt510lt04B0b30)0
B0b0Q0q0 0.273409 books_ccc005d4
!1(01gt410lt14B0b50)1B0b0Q0q0 0.069283
books_ccc006e7 !1(01gt410lt14B0b40)0B0b0Q0q0 0.
103720 books_ccc00442 !1(01gt410lt14B0b30)0B0b
0Q0q0 0.044051 books_ccc00040 !1(01gt410lt14B0b
30)0B0b0Q0q0 0.208350 books_ccc000b9
!1(01gt410lt15B0b40)0B0b0Q0q0 0.147137
books_ccc002cf !1(01gt410lt14B0b40)0B0b0Q0q0 0.
099524 books_ccc0082b !1(01gt410lt14B0b40)0B0b
0Q0q0 0.244620 books_ccc003ed !1(01gt710lt15B0b
40)0B0b0Q0q0 0.441445 logon_ccc001e0
!2(02gt510lt04B1b30)0B0b0Q0q0 0.116913
books_ccc00105 !1(01gt410lt14B0b40)0B0b0Q0q0 0.
096342 books_ccc00686 !1(01gt410lt14B0b40)0B0b
0Q0q0 ...
31Visualisation of all Clusters
n2 pairwise distances
Distance
Transaction ID
Transaction ID
32Just the active content
Transaction ID
Distance
Transaction ID
33A First-cut Workload Model
- Just use cluster centroids and sizes to generate
similar transaction mix? - Pretty good at capturing coarse differences
- e.g. Number of SQL requests
- Doesnt deal with continuous distributions very
well - e.g. Cache/memory performance, zipf file size
distributions -
- But is it better than just using URL-based
averages? - Evaluation metric
- Take the original trace, assume each transaction
is replaced by the centroid of its cluster and
add up the RMS error. - Evaluate with and without the outlier clusters.
34Evaluation of Magpie cluster-based model
- Results per-resource RMS errors (across all
transactions)
- RMS error improvement over just using URLs
35Evaluation of Magpie cluster-based model
- Results per-resource RMS errors (across all
transactions)
- RMS Errors (normalised by mean)
36Better Models using Bayesian Learning?
- Ongoing discussions with Michael Isard, Mike
Tipping and Chris Bishop - Learn probabilistic models of resource usage by
different request types - Construct the per-machine FSMs
- Possibly apply coupled hidden Markov models
(CHMMs)?
Web
Compute, Disk IO
Receive Pkt
SQL
Waiting
Send Pkt
(ignore fictitious details ? )
time
37Current Status
- Using Matlab Bayes Network Toolkit (BNT)
- Start by trying to fit simple HMM to just static
request cluster - One discrete hidden state, 8 continuous observed
variables with assumed Gaussian distns. - Priors computed from mean / var of observations
? unfortunately, none of the supplied learning
algorithms converge claiming our data is
infinitely improbable!
38Ongoing Work
- Investigating better ways of extracting models
from the data, esp. machine learning - Transfer Magpie ideas/tools to the Indy team
- Use Magpie to learn parameters in the live
system order to calibrate processor, memory
system and cache models (more speculative) - Exploring other types of distributed system,
e.g. GXA / Web Services v2 ? async messaging
39Tools for Systems Analysis
- Understanding the behaviour of distributed
systems is important for - Debugging
- Performance analysis
- Capacity planning
- What-if?
- Simulation
- Configuration and management
- Service level agreements, charging etc
- Complex systems require automated and
sophisticated tools
40Magpie Measurement Infrastructure
Store request id in TLS
Wrap stored procs with profiling
Modify SQL RPC
Tag each request
Web Server(s)
SQL Server(s)
Client(s)
Stored Procs
DBMS
Cache
Kernel
Kernel
Context switches, disk and file IO, network send
and receive
41Models of a simple request
Structure typically assumed for modelling
20
80
IIS
100
SQL
More realistic structure
IIS
SQL
42Does structure matter much?
- Compare SEQUENTIAL transaction with PIPELINED
transaction using simple Indy models -
- Saturation test with 1000 requests
- Equal resource demands (22ms comp IIS, 20ms
SQL, 3x1k net)