Title: Rebecca Isaacs
1Magpie Distributed request tracking for
realistic performance modelling
- Rebecca Isaacs
- Paul Barham
- Richard Mortier
- Dushyanth Narayanan
- Microsoft Research Cambridge
- James Bulpin
- University of Cambridge
2Performance in distributed systems
- Faults in distributed systems are notoriously
hard to diagnose - Performance problems are even more subtle to
debug - Often transient or affect only a subset of
requests / users - Frequently involve complex interactions between
multiple machines - Aggregate statistics (e.g. utilization) may look
perfectly normal
3Magpie Approach
- Track individual requests end to end
- Observe control flow (causality)
- Monitor resource consumption CPU, bandwidth,
disk - Debug performance in the small
- Build a probabilistic workload model from the
aggregate requests - Cluster similar requests according to their
observed behaviour - Debug performance in the large
4How do we use this information?
- Performance debugging
- Why did this request take much longer than that
request? - Fault detection
- Configuration and management
- Performance prediction
- Realistic workload models for capacity planning
- Obtain automatically on a live system
5Magpie components
- Instrumentation
- System activity recorded to logs
- Generic request parser
- Extract individual requests from logs according
to an event schema - Model construction
- Behavioural clusters
- Probabilistic state machine
6Outline
- Introduction
- What is a request?
- Instrumentation
- Request extraction
- Modelling
- Current status
7What is a request?
- System activity which takes place in response to
an action initiated by the application being
traced - HTTP request
- Database query
- File open request
- We describe a request as
- The sequence of application components involved
in its processing - The resource consumed at each stage
- CPU, bandwidth, disk transfer size, (latency)
8A typical e-commerce site (1)
Internet
Storage
SQL Servers
Web Front Ends
9A typical e-commerce site (2)
SQL Server
Web Server
CLR
IIS
Application
Logic
Filter
Stored
Static
procedures
Content
ASP.NET
ADO.NET
Data
WinSock2 API
WinSock2 API
Kernel
Kernel
10HTTP request detailed view
from
!
WEB.eec
-
-
-
-
-
-
-
WEB.398
Disk
Net RX
Net TX
10.051s
10.155s
10.100s
Net TX
Net RX
Disk
-
-
-
SQL.9c4
10.051s
10.155s
10.100s
Blocked
IIS
ASP.NET
SQL
KEY
Disk
Other
11Why is request tracking hard?
- Many components, multiple machines
- Must track control flow across machines
- No globally unique request ID
- Components are developed independently
- Multiple thread pools
- Many threads participate in processing a request
- Asynchronous communication
- Must match send/recvs between threads/machines
- Hand-rolled synchronization primitives
- SQL server has user-mode scheduler
12Outline
- Introduction
- What is a request?
- Instrumentation
- Request extraction
- Modelling
- Current status
13Event Tracing for Windows
- Low-overhead event mechanism
- Events timestamped with cycle counter
- Global ordering on events on a single machine
- Can enable/disable sets of events at runtime
- Using ETW in Magpie
- Each instrumentation point posts an event
- Events are logged to disk
- Logs are post-processed to extract requests
- Can also consume events in real time
14Instrumentation points
- Existing ETW event providers
- IIS, kernel
- App-specific hooks
- IIS, ASP.NET, SQL Server
- Detours
- Wrap dlls to trap Win32 and WinSock2 calls
- WinPcap
- Capture packets on the wire
15CPU usage from kernel events
- The ETW kernel logger records every context
switch - How do we know which cycles are used for which
request? - We can attribute cycles to a request by
- An application-specific event which occurs within
a delimited sector of CPU time, or - The current context of execution, eg thread id
16Example protocol processing in a DPC
DPC start
DPC end
pkt recv
cswitch
Events
cswitch
Request 1 cycle count
time
Request 2 cycle count
17Application and middleware events
- Cover points where flow of control moves between
components - Cover points where resources are multiplexed and
demultiplexed - E.g. user-level scheduling primitives
- Propagation of a global request id is not
required! - Magpie used to do this but not any more
18Instrumenting a web service
SQL Server
Web Server
CLR
IIS
Application
Extended SPs
Logic
Filter
Stored
Static
procedures
Content
HTTPModule
ASP.NET
ADO.NET
Data
ISAPI Filter
CLR profiler
Intercept
Intercept
WinSock2 API
WinSock2 API
Kernel
Kernel
Event Tracing for Windows
Event Tracing for Windows
Packet capture
Packet capture
19Outline
- Introduction
- What is a request?
- Instrumentation
- Request extraction
- Modelling
- Current status
20Generic request extraction
- No inbuilt assumptions about the system or the
application - No common unique identifier
- Schema specifies semantics of events
- Easy to add new event types
- Parser stitches events into requests based on
event semantics
21Terminology
- Namespace
- Event parameter which references an entity in the
system, eg thread id - Timeline
- Instantiation of a namespace with a unique value,
eg thread id 0xa - Events bind or unbind requests to timelines
- Bindings capture the semantics of each event for
a particular request type
22Example connecting events
Recv returns
Enter Recv
DPC start
DPC end
TCP pkt
cswitch
cswitch
Cpuid0
Tid0xa
Tid0xb
Connid0xd
Request 1
Request 2
23End-to-end request extraction
- An instance of the request parser runs on each
machine in the distributed system - Online or offline mode
- Offline post-processing connects request
fragments from each node according to a globally
unique namespace, e.g. packet IP identifier
24Outline
- Introduction
- What is a request?
- Instrumentation
- Request extraction
- Modelling
- Current status
25Clustering for workload generation
- Target the Indy performance modelling tool
- Calculates throughput, bottlenecks
- Needs transaction mix, resource consumption
- Previously microbenchmark approach
- Run 10000 of each transaction type (URL)
- Divide aggregate resource usage by 10000
- Aim provide realistic workload models
- From real, mixed workloads
- Derive transaction types automatically
26Single request cartoon view
- Partial ordering of events
- Annotated with resource usage
IIS CPU
ASP.NET CPU
SQL Server CPU
27Behavioural clustering of requests
- Represent requests as event strings
- Flatten out any concurrency
- Use Levenshtein string edit distance
- Modified to factor in resource usage vectors
- Cluster requests based on this distance
- Linear-time algorithm
- Each cluster is a request type
- Select representative from near centroid
28Build a workload model by clustering similar
requests
- Requests in the same cluster often have
different URLs, and one URL may appear in many
clusters
A
B
C
E
D
29Taking it further work-in-progress
- Online and incremental modelling
- Detect component failure
- Detect sudden shifts in workload
- More sophisticated models
- Learn the probabilistic state machine for each
request - c.f. flowcharts annotated with performance
information - Bayesian watchdogs
- Compute the likelihood of a requests behaviour
as it moves through the system - Deal with unlikely requests appropriately
30Outline
- Introduction
- What is a request?
- Instrumentation
- Request extraction
- Modelling
- Current status
31Current status
- Recent focus has been developing a generic
request extraction scheme - Prototype for 2-machine e-commerce site
- TPC-W style workload
- Prototype for single machine SQL Server 2000
- Challenge is user mode scheduler
- TPC-C workload
- Other applications on the way
- Large-scale
- Real systems with real performance problems
32Conclusion
- Magpie is a tool for performance analysis in a
distributed system - Bottom up, per-request approach
- Complementary to existing techniques
- Performance counters
- Program profiling
- Feeds into performance debugging and prediction
tools
33Work-in-progress learning the probabilistic
state machine
- Infer a stochastic context free grammar from a
sample set of strings - Each state transition emits a character and has
an associated probability - Use the Alergia algorithm (Carrasco Oncina 94)
- Construct a prefix tree from the sample set
- Merge similar subtrees
- Apply to Magpie requests
- Just event strings
34Ongoing work with Alergia
- Tuning the similarity criterion
- Factoring in resource usage information
- Can we identify event sequences with suspiciously
low probability - Run online for anomaly detection?