Title: Reduce MTTR
1Reduce MTTR
- with Combined Performance Management and Forensic
Analysis
2Introduction
- Brian Robertson
- Solutions Marketing Manager
- Netscout Systems, Inc.
Jim Bauer Director of Network Infrastructure
Services
3Agenda
- Network Performance Management Challenges
- Meeting the Challenge with Performance Management
Solutions - Top-Down Troubleshooting Approach
- Case Studies
- Survey Results
4Balancing Pro-active performance management With
Post-incident investigation and analysis
- Challenge
- Need proactive monitoring and analysis for
everyday troubleshooting and capacity planning - Intermittent problems wreak havoc on the
performance of business-critical and
revenue-generating applications - Problem cause is often difficult to discover
- Needs
- Ability to automatically analyze traffic and
simultaneously retain evidence to recreate a
complex incident and discover its culprit without
having to wait for the event to recur
5Who Finds Performance Problems?
Source NetScout Systems Survey March 2007 N
232
6Performance Problem Lifecycle
7The Ultimate Payoff Reduced MTTRFaster time
to resolution/minimized user impact
Service Outage
User Calls Help Desk
End-Users Impacted
Final Verification
End Users Not Impacted!
time
Problem Origin
8What is needed to Lower MTTR
- A Performance Management System that
- Combines key performance indicator (KPI)
monitoring and analysis with continuous packet
capture for post event data mining - Top down approach to troubleshooting
- Application Fabric Monitoring
- Has an architecture built for high performance
recording and infrastructure monitoring - Uses a high capacity, highly available hardware
platform
9Unique Approach Three-Tier Data Architecture
for Rapid Top-Down Workflow
Top Down
Data Set
KPIs Retransmits, VoIP QoS Errors, App Resp
Time, etc
Flows CDM - Applications/Services Conversations,
Utilization, Volume, etc
Packets Header and Payload
10Step 1 DetectionIs there an issue on the
network?
- Trending, alarming and analytical data on key
performance indicators will provide - Notification when link or application utilization
increases - See when unwanted protocols are on the network
- Measure VoIP links for Jitter, Packet Loss or
High MOS scores - This information has to be available in real-time
and historically - Collaboration with groups across the organization
requires quick and easy distribution of reports -
11Understand how your business uses the network
Do you know all the applications running on your
network? Application visibility provides
business justification for IT decisionsAre
there good reasons for an upgrade? Are there
non-business uses of the network?
12Baseline response time of key business
applications
Looking at several parameters such as MPLS,
VoIP, or QoS classes can provide critical data
in regards to the health of the network.
Response time provides insight into the end-user
experience and should be an integral part of any
performance management audit
13Top Down Approach for Rapid TroubleshootingPower
Alarms - Micro-Burst Alarming
- Based on traffic rates exceeding a 1 millisecond
threshold - Interval can be configured as low as 5
milliseconds up to 1 second - An alarm is received when the burst starts and
one when it ends - Evidence is launched based on applications,
hosts, conversations during the alarm interval
14Step 2 DiagnosisWhat is the root cause of the
issue?
- Look at the flows associated with the KPI
effected by the issue - Applications
- Conversations
- Utilization
- Volume
- Historical view of trends for flows show
- When and where an issue commonly occurs
- Upward / downward trends for flows
15Flow Based Troubleshooting Example
Music downloads from 3 different sites same user
16Troubleshoot Intermittent or Subtle Problems with
Application Fabric Monitoring
- Efficient on-board analysis
- Sends only the data requested by the client
- Does not require entire data capture files to be
sent - Deep forensic analysis
- Logically move from KPI to Flow to Packet
- View data with millisecond granularity
- Record packet level audit trails to view subtle
and intermittent issues
17Highly Efficient Architecture
18Highly Efficient Architecture
On Board Analysis
Client Analysis
Efficient On-board Analysis
Inefficient Client-based Analysis
19Superior Forensic Analysis
- Expert Analysis
- Expert Zoom and Data Mining interface with
multiple workspaces - Broad support of over 1000 protocol decodes
- TCP Session Follow
- Bounce Diagrams
- Intuitive, flexible GUI
- Multi-user, Web-based console
- Eases search process
- Incident Reports for collaborating and
communicating with others
20Case in Point Solution Architecture Wins Business
- Bank in North America has assets approaching 300
Billion - 2 trading floors Chicago, New York, NJ
disaster recovery - Pain Point Needs continuous capture and
monitoring on trading floors - Need both flow based monitoring and trending for
troubleshooting and capacity planning Plus
recording for in-depth analysis because of the
value of the financial trades - Current solution in Chicago not working network
managers constrained by time delays in viewing
packet trace files when pulling them over the
network.
Able to perform top down troubleshooting with
detailed analysis of applications and
conversations. When necessary for in-depth
troubleshooting the actual packets are available
for event reconstruction and forensic data
mining without adding load to the network.
21High Definition Visibility Deep Forensics
22Quickly navigate to the needed level of
granularity consistently maintaining context as
you go deeper
What do you do if you need a packet decode, but
the offending traffic occurred two hours ago???
23Step 3 VerificationHave we resolved the issue?
- Does the KPI meet expectations now that changes
have been implemented? - Re-evaluate response time of critical
applications - Are key applications being delivered within
previous response time levels? Have there been
negative or positive impacts ? Have VoIP QoE
Metrics changed? - Determine whether bandwidth utilization meets
estimates - If the changes resolve the issue the baseline
should be reset
24Case in Point Troubleshooting employee remote
access problem
- New England based insurance company
- Pain Point Remote employees having trouble
accessing network resources - LDAP servers source of many problems
- Intermittent issues were elusive
- Application Fabric Monitoring provided continuous
capture and recording for in-depth
troubleshooting forensics
Discovered two LDAP servers had their
authentication databases out of sync and were
spending their cycles trying to sync their
databases
25Step 4 On-Going ManagementHow is your network
growing and changing over time?
- Converged networks need unified performance
management - Continuation of the tasks you performed in
detection and post-deployment impact phases - Troubleshooting - requiring real-time information
- Planning and traffic engineering - requiring
longer-term historical information - Communication to key constituents
- Easy to create, customizable reports
26Visibility into how the network is usedComplete
Application monitoring and profiling with CDM
Virtualization
- Application identification - Common matrix for
multimedia voice, video and data - Well-known, complex, custom, URL-based apps
- VoIP for RTP, SIP, MGCP, H.323, SCCP
- Industry specific i.e. FIX protocol, IP
Multicast and PACs - Application discovery for TCP and UDP unknown
- No data reduction all applications
- QoE / Response time analysis
- Proactive Alarms for thresholds, response times,
time over threshold and microbursts - Virtual interface analysis for VLANs, VRFs, QoS,
or sites - Post-capture filters by variety of metrics, not
just by IP address, ie CDM port
27Added Benefits of a Unified Performance
Management System
- Minimize Total Cost of Ownership (TCO) by
choosing Performance Management Systems that - Support all applications, conversations and
diverse network technologies that make up the
network environment - Present
- Future
- Uses a common interface for all data sources
- Deep Forensic data
- Probe Flow data
- NetFlow / sFlow
28Common Data Model More Performance Metrics for
services or applications
VoIP packet loss
Link Usage over Time
Details from drill down on spike
29Responsiveness Tracking with QoE
- Quality of Experience Tracking (QoE)
- Key Features
- Adds support for Virtual Interfaces
- Granularity down to 1-minute
- Tracks TCP and HTTP Error counters
- Adds support for both passive and active VoIP
metrics - Adds support for IP-SLA transactions
TCP Errors
30Responsiveness Tracking with QoE Visibility with
1 Minute Granularity
Response Time with 1 minute resolution
QoE Virtual Interface Support
31Application Discovery Visibility into Unknown
Traffic
- Identify Port-to-Port conversations for unknown
applications - Can be logged Historically with 1-minute
resolution
32Survey Results
- In 2007 NetScout partnered with Ashton, Metzler
and Associates - Goal To see what impact performance management
systems had in diagnosing critical issues - 138 Participants were asked how long it took to
diagnose a critical network issue - Before implementing a Performance Management
Solution - After implementing a Performance Management
Solution
33Improved MTTR with a Performance Management System
Before Implementing the nGenius Performance
Management System
After Implementing the nGenius Performance
Management System
69 Time Savings 6 Hours Time Savings
Source NetScout Systems Survey March 2007 N
138
34Ability to Diagnose Issues in the First 3 Hours
Before Implementing the nGenius Performance
Management System
After Implementing the nGenius Performance
Management System
Percentage Increased from 26 to 77
Source NetScout Systems Survey March 2007 N
138
35Summary Benefits of Proactive Management with
Post-incident analysis
- Reduced MTTR with a Unified Solution
- Must have vision into the network during the
detection phase through to the on-going
management phase of the performance problem
lifecycle - Must be able to support real-time and historical
reporting - Top down approach with context-sensitive data
mining - Store packets and report on flows concurrently
- Lower TCO with Architecture Advantages
- Flexible to support todays and tomorrows
applications and network technologies - Provide vision into the entire network not just
certain pieces - Highly Available Hardware Architecture
- Integrated automatic and ad-hoc reporting
functionality for collaboration
36About NetScout
- The most experienced team in the industry
- Founded in 1984
- Growing, profitable
- 102M revenues 2007
- World-wide distribution and support
- Winner of 2004 2006 Omega Northface Award for
customer satisfaction
37Case in Point Bates College
- 3000 Users
- Needed to see
- Who was using large amounts of resources
- Who was using the available bandwidth
- Number of flows
- Pain Point Needed to minimize impact of single
users on resources - Diagnostics to see service and applications
- Looking for heavily utilized links
- Needed more than layer 2 RMON devices
Please Welcome Jim Bauer Director of Network
Infrastructure Services
Company Confidential