Reduce MTTR - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Reduce MTTR

Description:

with Combined Performance Management and Forensic Analysis. Brian Robertson ... Does not require entire data capture files to be sent. Deep forensic analysis ... – PowerPoint PPT presentation

Number of Views:1685

Avg rating:3.0/5.0

Slides: 38

Provided by: Robert1111

Category:

more less

Transcript and Presenter's Notes

Title: Reduce MTTR

1
Reduce MTTR

with Combined Performance Management and Forensic
Analysis

2
Introduction

Brian Robertson
Solutions Marketing Manager
Netscout Systems, Inc.

Jim Bauer Director of Network Infrastructure
Services
3
Agenda

Network Performance Management Challenges
Meeting the Challenge with Performance Management
Solutions
Top-Down Troubleshooting Approach
Case Studies
Survey Results

4
Balancing Pro-active performance management With
Post-incident investigation and analysis

Challenge
Need proactive monitoring and analysis for
everyday troubleshooting and capacity planning
Intermittent problems wreak havoc on the
performance of business-critical and
revenue-generating applications
Problem cause is often difficult to discover
Needs
Ability to automatically analyze traffic and
simultaneously retain evidence to recreate a
complex incident and discover its culprit without
having to wait for the event to recur

5
Who Finds Performance Problems?
Source NetScout Systems Survey March 2007 N
232
6
Performance Problem Lifecycle
7
The Ultimate Payoff Reduced MTTRFaster time
to resolution/minimized user impact
Service Outage
User Calls Help Desk
End-Users Impacted
Final Verification
End Users Not Impacted!
time
Problem Origin
8
What is needed to Lower MTTR

A Performance Management System that
Combines key performance indicator (KPI)
monitoring and analysis with continuous packet
capture for post event data mining
Top down approach to troubleshooting
Application Fabric Monitoring
Has an architecture built for high performance
recording and infrastructure monitoring
Uses a high capacity, highly available hardware
platform

9
Unique Approach Three-Tier Data Architecture
for Rapid Top-Down Workflow
Top Down
Data Set
KPIs Retransmits, VoIP QoS Errors, App Resp
Time, etc
Flows CDM - Applications/Services Conversations,
Utilization, Volume, etc
Packets Header and Payload
10
Step 1 DetectionIs there an issue on the
network?

Trending, alarming and analytical data on key
performance indicators will provide
Notification when link or application utilization
increases
See when unwanted protocols are on the network
Measure VoIP links for Jitter, Packet Loss or
High MOS scores
This information has to be available in real-time
and historically
Collaboration with groups across the organization
requires quick and easy distribution of reports

11
Understand how your business uses the network
Do you know all the applications running on your
network? Application visibility provides
business justification for IT decisionsAre
there good reasons for an upgrade? Are there
non-business uses of the network?
12
Baseline response time of key business
applications
Looking at several parameters such as MPLS,
VoIP, or QoS classes can provide critical data
in regards to the health of the network.
Response time provides insight into the end-user
experience and should be an integral part of any
performance management audit
13
Top Down Approach for Rapid TroubleshootingPower
Alarms - Micro-Burst Alarming

Based on traffic rates exceeding a 1 millisecond
threshold
Interval can be configured as low as 5
milliseconds up to 1 second
An alarm is received when the burst starts and
one when it ends
Evidence is launched based on applications,
hosts, conversations during the alarm interval

14
Step 2 DiagnosisWhat is the root cause of the
issue?

Look at the flows associated with the KPI
effected by the issue
Applications
Conversations
Utilization
Volume
Historical view of trends for flows show
When and where an issue commonly occurs
Upward / downward trends for flows

15
Flow Based Troubleshooting Example
Music downloads from 3 different sites same user
16
Troubleshoot Intermittent or Subtle Problems with
Application Fabric Monitoring

Efficient on-board analysis
Sends only the data requested by the client
Does not require entire data capture files to be
sent
Deep forensic analysis
Logically move from KPI to Flow to Packet
View data with millisecond granularity
Record packet level audit trails to view subtle
and intermittent issues

17
Highly Efficient Architecture
18
Highly Efficient Architecture
On Board Analysis
Client Analysis
Efficient On-board Analysis
Inefficient Client-based Analysis
19
Superior Forensic Analysis

Expert Analysis
Expert Zoom and Data Mining interface with
multiple workspaces
Broad support of over 1000 protocol decodes
TCP Session Follow
Bounce Diagrams
Intuitive, flexible GUI
Multi-user, Web-based console
Eases search process
Incident Reports for collaborating and
communicating with others

20
Case in Point Solution Architecture Wins Business

Bank in North America has assets approaching 300
Billion
2 trading floors Chicago, New York, NJ
disaster recovery
Pain Point Needs continuous capture and
monitoring on trading floors
Need both flow based monitoring and trending for
troubleshooting and capacity planning Plus
recording for in-depth analysis because of the
value of the financial trades
Current solution in Chicago not working network
managers constrained by time delays in viewing
packet trace files when pulling them over the
network.

Able to perform top down troubleshooting with
detailed analysis of applications and
conversations. When necessary for in-depth
troubleshooting the actual packets are available
for event reconstruction and forensic data
mining without adding load to the network.
21
High Definition Visibility Deep Forensics
22
Quickly navigate to the needed level of
granularity consistently maintaining context as
you go deeper
What do you do if you need a packet decode, but
the offending traffic occurred two hours ago???
23
Step 3 VerificationHave we resolved the issue?

Does the KPI meet expectations now that changes
have been implemented?
Re-evaluate response time of critical
applications
Are key applications being delivered within
previous response time levels? Have there been
negative or positive impacts ? Have VoIP QoE
Metrics changed?
Determine whether bandwidth utilization meets
estimates
If the changes resolve the issue the baseline
should be reset

24
Case in Point Troubleshooting employee remote
access problem

New England based insurance company
Pain Point Remote employees having trouble
accessing network resources
LDAP servers source of many problems
Intermittent issues were elusive
Application Fabric Monitoring provided continuous
capture and recording for in-depth
troubleshooting forensics

Discovered two LDAP servers had their
authentication databases out of sync and were
spending their cycles trying to sync their
databases
25
Step 4 On-Going ManagementHow is your network
growing and changing over time?

Converged networks need unified performance
management
Continuation of the tasks you performed in
detection and post-deployment impact phases
Troubleshooting - requiring real-time information
Planning and traffic engineering - requiring
longer-term historical information
Communication to key constituents
Easy to create, customizable reports

26
Visibility into how the network is usedComplete
Application monitoring and profiling with CDM
Virtualization

Application identification - Common matrix for
multimedia voice, video and data
Well-known, complex, custom, URL-based apps
VoIP for RTP, SIP, MGCP, H.323, SCCP
Industry specific i.e. FIX protocol, IP
Multicast and PACs
Application discovery for TCP and UDP unknown
No data reduction all applications
QoE / Response time analysis
Proactive Alarms for thresholds, response times,
time over threshold and microbursts
Virtual interface analysis for VLANs, VRFs, QoS,
or sites
Post-capture filters by variety of metrics, not
just by IP address, ie CDM port

27
Added Benefits of a Unified Performance
Management System

Minimize Total Cost of Ownership (TCO) by
choosing Performance Management Systems that
Support all applications, conversations and
diverse network technologies that make up the
network environment
Present
Future
Uses a common interface for all data sources
Deep Forensic data
Probe Flow data
NetFlow / sFlow

28
Common Data Model More Performance Metrics for
services or applications
VoIP packet loss
Link Usage over Time
Details from drill down on spike
29
Responsiveness Tracking with QoE

Quality of Experience Tracking (QoE)
Key Features
Adds support for Virtual Interfaces
Granularity down to 1-minute
Tracks TCP and HTTP Error counters
Adds support for both passive and active VoIP
metrics
Adds support for IP-SLA transactions

TCP Errors
30
Responsiveness Tracking with QoE Visibility with
1 Minute Granularity
Response Time with 1 minute resolution
QoE Virtual Interface Support
31
Application Discovery Visibility into Unknown
Traffic

Identify Port-to-Port conversations for unknown
applications
Can be logged Historically with 1-minute
resolution

32
Survey Results

In 2007 NetScout partnered with Ashton, Metzler
and Associates
Goal To see what impact performance management
systems had in diagnosing critical issues
138 Participants were asked how long it took to
diagnose a critical network issue
Before implementing a Performance Management
Solution
After implementing a Performance Management
Solution

33
Improved MTTR with a Performance Management System
Before Implementing the nGenius Performance
Management System
After Implementing the nGenius Performance
Management System
69 Time Savings 6 Hours Time Savings
Source NetScout Systems Survey March 2007 N
138
34
Ability to Diagnose Issues in the First 3 Hours
Before Implementing the nGenius Performance
Management System
After Implementing the nGenius Performance
Management System
Percentage Increased from 26 to 77
Source NetScout Systems Survey March 2007 N
138
35
Summary Benefits of Proactive Management with
Post-incident analysis

Reduced MTTR with a Unified Solution
Must have vision into the network during the
detection phase through to the on-going
management phase of the performance problem
lifecycle
Must be able to support real-time and historical
reporting
Top down approach with context-sensitive data
mining
Store packets and report on flows concurrently
Lower TCO with Architecture Advantages
Flexible to support todays and tomorrows
applications and network technologies
Provide vision into the entire network not just
certain pieces
Highly Available Hardware Architecture
Integrated automatic and ad-hoc reporting
functionality for collaboration