Monitoring Performance - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Monitoring Performance

Description:

The growing SAN challenge. you don't know what to do when things go wrong ... Bugatti Veyron The fastest production car. 0 60 mph in 3.2 seconds ... – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 45

Provided by: chrisci8

Category:

more less

Transcript and Presenter's Notes

Title: Monitoring Performance

1
Monitoring Performance
Finisar Corporation
2
The growing SAN challenge
complexity
you dont know the source of SAN issues
you dont know the source of SAN issues
heterogeneity
you dont know what you cant see
you dont know what you cant see
fabric blindness
virtualization
you dont know what to do when things go wrong
you dont know what to do when things go wrong
change
3
Fabric blindness leads to

Frantic fire-fighting
Internal finger-pointing
Application vs. network vs. storage
External finger-pointing
Vendors
Unacceptably long resolution times

Application brownouts or blackouts occur - and
have significant business impact

4
Information Highway

Network performance shares many similarities with
that of your daily commute
Storage Area Networks are no exception
Just like a large sprawling city, as SANs grow
performance becomes more difficult to ensure
Lets take a look at planning for a faster commute

5
Is Performance Important?

Bugatti Veyron The fastest production car
0 60 mph in 3.2 seconds
Top speed well over 200 mph
Price more than 1,000,000
Chevy Matiz One of the slowest cars
0 60 mph in 21.9 seconds
Top speed about 85 mph
Price about 10,000
The Difference
6.8 times the acceleration
3 4 times as fast
More than 10 times the price

6
The Real Difference

In this environment they both go they same speed.
In fact in most environments they would have
roughly the same time from A to B.
So maybe the right question is when is
performance important and how is it measured

7
Rush Hour

In LA which has the worst rush hour commute time
there is an 81 average delay during rush hour
Often certain routes are congested while others
have limited traffic that is not affected

8
Rush Hour

On many SANs there is a 500 average delay during
peak times
There is no notification of a problem (time out)
until it is at 6000 of normal maximums and
75,000 of the low load average
Queues (just like on ramps) can fill even at low
bandwidth conditions
Often certain routes are congested while others
have limited traffic that is not affected

9
The impact of accidents

The impact of accidents depends on their severity
Pileups can result in routes that are impassable
Minor accidents can cause delays that far exceed
even the impact of rush hour

10
The impact of errors

The impact of errors depend on the severity of
the issue
Physical errors can result in routes that are
unusable
Occasional errors can cause delays that far
exceed even the impact of rush hour

11
Patch Work

Often short term solutions to problems become
long term hazards

12
Patch Work

Often short term solutions to problems become
long term hazards

13
Planning for and monitoring the commute

City planners architect the roadways for what
they believe will be the commute demands
In some cases they use simulation to compare
various alternatives
Finally they monitor the traffic patterns to
prevent and resolve problems and better plan for
the future

14
Planning for and monitoring the SAN

SAN Architects plan the fabrics for what they
believe will be the storage demands
In some cases they use simulation and tests to
compare various alternatives
Finally they monitor the traffic patterns to
prevent and resolve problems and better plan for
the future

15
Planning

The roadways are designed for the expected
traffic loads
Often one of the biggest mistakes in the planning
is using information that is out of date or
incorrect assumptions.

16
Planning

The Fabrics are designed for the expected traffic
loads
Often one of the biggest mistakes in the planning
is using information that is out of date or
incorrect assumptions.

17
Simulations are sometimes used to compare changes
18
Simulations are sometimes used to compare changes
19
Monitoring I/Os Per Second

Which route has more cars passing by every
second?
In this scenario they could all be the sameSome
with a few cars moving very fast while others
with many cars that are going slow
So what if anything does that measurement tell us
about performance?

20
Monitoring I/Os Per Second

Which route has more MBs passing through every
second
In this scenario they could all be the sameSome
with no requests and some with slow request due
to congestion
So what if anything does that measurement tell us
about performance?

21
Modern Monitoring

Looks at the real traffic flows
Can assess performance
Pinpoints the source of slow downs such as
accidents and congestion
Speeds resolution to many of the problems
In many cases helps to prevent issues from
becoming problems

22
Different method of Network Monitoring

Software Monitoring
No interfering on the physical link
Software Agent needed
Effected by host system performance

Hardware Monitoring
Isolate from Software and Host issue
Intrusive on the physical link
Dedicated monitoring HW.

23
Modern Monitoring

Looks at the real traffic flows
Can assess performance
Pinpoints the source of slow downs such as
accidents and congestion
Speeds resolution to many of the problems
In many cases helps to prevent issues from
becoming problems

24
Performance Analysis and Tuning
Queue 16
Queue 8
Queue 4
Queue 2

Request size and Queue dept are two keys
contribution to performance tuning
Pre-Production run with variable queue dept and
request size.
Higher Queuept could increased throughput but
also could cause congestion and reduce throughput

25
Performance Analysis and Tuning

Read size 8 Kb with variable queue dept setting.
Response time range from 10ms to 65ms.
The ideal Queue dept for this system would be at
8 with 8Kb i/o

26
Performance Analysis and Tuning

Queue dept of 4 with variable read size
Throughput gain at the expense of latency
At 32k I/O throughput gain is no longer keeping
up with the latency

27
Good Performance Monitoring

Does not focus on the irrelevant
Alarm for know issues
Unless there is an increasing pattern

28
Good Performance Monitoring

Does not focus on the irrelevant
Alarm for know issues
Unless there is an increasing pattern

29
Effects of SAN performance monitoring

Eliminate internal and
vendor finger-pointing
Receive advance warning of
potential problems
Reduce business risk

risk
30
Two recent customer case studies
Case 1 A SAN problem was the root cause of an
application disruption
Case 2 A SAN problem was suspected as the root
cause of an application disruption - but it was
not the cause
31
Case 1 company profile

Large US insurance firm
Broad offering of insurance and financial
products
10,000 agents and employees
Large Microsoft Exchange implementation
Exchange data replicated to a remote site for
backup and disaster recovery

32
Case 1 customer crisis

Exchange application slowed and became
essentially unusable
User complaints flood IT
Business operations adversely impacted

33
Case 1 resolution efforts

Exchange server event log - no problems
Storage arrays log file - no problems
Primary and secondary DR links tested - no
problems
Switch fabric manager - no problems
Exchange throughput still low - pressure mounting
- but no way to diagnose the problem. Elapsed
time 8 hours

34
Case 1 Modern Performance monitoring

Probed storage link unusually high Exchange
Completion Times proved SAN is the problem
Storage array response good
Remote replication acknowledgments too long
Solution re-route the DR traffic through
secondary link Exchange performance restored.
Elapsed time 30 minutes
Cause and Fix Remote switch was busy dealing
with RSCN storm because of a bad HBA in a
unrelated application server in the remote site
Replaced HBA

35
Sync replication impact on production
Remote replication disabled
Remote replication enabled
36
Case 1 summary

Normal business operations were quickly restored
Conclusive data that prevented finger-pointing
Without deep SAN Performance monitoring/analysis
it would have taken extraordinary effort to get
to the root cause and resolution
If deep SAN monitoring/analysis was in place
problem would have been prevented

37
Case 2 company profile

Large UK financial services firm
Assets of 540 billion
Over 20 million customers
Major UK mortgage and savings provider and credit
card issuer
Relies on Oracle databases for transaction
processing systems

38
Case 2 problem statement

Sudden, but intermittent slow down of
Oracle-based applications
Widespread user complaints driving high level of
internal visibility
Business operations adversely impacted
SAN was assumed to be the problem

39
Case 2 Modern Performance Monitoring

Deep SAN monitoring/analysis solution already in
place
Quickly determined that all SAN parameters were
within normal ranges - problem was not within the
SAN
Trending report indicated time of problem
occurrence - IT tracked back to an application
enhancement
Elapsed time lt30 mins

40
Case 2 Modern Performance Monitoring
41
Case 2 Modern Performance Monitoring
Increased link traffic
42
Case 2 summary

Quickly identified SAN was not the root problem
Identified exact time of problem manifestation
helped identify the root cause poorly designed
database query
Quickly restored normal business operations
Customer acknowledgement without deep SAN
monitoring/analysis solution, it would have taken
days and many unproductive efforts to resolve

43
Where do you stand?

Are your networks being planned with the
appropriate timely information or are the just
happening?
How are you monitoring performance? Do you know
if your response times are degrading? Are your
queue depth settings correct? How would you react
to a brown out?
What would the impact be to your business of
response times that were 6000 longer than you
are seeing now due to errors or congestion?
Does your monitoring alert you to conditions that
are irrelevant while not informing you of
conditions that are likely to impact your
business?
Are you flying blind in when comes to the health
and performance of your SAN?