Monitoring Performance - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Monitoring Performance

Description:

The growing SAN challenge. you don't know what to do when things go wrong ... Bugatti Veyron The fastest production car. 0 60 mph in 3.2 seconds ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 45
Provided by: chrisci8
Category:

less

Transcript and Presenter's Notes

Title: Monitoring Performance


1
Monitoring Performance
Finisar Corporation
2
The growing SAN challenge
complexity
you dont know the source of SAN issues
you dont know the source of SAN issues
heterogeneity
you dont know what you cant see
you dont know what you cant see
fabric blindness
virtualization
you dont know what to do when things go wrong
you dont know what to do when things go wrong
change
3
Fabric blindness leads to
  • Frantic fire-fighting
  • Internal finger-pointing
  • Application vs. network vs. storage
  • External finger-pointing
  • Vendors
  • Unacceptably long resolution times
  • Application brownouts or blackouts occur - and
    have significant business impact

4
Information Highway
  • Network performance shares many similarities with
    that of your daily commute
  • Storage Area Networks are no exception
  • Just like a large sprawling city, as SANs grow
    performance becomes more difficult to ensure
  • Lets take a look at planning for a faster commute

5
Is Performance Important?
  • Bugatti Veyron The fastest production car
  • 0 60 mph in 3.2 seconds
  • Top speed well over 200 mph
  • Price more than 1,000,000
  • Chevy Matiz One of the slowest cars
  • 0 60 mph in 21.9 seconds
  • Top speed about 85 mph
  • Price about 10,000
  • The Difference
  • 6.8 times the acceleration
  • 3 4 times as fast
  • More than 10 times the price

6
The Real Difference
  • In this environment they both go they same speed.
  • In fact in most environments they would have
    roughly the same time from A to B.
  • So maybe the right question is when is
    performance important and how is it measured

7
Rush Hour
  • In LA which has the worst rush hour commute time
    there is an 81 average delay during rush hour
  • Often certain routes are congested while others
    have limited traffic that is not affected

8
Rush Hour
  • On many SANs there is a 500 average delay during
    peak times
  • There is no notification of a problem (time out)
    until it is at 6000 of normal maximums and
    75,000 of the low load average
  • Queues (just like on ramps) can fill even at low
    bandwidth conditions
  • Often certain routes are congested while others
    have limited traffic that is not affected

9
The impact of accidents
  • The impact of accidents depends on their severity
  • Pileups can result in routes that are impassable
  • Minor accidents can cause delays that far exceed
    even the impact of rush hour

10
The impact of errors
  • The impact of errors depend on the severity of
    the issue
  • Physical errors can result in routes that are
    unusable
  • Occasional errors can cause delays that far
    exceed even the impact of rush hour

11
Patch Work
  • Often short term solutions to problems become
    long term hazards

12
Patch Work
  • Often short term solutions to problems become
    long term hazards

13
Planning for and monitoring the commute
  • City planners architect the roadways for what
    they believe will be the commute demands
  • In some cases they use simulation to compare
    various alternatives
  • Finally they monitor the traffic patterns to
    prevent and resolve problems and better plan for
    the future

14
Planning for and monitoring the SAN
  • SAN Architects plan the fabrics for what they
    believe will be the storage demands
  • In some cases they use simulation and tests to
    compare various alternatives
  • Finally they monitor the traffic patterns to
    prevent and resolve problems and better plan for
    the future

15
Planning
  • The roadways are designed for the expected
    traffic loads
  • Often one of the biggest mistakes in the planning
    is using information that is out of date or
    incorrect assumptions.

16
Planning
  • The Fabrics are designed for the expected traffic
    loads
  • Often one of the biggest mistakes in the planning
    is using information that is out of date or
    incorrect assumptions.

17
Simulations are sometimes used to compare changes
18
Simulations are sometimes used to compare changes
19
Monitoring I/Os Per Second
  • Which route has more cars passing by every
    second?
  • In this scenario they could all be the sameSome
    with a few cars moving very fast while others
    with many cars that are going slow
  • So what if anything does that measurement tell us
    about performance?

20
Monitoring I/Os Per Second
  • Which route has more MBs passing through every
    second
  • In this scenario they could all be the sameSome
    with no requests and some with slow request due
    to congestion
  • So what if anything does that measurement tell us
    about performance?

21
Modern Monitoring
  • Looks at the real traffic flows
  • Can assess performance
  • Pinpoints the source of slow downs such as
    accidents and congestion
  • Speeds resolution to many of the problems
  • In many cases helps to prevent issues from
    becoming problems

22
Different method of Network Monitoring
  • Software Monitoring
  • No interfering on the physical link
  • Software Agent needed
  • Effected by host system performance
  • Hardware Monitoring
  • Isolate from Software and Host issue
  • Intrusive on the physical link
  • Dedicated monitoring HW.

23
Modern Monitoring
  • Looks at the real traffic flows
  • Can assess performance
  • Pinpoints the source of slow downs such as
    accidents and congestion
  • Speeds resolution to many of the problems
  • In many cases helps to prevent issues from
    becoming problems

24
Performance Analysis and Tuning
Queue 16
Queue 8
Queue 4
Queue 2
  • Request size and Queue dept are two keys
    contribution to performance tuning
  • Pre-Production run with variable queue dept and
    request size.
  • Higher Queuept could increased throughput but
    also could cause congestion and reduce throughput

25
Performance Analysis and Tuning
  • Read size 8 Kb with variable queue dept setting.
  • Response time range from 10ms to 65ms.
  • The ideal Queue dept for this system would be at
    8 with 8Kb i/o

26
Performance Analysis and Tuning
  • Queue dept of 4 with variable read size
  • Throughput gain at the expense of latency
  • At 32k I/O throughput gain is no longer keeping
    up with the latency

27
Good Performance Monitoring
  • Does not focus on the irrelevant
  • Alarm for know issues
  • Unless there is an increasing pattern

28
Good Performance Monitoring
  • Does not focus on the irrelevant
  • Alarm for know issues
  • Unless there is an increasing pattern

29
Effects of SAN performance monitoring
  • Eliminate internal and
  • vendor finger-pointing
  • Receive advance warning of
  • potential problems
  • Reduce business risk

risk
30
Two recent customer case studies
Case 1 A SAN problem was the root cause of an
application disruption
Case 2 A SAN problem was suspected as the root
cause of an application disruption - but it was
not the cause
31
Case 1 company profile
  • Large US insurance firm
  • Broad offering of insurance and financial
    products
  • 10,000 agents and employees
  • Large Microsoft Exchange implementation
  • Exchange data replicated to a remote site for
    backup and disaster recovery

32
Case 1 customer crisis
  • Exchange application slowed and became
    essentially unusable
  • User complaints flood IT
  • Business operations adversely impacted

33
Case 1 resolution efforts
  • Exchange server event log - no problems
  • Storage arrays log file - no problems
  • Primary and secondary DR links tested - no
    problems
  • Switch fabric manager - no problems
  • Exchange throughput still low - pressure mounting
    - but no way to diagnose the problem. Elapsed
    time 8 hours

34
Case 1 Modern Performance monitoring
  • Probed storage link unusually high Exchange
    Completion Times proved SAN is the problem
  • Storage array response good
  • Remote replication acknowledgments too long
  • Solution re-route the DR traffic through
    secondary link Exchange performance restored.
    Elapsed time 30 minutes
  • Cause and Fix Remote switch was busy dealing
    with RSCN storm because of a bad HBA in a
    unrelated application server in the remote site
    Replaced HBA

35
Sync replication impact on production
Remote replication disabled
Remote replication enabled
36
Case 1 summary
  • Normal business operations were quickly restored
  • Conclusive data that prevented finger-pointing
  • Without deep SAN Performance monitoring/analysis
    it would have taken extraordinary effort to get
    to the root cause and resolution
  • If deep SAN monitoring/analysis was in place
    problem would have been prevented

37
Case 2 company profile
  • Large UK financial services firm
  • Assets of 540 billion
  • Over 20 million customers
  • Major UK mortgage and savings provider and credit
    card issuer
  • Relies on Oracle databases for transaction
    processing systems

38
Case 2 problem statement
  • Sudden, but intermittent slow down of
    Oracle-based applications
  • Widespread user complaints driving high level of
    internal visibility
  • Business operations adversely impacted
  • SAN was assumed to be the problem

39
Case 2 Modern Performance Monitoring
  • Deep SAN monitoring/analysis solution already in
    place
  • Quickly determined that all SAN parameters were
    within normal ranges - problem was not within the
    SAN
  • Trending report indicated time of problem
    occurrence - IT tracked back to an application
    enhancement
  • Elapsed time lt30 mins

40
Case 2 Modern Performance Monitoring
41
Case 2 Modern Performance Monitoring
Increased link traffic
42
Case 2 summary
  • Quickly identified SAN was not the root problem
  • Identified exact time of problem manifestation
    helped identify the root cause poorly designed
    database query
  • Quickly restored normal business operations
  • Customer acknowledgement without deep SAN
    monitoring/analysis solution, it would have taken
    days and many unproductive efforts to resolve

43
Where do you stand?
  • Are your networks being planned with the
    appropriate timely information or are the just
    happening?
  • How are you monitoring performance? Do you know
    if your response times are degrading? Are your
    queue depth settings correct? How would you react
    to a brown out?
  • What would the impact be to your business of
    response times that were 6000 longer than you
    are seeing now due to errors or congestion?
  • Does your monitoring alert you to conditions that
    are irrelevant while not informing you of
    conditions that are likely to impact your
    business?
  • Are you flying blind in when comes to the health
    and performance of your SAN?

44
Thank You. Questions? Or, Contact us to
Thank You. Questions? Or, to contact us
  • Get a Finisar SAN assessment of your availability
    and performance needs
  • Walk through detailed SAN diagnostic scenarios
  • Schedule a web briefing for your organization
  • Todays slides
  • www.finisar.com/webcast/NW1006.php
Write a Comment
User Comments (0)
About PowerShow.com