Title: Monitoring Performance
1Monitoring Performance
Finisar Corporation
2The growing SAN challenge
complexity
you dont know the source of SAN issues
you dont know the source of SAN issues
heterogeneity
you dont know what you cant see
you dont know what you cant see
fabric blindness
virtualization
you dont know what to do when things go wrong
you dont know what to do when things go wrong
change
3Fabric blindness leads to
- Frantic fire-fighting
- Internal finger-pointing
- Application vs. network vs. storage
- External finger-pointing
- Vendors
- Unacceptably long resolution times
- Application brownouts or blackouts occur - and
have significant business impact
4Information Highway
- Network performance shares many similarities with
that of your daily commute - Storage Area Networks are no exception
- Just like a large sprawling city, as SANs grow
performance becomes more difficult to ensure - Lets take a look at planning for a faster commute
5Is Performance Important?
- Bugatti Veyron The fastest production car
- 0 60 mph in 3.2 seconds
- Top speed well over 200 mph
- Price more than 1,000,000
- Chevy Matiz One of the slowest cars
- 0 60 mph in 21.9 seconds
- Top speed about 85 mph
- Price about 10,000
- The Difference
- 6.8 times the acceleration
- 3 4 times as fast
- More than 10 times the price
6The Real Difference
- In this environment they both go they same speed.
- In fact in most environments they would have
roughly the same time from A to B. - So maybe the right question is when is
performance important and how is it measured
7Rush Hour
- In LA which has the worst rush hour commute time
there is an 81 average delay during rush hour - Often certain routes are congested while others
have limited traffic that is not affected
8Rush Hour
- On many SANs there is a 500 average delay during
peak times - There is no notification of a problem (time out)
until it is at 6000 of normal maximums and
75,000 of the low load average - Queues (just like on ramps) can fill even at low
bandwidth conditions - Often certain routes are congested while others
have limited traffic that is not affected
9The impact of accidents
- The impact of accidents depends on their severity
- Pileups can result in routes that are impassable
- Minor accidents can cause delays that far exceed
even the impact of rush hour
10The impact of errors
- The impact of errors depend on the severity of
the issue - Physical errors can result in routes that are
unusable - Occasional errors can cause delays that far
exceed even the impact of rush hour
11Patch Work
- Often short term solutions to problems become
long term hazards
12Patch Work
- Often short term solutions to problems become
long term hazards
13Planning for and monitoring the commute
- City planners architect the roadways for what
they believe will be the commute demands - In some cases they use simulation to compare
various alternatives - Finally they monitor the traffic patterns to
prevent and resolve problems and better plan for
the future
14Planning for and monitoring the SAN
- SAN Architects plan the fabrics for what they
believe will be the storage demands - In some cases they use simulation and tests to
compare various alternatives - Finally they monitor the traffic patterns to
prevent and resolve problems and better plan for
the future
15Planning
- The roadways are designed for the expected
traffic loads - Often one of the biggest mistakes in the planning
is using information that is out of date or
incorrect assumptions.
16Planning
- The Fabrics are designed for the expected traffic
loads - Often one of the biggest mistakes in the planning
is using information that is out of date or
incorrect assumptions.
17Simulations are sometimes used to compare changes
18Simulations are sometimes used to compare changes
19Monitoring I/Os Per Second
- Which route has more cars passing by every
second? - In this scenario they could all be the sameSome
with a few cars moving very fast while others
with many cars that are going slow - So what if anything does that measurement tell us
about performance?
20Monitoring I/Os Per Second
- Which route has more MBs passing through every
second - In this scenario they could all be the sameSome
with no requests and some with slow request due
to congestion - So what if anything does that measurement tell us
about performance?
21Modern Monitoring
- Looks at the real traffic flows
- Can assess performance
- Pinpoints the source of slow downs such as
accidents and congestion - Speeds resolution to many of the problems
- In many cases helps to prevent issues from
becoming problems
22Different method of Network Monitoring
- Software Monitoring
- No interfering on the physical link
- Software Agent needed
- Effected by host system performance
- Hardware Monitoring
- Isolate from Software and Host issue
- Intrusive on the physical link
- Dedicated monitoring HW.
23Modern Monitoring
- Looks at the real traffic flows
- Can assess performance
- Pinpoints the source of slow downs such as
accidents and congestion - Speeds resolution to many of the problems
- In many cases helps to prevent issues from
becoming problems
24Performance Analysis and Tuning
Queue 16
Queue 8
Queue 4
Queue 2
- Request size and Queue dept are two keys
contribution to performance tuning - Pre-Production run with variable queue dept and
request size. - Higher Queuept could increased throughput but
also could cause congestion and reduce throughput
25Performance Analysis and Tuning
- Read size 8 Kb with variable queue dept setting.
- Response time range from 10ms to 65ms.
- The ideal Queue dept for this system would be at
8 with 8Kb i/o -
26Performance Analysis and Tuning
- Queue dept of 4 with variable read size
- Throughput gain at the expense of latency
- At 32k I/O throughput gain is no longer keeping
up with the latency
27Good Performance Monitoring
- Does not focus on the irrelevant
- Alarm for know issues
- Unless there is an increasing pattern
28Good Performance Monitoring
- Does not focus on the irrelevant
- Alarm for know issues
- Unless there is an increasing pattern
29Effects of SAN performance monitoring
- Eliminate internal and
- vendor finger-pointing
- Receive advance warning of
- potential problems
- Reduce business risk
risk
30Two recent customer case studies
Case 1 A SAN problem was the root cause of an
application disruption
Case 2 A SAN problem was suspected as the root
cause of an application disruption - but it was
not the cause
31Case 1 company profile
- Large US insurance firm
- Broad offering of insurance and financial
products - 10,000 agents and employees
- Large Microsoft Exchange implementation
- Exchange data replicated to a remote site for
backup and disaster recovery
32Case 1 customer crisis
- Exchange application slowed and became
essentially unusable - User complaints flood IT
- Business operations adversely impacted
33Case 1 resolution efforts
- Exchange server event log - no problems
- Storage arrays log file - no problems
- Primary and secondary DR links tested - no
problems - Switch fabric manager - no problems
- Exchange throughput still low - pressure mounting
- but no way to diagnose the problem. Elapsed
time 8 hours
34Case 1 Modern Performance monitoring
- Probed storage link unusually high Exchange
Completion Times proved SAN is the problem - Storage array response good
- Remote replication acknowledgments too long
- Solution re-route the DR traffic through
secondary link Exchange performance restored.
Elapsed time 30 minutes - Cause and Fix Remote switch was busy dealing
with RSCN storm because of a bad HBA in a
unrelated application server in the remote site
Replaced HBA
35Sync replication impact on production
Remote replication disabled
Remote replication enabled
36Case 1 summary
- Normal business operations were quickly restored
- Conclusive data that prevented finger-pointing
- Without deep SAN Performance monitoring/analysis
it would have taken extraordinary effort to get
to the root cause and resolution - If deep SAN monitoring/analysis was in place
problem would have been prevented
37Case 2 company profile
- Large UK financial services firm
- Assets of 540 billion
- Over 20 million customers
- Major UK mortgage and savings provider and credit
card issuer - Relies on Oracle databases for transaction
processing systems
38Case 2 problem statement
- Sudden, but intermittent slow down of
Oracle-based applications - Widespread user complaints driving high level of
internal visibility - Business operations adversely impacted
- SAN was assumed to be the problem
39Case 2 Modern Performance Monitoring
- Deep SAN monitoring/analysis solution already in
place - Quickly determined that all SAN parameters were
within normal ranges - problem was not within the
SAN - Trending report indicated time of problem
occurrence - IT tracked back to an application
enhancement - Elapsed time lt30 mins
40Case 2 Modern Performance Monitoring
41Case 2 Modern Performance Monitoring
Increased link traffic
42Case 2 summary
- Quickly identified SAN was not the root problem
- Identified exact time of problem manifestation
helped identify the root cause poorly designed
database query - Quickly restored normal business operations
- Customer acknowledgement without deep SAN
monitoring/analysis solution, it would have taken
days and many unproductive efforts to resolve
43Where do you stand?
- Are your networks being planned with the
appropriate timely information or are the just
happening? - How are you monitoring performance? Do you know
if your response times are degrading? Are your
queue depth settings correct? How would you react
to a brown out? - What would the impact be to your business of
response times that were 6000 longer than you
are seeing now due to errors or congestion? - Does your monitoring alert you to conditions that
are irrelevant while not informing you of
conditions that are likely to impact your
business? - Are you flying blind in when comes to the health
and performance of your SAN?
44Thank You. Questions? Or, Contact us to
Thank You. Questions? Or, to contact us
- Get a Finisar SAN assessment of your availability
and performance needs - Walk through detailed SAN diagnostic scenarios
- Schedule a web briefing for your organization
- Todays slides
- www.finisar.com/webcast/NW1006.php