Title: Network planning considering reliability aspects H
1Network planning considering reliability
aspectsHálózatok tervezése megbízhatósági
szempontok figyelembevételével
2Fundamentals of reliability issues in network
planning
- You must dimension networks higher parameters
then the exact specification calculated from the
demand parameters. Networks always need some
spare capacity. - Calculate with unpredictable situations!
- Reliability and availability dimensioning are
important part of network dimensioning and
planning.
3Organizations are increasingly reliant on
computer networks for business or
mission-critical applications. The scope and size
of these networks have expanded so rapidly over
the past two decades that considerable effort and
expense are now targeted at keeping network
resources available, sometimes 24 hours a day,
all year. Traditionally this area of network
design has been the preserve of large mainframe
sites and those sites requiring high levels of
protection (such as nuclear power plants).
However, the explosion of Web-based business
methods means than many more organizations are
now eager to maintain high availability in order
to minimize service losses.
4If the network is poorly designed, and
insufficient attention is paid to providing
availability in core systems, users can
experience anything from slow response times to
complete loss of service (referred to as
downtime) for extended periods. The technical
issues in maintaining high availability are both
complex and subtle, and it is the network
designers job to balance loss probability
against cost, providing guidance to senior
management on the likelihood of failures and
their impact on the business.
5Networks are rarely static environments, and
budgets are finite. In practice network designers
are required to make a range of pragmatic and
technical decisions that address, accept,
mitigate, or transfer the risks of failureall
within the constraints of a budget. The designer
must also ensure that the solutions provided are
scalable, so that additional nodes, services, and
capacity can be added without major upheaval and
without adversely affecting existing users.
Downtime for truly business- and mission-
critical systems can equate to losses of millions
of dollars per minute these organizations,
therefore, demand high-availability (HA) networks
and are often prepared to go to extraordinary
lengths to achieve them.
6(No Transcript)
7Failure knows no boundaries in a network design,
and the smallest component failure can
effectively bring down a whole business without
warning (e.g., a failed hard disk controller on
your core e-business server could stop all
transactions). For practical reasons
organizations are invariably broken down into
teams responsible for different aspects of IT
(desktop support, communications, applications,
database, cabling, etc.). When a problem occurs,
it is all too common for application staff to
blame the network and vice versa. To maintain
HA networks, different disciplines must work
together, both at the design phase and
subsequently. Good diagnostic, monitoring, and
management tools can also help.
8Planning for failure When designing a reliable
data network, network designers are well advised
to keep two quotations in mind at all
times Anything that can go wrong, will go
wrong Murphy Whatever can go wrong will
go wrong at the worst possible time and in the
worst possible way . . .
Expect the unexpected. (Számíts a
váratlanra!) Douglas Adams, The Hitchhikers
Guide to the Galaxy
9- Failure refers to a situation where the observed
behavior of a system differs from its specified
behavior. A failure occurs because of an error,
caused by a fault. The time lapse between the
error occurring and the resulting failure is
called the error latency. - Faults can be
- hard (permanent) or
- soft (transient).
- For example, a cable break is a hard failure,
- whereas intermittent noise on the line is a soft
failure.
10Single Point of Failure (SPOF) indicates that a
system or network can be rendered inoperable, or
significantly impaired in operation, by the
failure of one single component. For example, a
single hard disk failure could bring down a
server a single router failure could break all
connectivity for a network. Multiple points of
failure indicate that a system or network can be
rendered inoperable through a chain or
combination of failures (as few as two). For
example, failure of a single router, plus failure
of a backup modem link, could mean that all
connectivity is lost for a net. Planning for
failure In general it is much more expensive to
cope with multiple points of failure and often
financially impractical.
11Fault tolerance indicates that every component in
the chain supporting the system has redundant
features or is duplicated. A fault-tolerant
system will not fail because any one component
fails (i.e., it has no single point of failure).
The system should also provide recovery from
multiple failures. Components are often
overengineered or purposely underutilized to
ensure that while performance may be affected
during an outage, the system will perform within
predictable, acceptable bounds.
12Fault resilience implies that at least one of the
modules or components within a system is backed
up with a spare (e.g., a power supply). This may
be in hot standby, cold standby, or load-sharing
mode. In contrast with fault-tolerant systems,
not all modules or components are necessarily
redundant (i.e., there may be several single
points of failure). For example, a
fault-resilient router may have multiple power
supplies but only one routing processor. By
definition, one fault-resilient component does
not make the entire system fault tolerant.
13Disaster recovery is the process of identifying
all potential failures, their impact on the
system/network as a whole, and planning the means
to recover from such failures.
14Calculating the true cost of downtime Network
designers are largely unfamiliar with financial
models. It is, however, imperative in designing
reliable networks that the designer gathers some
basic financial data in order to cost justify and
direct suitable technical solutions. The data may
come from line managers or financial support
staff and may not be readily collated. Without
these data the scale of the problem is undefined,
and it will be hard to convince senior financial
and operational management that additional
features are necessary.
15To illustrate the point let us consider a
hypothetical consumer-oriented business (such as
an airline, car rental, vacation, or hotel
reservation call center). The call center is
required to be online 24 hours a day, 7 days a
week, 365 days a year. The business has 800 staff
involved in call handling (transactions), each
with an average burdened cost of 25 an hour
(i.e., the cost of providing a desk, heating,
lighting, phone, data point, etc.). There is a
small profit made on each transaction, plus a
large profit on any actual sale that can be
closed. We assume here that there are on average
three sales closed per hour.
16Cost of Idle Staff is calculated as (Headcount
Burdened Cost Downtime). Production Losses are
calculated as (Headcount Transactions per Hour
Profit per Transaction Downtime). Lost Sales
are calculated as (Headcount Sales per Hour
Profit per Sale Downtime).
17(No Transcript)
18(No Transcript)
19Developing a disaster recovery plan All networks
are vulnerable to disruption. Sometimes these
disruptions may come from the most unlikely
sources. Natural events such as flooding,
fire, lightning strikes, earthquakes, tidal
waves, and hurricanes are all possible, as well
as fuel shortages, electricity strikes, viruses,
hackers, system failures, and software bugs.
History shows us that these events do happen
regularly. As recently as 1999 and 2000 we saw
the seemingly impossible power shortages in
California threatened to cripple Silicon Valley,
and a combination of fuel shortages, train safety
issues, and massive flooding .
20In fact, various studies indicate that the
majority of system failures can be attributed to
a relatively small set of events. These include,
in decreasing order of frequency, natural
disaster, power failure, systems failure,
sabotage/viruses, fire, and human error. There
is also a general consensus that companies that
take longer than a full business week to get back
online run a high risk of being forced out of
business entirely (some analysts state as high as
50 percent).
21- A general approach to the creation of a Disaster
Recovery (DR) - Benchmark the current designPerform a full risk
assessment for all key systems and the network as
a whole. Identify key threats to system and
network integrity. Analyze core business
requirements and identify core processes and
their dependence upon the network. Assign
monetary values of loss of service or systems. - Define the requirementsBased on business needs,
determine an acceptable recovery window for each
system and the network as a whole. If practical
specify a worst-case recovery window and a target
recovery window. Specify priorities for mission-
or business-critical systems.
22- Define the technical solutionDetermine the
technical response to these challenges by
evaluating alternative recovery models, and
select solutions that best meet the business
requirements. Ensure that a full cost analysis of
each solution is provided, together with the
recovery times anticipated under catastrophic
failure conditions and lesser degrees of failure. - Develop the recovery strategyFormulate a crisis
management plan identifying the processes to be
followed and key personnel response to failure
scenarios. Describe where automation and manual
intervention are required. Set priorities to
clearly identify the order in which systems
should be brought back online.
23- Develop an implementation strategyDetermine how
new/additional technology is to be deployed and
over what time period. Document changes to the
existing design. Identify how new/additional
processes and responsibilities are to be
communicated. - Develop a test programDetermine how business-
and mission-critical systems may be exercised and
what the expected results should be. Define
procedures for rectifying test failures. Run
tests to see if the strategy works if not, make
refinements until satisfied. - Implement continuous monitoring and
improvementsOnce the disaster recovery plan is
established, hold regular reviews to ensure that
the plan stays synchronized as the network grows
or design features are modified.
24Disaster recovery models
25Tape or CD site backupTape or CD-ROM backup and
restore are the widely used DR methods for sites.
Traditionally, key data repositories and
configuration files are backed up nightly or
every other night. Backup media are transported
and securely stored at a different location. This
enables complete data recovery should the main
site systems be compromised. If the primary site
becomes inoperable, the plan is to ship the media
back, reboot, and resume normal operations. Pros
and Cons This is a low-cost solution, but the
recovery window could range from a few hours to
several days this may prove unacceptable for
many businesses. Media reliability may not be 100
percent and, depending upon the backup frequency,
valuable data may be lost.
26Electronic vaultingWith remote electronic
vaulting, data are archived automatically to tape
or CD over the network to a secure remote site.
Electronic vaulting ideally requires a dedicated
network connection to support large or frequent
background data transfers otherwise, archiving
must be performed during off-peak periods or
low-utilization periods (e.g., via a nightly
backup). Backup procedures can, however, be
optimized by archiving only incremental changes
since the last archive, reducing both traffic
levels and network unavailability. Pros and Cons
The operating costs for electronic vaulting can
be up to four times more expensive than simple
tape or CD backup however, this approach can be
entirely automated. Unlike simple media backup
there is no requirement to transport backup data
physically. Recovery still depends on the most
recent backup copy, but this is likely to be more
recent due to automation. Electronic vaulting is
more reliable and significantly decreases the
recovery window (typically, just a few hours).
27Data replication/disk mirroringRemote disk
mirroring provides faster recovery and less data
loss than remote electronic vaulting. Since data
are transferred to disk rather than tape,
performance impacts are minimized. With disk
mirroring you can maintain a complete replica
file system image at the backup site all changes
made to production data are tracked and
automatically backed up. Data are typically
synchronized in the background, and when the
recovery site is initialized or when a failed
site comes back online, all data are
resynchronized from the replica to production
storage. Note that data may be available only in
read-only mode at the recovery site if the
original site fails (to ensure at least one copy
is protected), so services will recover but
applications that are required to update data may
be somewhat compromised unless some form of local
data cache is available until the primary storage
comes online. A disk mirroring solution should
ideally be able to use a variety of disks using
industrystandard interfaces (e.g., SCSI, Fibre
Channel, etc.).
28Data replication/disk mirroring Pros and Cons
Data replication is more expensive than the
previous two models, and for large sites
considerable traffic volumes can be generated.
Ideally, a private storage network should be
deployed to separate storage traffic from user
traffic. Although more optimal, this requires
more maintenance than earlier models.
29Server mirroring and clusteringThese techniques
can be used to significantly reduce the recovery
time to acceptable levels. Ideally, servers
should be running live and in parallel,
distributing load between them but located at
different physical locations. If
incremental changes are frequently synchronized
between servers, then backup could be a matter of
seconds, and only a few transactions may be lost
(assuming there isnt large-scale
telecommunications or power disruption and staff
are well briefed on what to do and what not to do
in such circumstances). The increasing focus on
electronic commerce and large-scale applications
such as ERP means that this configuration is
becoming increasingly common.
30 Server mirroring and clustering Pros and Cons
This approach is widely used at data centers
for major financial and retail institutions but
is often too expensive to justify for small
businesses. Server mirroring requires more
infrastructure to achieve (high-speed wide area
links, more routers, more firewalls, and tight
management and control systems).
31Storage Area Networks (SANs) and Optical Storage
Network (OSNs)There is increasing interest in
moving mission- and business- critical data off
the main network and offloading it onto a
privately managed infrastructure called a Storage
Area Network (SAN). Storage can be optically
attached via standard high-speed interfaces such
as Fibre Channel and SCSI (with optical
extenders), providing a physical separation of
storage from 600 meters to 10 kilometers. Servers
are directly attached to this network (typically
via Fibre Channel or ESCON/FICON interfaces 5
and are also attached to the main user network.
SANs may be further extended (to thousands of
kilometers) via technologies such as Dense Wave
Division Multiplexing (DWDM), forming optical
storage networks. This allows multiple sites to
share storage over reliable high-speed private
links.
32 Storage Area Networks (SANs) and Optical Storage
Network (OSNs) Pros and Cons This approach is
an excellent model for disaster recovery and
storage optimization. It significantly increases
complexity and cost (though storage consolidation
may recover some of these costs), and it is,
therefore, appropriate only for major enterprises
at present. One big attraction for many large
enterprises is that the whole storage
infrastructure can be outsourced to a Storage
Service Provider (SSP). This facilitates a very
reliable DR model (some providers are currently
quoting four-nines (99.99 percent) availability.
33Quantifying availability
- A Operational Time/Total Time
34(No Transcript)
35Mean Time Between Service Outages (MTBSO) or Mean
Time Between Failure (MTBF) is the average time
(expressed in hours) that a system has been
working between service outages and is typically
greater than 2,000 hours. Since modern network
devices may have a short working life (typically
five years), MTBF is often a predicted value,
based on stress-testing systems and then
forecasting availability in the future. Devices
with moving mechanical parts such as disk drives
often exhibit lower MTBFs than systems that use
fixed components (e.g., flash memory).
36Mean Time To Repair (MTTR) is the average time to
repair systems that have failed and is usually
several orders of magnitude less that MTBF. MTTR
values may vary markedly, depending upon the type
of system under repair and the nature of the
failure. Typical values range from 30 minutes
through to 3 or 4 hours. A typical MTTR for a
complex system with little inherent redundancy
might be several hours.
37(No Transcript)
38Soros rendszerre
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)