Dependability in the Internet Era - PowerPoint PPT Presentation

About This Presentation

Title:

Dependability in the Internet Era

Description:

Lycos (80) Yahoo! (80) Ask Jeeves (7) Altavista (18) 3.29. 5.41 ... Lycos (81) Yahoo! (81) Altavista (19) Go.com. Web Sites. with Best. Performance. Averages ... – PowerPoint PPT presentation

Number of Views:156

Avg rating:3.0/5.0

Slides: 46

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: Dependability in the Internet Era

1
Dependability in the Internet Era

Jim Gray
Microsoft Research
High Dependability Computing Consortium
Conference
Santa Cruz, CA 7 May 2001
REVISED 13 Feb 2005 Stanford, CA

2
Outline

The glorious past (Availability Progress)
The dark ages (current scene)
Some recommendations

3
PreviewThe Last 10 Years Availability Dark Ages
Ready for a Renaissance?

Things got better, then things got a lot worse!

99.999
99.999
99.99
Availability
99.9
99
9
1950
1960
1970
1980
1990
2000
2010
4
DEPENDABILITY The 3 ITIES

RELIABILITY / INTEGRITY Does the right
thing. (also MTTFgtgt1)
AVAILABILITY Does it now. (also 1 gtgt
MTTR )
MTTFMTTRSystem AvailabilityIf 90 of
terminals up 99 of DB up? (gt89 of
transactions are serviced on time).
Holistic vs. Reductionist view

Security
Integrity
Reliability
Availability
5
Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF

Improving either MTTR or MTTF gives benefit
Simple redundancy does not help much.

6
Fault Model

Failures are independentSo, single fault
tolerance is a big win
Hardware fails fast (dead disk, blue-screen)
Software fails-fast (or goes to sleep)
Software often repaired by reboot
Heisenbugs
Operations tasks major source of outage
Utility operations
Software upgrades

7
Disks (raid) the BIG Success Story

Duplex or Parity masks faults
Disks _at_ 1M hours (100 years)
But
controllers fail and
have 1,000s of disks.
Duplexing or parity, and dual path gives perfect
disks
Wal-Mart never lost a byte (thousands of
disks, hundreds of failures).
Only software/operations mistakes are left.

8
Fault Tolerance vs Disaster Tolerance

Fault-Tolerance mask local faults
RAID disks
Uninterruptible Power Supplies
Cluster Failover
Disaster Tolerance masks site failures
Protects against fire, flood, sabotage,..
Also, software changes, site moves,
Redundant system and service at remote site.

9
Availability
9
9
9
9
9
Un-managed
Availability
well-managed nodes
Masks some hardware failures
well-managed packs clones
Masks hardware failures, Operations tasks (e.g.
software upgrades) Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire,
move,) Masks some operations failures
10
Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2

Tele Comm lines
1
2

1
1
.
2
Environment

2
5

Application Software
9
.
3

Operations

Vendor (hardware and software) 5 Months
Application software 9 Months
Communications lines 1.5 Years
Operations 2 Years
Environment 2 Years
10 Weeks
1,383 institutions reported (6/84 - 7/85)
7,517 outages, MTTF 10 weeks, avg
duration 90 MINUTES
To Get 10 Year MTTF, Must Attack All These Areas

11
Case Studies - Tandem Trends

MTTF improved
Shift from Hardware Maintenance to from 50 to
10
to Software (62) Operations (15)
NOTE Systematic under-reporting of Environment
Operations errors
Application Software

12
Dependability Status circa 1995

4-year MTTF
5 9s for well-managed sys. Fault Tolerance Works.
Hardware is GREAT (maintenance and MTTF).
Software masks most hardware faults.
Many hidden software outages in operations
New Software.
Utilities.
Need to make all hardware/software changes
ONLINE.
Software seems to define a 30-year MTTF ceiling.
Reasonable Goal 100-year MTTF.
class 4 today gt class 6 tomorrow.

13
Honorable Mention

The nice folks at Tandem (now HP))
Made failover fast (30 seconds or less).
Made change online
Add hardware/software
Reorganize database.
Rolling upgrades.
Added at least one 9 to their story.

14
And Then?

Hardware got better ( more complex)
Software got better ( more complex)
Raid is standard, Snapshots becoming standard
Cluster in a box commodity failover
Remote replication is standard.

15
Outline

The glorious past (Availability Progress)
The dark ages (current scene)
Some recommendations

16
Progress?

MTTF improved from 1950-1995
MTTR incremental improvements 1970 ---
failover
Hardware and Software online change (pNp) is now
standard
Then the Internet arrived
No project can take more than 3 months.
Time to market is everything
Change is good.

17
The Internet Changed Expectations

1990
Phones delivered 99.999
ATMs delivered 99.99
Failures were front-page news.
Few hackers
Outages last an hour

2005
Cell phones deliver 90
Web sites deliver 99
Failures are business-page news
Many hackers.
Outages last a day

This is progress?
18
Eric Brewer said it bestACID vs BASEthe
internet litmus testcopy of slide 8 of
http//www.ccs.neu.edu/groups/IEEE/ind-acad/brewer
/sld008.htm

BasicAvailabilitySoft StateEventual Consistency
Availability FIRST
Weak consistencystale data is OKApproximate
answers OK
Best effort
Aggressive (optimistic)
Easier Evolution.
Simpler!
Faster

AtomicityConsistencyIsolation Durabilty
Availability?
Strong consistencyIsolation
Focus on commit
Conservative (Pessimistic)
Difficult evolution (e.g. schema)
Nested transactions

I think it is a spectrum
19
Why (1) Complexity

Internet sites are MUCH more complex.
NAP
Firewall/proxy/IPsprayer
Web
DMZ
App server
DB server
Links to other sites
tcp/http/html/dhtml/dom/xml/ com/corba/cgi/sql/fs/
os
Skill level is much reduced

20
A Data Center (500 servers)
21
A Schematic of HotMail

7,000 servers
100 backend stores with 300TB (cooked)
many data centers
Links to
Internet Mail gateways
Ad-rotator
Passport
5 B messages per day
350M mailboxes, 250M active
1M new per day.
New software every 3 months(small changes
weekly).

Member
MSERVS
Front
MSERVS
Directory
Doors
Local Director
MSERVS
Local Director
MSERVS
Graphics
MSERVS
Servers
Local Director
Data
MSERVS
Data
Swittched Ethernet
MSERVS
Internet
AD Servers
Data
Data
Local Director
USTORES
Incoming
MSERVS
MSERVS
MailServer
s
Local Director
Telnet Management
MSERVS
Login
MSERVS
gateway
Servers
gateway
gateway
Local Director
gateway
gateway
22
Why (2) Velocity

No project can take more than 13 weeks.
Time to market is everything
Functionality is everything
Faster, cheaper,

23
Why (3) Hackers

Hackers are a new increased threat
Any site can be attacked from anywhere
Motives include ego, malice, and greed.
Complexity makes it hard to protect sites.
Whole internet attacks Slammer
Concentration of wealth makes attractive target
Reporter Why did you rob banks?
Willie Sutton Cause thats where the money is!

Note Eric Raymonds How to Become a Hacker
http//www.tuxedo.org/esr/faqs/hacker-howto.html
is the positive use of Hacker, here I mean
malicious and anti-social hackers. Black-hats,
not white-hats.
24
How Bad Is It?
http//www-iepm.slac.stanford.edu/
Connectivity is poor.
http//www.internettrafficreport.com/main.htm
25
How Bad Is It?
http//www-iepm.slac.stanford.edu/pinger/

Median monthly ping packet loss for 2/ 99

26
And in 2006, about the same
27
Or In the US
28
Keynote measures Response Timeand Up Time
Measures response time around the world Business
service is better than popular service Has many
proprietary services for SLAs.
Week ofApril 22 - April 28, 2001 Week ofApril 22 - April 28, 2001 Previous Week Previous Week
Index 15.90 Index 15.90 15.78 15.78
Web Siteswith BestPerformanceAverages Ameritrade (65) Lycos (81) Yahoo! (81) Altavista (19) Go.com 3.29 5.41 5.79 6.03 7.02 Ameritrade (64) Lycos (80) Yahoo! (80) Ask Jeeves (7) Altavista (18) 3.35 5.58 5.74 6.11 6.17
Worst Average (anonymous) 38.04 (anonymous) 37.44
29
2006 typical 97.48 Availability
97.48
30
Netcrafts Crisis-of-the-Day
31
(No Transcript)
32
Service Level Measurements

Many organizations are measured on SLAs
Example 1 sec response 99 of prime
time
Keynote, Netcraft,
offer to monitor you site (probe every few min)
This probing can go deep into the tree to detect
services.
Send alerts via email
Give monthly reports.

33
In addition

Most large sites build their own instrumentation
(several times ?)
This instrumentation is elaborate and essential
for the Network Operations Center (NOC).
There are attempts now to systematize itTivoli,
OpenView, NetIQ, WhatsUP, Mom,..

34
Microsoft.Com

Operations mis-configured a router
Took a day to diagnose and repair.
DOS attacks cost a fraction of a day.
Regular security patches.

35
Back-End Servers are More Stable

Generally deliver 99.99
TerraServer for example single back-end
failed after 2.5 y.
Went to 4-nodecluster
Fails every 2 mo.Transparent failover in 30
sec.Online software upgradesSo 99.999 in
backend

36
eBay A very honest site
http//www2.ebay.com/aw/announce.shtml

Publishes operations log.
Has 99 of scheduled uptime
Schedules about 2 hours/week down.
Has had some operations outages
Has had some DOS problems.

37
And 2006.
http//www2.ebay.com/aw/announce.shtml

Welcome to eBay's System Board. Visit this board
for information on scheduled site maintenance or
system issues that are affecting Marketplace
trading. For general eBay news, please see our
General Announcements Board.
Resolved - PayPal site slowness
February 08, 2006 0520PM PST/PTFor several
hours today, members may have experienced
slowness while trying to access the PayPal
website. This issue has now been resolved.
AThank you for your patience.
Link to this announcement Back to top
PayPal site slowness
February 08, 2006 0238PM PST/PTMembers may be
experiencing intermittent slowness while trying
to access the PayPal website. We're aware of this
issue and are working to fix it as quickly as
possible. Thank you for your patience.
Link to this announcement Back to top
Scheduled Maintenance For This Week
February 08, 2006 0203PM PST/PTThe eBay
system will be undergoing general maintenance
from approximately 2300 PT on Thursday, February
9th to 0100 PT on Friday, February 10th. During
this maintenance period, certain eBay site
features may be intermittently unavailable or
slow.

38
Some Cool New Things

There are 100,000 node services.
Google File System shows importance benefit
of Triplex
DB replication mirroring works (is easy)
little things I have done
With Leslie Lamport unified Paxos 2PC
Measured mean-time-to-data-loss(and continue to
measure things).

39
Outline

The glorious past (Availability Progress)
The dark ages (current scene)
Some recommendations

40
Not to throw stones but

Everyone has a serious problem.
The BEST people publish their stats.
The others HIDE their stats (check Netcraft
to see who I mean).
We have good NODE-level availability 5-9s is
reasonable.
We have TERRIBLE system-level availability 2-9s
scheduled is the goal (!).

41
Greshams Lawbad money drives out good

People WANT features!
People WANT convenience!
People WANT cheap!
In exchange,they seem to be willing to tolerate
some
Un-availability ( inconvenience)
Dirty data that needs reconciliation
Insecurity
I see it as our task to make it easier
cheaperto get high availability and Security.

42
Recommendation 1

Continue progress on back-ends.
Make management easier (AUTOMATE IT!!!)
Measure
Compare best practices
Continue to look for better algoritims.
Live in fear
We are at 10,000 node servers
We are headed for 1,000,000 node servers

43
Recommendation 2

Current security approach is unworkable
Anonymous clients
Firewall is clueless
Incredible complexity
We cant win this game!
So change the rules (redefine the problem)
No anonymity
Unified authentication/authorization model
Single-function devices (with simple interfaces)
Only one-kind of interface (uddi/wsdl/soap/).

44
Recommendation 3

Dependability requires holistic not
reductionist approach.
Its the WHOLE system (end-to-end,
top-to-bottom)
Hard to publish in this area, hard to get tenure.
Journals want theoremproof and crisp statements.
Companies want to make money, so do not
share their knowledge.
Dependability is an important social good,
So, it Dependability Research needs
government or philanthropic sponsorship

45
References

Adams, E. (1984). Optimizing Preventative
Service of Software Products. IBM Journal of
Research and Development. 28(1) 2-14.0
Anderson, T. and B. Randell. (1979). Computing
Systems Reliability.
Garcia-Molina, H. and C. A. Polyzois. (1990).
Issues in Disaster Recovery. 35th IEEE Compcon
90. 573-577.
Gray, J. (1986). Why Do Computers Stop and What
Can We Do About It. 5th Symposium on Reliability
in Distributed Software and Database Systems.
3-12.
Gray, J. (1990). A Census of Tandem System
Availability between 1985 and 1990. IEEE
Transactions on Reliability. 39(4) 409-418.
Gray, J. N., Reuter, A. (1993). Transaction
Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann.
Lampson, B. W. (1981). Atomic Transactions.
Distributed Systems -- Architecture and
Implementation An Advanced Course. ACM,
Springer-Verlag.
Laprie, J. C. (1985). Dependable Computing and
Fault Tolerance Concepts and Terminology. 15th
FTCS. 2-11.
Long, D.D., J. L. Carroll, and C.J. Park (1991).
A study of the reliability of Internet sites.
Proc 10th Symposium on Reliable Distributed
Systems, pp. 177-186, Pisa, September 1991.
Theory and Practice of Reliable System Design,
Dan Siewiorek, Robert Swarz
Building Secure and Reliable Network
Applications, Ken P. Birman
Darrell Long, Andrew Muir and Richard Golding,
A Longitudinal Study of Internet Host
Reliability,'' Proc of the Symposium on Reliable
Distributed Systems, Bad Neuenahr, Germany IEEE,
1995, p. 2-9
http//www.netcraft.com/ They have even better
for-fee data as well, but for-free is really
excellent.
http//www2.ebay.com/aw/announce.shtmltop eBay
is an Excellent benchmark of best Internet
practices
Empirical Measurements of Disk Failure Rates and
Error Rates C .van Ingen moving 2P with cheap
iron
Consensus on Transaction Commit, , L. Lamport,
unifies 2PC and Byzantie-Paxos