Title: Dependability in the Internet Era
1Dependability in the Internet Era
- Jim Gray
- Microsoft Research
- High Dependability Computing Consortium
Conference - Santa Cruz, CA 7 May 2001
- REVISED 13 Feb 2005 Stanford, CA
2Outline
- The glorious past (Availability Progress)
- The dark ages (current scene)
- Some recommendations
3PreviewThe Last 10 Years Availability Dark Ages
Ready for a Renaissance?
- Things got better, then things got a lot worse!
99.999
99.999
99.99
Availability
99.9
99
9
1950
1960
1970
1980
1990
2000
2010
4DEPENDABILITY The 3 ITIES
- RELIABILITY / INTEGRITY Does the right
thing. (also MTTFgtgt1) - AVAILABILITY Does it now. (also 1 gtgt
MTTR )
MTTFMTTRSystem AvailabilityIf 90 of
terminals up 99 of DB up? (gt89 of
transactions are serviced on time). - Holistic vs. Reductionist view
Security
Integrity
Reliability
Availability
5Fail-Fast is Good, Repair is Needed
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF
-
- Improving either MTTR or MTTF gives benefit
- Simple redundancy does not help much.
6Fault Model
- Failures are independentSo, single fault
tolerance is a big win - Hardware fails fast (dead disk, blue-screen)
- Software fails-fast (or goes to sleep)
- Software often repaired by reboot
- Heisenbugs
- Operations tasks major source of outage
- Utility operations
- Software upgrades
7Disks (raid) the BIG Success Story
- Duplex or Parity masks faults
- Disks _at_ 1M hours (100 years)
- But
- controllers fail and
- have 1,000s of disks.
- Duplexing or parity, and dual path gives perfect
disks - Wal-Mart never lost a byte (thousands of
disks, hundreds of failures). - Only software/operations mistakes are left.
8Fault Tolerance vs Disaster Tolerance
- Fault-Tolerance mask local faults
- RAID disks
- Uninterruptible Power Supplies
- Cluster Failover
- Disaster Tolerance masks site failures
- Protects against fire, flood, sabotage,..
- Also, software changes, site moves,
- Redundant system and service at remote site.
9Availability
9
9
9
9
9
Un-managed
Availability
well-managed nodes
Masks some hardware failures
well-managed packs clones
Masks hardware failures, Operations tasks (e.g.
software upgrades) Masks some software failures
well-managed GeoPlex
Masks site failures (power, network, fire,
move,) Masks some operations failures
10Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2
Tele Comm lines
1
2
1
1
.
2
Environment
2
5
Application Software
9
.
3
Operations
- Vendor (hardware and software) 5 Months
- Application software 9 Months
- Communications lines 1.5 Years
- Operations 2 Years
- Environment 2 Years
- 10 Weeks
- 1,383 institutions reported (6/84 - 7/85)
- 7,517 outages, MTTF 10 weeks, avg
duration 90 MINUTES - To Get 10 Year MTTF, Must Attack All These Areas
11Case Studies - Tandem Trends
- MTTF improved
- Shift from Hardware Maintenance to from 50 to
10 - to Software (62) Operations (15)
- NOTE Systematic under-reporting of Environment
- Operations errors
- Application Software
12Dependability Status circa 1995
- 4-year MTTF
- 5 9s for well-managed sys. Fault Tolerance Works.
- Hardware is GREAT (maintenance and MTTF).
- Software masks most hardware faults.
- Many hidden software outages in operations
- New Software.
- Utilities.
- Need to make all hardware/software changes
ONLINE. - Software seems to define a 30-year MTTF ceiling.
- Reasonable Goal 100-year MTTF.
class 4 today gt class 6 tomorrow.
13Honorable Mention
- The nice folks at Tandem (now HP))
- Made failover fast (30 seconds or less).
- Made change online
- Add hardware/software
- Reorganize database.
- Rolling upgrades.
- Added at least one 9 to their story.
14And Then?
- Hardware got better ( more complex)
- Software got better ( more complex)
- Raid is standard, Snapshots becoming standard
- Cluster in a box commodity failover
- Remote replication is standard.
15Outline
- The glorious past (Availability Progress)
- The dark ages (current scene)
- Some recommendations
16Progress?
- MTTF improved from 1950-1995
- MTTR incremental improvements 1970 ---
failover - Hardware and Software online change (pNp) is now
standard - Then the Internet arrived
- No project can take more than 3 months.
- Time to market is everything
- Change is good.
17The Internet Changed Expectations
- 1990
- Phones delivered 99.999
- ATMs delivered 99.99
- Failures were front-page news.
- Few hackers
- Outages last an hour
- 2005
- Cell phones deliver 90
- Web sites deliver 99
- Failures are business-page news
- Many hackers.
- Outages last a day
This is progress?
18Eric Brewer said it bestACID vs BASEthe
internet litmus testcopy of slide 8 of
http//www.ccs.neu.edu/groups/IEEE/ind-acad/brewer
/sld008.htm
- BasicAvailabilitySoft StateEventual Consistency
- Availability FIRST
- Weak consistencystale data is OKApproximate
answers OK - Best effort
- Aggressive (optimistic)
- Easier Evolution.
- Simpler!
- Faster
- AtomicityConsistencyIsolation Durabilty
- Availability?
- Strong consistencyIsolation
- Focus on commit
- Conservative (Pessimistic)
- Difficult evolution (e.g. schema)
- Nested transactions
I think it is a spectrum
19Why (1) Complexity
- Internet sites are MUCH more complex.
- NAP
- Firewall/proxy/IPsprayer
- Web
- DMZ
- App server
- DB server
- Links to other sites
- tcp/http/html/dhtml/dom/xml/ com/corba/cgi/sql/fs/
os - Skill level is much reduced
20A Data Center (500 servers)
21A Schematic of HotMail
- 7,000 servers
- 100 backend stores with 300TB (cooked)
- many data centers
- Links to
- Internet Mail gateways
- Ad-rotator
- Passport
-
- 5 B messages per day
- 350M mailboxes, 250M active
- 1M new per day.
- New software every 3 months(small changes
weekly).
Member
MSERVS
Front
MSERVS
Directory
Doors
Local Director
MSERVS
Local Director
MSERVS
Graphics
MSERVS
Servers
Local Director
Data
MSERVS
Data
Swittched Ethernet
MSERVS
Internet
AD Servers
Data
Data
Local Director
USTORES
Incoming
MSERVS
MSERVS
MailServer
s
Local Director
Telnet Management
MSERVS
Login
MSERVS
gateway
Servers
gateway
gateway
Local Director
gateway
gateway
22Why (2) Velocity
- No project can take more than 13 weeks.
- Time to market is everything
- Functionality is everything
- Faster, cheaper,
23Why (3) Hackers
- Hackers are a new increased threat
- Any site can be attacked from anywhere
- Motives include ego, malice, and greed.
- Complexity makes it hard to protect sites.
- Whole internet attacks Slammer
- Concentration of wealth makes attractive target
- Reporter Why did you rob banks?
- Willie Sutton Cause thats where the money is!
Note Eric Raymonds How to Become a Hacker
http//www.tuxedo.org/esr/faqs/hacker-howto.html
is the positive use of Hacker, here I mean
malicious and anti-social hackers. Black-hats,
not white-hats.
24How Bad Is It?
http//www-iepm.slac.stanford.edu/
Connectivity is poor.
http//www.internettrafficreport.com/main.htm
25How Bad Is It?
http//www-iepm.slac.stanford.edu/pinger/
- Median monthly ping packet loss for 2/ 99
26 And in 2006, about the same
27Or In the US
28Keynote measures Response Timeand Up Time
Measures response time around the world Business
service is better than popular service Has many
proprietary services for SLAs.
Week ofApril 22 - April 28, 2001 Week ofApril 22 - April 28, 2001 Previous Week Previous Week
Index 15.90 Index 15.90 15.78 15.78
Web Siteswith BestPerformanceAverages Ameritrade (65) Lycos (81) Yahoo! (81) Altavista (19) Go.com 3.29 5.41 5.79 6.03 7.02 Ameritrade (64) Lycos (80) Yahoo! (80) Ask Jeeves (7) Altavista (18) 3.35 5.58 5.74 6.11 6.17
Worst Average (anonymous) 38.04 (anonymous) 37.44
292006 typical 97.48 Availability
97.48
30Netcrafts Crisis-of-the-Day
31(No Transcript)
32Service Level Measurements
- Many organizations are measured on SLAs
- Example 1 sec response 99 of prime
time - Keynote, Netcraft,
- offer to monitor you site (probe every few min)
- This probing can go deep into the tree to detect
services. - Send alerts via email
- Give monthly reports.
33In addition
- Most large sites build their own instrumentation
(several times ?) - This instrumentation is elaborate and essential
for the Network Operations Center (NOC). - There are attempts now to systematize itTivoli,
OpenView, NetIQ, WhatsUP, Mom,..
34Microsoft.Com
- Operations mis-configured a router
- Took a day to diagnose and repair.
- DOS attacks cost a fraction of a day.
- Regular security patches.
35Back-End Servers are More Stable
- Generally deliver 99.99
- TerraServer for example single back-end
failed after 2.5 y. - Went to 4-nodecluster
- Fails every 2 mo.Transparent failover in 30
sec.Online software upgradesSo 99.999 in
backend
36eBay A very honest site
http//www2.ebay.com/aw/announce.shtml
- Publishes operations log.
- Has 99 of scheduled uptime
- Schedules about 2 hours/week down.
- Has had some operations outages
- Has had some DOS problems.
37And 2006.
http//www2.ebay.com/aw/announce.shtml
- Welcome to eBay's System Board. Visit this board
for information on scheduled site maintenance or
system issues that are affecting Marketplace
trading. For general eBay news, please see our
General Announcements Board. -
- Resolved - PayPal site slowness
- February 08, 2006 0520PM PST/PTFor several
hours today, members may have experienced
slowness while trying to access the PayPal
website. This issue has now been resolved.
AThank you for your patience. - Link to this announcement Back to top
- PayPal site slowness
- February 08, 2006 0238PM PST/PTMembers may be
experiencing intermittent slowness while trying
to access the PayPal website. We're aware of this
issue and are working to fix it as quickly as
possible. Thank you for your patience. - Link to this announcement Back to top
- Scheduled Maintenance For This Week
- February 08, 2006 0203PM PST/PTThe eBay
system will be undergoing general maintenance
from approximately 2300 PT on Thursday, February
9th to 0100 PT on Friday, February 10th. During
this maintenance period, certain eBay site
features may be intermittently unavailable or
slow.
38Some Cool New Things
- There are 100,000 node services.
- Google File System shows importance benefit
of Triplex - DB replication mirroring works (is easy)
- little things I have done
- With Leslie Lamport unified Paxos 2PC
- Measured mean-time-to-data-loss(and continue to
measure things).
39Outline
- The glorious past (Availability Progress)
- The dark ages (current scene)
- Some recommendations
40Not to throw stones but
- Everyone has a serious problem.
- The BEST people publish their stats.
- The others HIDE their stats (check Netcraft
to see who I mean). - We have good NODE-level availability 5-9s is
reasonable. - We have TERRIBLE system-level availability 2-9s
scheduled is the goal (!).
41Greshams Lawbad money drives out good
- People WANT features!
- People WANT convenience!
- People WANT cheap!
- In exchange,they seem to be willing to tolerate
some - Un-availability ( inconvenience)
- Dirty data that needs reconciliation
- Insecurity
- I see it as our task to make it easier
cheaperto get high availability and Security.
42Recommendation 1
- Continue progress on back-ends.
- Make management easier (AUTOMATE IT!!!)
- Measure
- Compare best practices
- Continue to look for better algoritims.
- Live in fear
- We are at 10,000 node servers
- We are headed for 1,000,000 node servers
43Recommendation 2
- Current security approach is unworkable
- Anonymous clients
- Firewall is clueless
- Incredible complexity
- We cant win this game!
- So change the rules (redefine the problem)
- No anonymity
- Unified authentication/authorization model
- Single-function devices (with simple interfaces)
- Only one-kind of interface (uddi/wsdl/soap/).
44Recommendation 3
- Dependability requires holistic not
reductionist approach. - Its the WHOLE system (end-to-end,
top-to-bottom) - Hard to publish in this area, hard to get tenure.
- Journals want theoremproof and crisp statements.
- Companies want to make money, so do not
share their knowledge. - Dependability is an important social good,
- So, it Dependability Research needs
government or philanthropic sponsorship
45References
- Adams, E. (1984). Optimizing Preventative
Service of Software Products. IBM Journal of
Research and Development. 28(1) 2-14.0 - Anderson, T. and B. Randell. (1979). Computing
Systems Reliability. - Garcia-Molina, H. and C. A. Polyzois. (1990).
Issues in Disaster Recovery. 35th IEEE Compcon
90. 573-577. - Gray, J. (1986). Why Do Computers Stop and What
Can We Do About It. 5th Symposium on Reliability
in Distributed Software and Database Systems.
3-12. - Gray, J. (1990). A Census of Tandem System
Availability between 1985 and 1990. IEEE
Transactions on Reliability. 39(4) 409-418. - Gray, J. N., Reuter, A. (1993). Transaction
Processing Concepts and Techniques. San Mateo,
Morgan Kaufmann. - Lampson, B. W. (1981). Atomic Transactions.
Distributed Systems -- Architecture and
Implementation An Advanced Course. ACM,
Springer-Verlag. - Laprie, J. C. (1985). Dependable Computing and
Fault Tolerance Concepts and Terminology. 15th
FTCS. 2-11. - Long, D.D., J. L. Carroll, and C.J. Park (1991).
A study of the reliability of Internet sites.
Proc 10th Symposium on Reliable Distributed
Systems, pp. 177-186, Pisa, September 1991. - Theory and Practice of Reliable System Design,
Dan Siewiorek, Robert Swarz - Building Secure and Reliable Network
Applications, Ken P. Birman - Darrell Long, Andrew Muir and Richard Golding,
A Longitudinal Study of Internet Host
Reliability,'' Proc of the Symposium on Reliable
Distributed Systems, Bad Neuenahr, Germany IEEE,
1995, p. 2-9 - http//www.netcraft.com/ They have even better
for-fee data as well, but for-free is really
excellent. - http//www2.ebay.com/aw/announce.shtmltop eBay
is an Excellent benchmark of best Internet
practices - Empirical Measurements of Disk Failure Rates and
Error Rates C .van Ingen moving 2P with cheap
iron - Consensus on Transaction Commit, , L. Lamport,
unifies 2PC and Byzantie-Paxos