Title: the erisks of ecommerce
1the e-risks of e-commerce
- Professor Ken Birman
- Dept. of Computer Science
- Cornell University
2Reliability
- If it stops ticking when it takes a licking your
e-commerce company could tank - So you need to know that your technology base is
reliable - It does what it should do, does it when needed,
does it correctly, and is accessible to your
customers.
3A Quiz
- Q When and why did Sun Microsystems have a
losing quarter?
Ken Birman Mr. Birman,Sun experienced a loss
in Q4FY89 (June 1989). This was the quarter in
which we transitioned to a new manufacturing,
order processing and inventory control
systems.Andrew CaseyManager, Investor
RelationsSun Microsystems, Inc.(650)
336-0761andrew.casey_at_corp.sun.com
4Typical Web Session
get http//www.cs.cornell.edu/People/ken
where
what
firewall
5Typical Web Session
resolve www.cs.cornell.edu
IP address128.64.31.77
load-balancing proxy
firewall
6The Webs dark side
Netscape error web server www.cs.cornell.edu...
not responding. Server may have crashed or is
overloaded.
OK
7Right URL, but the request times out. Why?
- The web server could be down
- Your network connection may have failed
- There could be a problem in the DNS
- There could be a network routing problem
- The Internet may be experiencing an overload
- Your web caching proxy may be down
- Your PC might have a problem, or your version of
Netscape (or Explorer), or the file system you
are using, or your LAN - The URL itself may be wrong
- A router or network link may have failed and the
Internet may not yet have rerouted around the
problem
8E-Trade computers crash again -- and again
Edupage Editors
Sun, 07 Feb 1999 102830 -0500
- The computer system of online security firm
E-Trade crashed on Friday for the third
consecutive day. "It was just a software glitch.
I think we were all frustrated by it," says an
E-Trade executive. Industry analyst James Mark of
Deutsche Bank commented it's the application on
a large scale. As soon as E-Trade's volumes
started spiking up, they had the same problems as
others."
9Reliable Distributed Computing Increasingly
urgent, yet unsolved
- Distributed computing has swept the world
- Impact has become revolutionary
- Vast wave of applications migrating to networks
- Already as critical a national infrastructure as
water, electricity, or telephones - Yet distributed systems remain
- Unreliable, prone to inexplicable outages
- Insecure, easily attacked
- Difficult (and costly) to program, bug-prone
10A National Imperative
- Potential for catastrophe cited by
- Presidential Commission on Critical
Infrastructure Protection (PCCIP) - National Academy of Sciences Study on Trust in
Cyberspace - These experts warn that we need a quantum
improvement in technologies - Meanwhile, your e-commerce venture is at grave
risk of stumbling just like many others
11A Business Imperative
12A Business Imperative
13E-projects Often Fail
- e-commerce revolves around computing
- Even business and marketing people are at the
mercy of these systems - When your companys computing systems arent
running, youre out of business
14Big and Little Pictures
- It is too easy to understand reliability as a
narrow technical issue - In fact, many systems and companies stumble by
building - Unreliable technologies, because of
- A mixture of poor management and poor technical
judgment - Reliable systems demand a balance between good
management and good technology
15A Glimpse of Some Unreliable Systems
- Quick review of some failed projects
- These were characterized by poor reliability of
the final product - But the issues were not really technical
- As future managers you need to understand this
phenomenon!
16Tales from the Software Crypt
Source Jerry Saltzer, Keynote address, SOSP 1999
17Tales from the Software Crypt
Source Jerry Saltzer, Keynote address, SOSP 1999
181995 Standish Group Study
On timeOn budgetOn function
Over budgetMissed scheduleLacks functions
Success 20
Challenged 50
Impaired 30
Scrapped
2x budget2x competion time2/3 planned
functionality
Source Jerry Saltzer, Keynote address, SOSP 1999
19A strange picture
- Many technology projects fail
- For lots of reasons
- But some succeed
- Today we do web-based hotel reservations all the
time, yet Confirm failed - French air traffic project was a success yet US
project lost 6 billion - Is there a pattern?
20Recurring Problems
- Incommensurate scaling
- Too many ideas
- Mythical man-month
- Bad ideas included
- Modularity is hard
- Bad-news diode
- Best people are far more productive than average
employees - New is better, not-even-available yet is best
- Magic bullet syndrome
Source Jerry Saltzer, Keynote address, SOSP 1999
211995 Study of Tandem Computer Systems
- 77 of failures that are software problems.
- Software fault-tolerance techniques can overcome
about 75 of detected faults. - Loose coupling between primary and backup is
important for software fault tolerance. - Over two-thirds (72) of measured software
failures are recurrences of previously reported
faults.
Source Jerry Saltzer, Keynote address, SOSP 1999
22A Buggy Aside
- Q What are the two main categories of software
bugs called? - A Bohrbugs and Heisenbugs
- Q Why?
23Bohr Model of Atom
- Bohr argued that thenucleus was a little ball
24Bohr Model of Atom
- Bohr argued that thenucleus was a little ball
- Bohr bug is a nastybut well defined thing
25Bohr Model of Atom
- Bohr argued that thenucleus was a little ball
- Bohr bug is a nastybut well defined thing
- Your technical peoplecan reproduce it, so
theycan nail it
26Heisenbug
- Heisenberg modeled atom as a cloud of
electromsand a cloud-like nucleus - The closer you look, themore it wiggled
- A Heisenbug moves when your people try and pin
it down.They wont find it easy to fix.
27Why?
- Bohrbugs tend to be deterministic errors
outright mistakes in the code - Once you understand what triggers them they are
easy to search for and fix - Heisenbugs are often delayed side-effects of an
old error. Like a bad tank of gas, effect may
happen long after the bug first occurs. Hard
to fix because at the time the mistake happened,
nothing obvious went wrong
28Why Systems fail
- Mostly, because something crashes
- Usually, software or a human error
- Mean time to failure improves with age but
software problems remain prevalent - Every kind of software system is prone to
failures. Failure to plan for failures is the
most common way for e-systems to fail.
29E-reliability
- We want e-commerce solutions to be reliable but
what should this mean? - Fault-tolerant?
- Secure?
- Fast enough?
- Accessible to customers?
- Deliver critical services when needed, where
needed, in a correct, timely manner
30Costs of a Failure
31Minimizing Downtime
- Idea is to design critical parts of your system
to survive failures - Two basic approaches
- Recoverable systems are designed to restart
without human intervention but may wait until
outage is repaired - Highly available systems are designed to keep
running during failure
32Recoverability
- The technology is called transactions
- Well discuss this next time, but
- Main issue is time needed to restart the service
- For a large database, half an hour or more is not
at all unusual - Faster restart requires a warm standby
33High Availability
- Idea is to have a way to keep the system running
even while some parts are crashed - For example, a backup that takes over if primary
fails - Backup is kept warm
- This involves replicating information
- As changes occur, backup may lag behind
34Complexity
- The looming threat to your e-commerce solution,
no matter what it may be - Even simple systems are hard to make reliable
- Complex systems are almost impossible to make
reliable - Yet innovative e-commerce projects often require
fairly complex technologies!
35Two Side-by-Side Case Studies
- American Advanced Automation System
- Intended as replacement for air traffic control
system - Needed because Pres. Reagan fired many
controllers in 1981 - But project was a fiasco, lost 6B
- French Phidias System
- Similar goals, slightly less ambitious
- But rolled out, on time and on budget, in 1999
36Background
- Air traffic control systems are using 1970s
technology - Extremely costly to maintain and impossible to
upgrade - Meanwhile, load on controllers is rising steadily
- Cant easily reduce load
37Air Traffic Control system (one site)
Onboard
Radar
X.500 Directory
Team of Controllers
Air Traffic Database(flight plans, etc)
38Politics
- Government wanted to upgrade the whole thing,
solve a nagging problem - Controllers demanded various simplifications and
powerful new tools - Everyone assumed that what you use at home can be
adapted to the demands of an air traffic control
center
39Technology
- IBM bid the project, proposed to use its own
workstations - These arent super reliable, so they proposed to
adapt a new approach to fault-tolerance - Idea is to plan for failure
- Detect failures when they occur
- Automatically switch to backups
40Core Technical Issue?
- Problem revolves around high availability
- Waiting for restart not seen as an option goal
is 10sec downtime in 10 years - So IBM proposed a replication scheme much like
the load balancing approach - IBM had primary and backup simply do the same
work, keeping them in the same state
41Technology
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
radar
Conceptual flow of system
IBMs fault-tolerant process pair concept
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
radar
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
radar
42Why is this Hard?
- The system has many real-time constraints on it
- Actions need to occur promptly
- Even if something fails, we want the human
controller to continue to see updates - IBMs technology
- Based on a research paper by Flaviu Cristian
- But had never been used except for proof of
concept purposes, on a small scale in the
laboratory
43Politics
- IBMs proposal sounded good
- and they were the second lowest bidder
- and they had the most aggressive schedule
- So the FAA selected them over alternatives
- IBM took on the whole thing all at once
44Disaster Strikes
- Immediate confusion all parts of the system
seemed interdependent - To design part A I need to know how part B, also
being designed, will work - Controllers didnt like early proposals and
insisted on major changes to design - Fault-tolerance idea was one of the reasons IBM
was picked, but made the system so complex that
it went on the back burner
45Summary of Simplifications
- Focus on some core components
- Postpone worry about fault-tolerance until later
- Try and build a simple version that can be
fleshed out later - but the simplification wasnt enough. Too many
players kept intruding with requirements
46Crash and Burn
- The technical guys saw it coming
- Probably as early as one year into the effort
- But they kept it secret (bad news diode)
- Anyhow, management wasnt listening (theyve
heard it all before whining engineers!) - The fault-tolerance scheme didnt work
- Many technical issues unresolved
- The FAA kept out of the technical issues
- But a mixture of changing specifications and
serious technical issues were at the root of the
problems
47What came out?
- In the USA, nothing.
- The entire system was useless the technology
was of an all-or-nothing style and nothing was
ready to deploy - British later rolled out a very limited version
of a similar technology, late, with many bugs,
but it does work
48Contrast with French
- They took a very incremental approach
- Early design sought to cut back as much as
possible - If it isnt mandatory dont do it yet
- Focus was on console cluster architecture and
fault-tolerance - They insisted on using off-the-shelf technology
49Contrast with French
- Managers intervened in technology choices
- For example, the vendor wanted to do a home-brew
fault-tolerance technology - French insisted on a specific existing technology
and refused to bid out the work until vendors
accepted - A critical good call as it worked out
50Learning by Doing
- To gain experience with technology
- They tested, and tested, and tested
- Designed simple prototypes and played with them
- Discovered that large cluster would perform
poorly - But found a sweet spot and worked within it
- This forced project to cut back on some goals
51Testing
- 9/10th of time and expense on any system is in
- Testing
- Debugging
- Integration
- Many projects overlook this
- French planned conservatively
52Software Bugs
- Figure 1/10 lines in new code
- But as many as 1/250 lines in old code
- Bugs show up under stress
- Trick is to run a system in an unstressed mode
- French identified stress points and designed to
steer far from them - Their design also assumed that components would
fail and automated the restart
53All of this worked!
- Take-aways from French project?
- Complex technical issues at the core of the
system - But they managed to break big poject into pieces
- Do the critical core first, separately, and focus
exclusively on it - Test, test, test
- Dont build anything you can possibly buy
- Management was technically sophisticated enough
to make some critical calls
54Your Problem
- e-commerce systems are at e-risk
- These e-risks take many forms
- System complexity
- Failure to plan for failures
- Poor project management
- Ignore this at our peril, as weve seen
- But how can we learn to do better?
55Keys to Reliability
- Know the basic technologies
- Realize that software is buggy and failures will
happen. - Design to treat failure as a mundane event
- Failure to plan for failure is the biggest
e-risk! - Complexity is a huge threat. Use your naiveté as
an advantage if you cant understand it, why
assume that they can understand it?
56E-commerce Technologies
- The network and associated services
- Databases
- Web servers
- Scripts the glue your people use to tie it
all together
57Next Lecture
- Look at some realistic e-commerce systems
- Ask ourselves where to start first, if we need to
convince ourselves that the system will be
reliable enough - Trick is to balance between system complexity and
adequate risk coverage