the erisks of ecommerce - PowerPoint PPT Presentation

About This Presentation

Title:

the erisks of ecommerce

Description:

If it stops ticking when it takes a licking... your e-commerce company could tank. So you need to know that your technology base is reliable ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 58

Provided by: csCor

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: the erisks of ecommerce

1
the e-risks of e-commerce

Professor Ken Birman
Dept. of Computer Science
Cornell University

2
Reliability

If it stops ticking when it takes a licking your
e-commerce company could tank
So you need to know that your technology base is
reliable
It does what it should do, does it when needed,
does it correctly, and is accessible to your
customers.

3
A Quiz

Q When and why did Sun Microsystems have a
losing quarter?

Ken Birman Mr. Birman,Sun experienced a loss
in Q4FY89 (June 1989). This was the quarter in
which we transitioned to a new manufacturing,
order processing and inventory control
systems.Andrew CaseyManager, Investor
RelationsSun Microsystems, Inc.(650)
336-0761andrew.casey_at_corp.sun.com
4
Typical Web Session
get http//www.cs.cornell.edu/People/ken
where
what
firewall
5
Typical Web Session
resolve www.cs.cornell.edu
IP address128.64.31.77
load-balancing proxy
firewall
6
The Webs dark side
Netscape error web server www.cs.cornell.edu...
not responding. Server may have crashed or is
overloaded.
OK
7
Right URL, but the request times out. Why?

The web server could be down
Your network connection may have failed
There could be a problem in the DNS
There could be a network routing problem
The Internet may be experiencing an overload
Your web caching proxy may be down
Your PC might have a problem, or your version of
Netscape (or Explorer), or the file system you
are using, or your LAN
The URL itself may be wrong
A router or network link may have failed and the
Internet may not yet have rerouted around the
problem

8
E-Trade computers crash again -- and again
Edupage Editors
Sun, 07 Feb 1999 102830 -0500

The computer system of online security firm
E-Trade crashed on Friday for the third
consecutive day. "It was just a software glitch.
I think we were all frustrated by it," says an
E-Trade executive. Industry analyst James Mark of
Deutsche Bank commented it's the application on
a large scale. As soon as E-Trade's volumes
started spiking up, they had the same problems as
others."

9
Reliable Distributed Computing Increasingly
urgent, yet unsolved

Distributed computing has swept the world
Impact has become revolutionary
Vast wave of applications migrating to networks
Already as critical a national infrastructure as
water, electricity, or telephones
Yet distributed systems remain
Unreliable, prone to inexplicable outages
Insecure, easily attacked
Difficult (and costly) to program, bug-prone

10
A National Imperative

Potential for catastrophe cited by
Presidential Commission on Critical
Infrastructure Protection (PCCIP)
National Academy of Sciences Study on Trust in
Cyberspace
These experts warn that we need a quantum
improvement in technologies
Meanwhile, your e-commerce venture is at grave
risk of stumbling just like many others

11
A Business Imperative
12
A Business Imperative
13
E-projects Often Fail

e-commerce revolves around computing
Even business and marketing people are at the
mercy of these systems
When your companys computing systems arent
running, youre out of business

14
Big and Little Pictures

It is too easy to understand reliability as a
narrow technical issue
In fact, many systems and companies stumble by
building
Unreliable technologies, because of
A mixture of poor management and poor technical
judgment
Reliable systems demand a balance between good
management and good technology

15
A Glimpse of Some Unreliable Systems

Quick review of some failed projects
These were characterized by poor reliability of
the final product
But the issues were not really technical
As future managers you need to understand this
phenomenon!

16
Tales from the Software Crypt
Source Jerry Saltzer, Keynote address, SOSP 1999
17
Tales from the Software Crypt
Source Jerry Saltzer, Keynote address, SOSP 1999
18
1995 Standish Group Study
On timeOn budgetOn function
Over budgetMissed scheduleLacks functions
Success 20
Challenged 50
Impaired 30
Scrapped
2x budget2x competion time2/3 planned
functionality
Source Jerry Saltzer, Keynote address, SOSP 1999
19
A strange picture

Many technology projects fail
For lots of reasons
But some succeed
Today we do web-based hotel reservations all the
time, yet Confirm failed
French air traffic project was a success yet US
project lost 6 billion
Is there a pattern?

20
Recurring Problems

Incommensurate scaling
Too many ideas
Mythical man-month
Bad ideas included
Modularity is hard
Bad-news diode
Best people are far more productive than average
employees
New is better, not-even-available yet is best
Magic bullet syndrome

Source Jerry Saltzer, Keynote address, SOSP 1999
21
1995 Study of Tandem Computer Systems

77 of failures that are software problems.
Software fault-tolerance techniques can overcome
about 75 of detected faults.
Loose coupling between primary and backup is
important for software fault tolerance.
Over two-thirds (72) of measured software
failures are recurrences of previously reported
faults.

Source Jerry Saltzer, Keynote address, SOSP 1999
22
A Buggy Aside

Q What are the two main categories of software
bugs called?
A Bohrbugs and Heisenbugs
Q Why?

23
Bohr Model of Atom

Bohr argued that thenucleus was a little ball

24
Bohr Model of Atom

Bohr argued that thenucleus was a little ball
Bohr bug is a nastybut well defined thing

25
Bohr Model of Atom

Bohr argued that thenucleus was a little ball
Bohr bug is a nastybut well defined thing
Your technical peoplecan reproduce it, so
theycan nail it

26
Heisenbug

Heisenberg modeled atom as a cloud of
electromsand a cloud-like nucleus
The closer you look, themore it wiggled
A Heisenbug moves when your people try and pin
it down.They wont find it easy to fix.

27
Why?

Bohrbugs tend to be deterministic errors
outright mistakes in the code
Once you understand what triggers them they are
easy to search for and fix
Heisenbugs are often delayed side-effects of an
old error. Like a bad tank of gas, effect may
happen long after the bug first occurs. Hard
to fix because at the time the mistake happened,
nothing obvious went wrong

28
Why Systems fail

Mostly, because something crashes
Usually, software or a human error
Mean time to failure improves with age but
software problems remain prevalent
Every kind of software system is prone to
failures. Failure to plan for failures is the
most common way for e-systems to fail.

29
E-reliability

We want e-commerce solutions to be reliable but
what should this mean?
Fault-tolerant?
Secure?
Fast enough?
Accessible to customers?
Deliver critical services when needed, where
needed, in a correct, timely manner

30
Costs of a Failure
31
Minimizing Downtime

Idea is to design critical parts of your system
to survive failures
Two basic approaches
Recoverable systems are designed to restart
without human intervention but may wait until
outage is repaired
Highly available systems are designed to keep
running during failure

32
Recoverability

The technology is called transactions
Well discuss this next time, but
Main issue is time needed to restart the service
For a large database, half an hour or more is not
at all unusual
Faster restart requires a warm standby

33
High Availability

Idea is to have a way to keep the system running
even while some parts are crashed
For example, a backup that takes over if primary
fails
Backup is kept warm
This involves replicating information
As changes occur, backup may lag behind

34
Complexity

The looming threat to your e-commerce solution,
no matter what it may be
Even simple systems are hard to make reliable
Complex systems are almost impossible to make
reliable
Yet innovative e-commerce projects often require
fairly complex technologies!

35
Two Side-by-Side Case Studies

American Advanced Automation System
Intended as replacement for air traffic control
system
Needed because Pres. Reagan fired many
controllers in 1981
But project was a fiasco, lost 6B
French Phidias System
Similar goals, slightly less ambitious
But rolled out, on time and on budget, in 1999

36
Background

Air traffic control systems are using 1970s
technology
Extremely costly to maintain and impossible to
upgrade
Meanwhile, load on controllers is rising steadily
Cant easily reduce load

37
Air Traffic Control system (one site)
Onboard
Radar
X.500 Directory
Team of Controllers
Air Traffic Database(flight plans, etc)
38
Politics

Government wanted to upgrade the whole thing,
solve a nagging problem
Controllers demanded various simplifications and
powerful new tools
Everyone assumed that what you use at home can be
adapted to the demands of an air traffic control
center

39
Technology

IBM bid the project, proposed to use its own
workstations
These arent super reliable, so they proposed to
adapt a new approach to fault-tolerance
Idea is to plan for failure
Detect failures when they occur
Automatically switch to backups

40
Core Technical Issue?

Problem revolves around high availability
Waiting for restart not seen as an option goal
is 10sec downtime in 10 years
So IBM proposed a replication scheme much like
the load balancing approach
IBM had primary and backup simply do the same
work, keeping them in the same state

41
Technology
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
radar
Conceptual flow of system
IBMs fault-tolerant process pair concept
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
radar
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
radar
42
Why is this Hard?

The system has many real-time constraints on it
Actions need to occur promptly
Even if something fails, we want the human
controller to continue to see updates
IBMs technology
Based on a research paper by Flaviu Cristian
But had never been used except for proof of
concept purposes, on a small scale in the
laboratory

43
Politics

IBMs proposal sounded good
and they were the second lowest bidder
and they had the most aggressive schedule
So the FAA selected them over alternatives
IBM took on the whole thing all at once

44
Disaster Strikes

Immediate confusion all parts of the system
seemed interdependent
To design part A I need to know how part B, also
being designed, will work
Controllers didnt like early proposals and
insisted on major changes to design
Fault-tolerance idea was one of the reasons IBM
was picked, but made the system so complex that
it went on the back burner

45
Summary of Simplifications

Focus on some core components
Postpone worry about fault-tolerance until later
Try and build a simple version that can be
fleshed out later
but the simplification wasnt enough. Too many
players kept intruding with requirements

46
Crash and Burn

The technical guys saw it coming
Probably as early as one year into the effort
But they kept it secret (bad news diode)
Anyhow, management wasnt listening (theyve
heard it all before whining engineers!)
The fault-tolerance scheme didnt work
Many technical issues unresolved
The FAA kept out of the technical issues
But a mixture of changing specifications and
serious technical issues were at the root of the
problems

47
What came out?

In the USA, nothing.
The entire system was useless the technology
was of an all-or-nothing style and nothing was
ready to deploy
British later rolled out a very limited version
of a similar technology, late, with many bugs,
but it does work

48
Contrast with French

They took a very incremental approach
Early design sought to cut back as much as
possible
If it isnt mandatory dont do it yet
Focus was on console cluster architecture and
fault-tolerance
They insisted on using off-the-shelf technology

49
Contrast with French

Managers intervened in technology choices
For example, the vendor wanted to do a home-brew
fault-tolerance technology
French insisted on a specific existing technology
and refused to bid out the work until vendors
accepted
A critical good call as it worked out

50
Learning by Doing

To gain experience with technology
They tested, and tested, and tested
Designed simple prototypes and played with them
Discovered that large cluster would perform
poorly
But found a sweet spot and worked within it
This forced project to cut back on some goals

51
Testing

9/10th of time and expense on any system is in
Testing
Debugging
Integration
Many projects overlook this
French planned conservatively

52
Software Bugs

Figure 1/10 lines in new code
But as many as 1/250 lines in old code
Bugs show up under stress
Trick is to run a system in an unstressed mode
French identified stress points and designed to
steer far from them
Their design also assumed that components would
fail and automated the restart

53
All of this worked!

Take-aways from French project?
Complex technical issues at the core of the
system
But they managed to break big poject into pieces
Do the critical core first, separately, and focus
exclusively on it
Test, test, test
Dont build anything you can possibly buy
Management was technically sophisticated enough
to make some critical calls

54
Your Problem

e-commerce systems are at e-risk
These e-risks take many forms
System complexity
Failure to plan for failures
Poor project management
Ignore this at our peril, as weve seen
But how can we learn to do better?

55
Keys to Reliability

Know the basic technologies
Realize that software is buggy and failures will
happen.
Design to treat failure as a mundane event
Failure to plan for failure is the biggest
e-risk!
Complexity is a huge threat. Use your naiveté as
an advantage if you cant understand it, why
assume that they can understand it?

56
E-commerce Technologies

The network and associated services
Databases
Web servers
Scripts the glue your people use to tie it
all together

57
Next Lecture

Look at some realistic e-commerce systems
Ask ourselves where to start first, if we need to
convince ourselves that the system will be
reliable enough
Trick is to balance between system complexity and
adequate risk coverage

Write a Comment

User Comments (0)