the erisks of ecommerce - PowerPoint PPT Presentation

About This Presentation
Title:

the erisks of ecommerce

Description:

If it stops ticking when it takes a licking... your e-commerce company could tank. So you need to know that your technology base is reliable ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 58
Provided by: csCor
Category:

less

Transcript and Presenter's Notes

Title: the erisks of ecommerce


1
the e-risks of e-commerce
  • Professor Ken Birman
  • Dept. of Computer Science
  • Cornell University

2
Reliability
  • If it stops ticking when it takes a licking your
    e-commerce company could tank
  • So you need to know that your technology base is
    reliable
  • It does what it should do, does it when needed,
    does it correctly, and is accessible to your
    customers.

3
A Quiz
  • Q When and why did Sun Microsystems have a
    losing quarter?

Ken Birman Mr. Birman,Sun experienced a loss
in Q4FY89 (June 1989). This was the quarter in
which we transitioned to a new manufacturing,
order processing and inventory control
systems.Andrew CaseyManager, Investor
RelationsSun Microsystems, Inc.(650)
336-0761andrew.casey_at_corp.sun.com
4
Typical Web Session
get http//www.cs.cornell.edu/People/ken
where
what
firewall
5
Typical Web Session
resolve www.cs.cornell.edu
IP address128.64.31.77
load-balancing proxy
firewall
6
The Webs dark side
Netscape error web server www.cs.cornell.edu...
not responding. Server may have crashed or is
overloaded.
OK
7
Right URL, but the request times out. Why?
  • The web server could be down
  • Your network connection may have failed
  • There could be a problem in the DNS
  • There could be a network routing problem
  • The Internet may be experiencing an overload
  • Your web caching proxy may be down
  • Your PC might have a problem, or your version of
    Netscape (or Explorer), or the file system you
    are using, or your LAN
  • The URL itself may be wrong
  • A router or network link may have failed and the
    Internet may not yet have rerouted around the
    problem

8
E-Trade computers crash again -- and again
Edupage Editors
Sun, 07 Feb 1999 102830 -0500
  • The computer system of online security firm
    E-Trade crashed on Friday for the third
    consecutive day. "It was just a software glitch.
    I think we were all frustrated by it," says an
    E-Trade executive. Industry analyst James Mark of
    Deutsche Bank commented it's the application on
    a large scale. As soon as E-Trade's volumes
    started spiking up, they had the same problems as
    others."

9
Reliable Distributed Computing Increasingly
urgent, yet unsolved
  • Distributed computing has swept the world
  • Impact has become revolutionary
  • Vast wave of applications migrating to networks
  • Already as critical a national infrastructure as
    water, electricity, or telephones
  • Yet distributed systems remain
  • Unreliable, prone to inexplicable outages
  • Insecure, easily attacked
  • Difficult (and costly) to program, bug-prone

10
A National Imperative
  • Potential for catastrophe cited by
  • Presidential Commission on Critical
    Infrastructure Protection (PCCIP)
  • National Academy of Sciences Study on Trust in
    Cyberspace
  • These experts warn that we need a quantum
    improvement in technologies
  • Meanwhile, your e-commerce venture is at grave
    risk of stumbling just like many others

11
A Business Imperative
12
A Business Imperative
13
E-projects Often Fail
  • e-commerce revolves around computing
  • Even business and marketing people are at the
    mercy of these systems
  • When your companys computing systems arent
    running, youre out of business

14
Big and Little Pictures
  • It is too easy to understand reliability as a
    narrow technical issue
  • In fact, many systems and companies stumble by
    building
  • Unreliable technologies, because of
  • A mixture of poor management and poor technical
    judgment
  • Reliable systems demand a balance between good
    management and good technology

15
A Glimpse of Some Unreliable Systems
  • Quick review of some failed projects
  • These were characterized by poor reliability of
    the final product
  • But the issues were not really technical
  • As future managers you need to understand this
    phenomenon!

16
Tales from the Software Crypt
Source Jerry Saltzer, Keynote address, SOSP 1999
17
Tales from the Software Crypt
Source Jerry Saltzer, Keynote address, SOSP 1999
18
1995 Standish Group Study
On timeOn budgetOn function
Over budgetMissed scheduleLacks functions
Success 20
Challenged 50
Impaired 30
Scrapped
2x budget2x competion time2/3 planned
functionality
Source Jerry Saltzer, Keynote address, SOSP 1999
19
A strange picture
  • Many technology projects fail
  • For lots of reasons
  • But some succeed
  • Today we do web-based hotel reservations all the
    time, yet Confirm failed
  • French air traffic project was a success yet US
    project lost 6 billion
  • Is there a pattern?

20
Recurring Problems
  • Incommensurate scaling
  • Too many ideas
  • Mythical man-month
  • Bad ideas included
  • Modularity is hard
  • Bad-news diode
  • Best people are far more productive than average
    employees
  • New is better, not-even-available yet is best
  • Magic bullet syndrome

Source Jerry Saltzer, Keynote address, SOSP 1999
21
1995 Study of Tandem Computer Systems
  • 77 of failures that are software problems.
  • Software fault-tolerance techniques can overcome
    about 75 of detected faults.
  • Loose coupling between primary and backup is
    important for software fault tolerance.
  • Over two-thirds (72) of measured software
    failures are recurrences of previously reported
    faults.

Source Jerry Saltzer, Keynote address, SOSP 1999
22
A Buggy Aside
  • Q What are the two main categories of software
    bugs called?
  • A Bohrbugs and Heisenbugs
  • Q Why?

23
Bohr Model of Atom
  • Bohr argued that thenucleus was a little ball

24
Bohr Model of Atom
  • Bohr argued that thenucleus was a little ball
  • Bohr bug is a nastybut well defined thing

25
Bohr Model of Atom
  • Bohr argued that thenucleus was a little ball
  • Bohr bug is a nastybut well defined thing
  • Your technical peoplecan reproduce it, so
    theycan nail it

26
Heisenbug
  • Heisenberg modeled atom as a cloud of
    electromsand a cloud-like nucleus
  • The closer you look, themore it wiggled
  • A Heisenbug moves when your people try and pin
    it down.They wont find it easy to fix.

27
Why?
  • Bohrbugs tend to be deterministic errors
    outright mistakes in the code
  • Once you understand what triggers them they are
    easy to search for and fix
  • Heisenbugs are often delayed side-effects of an
    old error. Like a bad tank of gas, effect may
    happen long after the bug first occurs. Hard
    to fix because at the time the mistake happened,
    nothing obvious went wrong

28
Why Systems fail
  • Mostly, because something crashes
  • Usually, software or a human error
  • Mean time to failure improves with age but
    software problems remain prevalent
  • Every kind of software system is prone to
    failures. Failure to plan for failures is the
    most common way for e-systems to fail.

29
E-reliability
  • We want e-commerce solutions to be reliable but
    what should this mean?
  • Fault-tolerant?
  • Secure?
  • Fast enough?
  • Accessible to customers?
  • Deliver critical services when needed, where
    needed, in a correct, timely manner

30
Costs of a Failure
31
Minimizing Downtime
  • Idea is to design critical parts of your system
    to survive failures
  • Two basic approaches
  • Recoverable systems are designed to restart
    without human intervention but may wait until
    outage is repaired
  • Highly available systems are designed to keep
    running during failure

32
Recoverability
  • The technology is called transactions
  • Well discuss this next time, but
  • Main issue is time needed to restart the service
  • For a large database, half an hour or more is not
    at all unusual
  • Faster restart requires a warm standby

33
High Availability
  • Idea is to have a way to keep the system running
    even while some parts are crashed
  • For example, a backup that takes over if primary
    fails
  • Backup is kept warm
  • This involves replicating information
  • As changes occur, backup may lag behind

34
Complexity
  • The looming threat to your e-commerce solution,
    no matter what it may be
  • Even simple systems are hard to make reliable
  • Complex systems are almost impossible to make
    reliable
  • Yet innovative e-commerce projects often require
    fairly complex technologies!

35
Two Side-by-Side Case Studies
  • American Advanced Automation System
  • Intended as replacement for air traffic control
    system
  • Needed because Pres. Reagan fired many
    controllers in 1981
  • But project was a fiasco, lost 6B
  • French Phidias System
  • Similar goals, slightly less ambitious
  • But rolled out, on time and on budget, in 1999

36
Background
  • Air traffic control systems are using 1970s
    technology
  • Extremely costly to maintain and impossible to
    upgrade
  • Meanwhile, load on controllers is rising steadily
  • Cant easily reduce load

37
Air Traffic Control system (one site)
Onboard
Radar
X.500 Directory
Team of Controllers
Air Traffic Database(flight plans, etc)
38
Politics
  • Government wanted to upgrade the whole thing,
    solve a nagging problem
  • Controllers demanded various simplifications and
    powerful new tools
  • Everyone assumed that what you use at home can be
    adapted to the demands of an air traffic control
    center

39
Technology
  • IBM bid the project, proposed to use its own
    workstations
  • These arent super reliable, so they proposed to
    adapt a new approach to fault-tolerance
  • Idea is to plan for failure
  • Detect failures when they occur
  • Automatically switch to backups

40
Core Technical Issue?
  • Problem revolves around high availability
  • Waiting for restart not seen as an option goal
    is 10sec downtime in 10 years
  • So IBM proposed a replication scheme much like
    the load balancing approach
  • IBM had primary and backup simply do the same
    work, keeping them in the same state

41
Technology
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
radar
Conceptual flow of system
IBMs fault-tolerant process pair concept
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
radar
findtracks
Identifyflight
Lookuprecord
Planactions
Humanaction
radar
42
Why is this Hard?
  • The system has many real-time constraints on it
  • Actions need to occur promptly
  • Even if something fails, we want the human
    controller to continue to see updates
  • IBMs technology
  • Based on a research paper by Flaviu Cristian
  • But had never been used except for proof of
    concept purposes, on a small scale in the
    laboratory

43
Politics
  • IBMs proposal sounded good
  • and they were the second lowest bidder
  • and they had the most aggressive schedule
  • So the FAA selected them over alternatives
  • IBM took on the whole thing all at once

44
Disaster Strikes
  • Immediate confusion all parts of the system
    seemed interdependent
  • To design part A I need to know how part B, also
    being designed, will work
  • Controllers didnt like early proposals and
    insisted on major changes to design
  • Fault-tolerance idea was one of the reasons IBM
    was picked, but made the system so complex that
    it went on the back burner

45
Summary of Simplifications
  • Focus on some core components
  • Postpone worry about fault-tolerance until later
  • Try and build a simple version that can be
    fleshed out later
  • but the simplification wasnt enough. Too many
    players kept intruding with requirements

46
Crash and Burn
  • The technical guys saw it coming
  • Probably as early as one year into the effort
  • But they kept it secret (bad news diode)
  • Anyhow, management wasnt listening (theyve
    heard it all before whining engineers!)
  • The fault-tolerance scheme didnt work
  • Many technical issues unresolved
  • The FAA kept out of the technical issues
  • But a mixture of changing specifications and
    serious technical issues were at the root of the
    problems

47
What came out?
  • In the USA, nothing.
  • The entire system was useless the technology
    was of an all-or-nothing style and nothing was
    ready to deploy
  • British later rolled out a very limited version
    of a similar technology, late, with many bugs,
    but it does work

48
Contrast with French
  • They took a very incremental approach
  • Early design sought to cut back as much as
    possible
  • If it isnt mandatory dont do it yet
  • Focus was on console cluster architecture and
    fault-tolerance
  • They insisted on using off-the-shelf technology

49
Contrast with French
  • Managers intervened in technology choices
  • For example, the vendor wanted to do a home-brew
    fault-tolerance technology
  • French insisted on a specific existing technology
    and refused to bid out the work until vendors
    accepted
  • A critical good call as it worked out

50
Learning by Doing
  • To gain experience with technology
  • They tested, and tested, and tested
  • Designed simple prototypes and played with them
  • Discovered that large cluster would perform
    poorly
  • But found a sweet spot and worked within it
  • This forced project to cut back on some goals

51
Testing
  • 9/10th of time and expense on any system is in
  • Testing
  • Debugging
  • Integration
  • Many projects overlook this
  • French planned conservatively

52
Software Bugs
  • Figure 1/10 lines in new code
  • But as many as 1/250 lines in old code
  • Bugs show up under stress
  • Trick is to run a system in an unstressed mode
  • French identified stress points and designed to
    steer far from them
  • Their design also assumed that components would
    fail and automated the restart

53
All of this worked!
  • Take-aways from French project?
  • Complex technical issues at the core of the
    system
  • But they managed to break big poject into pieces
  • Do the critical core first, separately, and focus
    exclusively on it
  • Test, test, test
  • Dont build anything you can possibly buy
  • Management was technically sophisticated enough
    to make some critical calls

54
Your Problem
  • e-commerce systems are at e-risk
  • These e-risks take many forms
  • System complexity
  • Failure to plan for failures
  • Poor project management
  • Ignore this at our peril, as weve seen
  • But how can we learn to do better?

55
Keys to Reliability
  • Know the basic technologies
  • Realize that software is buggy and failures will
    happen.
  • Design to treat failure as a mundane event
  • Failure to plan for failure is the biggest
    e-risk!
  • Complexity is a huge threat. Use your naiveté as
    an advantage if you cant understand it, why
    assume that they can understand it?

56
E-commerce Technologies
  • The network and associated services
  • Databases
  • Web servers
  • Scripts the glue your people use to tie it
    all together

57
Next Lecture
  • Look at some realistic e-commerce systems
  • Ask ourselves where to start first, if we need to
    convince ourselves that the system will be
    reliable enough
  • Trick is to balance between system complexity and
    adequate risk coverage
Write a Comment
User Comments (0)
About PowerShow.com