Scott Andersen - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Scott Andersen

Description:

Understand what is needed vs what is 'wanted' Design from the technology ... decisions made here are ... 'Gold Plated' machines not much more reliable. S/W ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 18
Provided by: scot94
Category:

less

Transcript and Presenter's Notes

Title: Scott Andersen


1
Designing a large mail system
  • Scott Andersen
  • Chairman, IASA Board of Education

2
Big Rocks
  • Design for the business
  • Gather requirements
  • Understand what is needed vs what is wanted
  • Design from the technology not TO the technology
  • Design for operations
  • Can I recover this?
  • Can I meet my SLA
  • Design for the next migration
  • Sustained innovation

3
Architectural DecisionsA matter of scope
product family architecture decisions
decisions optimize over the whole, making
tradeoffs and compromises across the products for
the overall good of the whole
product architecture decisions
Product family scope
Product B scope
Product A scope
Component scope
Component scope
decisions made here are tuned to product A
Architecture is the set of decisions that cannot
be delegated without compromising overall system
objectives.
4
Design for the business
  • What does the business need from the solution?
  • Reliability
  • Scalability
  • Process enablement
  • Process Improvement

5
Design for the business
  • What does the business want from the solution
  • Enable new technologies?
  • Mobility
  • Remote access
  • Any time, anywhere access to my email?
  • Reliable solution (its just there)

6
Design from the technology
  • Selecting the right solution
  • Map business needs to the solutions
  • Map migration considerations to the solutions
  • Map operations requirements to the solutions
  • Design Criteria The weeds of the solution
  • Number of users today
  • Size of mailbox
  • User growth (can it happen in this economy)?
  • Storage design and requirements
  • Amount of mail moved

7
Design from the technology
  • Are mobile users worse then a virus?
  • Alignment with the business
  • Users - I want to keep all my mail
  • Legal or HR we would like to be able to capture
    the mail of specific users
  • Legal we do not want mail older than XX days
    stored on our network
  • Operations we want to be able to back up this
    solution every night in our available window.
  • Management we want a mail system that enabled
    productivity and collaboration without limiting
    users

8
Design for operability
System-to-Admin ratio helps understand admin
costs Tracking overall ops costs rather than head
count doesnt work Outsourcing will save ½ the
expense without solving the underlying
issue Inefficient properties 21 Average
property 1501 Live Search 2,2001 Autopilot
often cited as the solution Autopilot only part
of the solution Hardest part is addressing the
apps issues 80 of operational problems have
genesis in design development
9
Four Major Error Classes
  • Human operator error is the leading cause of
    dependability problems in many domains
  • In my experience O/L S/W failures
    underrepresented in above
  • H/W issues considerably less common
  • Automation reduces costs while also eliminating
    admin interaction
  • Every interaction brings risk of error

Source D. Patterson et al. Recovery Oriented
Computing (ROC) Motivation, Definition,
Techniques, and Case Studies, UC Berkeley
Technical Report UCB//CSD-02-1175, March 2002.
Source D. Patterson
10
What does operations do?
  • Teams Messenger, Contacts and Storage, OSSG and
    BUIT services
  • 51 of time spent on deployment incident
    management (known resolution technique)

Source Deepak Patil , WLO (8/14/2006)
11
ROC Design Priorities
  • Recover Oriented Computing (ROC)
  • Assume everything will fail
  • S/W H/W can fail at any time
  • Build redundancy at all levels in system
  • Scale out rather than up
  • Gold Plated machines not much more reliable
  • S/W failures dominate H/W
  • More workload on a single system increases
    potential failure impact
  • Even good H/W still fails more frequently than
    typical SLAs allow
  • Reduce operations costs
  • Inexpensive hardware slice
  • 1 to 2 orders of magnitude fewer systems
    engineers
  • Increase reliability through redundancy less
    operator interaction
  • Goal 24x7 availability with 8x5 operations
  • Lights out operation is more reliable
  • Write system such that S/W quality problems are
    reported but dont show to customers. It should
    take many failures to miss SLA

12
Overall Application Design
  • Implement ROC Principles
  • Service Design Best Practices
  • Single-box deployment
  • Development and test in full environment
  • Quick service health check
  • Zero trust of underlying components assume
    failure
  • Pod or cluster independence
  • Implement test ops tools disaster response
  • Simplicity in all things
  • Partition everything
  • Version everything

13
Dependency Management
  • Expect latency failures in dependent services
  • Run on cached data or offer degraded services
  • Test failure latency frequently in production
  • Dont depend upon features not yet shipped
  • It takes time to work out reliability scaling
    issues
  • Select dependent components services
    thoughtfully
  • On-server components need consistent quality
    goals
  • Dependent services should be large granule
    (worth sharing)
  • Isolate services decouple components
  • Contain faults within services
  • Assume different upgrade rates
  • Rather than auth on each connect, use session key
    and refresh every N hours (avoids login storms)

14
Release Cycle Testing
  • Ship often (full release every 90 days)
  • Small releases ship more smoothly
  • Increases pace of innovation
  • Long stabilization periods not required in
    services
  • Support 1 version, no-back-level support, 1
    configuration, installed in 1 way, in 1
    environment
  • Use production data to find problems (traffic
    capture)
  • Measurable release criteria
  • Release criteria includes quality and throughput
    data
  • Never deploy anything without tested roll-back
  • Test in production via incremental deployment
    roll-back
  • Track all recovered errors to protect against
    automation-supported service entropy
  • Test all error paths in integration in
    production
  • Continue testing after release
  • 2 to 5 load from automated testing is affordable
    and finds errors FAST

15
Design for the next migration
  • The great Oklahoma Land Grab
  • Would a GPS have been an advantage?
  • Why are migrations like a land grab?
  • Can you sustain innovation?

16
Migrations many moving pieces
17
The IO of migrations
Basic it sounds simple even the word is often
applied to simple. Yet Basic represents the most
complex state an organization can be in prior to
a migration. Basic means that the overall IT
maturity is low in the sense that there is not a
lot of automation. This often occurs in older
organizations or organizations that have recently
undergone severe budget cuts. In this scenario
the migration will only be successful if the
organization is moved up a level (to
standardized) or gains a competitive advantage
from the migration. A friend of mine once
described these folks as being in job
protection mode all the time. Standardized is
our next tier - just like basic standardized
sounds like a easy state to migrate from. In
fact, while it represents a more automated IT
environment then Basic does, it is still quite
manual. Manual is the process that forces cost
and risk in migrations. So how do we help a
standardized IT shop move forward? The first
thing is to help them build a plan for a dynamic
IT environment they see the business value of
moving to a Dynamic IT environment but are
looking for the project that can lead them to
this state. A competitive advantage is always a
benefit for the customer. Rationalized represents
an organization well on the way to maturity.
This is the first state that allows for easier
migrations within the organization. Here we have
an IT team that has automated many of the
processes and procedures that involve touching
end users. This is the first step in helping
customers be consistently successful with their
migrations. Consistently successful migrations
lead to additional migrations and allows for the
move from migration to transition. Dynamic
organizations move quickly recognizing the
business value of IT improvement and leveraging
that improvement to automate and streamline
processes. A dynamic organization is ready for
transition and no longer migrates anything. They
now have moved to the concept of transition.
Write a Comment
User Comments (0)
About PowerShow.com