Title: Scott Andersen
1Designing a large mail system
- Scott Andersen
- Chairman, IASA Board of Education
2Big Rocks
- Design for the business
- Gather requirements
- Understand what is needed vs what is wanted
- Design from the technology not TO the technology
- Design for operations
- Can I recover this?
- Can I meet my SLA
- Design for the next migration
- Sustained innovation
3Architectural DecisionsA matter of scope
product family architecture decisions
decisions optimize over the whole, making
tradeoffs and compromises across the products for
the overall good of the whole
product architecture decisions
Product family scope
Product B scope
Product A scope
Component scope
Component scope
decisions made here are tuned to product A
Architecture is the set of decisions that cannot
be delegated without compromising overall system
objectives.
4Design for the business
- What does the business need from the solution?
- Reliability
- Scalability
- Process enablement
- Process Improvement
5Design for the business
- What does the business want from the solution
- Enable new technologies?
- Mobility
- Remote access
- Any time, anywhere access to my email?
- Reliable solution (its just there)
6Design from the technology
- Selecting the right solution
- Map business needs to the solutions
- Map migration considerations to the solutions
- Map operations requirements to the solutions
- Design Criteria The weeds of the solution
- Number of users today
- Size of mailbox
- User growth (can it happen in this economy)?
- Storage design and requirements
- Amount of mail moved
7Design from the technology
- Are mobile users worse then a virus?
- Alignment with the business
- Users - I want to keep all my mail
- Legal or HR we would like to be able to capture
the mail of specific users - Legal we do not want mail older than XX days
stored on our network - Operations we want to be able to back up this
solution every night in our available window. - Management we want a mail system that enabled
productivity and collaboration without limiting
users
8Design for operability
System-to-Admin ratio helps understand admin
costs Tracking overall ops costs rather than head
count doesnt work Outsourcing will save ½ the
expense without solving the underlying
issue Inefficient properties 21 Average
property 1501 Live Search 2,2001 Autopilot
often cited as the solution Autopilot only part
of the solution Hardest part is addressing the
apps issues 80 of operational problems have
genesis in design development
9Four Major Error Classes
- Human operator error is the leading cause of
dependability problems in many domains
- In my experience O/L S/W failures
underrepresented in above - H/W issues considerably less common
- Automation reduces costs while also eliminating
admin interaction - Every interaction brings risk of error
Source D. Patterson et al. Recovery Oriented
Computing (ROC) Motivation, Definition,
Techniques, and Case Studies, UC Berkeley
Technical Report UCB//CSD-02-1175, March 2002.
Source D. Patterson
10What does operations do?
- Teams Messenger, Contacts and Storage, OSSG and
BUIT services - 51 of time spent on deployment incident
management (known resolution technique)
Source Deepak Patil , WLO (8/14/2006)
11ROC Design Priorities
- Recover Oriented Computing (ROC)
- Assume everything will fail
- S/W H/W can fail at any time
- Build redundancy at all levels in system
- Scale out rather than up
- Gold Plated machines not much more reliable
- S/W failures dominate H/W
- More workload on a single system increases
potential failure impact - Even good H/W still fails more frequently than
typical SLAs allow - Reduce operations costs
- Inexpensive hardware slice
- 1 to 2 orders of magnitude fewer systems
engineers - Increase reliability through redundancy less
operator interaction - Goal 24x7 availability with 8x5 operations
- Lights out operation is more reliable
- Write system such that S/W quality problems are
reported but dont show to customers. It should
take many failures to miss SLA
12Overall Application Design
- Implement ROC Principles
- Service Design Best Practices
- Single-box deployment
- Development and test in full environment
- Quick service health check
- Zero trust of underlying components assume
failure - Pod or cluster independence
- Implement test ops tools disaster response
- Simplicity in all things
- Partition everything
- Version everything
13Dependency Management
- Expect latency failures in dependent services
- Run on cached data or offer degraded services
- Test failure latency frequently in production
- Dont depend upon features not yet shipped
- It takes time to work out reliability scaling
issues - Select dependent components services
thoughtfully - On-server components need consistent quality
goals - Dependent services should be large granule
(worth sharing) - Isolate services decouple components
- Contain faults within services
- Assume different upgrade rates
- Rather than auth on each connect, use session key
and refresh every N hours (avoids login storms)
14Release Cycle Testing
- Ship often (full release every 90 days)
- Small releases ship more smoothly
- Increases pace of innovation
- Long stabilization periods not required in
services - Support 1 version, no-back-level support, 1
configuration, installed in 1 way, in 1
environment - Use production data to find problems (traffic
capture) - Measurable release criteria
- Release criteria includes quality and throughput
data - Never deploy anything without tested roll-back
- Test in production via incremental deployment
roll-back - Track all recovered errors to protect against
automation-supported service entropy - Test all error paths in integration in
production - Continue testing after release
- 2 to 5 load from automated testing is affordable
and finds errors FAST
15Design for the next migration
- The great Oklahoma Land Grab
- Would a GPS have been an advantage?
- Why are migrations like a land grab?
- Can you sustain innovation?
16Migrations many moving pieces
17The IO of migrations
Basic it sounds simple even the word is often
applied to simple. Yet Basic represents the most
complex state an organization can be in prior to
a migration. Basic means that the overall IT
maturity is low in the sense that there is not a
lot of automation. This often occurs in older
organizations or organizations that have recently
undergone severe budget cuts. In this scenario
the migration will only be successful if the
organization is moved up a level (to
standardized) or gains a competitive advantage
from the migration. A friend of mine once
described these folks as being in job
protection mode all the time. Standardized is
our next tier - just like basic standardized
sounds like a easy state to migrate from. In
fact, while it represents a more automated IT
environment then Basic does, it is still quite
manual. Manual is the process that forces cost
and risk in migrations. So how do we help a
standardized IT shop move forward? The first
thing is to help them build a plan for a dynamic
IT environment they see the business value of
moving to a Dynamic IT environment but are
looking for the project that can lead them to
this state. A competitive advantage is always a
benefit for the customer. Rationalized represents
an organization well on the way to maturity.
This is the first state that allows for easier
migrations within the organization. Here we have
an IT team that has automated many of the
processes and procedures that involve touching
end users. This is the first step in helping
customers be consistently successful with their
migrations. Consistently successful migrations
lead to additional migrations and allows for the
move from migration to transition. Dynamic
organizations move quickly recognizing the
business value of IT improvement and leveraging
that improvement to automate and streamline
processes. A dynamic organization is ready for
transition and no longer migrates anything. They
now have moved to the concept of transition.