Title: Building FaultTolerant Enterprise Applications
1Building Fault-Tolerant Enterprise Applications
- Erin Mulder
- Chariot Solutions
- chariotsolutions.com
Brian McCallister Fort Hill Company forthillcompan
y.com
2Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
3Goals of Fault Tolerance
What are we really worried about?
- Availability
- Integrity
- Confidentiality
- Usability
- Cost
4Goals of Fault Tolerance
What can go wrong?
- User Error
- Concurrent Changes
- Bugs
- Resource Failure/Downtime
- System Overload
- Misconfiguration
- Sabotage
5Goals of Fault Tolerance
Themes well keep visiting
- Prevention
- Code Guidelines Reviews
- Automated Validation Testing
- Performance / Stress Testing
- Detection
- Logging and Auditing
- Validation Patterns
- Monitoring
- Recovery
- Exception handling patterns
- Error feedback loop
- Redundancy
6Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
7User Recoverable Errors
Simple validation error
- What do you do when the user
- Leaves a required field blank
- Enters a value too big for the database field
- Types letters in a numeric field
- Selects inconsistent options
- Tries to do things in the wrong order
8User Recoverable Errors
Simple validation error
- Fault tolerance is more than detection
- Prevent the user from making errors
- Set maxlengths on input fields
- Use character masks
- Specify units
- Show example input
- Dont allow the selection of inconsistent options
- Dont present navigation options that arent
meant to be followed
9User Recoverable Errors
Simple validation error
- Help the user recover quickly
- Highlight all errors clearly
- Show help text and examples for invalid fields
- If some other action is required first, launch it
instead of interrupting the flow with frustrating
errors - Perception is everything!
- Log the error for later analysis
- Save enough information to recreate
- Start automatically handling common mistakes
10User Recoverable Errors
Optimistic concurrency clash
- Everything looks good until the save
- Then
- Item has just gone out of stock
- Another user has just updated the same document
- Time has passed and action is no longer allowed
11User Recoverable Errors
Optimistic concurrency clash
- Increase save points
- Alert user to potential risk
- Low stock
- Another user just accessed this record
- Another user has soft lock on record
- Offer useful options for resolving collision
- Merge changes
- Backorder
- Automatically retry later
- Email me when it is available
- Give tips for avoiding future collisions
12User Recoverable Errors
Bookmarks, back buttons and browsers
- User escapes normal page flow
- Bookmarks login page or internal page
- Uses back button
- Opens a new window within same session
- Session times out
- Missing context from previous requests
- Next click is like bookmark to internal page
- Other browser oddities
- Double-clicking submit buttons
- Pressing stop button in the middle of a request
13User Recoverable Errors
Bookmarks, back buttons and sessions
- Prevention is difficult the user is in control
- Javascript can sometimes help
- Javascript can sometimes hurt
- Plan for and test each of these scenarios
- Plan for handling out-of-sequence requests
14User Recoverable Errors
Bookmarks, back buttons and sessions
- To seamlessly handle session timeouts and
out-of-sequence requests, consider - Persistent sessions (saved to database)
- Passing state in every request (form fields or
URL rewriting) - Storing state in custom cookies
- Adding custom logic to recover from timed-out
sequences - To simply detect and alert, consider
- Using listener to catch session expiration
- Using state validation to catch out-of-sequence
requests - Redirecting user to session expiration page
- To improve process
- Log session losses (requests within expired
session) - Consider increasing session timeout
- Consider using prevention techniques described
above - Increase save points
15User Recoverable Errors
Bookmarks, back buttons and sessions
- To minimize impact of back button, consider
- Techniques described for out-of-sequence requests
- Redirecting to GETs instead of returning
responses to POSTs - To work around double submissions, consider
- Disabling submit button after first click
- Susceptible to Stop button or request timeout
- Minimizing response times!
- Detecting on server side using request id
- Difficult to return correct response to second
request - Immediately forwarding to intermediate page which
can forward on when response is ready - To handle multiple windows, consider
- Passing state in every request
- Adapting web frameworks to map state (e.g. Struts
form beans) by primary key or request ID instead
of a static name
16Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
17Expected Application Errors
Resource is unavailable
- Database is down for maintenance
- No connection to integrated partner service
- Resource is overloaded
- Out of DB connections
- JMS Queue full
18Expected Application Errors
Resource is unavailable
- To prevent, consider
- Coordinating maintenance schedules
- Planning for failover at the resource level
- Increasing hardware budget ?
- Increasing transaction timeout seconds (caution
last resort) - To handle, analyze transactional requirements
- Is immediate user response necessary?
- Can the resource access be handled asynchronously
with an extended, logical transaction? - Plan rollbacks carefully to allow for retries
(consider idempotence, sub-transactions) - Alert operator/admin if out of SLA
- Log all outages (study for patterns)
19Expected Application Errors
Application is overloaded
- Mentioned on CNBC
- Linked from Slashdot
- Denial of Service
20Expected Application Errors
Application is overloaded
- Test under heavy load
- Tune hot spots
- Run with excess capacity
- Throttle at network level
- Use JMS and other asynchronous technologies to
throttle on backend - Tune application server to degrade gracefully
- Monitor carefully
- Be prepared to scale out, not just up
21Expected Application Errors
Bugs and other undocumented features
- Friendly bug
- Triggers invalid state
- Causes VM or app server to throw exception
- Greedy bug
- Monopolizes resources
- Leaks connections
- Silent and deadly bug
- Corrupts data
22Expected Application Errors
Bugs and other undocumented features
- To handle friendly bugs
- Bulletproof your transactions rollbacks
- Write up strict guidelines
- Conduct code reviews
- If developers are junior and/or integrity is a
lot more important than performance, consider
using unchecked application exceptions - Catch Throwable somewhere in the UI
- Display sanitized errors to user
- Log carefully to allow easy debugging (use
timestamps, thread IDs, transaction IDs - Alert operator/administrator
23Expected Application Errors
Bugs and other undocumented features
- To handle greedy bugs
- Reduce transaction timeout seconds
- Handle timeouts in the same way as friendly bugs
- Monitor carefully
- Log statistics ( of transaction timeouts, CPU
usage, memory usage, GC, network traffic, stuck
threads) - Automate log analysis
- Trigger a thread dump (kill -3) during hot spots
- Alert operator/administrator to hot spots
- Use clustering to contain damage
24Expected Application Errors
Bugs and other undocumented features
- To handle silent and deadly bugs
- Bulletproof transaction settings
- Validate on multiple levels, use referential
integrity - Audit everything
- Unless performance/cost prohibits, keep a
complete audit trail on every table (easy with
triggers, aspects or code generators), try to
include transaction ID - Flush caches regularly
- After a save, load the record from the database
and display back to the user - Run periodic audits with human review
- Plan for how to use audit trail to recover from
data corruption - Early detection is key escalate user concerns!
25Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
26System Failure
Never have an unplanned outage
- Determine acceptable downtime
- Plan clustering / failover accordingly
- Monitor carefully so outages are detected
immediately - Be ready with a tiny planned outage page and
server in advance - Consider offsite host
- Build this functionality into non-Web clients at
development time - Plan for transaction recovery
- Plan for JMS recovery
- Use quiescing load balancing to bring servers
offline for maintenance
27Agenda
- Goals of Fault Tolerance
- User Recoverable Errors
- Expected Application Errors
- System Failure
- Useful Strategies
- Discussion
28Useful Strategies
Be sure that you develop guidelines for
- Error Messages
- Validation (format, business rules, size,
cleansing) - Logging (when, where, what)
- Auditing
- Monitoring (level of automation, alerts)
- Transactions (who rolls back, checked vs.
unchecked) - Sessions Caching (request vs. session,
flushing) - Clustering
29Useful Strategies
Error Messages
- For validation errors, be sure to
- Include format and size hints
- Show examples
- Give more information than the basic field label
- Mention the error at the top of the screen
- Highlight the actual field
- Wherever possible, catch all errors at the same
time! - For other user-recoverable errors
- Let the user know what to do next
- If the user cant recover
- Apologize
- Give no details
- Suggest workarounds
- (Silently log and alert!)
30Useful Strategies
Validation
- If possible, validate at all levels
- Common strategies
- Externalize validation rules and use AOP or CG
techniques to build them into each layer - Clearly define which layers are responsible for
which types of validation. For example - All format errors handled in web tier
- All business rule violations handled in
application tier - All field lengths enforced at data tier
- If business logic isnt dependent on fields,
define them dynamically, along with their
validation rules - Use a rules engine
31Useful Strategies
Logging
- Log in all tiers
- Helps diagnose problems
- More secure
- Define logging levels (debug, info, error, etc.)
- Create consistent guidelines for those levels
- Include
- Timestamp
- Current User
- Some sort of thread ID, transaction ID, etc.
- Dont make logs a source of failure (watch disk
space, JMS load, etc.)
32Useful Strategies
Auditing
- Unless you have a performance or space problem,
audit all changes - Provides accountability
- Easier to support users
- Easier to debug
- Easier to recover from disaster
- Easier to detect attacks
- Include
- Timestamp
- Current User
- Some sort of thread ID, transaction ID, etc.
- Complete data record or diff
- Decide whether a failure to audit should sink the
transaction
33Useful Strategies
Monitoring
- Who is watching the logs?
- Common strategies include
- 24/7 operations center
- Business hours operation center
- Overworked admin whenever she happens to think
about it (not recommended!) - Automated, redundant processes that analyze logs
and raise alerts to on-call administrators - Logs show more than critical errors
- Ideally, mine them for clues on usability,
performance problems and attacks
34Useful Strategies
Transactions
- What runs in a transaction?
- Who is responsible for rolling it back?
- Common strategies
- Top server-side tier creates a user transaction,
catches all errors and then determines its fate - Container-managed transactions with session
façade - Top level methods responsible for rollbacks
- Unchecked (runtime) exceptions, so rollbacks are
automatic - Mostly unchecked exceptions with a few special
cases - How to pick a transaction timeout?
35Useful Strategies
Sessions and Caching
- Session or request?
- Common strategies
- Everything in HTTP session
- Stateful session beans
- Hidden form fields
- URL rewriting
- Encrypted cookies
- Unencrypted cookies
- When to flush cache?
- Trade-off between integrity and performance
- Caches can mask data problems
36Useful Strategies
Clustering
- Why use clusters?
- Availability
- Scalability
- Will this application need a cluster?
- Can you take it offline for maintenance?
- Can you take it offline to scale it up?
- Are you sure you wont need to scale it out?
- Useful option, but not the answer to everything
- Usually requires work on existing applications
- Build applications to be clusterable from the
start - Make shared state serializable
- Obey J2EE restrictions
- Consider how much should be stored in session
- Test on a cluster so you know how close you are
37Discussion
Get the slides online at http//www.chariotsoluti
ons.com/slides
37
38Building Fault-Tolerant Enterprise Applications
- Erin Mulder
- Chariot Solutions
- chariotsolutions.com
Brian McCallister Fort Hill Company forthillcompan
y.com