Building FaultTolerant Enterprise Applications - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Building FaultTolerant Enterprise Applications

Description:

Enters a value too big for the database field. Types letters ... Throttle at network level. Use JMS and other asynchronous technologies to throttle on backend ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 39

Provided by: chariots

Category:

more less

Transcript and Presenter's Notes

Title: Building FaultTolerant Enterprise Applications

1
Building Fault-Tolerant Enterprise Applications

Erin Mulder
Chariot Solutions
chariotsolutions.com

Brian McCallister Fort Hill Company forthillcompan
y.com
2
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

3
Goals of Fault Tolerance
What are we really worried about?

Availability
Integrity
Confidentiality
Usability
Cost

4
Goals of Fault Tolerance
What can go wrong?

User Error
Concurrent Changes
Bugs
Resource Failure/Downtime
System Overload
Misconfiguration
Sabotage

5
Goals of Fault Tolerance
Themes well keep visiting

Prevention
Code Guidelines Reviews
Automated Validation Testing
Performance / Stress Testing
Detection
Logging and Auditing
Validation Patterns
Monitoring
Recovery
Exception handling patterns
Error feedback loop
Redundancy

6
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

7
User Recoverable Errors
Simple validation error

What do you do when the user
Leaves a required field blank
Enters a value too big for the database field
Types letters in a numeric field
Selects inconsistent options
Tries to do things in the wrong order

8
User Recoverable Errors
Simple validation error

Fault tolerance is more than detection
Prevent the user from making errors
Set maxlengths on input fields
Use character masks
Specify units
Show example input
Dont allow the selection of inconsistent options
Dont present navigation options that arent
meant to be followed

9
User Recoverable Errors
Simple validation error

Help the user recover quickly
Highlight all errors clearly
Show help text and examples for invalid fields
If some other action is required first, launch it
instead of interrupting the flow with frustrating
errors
Perception is everything!
Log the error for later analysis
Save enough information to recreate
Start automatically handling common mistakes

10
User Recoverable Errors
Optimistic concurrency clash

Everything looks good until the save
Then
Item has just gone out of stock
Another user has just updated the same document
Time has passed and action is no longer allowed

11
User Recoverable Errors
Optimistic concurrency clash

Increase save points
Alert user to potential risk
Low stock
Another user just accessed this record
Another user has soft lock on record
Offer useful options for resolving collision
Merge changes
Backorder
Automatically retry later
Email me when it is available
Give tips for avoiding future collisions

12
User Recoverable Errors
Bookmarks, back buttons and browsers

User escapes normal page flow
Bookmarks login page or internal page
Uses back button
Opens a new window within same session
Session times out
Missing context from previous requests
Next click is like bookmark to internal page
Other browser oddities
Double-clicking submit buttons
Pressing stop button in the middle of a request

13
User Recoverable Errors
Bookmarks, back buttons and sessions

Prevention is difficult the user is in control
Javascript can sometimes help
Javascript can sometimes hurt
Plan for and test each of these scenarios
Plan for handling out-of-sequence requests

14
User Recoverable Errors
Bookmarks, back buttons and sessions

To seamlessly handle session timeouts and
out-of-sequence requests, consider
Persistent sessions (saved to database)
Passing state in every request (form fields or
URL rewriting)
Storing state in custom cookies
Adding custom logic to recover from timed-out
sequences
To simply detect and alert, consider
Using listener to catch session expiration
Using state validation to catch out-of-sequence
requests
Redirecting user to session expiration page
To improve process
Log session losses (requests within expired
session)
Consider increasing session timeout
Consider using prevention techniques described
above
Increase save points

15
User Recoverable Errors
Bookmarks, back buttons and sessions

To minimize impact of back button, consider
Techniques described for out-of-sequence requests
Redirecting to GETs instead of returning
responses to POSTs
To work around double submissions, consider
Disabling submit button after first click
Susceptible to Stop button or request timeout
Minimizing response times!
Detecting on server side using request id
Difficult to return correct response to second
request
Immediately forwarding to intermediate page which
can forward on when response is ready
To handle multiple windows, consider
Passing state in every request
Adapting web frameworks to map state (e.g. Struts
form beans) by primary key or request ID instead
of a static name

16
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

17
Expected Application Errors
Resource is unavailable

Database is down for maintenance
No connection to integrated partner service
Resource is overloaded
Out of DB connections
JMS Queue full

18
Expected Application Errors
Resource is unavailable

To prevent, consider
Coordinating maintenance schedules
Planning for failover at the resource level
Increasing hardware budget ?
Increasing transaction timeout seconds (caution
last resort)
To handle, analyze transactional requirements
Is immediate user response necessary?
Can the resource access be handled asynchronously
with an extended, logical transaction?
Plan rollbacks carefully to allow for retries
(consider idempotence, sub-transactions)
Alert operator/admin if out of SLA
Log all outages (study for patterns)

19
Expected Application Errors
Application is overloaded

Mentioned on CNBC
Linked from Slashdot
Denial of Service

20
Expected Application Errors
Application is overloaded

Test under heavy load
Tune hot spots
Run with excess capacity
Throttle at network level
Use JMS and other asynchronous technologies to
throttle on backend
Tune application server to degrade gracefully
Monitor carefully
Be prepared to scale out, not just up

21
Expected Application Errors
Bugs and other undocumented features

Friendly bug
Triggers invalid state
Causes VM or app server to throw exception
Greedy bug
Monopolizes resources
Leaks connections
Silent and deadly bug
Corrupts data

22
Expected Application Errors
Bugs and other undocumented features

To handle friendly bugs
Bulletproof your transactions rollbacks
Write up strict guidelines
Conduct code reviews
If developers are junior and/or integrity is a
lot more important than performance, consider
using unchecked application exceptions
Catch Throwable somewhere in the UI
Display sanitized errors to user
Log carefully to allow easy debugging (use
timestamps, thread IDs, transaction IDs
Alert operator/administrator

23
Expected Application Errors
Bugs and other undocumented features

To handle greedy bugs
Reduce transaction timeout seconds
Handle timeouts in the same way as friendly bugs
Monitor carefully
Log statistics ( of transaction timeouts, CPU
usage, memory usage, GC, network traffic, stuck
threads)
Automate log analysis
Trigger a thread dump (kill -3) during hot spots
Alert operator/administrator to hot spots
Use clustering to contain damage

24
Expected Application Errors
Bugs and other undocumented features

To handle silent and deadly bugs
Bulletproof transaction settings
Validate on multiple levels, use referential
integrity
Audit everything
Unless performance/cost prohibits, keep a
complete audit trail on every table (easy with
triggers, aspects or code generators), try to
include transaction ID
Flush caches regularly
After a save, load the record from the database
and display back to the user
Run periodic audits with human review
Plan for how to use audit trail to recover from
data corruption
Early detection is key escalate user concerns!

25
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

26
System Failure
Never have an unplanned outage

Determine acceptable downtime
Plan clustering / failover accordingly
Monitor carefully so outages are detected
immediately
Be ready with a tiny planned outage page and
server in advance
Consider offsite host
Build this functionality into non-Web clients at
development time
Plan for transaction recovery
Plan for JMS recovery
Use quiescing load balancing to bring servers
offline for maintenance

27
Agenda

Goals of Fault Tolerance
User Recoverable Errors
Expected Application Errors
System Failure
Useful Strategies
Discussion

28
Useful Strategies
Be sure that you develop guidelines for

Error Messages
Validation (format, business rules, size,
cleansing)
Logging (when, where, what)
Auditing
Monitoring (level of automation, alerts)
Transactions (who rolls back, checked vs.
unchecked)
Sessions Caching (request vs. session,
flushing)
Clustering

29
Useful Strategies
Error Messages

For validation errors, be sure to
Include format and size hints
Show examples
Give more information than the basic field label
Mention the error at the top of the screen
Highlight the actual field
Wherever possible, catch all errors at the same
time!
For other user-recoverable errors
Let the user know what to do next
If the user cant recover
Apologize
Give no details
Suggest workarounds
(Silently log and alert!)

30
Useful Strategies
Validation

If possible, validate at all levels
Common strategies
Externalize validation rules and use AOP or CG
techniques to build them into each layer
Clearly define which layers are responsible for
which types of validation. For example
All format errors handled in web tier
All business rule violations handled in
application tier
All field lengths enforced at data tier
If business logic isnt dependent on fields,
define them dynamically, along with their
validation rules
Use a rules engine

31
Useful Strategies
Logging

Log in all tiers
Helps diagnose problems
More secure
Define logging levels (debug, info, error, etc.)
Create consistent guidelines for those levels
Include
Timestamp
Current User
Some sort of thread ID, transaction ID, etc.
Dont make logs a source of failure (watch disk
space, JMS load, etc.)

32
Useful Strategies
Auditing

Unless you have a performance or space problem,
audit all changes
Provides accountability
Easier to support users
Easier to debug
Easier to recover from disaster
Easier to detect attacks
Include
Timestamp
Current User
Some sort of thread ID, transaction ID, etc.
Complete data record or diff
Decide whether a failure to audit should sink the
transaction

33
Useful Strategies
Monitoring

Who is watching the logs?
Common strategies include
24/7 operations center
Business hours operation center
Overworked admin whenever she happens to think
about it (not recommended!)
Automated, redundant processes that analyze logs
and raise alerts to on-call administrators
Logs show more than critical errors
Ideally, mine them for clues on usability,
performance problems and attacks

34
Useful Strategies
Transactions

What runs in a transaction?
Who is responsible for rolling it back?
Common strategies
Top server-side tier creates a user transaction,
catches all errors and then determines its fate
Container-managed transactions with session
façade
Top level methods responsible for rollbacks
Unchecked (runtime) exceptions, so rollbacks are
automatic
Mostly unchecked exceptions with a few special
cases
How to pick a transaction timeout?

35
Useful Strategies
Sessions and Caching