Title: Distributed, Virtualized Fault Tolerant Systems
1Distributed, Virtualized Fault Tolerant Systems
2- What does downtime
- cost you ?
3Agenda
- Fault Tolerance An Introduction
- Components of Fault Tolerant Systems
- Fault Tolerant Servers
- The Distributed Approach
- The Virtualization Approach
- Fault Tolerance with FOSS
- Resources
- Q A
4Introduction
- Fault-tolerance is the property of a system that
continues operating properly in the event of
failure - of some of its parts
- Source http//en.wikipedia.org/wiki/Fault_tolera
nt_systems
5Components
- Network
- Hardware
- OS
- Middleware
- Applications
- Storage
- Monitoring
- Notification
- Recovery
- Security
Component Inter-dependencies
Notification
Users
Administrators
Dependencies
Storage
Monitoring
Recovery
Application
Middleware
OS
Hardware
Network
6Fault-Tolerant Servers
- Fault-tolerant servers have lock-step H/W
- redundant power supply, processors, storage
- 5 nines (99.999) uptime 5 minutes per year
- Clustering averages only 99.9 uptime
- Downtime of eight hours and 44 mins / year
- Does not account for OS, Application issues
Source Brad Lightner, Director of product and
solution integration, NEC
7The Distributed Approach
- Distribute mission critical applications
- High-Availability / Failover Clusters
- Replication
- Redundancy
- Diversity
- Disadvantages
- Additional investment in H/W and licenses
- Administration Overhead
- Applications may need to be cluster aware
8The Distributed Approach
- A single server with tons of redundant bits,
doesn't help you if the OS or Applications that
it servers get borked. - Easy to incrementally scale up infrastructure and
with overall scalability - Fully parallel resources - independent memory
bus, an independent IO bus, independent CPU's,
independent NIC - More flexibility in how resources are divided
amongst mutually exclusive tasks
9The Virtualization Approach
- OS Instances for running specific apps
- Fail-over between OS instances
- greatly reduces recovery time
- Separate components to avoid failure
- Enhanced security with separation
- Auto-provisioning for recovery
- Performance penalties
10Fault Tolerance with FOSS
- Test Case Version Control System
- SubVersion Running on OpenSuSE
- Apache 2 based network service
- FSFS Backend
- Dependencies
- Neon, Perl, Python, Swig
- Storage is not local to the system
11Fault Tolerance with FOSS
The Distributed Approach
Users
Administrators
OpenFiler (Storage)
Notification
Apache w/ SVN
GFS
Neon, Swig
OpenSuSE 10.0
Hardware 1
Hardware 2
Monitoring FAM Nagios Heartbeat
Recovery Linux-HA IP-Takeover CHPox
Failover
12Fault Tolerance with FOSS
The Virtualized Approach
Users
Administrators
Notification
OpenFiler (Storage)
Apache w/ SVN
Neon, Swig
OpenSuSE 10.0
Monitoring
Recovery
Xen 0
Xen 1
Hardware
13Fault Tolerance with FOSS
- Fault Tolerance possible to achieve with FOSS
- Holds true for either approach
- Distribution / Virtualization not exclusive of
each other - Advantages
- Ability to integrate components
- Choice
- Cost Effectiveness
- Disadvantages
- Administration Overhead
- System Complexity
14Resources
- Operating System Linux
- Distributed File System GFS / Lustre
- Cluster Management OpenMOSIX / OSCAR / C3
- Data Replication FAM, IMON / dnotify
- Data Management OpenFiler
- Application Replication CHpox
- Virtualization Xen Hypervisor
- Failover Linux-HA, LVS
- Monitoring Ganglia / Nagios / OpenNMS
15- Questions ?
- Demo of Virtualization Approach at FOSS Village
After the Talk
16Backups
17FOSS ? Components
- OS - Linux
- Middleware GFS, Lustre, FAM/IMON, dnotify,
- Virtualization Xen Hypervisor
- Storage, FS OpenFiler, GFS, Lustre
- Monitoring Ganglia, Nagios
- Notification SMTP, IM, RSS, SMS, Pager
- Recovery IP-takeover, Heartbeat, CHpox
- Security SELinux, chroot jails, virtual
instances