Title: FT NT: A Tutorial on Microsoft Cluster Server
1FT NT A Tutorial on Microsoft Cluster
Server(formerly Wolfpack)
- Joe Barrera
- Jim Gray
- Microsoft Research
- joebar, gray _at_ microsoft.com
- http//research.microsoft.com/barc
2Outline
- Why FT and Why Clusters
- Cluster Abstractions
- Cluster Architecture
- Cluster Implementation
- Application Support
- QA
3DEPENDABILITY The 3 ITIES
- RELIABILITY / INTEGRITY Does the right thing.
(also large MTTF) - AVAILABILITY Does it now. (also small
MTTR )
MTTFMTTRSystem AvailabilityIf 90 of
terminals up 99 of DB up? (gt89 of
transactions are serviced on time). - Holistic vs. Reductionist view
Security
Integrity
Reliability
Availability
4Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2
Tele Comm lines
1
2
1
1
.
2
Environment
2
5
Application Software
9
.
3
Operations
- Vendor (hardware and software) 5 Months
- Application software 9 Months
- Communications lines 1.5 Years
- Operations 2 Years
- Environment 2 Years
- 10 Weeks
- 1,383 institutions reported (6/84 - 7/85)
- 7,517 outages, MTTF 10 weeks, avg
duration 90 MINUTES - To Get 10 Year MTTF, Must Attack All These Areas
5Case Studies - Tandem Trends
- MTTF improved
- Shift from Hardware Maintenance to from 50 to
10 - to Software (62) Operations (15)
- NOTE Systematic under-reporting of Environment
- Operations errors
- Application Software
6Summary of FT Studies
- Current Situation 4-year MTTF gt Fault
Tolerance Works. - Hardware is GREAT (maintenance and MTTF).
- Software masks most hardware faults.
- Many hidden software outages in operations
- New Software.
- Utilities.
- Must make all software ONLINE.
- Software seems to define a 30-year MTTF ceiling.
- Reasonable Goal 100-year MTTF.
class 4 today gt class 6 tomorrow.
7Fault Tolerance vs Disaster Tolerance
- Fault-Tolerance mask local faults
- RAID disks
- Uninterruptible Power Supplies
- Cluster Failover
- Disaster Tolerance masks site failures
- Protects against fire, flood, sabotage,..
- Redundant system and service at remote site.
8The Microsoft Vision Plug Play Dependability
- Transactions for reliability
- Clusters for availability
- Security
- All built into the OS
Integrity
Security
Integrity /
Reliability
Availability
9Cluster Goals
- Manageability
- Manage nodes as a single system
- Perform server maintenance without affecting
users - Mask faults, so repair is non-disruptive
- Availability
- Restart failed applications servers
- un-availability MTTR / MTBF , so quick repair.
- Detect/warn administrators of failures
- Scalability
- Add nodes for incremental
- processing
- storage
- bandwidth
10Fault Model
- Failures are independentSo, single fault
tolerance is a big win - Hardware fails fast (blue-screen)
- Software fails-fast (or goes to sleep)
- Software often repaired by reboot
- Heisenbugs
- Operations tasks major source of outage
- Utility operations
- Software upgrades
11Cluster Servers Combined to Improve
Availability Scalability
- Cluster A group of independent systems working
together as a single system. Clients see
scalable FT services (single system image). - Node A server in a cluster. May be an SMP
server. - Interconnect Communications link used for
intra-cluster status info such as heartbeats.
Can be Ethernet.
12Microsoft Cluster Server
- 2-node availability Summer 97 (20,000 Beta
Testers now) - Commoditize fault-tolerance (high availability)
- Commodity hardware (no special hardware)
- Easy to set up and manage
- Lots of applications work out of the box.
- 16-node scalability later (next year?)
13Failover Example
Server 1
Server 2
Web site
Web site
Database
Database
Web site files
Database files
14MS Press Failover Demo
Resource States
- Client/Server
- Software failure
- Admin shutdown
- Server failure
- Pending - Partial - Failed - Offline
!
15Demo Configuration
SCSI Disk Cabinet
Windows NT Server Cluster
16Demo Administration
Server Alice Runs SQL Trace Runs Globe
Server Betty Run SQL Trace
SCSI Disk Cabinet
Windows NT Server Cluster
Client
17Generic Stateless ApplicationRotating Globe
- Mplay32 is generic app.
- Registered with MSCS
- MSCS restarts it on failure
- Move/restart 2 seconds
- Fail-over if
- 4 failures ( process exits)
- in 3 minutes
- settable default
18Demo Moving or Failing Over An Application
SCSI Disk Cabinet
Windows NT Server Cluster
19Generic Stateful ApplicationNotePad
- Notepad saves state on shared disk
- Failure before save gt lost changes
- Failover or move (disk state move)
20Demo Step 1 Alice Delivering Service
SQL Activity
No SQL Activity
SQL
SQL
ODBC
ODBC
SCSI Disk Cabinet
IIS
IIS
Windows NT Server Cluster
IP
HTTP
212 Request Move to Betty
SCSI Disk Cabinet
Windows NT Server Cluster
HTTP
223 Betty Delivering Service
SQL
SQL
ODBC
ODBC
SCSI Disk Cabinet
IIS
IIS
Windows NT Server Cluster
234 Power Fail Betty, Alice Takeover
SCSI Disk Cabinet
Windows NT Server Cluster
245 Alice Delivering Service
SQL Activity
No SQL Activity
SQL
ODBC
SCSI Disk Cabinet
IIS
Windows NT Server Cluster
IP
HTTP
256 Reboot Betty, now can takeover
SQL Activity
No SQL Activity
SQL
ODBC
SCSI Disk Cabinet
IIS
Windows NT Server Cluster
IP
HTTP
26Outline
- Why FT and Why Clusters
- Cluster Abstractions
- Cluster Architecture
- Cluster Implementation
- Application Support
- QA
27Cluster and NT Abstractions
Resource
Cluster
Group
Cluster Abstractions
NT Abstractions
Service
Domain
Node
28Basic NT Abstractions
Service
Domain
Node
- Service program or device managed by a node
- e.g., file service, print service, database
server - can depend on other services (startup ordering)
- can be started, stopped, paused, failed
- Node a single (tightly-coupled) NT system
- hosts services belongs to a domain
- services on node always remain co-located
- unit of service co-location involved in naming
services - Domain a collection of nodes
- cooperation for authentication, administration,
naming
29Cluster Abstractions
Resource
Cluster
Resource Group
- Resource program or device managed by a cluster
- e.g., file service, print service, database
server - can depend on other resources (startup ordering)
- can be online, offline, paused, failed
- Resource Group a collection of related resources
- hosts resources belongs to a cluster
- unit of co-location involved in naming resources
- Cluster a collection of nodes, resources, and
groups - cooperation for authentication, administration,
naming
30Resources
Resource
Cluster
Group
- Resources have...
- Type what it does (file, DB, print, web)
- An operational state (online/offline/failed)
- Current and possible nodes
- Containing Resource Group
- Dependencies on other resources
- Restart parameters (in case of resource failure)
31Resource Types
- Built-in types
- Generic Application
- Generic Service
- Internet Information Server (IIS) Virtual Root
- Network Name
- TCP/IP Address
- Physical Disk
- FT Disk (Software RAID)
- Print Spooler
- File Share
- Added by others
- Microsoft SQL Server,
- Message Queues,
- Exchange Mail Server,
- Oracle,
- SAP R/3
- Your application? (use developer kit wizard).
32Physical Disk
33TCP/IP Address
34Network Name
35File Share
36IIS (WWW/FTP) Server
37Print Spooler
38Resource States
- Resources states
- Offline exists, not offering service
- Online offering service
- Failed not able to offer service
- Resource failure may cause
- local restart
- other resources to go offline
- resource group to move
- (all subject to group and resource parameters)
- Resource failure detected by
- Polling failure
- Node failure
Im Online!
Go Off-line!
Offline Pending
Im here!
Go Online!
Im Off-line!
39Resource Dependencies
- Similar to NT Service Dependencies
- Orderly startup shutdown
- A resource is brought online after any resources
it depends on are online. - A Resource is taken offline before any resources
it depends on - Interdependent resources
- Form dependency trees
- move among nodes together
- failover together
- as per resource group
40Dependencies Tab
41NT Registry
- Stores all configuration information
- Software
- Hardware
- Hierarchical (name, value) map
- Has a open, documented interface
- Is secure
- Is visible across the net (RPC interface)
- Typical Entry
- \Software\Microsoft\MSSQLServer\MSSQLServer\
- DefaultLogin GUEST
- DefaultDomain REDMOND
42Cluster Registry
- Separate from local NT Registry
- Replicated at each node
- Algorithms explained later
- Maintains configuration information
- Cluster members
- Cluster resources
- Resource and group parameters (e.g. restart)
- Stable storage
- Refreshed from master copy when node joins
cluster
43Other Resource Properties
- Name
- Restart policy (restart N times, failover)
- Startup parameters
- Private configuration info (resource type
specific) - Per-node as well, if necessary
- Poll Intervals (LooksAlive, IsAlive, Timeout)
- These properties are all kept in Cluster Registry
44General Resource Tab
45Advanced Resource Tab
46Resource Groups
Resource
Cluster
Group
- Every resource belongs to a resource group.
- Resource groups move (failover) as a unit
- Dependencies NEVER cross groups. (Dependency
trees contained within groups.) - Group may contain forest of dependency trees
Payroll Group
Web Server
SQL Server
Drive E
IP Address
Drive F
47Moving a Resource Group
48Group Properties
- CurrentState Online, Partially Online, Offline
- Members resources that belong to group
- members determine which nodes can host group.
- Preferred Owners ordered list of host nodes
- FailoverThreshold How many faults cause failover
- FailoverPeriod Time window for failover
threshold - FailbackWindowsStart When can failback happen?
- FailbackWindowEnd When can failback happen?
- Everything (except CurrentState) is stored in
registry
49Failover and Failback
- Failover parameters
- timeout on LooksAlive, IsAlive
- local restarts in failure window after this,
offline. - Failback to preferred node
- (during failback window)
- Do resource failures affect group?
Node \\Betty
Node \\Alice
Cluster Service
Cluster Service
IPaddr
name
50Cluster ConceptsClusters
Resource
Cluster
Group
Resource
Group
Resource
Group
Resource
Group
51Cluster Properties
- Defined Members nodes that can join the cluster
- Active Members nodes currently joined to cluster
- Resource Groups groups in a cluster
- Quorum Resource
- Stores copy of cluster registry.
- Used to form quorum.
- Network Which network used for communication
- All properties kept in Cluster Registry
52Cluster API Functions(operations on nodes
groups)
- Find and communicate with Cluster
- Query/Set Cluster properties
- Enumerate Cluster objects
- Nodes
- Groups
- Resources and Resource Types
- Cluster Event Notifications
- Node state and property changes
- Group state and property changes
- Resource state and property changes
53Cluster Management
54Demo
- Server startup and shutdown
- Installing applications
- Changing status
- Failing over
- Transferring ownership of groups or resources
- Deleting Groups and Resources
55Outline
- Why FT and Why Clusters
- Cluster Abstractions
- Cluster Architecture
- Cluster Implementation
- Application Support
- QA
56Architecture
- Top tier provides cluster abstractions
- Middle tier provides distributed operations
- Bottom tier is NT and drivers
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Quorum
Membership
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
57Membership and Regroup
- Membership
- Used for orderly addition and removal from
active nodes - Regroup
- Used for failure detection (via heartbeat
messages) - Forceful eviction from active nodes
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
58Membership
- Defined cluster all nodes
- Active cluster
- Subset of defined cluster
- Includes Quorum Resource
- Stable (no regroup in progress)
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
59Quorum Resource
- Usually (but not necessarily) a SCSI disk
- Requirements
- Arbitrates for a resource by supporting the
challenge/defense protocol - Capable of storing cluster registry and logs
- Configuration Change Logs
- Tracks changes to configuration database when any
defined member missing (not active) - Prevents configuration partitions in time
60Challenge/Defense Protocol
- SCSI-2 has reserve/release verbs
- Semaphore on disk controller
- Owner gets lease on semaphore
- Renews lease once every 3 seconds
- To preempt ownership
- Challenger clears semaphore (SCSI bus reset)
- Waits 10 seconds
- 3 seconds for renewal 2 seconds bus settle time
- x2 to give owner two chances to renew
- If still clear, then former owner loses lease
- Challenger issues reserve to acquire semaphore
61Challenge/Defense ProtocolSuccessful Defense
62Challenge/Defense ProtocolSuccessful Challenge
Defender Node
No reservation detected
Challenger Node
63Regroup
- Invariant All members agree on members
- Regroup re-computes members
- Each node sends heartbeat message to a peer
(default is one per second) - Regroup if two lost heartbeat messages
- suspicion that sender is dead
- failure detection in bounded time
- Uses a 5-round protocol to agree.
- Checks communication among nodes.
- Suspected missing node may survive.
- Upper levels (global update, etc.) informed of
regroup event.
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
64Membership State Machine
Initialize
Search or Reserve Fails
Sleeping
Start Cluster
Member Search
Quorum Disk Search
Search Fails
Regroup
Acquire (reserve) Quorum Disk
Minority or no Quorum
Found Online Member
Forming
Non-Minority and Quorum
Lost Heartbeat
Joining
Join Succeeds
Synchronize Succeeds
Online
65Joining a Cluster
- When a node starts up, it mounts and configures
only local, non-cluster devices - Starts Cluster Service which
- looks in local (stale) registry for members
- Asks each member in turn to sponsor new nodes
membership. (Stop when sponsor found.) - Sponsor (any active member)
- Sponsor authenticates applicant
- Broadcasts applicant to cluster members
- Sponsor sends updated registry to applicant
- Applicant becomes a cluster member
66Forming a Cluster(when Joining fails)
- Use registry to find quorum resource
- Attach to (arbitrate for) quorum resource
- Update cluster registry from quorum resource
- e.g. if we were down when it was in use
- Form new one-node cluster
- Bring other cluster resources online
- Let others join your cluster
67Leaving A Cluster (Gracefully)
- Pause
- Move all groups off this member.
- Change to paused state (remains a cluster member)
- Offline
- Move all groups off this member.
- Sends ClusterExit message all cluster members
- Prevents regroup
- Prevents stalls during departure transitions
- Close Cluster connections (now not an active
cluster member) - Cluster service stops on node
- Evict remove node from defined member list
68Leaving a Cluster (Node Failure)
- Node (or communication) failure triggers Regroup
- If after regroup
- Minority group OR no quorum device
- group does NOT survive
- Non-minority group AND quorum device
- group DOES survive
- Non-Minority rule
- Number of new members gt 1/2 old active cluster
- Prevents minority from seizing quorum device at
the expense of a larger potentially surviving
cluster - Quorum guarantees correctness
- Prevents split-brain
- e.g. with newly forming cluster containing a
single node
69Global Update
- Propagates updates to all nodes in cluster
- Used to maintain replicated cluster registry
- Updates are atomic and totally ordered
- Tolerates all benign failures.
- Depends on membership
- all are up
- all can communicate
- R. Carr, Tandem Systems Review. V1.2 1985,
sketches regroup and global update protocol.
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
70Global Update Algorithm
- Cluster has locker node that regulates updates.
- Oldest active node in cluster
- Send Update to locker node
- Update other (active) nodes
- in seniority order (e.g. locker first)
- this includes the updating node
- Failure of all updated nodes
- Update never happened
- Updated nodes will roll back on recovery
- Survival of any updated nodes
- New locker is oldest and so has update if any do.
- New locker restarts update
L
S
71Cluster Registry
- Separate from local NT Registry
- Maintains cluster configuration
- members, resources, restart parameters, etc.
- Stable storage
- Replicated at each member
- Global Update protocol
- NT Registry keeps local copy
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
72Cluster Registry Bootstrapping
- Membership uses Cluster Registry for list of
nodes - Circular dependency
- Solution
- Membership uses stale local cluster registry
- Refresh after joining or forming cluster
- Master is either
- quorum device, or
- active members
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
73Resource Monitor
- Polls resources
- IsAlive and LooksAlive
- Detects failures
- polling failure
- failure event from resource
- Higher levels tell it
- Online, Offline
- Restart
Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
74Failover Manager
Failover Manager
- Assigns groups to nodes based on
- Failover parameters
- Possible nodes for each resource in group
- Preferred nodes for resource group
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
75Failover(Resource Goes Offline)
Notify Failover Manager.
Resource Manager Detects resource error.
Failover Manager checks Failover Window
and Failover Threshold
Attempt to restart resource.
Wait for Failback Window
Are Failover conditions within Constraints?
No
Has the Resource Retry limit been exceeded?
No
Yes
Leave Group in partially Online state.
Yes
No
Can another owner be found? (Arbitration)
Switch resource (and Dependants) Offline.
Notify Failover Manager on the new system
to bring resource Online.
Yes
76Pushing a Group (Resource Failure)
Resource Monitor notifies Resource Manager of
resource failure.
Resource Manager enumerates all objects in
the Dependency Tree of the failed resource.
Resource Manager notifies Failover Manager that
the Dependency Tree is Offline and needs to fail
over.
Failover Manager performs Arbitration to locate a
new owner for the group.
Resource Manager takes each depending resource
Offline.
Failover Manager on the new owner node brings the
resources Online.
Any resource has Affect the Group True
No
Leave Group in partially Online state.
Yes
77Pulling a Group(Node Failure)
Cluster Service notifies Failover Manager of node
failure.
Failover Manager determines which groups were
owned by the failed node.
Failover Manager performs Arbitration to locate a
new owner for the groups.
Failover Manager on the new owner(s) bring the
resources Online in dependency order.
Resource Manager notifies Failover Manager that
the node is Offline and the groups it owned
need to fail over.
78Failback to Preferred Owner Node
- Group may have a Preferred Owner
- Preferred Owner comes back online
- Will only occur during the Failback
Window (time slot, e.g. at night)
Resource Manager takes each resource on
the current owner Offline.
Preferred owner comes back Online.
Failover Manager performs Arbitration to locate
the Preferred Owner of the group.
Is the time within the Failback Window?
Resource Manager notifies Failover Manager that
the Group is Offline and needs to fail over to
the Preferred Owner.
Failover Manager on the Preferred Owner brings
the resources Online.
79Outline
- Why FT and Why Clusters
- Cluster Abstractions
- Cluster Architecture
- Cluster Implementation
- Application Support
- QA
80Process Structure
- Cluster Service
- Failover Manager
- Cluster Registry
- Global Update
- Quorum
- Membership
- Resource Monitor
- Resource Monitor
- Resource DLLs
- Resources
- Services
- Applications
A Node
Cluster Service
81Resource Control
- Commands
- CreateResource()
- OnlineResource()
- OfflineResource()
- TerminateResource()
- CloseResource()
- ShutdownProcess()
- And resource events
A Node
Resource Monitor
Cluster Service
Private calls
Resource Monitor
DLL
Private calls
Resource
82Resource DLLs
- Calls to Resource DLL
- Open get handle
- Online start offering service
- Offline stop offering service
- as a standby or
- pair-is offline
- LooksAlive Quick check
- IsAlive Thorough check
- Terminate Forceful Offline
- Close release handle
Resource Monitor
DLL
Private calls
Resource
Std calls
83Cluster Communications
- Most communication via DCOM /RPC
- UDP used for membership heartbeat messages
- Standard (e.g. Ethernet) interconnects
Management apps
DCOM / RPC admin UDP Heartbeat
Cluster Service
DCOM / RPC
Resource Monitors
Resource Monitors
84Outline
- Why FT and Why Clusters
- Cluster Abstractions
- Cluster Architecture
- Cluster Implementation
- Application Support
- QA
85Application Support
- Virtual Servers
- Generic Resource DLLs
- Resource DLL VC Wizard
- Cluster API
86Virtual Servers
- Problem
- Client and Server Applications do not want node
name to change when server app moves to another
node. - A Virtual Server simulates an NT Node
- Resource Group (name, disks, databases,)
- NetName and IP address (node \\a keeps name
and IP address as is moves) - Virtual Registry (registry moves (is
replicated)) - Virtual Service Control
- Virtual RPC service
- Challenges
- Limit app to virtual servers devices and
services. - Client reconnect on failover (easy if
connectionless -- eg web-clients)
Virtual Server \\a1.2.3.4
87Virtual Servers (before failover)
- Nodes \\Y and \\Z support virtual servers \\A and
\\B - Things that need to fail over transparently
- Client connection
- Server dependencies
- Service names
- Binding to local resources
- Binding to local servers
\\Y
\\Z
SAP
SAP
SQL
SQL
T\
S\
\\A
\\B
SAP on A
SAP on B
88Virtual Servers (just after failover)
- \\Y resources and groups (i.e. Virtual Server
\\A)moved to \\Z - A resources bind to each other and to local
resources (e.g., local file system) - Registry
- Physical resource
- Security domain
- Time
- Transactions used to make DB state consistent.
- To work, local resources on \\Y and \\Z have to
be similar - E.g. time must remain monotonic after failover
\\Y
\\Z
\\A
\\B
89Address Failover andClient Reconnection
- Name and Address rebind to new node
- Details later
- Clients reconnect
- Failure not transparent
- Must log on again
- Client context lost (encourages connectionless)
- Applications could maintain context
\\Y
\\Z
SAP
SAP
SQL
SQL
S\
T\
\\A
\\B
SAP on A
SAP on B
90Mapping Local References to Group-Relative
References
- Send client requests to correct server
- \\A\SAP refers to \\.\SQL
- \\B\SAP refers to \\.\SQL
- Must remap references
- \\A\SAP to \\.\SQLA
- \\B\SAP to \\.\SQLB
- Also handles namespace collision
- Done via
- modifying server apps, or
- DLLs to transparently rename
\\Y
\\Z
SAP
SAP
SQL
SQL
S\
T\
\\A
\\B
SAP on A
SAP on B
91Naming and Binding and Failover
- Services rely on the NT node name and - or IP
address to advertise Shares, Printers, and
Services. - Applications register names to advertise services
- Example \\Alice\SQL (i.e. ltnodegtltservicegt)
- Example 128.2.2.280 (http//www.foo.com/)
- Binding
- Clients bind to an address (e.g. name-gtIP
address) - Thus the node name and IP address must failover
along with the services (preserve client bindings)
92Client to Cluster CommunicationsIP address
mobility based on MAC rebinding
- Cluster Clients
- Must use IP (TCP, UDP, NBT,... )
- Must Reconnect or Retry after failure
- Cluster Servers
- All cluster nodes must be on same LAN segment
- IP rebinds to failover MAC addr
- Transparent to client or server
- Low-level ARP (address resolution protocol)
rebinds IP add to new MAC addr.
Client Alice lt-gt 200.110.12.4 Virtual Alice lt-gt
200.110.12.5 Betty lt-gt 200.110.12.6 Virtual Betty
lt-gt 200.110.12.7
WAN
Alice lt-gt 200.110.120.4 Virtual Alice lt-gt
200.110.120.5
Betty lt-gt 200.110.120.6 Virtual Betty lt-gt
200.110.120.7
Router 200.110.120.4 -gtAliceMAC 200.110.120.5
-gtAliceMAC 200.110.120.6 -gtBettyMAC 200.110.120.7
-gtBettyMAC
Local Network
93Time
- Time must increase monotonically
- Otherwise applications get confused
- e.g. make/nmake/build
- Time is maintained within failover resolution
- Not hard, since failover on order of seconds
- Time is a resource, so one node owns time
resource - Other nodes periodically correct drift from
owners time
94Application Local NT Registry Checkpointing
- Resources can request that local NT registry
sub-trees be replicated - Changes written out to quorum device
- Uses registry change notification interface
- Changes read and applied on fail-over
\\A on \\X
\\A on \\B
registry
registry
Each update
registry
After Failover
Quorum Device
95Registry Replication
96Application Support
- Virtual Servers
- Generic Resource DLLs
- Resource DLL VC Wizard
- Cluster API
97Generic Resource DLLs
- Generic Application DLL
- Simplest just starts, stops application, and
makes sure process is alive - Generic Service DLL
- Translates DLL calls into equivalent NT Server
calls - Online gt Service Start
- Offline gt Service Stop
- Looks/IsAlive gt Service Status
98Generic Application
99Generic Service
100Application Support
- Virtual Servers
- Generic Resource DLLs
- Resource DLL VC Wizard
- Cluster API
101Resource DLL VC Wizard
- Asks for resource type name
- Asks for optional service to control
- Asks for other parameters (and associated types)
- Generates DLL source code
- Source can be modified as necessary
- E.g. additional checks for Looks/IsAlive
102Creating a New Workspace
103Specifying Resource Type Name
104Specifying Resource Parameters
105Automatic Code Generation
106Customizing The Code
107Application Support
- Virtual Servers
- Generic Resource DLLs
- Resource DLL VC Wizard
- Cluster API
108Cluster API
- Allows resources to
- Examine dependencies
- Manage per-resource data
- Change parameters (e.g. failover)
- Listen for cluster events
- etc.
- Specs API became public Sept 1996
- On all MSDN Level 3
- On web site
- http//www.microsoft.com/clustering.htm
109Cluster API Documentation
110Outline
- Why FT and Why Clusters
- Cluster Abstractions
- Cluster Architecture
- Cluster Implementation
- Application Support
- QA
111Research Topics?
- Even easier to manage
- Transparent failover
- Instant failover
- Geographic distribution (disaster tolerance)
- Server pools (load-balanced pool of processes)
- Process pair (active/backup process)
- 10,000 nodes?
- Better algorithms
- Shared memory or shared disk among nodes
- a truly bad idea?
112References
- Microsoft NT site http//www.microsoft.com/ntserv
er/ - BARC site (e.g. these slides)http//research.micr
osoft.com/joebar/wolfpack - Inside Windows NT, H. Custer, Microsoft Pr,
ISBN 155615481 - Tandem Global Update Protocol, R. Carr, Tandem
Systems Review. V1.2 1985, sketches regroup and
global update protocol. - VAXclusters a Closely Coupled Distributed
System, Kronenberg, N., Levey, H., Strecker, W.,
ACM TOCS, V 4.2 1986. A (the) shared disk
cluster. - In Search of Clusters The Coming Battle in
Lowly Parallel Computing, Gregory F. Pfister,
Prentice Hall, 1995, ISBN 0134376250. Argues
for shared nothing - Transaction Processing Concepts and Techniques,
Gray, J., Reuter A., Morgan Kaufmann, 1994.
ISBN 1558601902, survey of outages, transaction
techniques.