FT NT: A Tutorial on Microsoft Cluster Server - PowerPoint PPT Presentation

1 / 112
About This Presentation
Title:

FT NT: A Tutorial on Microsoft Cluster Server

Description:

1996, 1997 Microsoft Corp. 4. Case Study - Japan 'Survey on Computer Security', Japan Info Dev Corp., March 1986. ( trans: Eiichi Watanabe) ... – PowerPoint PPT presentation

Number of Views:249
Avg rating:3.0/5.0
Slides: 113
Provided by: joseph340
Category:

less

Transcript and Presenter's Notes

Title: FT NT: A Tutorial on Microsoft Cluster Server


1
FT NT A Tutorial on Microsoft Cluster
Server(formerly Wolfpack)
  • Joe Barrera
  • Jim Gray
  • Microsoft Research
  • joebar, gray _at_ microsoft.com
  • http//research.microsoft.com/barc

2
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • QA

3
DEPENDABILITY The 3 ITIES
  • RELIABILITY / INTEGRITY Does the right thing.
    (also large MTTF)
  • AVAILABILITY Does it now. (also small
    MTTR )
    MTTFMTTRSystem AvailabilityIf 90 of
    terminals up 99 of DB up? (gt89 of
    transactions are serviced on time).
  • Holistic vs. Reductionist view

Security
Integrity
Reliability
Availability
4
Case Study - Japan"Survey on Computer Security",
Japan Info Dev Corp., March 1986. (trans Eiichi
Watanabe).
Vendor
4
2

Tele Comm lines
1
2

1
1
.
2
Environment

2
5

Application Software
9
.
3

Operations
  • Vendor (hardware and software) 5 Months
  • Application software 9 Months
  • Communications lines 1.5 Years
  • Operations 2 Years
  • Environment 2 Years
  • 10 Weeks
  • 1,383 institutions reported (6/84 - 7/85)
  • 7,517 outages, MTTF 10 weeks, avg
    duration 90 MINUTES
  • To Get 10 Year MTTF, Must Attack All These Areas

5
Case Studies - Tandem Trends
  • MTTF improved
  • Shift from Hardware Maintenance to from 50 to
    10
  • to Software (62) Operations (15)
  • NOTE Systematic under-reporting of Environment
  • Operations errors
  • Application Software

6
Summary of FT Studies
  • Current Situation 4-year MTTF gt Fault
    Tolerance Works.
  • Hardware is GREAT (maintenance and MTTF).
  • Software masks most hardware faults.
  • Many hidden software outages in operations
  • New Software.
  • Utilities.
  • Must make all software ONLINE.
  • Software seems to define a 30-year MTTF ceiling.
  • Reasonable Goal 100-year MTTF.
    class 4 today gt class 6 tomorrow.

7
Fault Tolerance vs Disaster Tolerance
  • Fault-Tolerance mask local faults
  • RAID disks
  • Uninterruptible Power Supplies
  • Cluster Failover
  • Disaster Tolerance masks site failures
  • Protects against fire, flood, sabotage,..
  • Redundant system and service at remote site.

8
The Microsoft Vision Plug Play Dependability
  • Transactions for reliability
  • Clusters for availability
  • Security
  • All built into the OS

Integrity
Security
Integrity /
Reliability
Availability
9
Cluster Goals
  • Manageability
  • Manage nodes as a single system
  • Perform server maintenance without affecting
    users
  • Mask faults, so repair is non-disruptive
  • Availability
  • Restart failed applications servers
  • un-availability MTTR / MTBF , so quick repair.
  • Detect/warn administrators of failures
  • Scalability
  • Add nodes for incremental
  • processing
  • storage
  • bandwidth

10
Fault Model
  • Failures are independentSo, single fault
    tolerance is a big win
  • Hardware fails fast (blue-screen)
  • Software fails-fast (or goes to sleep)
  • Software often repaired by reboot
  • Heisenbugs
  • Operations tasks major source of outage
  • Utility operations
  • Software upgrades

11
Cluster Servers Combined to Improve
Availability Scalability
  • Cluster A group of independent systems working
    together as a single system. Clients see
    scalable FT services (single system image).
  • Node A server in a cluster. May be an SMP
    server.
  • Interconnect Communications link used for
    intra-cluster status info such as heartbeats.
    Can be Ethernet.

12
Microsoft Cluster Server
  • 2-node availability Summer 97 (20,000 Beta
    Testers now)
  • Commoditize fault-tolerance (high availability)
  • Commodity hardware (no special hardware)
  • Easy to set up and manage
  • Lots of applications work out of the box.
  • 16-node scalability later (next year?)

13
Failover Example
Server 1
Server 2
Web site
Web site
Database
Database
Web site files
Database files
14
MS Press Failover Demo
Resource States
  • Client/Server
  • Software failure
  • Admin shutdown
  • Server failure

- Pending - Partial - Failed - Offline
!
15
Demo Configuration
SCSI Disk Cabinet
Windows NT Server Cluster
16
Demo Administration
Server Alice Runs SQL Trace Runs Globe
Server Betty Run SQL Trace
SCSI Disk Cabinet
Windows NT Server Cluster
Client
17
Generic Stateless ApplicationRotating Globe
  • Mplay32 is generic app.
  • Registered with MSCS
  • MSCS restarts it on failure
  • Move/restart 2 seconds
  • Fail-over if
  • 4 failures ( process exits)
  • in 3 minutes
  • settable default

18
Demo Moving or Failing Over An Application
SCSI Disk Cabinet
Windows NT Server Cluster
19
Generic Stateful ApplicationNotePad
  • Notepad saves state on shared disk
  • Failure before save gt lost changes
  • Failover or move (disk state move)

20
Demo Step 1 Alice Delivering Service
SQL Activity
No SQL Activity
SQL
SQL
ODBC
ODBC
SCSI Disk Cabinet
IIS
IIS
Windows NT Server Cluster
IP
HTTP
21
2 Request Move to Betty
SCSI Disk Cabinet
Windows NT Server Cluster
HTTP
22
3 Betty Delivering Service
SQL
SQL
ODBC
ODBC
SCSI Disk Cabinet
IIS
IIS
Windows NT Server Cluster
23
4 Power Fail Betty, Alice Takeover
SCSI Disk Cabinet
Windows NT Server Cluster
24
5 Alice Delivering Service
SQL Activity
No SQL Activity
SQL
ODBC
SCSI Disk Cabinet
IIS
Windows NT Server Cluster
IP
HTTP
25
6 Reboot Betty, now can takeover
SQL Activity
No SQL Activity
SQL
ODBC
SCSI Disk Cabinet
IIS
Windows NT Server Cluster
IP
HTTP
26
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • QA

27
Cluster and NT Abstractions
Resource
Cluster
Group
Cluster Abstractions
NT Abstractions
Service
Domain
Node
28
Basic NT Abstractions
Service
Domain
Node
  • Service program or device managed by a node
  • e.g., file service, print service, database
    server
  • can depend on other services (startup ordering)
  • can be started, stopped, paused, failed
  • Node a single (tightly-coupled) NT system
  • hosts services belongs to a domain
  • services on node always remain co-located
  • unit of service co-location involved in naming
    services
  • Domain a collection of nodes
  • cooperation for authentication, administration,
    naming

29
Cluster Abstractions
Resource
Cluster
Resource Group
  • Resource program or device managed by a cluster
  • e.g., file service, print service, database
    server
  • can depend on other resources (startup ordering)
  • can be online, offline, paused, failed
  • Resource Group a collection of related resources
  • hosts resources belongs to a cluster
  • unit of co-location involved in naming resources
  • Cluster a collection of nodes, resources, and
    groups
  • cooperation for authentication, administration,
    naming

30
Resources
Resource
Cluster
Group
  • Resources have...
  • Type what it does (file, DB, print, web)
  • An operational state (online/offline/failed)
  • Current and possible nodes
  • Containing Resource Group
  • Dependencies on other resources
  • Restart parameters (in case of resource failure)

31
Resource Types
  • Built-in types
  • Generic Application
  • Generic Service
  • Internet Information Server (IIS) Virtual Root
  • Network Name
  • TCP/IP Address
  • Physical Disk
  • FT Disk (Software RAID)
  • Print Spooler
  • File Share
  • Added by others
  • Microsoft SQL Server,
  • Message Queues,
  • Exchange Mail Server,
  • Oracle,
  • SAP R/3
  • Your application? (use developer kit wizard).

32
Physical Disk
33
TCP/IP Address
34
Network Name
35
File Share
36
IIS (WWW/FTP) Server
37
Print Spooler
38
Resource States
  • Resources states
  • Offline exists, not offering service
  • Online offering service
  • Failed not able to offer service
  • Resource failure may cause
  • local restart
  • other resources to go offline
  • resource group to move
  • (all subject to group and resource parameters)
  • Resource failure detected by
  • Polling failure
  • Node failure

Im Online!
Go Off-line!
Offline Pending
Im here!
Go Online!
Im Off-line!
39
Resource Dependencies
  • Similar to NT Service Dependencies
  • Orderly startup shutdown
  • A resource is brought online after any resources
    it depends on are online.
  • A Resource is taken offline before any resources
    it depends on
  • Interdependent resources
  • Form dependency trees
  • move among nodes together
  • failover together
  • as per resource group

40
Dependencies Tab
41
NT Registry
  • Stores all configuration information
  • Software
  • Hardware
  • Hierarchical (name, value) map
  • Has a open, documented interface
  • Is secure
  • Is visible across the net (RPC interface)
  • Typical Entry
  • \Software\Microsoft\MSSQLServer\MSSQLServer\
  • DefaultLogin GUEST
  • DefaultDomain REDMOND

42
Cluster Registry
  • Separate from local NT Registry
  • Replicated at each node
  • Algorithms explained later
  • Maintains configuration information
  • Cluster members
  • Cluster resources
  • Resource and group parameters (e.g. restart)
  • Stable storage
  • Refreshed from master copy when node joins
    cluster

43
Other Resource Properties
  • Name
  • Restart policy (restart N times, failover)
  • Startup parameters
  • Private configuration info (resource type
    specific)
  • Per-node as well, if necessary
  • Poll Intervals (LooksAlive, IsAlive, Timeout)
  • These properties are all kept in Cluster Registry

44
General Resource Tab
45
Advanced Resource Tab
46
Resource Groups
Resource
Cluster
Group
  • Every resource belongs to a resource group.
  • Resource groups move (failover) as a unit
  • Dependencies NEVER cross groups. (Dependency
    trees contained within groups.)
  • Group may contain forest of dependency trees

Payroll Group
Web Server
SQL Server
Drive E
IP Address
Drive F
47
Moving a Resource Group
48
Group Properties
  • CurrentState Online, Partially Online, Offline
  • Members resources that belong to group
  • members determine which nodes can host group.
  • Preferred Owners ordered list of host nodes
  • FailoverThreshold How many faults cause failover
  • FailoverPeriod Time window for failover
    threshold
  • FailbackWindowsStart When can failback happen?
  • FailbackWindowEnd When can failback happen?
  • Everything (except CurrentState) is stored in
    registry

49
Failover and Failback
  • Failover parameters
  • timeout on LooksAlive, IsAlive
  • local restarts in failure window after this,
    offline.
  • Failback to preferred node
  • (during failback window)
  • Do resource failures affect group?

Node \\Betty
Node \\Alice
Cluster Service
Cluster Service
IPaddr
name
50
Cluster ConceptsClusters
Resource
Cluster
Group
Resource
Group
Resource
Group
Resource
Group
51
Cluster Properties
  • Defined Members nodes that can join the cluster
  • Active Members nodes currently joined to cluster
  • Resource Groups groups in a cluster
  • Quorum Resource
  • Stores copy of cluster registry.
  • Used to form quorum.
  • Network Which network used for communication
  • All properties kept in Cluster Registry

52
Cluster API Functions(operations on nodes
groups)
  • Find and communicate with Cluster
  • Query/Set Cluster properties
  • Enumerate Cluster objects
  • Nodes
  • Groups
  • Resources and Resource Types
  • Cluster Event Notifications
  • Node state and property changes
  • Group state and property changes
  • Resource state and property changes

53
Cluster Management
54
Demo
  • Server startup and shutdown
  • Installing applications
  • Changing status
  • Failing over
  • Transferring ownership of groups or resources
  • Deleting Groups and Resources

55
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • QA

56
Architecture
  • Top tier provides cluster abstractions
  • Middle tier provides distributed operations
  • Bottom tier is NT and drivers

Failover Manager
Resource Monitor
Cluster Registry
Global Update
Quorum
Membership
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
57
Membership and Regroup
  • Membership
  • Used for orderly addition and removal from
    active nodes
  • Regroup
  • Used for failure detection (via heartbeat
    messages)
  • Forceful eviction from active nodes

Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
58
Membership
  • Defined cluster all nodes
  • Active cluster
  • Subset of defined cluster
  • Includes Quorum Resource
  • Stable (no regroup in progress)

Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
59
Quorum Resource
  • Usually (but not necessarily) a SCSI disk
  • Requirements
  • Arbitrates for a resource by supporting the
    challenge/defense protocol
  • Capable of storing cluster registry and logs
  • Configuration Change Logs
  • Tracks changes to configuration database when any
    defined member missing (not active)
  • Prevents configuration partitions in time

60
Challenge/Defense Protocol
  • SCSI-2 has reserve/release verbs
  • Semaphore on disk controller
  • Owner gets lease on semaphore
  • Renews lease once every 3 seconds
  • To preempt ownership
  • Challenger clears semaphore (SCSI bus reset)
  • Waits 10 seconds
  • 3 seconds for renewal 2 seconds bus settle time
  • x2 to give owner two chances to renew
  • If still clear, then former owner loses lease
  • Challenger issues reserve to acquire semaphore

61
Challenge/Defense ProtocolSuccessful Defense
62
Challenge/Defense ProtocolSuccessful Challenge
Defender Node
No reservation detected
Challenger Node
63
Regroup
  • Invariant All members agree on members
  • Regroup re-computes members
  • Each node sends heartbeat message to a peer
    (default is one per second)
  • Regroup if two lost heartbeat messages
  • suspicion that sender is dead
  • failure detection in bounded time
  • Uses a 5-round protocol to agree.
  • Checks communication among nodes.
  • Suspected missing node may survive.
  • Upper levels (global update, etc.) informed of
    regroup event.

Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
64
Membership State Machine
Initialize
Search or Reserve Fails
Sleeping
Start Cluster
Member Search
Quorum Disk Search
Search Fails
Regroup
Acquire (reserve) Quorum Disk
Minority or no Quorum
Found Online Member
Forming
Non-Minority and Quorum
Lost Heartbeat
Joining
Join Succeeds
Synchronize Succeeds
Online
65
Joining a Cluster
  • When a node starts up, it mounts and configures
    only local, non-cluster devices
  • Starts Cluster Service which
  • looks in local (stale) registry for members
  • Asks each member in turn to sponsor new nodes
    membership. (Stop when sponsor found.)
  • Sponsor (any active member)
  • Sponsor authenticates applicant
  • Broadcasts applicant to cluster members
  • Sponsor sends updated registry to applicant
  • Applicant becomes a cluster member

66
Forming a Cluster(when Joining fails)
  • Use registry to find quorum resource
  • Attach to (arbitrate for) quorum resource
  • Update cluster registry from quorum resource
  • e.g. if we were down when it was in use
  • Form new one-node cluster
  • Bring other cluster resources online
  • Let others join your cluster

67
Leaving A Cluster (Gracefully)
  • Pause
  • Move all groups off this member.
  • Change to paused state (remains a cluster member)
  • Offline
  • Move all groups off this member.
  • Sends ClusterExit message all cluster members
  • Prevents regroup
  • Prevents stalls during departure transitions
  • Close Cluster connections (now not an active
    cluster member)
  • Cluster service stops on node
  • Evict remove node from defined member list

68
Leaving a Cluster (Node Failure)
  • Node (or communication) failure triggers Regroup
  • If after regroup
  • Minority group OR no quorum device
  • group does NOT survive
  • Non-minority group AND quorum device
  • group DOES survive
  • Non-Minority rule
  • Number of new members gt 1/2 old active cluster
  • Prevents minority from seizing quorum device at
    the expense of a larger potentially surviving
    cluster
  • Quorum guarantees correctness
  • Prevents split-brain
  • e.g. with newly forming cluster containing a
    single node

69
Global Update
  • Propagates updates to all nodes in cluster
  • Used to maintain replicated cluster registry
  • Updates are atomic and totally ordered
  • Tolerates all benign failures.
  • Depends on membership
  • all are up
  • all can communicate
  • R. Carr, Tandem Systems Review. V1.2 1985,
    sketches regroup and global update protocol.

Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
70
Global Update Algorithm
  • Cluster has locker node that regulates updates.
  • Oldest active node in cluster
  • Send Update to locker node
  • Update other (active) nodes
  • in seniority order (e.g. locker first)
  • this includes the updating node
  • Failure of all updated nodes
  • Update never happened
  • Updated nodes will roll back on recovery
  • Survival of any updated nodes
  • New locker is oldest and so has update if any do.
  • New locker restarts update

L
S
71
Cluster Registry
  • Separate from local NT Registry
  • Maintains cluster configuration
  • members, resources, restart parameters, etc.
  • Stable storage
  • Replicated at each member
  • Global Update protocol
  • NT Registry keeps local copy

Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
72
Cluster Registry Bootstrapping
  • Membership uses Cluster Registry for list of
    nodes
  • Circular dependency
  • Solution
  • Membership uses stale local cluster registry
  • Refresh after joining or forming cluster
  • Master is either
  • quorum device, or
  • active members

Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
73
Resource Monitor
  • Polls resources
  • IsAlive and LooksAlive
  • Detects failures
  • polling failure
  • failure event from resource
  • Higher levels tell it
  • Online, Offline
  • Restart

Failover Manager
Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
74
Failover Manager
Failover Manager
  • Assigns groups to nodes based on
  • Failover parameters
  • Possible nodes for each resource in group
  • Preferred nodes for resource group

Resource Monitor
Cluster Registry
Global Update
Membership
Regroup
Windows NT Server
Cluster Disk Driver
Cluster Net Drivers
75
Failover(Resource Goes Offline)
Notify Failover Manager.
Resource Manager Detects resource error.
Failover Manager checks Failover Window
and Failover Threshold
Attempt to restart resource.
Wait for Failback Window
Are Failover conditions within Constraints?
No
Has the Resource Retry limit been exceeded?
No
Yes
Leave Group in partially Online state.
Yes
No
Can another owner be found? (Arbitration)
Switch resource (and Dependants) Offline.
Notify Failover Manager on the new system
to bring resource Online.
Yes
76
Pushing a Group (Resource Failure)
Resource Monitor notifies Resource Manager of
resource failure.
Resource Manager enumerates all objects in
the Dependency Tree of the failed resource.
Resource Manager notifies Failover Manager that
the Dependency Tree is Offline and needs to fail
over.
Failover Manager performs Arbitration to locate a
new owner for the group.
Resource Manager takes each depending resource
Offline.
Failover Manager on the new owner node brings the
resources Online.
Any resource has Affect the Group True
No
Leave Group in partially Online state.
Yes
77
Pulling a Group(Node Failure)
Cluster Service notifies Failover Manager of node
failure.
Failover Manager determines which groups were
owned by the failed node.
Failover Manager performs Arbitration to locate a
new owner for the groups.
Failover Manager on the new owner(s) bring the
resources Online in dependency order.
Resource Manager notifies Failover Manager that
the node is Offline and the groups it owned
need to fail over.
78
Failback to Preferred Owner Node
  • Group may have a Preferred Owner
  • Preferred Owner comes back online
  • Will only occur during the Failback
    Window (time slot, e.g. at night)

Resource Manager takes each resource on
the current owner Offline.
Preferred owner comes back Online.
Failover Manager performs Arbitration to locate
the Preferred Owner of the group.
Is the time within the Failback Window?
Resource Manager notifies Failover Manager that
the Group is Offline and needs to fail over to
the Preferred Owner.
Failover Manager on the Preferred Owner brings
the resources Online.
79
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • QA

80
Process Structure
  • Cluster Service
  • Failover Manager
  • Cluster Registry
  • Global Update
  • Quorum
  • Membership
  • Resource Monitor
  • Resource Monitor
  • Resource DLLs
  • Resources
  • Services
  • Applications

A Node
Cluster Service
81
Resource Control
  • Commands
  • CreateResource()
  • OnlineResource()
  • OfflineResource()
  • TerminateResource()
  • CloseResource()
  • ShutdownProcess()
  • And resource events

A Node
Resource Monitor
Cluster Service
Private calls
Resource Monitor
DLL
Private calls
Resource
82
Resource DLLs
  • Calls to Resource DLL
  • Open get handle
  • Online start offering service
  • Offline stop offering service
  • as a standby or
  • pair-is offline
  • LooksAlive Quick check
  • IsAlive Thorough check
  • Terminate Forceful Offline
  • Close release handle

Resource Monitor
DLL
Private calls
Resource
Std calls
83
Cluster Communications
  • Most communication via DCOM /RPC
  • UDP used for membership heartbeat messages
  • Standard (e.g. Ethernet) interconnects

Management apps
DCOM / RPC admin UDP Heartbeat
Cluster Service
DCOM / RPC
Resource Monitors
Resource Monitors
84
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • QA

85
Application Support
  • Virtual Servers
  • Generic Resource DLLs
  • Resource DLL VC Wizard
  • Cluster API

86
Virtual Servers
  • Problem
  • Client and Server Applications do not want node
    name to change when server app moves to another
    node.
  • A Virtual Server simulates an NT Node
  • Resource Group (name, disks, databases,)
  • NetName and IP address (node \\a keeps name
    and IP address as is moves)
  • Virtual Registry (registry moves (is
    replicated))
  • Virtual Service Control
  • Virtual RPC service
  • Challenges
  • Limit app to virtual servers devices and
    services.
  • Client reconnect on failover (easy if
    connectionless -- eg web-clients)

Virtual Server \\a1.2.3.4
87
Virtual Servers (before failover)
  • Nodes \\Y and \\Z support virtual servers \\A and
    \\B
  • Things that need to fail over transparently
  • Client connection
  • Server dependencies
  • Service names
  • Binding to local resources
  • Binding to local servers

\\Y
\\Z
SAP
SAP
SQL
SQL
T\
S\
\\A
\\B
SAP on A
SAP on B
88
Virtual Servers (just after failover)
  • \\Y resources and groups (i.e. Virtual Server
    \\A)moved to \\Z
  • A resources bind to each other and to local
    resources (e.g., local file system)
  • Registry
  • Physical resource
  • Security domain
  • Time
  • Transactions used to make DB state consistent.
  • To work, local resources on \\Y and \\Z have to
    be similar
  • E.g. time must remain monotonic after failover

\\Y
\\Z
\\A
\\B
89
Address Failover andClient Reconnection
  • Name and Address rebind to new node
  • Details later
  • Clients reconnect
  • Failure not transparent
  • Must log on again
  • Client context lost (encourages connectionless)
  • Applications could maintain context

\\Y
\\Z
SAP
SAP
SQL
SQL
S\
T\
\\A
\\B
SAP on A
SAP on B
90
Mapping Local References to Group-Relative
References
  • Send client requests to correct server
  • \\A\SAP refers to \\.\SQL
  • \\B\SAP refers to \\.\SQL
  • Must remap references
  • \\A\SAP to \\.\SQLA
  • \\B\SAP to \\.\SQLB
  • Also handles namespace collision
  • Done via
  • modifying server apps, or
  • DLLs to transparently rename

\\Y
\\Z
SAP
SAP
SQL
SQL
S\
T\
\\A
\\B
SAP on A
SAP on B
91
Naming and Binding and Failover
  • Services rely on the NT node name and - or IP
    address to advertise Shares, Printers, and
    Services.
  • Applications register names to advertise services
  • Example \\Alice\SQL (i.e. ltnodegtltservicegt)
  • Example 128.2.2.280 (http//www.foo.com/)
  • Binding
  • Clients bind to an address (e.g. name-gtIP
    address)
  • Thus the node name and IP address must failover
    along with the services (preserve client bindings)

92
Client to Cluster CommunicationsIP address
mobility based on MAC rebinding
  • Cluster Clients
  • Must use IP (TCP, UDP, NBT,... )
  • Must Reconnect or Retry after failure
  • Cluster Servers
  • All cluster nodes must be on same LAN segment
  • IP rebinds to failover MAC addr
  • Transparent to client or server
  • Low-level ARP (address resolution protocol)
    rebinds IP add to new MAC addr.

Client Alice lt-gt 200.110.12.4 Virtual Alice lt-gt
200.110.12.5 Betty lt-gt 200.110.12.6 Virtual Betty
lt-gt 200.110.12.7
WAN
Alice lt-gt 200.110.120.4 Virtual Alice lt-gt
200.110.120.5
Betty lt-gt 200.110.120.6 Virtual Betty lt-gt
200.110.120.7
Router 200.110.120.4 -gtAliceMAC 200.110.120.5
-gtAliceMAC 200.110.120.6 -gtBettyMAC 200.110.120.7
-gtBettyMAC
Local Network
93
Time
  • Time must increase monotonically
  • Otherwise applications get confused
  • e.g. make/nmake/build
  • Time is maintained within failover resolution
  • Not hard, since failover on order of seconds
  • Time is a resource, so one node owns time
    resource
  • Other nodes periodically correct drift from
    owners time

94
Application Local NT Registry Checkpointing
  • Resources can request that local NT registry
    sub-trees be replicated
  • Changes written out to quorum device
  • Uses registry change notification interface
  • Changes read and applied on fail-over

\\A on \\X
\\A on \\B
registry
registry
Each update
registry
After Failover
Quorum Device
95
Registry Replication
96
Application Support
  • Virtual Servers
  • Generic Resource DLLs
  • Resource DLL VC Wizard
  • Cluster API

97
Generic Resource DLLs
  • Generic Application DLL
  • Simplest just starts, stops application, and
    makes sure process is alive
  • Generic Service DLL
  • Translates DLL calls into equivalent NT Server
    calls
  • Online gt Service Start
  • Offline gt Service Stop
  • Looks/IsAlive gt Service Status

98
Generic Application
99
Generic Service
100
Application Support
  • Virtual Servers
  • Generic Resource DLLs
  • Resource DLL VC Wizard
  • Cluster API

101
Resource DLL VC Wizard
  • Asks for resource type name
  • Asks for optional service to control
  • Asks for other parameters (and associated types)
  • Generates DLL source code
  • Source can be modified as necessary
  • E.g. additional checks for Looks/IsAlive

102
Creating a New Workspace
103
Specifying Resource Type Name
104
Specifying Resource Parameters
105
Automatic Code Generation
106
Customizing The Code
107
Application Support
  • Virtual Servers
  • Generic Resource DLLs
  • Resource DLL VC Wizard
  • Cluster API

108
Cluster API
  • Allows resources to
  • Examine dependencies
  • Manage per-resource data
  • Change parameters (e.g. failover)
  • Listen for cluster events
  • etc.
  • Specs API became public Sept 1996
  • On all MSDN Level 3
  • On web site
  • http//www.microsoft.com/clustering.htm

109
Cluster API Documentation
110
Outline
  • Why FT and Why Clusters
  • Cluster Abstractions
  • Cluster Architecture
  • Cluster Implementation
  • Application Support
  • QA

111
Research Topics?
  • Even easier to manage
  • Transparent failover
  • Instant failover
  • Geographic distribution (disaster tolerance)
  • Server pools (load-balanced pool of processes)
  • Process pair (active/backup process)
  • 10,000 nodes?
  • Better algorithms
  • Shared memory or shared disk among nodes
  • a truly bad idea?

112
References
  • Microsoft NT site http//www.microsoft.com/ntserv
    er/
  • BARC site (e.g. these slides)http//research.micr
    osoft.com/joebar/wolfpack
  • Inside Windows NT, H. Custer, Microsoft Pr,
    ISBN 155615481
  • Tandem Global Update Protocol, R. Carr, Tandem
    Systems Review. V1.2 1985, sketches regroup and
    global update protocol.
  • VAXclusters a Closely Coupled Distributed
    System, Kronenberg, N., Levey, H., Strecker, W.,
    ACM TOCS, V 4.2 1986. A (the) shared disk
    cluster.
  • In Search of Clusters The Coming Battle in
    Lowly Parallel Computing, Gregory F. Pfister,
    Prentice Hall, 1995, ISBN 0134376250. Argues
    for shared nothing
  • Transaction Processing Concepts and Techniques,
    Gray, J., Reuter A., Morgan Kaufmann, 1994.
    ISBN 1558601902, survey of outages, transaction
    techniques.
Write a Comment
User Comments (0)
About PowerShow.com