Title: Module 7: Server Cluster Maintenance and Troubleshooting
1Module 7 Server Cluster Maintenance and
Troubleshooting
2Overview
- Cluster Maintenance
- Troubleshooting Cluster Service
3- Server cluster maintenance and troubleshooting
are considered two separate disciplines.
Maintenance is continuous, whereas
troubleshooting has a beginning when the problem
is discovered, and an end when the problem is
resolved. The two disciplines are complimentary,
however. When every troubleshooting procedure
that you follow fails, you will need to rebuild
the cluster from a backup tape that was generated
during a maintenance procedure.
4- After completing this module, you will be able
to - Perform the steps to successfully back up a
server cluster. - Perform the steps to successfully restore a
server cluster. - Evict a node from a server cluster.
- Identify the tools that are necessary to
troubleshoot a cluster failure. - Interpret the entries on the cluster log.
- Identify and troubleshoot common server cluster
failures network communications, small computer
system interface (SCSI) configuration problems,
group, resource, and quorum failures.
5 Cluster Maintenance
- Backup
- Restoring the First Node
- Restoring Cluster Disks
- Restoring the Second Node
- Evicting a Node
6- Cluster service uses the self-tuning features of
Microsoft Windows 2000 and requires very little
maintenance. The only day-to-day maintenance
operation that you need to perform is to back up
the cluster. - Under special circumstances, a node in the
cluster may need to be replaced, for example,
when your organization decides to perform a
hardware upgrade. In this situation, you need to
evict a node from the cluster and add the
upgraded node to the cluster.
7Backup
- Backing Up the System State
- Backing Up the Local Disk
- Backing Up the Cluster Disk
8- Backing up the cluster is no different from
backing up Microsoft Windows 2000 Advanced
Server. It is recommended that you perform
regular backups by using the Windows 2000 Backup
program (NTBackup), or other compatible backup
programs. Additional backup agents are still
necessary to back up applications running on the
cluster, such as Microsoft SQL Server and
Microsoft Exchange. - Note A cluster-aware backup program will be able
to perform the same backup operations as
NTBackup, especially with regard to backing up
the System State and the cluster configuration
database.
9Backing Up the System State
- The configuration information for the cluster is
located on the registry on each node
(HKEY_LOCAL_MACHINE\Cluster). The Backup tool
that is included with Windows 2000 backs up the
cluster database when you back up each nodes
system state. - NTBackup backs up the system state on each node.
The system state includes - The quorum log.
- The local registry.
- The Cluster registry hive.
10Backing Up the Local Disk
- Follow standard computer backup procedures to
back up the operating system and the data on the
local drives. You must also back up key cluster
files on the local disks. - On each node, back up the cluster database files
systemroot\cluster\CLUSDB systemroot\cluster\
CLSUDB.LOG - On each node, back up the clustering service
systemroot\cluster\. - Note Backup is essential, but regular testing to
make sure that backups and restores actually work
as expected is also necessary. A good practice is
to schedule test backup and restore operations
frequently.
11Backing Up the Cluster Disks
- It is critical to back up cluster files on the
quorum disk and data on the cluster disks,
because Cluster service will write information to
files in the \mscsdirectory on the quorum disk
and cluster-aware applications will likely be
placing data on the cluster disk. Because either
node of the cluster could own the cluster disk
resource at any time, it is possible for each
node to back up the data on the drive. However,
having each node back up data would require you
to install backup hardware and software on each
cluster node, which is not the best solution. - One possibility is to identify a nonclustered
server running Windows 2000 Server and schedule
it to back up data remotely through a network
connection to the Cluster disks administrative
share or a hidden share that you create. For
example, you might create FBackup, GBackup,
HBackup, and WBackup file share resources on
the virtual server for the root of drives F, G,
H, and W. F, G, and H would be cluster disks with
data, and W would be the drive letter for the
quorum disk. Hidden shares would not appear in a
browse list and you could configure them to allow
access only to members of the Backup Operators
group.
12- The following sections describe the procedure for
restoring a server cluster in the event that both
nodes and the cluster disk fail. It is possible
that any one of the components in the cluster
could fail independently. In the case of a failed
component, you follow the same procedure for
restoring that specific component.
13- Performing a complete restore of a server cluster
is a straightforward process. - Restore a node of the cluster.
- Restore the cluster disks of the restored first
node. - Restore the remaining node of the cluster.
- Perform node testing.
14Restoring the First Node
- Steps For Restoring a Server Cluster
- Restore the first node
- Restore the cluster disks
- Restore the second node
- Perform node testing
15Restoring a Node of the Cluster
- To restore a node in a server cluster, you follow
the same procedure that you would use in
restoring a Windows 2000 operating system. - Install a fresh copy of Windows 2000 Advanced
Server on the node to be restored. - Log on as Administrator and restore the system
and boot partition, system state, and associated
volumes from the backup. Make sure that you
select the option to restore the system state to
the original location in the backup program. - Restart the node.
- Perform the steps for restoring the cluster disk.
These steps follow in the next section. - Note The difference between the time of the
backup and the time of the restoration to the new
computer may affect the computer account on the
domain controller. You may have to join a
workgroup and then rejoin the domain.
16Restoring Cluster Disks
- Restoring Disk Signature Files
- Restoring the Data on the Cluster Disk
- Restoring the Cluster Configuration Files
17- After you have restored a node in the cluster,
you must restore the cluster disks. Restoring the
cluster disks involves restoring the disk
signature file that the cluster uses to identify
the disk. You may also need to restore a cluster
disk if you are running out of disk space or if
there is impending disk failure of a disk. It can
be costly to make mistakes while replacing a
cluster disk the consequence can be the
irrecoverable loss of all of the data on that
disk. If the disk is the quorum disk, the server
cluster's configuration data is at risk. - Before restoring the cluster disks, stop Cluster
service on all of the nodes of the cluster.
Stopping Cluster service will ensure that it will
not attempt to start, which would place a lock on
the disks.
18Restoring Disk Signature Files
- Because Cluster service relies on disk signatures
to identify and mount volumes, if a disk is
replaced, or if the bus is re-enumerated, Cluster
service will not find the disk signatures that it
is expecting and will not function. - You can run Dumpcfg.exe to extract the disk
signature from the registry and write it to the
new disk. Cluster service will recognize the new
disk and successfully start the resource. - Note The Dumpcfg.exe is a resource kit utility
that restores an old disk signature file to a new
disk.
19- If the disk that you are replacing is the quorum
disk, use Cluster Administrator to move the
quorum to a different disk, and proceed in the
replacement of the disk. After the disk is
brought back online, you can move the quorum back
to the new disk.
20Restoring the Data on the Cluster Disk
- Restoring the data on the cluster disk is the
same as a restore of a local disk. Before
restoring the data, make sure that you have
associated each cluster disk to the same drive
letter as before the disaster or failure. When
restoring, make sure that you restore the data to
the original location and verify the integrity
after you have completed the restore.
21Restoring the Cluster Configuration Files
- The cluster configuration files include the
cluster database and the quorum log. The cluster
database is the database or configuration data
(cluster objects and their settings) that are
pertinent to the cluster. This database is the
product of the cluster registry key checkpoint
and the changes that are recorded in the quorum
log. All of the nodes of the cluster hive
maintain a local copy of this database in the
nodes local registry. - After you have restored the disk signature file
and data, you can start the server cluster. If
the cluster files were not restored, or were
corrupted, the following procedure can restore
the cluster database from the registry of the
restored node.
22- Identify the node on which you will restore the
database (in the case of a disaster restore, this
will be the first node that you have restored).
Restore the cluster database on the selected node
by restoring the system state. Restoring the
system state creates a temporary folder under the
Systemroot\Cluster folder called
Cluster_backup. - You use NTBackup to restore the cluster
configuration files, which places them on the
node. You then restore the cluster database to
the nodes registry by using the Clusrest.exe
tool. Clusrest.exe restores both the quorum log
(Quorum.log) file and the cluster database
(Clusdb). - Note The Clusrest.exe tool is available in the
Windows 2000 Resource Kit. This tool is a free
download from www.microsoft.com
23Restoring the Second Node
- Restoring the Remaining Node(s) of a Cluster
- Perform Node Testing
24- After you complete the process of restoring a
node of a cluster, and Cluster service has
started successfully on the newly restored node,
you can start the restore process on the other
node of the cluster.
25Restoring the Remaining Node(s) of the Cluster
- The restoration of the second node of a cluster
is the same procedure as restoring the first node
of a cluster, except that you will not have to
restore the cluster disks.
26Performing Node Testing
- Testing the failover and failback policy is
recommended before putting the cluster back into
production. - Verify that the disk and cluster resources are
available on the correct node. - Fail over each group and resource to verify that
they can successfully start on the other node of
the cluster. - Test the failback policy of each resource by
allowing the resource to fail back to a preferred
owner after the node has come back online.
27Evicting a Node
- Steps for Evicting a Node
- Back up both nodes
- Verify backup
- Move all groups to the remaining node
- Stop Cluster service on the node to be removed
- Evict the node
- Unplug the server from the shared bus
28- If you need to change a node of a cluster, for
example, to add a more powerful server, you need
to logically remove the node before physically
removing the node from the cluster. When you
configure a new server with the shared bus, and
the public and private networks, you can then run
the Cluster Installation Wizard. - To remove a node from a cluster, from Cluster
Administrator, right-click on the node to access
the menu with the Stop Cluster option and Evict
Node options.
29- To evict a node
- Back up both nodes.
- Verify backup.
- Move all of the groups to the remaining node.
- Stop Cluster service on the node that is to be
removed. - Evict the node.
- Unplug the server from the shared bus (if the
shared bus is a SCSI bus, be careful about
termination). - Note If a new server is to join the cluster
later, run the Cluster Installation Wizard and
select Join a Cluster.
30Troubleshooting Cluster Service
- Troubleshooting Tools
- Examining the Cluster Log
- Troubleshooting Network Communications
- SCSI Configuration Problems
- Group and Resource Failures
- Quorum Log Corruption
31- Troubleshooting a problem with Cluster service
can be more complex than troubleshooting a single
server because of the virtual servers and the
need for intracluster communications. Virtual
servers change ownership from one node to
another, which may cause network connectivity
problems. Applications running on the cluster are
difficult to troubleshoot, because they are
running on a virtual server instead of a physical
server. You could also have a node-to-node
communication problem because servers usually
work independently of each other and not
together. You might experience hardware problems
with the shared bus and the cluster disk
resources.
32- The most common failures are due to improper
configurations within groups and resources.
Cluster service will fail if the quorum log
becomes corrupt. It is important to know how to
repair the quorum log to restart the cluster. - You use the same tools to identify problems on
the cluster as you would use to identify problems
on a physical server. The best resource for
troubleshooting is the cluster log because
Cluster service records the activity of each node
in the cluster log. This log can help you
identify problems on the node or in the cluster.
33Troubleshooting Tools
- Disk Manager
- Task Manager
- Performance Monitor
- Network Monitor
- Dr. Watson
- Services Snap-in
34- When troubleshooting Cluster service, you can use
the same tools and methodologies that you would
when troubleshooting Windows 2000 Advanced
Server.
35- Cluster service writes logging information to the
system log of every node in the cluster. Cluster
service also writes a more detailed log of
cluster activity to the cluster log on each node.
Use these two sources to gather information when
you begin troubleshooting a problem. You will be
able to determine whether the problem is related
to the network, to services or applications, or
to physical components in the cluster. - Note Use Event Viewer to filter the system log
on event source ClusSvc. You can view general
events, such as if Microsoft Cluster service
failed to join the cluster on this node and
Microsoft Cluster service successfully created a
cluster on this node.
36- After you have determined the type of problem,
you can use the following tools to search for the
source of the problem. You must check each node
individually when using any of these tools.
37- Disk Manager. You check disk manager to find out
the health of the cluster disk. You can check
whether the operating system recognizes the
disks, and whether the cluster disks are basic
versus dynamic. You also need to verify that the
drive letters of the cluster disks are the same
on both nodes.
38- Task Manager. You can verify that Cluster service
is running in Microsoft Windows 2000 Task
Manager. You can also use Task Manager as a
performance monitor, but you do not obtain the
level of detail as you would with a performance
monitor. In Task Manager, you will be able to
verify the CPU utilization percentage and the
memory resources on the node.
39- Performance Monitor. Microsoft Windows 2000
Performance Monitor is the primary tool for
finding bottlenecks on servers running Windows
2000. It is recommended that you create a
baseline before and after you add cluster
resources to the cluster. You also need to create
a baseline on each node during failover and
failback of resources to check for potential
physical resource deficiencies. It is recommended
that you configure a computer to monitor the
Cluster service property on every node of the
cluster, and send an e-mail message to an
administrator when a node or the cluster is
offline.
40- Network Monitor. You use Microsoft Windows 2000
Network Monitor to troubleshoot any node-to-node
and client-to-node communication. You must
configure Network Monitor to capture data on the
private network to see node-to-node
communication.
41- Dr. Watson. Dr. Watson is a user-mode debugging
tool. If a clustered application or the Cluster
Administrator crashes, the debugging information
is found in the Dr. Watson log file.
42- Services Snap-in. Cluster service runs as a
service in Windows 2000. If Cluster service is
not running correctly, check the properties of
the service through the services snap-in to
ensure that the default properties have not
changed. Verify that Cluster service - Is set to start automatically.
- Is set to log on as the designated domain service
account. - Is set to restart after a failure.
43- Make sure that the four following services have
started - Network Connections (Network Connections has a
Remote Procedure Call (RPC) dependency) - RPC
- Windows Management Instrumentation Driver
Extensions - Windows Time
44Examining the Cluster Log
45- The cluster log is a diagnostic log that is a
more complete record of cluster activity than the
Microsoft Windows 2000 Event Log. The cluster log
records the Cluster service activity (Clussvc.exe
and associated processes) that leads up to the
events that are recorded in the event log.
Although the event log can point you to a
problem, the cluster log helps you to determine
the source of the problem. So, for diagnosis,
check the event log for general information and
the cluster log for specific details about the
cluster status. If you see a problem in the event
log, note the timestamp and go to approximately
the same timestamp on the cluster log. - The cluster log is enabled by default when you
install Cluster service, but will not start
logging information until after the first restart
of the node. Cluster log output is written to
SystemRoot\Cluster\Cluster.log, and you can
view it with Microsoft Wordpad.
46Setting the Logging Level
- You can set four logging levels in the cluster
log. Four logging levels are possible. The
default level is two, which logs enough
information necessary for normal troubleshooting.
To set a different logging level, click Start,
point to Settings, click Control Panel, and then
double-click the System icon. Create a system
environment variable under the Advanced button
called ClusterLogLevel with a value of 0, 1, 2,
or 3, where 0no logging, 1Errors only, 2Errors
and Warnings, and 3Everything that happens.
47Setting the Log File Size
- The log file defaults to a maximum size of 8
megabytes (MB). When the log file size reaches 8
MB, the log file will start overwriting the data
in the log file. To specify a larger file size,
add the registry entry ClusterLogSize under
HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\Servi
ces\ClusSvc\ Parameters. ClusterLogSize has a
type of DWORD and it should specify the maximum
size in MB for the log file. If this value is set
to 0, logging is disabled.
48Cluster Log Entries
- There are two types of cluster log entries
Component Event Log entries and Resource
dynamic-link library (DLL) log entries. Cluster
service is made up of a number of components,
such as the database manager and the global
update manager. The cluster log records the
interactions of these components, making it a
powerful diagnostic tool. Because resource groups
are the basic unit of failover, resource DLL
entries are essential to understanding cluster
activity. - The first line in the body of a typical cluster
log is - 378.32c1999/06/09-180018.874 Cluster service
started -Cluster Node Version 3.2051
49- The main elements of this line are common to
every line of the log - The IDs of the process and thread issuing the log
entry. These two IDs are concatenated, separated
by a period. In the previous example, the Process
ID is 378, and the Thread ID is 32c. - Timestamp. The timestamp is recorded in the
following format, in Greenwich Mean Time (GMT) - yyyy/mm/dd-hhmmss.sss
- Event description. One example of an event
description would be Cluster service started.
50Component Event Log Entries
- In the following example, NM indicates the
component that wrote the event to the cluster
log in this case, NM stands for node manager. - 378.3801999/06/09-180050.881 NM Forming
cluster membership.
51Resource DLL Log Entries.
- The following example is a cluster log entry for
a resource DLL event. This example is one of the
entries from the disk arbitration process. - 15c.4581999/06/09-180047.897 Physical Disk
ltDisk Dgt DISKARB Arbitration Parameters (1
9999).
52- Instead of listing an abbreviated component name
between the timestamp and event description as
component log entries do, entries describing
resource DLL events list the following
information - Resource type (Physical Disk)
- Resource name (ltDisk Igt)
- The event description in this example is
DISKARB Arbitration Parameters (1 9999).
53Troubleshooting Network Communications
- Troubleshooting Node-to-Node Communication
- Verify RPC Communications
- Verify Cluster Heartbeats
- Troubleshooting Client-to-Node Communications
- Check NetBT Cache with Nbtstat
- Ping IP Address
- WINS Static Mappings
54- There are two types of cluster network
communications that can fail the client may be
unable to access the cluster or the nodes may be
unable to communicate with each other. When
client communications are interrupted, there is a
problem with the public network. When the nodes
are unable to communicate, there is a problem
with either the public or the private network.
Troubleshooting these two types of
network-related problems requires different
approaches.
55Troubleshooting Node-to-Node Communications
- You can use Windows 2000 Network Monitor before
installing Cluster service to capture the trace
of the ping between the nodes on the public and
private network. After Cluster service is
installed, you use Network Monitor to verify
remote procedure call (RPC) communication and
cluster heartbeats. - Note You can also use RPC Ping, which is an RPC
connectivity verification tool that is a free
download from www.microsoft.com. This tool
verifies that Windows 2000 Server services are
responding to the call requests of remote
procedures between nodes.
56Verifying RPC Communication
- To verify that RPC communication is occurring
between the nodes of a cluster, use a network
capture utility, such as Microsoft Network
Monitor. Windows 2000 Server includes a simple
version of Network Monitor that you can install
by using the Network program in Control Panel. - To verify RPC communication, configure the
Capture utility to capture all of the traffic
between the nodes of a cluster. After you have
started a capture, using Cluster Administrator to
create a group or resource will result in RPC
traffic between the nodes.
57Verifying Cluster Heartbeats
- As with RPC communication, to verify that cluster
heartbeats are occurring between the nodes of a
cluster, you must use a network capture utility.
- Cluster service uses User Datagram Protocol (UDP)
port 3343 to send heartbeats on the network. Use
Network Monitor to capture port 3343 to verify
both nodes of the cluster are sending and
receiving cluster heartbeats.
58Troubleshooting Client-to-Node Communications
- After a failover occurs, clients must still be
able to gain access to a cluster, even though
they will be accessing a different node. The
client must be able to resolve any cluster
network names so that they will always connect to
the node on which the resources are online. If
clients cannot connect to virtual servers, verify
that - The client is accessing the cluster by using the
correct network name or IP address. - The client has the Transmission Control
Protocol/Internet Protocol (TCP/IP) protocol
correctly installed and configured.
59Check NetBT Cache with Nbtstat
- Depending on the resource that is being accessed,
the client can address the cluster by specifying
either the resource network name or the IP
address. In the case of the network name, you can
verify proper name resolution by checking the
NetBT cache (using the Nbtstat.exe utility) to
determine whether the name had been previously
resolved. Also, confirm proper Windows Internet
Name Service (WINS) configuration, at the client
and at the cluster nodes.
60Ping IP Address Using Ping Utility
- If the client is accessing the resource through a
specific IP address, ping the IP address of the
cluster resource and cluster nodes from a command
prompt.
61WINS Static Mappings
- You should not create static network name to IP
address mappings for any cluster names in a WINS
database. WINS is the only name resolution method
that will cause problems when using static
mappings, because WINS static mappings use the
media access control (MAC) address of the network
card as part of the static mapping.
62- If clients are having a problem connecting to a
virtual server, an administrator might have
created a WINS static mapping for a virtual
server. The node for which the mapping is created
will be able to bring the network name resource
online and clients will be able to connect.
However, if failover occurs, the second node in
the cluster will be able to bring the IP address
online but not the network name. When the second
node attempts to bring the network name online,
WINS will return an error preventing it from
registering the network name. WINS prevents the
network name from going online because the second
node does not have the same physical address as
the one recorded in the static mapping for the
network name. - Note For more WINS troubleshooting information,
see Recommended WINS Configuration for Microsoft
Cluster Server, Q193890, on the Student compact
disk.
63SCSI Configuration Problems
- SCSI Controllers
- SCSI Terminiation
- SCSI Cabling
64- If you suffer from hardware failures, you may
have to replace hardware components of the
cluster. If you replace components in the SCSI
subsystems, you need to make sure that the new
SCSI configurations conform to the following
guidelines.
65SCSI IDs Each device on the shared SCSI bus must have a unique SCSI ID. Most SCSI controllers default to SCSI ID 7. Therefore, you must change the SCSI ID for one of the controllers on the shared SCSI bus to something other than ID 7.
Boot Time SCSI Bus Reset Cluster service uses SCSI bus resets, but in a controlled way during a membership regroup operation. Some SCSI controllers reset the SCSI bus when they initialize at start time, before Windows 2000 is loaded. If the SCSI controllers reset the SCSI bus, the bus reset can interrupt any data transfers between the other node and drives on the shared SCSI bus. Therefore, you should disable automatic SCSI bus resets, if possible, by using the adapter configuration program accessible at computer start time.
Non-Compliant Controllers It is important to verify that the SCSI controllers that are being used are on the Cluster service Hardware Compatibility List (HCL). For a SCSI controller to work with Cluster service, it must support the SCSI reserve and release commands and bus resets.
66Active or Forced-Perfect Termination There are three types of termination that are used for terminating the SCSI bus passive termination, active termination, and forced perfect termination. Because both active and forced perfect termination use electronics to provide termination, these types provide the best termination. You should not use passive termination in a cluster, because it can result in problems, such as unnecessary failover or inability to access the quorum disk.
On-Card Termination Many SCSI controllers provide on-card termination however, the on-card termination does not provide termination when the computer is not turned on. On-card termination only becomes an issue when external terminators are not used. When using external terminators, the on-card termination should be disabled.
67Tri-Link or Y-cable SCSI Connectors Attaching Y-cables or tri-link connectors to the back of the SCSI controllers at each end of the bus is one method that you can use to allow the SCSI bus to remain terminated even when one node is turned off. These components allow you to use external terminators that will continue to provide termination if a node is turned off. You must ensure that the SCSI cards in the nodes are not providing termination when using these connectors.
Long Cables It is very common to have multiple external SCSI drives on the shared SCSI bus. When configuring multiple external drives, it is very important not to exceed the maximum combined cable length that the controller manufacturer recommends. The SCSI specifications specify the maximum combined cable length when using different types of cabling. If the manufacturer of the controller recommends a shorter distance, be sure to follow the recommendation of the manufacturer.
68Group and Resource Failures
69- If groups or resources are not available to
clients, you need to verify whether it is a
restart, failover, or failback problem. In
Cluster Administrator, you will see a visual
notification that a group or a resource in a
group is offline. Because there are a variety of
reasons for a failure, you will have to
troubleshoot the cause to find out whether it is
a resource or group failure.
70Problem Possible Resolution
A Resource Fails, But is Not Brought Back Online In the Policies dialog box for the resource properties, verify that Dont restart is cleared (not selected). Verify that the resource dependencies are correctly configured. Verify that any dependent resources are online.
The Default Quorum Resource Will Not Come Online Verify that there are no hardware errors by using Event Viewer and looking for disk input/output (I/O) error messages.
Cannot Bring a Group Online Verify that there are no hardware or configuration problems with any disk resources for the group. Verify that the resource dependencies are correctly configured. Move the group to the other node and attempt to bring the group online. If this works, verify that the first node can gain access to everything that is necessary to bring the groups resources online (for instance, the disk resource).
71Problem Possible Resolution
A Group Cannot Be Moved or Failed Over to the Other Node Verify that the resource is properly installed on the node. Verify that the other node is set as a possible owner for all resources in the group in the Properties dialog box for the resource.
A Group Failed Over But Did Not Fail Back Verify that the failback policies for the group are properly configured. In the Properties dialog box for the group, verify that Prevent failback is cleared. If Failback immediately is selected, be sure to wait long enough for the group to fail back. Check these settings for all of the resources within a group. Because groups fail over as a whole, one resource that is prevented from failing back will affect the entire group. Ensure that the node to which you want the groups to fail back is configured as the preferred owner of the group. If not, Cluster service will leave the groups on the node to which they failed over.
72Problem Possible Resolution
The Entire Group Failed and Has Not Restarted If the node on which the group had been running is offline, verify that the other node is a possible owner of the group and of all of the resources in the group. Ensure that the group has not exceeded its failover threshold or its failover period. Bring the resources online one at a time to determine which resource is causing the problem. Create a temporary group (for testing purposes), and then move the resources to it one at a time, bringing each resource online after moving the resource.
73Quorum Log Corruption
- Reset the Quorum Log
- Clussvc debug -resetquorumlog
- Delete the Quorum Log
- -noquorumlogging
74- Microsoft Cluster service maintains details about
changes within the cluster through a quorum log
file. If this file becomes corrupted for any
reason, it is possible that Cluster service will
not start. The following error message may occur
when you attempt to start Cluster service on a
node of the server cluster Event ID 1147
Source ClusSvc - If the cluster will not start because of a
corrupted quorum log, you can reset the quorum
log. If Cluster service still will not start
after attempting a reset, you can access the
quorum disk and remove the corrupted quorum log.
75Reset the Quorum Log
- If you do not have a backup of the quorum log
file, perform the following steps - Open a command prompt.
- Go to the Systemroot\Cluster.
- Start Cluster service by typing clussvc -debug
-resetquorumlog which attempts to create a new
quorum log file that is based on the cluster
configuration information in the local system's
cluster registry hive. - Stop Cluster service by pressing CTRLC.
- Restart Cluster service by typing net start
clussvc - Close the command prompt.
76Delete the Quorum Log
- If the log file becomes corrupted and cannot be
reset, Cluster service may not start. To correct
this problem, you must use the -noquorumlogging
option when starting Cluster service. This option
allows the cluster to start without quorum
logging. You may then access the quorum disk and
remove the corrupted Quolog.log file.
77- Use the following procedure to help recover from
this situation - If Cluster service is running, use Control Panel
on both nodes to stop Cluster service. - On one node, use the Services tool in Control
Panel to specify the startup parameter for
Cluster service as -noquorumlogging and start the
service. - On the quorum disk, run Chkdsk. If the disk does
not show corruption, the log file may be
corrupted. In this case, delete the Quolog.log
file and any .tmp files that are located in the
MSCS folder on the quorum disk. - In Services, stop Cluster service, and then start
Cluster service without startup parameters. After
the service starts, you may start it on the other
node. - Note When you disable quorum logging within a
cluster, changes to the cluster configuration
cannot be logged. If a node goes offline during
this period, recent changes may be lost if
changes could not be communicated to the other
node. Quorum logging should only be disabled when
necessary to recover from log file corruption.
78Lab A Cluster Maintenance
79Objectives
- After completing this lab, you will be able to
- Back up cluster configuration files.
- Restore cluster configuration files.
- Evict a node from the cluster.
- Uninstall Cluster service.
80Scenario
- In this exercise, you will back up a nodes
system state, which includes the cluster
configuration files. After the backup is
complete, you will restore the system state and
verify that the cluster configuration files were
restored to the node. At this point, to restore
the cluster, you would run the Clustrest.exe
utility, but for the purposes of this lab, you
will not restore the cluster. You will evict a
node from a cluster and uninstall the Cluster
service on both nodes. - The following exercises will refer to your
computers as Node A and Node B. For this lab, you
will perform all of the tasks on both Node A and
Node B, with the exception of evicting a node,
which you will perform only on Node B.
81Exercise 1 Backup and Restore
- In this exercise, you will learn how NTBackup is
used to backup and restore the cluster.
82To back the Cluster
- Complete this lab from Node A and Node B.
- Click Start, point to Programs, point to
Accessories, point to System Tools, and then
click Backup. - In the Backup dialog box, click Backup Wizard.
- In the Backup Wizard dialog box, click Next.
- Select Only backup the System State data, and
then click Next. - In the Backup media or file name dialog box,
type c\Backup.bkf and then click Next. - Click Finish to start the backup.
- NTBackup will start backing up the system state,
which will take a couple of minutes. - When the backup is complete, click Close.
83To Restore the Cluster
- In the Backup dialog box, click Restore Wizard.
- Click Next.
- Click Import File to locate the backup file of
the system state. - In the Catalog backup file dialog box, type
c\Backup.bkf and then click OK. - In the What to restore box, expand File, expand
Media created. - Select the System State box, click Next, and then
click Finish. - In the Enter Backup File Name dialog box, click
OK.
84- The Restore process will take a couple of
minutes. - When Restore is complete, click Close.
- Do not restart the computer, click No.
- Close NTBackup.
- Note NTBackup does not restore the cluster files
to the cluster disk. NTBackup places the cluster
files on the local node.
85To examine the cluster files that are restored by
NTBackup
- Click Start, and then click Run.
- In the Run dialog box, type systemroot\cluster
and then click OK. - Double-click the cluster_backup folder to view
the files that are restored by NTBackup. - What utility would you use to restore these files
to the shared drive?___________________
86To create a group after backup
- To test the restore process, you will create a
group after the backup. The restore procedure
will roll back the cluster to the state when the
backup was performed. - Perform this task from Node A.
- In Cluster Administrator, click File, select New,
select Group. - In the New Group dialog box, fill out the
following properties Name Test Group
Description Test Group - Click Next.
- In the Preferred Owners dialog box, click Finish.
- Click OK to acknowledge that the group was
successfully created.
87To install the Clusrest.exe
- In this task Node B will install the ClustRest
utility and restore the cluster to the state of
the last backup. Close Cluster Administrator on
Node A and Node B if it is running. - Perform this task from Node B.
- On the Start menu, click Run.
- In the Run dialog box, type c\moc\2087a\labfiles\
mscs and then click OK. - In the Microsoft Web Installation Wizard Tool
CLUSRESTEXE dialog box, click Next. - Click I Agree, and then click Next.
- Click Install Now.
- Click Finish.
88- On the Start menu, click Run.
- In the Run dialog box, type cmd and then click
OK. - In the command prompt, type cd\program
files\resource kit and then press ENTER. - In the command prompt, type clusrest and press
ENTER. - In the command prompt, type y to continue.
- Wait for clusrest before proceeding.
- Open Cluster Administrator.
- Expand Groups and notice that the Test Group that
was created in the previous task is now missing
and that Node A is OfflLine.
89Exercise 2 Removing Cluster Service
- In this exercise, you will remove Cluster service
from both computers in the cluster.
90To evict a node
- Complete this task from Node A only.
- Log on as Administrator with a password of
password. - Open Cluster Administrator from the
Administrative Tools menu. - If prompted, click Yes to restart Cluster service
on Node A. - Right-click Node B.
- Click Stop Cluster Service.
- Click Yes.
- Right-click Node B.
- Click Evict Node.
- Click Yes.
91To remove Cluster service from NodeA and NodeB
- Complete this task from Node A and Node B.
- Log on as Administrator with a password of
password. - On the Start menu, select Settings, and then
click Control Panel. - Open Add/Remove Programs from Control Panel.
- Click Add/Remove Window Components.
- Clear the Cluster Service check box, and then
click Next. - Click Finish.
- Click Yes to restart the computer.
92Review
- Cluster Maintenance
- Troubleshooting Cluster Service