Title: IT Essentials II Network Operating Systems
1IT Essentials IINetwork Operating Systems
- Chapter 13
- Troubleshooting the Operating System
2Identifying Problems
- Most problems can be assigned to the following
- Hardware A component has malfunctioned, or
expected but not present. - Kernel A bug or lack of functionality, i.e.
module not loaded in the system kernel sometimes
causes problems of ambiguous origin. - Application software User level application
software or command utilities may behave
strangely, or simply collapse. - Configuration System services or application
software may be misconfigured. - User error One of the most frequent sources of
error conditions is caused by computer users
attempting to do something the wrong way. - All of the above can be categorized as
- Consistent one that is reliably and demonstrably
occurring again and again. - Inconsistent those that occur only sporadically,
or under indeterminate conditions.
3Identifying Application Problems
- Common signs of application bugs are
- Failure to execute
- program wont start up at all, main file might
not have permission to execute. - Or seem to start, but fails to initialize
entirely, and either exits or stalls part way up. - Program crashes data not saved
- Sometimes error messages are recorded
- Sometimes a core file is left behind, indicating
the application itself suffered a catastrophic
failure - variant of this locked up program, application
left running unable to proceed - Resource Exhaustion
- refer primarily to CPU time, memory, and disk
space - An application consumes too much memory and
ultimately begin to swap so badly that the whole
system is affected - Program Specific Misbehaviour
- To do with the running program itself
4Configuration Problems
- Can present themselves n many ways, i.e. poor
screen resolution when running a high end monitor
and graphics card, Xconfigurator program may need
to be run - Programs that depend on networking services are
particularly liable to cause problems - first place to look is in the configuration file
/etc/fstab - If configuration problem only happens to one
person in a group, it is liable to be caused by
something that person did.
5System Tools and Utilities
- Utilities return information about how the system
or a file should be configured, but they which
exact file or system configuration is
misconfigured - setserial utility provides information and set
options for the serial ports on the system - lpq command that helps resolve printing
problems, display all the jobs that are waiting
to be printed - ifconfig - entered at the shell to return the
current network interface configuration of the
system - route displays or sets the information on the
routing of the system
6Fixing Persistent Problems and Log Files
- Most log files are located in the /var/log
directory or a subdirectory - Log files can be used to
- Monitoring System Loads - Server need to handle
requests efficiently. Log files can be used to
determine what requests are being made that might
cause the server to run slowly - Intrusion Attempts and Detection - examination of
system log files can help in finding out how and
where the intrusion occurred, as well as what
changes the attacker made to the system or
network - Normal System Functioning - log files can be
examined to ensure the system is functioning
normally. If something is wrong the information
in the log files can help identify and eliminate
them. - Missing Entries- any log files missing entries
can indicate something on the server is not
functioning properly or is misconfigured - Error Messages - many log files will contain
various error messages that can be used to locate
and identify problems or misconfiguration within
the server
7Ftab and Lilo Boot Errors
- dmesg command can be used to display the recent
kernel messages, also known as the kernel ring
buffer. i.e. using variants of this command you
can find details about drivers etc. - LILO boot loader is the first piece of code that
takes control of the boot process form the BIOS.
It loads the Linux kernel, and then passes
control entirely to the Linux kernel, if LILO is
not working properly the system wont boot.
Following are some of the LILO error code - No error - codes LILO hasnt loaded
- L error-code - LILO has started to boot but it is
unable to boot the 2nd stage boot loader.
(error-code two-digit number generated by BIOS). - LI - LILO has started and the 1st and 2nd stage
loaders have been loaded, but the 2nd stage
loader wont run - LI101010 LILO has been loaded and running
properly but it cannot locate the kernel image - LIL - 1st and 2nd stage loaders successfully
loaded and are running, but LILO is unable to
read the information it needs to work - LIL? 2nd stage boot loader has been loaded
correctly but is at an incorrect address - LILO - LILO has loaded and is running, indicating
no problem with LILO that is causing the system
not to boot
8Booting Without LILO
- When LILO fails completely the following can be
used - LOADLIN - DOS utility
- comes with the installation CDs and located in
dosutils directory. - to use, use a DOS partition or a DOS boot disk,
a copy of LOADLIN.EXE, and a copy of the Linux
kernel. - Boot from Raw kernel on a floppy
- kernel copied to floppy disk using dd ifvmlinuz
of/dev/fd0 command, where again vmlinuz is the
name of the kernel - LILO on a floppy
- Most preferred methods as it is the fastest.
- To install LILO on a floppy edit lilo.conf file.
Changing the boot line to boot/dev/fd0.
9Emergency Boot System
- Linux provides an emergency system copy of LILO,
to boot system if original fails. - Known as the Emergency Boot System.
- To use this copy of LILO configuration changes
must be made in lilo.conf. The steps to take to
make these changes are listed as follows - Change where the regular disks root partition is
mounted. - Mount it somewhere in the emergency boot system
like /mnt/std. - Ensure the /boot directory is in its own
partition. Mount it instead of or in addition to
the root partition. - Last changes the kernel images and other boot
options to what is normal. - i.e. the boot and root options should point to
the regular hard disk.
LILO Bootlabel An example of the GUI
configuration screen in which root mount is
changed
10Emergency Boot Disks
- There are various types of Linux Boot Disks
available - Linux Installation Disks included in media used
to install the OS in the first place. At lilo
prompt type linux rescue - Toms Root/Boot Disk (tomsrtbt), has a bootable
root file. downloadable from the Internet and
will fit onto a floppy disk. - ZipSlack - available for Slackware Linux. Can be
installed on a small partition or on a removable
drive slightly too big to fit a floppy. - Demo Linux or SuSE Evaluation - one of the better
emergency boot disk utilities available because
it is the most complete, must be burned onto
CD-ROM - Custom Boot Disk required if the system being
worked on contains hardware that needs special
drivers or other specialities. - Simplest method of creation modify an existing
boot disks, by adding required extras.
11Package Dependency Problems
- Some packages require other packages or libraries
to run. - Linux will usually notify users if a package has
dependencies - A few examples of events that can cause
dependency problems and conflicts are listed
below - Missing libraries or support programs Libraries
are a type of support code used by many programs,
as if they were part of the program itself. - Incompatible libraries or support programs -
different versions of libraries and support
programs available and correspond to current and
past versions of programs installed. The correct
version therefore needs to be used. - Duplicate Files or Features - can cause programs
to not function correctly.
12Solutions to Package Dependency Problems
- Force the Installation
- If the error is on a package the user has
manually compiled the source code for, then
installation can be forced. - Note The xxxxxxxx.rpm represents any rpm package
- Modify the SystemThe correct and recommended
method solutions is to modify the system so that
it has the necessary dependencies needed to run
properly. - Rebuild the Problem Package from Source CodeIn
some instances it may be necessary to rebuild the
package from source code if there are dependency
error messages showing up. - Some dependencies are caused by recompilation of
the program and dependencies changing. - To rebuild an RPM call the rpm with the
rebuild command
13Application Failure
- Difficult to spot, as they dont present in an
obvious way. No error message will be given
outlining the problem. Problems include
14Troubleshooting Loss of Network connectivity
- Most basic networking problem is the inability of
two computers to communicate. This can be due to
hardware and/or software problem. - First rule check for physical connectivity.
- ensure cables are properly plugged in at both
ends - the network adapter is functioning (check the
link light) - the hub status lights are on,
- and no simple hardware malfunctions have happened
15Operator Error
16TCP/IP Utilities and Troubleshooting Steps
Connectivity testing Ping Traceroute
17Other Tools
- Windows 2000 Diagnostic tools
- Netdiag runs a standard set of network tests and
generates a report of the results - Pathping a combination of the ping command and
the tracert command - Wake-On-LAN (WOL) - used to enable an
administrator to power up a computer by sending a
signal (magic packet) to the NIC with WOL
technology.
18Disaster Recovery
Risk Analysis
19RAID Redundacy
As well as disk, other components in the server
can be configured for redundancy, including.
Power supplies UPS Cooling fans Network
interface adapters Processors
20Clustering
- Group of independent computers working together
as a single system, two main types - shared-device model, applications running in a
cluster can access any hardware resource
connected to any node/server in the cluster - nothing-shared model, each node has ownership of
a resource, so there is no competition for the
resources - Used to ensure mission-critical applications and
resources are as highly available as possible - Advantages of Clustering
- Fault tolerance can support the failure of
components up to a complete computer without
impacting the capability to support
mission-critical applications. - High availability the cluster will not be
unavailable for reasons such as maintenance,
upgrades, or configuration changes. A correctly
configured cluster will come very close to 100
availability. - Scalability resources can be added to the
cluster transparently to the system users. - Easier manageability servers can be managed as a
group, dramatically reducing the number and
amount of management tasks when compared to an
equivalent number of standalone servers.
- Disadvantages of clustering
- Clusters can be significantly more expensive than
the equivalent standalone servers, due to the
additional software and specialized hardware. - More complex than setting up a server.
21Hot Swapping, Warm Swapping and Hot Spares
- Hot swap (also known as hot pluggable) - the
capability to add and remove from a computer
while its running and have the operating system
automatically recognize the change. - i.e. hard drives, particularly useful in
conjunction with a hardware RAID controller. - Hot-spare - a component kept on hand in case of
an equipment failure. - Examples include
- disk drives,
- RAID controllers,
- NICs,
- and any other critical component that could be
used to replace a failed component. - In some mission-critical environments, an entire
server can be designated hot spares. - Warm swap - compromise between hot swap and hot
spare. Generally done in conjunction with hard
drive failure. - Shut down the disk array before the drive can be
replaced. And stop all I/O for that array, users
cannot access the system. - Referred to as a warm swap because the server
does not have to be powered down to replace the
drive.
22Disaster Recovery Steps
Testing the Plan
23The Facility Goes Down
- If the facility goes down, due to
- a natural disaster such as an earthquake or a
flood, - sabotage such as a bomb,
- or even just an extended power outage.
- A place to resume critical business activities
maybe required, a disaster-recovery site. Two
types of disaster-recovery sites are commonly
used in the industry.
24Questions?