High availability is a key component of network performance.
Periods of internal unavailability reduce productivity and keep employees from having information at hand when it's needed. Unavailability of public services discourages customers and partners.
Availability monitoring lets an organization detect and fix any lapses quickly. It can help to prove that a service lives up to its service level agreements.
What monitoring does
The purpose of availability monitoring is to identify the failure of any server, service, or device on the network to respond within acceptable time limits.
It will report when the component fails to respond and when it becomes available again. Short periods of unavailability may simply be logged; if a component doesn't come back promptly, the monitor will issue an alert.
Loss of availability may have several causes:
Software failure. An application may crash or enter an unresponsive state. Whether it restarts successfully nor not, the situation needs to be noted. It may indicate a bug that needs fixing or a configuration problem.
Server hardware failure. The server may have physical problems that make it stop running or intermittently fail to respond. Such problems tend to get worse over time.
Network packet loss. If the network is prone to dropping packets, it can become unavailable intermittently.
Network connectivity failure. The connection to the network may fail for any number of reasons. This will make all its services unavailable.
Network hardware failure. A component within the network, such as a router or switch, may fail intermittently or permanently. Cables can break or become disconnected.
Resource overload. An excessive number of requests may keep a service from responding within an acceptable time or cause it to crash from resource exhaustion. This can result from abnormally high traffic or a DDoS attack.
Types of monitoring tools
Several kinds of tools are helpful in availability monitoring. Each has its own strengths, and they can be used effectively in combination.
SNMP tools. The Simple Network Management Protocol supports polling of resources. Agents collect information from devices and enter them in the management information base (MIB). The Manager queries the agents and assembles the information for administrators or technicians. SNMP tools are good for identifying faults at the device level. They're easy to set up and provide constant monitoring, but they can detect only certain types of problems.
Traffic flow tools. These tools monitor incoming and outgoing data. They can help to identify application-specific issues. Since traffic passes through them, they can themselves be a point of failure or limit throughput, so they have to be used conservatively. They might be slow to catch problems with services that aren't constantly used.
Active monitoring tools. Pinging services at frequent intervals can catch services which aren't responding, even when there is no incoming traffic. Requests can be designed to monitor any or all available services. They impose some overhead on the service, and they need to be tailored to the services being monitored.
Using the tools
Availability monitoring tools will notify administrators in a variety of ways, depending on the type and severity of the problem. A brief failure to respond may simply be logged, or an email may be sent.
A more serious outage will result in direct notification, through an alert on the dashboard screen or an SMS message.
When a non-transient problem arises, the administrator needs to diagnose and correct it.
The first step is to identify the point of failure, which a well-organized set of availability tools will do. It may be possible just from that to identify a hardware or connectivity failure and fix the problem. If the problem is in the configuration or software, it may be necessary to examine the system logs or notify a software engineer.
In addition to providing notice of critical outages, availability monitoring logs ongoing performance, giving a definite number for each component's uptime percentage. If these numbers aren't up to acceptable levels, it will be necessary to fix the problem, perhaps by installing more reliable hardware, fixing software bugs, or adding failover components.
With constant monitoring, availability problems will get prompt attention, so that services meet their obligations and keep users satisfied and productive.