Any computer system will fail, given enough time. Hardware problems, such as a power failure or component defect, can bring it down. The software might have a bug that rare circumstances trigger, or an external attack exploiting a flaw might stop it from running properly. A required external service might go down. Operator error might mess everything up.
It's important to catch and remedy any of these situations as quickly as possible. If they go unnoticed, the downtime could hurt a business's reputation and cost it money.
It isn't just hard failures that need detection. Failure to respond to queries within an acceptable time can also be considered a fault. If it happens too often, it needs addressing. Otherwise, users will be dissatisfied with performance.
Automatic control systems are especially in need of fault monitoring, even if they don't interact directly with humans. If they fail or aren't sufficiently responsive, the operation of equipment may suffer. The result could be poor performance by the equipment or even a breakdown.
Fault monitoring systems exist to catch these failures. Software watching the network can discover non-responsive software and start remedial action in seconds or less.
Types of fault monitoring
The two main categories of fault monitoring are active and passive. One system can use both types of monitoring.
Active monitoring pings or queries a service and awaits a response. If it doesn't get one within a specified time interval, that may indicate a problem. If 100% responsiveness is expected, the monitoring software will immediately take action. If some level of packet dropping is considered normal, it will report an error only if the dropped queries exceed a threshold.
Effective use of active monitoring requires striking a balance between performance and rapid detection. Too many queries covering every subsystem could degrade performance. With too little monitoring, it will take longer to catch outages.
Passive monitoring covers several methods. If the software detects a severe error, it can signal the monitoring system. If both the application and the monitor follow the syslog standard, this is fairly easy to implement. The monitoring system can also detect timeout errors and take action if there are too many.
The complexities of modern networks
The nature of fault monitoring has changed as networks have grown in complexity. Mobile devices, the Internet of Things, and cloud connections haven't just added to the number of devices but made the topology more intricate. Networks no longer have as well-defined a boundary as they once did. Devices come and go on a regular basis.
Network complexity makes fault localization more difficult. One component can cascade into lots of fault reports on different devices. Isolating the one (or ones) causing the problem isn't always easy. If the network does dynamic rerouting in response to faults, the monitoring process is chasing a moving target. That's good for avoiding immediate problems, but not so good for finding and fixing problems.
It's impractical to monitor each device with a direct connection, so indirect methods are necessary to tell where bottlenecks or discontinuities are. It's something like testing a complex electrical circuit from its input points to tell which component is failing.
Human testers can't keep track of all the combinations unaided. Automated testing software which can map out the whole network is necessary. If a fault occurs, it needs to conduct a set of queries to narrow down the point of failure. Lots of data will come back, and the software should analyze it rather than making the administrator sort through it.
Intermittent faults in a complex network are the hardest to detect. The testing may have to go on for a period of time before catching the problem, yet it can't flood the network with so many pings that normal operations slow down. Software to catch these problems needs to be designed to find answers without overloading the network. Overloading thresholds are different for every network, so administrators will need to configure the testing so it strikes the right balance between finding problems and adding to traffic.
Remedial action
When it detects a problem, a fault monitoring system needs to take some kind of action.
One possibility is to attempt restarting the failed software or the computer on which it is running. If it's running in a virtual machine or container, the indicated response might be to kill the existing instance and launch a new one.
Other possibilities are to switch over to a failover system or to send an urgent notification, such as an automated phone call, to an administrator. If a critical system fails and can't quickly recover, administrators will need to start emergency procedures.
If the monitoring system includes an administrative console, a fault notice will appear there regardless of what other action is taken. Administrators will have to check if the remedial action was successful. If it was, they should look into the source of the failure. If it wasn't, they need to find out why and take steps to recover.
Administrators need to review the logs periodically. If a service fails repeatedly, there is probably an underlying problem that needs addressing. A software bug, unreliable hardware, or system overloading might be the cause. If it's ignored and usage increases, the failures will only grow more frequent.
Repeated underperformance faults require assessing the hardware and software environment. Changing the configuration may bring performance back to acceptable levels. If not, a hardware upgrade may be necessary.
The advantage of cloud monitoring
Many system managers are afraid to install fault monitoring because of the burden it will impose. Badly designed fault monitoring systems overload the network and impair performance, but it doesn't have to be that way. With cloud-based monitoring, only a lightweight agent resides in the network. The analysis and identification of problems are part of the external cloud service.
Cloud-based monitoring is more reliable than internal monitoring. If a network misbehaves or is under a DDoS attack, all its services are affected. Monitoring and reporting will slow down just when they're needed most. Notifications might not go out. If the whole network becomes inaccessible, a cloud-based system will detect the lack of response and report it.
A well-known benefit of cloud computing is scalability, and it applies to fault monitoring. With the growth of mobile computing and smart devices, even small networks grow faster than we realize, and they're spread out beyond any single location. An onsite box for fault monitoring will require upgrades as the network grows. Since most of the time it has little to do, it might not be obvious that the network has outgrown it till a crisis happens. Cloud-based monitoring provides the capacity that's needed as the network grows more complex.
One cloud monitoring system can cover multiple data centers, as well as remote connections. Centralized monitoring helps to identify problems across VPNs and WANs. This allows faster identification and correction of problems.
Maintenance is simpler. Software upgrades, including releases of the onsite agent, are part of the service. When bugs are fixed promptly, the system offers more confidence in its security.
Fault monitoring is part of a network management strategy for keeping systems reliable and fixing problems before they become serious. When it's done properly, it pays for itself.