Whether it is the middle of the day, or the middle of the night nobody who is in charge of a network wants to get “that call”. There is a major problem and the network is down. It usually starts with one or two complaints “hey, I can’t open my email” or “something is wrong with my web browser” but those few complaints suddenly turn into many and you suddenly you know there is a real problem. What you may not know, is what to do next.
In this blog post, I will examine some basic troubleshooting steps that every network manager should take when investigating an issue. Whether you have a staff of 2 or 200, these common sense steps still apply. Of course, depending on what you discover as you perform your investigation, you may need to take some additional steps to fully determine the root cause of the problem and how to fix it.
Step 1. Determine the extent of the problem.
You will need to try and pinpoint as quickly as possible the scope of the issue. Is it related to a single physical location like just one office, or is it network wide including WAN’s and remote users. This can provide valuable insight into where to go next. If the problem is contained within a single location, then you can be pretty sure that the cause of the issue is also within that location (or at the very least that location plus any uplink connections to other locations).
It may not seem intuitive but if the issue is network wide with multiple affected locations, then sometimes this can really narrow down the problem. It probably resides in the “core” of your network because this is usually the only place that can have an issue which affects such a large portion of your network. That may not make it easier to fix, but it generally does help with identification.
If you’re lucky you might even be able to narrow this issue down even further into a clear segment like “only wireless users” or “everything on VLAN 100” etc. In this case, you need to jump straight into deep dive troubleshooting on just those areas.
Step 2. Try to determine if it is server/application related or network related.
This starts with the common “ping test”. The big question you need to answer is, do my users have connectivity to the servers they are trying to access, but (for some reason) cannot access the applications (this means the problem is in the servers / apps) or do they not have any connectivity at all (which means a network issue).
This simple step can go a long way towards troubleshooting the issue. If there is no network connectivity, then the issue will reside in the infrastructure. Most commonly in L2/L3 devices and firewalls. I’ve seen many cases where the application of a single firewall rule is the cause if an entire network outage.
If there is connectivity, then you need to investigate the servers and applications themselves. Common network management platforms should be able to inform you of server availability including tests for service port availability, the status of services and processes etc. A widespread issue that happens all at once is usually indicative of a problem stemming from a patch or other update / install that was performed on multiple systems simultaneously.
Step 3. Use your network management system to pinpoint, rollback, and/or restart.
Good management systems today should be able to identify when the problem first occurred and potentially even the root cause of the issue (especially for network issues). You also should have backup / restore capabilities for all systems. That way, in a complete failure scenario, you can always fall back to a known good configuration or state. Lastly, you should be able to then restart your services or devices and restore service.
In some cases there may have been a hardware failure that needs to be addressed first before a device can come back online. Having spare parts or emergency maintenance contracts will certainly help in that case. If the issue is more complex like overloading of a circuit or system, then steps may need to be put in place to restrict usage until additional capacity can be added. With most datacenters running on virtualized platforms today, in many cases additional capacity for compute, and storage can be added in less than 60 minutes.
Network issues happen to every organization. Those that know how to effectively respond and take a step by step approach to troubleshooting will be able to restore service quickly.
I hope these three steps to take when your Network goes down was usefull, dont forget to subscribe for our weekly blogs.