Much like monitoring the heath of your body, monitoring the health of your IT systems can get complicated. There are potentially hundreds of data points that you could monitor, but I am often asked by customers to help them decide what they should monitor. This is mostly due to there being so many available KPI options that can be implemented.
However, once you begin to monitor a particular KPI, then to some degree you are implicitly stating that this KPI must be important (since I am monitoring it) and therefore I must also respond when the KPI creates an alarm. This can easily (and quickly) lead to “monitor sprawl” where you end up monitoring so many data point and generating so many alerts that you can’t really understand what is happening – or worse yet – you begin to ignore some alarms because you have too many to look at.In the end, one of the most important aspects of designing a sustainable IT monitoring system is to really determine what the critical performance indicators are, and then focus on those. In this blog post, I will highlight the 3 most important KPI's to monitor on your windows servers. Although, as you will see, these same KPI’s would be suited for any server platform.
1. Processor Utilization
Most monitoring systems have a statically defined threshold for processor utilization somewhere between 75% and 85%. In general, I agree that 80% should be the “simple” baseline threshold for core utilization.
However, there is more than meets the eye to this KPI. It is very common for a CPU to exceed this threshold for a short period of time. Without some consideration for the length of time that this mark is broken, a system could easily generate a large number of alerts that are not actionable by the response team.
I usually recommend a “grace period” of about 5 minutes before an alarm should be created. This provides enough time for a common CPU spike to return to an OK state, but is also short enough that when a real bottleneck occurs due to CPU utilization, the monitoring team is alerted promptly.
It is also important to take into consideration the type of server that you are monitoring. A well scoped out VM should in fact see high average utilization. In that case, it may be useful to also monitor a value like the total percentage interrupt time. You may want to alarm when total percentage interrupt time is greater than 10% for 10 minutes. This value, combined with the standard CPU utilization mentioned above can provide a simple but effective KPI for CPU health.
2- Memory Utilization
Similar to CPU, memory bottlenecks are usually considered to take place at around 80% memory utilization. Again, memory utilization spikes are common enough (especially in VM’s) that we want to allow for some time before we raise an alarm. Typically, memory utilization over 80-85% for 5 minutes is a good criteria to start with.
This can be adjusted over time as you get to understand the performance of particular servers or groups of servers. For example, Exchange servers typically have a different memory usage pattern compared to Web servers or traditional file servers. It is important to baseline these various systems and make appropriate deviations in the alert criteria for each.
The amount of paging on a server is also a memory related KPI which is important to track. If your monitoring system is able to track memory pages per second, then I recommend also including this KPI in your monitoring views. Together with standard committed memory utilization these KPI’s provide a solid picture of memory health on a server.
3- Disk Utilization
Disk Drive monitoring encompasses a few different aspects of the drives. The most basic of course is drive utilization. This is commonly measured as an amount of free disk space (and not as an amount of used disk space).
This KPI can should be measured both as a percentage of free space – 10% is the most common threshold I see – as well as an absolute value, for example 200MB free. Both of these metrics are important to watch and should have individual alerts associated with their capacity KPI. It is also key to understand that a system drive might need a different threshold as compared to nonsystem drives.
A second aspect of HDD performance is the KPI’s associated with the time it takes for disk reads and writes. This is commonly described as “average disk seconds per transfer” although you may see this described in other terms. In this case the hardware that is used greatly influences the correct thresholds for such a KPI, so I cannot make a recommendation here. However, most HDD manufacturers will provide a KPI for their drives that is appropriate. You can usually find information on the vendors website for your specific drives.
The last component of drive monitoring seems obvious, but I have seen many monitoring systems that unfortunately ignore it (usually because it is not enabled by default and nobody ever thinks to check) and that is pure logical drive availability. For example checking the availability on a server of the C:\ , D:\ and E:\ Drives (or whatever should exist). This is simple, but can be a lifesaver when a drive is lost for some reason and you want to be alerted quickly.
Summary:
In order to make sure that your Windows servers are fully operational, there are few really critical KPIs that I think you should focus on. By eliminating some of the “alert noise” you can make sure that important alerts are not lost.
Of course each server has some application / service functions that also need to be monitored. We will explore the best practices for server application monitoring in a further blog post.