Statistics Rollups Are Evil

by Bob Plankers on October 18, 2012 · 11 comments

in Best Practices,System Administration,Virtualization

It’s pretty common for statistics-gathering software, like MRTG, Cacti, VMware vCenter, etc. to roll statistics up over time by averaging them. This helps save space, as well as cut down on the processing needed to look at & graph the data.

The problem is that the process is lossy. These systems save disk, memory, and CPU by averaging the data over longer and longer time periods. Those averages remove spikes and make the data less and less representative of what actually happened on your system or network. It also makes it damn near useless for planning and troubleshooting.

Let’s start with an example I drew up in Excel to simulate something like vCenter recording an application server’s CPU load every 60 seconds for an hour (I’m picking on vCenter but this could easily be MRTG, Cacti, Solarwinds, System Center, Control Center, built-in graphs on your storage array, etc.). If we plot the 1 minute data we get:

1-minute

It’s spiky, which is plausible for an application server where requests are transient. Now let’s look at the same data after we’ve rolled it up by averaging it over 5 minute intervals:

5-minute

Wow, that doesn’t look like the same data at all, does it? Imagine if we carried that forward to the other common intervals, like 15 minutes, or 60 minutes? A 60 minute average of my sample data is 4384, yet the 1 minute graphs show spikes up to 10000.

If we think about this in MHz, as vCenter often does, the 5 minute averages tell us that we don’t need any more CPU than 6000 MHz. The 60 minute average (which we’ll often see if we look at long-term graphs, like a year) says we need 4384 MHz. So we go buy an instance in a public cloud that has 5000 MHz of CPU, and watch as all of our users get really, really upset because the app is really, really slow. After all, we really needed a server with 10000 MHz of CPU. We just couldn’t tell from our “data.”

Another example might be troubleshooting, where someone approaches us about slow performance. If they wait long enough to register the complaint we won’t be able to tell from the graphs that there was a problem. Perhaps they’re hitting a CPU limit at 10000 MHz, but the graphs only show 6000 MHz peaks because they’ve been averaged out. Based on the faulty data we conclude that it isn’t a problem and waste a lot of time looking at other stuff. Not cool.

So how do you fix this? You simply gather and keep the highest-resolution statistics you have storage and processing power for. Some performance monitoring packages let you choose how often to poll and how long to keep data at the various resolutions. Many have pretty serious limits, though, in which case you just have to learn to not trust the long-term data. :(

{ 11 comments }

Comments on this entry are closed.

Previous post:

Next post: