Statistics Rollups Are Evil

It’s pretty common for statistics-gathering software, like MRTG, Cacti, VMware vCenter, etc. to roll statistics up over time by averaging them. This helps save space, as well as cut down on the processing needed to look at & graph the data.

The problem is that the process is lossy. These systems save disk, memory, and CPU by averaging the data over longer and longer time periods. Those averages remove spikes and make the data less and less representative of what actually happened on your system or network. It also makes it damn near useless for planning and troubleshooting.

Let’s start with an example I drew up in Excel to simulate something like vCenter recording an application server’s CPU load every 60 seconds for an hour (I’m picking on vCenter but this could easily be MRTG, Cacti, Solarwinds, System Center, Control Center, built-in graphs on your storage array, etc.). If we plot the 1 minute data we get:

1-minute

It’s spiky, which is plausible for an application server where requests are transient. Now let’s look at the same data after we’ve rolled it up by averaging it over 5 minute intervals:

5-minute

Wow, that doesn’t look like the same data at all, does it? Imagine if we carried that forward to the other common intervals, like 15 minutes, or 60 minutes? A 60 minute average of my sample data is 4384, yet the 1 minute graphs show spikes up to 10000.

If we think about this in MHz, as vCenter often does, the 5 minute averages tell us that we don’t need any more CPU than 6000 MHz. The 60 minute average (which we’ll often see if we look at long-term graphs, like a year) says we need 4384 MHz. So we go buy an instance in a public cloud that has 5000 MHz of CPU, and watch as all of our users get really, really upset because the app is really, really slow. After all, we really needed a server with 10000 MHz of CPU. We just couldn’t tell from our “data.”

Another example might be troubleshooting, where someone approaches us about slow performance. If they wait long enough to register the complaint we won’t be able to tell from the graphs that there was a problem. Perhaps they’re hitting a CPU limit at 10000 MHz, but the graphs only show 6000 MHz peaks because they’ve been averaged out. Based on the faulty data we conclude that it isn’t a problem and waste a lot of time looking at other stuff. Not cool.

So how do you fix this? You simply gather and keep the highest-resolution statistics you have storage and processing power for. Some performance monitoring packages let you choose how often to poll and how long to keep data at the various resolutions. Many have pretty serious limits, though, in which case you just have to learn to not trust the long-term data. :(

Comments on this entry are closed.

  • One compromise would be to store extrema along with mean. So for each n-minute interval, you’d store the average, minimum, and maximum value of $metric (some measure of spread would be awesome too, but I won’t be greedy). This reduces the number of data points that have to be stored and processed without giving up important information.

  • Averages are horrible for any data with enough outliers, especially if the outliers are of interest to you (and if you’re a sysadmin, they are). There’s lot of other stuff you can do with raw data, use the box plot over a time interval, or use few moving averages with different window sizes, or have an average that’s weighted exponentially toward the most recent events…

  • I’ve been saying this for years, round robin databases are for the 1990’s. We should instead be using column store databases like monetdb to store time series data about systems.

  • This is an ongoing battle – do we build for peak or average? Is “rush hour” acceptable? We expect traffic to suck during our commutes in the mornings and afternoons because the roads weren’t built to accommodate the peak loads. Yet we grumble but move on because we saved money. Same thing with compute resources – do you want to save money and live with occasional slow periods or do you throw money at the problem? How long of a slowdown are we willing to tolerate?

    Funding is never infinite so we have to cap it somewhere. Sometimes averages are just good enough.

  • As you point out averages are really only useful for showing trends, but can be quite useless in troubleshooting specific issues. However, even for trends it can be misleading as the spikes or troughs get more weight in the average – instead a geometric mean should be used. Graph that along side the std. deviation to get an idea of the “spikiness” of the data.

    Beyond that, I prefer to look at 95% or 99% values for sizing capacity.

    It’s disappointing that so many of these tools roll up averages out of the box, because it makes any further analysis impossible.

  • Wow, does this ever make Munin look good.

    Oh sure, if you *only* look at the graph, you get the same results, but Munin shows extended time periods on different graphs, which if anything actually works to illustrate the problem you talk about here. Every time you look at a graph, you say “oh, those peaks and valleys sure are smoothed on these other graphs, what happened to our data? Oh yeah, it’s averaged”.

    And then it also shows actual peak values over those same time periods.

    As for the “do we throw money at it?” comment above, it all depends on what your customers are paying for, and what will annoy them. If it’s e-mail and incoming mail takes 3 minutes longer to arrive at peak times due to load, it’s not such a big deal. If it’s VoIP and calls start dropping and voice quality gets mangled at peak times due to load, that’s a huge problem! Sometimes it’s better to underutilize a server (or oversubscribe a cloud) if that’s what it takes to keep your customers from hating you and taking their business elsewhere.

    So yeah. Use your own judgement.

  • I agree completely with Ben and have done this myself…. keep the min/max values along with the average. Sure, it makes your RRDs (or whatever tech you’re using) 3x larger, but that is far better than trying to keep months of higher resolution data.

    I used to do many graphs where I would plot out the min/max values and shade in between, then plot the average. While not perfect, where the average fell in between those points gave you a good feel for how the rest of the data was behaving.

    • I agree with you guys. The way I see it, you can either have a 3x larger RRD file or just trash the data completely because it’s useless without the min/max.

  • Good post Bob! Question though: How does one find the issue if they keep all the data. This would be like looking for a needle in a haystack, thus don’t averages help you isolate the areas of concern and then let you dig? I think having the averages + small points is the best solution. Just my few cents.

    • If you’re troubleshooting you often know what time a problem occurred due to other stuff, like users complaining, or a load balancer failing over to backup hosts at a certain time. Being able to look back at that time and see high-resolution data from your network or systems is very useful.

      Averages can serve a purpose and help people to make decisions, but you just have to be aware that they are averages. Many people don’t seem to realize that these monitoring systems take their raw data and make it into other things over time.

  • You’re description of the roll-up problem is so much better than my own. I recently tried to address this problem in Vcenter through a Vmware Fling idea. I’d be interested to hear your feedback on my recommendation…
    http://goo.gl/YodJP or http://openinnovationcontest.vmware.com/ct/ct_a_view_idea.bix?i=3EAA9D49

Previous Post:

Next Post: