Do Not Collect System Performance Data From Guest OSes

This is post #12 in my December 2013 series about Linux Virtual Machine Performance Tuning. For more, please see the tag “Linux VM Performance Tuning.”

Fans of the 12th Doctor Who have often heard the phrase “the Doctor lies.” The explanation for his lies is that, because he skips around in time, he knows things that others cannot know yet. Hypervisors are like that, too. Guest OSes don’t know that they aren’t the only OS on the hardware, and the hypervisor lies to them about things like CPUs, RAM, and things like system timers because, like the Doctor, the hypervisor is skipping a VM forward in time. And that’s the rub – only the hypervisor knows what the truth is.

Many traditional performance monitoring systems involve installing an agent on the guest OS which then monitors OS metrics like CPU utilization, RAM usage, etc. With the hypervisor lying about execution times, RAM allocations, and such these metrics are inaccurate from the guest OS point of view. The hypervisor keeps similar statistics, though, and because it knows the truth those stats are correct. For example, a guest OS might report that something is using 100% of a CPU. That doesn’t mean that it is using 100% of a real CPU. Examination of the performance data from the hypervisor might indicate that there is contention on the parent host, or perhaps a CPU limit is in place for the VM.

What difference does this make?

Accurate statistics are important for system troubleshooting and sizing. Using the wrong information will lead you to make bad decisions. Furthermore, the collection of statistics isn’t free. It takes CPU, RAM, disk, and network resources to collect and process that performance information. Why do it if it’s going to be wrong? You might also save some money on licensing, depending on how guest OS agents are charged for.

So what do I do?

Remove any agents you have running on VMs that collect system performance data, or disable their ability to collect system performance stats. Gather that data directly from the hypervisor instead. If you use system tools like sysstat you may wish to comment out the cron entries in /etc/cron.d/sysstat:

# Run system activity accounting tool every 10 minutes
##*/10 * * * * root /usr/lib64/sa/sa1 -S DISK 1 1
# 0 * * * * root /usr/lib64/sa/sa1 -S DISK 600 6 &
# Generate a daily summary of process accounting at 23:53
##53 23 * * * root /usr/lib64/sa/sa2 –A

and disable the system service:

$ sudo chkconfig sysstat off

Comments on this entry are closed.

  • Love the analogy using Dr Who – it’s quality

  • Does Linux memory utilization get reflected properly from the hyper visor with paging concerns? Windows doesn’t match os vs hyper visor.

  • realistically, you need to look from both sides to get the full picture.

    just looking at the hypervisor, you will miss a ton of detail.

  • This is good to think about, esp. the implications of agents in your guests all doing something together at the same time, but it is also a good idea to check that all the stats you need are actually collected by the hypervisor, detailed information such as what memory is used for (buffers vs. apps), or which application/partition on the system disk IO is coming from may not be there. There is a robust marketplace for tools which can gather stats from the OS and applications, performance stats recorded by the hypervisor, while more accurate, are likely to be tied to the vendors management console and ecosystem.

    Undeniably, if you need accurate information, the hypervisor is the only place where that exists, but some things are useful to see from the guest’s perspective as well, and the guest is the only place where detail exists.

  • While the stats inside of the virtual machine don’t know the full truth, it’s only the information giving to the guest operating system from the hypervisor that the guest OS can react to (moving mem pages to swap, falling over, etc).

    The important bit that I got from this post was “have I, or any of the unix admins in the team, ever looked at the stats gathered from /etc/cron.d/sysstat for troubleshooting?” And I think that is a no.
    Thanks and I’ll be sure to pass on the entire series to our unix group for them to look over. This is fantastic. ;)