One of my favorite places to glean tuning tips, especially obscure ones, is from SPECint or VMmark notes and disclosures. I have a number of Dell R900s in my virtual infrastructure, so I perused the Dell R900 VMmark results the other day. The notes show that the testers had disabled the Hardware Prefetcher and the Adjacent Sector Prefetch in the BIOS.
I didn’t know much about the Hardware Prefetcher or the Adjacent Sector Prefetch, so I started poking around. Dell doesn’t have much information about these features but IBM does, and since it’s an Intel processor feature the descriptions should be mostly accurate across vendors. The IBM xSeries 366 tuning tips have a nice blurb about the hardware prefetcher:
By default, hardware prefetching is enabled on the x366 processors which enables the processors to prefetch extra cache lines for every memory request. Recent tests in the performance lab have shown that you will get the best performance for most commercial application types if you disable this feature. The performance gain can be as much as 20% depending on the application. To disable prefetch, go to BIOS Setup (press F1 when prompted at boot) and select Advanced Settings -> CPU and set HW Prefetch to Disabled. For high-performance computing (HPC) applications, we recommend you leave HW Prefetch enabled. Future releases of BIOS that ship to enable dual-core will have HW Prefetch disabled by default.
Tom’s Hardware has an excellent explanation for this in their Intel Nehalem architecture article:
With the Conroe architecture, Intel was especially proud of its hardware prefetchers. As you know, a prefetch is a mechanism that observes memory access patterns and tries to anticipate which data will be needed several cycles in advance. The point is to return the data to the cache, where it will be more readily accessible to the processor while trying to maximize bandwidth by using it when the processor doesn’t need it.
This technique produced remarkable results with most desktop applications, but in the server world the result was often a loss of performance. There are many reasons for that inefficiency. First of all, memory accesses are often much less easy to predict with server applications. Database accesses, for example, aren’t linear—when an item of data is accessed in memory, the adjacent data won’t necessarily be called on next. That limits the prefetcher’s effectiveness. But the main problem was with memory bandwidth in multi-socket configurations. As we said earlier, there was already a bottleneck between processors, but in addition, the prefetchers added additional pressure at this level. When a microprocessor wasn’t accessing memory, the prefetchers kicked in to use bandwidth they assumed was available. They had no way of knowing at that precise point that the other processor might need the bandwidth. That meant the prefetchers could deprive a processor of bandwidth that was already at a premium in this kind of configuration. To solve the problem, Intel had no better solution to offer than to disable the prefetchers in these situations—hardly a satisfactory answer.
There is also a good description at the end of this VMware Communities post, about the adjacent sector prefetch and the processor hardware prefetcher, which correlates nicely with the information above.
It’s not hard to imagine that with VMware ESX workloads would really compete with the prefetcher, since VMs execute on different cores all over the place and beat up on RAM in a non-linear fashion. As such, on my R900s, I’m disabling the hardware prefetcher, but leaving the adjacent sector prefetch enabled.
It does look like the Intel 5500-series (Nehalem) CPUs fix this problem, though, so it’s not something you’d need to tune on the Dell R610/710s. The Dell R710 VMmark disclosure reflects that, as there is no mention of prefetcher changes. For now, I’m just hoping I can squeak a few extra percentage points of performance out of the machines I already have. 🙂
I’ve got an R900 as well but haven’t changed it from default. I suppose this is only an issue if you’re maxing out your FSB? It would be interesting to see more details on the performance increase that this can result in. Is this a .5% type thing or a 10% type thing?
Hmmm.
IBM was claiming it could be as much as 20% of a commercial application workload. I’m guessing the drag on the performance scales with the workload on the machine. The FSBs aren’t super fast on these boxes, not like the new Nehalem machines, anyhow.
As of this morning I’ve disabled the hardware prefetcher on all of my R900s, so we’ll see how it works. No problems so far.
FYI — yes, Nehalem prefetcher algorithm apparently now recognizes wasted BW with overly agressive pre-fetching, page 75:
http://www.scribd.com/doc/15507330/Intel-Nehalem-Core-Architecture-