AMD & Linux Data Corruption

Mad props to Don MacAskill for getting the word out that AMD-based machines with more than 4 GB of RAM running Linux may be subject to a silent data corruption problem, mainly on machines with NVidia chipsets. Fixed in 2.6.21, but not yet in a shipping Red Hat kernel. The workaround if you find yourself in this position is to tell the kernel to ignore the hardware MMU with the kernel option “iommu=soft”, or build yourself a kernel that doesn’t have the problem.

This points to a bigger problem with things like Red Hat’s Kernel Application Binary Interface compatibility guarantee: agility. kABI compatibility sounds great to developers, but it significantly increases the response time to problems like this. With Red Hat Enterprise Linux they state that they won’t make changes unless they address a demonstrated issue encountered by customers, preserve compatibility with ABI/API interfaces, and are minor feature enhancements. That second point is the killer. They can’t just go wildly patching the kernel, because every patch needs to be examined and carefully merged to guarantee no ABI/API changes. This gets quite tricky when the patches need to be backported, such as when the original LKML patch is against 2.6.21. Red Hat Enterprise Linux 5 is at 2.6.18, and even worse, RHEL 4 is at 2.6.9. Every new kernel release makes the differences more profound and harder to cope with.

So a customer needs to report a bug. They’ll find a patch and backport it into their kernels, then send it to QA. This whole process can take months of work by people with intricate knowledge of the kernel. Tough job, for sure, especially when you have customers suffering and probably complaining, while developers and vendors will complain if you change anything. Rock, meet hard place.

At least this bug has a workaround. 🙂