Playing Mastermind With My RAM

I have a Dell PowerEdge R610 in one of my VMware vSphere clusters that has been reporting memory errors. In fact, the machine wouldn’t boot, and the front panel suggested I reseat all the RAM. Okay…

0. Reseat all the RAM. Didn’t work, as expected.

1. Pull all twelve DIMMs out, put four back in. That worked, machine comes up.

2. Put four more DIMMs back. That worked, machine comes up.

3. Put last four DIMMs in. Machine doesn’t boot, same original error.

4. Pull last set of DIMMs out. Boot machine. Notice that BIOS is really old. Upgrade BIOS, thinking this is some stupid BIOS bug. Machine continues to boot.

5. Put last four DIMMs back in. New BIOS actually tells me what DIMMs are bad. Nice, except it says that A1 and A4 are bad. Two DIMMs? Yeah, not likely.

6. Order single replacement DIMM from Dell, decide to play Mastermind with RAM.

7. Replace DIMM A1. Machine switches to saying DIMMs B3 and B5 are bad. Really? DIMM banks B are on the other CPU.

8. Stifle disbelief, take loose DIMM from A1 and replace B3.

9. Machine switches to saying DIMM B5 is bad.

10. Take loose DIMM from B3 and replace B5. Machine likes that, has all of its RAM again, and I probably have the offending DIMM out now. Probably.

Lessons here: A) physical hardware sucks. B) linear troubleshooting rules. C) keep your firmware up to date.

Comments on this entry are closed.

  • times like this I think to myself “c’mon cloud computing hurry up already I’m tired of this crap”

  • Hi,

    Did you check OpenManage for memory related errors ?

    Did you have an orange alert message saying something like “E2111 SBE log disable DIMM X” on the front panel ?

    Thank you

  • We don’t run the OpenManage software on our VMware ESX hosts, but we do use IPMI via the BMC, and yes, it did register a memory error.

    The front panel was not the error you described, it was something along the lines of “Memory not detected. Please reseat DIMMs.” (sorry, I don’t have the exact error as I don’t have my notes with me right now).

    It’s okay, though — the errors have come back so I’m going to get another shot at this.

Previous Post:

Next Post: