Rube Goldberg Lives in My Machines, Part 1

I really feel like I’m pulling explanations out of my ass lately. You know what I mean? It’s like I’m inventing a damn Rube Goldberg machine in my head to explain the weird stuff at work.

“Hey Bob, the patches that you applied last week to our machine are causing serious I/O problems. We need them off of there ASAP.”

“Really? We’re running those same things on identical machines, and all manner of different machines, and all of those work really well.”

“Huh. Things are really messed up. What are you going to do?”

“I’ll get one of my guys to look at it. Hang on.”

“We really need those patches reverted.”

“Yeah. Let me make sure it’s the patches before we start dicking around.”

This machine runs the world’s largest MRTG implementation. Rather than have multiple machines with a split workload, it’s an 8-way, fibre-channel connected monolithic badass. Our guys that wrote the monitoring system around MRTG don’t believe in directories, so all of the RRD files for 500,000 network ports sit in one freaking directory. They also don’t believe me that directory lookups aren’t O(N), but hey, their mess meant a pair of 8-way machines for us to play with. Anyhow, on Thursday I’d put Red Hat Enterprise Linux AS 3 Update 6 on it, updated the EMC PowerPath software, and flashed the QLogic QLA2342 firmware to 1.47. I did this same thing to its identical sibling, another behemoth with a different workload, and the other machine was fine.

One of my team members took a look at it. I was the guy that patched and rebooted it the other day, but I was trying to stay out of it because I wanted him to find it on his own. It’s really hard for me to not fire up iostat, vmstat, and top and just get a feel for the machine, so I did. My god, this monster machine was doing 4 MB/sec in bursts every three seconds. I’ve watched this thing do 200 MB/sec constantly when we were testing it. My coworker did some quality linear troubleshooting, backing down to an older version of PowerPath because EMC software is notoriously unstable. When that didn’t fix it he backed down to an older kernel. That didn’t do it, either. Sure, it’s my gut feeling, but I doubt it’s the fibre channel card’s firmware. So now what?

Being the work-loving knob I am, I went home and watched the mofo for a while. Eventually my coworker shows up on IM, and concludes that the machine is hosed:

“Dude, I’m cancelling all the upgrades I’m doing because the software is hosed. Is that okay?”

“Um, no – all the other, identical boxes are fine. This is total B.S. I think it’s the storage array. Can you log in and look?”

“You think so? Sure, I can look. What are we looking for?”

“Is the mirror to the DR site still active? Is it syncing or synched? Actually, screw it, break the mirror. I want to start eliminating all the causes.”

Two minutes later the problem went away.

“Hey, did you break the mirror?”

“No, I haven’t even logged in all the way yet.”

“WTF, WTF, WTF! The thing just unloaded, went like a bat out of hell, and is normal now.”

“What did you do?”

“NOTHING.”

“Okay, I’ll look at the array logs to see what happened…. Oh, they don’t say anything.”

EMC’s MirrorView is a pile of crap. We use it to mirror to our DR site. If you look at it funny the mirror breaks. If the array burps the mirrors break. If the SAN does anything remotely interesting, like a topology change, the mirrors break. Hey, we’re just lucky that they fixed the bugs where MirrorView would dump and crash the storage processors, too. This fragile piece-of-software also gets slow, if it has any sort of work to do. And because it’s synchronous mirroring that means your host I/O gets slow, hence the reason I wanted the mirror broken. But the array, almost magically, read my mind and fixed itself. Or something. I don’t know anymore. But tomorrow I have to come up with an explanation for this, and I don’t think I can say “the mofo was taunting me” and remain credible:

“The I/O from MRTG as it caught up from the outage on Thursday overwhelmed the mirroring software. It was overwhelmed after the reboot and only finally caught up today, coincidentally while we were watching it, but luckily after we’d made some changes to rule out the hardware and OS. The work we did to revert the updates did nothing, and we concluded it is quite unlikely that the host OS or hardware is at fault. The workload is just nearly too much for the storage system.”

…and the ball rolls into the basket, which strikes a match, lights a candle, burns a rope which opens a trapdoor, dropping bacon in a frying pan, tripping a switch to start the stove, and I have breakfast. WTFTF.