Fun with ipcrm

Speaking of screwing up, I was reminiscing about some of the other screwups I’ve made.

We were having trouble with a server that appeared to have a memory leak. The major application on it was an Oracle database, and I had been reading something about ipcs and ipcrm and how you should check for orphaned shared memory segments and delete them. Sounds dumb now but at the time it seemed compelling. Not wanting to do anything rash, I checked AIX’s man page for ipcrm, which indicated that ipcrm run with certain flags wouldn’t delete a memory allocation that was still in use. Cool.

I ran the command, and it did nothing. I just figured it was a dead end.

What the man page didn’t tell me, and what I didn’t notice, was that it changed the permissions on all shared memory segments to 000. As promised it didn’t remove them, but it did make them unusable. Um, thanks.

About 30 minutes later a DBA called me, and we pieced together what had happened. I was honest about what happened, which helped us get things fixed quickly[0]. It led to a bug report for the man page, since I considered the permission change an error of omission (it was fixed, BTW). It also led to better monitoring of our databases. We had been monitoring the database by monitoring the processes, but when I changed the permissions everything just stopped. The processes were still in memory, just not doing anything, which was a failure mode we’d never seen.

Sometimes it comes up in meetings with those DBAs, someone jokes about it.

[0] Honesty is always the best policy, even if you screwed up big time. However, many organizations discourage honesty by making the penalties severe, like instant termination, regardless of your history. What’s better, honesty and learning from your mistakes or making people gun-shy and unwilling/scared to do anything (and dishonest when they do, hiding their changes)?