Some of my favorite system outages are denial-of-service attacks brought on by coders who code as if nothing will ever go wrong. For instance, take the following section of pseudocode:
foreach $email (@giant_list_of_customer_email_addrs) { @customer_info = get_database_info_for_customer($email); if (!defined(@customer_info)) { send_error_email_to_admins($email); } else { send_customer_email(@customer_info); undef(@customer_info); } }
When get_database_info_for_customer() fails (such as when the database is down for maintenance), someone will get an email for every failure. This is merely annoying when @giant_list_of_customer_email_addrs is 50 people, but when it’s 200,000 people it’s a big problem. First, you get hundreds of copies of sendmail running (or whatever the mailer function uses — with a lazy coder like this it’ll usually be something that isn’t efficient at all). Second, your local SMTP server gets overwhelmed, and its spam & antivirus scanners start melting down. Third, your mail spool fills, which causes other problems, like bounce messages that exacerbate the situation. And now, instead of having your database server down for maintenance and a scheduled script that should have run but couldn’t, you also have some of your other systems down as part of an incident. Whoops.
Never send error email in a loop. If you need to notify someone that a script had a failure do it outside of the loop, and use a stack to keep track of the errors that were encountered to be listed in the error message. Certainly, in this case, detecting a database problem and exiting would have helped, too. However, with all the things that could go wrong with databases, customer information, and general programmatic input it’s still worthwhile to avoid the inadvertent DoS by moving error notification outside the loop. At least that way you only risk having a single, potentially giant, email. 🙂
Some of the purpose built monitoring software solutions are smart about this though thank god. They have the ability to combine multiple alerts together, only send an alert every x minutes if the item happens, as well as create alert dependencies (if it doesn’t respond to a ping, then don’t tell me about every single service that is down on the device)
I’m a fan of having monitoring software rather than rolling your own, though the catch there is that it’s very difficult (if not impossible) to bring it all into one single smart system. I’m a big fan of PRTG for windows monitoring for what it’s worth. If you’re a linux shop then you’ll probably roll your own anyways…
Dan, I agree — a purpose-built tool is often better about stuff like that than a roll-your-own script, if only because the people writing it can think of stuff like that and handle it. We use Nagios for our monitoring, and we’ve also got dependencies set up so if the monitoring system can’t ping across a router it won’t tell us everything is down, either (well, it won’t be able to, but at least it won’t kill itself trying).
The “email in a loop” example I was thinking of is often where a programmer slaps something together in 10 minutes to make a CxO’s wish of emailing large population X come true, and if they’d have spent 15 minutes on it instead of 10 I wouldn’t have been paged at 4 AM on Sunday to fix a whole slew of problems. 🙂
Back when I was an admin of a vBulletin site I ran into a very similar problem. The code that connects to the database would wend you an email if it couldn’t connect.
The trouble started when you had a busy site with upwards of 1,000 simultaneous users at the time when your database went down. That equated to 1,000 simultaneous emails, followed by another 1,000 simultaneous emails when everyone refreshed to see if you were back up, followed by another, and another…
I made a quick hack that touched a file in /tmp every time it would have sent an email and then only actually sent an email if the timestamp of the file was older than 10 minutes ago. Using the local filesystem still meant we got one email for each of our web servers, but that wasn’t so bad because sometimes it could help diagnose the problem if only one of them sent an error email.
I don’t know if vBulletin have fixed this in the current version but I sure wish they had thought about it before it killed my mail server.
Hmmm… s/wend/send/ but the typo kinda makes sense the way it is. 🙂
Y’know, I’m going to agree with the not-sending-alerts-in-a-loop part, but not the “your server will melt” part. Having watched my boss get hammered by this on a weekly basis (his duct-tape language is FoxPro – yes in 2010), the self-mailbomb is very very real. I just don’t see why you think it would send mail and spam servers into a flaming spiral of doom. Mine sure don’t! At worst, it will delay other mail to that recipient’s network for however long it takes to void the queue – or until I start noticing the hundreds of “ping!” sounds coming from the next office. Frankly if *any* load can take down a server, I consider that box to be misconfigured. Any server should have sane limits on how much work it tries to do at once. That’s the whole point of a queue!
This is much like those horrible “Apache tuning” guides that tell everyone to increase MaxClients to 256, then the first day that site gets popular it sinks under the load due to obscene swap pressure. When you tune a box, take a few minutes to test it under the absolute worst case scenario you can simulate, because that’s precisely what will be happening the next time your pager goes off at 3 a.m.
Sadly, my boss still thinks it’s perfectly acceptable for a single-processor box to hit 15.0 load averages. At least Windows gives you a nice idiot-proof metric like “100% CPU Usage”. That may well be the only thing I like about Windows servers, but I like it a lot!
I was going to say something about rudeness, missing points, and whatnot, but I think I can sum up my feelings about you and your boss with the old phrase “A people hire A people, B people hire C people.”