Landslide

I love building services that are fast and stable.

It is a tribute to the service that lots of people start using it.

Lots of people using it, all at once, sometimes overwhelm it, making it not fast and stable. You’re chugging along and wham! Suddenly you’re flying down the mountainside in a landslide.

This is where the last three weeks of my life have gone.

…

In this particular case it’s a mailing list server. One of the places I work is a large university, which many of you probably have figured out by now (I also have my own consulting business, which is also probably pretty obvious). This university, which is probably pretty obvious but shall remain nameless, has classically had several different mailing list solutions. Two for general lists — an ancient implementation of Listproc, and an implementation of Lyris ListManager. One for lists for classes — a custom software written in Perl 4 on AIX 4.1.5 in 1998, tightly integrated with Sendmail 5. No security holes there. The last was a solution for student organizations, which was sendmail aliases. Yawn.

So, a couple of us got together a few years back and said “This all sucks. We need to pick one list manager and do that well.” We had a bake-off between a few solutions and we picked Lyris ListManager.

We’ve been pretty happy with ListManager. It’s got the same number of flaws every other product has, and it’s obvious that Lyris is learning and growing with it’s customers, but that’s fine (usually). There is a database component and an application component. We bought two Dell PowerEdge 2650s to run this. Dual 2.4 GHz CPUs, 2 GB of RAM, and a bunch of local disk for the database machine.

Somewhere around moving the student organizations over last fall we added more RAM to both machines.

This last semester was the first time class lists were in Lyris, too, and we had one mailing list solution. Each class had a list, and one of our programmers took the initiative and customized the ListManager interface so it was easy for professors and TAs to enable their list, populated from our student data system, and do other basic tasks. It was fast, had spam and virus scanning, fairly easy to use for n00bs, and had enough options to keep power list users happy.

Bliss. Mail trip latency was good, support cases down, and it was chugging away. We were in the garden of Eden. Hey, look, a snake! With an apple!

The worst times for university IT systems are the worst parts of loops in source code: the beginning and the end. The first three weeks of fall semester are hell, and so are the last three of spring semester. If your system isn’t built to deal with those cases you’re in trouble.

We were in trouble. We knew this three weeks ago. Things were great during the year, but as more and more email went out about locations of final exams, grades, student orgs over the summer, etc. traffic rose. The disk I/O graph was looking fairly logarithmic. Until the plateau formed in it.

I tweaked the tuning of the disks and filesystems. The plateau goes up a bunch, turning into a little mesa in the graph. Mesa rojo, the color of doom.

I say “screw local storage, we’ve got these huge EMC disk arrays. Let’s attach them.” That fixes the disk I/O problem. My mesa in the graph becomes a shelf on the side of a craggy peak. I’ve traded one problem for another, though. I have a CPU problem on the database server.

There is a phenomenon with people and slow technology, especially web and mail servers, that if they don’t get what they want within a time period they define they try again. If the web server is slow they reload the page and put more load on the server. If the mail server is slow they send the mail again and put more load on the server. They think systems are like plumbing, where if there’s a clog more pressure will blow it out. There’s a guy at work that thinks this of print queues.

So my database server is churning on multiple copies of things, rejecting the duplicates but it has to do all that extra work to find duplicates. I need a faster machine, as this four year old box isn’t cutting it. Two options: migrate the database from PostgreSQL on the Dell to Oracle on a huge AIX box, or get a faster Dell. I want a Dell PowerEdge 6850, quad CPU, lots of RAM. Why mess around?

Money is tight, and we have the huge AIX box and Oracle. I need a test environment to try this, because anything I do to the production setup, like copying database dumps, etc. makes it slower, and that’s bad. So I reload the test list server machines with production data, taking an outage to dump the database and building a huge queue for later. Then I discover the conversion tools eat RAM. Like “hey, let’s select the whole table to RAM then write it to disk.” One of those “Lyris is learning with their customers” sorts of things. That isn’t quite what they do but the process does consume 4-6 GB of RAM. So my test machine, a virtual machine, gets 10 GB of swap. Then I discover that the tool just dumps core right at the end of the whole thing, but before it’s finished. Of course, it takes three hours to get to that point, so messing with it takes a couple of days. Then Lyris support wants a copy of our database, and the core files, because they want to muck with it. Hey, cool. But to do that we need them to sign an NDA, because we might have protected data we’re going to give them, and they need to be held accountable for their actions. The person on our end who deals with that is on vacation. No backup for them. Arggh.

While all this is happening we’re fielding numerous complaints, and spend more than 50% of our time communicating with management and customers about the problems. Our help desk writes us long letters about SLAs and things. Our directors tell us “get the problem fixed ASAP but you absolutely cannot interfere with operations before the end of the semester.” Hmmm… so no longer outages to do anything to help the situation.

So now we’re mitigating the load on the list servers. For instance, you can’t delete lists right now, because the act of deleting a list does some very intensive things to the database. Like vacuuming tables. Dude, don’t put a PostgreSQL vacuum statement in your app. WTF. I spent four days watching queries and putting more indexes on tables, which hurts CPU for INSERTs and UPDATEs but helps a lot for certain SELECTs. And I can’t vacuum the database without creating mail backlogs, which really sucks because you have to vacuum the database or you’re in bigger trouble later.

Can I rewind my life to three weeks ago and try this all again?

What’s the moral of the story here? Maybe it’s “don’t let load sneak up on you.” It could be “do some load testing periodically, you dolt, so you know your limits.” Or maybe there isn’t a moral at all, just fate being cruel.

While it is sort of refreshing to have a problem to deal with, rather than speaking always about theoretical & potential problems, I’m really kicking myself for being too busy to notice it’s precursors beforehand.

*sigh*