Treading Water

It’s Saturday, and one of my servers is having a conniption fit. The load is like a sine wave between 6 and 12, and it’s trying to do a lot of I/O.

This machine is set up as a mail server, optimized for latency and not throughput. It was a mail server, actually, until it turned into a Sophos PureMessage reporting and management machine for the other seven spam and virus filtering machines. The stats from the other machines are stored in a PostgreSQL database, which is pre-installed with newer versions of PureMessage. I’m not quite sure what Sophos is doing under the hood, but this machine suddenly needs to do a lot more I/O, and going from mail server to database server on this machine is like asking your toaster to cook a turkey.

I’m not a complete PostgreSQL newbie so I take a look at the way the database is configured. I can’t change or even look at the disk subsystem right now, so that’s out. But I can give it more memory to use to do it’s work. Since this is a pre-packaged PostgreSQL install I have to find the postgresql.conf file. And, lo and behold, the parameters it ships with suck. On top of that, this thing has the autovacuum daemon running but it isn’t set up to do any analyze operations, and when the vacuum tries to run it’s preempted by other jobs that are updating the tables it wants to vacuum. So it never happens.

I check kernel.shmall and kernel.shmmax. The machine has 4 GB of memory. Those parameters are set to 3 GB.

I check shared_buffers and sort_mem:

shared_buffers = 131072
sort_mem = 10240

That’s low for 3 GB of shared memory.

Since this is a reporting tool and not super-production, what’s fsync set to?

#fsync = true

Ah yes, the default.

My plan of attack is coalescing; the path is becoming clear. I need to bump everything up.

First, change the amount of shared memory available to 2 GB, leaving 2 GB for the OS, programs, and various work buffers. I have no idea what Sophos is doing on the machine, so this seems reasonable.

Second, I compute the size of shared_buffers. I want to have 2 GB of shared memory, and PostgreSQL uses 8 KB blocks. There’s also a 512 KB overhead. So:

2.0 GB * 1024 * 1024 = 2097152 KB
(2097152 KB – 512 KB) = 2096640 KB
2096640 KB/8 KB = 262080

Third, I take a wild stab at sort_mem. 10 MB didn’t seem like enough. How about 16 MB? It’s bigger, but not a ton.

Fourth, fsync needs to get set to false. Yes, this has data implications should the box crash, but honestly if people wanted to do this right they wouldn’t have installed a database on a mail server. Plus, it’s reports. Whatever. Right now it doesn’t look like anything is getting done on this box, reports or otherwise.

So now I have at the top of postgresql.conf:

#———–
shared_buffers = 262080
sort_mem = 16384
fsync = false
#———–

I restart PostgreSQL. Hot damn, it works. Actually, it’s just “hot damn, it starts.” Now for the vacuum.

Oh, crap — all of the other servers jumped on this thing, and now the vacuum is stuck. Quick, change the port it listens on, and restart!

Ah, now I can vacuum in peace. A little “vacuumdb –all –analyze -v -v” does the trick. While it’s running I can muck with elvtune on the disk, too, and using iostat I can watch the throughput.

The default elvtune on this box is read_latency of 2048 and write_latency of 8192. Since I’m after throughput and not latency I want to bump those up.

elvtune -r 8192 -w 32768 -b 6 /dev/sda

And all of a sudden I’m seeing huge I/O spikes in with my normal read and write traffic. w00t.

Six hours later, though, the machine is swamped again. Maybe it is like cooking a turkey with a toaster. *sigh*