We use Sophos PureMessage to scan for spam and viruses. We had eight Dell PowerEdge 2650s with varying CPUs and RAM clustered with a Alteon Layer 4 application switch. Our mail systems (sendmail and Sun Java System Messaging Server) use the PureMessage connector to speak to this cluster. The Layer 4 switch is set to balance the connections with a least-connections weighted algorithm.
The amount of spam we’ve had to deal with has tripled in the last 12 months. As a result this cluster was starting to max out. There are two ways to think about mail volumes: peak and overall. When someone says that they handle 15 million messages a day they don’t usually mean 625,000 per hour, or 10,416 per minute. Averages don’t tell the whole story, so you need to know what your peak traffic looks like. For us, the peaks were getting too high. We needed to add capacity.
Adding capacity is sometimes a crapshoot. How much capacity are you really adding? How fast are the new machines? When should you think about upgrading again? How do you get managers to give you money when you don’t know any of these answers?
We know PureMessage is mainly CPU intensive. It also can be memory intensive if it is doing a lot of scanning. The PureMessage connector opens a single connection to the cluster and pumps a lot of mail through it, so a fast network connection is important, but we don’t endure the overhead of opening and closing a lot of connections. It doesn’t do much disk I/O except logging.
Dell’s fastest 1U server is their PowerEdge 1950. Given what we knew we ordered four new 1950s, each configured with dual dual-core Intel 5160 3.0 GHz CPUs, 8 GB of RAM, and dual mirrored 146 GB 15,000 RPM disks. We configured each with a single gigabit network connection. Each runs Red Hat Enterprise Linux 4. Each is its own caching name server (using Red Hat’s default caching-nameserver package), and has some minimal network buffer tuning done.
Our load testing procedure involves reducing the size of the cluster to get the fewest number of servers at the highest utilizations possible, without causing queues to develop. We have scripts that monitor the PureMessage logs to determine the number of messages per minute, as well as the queue sizes on the mail systems. We monitor vmstat and iostat to determine system usage. We do the testing this way, rather than using a benchmark, because it’s really hard to simulate test spam email in a way that is meaningful for a test. By testing using real traffic we get real numbers the first time without having to spend days setting the test up. The worst thing that happens is that we might develop a queue, which we then run immediately so nobody notices.
We were able to drive the load on two of our 1950s to 95% (user, system, and wait CPU percentages added together), processing a peak of 3150 messages per minute. Therefore 100% should be around 3300 messages per minute. That’s 4.75 million messages a day per host.
Wow.
Now, why shouldn’t we use that number?
1. The Layer 4 load balancing switch does an imperfect job of evenly distributing the load, so we want to ensure excess capacity if the switch singles one of our servers out. Least connections doesn’t mean least work.
2. We want enough capacity to be able to remove a server from the cluster for maintenance at peak times.
3. Sometimes the scanning rules from Sophos consume more CPU time to achieve the same throughput, either because of software errors or because that’s what needs to happen to get the job done.
4. Sometimes the mix of email we are getting is harder to scan than other times, taking longer to reach the short-circuit “it’s definitely spam” mark where the scanner can stop processing it. For that matter, legitimate email probably takes the longest, since it has to be evaluated by every rule.
5. We have other processes running, like name servers, log processing, etc., which consume some CPU time. Though this is mostly factored into our numbers because of the way we test, but adding a little buffer for these isn’t a bad idea.
6. Most importantly, we don’t want to plan to add capacity when we reach 3300 messages per second. At that point we’re full. We want to start the process when we reach a lower threshold, like 75% utilization. That way if there are delays we won’t hit 100%.
Because of these factors, we chose our target of 65% capacity, or somewhere around 2200 messages a minute. 8800 messages a minute, 12.67 million per day at 65% (nearly 20 million a day at peak, though! Wow!).
I hope this is useful to folks planning to implement PureMessage or expand their existing implementations. This work was done in conjunction with two other fellows, one of whom is our primary PureMessage admin (the other guy and I are sysadmins, keeping the systems healthy but not doing much maintenance on the application itself). If you have questions just leave them in the comments section and I will pass them along if I can’t answer them.
We’re not in the same league, but useful to know … interesting stuff …
Hi Bob,
You might want to take a look at our connection management solution, which was designed by former ActiveState developers (some of the people who built the product that became PureMessage) specifically to address the loading challenges posed by today’s spam problem. Traffic Control is being used by numerous large PureMessage customers to significantly reduce the load on their PMX clusters. For example, we have one customer who now processes 6M connections a day on just two boxes with plenty of CPU to spare.
But don’t take my word for it. Give us a call and we’ll set you up with a trial.
Regards,
Ken Simpson
CEO, MailChannels
Hi bob,
One thing I would ask is not the numbers of email, but the type of traffic. Some of the simpler stuff can be handled by use of existing tools.
Firstly do you have the blocker daemon running?
Its a lot more efficient to let blocker run outside of the policy ( blocker is checked before the policy is accessed ).
To relieve I/O possibly consider separating the transaction log from the database?
Are your machines tuned for the amount of memory that you have?
Has the shared MB/GB shared memory ( shmall/shmmax), effective_cache,vacuum_mem and shared_buffers been increased to account for your increased memory?
After that you can attempt to look at the policy itself, is the default policy right for you, are there tests that could be applied before the default order?
🙂
Tim