Should vs. Does, again.

Mark Callaghan over at High Availability MySQL made some comments about the wording in the MySQL documentation. All you readers of my blog will know why I find his comments interesting: MySQL documentation states that replicating from a 5.0 master to a 5.1 slave should work. This is very different from stating that it does work. That section of the manual should enter the 5.1 no-use case competition. Frankly, I hate the word “should.” To see it in vendor documentation like this is terrible, because it’s a weasel word. It puts the onus of testing and support on the end user, and gives the vendor a cop-out when it doesn’t work. “Well, we only said it should work, not that …

Read More

Why Does rnd() Keep Changing?

My friend Tom found this, I thought it was worth re-sharing: I can think of several ways of making things like /dev/random stop changing, mainly based on what my customers have done to machines.

Intel 7400 Memory Population

Intel’s The Server Room blog has an interesting tidbit of information for those of us thinking about servers with the Intel 7400 series of CPUs in it: As mentioned before, an MP Xeon 7400 series server will provide four channels of FBD memory. There are a couple of considerations here. First, latency to memory increases for every DIMM added to the system. This is important to note because you can keep the memory latency to a minimum by adding fewer high capacity DIMMs. Second, be sure to evenly distribute the DIMMs across all the channels. In other words, don’t fill up all the slots on one channel and then lightly populate the rest. Some systems get faster when you have …

Read More

The Beauty of Logs

I’m not sure how many times I’ve been asked by coworkers, friends, and random people if I know how to fix a problem. The conversation always goes something like: “Hi Bob. I am getting error XYZ when I try to use scp with public keys to copy a VMDK file from one ESX host to another. Can you tell me what I’m doing wrong?” “Hi Joe. It could be one of thousands of things. You might try looking at /var/log/messages or /var/log/secure to see what SSH thinks the problem is.” “Bob, thanks! It was a permission problem for my authorized_hosts file.” Neato. The nice thing about logs is that they often give you information that helps you solve a problem[0]. …

Read More

Failure Modes I Haven't Seen Before

It’s a rare day when I get to see operating systems fail in ways I’ve never seen before. I’ve been having the strangest problems with a virtual machine I’m trying to deploy. It boots but won’t come up properly on the network. Services will start but complain about the network, or just be unresponsive. I can’t ping it, either. I’ve deployed several other virtual machines today from this same image, so it isn’t the image. Regardless, I redeployed it. Still messed up. I double-checked the network settings, /etc/hosts, /etc/resolv.conf, gateway devices, netstat, route, everything. Nothing is wrong. I changed the IP address to something else, and it works great. I checked with my NOC to see if the IP I’d …

Read More

Hell Breaking Loose

A great George Carlin quote came through today, via Quotes of the Day: “I’m not concerned about all hell breaking loose, but that a PART of hell will break loose… it’ll be much harder to detect.” As a system administrator that’s exactly the attitude I take on monitoring.

Complexity vs. Availability

A few days ago I wrote an article on downtime. In the end it was an article on how big, complex, highly available systems get really expensive and hard to maintain. The interesting thing is that highly available systems end up having more failures because there are so many more components to them than simpler systems. Every time you add a component you increase the risk of failure. Obviously the goal of a highly available, highly redundant system is to survive outages. It’s just that the likelihood of actually having a problem is much greater. Buy a second server with mirrored disks for load balancing and now you have four disks to worry about instead of two. Ditto for power …

Read More

Scheduled Downtime vs. Availability

Reader Ben asked a good question in the comments of my previous post about downtime. I’m going to take a stab at this with the hopes that others will chime in and augment/correct my thinking. Isn’t it true that scheduled downtime is not usually factored in when calculating historic availability trends? If you had a scheduled maintenance window for 6 hours every Saturday morning, that wouldn’t count at all towards your downtime calculations. That could also affect the number of minutes per 9 calculation above. I am of the opinion that service availability[0] should be measured as the amount of time the service was available. If your service is down for 6 hours every week it’s not available during that …

Read More

Sun x4540, Thumper, as FC Target?

Hey, is anyone out there using a Sun x4540 (aka Thumper) as a fibre channel target on their SAN? The COMSTAR project that’s part of OpenSolaris appears to be able to make a volume available over the SAN. I’m very interested in that, but I’d like to know what other people think about it, especially in terms of stability & performance[0], before I go trying anything. If you have any thoughts could you leave me a comment? ————— [0] Yes, I know it’s SATA, but for what I want to do, and across 48 drives, it’ll probably be fine. Plus they come with a god-awful amount of RAM. Go caching go!

Downtime

(Matt over at Standalone Sysadmin posted a thought the other day about downtime, which coincided nicely with an explanation I ended up writing for a customer about their downtime requirements. Since I had it written up anyhow I figured I’d post it here.) Downtime is often referred to as the number of 9’s in the percentage of availability. So four 9’s would be 99.99% available, which translates to 52.56 minutes of downtime a year (ignoring leap years). It breaks down like this: 98% is 10512 minutes of downtime. 99% is 5256 minutes of downtime. 99.9% is 525.6 minutes of downtime. 99.99% is 52.56 minutes of downtime. 99.999% is 5.26 minutes of downtime. By default all customers want five 9’s or …

Read More