How Things Get Built
Just ran into this again. Hits home for me; I’ve been running around trying to reconcile people’s views of projects and work to be done.
Just ran into this again. Hits home for me; I’ve been running around trying to reconcile people’s views of projects and work to be done.
The danger of collecting data is that you need to know what you’re looking at before you conclude anything. I say this because every once in a while someone new to system performance statistics starts perusing all the performance graphs we have for our servers. They see a graph like this: and their reaction, seeing all the green “Used Memory” is “OH MY GOD WE ARE OUT OF RAM GET MORE RAM WHY AREN’T YOU DOING ANYTHING YOU SLOVENLY SYSADMINS OMFG WTF BBQ.” My reaction: A) No, you are not out of RAM. Our other monitoring systems tell us when that happens. B) Operating systems know that RAM is way faster than disks, so when an operating system has RAM …
The concept of reliability isn’t nearly as straightforward as it seems. It also depends heavily on what you are protecting yourself against. A good example of this is hard disks. You can protect yourself against a single drive failure by adding another disk and mirroring them. However, in doing this you add a controller that is now a point of failure. You also add another disk that may fail in a way that causes disruptions to both disks (freak the controller out, freak the SCSI bus out, etc.). Is it worth the additional risk? Sure, as long as the controller is way more reliable than the drives. On the other hand, sometimes the best way to make a service reliable …
Are you a sysadmin? You might want to go check out some of the other bloggers out there. A great place to start is over at planetsysadmin.com. That site aggregates a number of blogs, mainly sysadmin stuff, and their blogroll is a bunch of great folks with a lot of great content. I was picked up by them quite a while ago, but I have no idea if I’m still getting rebroadcast there (I understand, in the past I’ve had a lot of random content). Regardless, good stuff, and worth a read. Not saying I agree all the time, but a conversation is boring if you’re always in agreement. 🙂 This post was originally going to be a “hey, read …
I’ve long been a fan of RAID 5. Since you only lose one disk worth of space to parity it has been the best way to maximize local disk space. Sure, the performance isn’t the greatest, but I haven’t had applications that taxed the local drives, and the disk space and generically decent performance has been a good trade off. In the last six months, though, I’ve had three machines die from a double drive fault. This is the Achilles Heel of RAID 5. A single drive failure is as much as can be tolerated. In two of those cases the array had a hot spare drive, and the second drive faulted during the process of rebuilding on that drive. …
This morning my building was hit by lightning. The power sorta browned out for about 10 seconds, and then everything was fine again. Except some control system freaked out, and the chillers in the data center stopped receiving cold water. Oops. Thus began an hour of frantic shutting down of development, test, staging, and otherwise not production machines. All in an attempt to keep the room temperature down while the facilities guys fixed the problem. It worked. A couple of things occurred to me during all of this: 1. We discovered that the Jabber server we use to coordinate outage handling is open to the world. We had customers joining our chat room. Not that we were hiding anything, but …
Directories in ext2 and ext3 used to be simple linked lists. These had scalability problems. When you put a lot of files in a directory programs like ‘ls’ took non-linear (O^2) amounts of time to complete. To resolve this the ext3 folks added a new directory indexing feature, which replaces the linked lists with an “HTree.” I’d never heard of an HTree before, either, and Daniel Phillips, the inventor, explains in a paper presented at USENIX ALS 2001: …I went on to discard the BTree approach, and invented a new type of indexing structure whose characteristics lie somewhere between those of a tree and a hash table. Since I was unable to find anything similar mentioned in the literature I …
Four people now have sent me the link to Seth Godin’s “Bobcasting” post.[0] “I call it that because instead of reaching the masses, it’s just about reaching Bob.” As a guy named Bob I couldn’t agree more. 🙂 In all seriousness, though, his idea is dead on. The key is control. End-user control. Most information doesn’t need to be a popup, an email, or an instant message. It just needs to be out there so that when I’m ready for it I can get it. As a sysadmin I see this a lot with folks building email alerts into everything. Some of my coworkers get hundreds of status email messages a week, saying everything is good and reporting statistics like …
The Unofficial Apple Weblog posts a story about the iPhone running /bin/sh when it crashes. Of course, there isn’t a keyboard so you end up doing a restore. Since the iPhone didn’t ship with /bin/sh anyhow, couldn’t you put a script in its place to reboot your phone by calling init or shutdown? Or put something in your .bashrc to sleep for five minutes and then reboot? Just a thought.
To all those system administrators that have come before me, who have shared their wisdom with me personally or through books, articles, blogs, forum and list postings, I say thank you. I stand on the shoulders of giants every day I work in this field. To all those system administrators working to advance this profession, in LOPSA and other organizations, I say thank you. It is because of you that we even have this day. To all those system administrators out there, who toil every day in relative anonymity ensuring the services we rely on stay operational, I say thank you. It is you that makes things work, keeping the users, developers, and managers happy day after day. Happy System …