AMD & Linux Data Corruption

Mad props to Don MacAskill for getting the word out that AMD-based machines with more than 4 GB of RAM running Linux may be subject to a silent data corruption problem, mainly on machines with NVidia chipsets. Fixed in 2.6.21, but not yet in a shipping Red Hat kernel. The workaround if you find yourself in this position is to tell the kernel to ignore the hardware MMU with the kernel option “iommu=soft”, or build yourself a kernel that doesn’t have the problem. This points to a bigger problem with things like Red Hat’s Kernel Application Binary Interface compatibility guarantee: agility. kABI compatibility sounds great to developers, but it significantly increases the response time to problems like this. With Red …

Read More

What Were They Thinking – Guy Kawasaki and Jeffrey Pfeffer

Guy Kawasaki has a great interview with Jeffrey Pfeffer, author of What Were They Thinking? Two quotes stand out for me: “sometimes…the best leadership is less leadership. No seed can grow if it is dug up and examined every week, and for people to innovate and get things done, sometimes they need some time and space and resources.” It does take the right touch, though. Some folks will not get things done when given time and space, and some will flourish. The trick is to know who is who and treat them accordingly. For example, I like to let a big problem “stew” for a few days before I start working on it. This bothers some of my coworkers who …

Read More

Red Hat broke (by fixing) NIC detection on Dell PowerEdge 2950s

On Dell PowerEdge 1950 and 2950 hardware the built-in network interfaces have always been detected backwards under Red Hat Enterprise Linux 4. The NIC labeled “1” is eth1, the NIC labeled “2” is eth0. Okay, no problem, we were able to figure that out and compensate. It isn’t hard to reverse the cables. The latest Red Hat Enterprise Linux 4 kernel patch (2.6.9-55.0.2, maybe 2.6.9-55, too) fixes the detection on 2950s. So when you patch and reboot, your cable is suddenly in the wrong port. Found that out the hard way about 30 minutes ago on a machine 82.3 miles from me. Luckily I had two cables on this machine, and my network engineers just swapped the port configurations. Just …

Read More

Just Pull the Drive

I don’t know about other hardware, but on Dell PowerEdge servers the best way to fix a dead drive is just to pull it out and put a new one in while everything is up and running. It’s blissfully simple. Walk up to the box, pull the drive, put a new one in, and wait until the status light turns green. I walk away after the whole array starts blinking as it rebuilds the missing disk. Every time, and I really mean every time I’ve seen someone try to use the Windows or Linux-based RAID controller software to help them replace a disk they’ve ended up either needing to power cycle the whole machine or doing something dumb. Dumb, like …

Read More

Write Your Documentation as a Script

Which would you rather have: a document telling you how to start an application in 10 easy steps, or a script (shell, Perl, Makefile, etc.) that does it for you? I’d pick the script: The script is self-documenting. You can look at it and see what it will do. If you need to troubleshoot something you can just run the commands yourself. If you need to change the documentation you just change the script. The script can help ensure that the environment is correct for the application. Do you need to set environment variables, like JAVA_HOME, ORACLE_HOME, etc.? Just do it at the top of the script. You can call the script at boot, and have the application start automatically. …

Read More

Show and Adjust ext3 Mount Counts With tune2fs

The folks over at Ubuntu Geek have a post about installing showfsck, a utility that will show you the number of mounts left before you’ll get an fsck. You can do the same thing with /sbin/tune2fs: $ sudo /sbin/tune2fs -l /dev/datavg/www_lv tune2fs 1.35 (28-Feb-2004) Filesystem volume name: Last mounted on: Filesystem UUID: 18c4cf33-9dc5-4230-9eea-17eed1060f46 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr filetype sparse_super …<snip>… Mount count: 14 Maximum mount count: 39 Last checked: Sat Dec 2 13:46:24 2006 Check interval: 15552000 (6 months) Next check after: Thu May 31 14:46:24 2007 …<snip>… When your mount count reaches the maximum mount count you’ll get an fsck on the next reboot. You can adjust this with tune2fs, …

Read More

LOPSA Sysadmin Days, August 6 & 7

The League of Professional System Administrators is holding their second Sysadmin Days, this time in Cherry Hill, NJ, which is just outside of Philadelphia. August 6th and 7th. I spoke at their first Sysadmin Days last November in Phoenix, and had a great time. These are low-cost training opportunities, and unlike other organizations they are OS-agnostic and also focus as much on the professional side of things as they do the technical. This program includes Windows, Mac OS X, and Linux technical sessions, as well as sessions on time management, ethics, documentation, communication, policies, and compliance. If you sign up by July 13th there are excellent discounts available, especially if you are a student or from an educational institution. Lots …

Read More

Broken Windows

Have you read “The Pragmatic Programmer” by Andrew Hunt and David Thomas? No? For what it’s worth I think you should. System administrators and software developers have so much in common, but we just don’t realize it. “Two sides of the same coin,” or something like that. A bunch of software development books would make great system administration books if you just replaced the word “software” with “operating system.” 🙂 As an example, check out this short excerpt on software entropy and broken windows, straight from the book. Sound familiar? What do you have that could be considered a broken window? I’ve got a few. One is my Linux server build system. When it was first designed we only supported …

Read More

What Data Do We Really Need In A CMDB?

A while ago I wrote a diatribe on keeping too much data. Recently I have been asked what data I do suggest keeping in a configuration management database (CMDB).My preliminary answer: Less is more. Store whatever data you absolutely cannot live without. Be stingy, for every item requires maintenance, and maintenance requires time which nobody has. Don’t store anything you can query from the machine or elsewhere. Do store information that will help you repair the machine if there is a problem.My suggestions for things to start with: Server name. (duh!) 🙂 Hardware serial number. From this you can usually look up the system configuration as shipped. Hardware manufacturer model number, which will tell you a lot. Hardware warranty expiration. …

Read More

How to Configure IPMI on a Dell PowerEdge running Red Hat Enterprise Linux

This is intended to help fairly knowledgeable people get IPMI working on their hosts so they can issue remote commands to their hardware. I focus on Red Hat Enterprise Linux on a Dell, but it is likely to work on other hosts, distributions, and OSes, too. This works for me on Dell PowerEdge 1850, PowerEdge 2850, Dell PowerEdge 1950, and Dell PowerEdge 2950 hardware. Dell PowerEdge 1650, 2650, and 1750 servers have an older implementation of IPMI which will let you issue commands locally, but not to these models over the network. Before you begin: The Baseboard Management Controller (BMC) is the thing that implements IPMI. It piggybacks on the first built-in NIC so you have to have that attached …

Read More