RSS Feed for System AdministrationCategory: System Administration

Perceived Productivity »

“What, you just sit around all day browsing Wikipedia?”

“Excuse me?”

“What are you looking at in Wikipedia?”

“The article on X-Men.”

“Tough day at work, I suppose.”

“Um, I’m trying to figure out a naming scheme for the 10 new servers I’m bringing in. That okay with you?”

“Oh, sorry.”

Just because you think I’m not doing work doesn’t mean you’re right.

(also, great site for naming schemes: namingschemes.com)

Cloud Computing »

My friend Terry’s slightly unorthodox take on cloud computing:

To hell with cloud computing. Clouds are puffy crap that float lazily by. Is that what you want out of your service provider? Just floating by without a care in the world?

It is time for tornado computing. Or hurricane computing. Real wrath of God type stuff. I want an architecture that knocks me off my feet, whips my apps around and hurls them half way through a tree. I don’t want my data intact for some script kiddie to steal. I want it like a frog in a blender; unrecognizably processed with a taste only I care for.

So to that end I am setting half of my air handlers to “Freakin’ Steaming,” the other half to “Ice Storm,” and locking the doors until the screaming stops. By this time tomorrow you should have some form of cloud computing in the data center, maybe a squall somewhere over the mainframe if you’re lucky. Viva La Revolucion!

Interestingly enough, that pretty much sums up my feelings, too. Service providers don’t seem to address the DR, legal, privacy, and security concerns that corporations have, don’t seem to care, and even go so far as a Microsoft rep telling a coworker of mine that “it’s no big deal as every bit of information about you is practically out there already.” Given that sort of attitude how can I do anything but build my own cloud?

Your Sysadmin Should Know Why Backups Are Good »

You know, if you’re a system administrator there are a few things you should know (and probably do). One of those things is why you should have backups.

If you can’t figure out why perhaps you should find a different profession.

Seriously.

I’m fine if you don’t keep backups because you’ve thought about it and you are taking a calculated risk. However, having to explain why backups are valuable to someone who, until this moment, I considered a peer is ridiculous.

It’s like having to explain what DNS does to someone who calls themselves a network administrator. I’ve done that, too.

What’s a Good Workflow/Request Tool? »

Dear readers,

You folks are full of good ideas, so here’s my latest question. I’m rethinking workflow for my group of 20+ admins, so the customers we interact with have a nice single point of contact and the admins have a good idea of what’s in the queue for work. I’m looking for tools to help us. How we’ve lived this long without something to help us is a real wonder.

The tool needs to be able to accept email and web-based requests. It would be nice if it could have some logic in it so that the customer could help direct who gets the request by choosing the OS and (perceived) priority. It should be fairly lightweight overall. I don’t want to have to slog through a ton of pages to close a ticket, or spend longer on the administrivia than the request took to complete.

There’s the venerable RT. What else is out there that’s cool, easy to use and run, and helps more than it hurts?

:-)

Accountability and Signatures »

One of my favorite tricks lately to make people understand how serious I am about things is to get them to sign a form.

You want to run your server without backups? I don’t recommend it at all, but I’ll do whatever you say. Just sign this form acknowledging that you know the risks, you know you could lose all your data at any time for any reason (including things I might do), and regardless of cause you don’t hold me accountable for anything.

You want to let your employee take a machine out of the building without following our procedures for wiping the drives? We have a policy against that and it’s a terrible idea, but no big deal. Here’s the form to sign saying that you take complete responsibility for all the data, sensitive or otherwise, on that machine. Why do you have to sign this? Well, this way when the data on that machine leaks out and causes identity theft, etc. I have a “get out of jail free” card. Yes, jail.

The thing is, nobody ever wants to sign these forms and put their name, in writing, on a really bad idea.

Building NRPE on Solaris 10 with SSL Support »

Solaris 10 ships OpenSSL as part of the OS distribution, in /usr/sfw. It appears that they have removed some of the ciphers in order to be compliant with export restrictions. Unfortunately, that throws a wrench in things when you want your Solaris Nagios server to use the Nagios Remote Plugin Executor (NRPE) to securely talk to other hosts. In my case, my Nagios server is a Sun T2000 and I’m referring to NRPE version 2.12. Newer versions may fix these issues.

First, I built NRPE 2.12 with:

./configure --with-ssl-lib=/usr/sfw/lib \
--with-ssl-inc=/usr/sfw/include --with-ssl=/usr/sfw \
--prefix=/opt/whatever

Once that was done the error I was getting on the target Linux host (in /var/log/messages) was the ultra-informative:

Error: Could not complete SSL handshake. 5

I checked that I could telnet to port 5666 on the host to be monitored, and got a connection. If that wouldn’t have worked I’d have made sure that my firewalls were set up correctly, /etc/hosts.allow had a line authorizing the Nagios server, and that nrpe.cfg permitted the Nagios server to connect.

Then I checked that I could start NRPE on the host to be monitored with the -n flag to disable SSL, and was able to run check_nrpe manually with the -n flag and have it work.

It ultimately appeared to be an SSL issue. Everything worked except when I enabled SSL.

There appear to be two fixes. First, you can install the export-controlled SUNWcry and SUNWcryr packages and get those additional ciphers, which theoretically fixes the problem. For various reasons I chose the second fix suggested by Jim Pirzyk in the Nagios FAQs: change the source. Line 152 of check_nrpe.c goes from:

SSL_CTX_set_cipher_list(ctx,"ADH");

to

SSL_CTX_set_cipher_list(ctx,"ADH:-ADH-AES256-SHA");

Basically you tell OpenSSL to not try using the 256-bit AES ciphers, which aren’t there. Additionally, to get nrpe to build you need to comment out lines 616-619 of nrpe.c:

/*      else if(!strcmp(varvalue,"authpriv"))
                log_facility=LOG_AUTHPRIV;
        else if(!strcmp(varvalue,"ftp"))
                log_facility=LOG_FTP; */

Those log facilities aren’t supported on Solaris.

I’ve attached a patch for both issues. You can apply it to the 2.12 source with:

cd nrpe-2.12; gpatch -p1 < nrpe-2.12.solaris10.patch

I’ll likely send this along to the NRPE folks. At any rate, here’s hoping you don’t beat your head against this as hard as I did.

Jargon »

Overheard at the grocery store yesterday:

“Oh my God, Doug, there you are. We’ve been trying to find you. They need the M-O-D at the service counter, there’s a lady there going absolutely nuts.” I’d been listening to them page the M-O-D for ten minutes, and I’d been watching this guy help bag groceries for five.

“What’s the M-O-D?” he asked.

“Manager on Duty,” said in the snottiest voice she’d talk to her boss in. “That’s you.”

I bet if they’d paged a MANAGER he would have responded. Which makes me think about all the jargon I use on a daily basis. Given that people won’t generally ask for clarification when they don’t understand something because they don’t want to feel stupid, how do I know that they’re on the same page as me?

Best bet might just be to use less jargon.

Just Because You Deleted A File Doesn’t Mean It’s Gone »

I ran into a case the other day where someone was reporting an operating system bug. A filesystem was 98% full, but an examination of that filesystem showed that it should only be 25% full.

It isn’t a bug. In order to understand why it isn’t, we need to know something about how files are stored, and then how they are deleted. A good place to start is the basic structure behind a UNIX-style filesystem, the inode. According to Wikipedia:

an inode is a data structure on a traditional Unix-style file system such as UFS. An inode stores basic information about a regular file, directory, or other file system object… Each file has an inode and is identified by an inode number (often referred to as an “i-number” or “ino”) in the file system where it resides.

Inodes store information on files such as user and group ownership, access mode (read, write, execute permissions) and type of file. There is a fixed number of inodes, which indicates the maximum number of files each file system can hold. Typically when a file system is created about 1% of it is devoted to inodes.

Very importantly, inodes only store file contents, not file names. Because file names are stored elsewhere an inode can have multiple names. Enter the hard link, which is a way to give the same file data multiple names inside a filesystem:

“A hard link is a reference, or pointer, to physical data on a storage volume. On most file systems, all named files are hard links. The name associated with the file is simply a label that refers the operating system to the actual data. As such, more than one name can be associated with the same data. Though called by different names, any changes made will affect the actual data, regardless of how the file is called at a later time. Hard links can only refer to data that exists on the same file system.”

On most operating systems a file is marked for deletion when the last name for it is removed from the filesystem:

The process of unlinking disassociates a name from the data on the volume without destroying the associated data. The data is still accessible as long as at least one link that points to it still exists. When the last link is removed, the space is considered free.

This is true for files that are not open. However, if a file is deleted but it is still held open by a process, the space doesn’t actually get marked as free until that process closes that filehandle.

That’s the “bug” — you can delete a file that is still open, but the space isn’t free. So a “du” might show 25% usage but a “df” shows 98%. This happens a lot with big log files. You go in, find the huge file, copy it somewhere, delete the original, and then note that nothing changed. The file isn’t there anymore but the space isn’t free. Lots of people scratch their head, declare it an OS bug, and reboot. A reboot fixes the problem, too, by globally closing every file, but had they restarted the process (or “kill -HUP” it, like syslog) it would have accomplished the same thing, by forcing the software to close and reopen the logs (and freeing the space).

This “bug” is actually a feature for some folks, though: it’s a way to securely use temporary files. A program could create a temporary file, open it, and then delete it so it isn’t visible in the filesystem, but it’s still there and usable to the program. In fact, the tmpfile() system call does this for you.

ONLamp has a great list of secure programming techniques as an excerpt from “Practical UNIX & Internet Security,” which mentions these topics and more. Also, if you aren’t familiar with inodes, directories, etc. those Wikipedia articles linked above are a good starting point. Consider it required reading if you’re a system administrator. :-)

Close
Powered by ShareThis