Archive for November, 2005

William Blake »

There are a few blogs that I read, well, just to read them. I like to think that maybe some people read my blog for that reason once in a while. One of these blogs is caterina.net. I don’t know how I found her blog once upon a time, but hey, she’s got a good quote today:

“I must create a system, or be enslaved by another man’s.” - William Blake

Right-on.

Regurgitating the Documentation »

My favorite blog posts, forum posts, and web sites are the ones that just regurgitate the documentation for a product. Any moron that gets the product documentation can read it. Sure, I’ll give some people a break if the docs are awful. But, for instance, Oracle has documented the living crap out of installing their products on Linux. Why do people feel the need to type up how to install Oracle products on Linux? Is it just to steal search-engine results away from the company itself?

Installing Oracle 10g on Linux is a great example of this B.S. I know, for a fact, that if you grab the Oracle quick-start guide they tell you exactly how to install their product, step-by-step. Let’s take a quick look at user-supplied documentation and see if people are adding value or stealing search results.

How to Install Oracle 10g on RedHat Enterprise 3 - suggestion: skip the 80% of your documentation that Oracle took care of and just put in the part about setting the environment variables. If you want to be helpful to n00bs you should explain that you need to set ORACLE_BASE to whatever you installed to, not necessarily /u01/app/oracle. They’re going to cut-n-paste your stuff and wonder why it’s not working. The product is also “Red Hat” (two words) “Enterprise Linux 3.” You get a D.

Installing Oracle 10g R1,R2 on RHEL 4, 3, 2.1, etc. is my next target. This guy gets props for calling it RHEL, and for having it updated for recent releases of the software. He restates a ton of what is in the manual already, but he also adds value to some of it. The post-installation tasks and tips & hints sections are useful, but the obviousness of the errors and problems section make me want to take a chainsaw, have someone hold my beer, and say “watch this” as I lop my own head off. I’ll give you a C for your effort.

Next up is the Gentoo Wiki’s HOWTO Install Oracle 10g. The first parts are near-verbatim restatements of the Oracle docs. The end parts and the screen shots are impressive, though, because it gives you what you need to run Oracle 10g on Gentoo. Nevermind that these kind of docs are the bane of support staff everywhere. Gentoo isn’t exactly a supported Linux distribution, so everybody using these docs is 100% off the reservation. Add a statement to that effect at the beginning and all is good. You get a C-, mostly because of the section telling people to steal libaio from SuSE. Way to leech.

ORACLE-BASE - Linux and Oracle just condenses all of the documentation from Oracle down into a page and a half, minus all that neat explanation of why you’re making changes that Oracle supplies. That’s where it isn’t a link farm. Way to steal attention from Oracle, like those jerks that clone Wikipedia to steal their searches. You get an F.

Togaware’s Oracle 10g Release 2 has people going off the reservation by installing on Debian. Hey, cool, but again there is no disclaimer that people are going to be unsupported. Apparently the section on creating an /etc/redhat-release implies that. Most of this is all the standard Oracle docs, but the section on using Oracle 10g and importing data might be useful. The troubleshooting section looks useful. You get a B.

That’s just the first page of the Google results, and the best ones. Looking at page two drains my spirit with spelling errors, reposting the Oracle docs, link farms, and more. People, if you’re going to write your own docs:

  • don’t duplicate those that the vendor wrote already.
  • maintain the pages, which may involve removing some if the vendor catches up.
  • check the spelling on your pages.
  • tell people if you are leading them down a path that is unsupported. Just because you can do something doesn’t mean they should. If you’re screwing around you can run Oracle on Gentoo, but it isn’t appropriate for a production database, and 95% of the people reading your docs aren’t going to make that distinction on their own.
  • above all, add value. Don’t bother rewriting what has been said already if you aren’t going to add something to it.

Perpetual Betas and Version Numbers »

I was just reading the notes on day one of OSBC over at RedMonk, and the idea of a perpetual beta sorta annoyed me:

Perpetual Beta: as popularized by Google, this is the notion that software is a process rather than an end state. But the key insight here was that the customer appetite for this dynamic evolution - which Adam Bosworth might term intelligent reaction - is likely to be proportional to the business importance of that application. In other words, Google News can afford to experiment in ways that Google Search probably cannot. I’d never thought about it in quite those terms, but it’s a good point.

Then I read through the Wikipedia article on release stages.

My understanding of the process might be limited, as a lowly sysadmin, but isn’t all software in some form of beta? A perpetual beta acknowledges that software is a process and not an end state? Maybe I’m missing something here, but DUH. As a software user for a good 15 years now I’ve learned this. Heck, my mother has figured this out. The advent of web-based applications has really muddied the water with betas and version numbering. Since a web-based application has a lot shorter release cycle (you do control most of the execution environment, after all) , putting a product in perpetual beta doesn’t seem like the wisest thing to do. Can’t we just call it version 1.0 and go from there?

It also seems like we’ve lost the idea of “alpha” with web-based applications. I know “beta” as being feature-complete but unstable. “Alpha” means that it isn’t necessarily feature complete. Why don’t we see a perpetual alpha release from Google? What is a product that is perpetually having features added to it, but it’s stable? It’s a normal software product, that’s what it is. Gmail is stable, but they keep adding features. Isn’t this more like alpha than beta? No, of course not. It’s just part of the process. So why not skip the whole thing and just give it a version number, like 1.0, once you are done testing it? We know it’s a process. We’re part of the process, and we’re cool with version 1.0.1, 1.0.3, 1.1.0, etc. especially since we don’t usually have to do anything when you go from 1.0 to 1.0.3, except rejoice in the new features or bugfixes.

Along these same lines, I just get annoyed when I troll through Freshmeat. People, if you’re going to bother with a release, and it is your first, call it version 1.0. We’re all cool with that. These sub-1.0 numbers are annoying, and to me they show that you lack confidence in your work. Why should I use it, then? Why did you release it? Plus, most package management systems have trouble with a leading zero, as in 0.9.7a. Yes, that’s you OpenSSL. For god’s sake, are you ever going to acknowledge that your software is stable and runs the Earth? Multiply your versions by 10 and ditch the damn letters. As a cheerleading squad would put it, it’s A-N-N-O-Y-I-N-G.

So, in short, get over whatever phobia you have about calling something 1.0. Grow a spine and take your stuff out of beta when you release it, especially if it’s a web app like Google News or Gmail. We know you’re adding things to it. When you release new versions those are 1.3, 2.0, 8.0.4, etc. No more of this 0.9.7a crap, or perpetual beta. It’s ridiculous.

Rube Goldberg Lives in My Machines, Part 1 »

I really feel like I’m pulling explanations out of my ass lately. You know what I mean? It’s like I’m inventing a damn Rube Goldberg machine in my head to explain the weird stuff at work.

“Hey Bob, the patches that you applied last week to our machine are causing serious I/O problems. We need them off of there ASAP.”

“Really? We’re running those same things on identical machines, and all manner of different machines, and all of those work really well.”

“Huh. Things are really messed up. What are you going to do?”

“I’ll get one of my guys to look at it. Hang on.”

“We really need those patches reverted.”

“Yeah. Let me make sure it’s the patches before we start dicking around.”

This machine runs the world’s largest MRTG implementation. Rather than have multiple machines with a split workload, it’s an 8-way, fibre-channel connected monolithic badass. Our guys that wrote the monitoring system around MRTG don’t believe in directories, so all of the RRD files for 500,000 network ports sit in one freaking directory. They also don’t believe me that directory lookups aren’t O(N), but hey, their mess meant a pair of 8-way machines for us to play with. Anyhow, on Thursday I’d put Red Hat Enterprise Linux AS 3 Update 6 on it, updated the EMC PowerPath software, and flashed the QLogic QLA2342 firmware to 1.47. I did this same thing to its identical sibling, another behemoth with a different workload, and the other machine was fine.

One of my team members took a look at it. I was the guy that patched and rebooted it the other day, but I was trying to stay out of it because I wanted him to find it on his own. It’s really hard for me to not fire up iostat, vmstat, and top and just get a feel for the machine, so I did. My god, this monster machine was doing 4 MB/sec in bursts every three seconds. I’ve watched this thing do 200 MB/sec constantly when we were testing it. My coworker did some quality linear troubleshooting, backing down to an older version of PowerPath because EMC software is notoriously unstable. When that didn’t fix it he backed down to an older kernel. That didn’t do it, either. Sure, it’s my gut feeling, but I doubt it’s the fibre channel card’s firmware. So now what?

Being the work-loving knob I am, I went home and watched the mofo for a while. Eventually my coworker shows up on IM, and concludes that the machine is hosed:

“Dude, I’m cancelling all the upgrades I’m doing because the software is hosed. Is that okay?”

“Um, no – all the other, identical boxes are fine. This is total B.S. I think it’s the storage array. Can you log in and look?”

“You think so? Sure, I can look. What are we looking for?”

“Is the mirror to the DR site still active? Is it syncing or synched? Actually, screw it, break the mirror. I want to start eliminating all the causes.”

Two minutes later the problem went away.

“Hey, did you break the mirror?”

“No, I haven’t even logged in all the way yet.”

“WTF, WTF, WTF! The thing just unloaded, went like a bat out of hell, and is normal now.”

“What did you do?”

“NOTHING.”

“Okay, I’ll look at the array logs to see what happened…. Oh, they don’t say anything.”

EMC’s MirrorView is a pile of crap. We use it to mirror to our DR site. If you look at it funny the mirror breaks. If the array burps the mirrors break. If the SAN does anything remotely interesting, like a topology change, the mirrors break. Hey, we’re just lucky that they fixed the bugs where MirrorView would dump and crash the storage processors, too. This fragile piece-of-software also gets slow, if it has any sort of work to do. And because it’s synchronous mirroring that means your host I/O gets slow, hence the reason I wanted the mirror broken. But the array, almost magically, read my mind and fixed itself. Or something. I don’t know anymore. But tomorrow I have to come up with an explanation for this, and I don’t think I can say “the mofo was taunting me” and remain credible:

“The I/O from MRTG as it caught up from the outage on Thursday overwhelmed the mirroring software. It was overwhelmed after the reboot and only finally caught up today, coincidentally while we were watching it, but luckily after we’d made some changes to rule out the hardware and OS. The work we did to revert the updates did nothing, and we concluded it is quite unlikely that the host OS or hardware is at fault. The workload is just nearly too much for the storage system.”

…and the ball rolls into the basket, which strikes a match, lights a candle, burns a rope which opens a trapdoor, dropping bacon in a frying pan, tripping a switch to start the stove, and I have breakfast. WTFTF.

Close
Powered by ShareThis