Uptime Is Not Something To Be Revered

by Bob Plankers on March 14, 2013 · 1 comment

in Security,System Administration,Virtualization

Slashdot has a link to a tribute video to a Sun that was up continuously for 3737 days. That’s 10.23 years. It’s like a sequoia tree seeing the passage of civilization around it:

My thoughts on this:

  • The data center and infrastructure powering this machine was built in such a way as to keep this thing powered continuously for 10 years. Whoever built and ran that infrastructure was doing a good job. It’s a generalization but I bet there are very few cloud providers that can boast anything like that.
  • That version of Sun Solaris is reliable enough to keep operating for years without disruption. Most OSes are, by the way, even Microsoft Windows.
  • That particular hardware is reliable enough to keep operating for years. Factors that influence this include enough hot-serviceable redundancy built-in, a stable environment for the server to run, clean power, etc.
  • Given that something like 85% of downtime is caused by human error the admins of this host were competent enough to operate the host without disrupting service, or didn’t touch it much.
  • The workloads for this host were sized appropriately for the host for 10 years, and any errors in this regard were resolvable without a restart.
  • This host probably has 10+ years of security holes on it. I’m not super familiar with patching Solaris hosts but, generally, unless you restart the software running on a host they don’t pick up library updates. So even if it is patched you’d have to restart everything running on it to guarantee library security updates take effect. Possible, just not likely. The kernel itself likely has not been patched, unless there is a mechanism to load new code in on the fly (like Ksplice on Linux). The comments on the video indicate there hasn’t been much patching.
  • We cannot infer anything about service availability from the little we know about this system design. There are many services that do not require high availability or continuous uptime and a vendor warranty with a certain level of response might be just fine. We can speculate that the service contract on this hardware is probably expensive, though the particular economics of replacing the system vs. maintaining a service contract are unknown to us.
  • The people who built this system may be gone, retired, perhaps even dead. Hopefully the builders left good documentation about it so that current admins understood its role and configuration.

To me, security is the biggest problem here, because patching is a big part of defense-in-depth. Firewalls are neat but you have to punch holes in the firewall to let people use applications, right? If an application running on a host like this gets compromised it may be very easy for the attacker to compromise the rest of the system by exploiting 10+ years of kernel vulnerabilities. Game over. In the face of threats like APT1 where attackers are coming from inside your network, or even just a firewall rule misconfiguration that isn’t caught, the kernel & system software is effectively the last good line of defense on most systems. It limits a compromise to the application and not the whole host and prevents the attackers from establishing a beachhead inside your security perimeter where they can compromise other hosts from the inside. As such, it is very important for system software to stay current. Especially in an era of virtualization, where physical hardware issues are lessened and worked around with live migration and fault tolerance features. Seeing OSes outlive their vendor support is unfortunately becoming pretty common, as hardware lifespans just don’t provide a natural OS upgrade point anymore.

These guys seem pretty aware that this wasn’t an ideal situation, and I’m not picking on them. In fact, I rather enjoyed the video, because how often do you really see something like this? This is just my occasional opportunity to reiterate that this isn’t how we, as system administrators and IT staff, should be regularly doing business. We should not be encouraging our customers and employers to do business this way, either. High-uptime systems like this become serious liabilities in so many ways, from security to lack of understanding and documentation, that when we discover them we should do what these guys did: shut it down.

Comments on this entry are closed.

{ 1 trackback }

Previous post:

Next post: