The Eternal Wait For Vendor Software Updates

There’s been a fair amount of commentary & impatience from IT staff as we wait for vendors to patch their products for the OpenSSL Heartbleed vulnerability. Why don’t they hurry up? They’ve had 10 days now, what’s taking so long? How big of a deal is it to change a few libraries?

Perhaps, to understand this, we need to consider how software development works.

The Software Development Life Cycle

To understand why vendors take a while to do their thing we need to understand how they work. In short, there are a few different phases they work through when designing a new system or responding to bug reports.

Requirement Analysis is where someone figures out precisely what the customer wants and what the constraints are, like budget. It’s a lot of back & forth between stakeholders, end users, and the project staff. In the case of a bug report, like “OMFG OPENSSL LEAKING DATA INTERNET HOLY CRAP” the requirements are often fairly clear. Bugs aren’t always clear, though, which is why you sometimes get a lot of questions from support guys.

Design is where the technical details of implementation show up. The project team takes the customer requirements and turns them into a technical design. In the case of a bug the team figures out how to fix the problem without breaking other stuff. That’s sometimes a real art. Read bugs filed against the kernel in Red Hat’s Bugzilla if you want to see guys try very hard to fix problems without breaking other things.

Implementation is where someone sits down and codes whatever was designed, or implements the agreed-upon fix.

The testing phase can be a variety of things. For new code it’s often it’s full system testing, integration testing, and end-user acceptance testing. But if this is a bug, the testing is often Quality Assurance. Basically a QA team is trying to make sure that whoever coded a fix didn’t introduce more problems along the way. If they find a problem, called a regression, they work with the Engineering team to get it resolved before it ships.

Evolution is basically just deploying what was built. For software vendors there’s a release cycle, and then the process starts again.

So what? Why can’t they just fix the OpenSSL problem?

The problem is that in an organization with a lot of coders, a sudden need for an unplanned release really messes with a lot of things, short-circuiting the requirements, design, and implementation phases and wreaking havoc in testing.

Using this fine graphic I’ve borrowed from a Git developer we can get an idea of how this happens. In this case there’s a “master” branch of the code that customer releases are done from. Feeding that, there’s a branch called “release” that is likely owned by the QA guys. When the developers think they’re ready for a release they merge “develop” up into “release” and QA tests it. If it is good it moves on to “master.”

Developers who are adding features and fixing bugs create their own branches (“feature/xxx” etc.) where they can work, and then merge into “develop.” At each level there’s usually senior coders and project managers acting as gatekeepers, doing review and managing the flow of updates. On big code bases there are sometimes hundreds of branches open at any given time.

So now imagine that you’re a company like VMware, and you’ve just done a big software release, like VMware vSphere 5.5 Update 1, that has huge new functionality in it (VSAN).[0] There’s a lot of coding activity against your code base because you’re fixing new bugs that are coming in. You’re probably also adding features, and you’re doing all this against multiple major versions of the product. You might have had a plan for a maintenance release in a couple of months, but suddenly this OpenSSL thing pops up. It’s such a basic system library that it affects everything, so everybody will need to get involved at some level.

On top of that, the QA team is in hell because it isn’t just the OpenSSL fix that needs testing. A ton of other stuff was checked in, and is in the queue to be released. But all that needs testing, first. And if they find a regression they might not even be able to jettison the problem code, because it’ll be intertwined with other code in the version control system. So they need to sort it out, and test more, and sort more out, and test again, until it works like it should. The best way out is through, but the particular OpenSSL fix can’t get released until everything else is ready.

This all takes time, to communicate and resolve problems and coordinate hundreds of people. We need to give them that time. While the problem is urgent, we don’t really want software developers doing poor work because they’re burnt out. We also don’t want QA to miss steps or burn out, either, because this is code that we need to work in our production environments. Everybody is going to run this code, because they have to. If something is wrong it’ll create a nightmare for customers and support, bad publicity, and ill will.

So let’s not complain about the pace of vendor-supplied software updates appearing, without at least recognizing our hypocrisy. Let’s encourage them to fix the problem correctly, doing solid QA and remediation so the problem doesn’t get worse. Cut them some slack for a few more days while we remember that this is why we have mitigating controls, and defense-in-depth. Because sometimes one of the controls fails, for an uncomfortably long time, and it’s completely out of our control.

—–

[0] This is 100% speculative, while I have experience with development teams I have no insight into VMware or IBM or any of the other companies I’m waiting for patches from.

8 Practical Notes about Heartbleed (CVE-2014-0160)

I see a lot of misinformation floating around about the OpenSSL Heartbleed bug. In case you’ve been living under a rock, OpenSSL versions 1.0.1 through 1.0.1f are vulnerable to a condition where a particular feature will leak the contents of memory. This is bad, because memory often contains things like the private half of public-key cryptographic exchanges (which should always stay private), protected information, parts of your email, instant messenger conversations, and other information such as logins and passwords for things like web applications.

This problem is bad, but freaking out about it, and talking out of our duffs about it, adds to the problem.

You can test if you’re vulnerable with http://filippo.io/Heartbleed/ — just specify a host and a port, or with http://s3.jspenguin.org/ssltest.py from the command line with Python.

1. Not all versions of OpenSSL are vulnerable. Only fairly recent ones, and given the way enterprises patch you might be just fine. Verify the problem before you start scheduling remediations.

2. Heartbleed doesn’t leak all system memory. It only leaks information from the affected process, like a web server running with a flawed version of OpenSSL. A modern operating system prevents one process from accessing another’s memory space. The big problem is for things like IMAP servers and web applications that process authentication data, where that authentication information will be present in the memory space of the web server. That’s why this is bad, but it doesn’t automatically mean that things like your SSH-based logins to a host are compromised, nor just anything stored on a vulnerable server.

Of course, it’s always a good idea to change your passwords on a regular basis.

3. People are focusing on web servers being vulnerable, but many services can be, including your email servers (imapd, sendmail, etc.), databases (MySQL), snmpd, etc. and some of these services have sensitive authentication information, too. There’s lots of email that I wouldn’t want others to gain access to, like password reset tokens, what my wife calls me, etc.

4. A good way, under Linux, to see what’s running and using the crypto libraries is the lsof command:

$ sudo lsof | egrep "libssl|libcrypto" | cut -f 1 -d " " | sort | uniq
cupsd
dovecot
dsmc
httpd
imap-logi
java
mysqld
named
nmbd
ntpd
sendmail
smbd
snmpd
snmptrapd
spamd
squid
ssh
sshd
sudo
tuned
vsftpd

This does not list things that aren’t running that depend on the OpenSSL libraries. For that you might try mashing up find with ldd, mixing in -perm and -type a bit.

5. Just because you patched doesn’t mean that the applications using those libraries are safe. Applications load a copy of the library into memory when they start, so you replacing the files on disk means almost nothing unless you restart the applications, too. In my item #3 all of those processes have a copy of libcrypto or libssl, and all would need to restart to load the fixed version.

Furthermore, some OSes, like AIX, maintain a shared library cache, so it’s not even enough to replace it on disk. In AIX’s case you need to run /usr/sbin/slibclean as well to purge the flawed library from the cache and reread it from disk.

In most cases so far I have chosen to reboot the OSes rather than try to find and restart everything. Nuke it from orbit, it’s the only way to be sure.

6. Patching the system libraries is one thing, but many applications deliver libraries as part of their installations. You should probably use a command like find to search for them:

$ sudo find / -name libssl*; sudo find / -name libcrypto*
/opt/tivoli/tsm/client/ba/bin/libssl.so.0.9.8
/opt/tivoli/tsm/client/api/bin64/libssl.so.0.9.8
/home/plankers/pfs/openssl-1.0.1e/libssl.a
/home/plankers/pfs/openssl-1.0.1e/libssl.pc
/usr/lib/libssl.so.10
/usr/lib/libssl.so.1.0.1e
/usr/lib64/libssl.so.10
/usr/lib64/libssl3.so
/usr/lib64/libssl.so
/usr/lib64/pkgconfig/libssl.pc
/usr/lib64/libssl.so.1.0.1e
/opt/tivoli/tsm/client/ba/bin/libcrypto.so.0.9.8
/opt/tivoli/tsm/client/api/bin64/libcrypto.so.0.9.8
/home/plankers/pfs/openssl-1.0.1e/libcrypto.a
/home/plankers/pfs/openssl-1.0.1e/libcrypto.pc
/usr/lib/libcrypto.so.1.0.1e
/usr/lib/libcrypto.so.10
/usr/lib64/libcrypto.so.1.0.1e
/usr/lib64/libcrypto.so.10
/usr/lib64/libcrypto.so
/usr/lib64/pkgconfig/libcrypto.pc

In this example you can see that the Tivoli Storage Manager client has its own copy of OpenSSL, version 0.9.8, which isn’t vulnerable. I’ve got a vulnerable copy of OpenSSL 1.0.1e in my home directory from when I was messing around with Perfect Forward Secrecy. The rest looks like OpenSSL 1.0.1e but I know that it’s a patched copy from Red Hat. I will delete the vulnerable copy so there is no chance something could link against it.

7. If you were running a vulnerable web, email, or other server application you should change your SSL keys, because the whole point is that nobody but you should know your private keys. If someone knows your private keys they’ll be able to decrypt your traffic, NSA-style, or conduct a man-in-the-middle attack where they insert themselves between your server and a client and pretend to be you. Man-in-the-middle is difficult to achieve, but remember that this vulnerability has been around for about two years (April 19, 2012) so we don’t know who else knew about it. The good assumption is that some bad guys did. So change your keys. Remember that lots of things have SSL keys, mail servers, web servers, Jabber servers, etc.

8. While you’re messing with all your SSL certs, step up your SSL security in general. A great testing tool I use is the Qualys SSL Labs Server Test, and they link to best practices from the results page.

Good luck.

Upgrading to VMware vCenter Server Appliance 5.5 from Windows vCenter 5.1

My coworkers and I recently undertook the task of upgrading our vSphere 5.1 environment to version 5.5. While upgrades of these nature aren’t really newsworthy we did something of increasing interest in the VMware world: switched from the Windows-based vCenter Server on a physical host to the vCenter Server Appliance, or vCSA, which is a VM. This is the story of that process.

If you aren’t familiar with the vCSA it is a vCenter implementation delivered as a SuSE-based appliance from VMware. It has been around for several major versions, but until vSphere 5.5 it didn’t have both feature parity with Windows and the ability to support very many hosts & VMs without connecting to an external database. Under vSphere 5.5 the embedded database has improved to support 100 hosts and 3000 virtual machines, which easily covers our needs. While my team consists of very capable IT professionals, able to run Windows and MS SQL Server with their proverbial eyes shut and limbs tied behind them, it’d be better if we simply didn’t need to. On top of all of this, upgrades between major vCenter releases on Windows have always been perilous, with full reinstalls the norm. The few major upgrades we’ve done with the vCSA have been pretty straightforward and easy, and when they weren’t we just reverted the snapshot and tried again.

There are still some limitations to the vCSA. It doesn’t support linked mode, because linked mode is built on the Active Directory Application Mode (ADAM) functionality in Windows (which is also why a Windows vCenter cannot reside on a domain controller). We don’t use linked mode because it makes the environment more complicated, without much return on investment for the time we would spend dealing with the additional complexity. The vCSA doesn’t support vCenter Heartbeat, either. We don’t use Heartbeat because it’s fairly expensive, and if our vCenter servers are virtual machines we can use snapshots, replication, HA, and DRS to help minimize possible downtime.

Last, the vCSA doesn’t include Update Manager support, so you still need a Windows guest to run it, and, if you follow directions, a MS SQL Server, too. We thought about those directions, and how we actually use Update Manager. We use Update Manager to keep our infrastructure updated, but it isn’t critical to our operations, and our Update Manager configuration isn’t complicated (the default baselines, add the Dell OpenManage depot URL, upload a couple of custom ESXi boot images, Dell EqualLogic MPMs, and newer Broadcom drivers for our blades). Coupled with the ability to take snapshots, and our use of Veeam Backup & Replication to back the whole thing up, what would we lose if it was down for a day? (Nothing, we plan our patching in advance). Does anybody but my team rely on it? (No.) What would we lose if we had to rebuild it from scratch or restore it from backup? (About an hour of someone’s time). Are we concerned with SQL database performance for this application? (No, we run scans and remediations asynchronously — we start them and walk away). Given this, we decided to build a Windows VM for each of our vCSAs to run Update Manager, and we would use the MS SQL Server Express database it offers to install for non-production use. Easy.

While it is possible to run vCenter inside the cluster it manages, not all the VMware components support that as well as vCenter does. As a result, VMware best practices for vCloud instances, and likely many other things going forward, now include the idea of a separate “management cluster.” This cluster should be a simple, independent cluster that houses all the management components for a site, like vCenter, vCloud Director, Chargeback, Site Recovery Manager, Infrastructure Navigator, etc. Not only does this simplify administration, it helps organizations properly compute the overhead costs associated with their environments, and it makes some aspects of licensing and business continuity & disaster recovery easier. Since we were redoing all our management infrastructure anyhow we decided it would be a good time to implement this. It looks something like:

VMware Environment Design with Management Cluster

There isn’t an official upgrade process to move from Windows vCenter to the vCSA, so we had to come up with our own. What we’ve done in the past is disconnect an ESXi host from vCenter with all the VMs running, and add it to another vCenter somewhere else. When we tested that we found a big snag: the vSphere Distributed Switches (vDS) disappeared. In vSphere 5.1 VMware added the ability to export a vDS configuration and import it somewhere else, which, in theory, should have made this easy. When we did that export/import and then reconnected our ESXi hosts the vDS on the host didn’t mate up with vCenter’s vDS, erasing the vDS on the host and leaving our VMs with no network. Not good.

As it turns out, there is a bug in vSphere 5.1 that prevents this from working correctly, which bas been fixed in vCenter 5.1 Update 2. Our vCenter was 5.1 Update 1, and because Windows vCenter upgrades are often a crapshoot we didn’t feel like wasting a ton of our staff time getting to Update 2. Most of our network links are redundant, and standard virtual switches import seamlessly. So, using a bunch of PowerCLI commands we moved the redundant NICs to a new standard vSwitch and recreated the tagged VLAN port groups.

Our general plan became:

  1. Build the new management cluster first, get that set up, tested, and debugged. This also gives people a chance to upgrade clients and whatnot. Deploy a Veeam backup proxy into the management cluster so you can back the new appliances up.
  2. Get the new production cluster vCSA deployed, get authentication working, and duplicate the clusters, resource pools (enable DRS in partial mode), folder structure, and permissions. This was also a good time to work through some of the vSphere Hardening Guide, dealing with SSL, resetting service account passwords to long random strings, and ensuring there is a service account for each add-on (vCOPS, Veeam, VIN, etc.).
  3. Document resource pool configurations, as the cutover process will mess with them and you want to know the way they were set up originally.
  4. Document HA exceptions and settings.
  5. Document all DRS rules and groups for re-creation on the new vCSA (you can’t create rules until vCenter sees the VMs).
  6. Import a copy of the vSphere Distributed Switches, because even if we couldn’t use them straight up it made rebuilding easier. Resist the urge to upgrade them to 5.5 at this point — remember that you’ll be importing ESXi 5.1 hosts which can’t participate in a newer vDS. We also audited the port group configurations at this time.
  7. Set Update Manager up so we could do ESXi 5.5 upgrades.
  8. Verify all physical network port configurations. We actually didn’t do this, trusting that they’d been set up correctly by our Networking group. We discovered, the hard way, that at some point some of our ports became misconfigured through human error (switchport trunk allowed vlan vs. switchport trunk allowed vlan add — under Cisco IOS the word “add” is very significant), and others through configuration rot. As you’d expect, this caused outages when VMs were migrated to those ports. It’s an easy fix, fix the ports, just put the VMs back on the primary NIC, or put the primary NIC in the standard vSwitch temporarily. I suggest you trust but verify. Actually, I suggest you automate and remove the humans from the process altogether.
  9. One day before, remove all extra infrastructure components (Infrastructure Navigator, vC Ops, NetApp Virtual Storage Console, etc.) from the old vCenter. There may be a way to keep vCenter Operations Manager going and just move it, but in our testing it lost track of the VMs that moved, even when it could see them on a different vCenter. So we just dumped the reports we wanted, documented the customizations, and planned to start fresh on the other side.
  10. One day before, split the networking and move all VMs to the standard virtual switches. Use PowerCLI to reduce time and errors. Isolate workloads that do not have redundant networking or rely on a vDS feature to one host that can stay on the old vCenter until a future scheduled outage window. I would suggest using the backup or secondary links for the standard vSwitch. Why? When you add a host to a vDS you’ll be prompted to specify the uplink NIC for that host. vCenter will assign that NIC to the first uplink slot. You can save some work by choosing wisely in this step.
  11. Remove the ESXi hosts from the vDSes.
  12. Day of the upgrade, disable vSphere Replication and Veeam Backup & Replication. We aren’t using these heavily, relying on array-based replication for most of our stuff. If you care about this you will definitely want to test this more than I did.
  13. Disable HA on the old vCenter (we didn’t want something we did to trigger it, and we’d be online anyhow to restart VMs if something went wrong).
  14. Cripple DRS by putting it into manual mode. Don’t ever disable DRS — your resource pools will go away.
  15. One at a time, disconnect (don’t remove) the ESXi hosts from the old vCenter, and add them to the vCSA. We asked it to keep the resource pools, grafting them into the root resource pool. This operation seems to mess with the resource pool settings a bit so you want to have already created good resource pools as part of step 2, and then you can just move the VMs out of one and into the other.
  16. Move all ESXi hosts to the vCSA except the host that has workloads with specific networking needs. Get them organized into clusters.
  17. Sort out resource pools.
  18. Recreate DRS & HA rules & customizations.
  19. Readd the ESXi hosts to the vCSA vDSes. Migrate VMs back to the vDS, then remove the standard vSwitch and re-add that NIC to the secondary uplink.
  20. Fix & restart Veeam Backup & Replication & vSphere Replication.
  21. During the wee hours of the next morning we moved the ESXi hosts with the specialized networking on them. We’d documented the networking configurations so we could rapidly rebuild them when they lost their vDS configurations, minimizing the outage.
  22. Remove all ESXi hosts from the old Windows vCenter. I like doing this in case I have to restart the old box, I don’t want that old vCenter getting ideas about my hosts. Shut the old vCenter down. I also disabled the services so they wouldn’t restart, or cause alarms (we have a monitoring check to see if any service set to “Automatic” isn’t running).
  23. Re-enable HA. Put DRS back in fully automated mode. Resolve issues.
  24. Move any service DNS names (A or CNAME records) to where they need to go. We did this earlier in the process and discovered that the old vCenter used some of these names internally. It crashed and halted our upgrade for 45 minutes while we switched them back and restarted the old vCenter. We might have been able to hack it with a hosts file entry or some DNS trickery but it wasn’t worth it, as all the new vCSAs had their own DNS entries.
  25. Wait for a day or two to make sure everything is stable. If it isn’t you still have a working vCenter 5.1, and you haven’t upgraded your hosts yet.
  26. Upgrade your hosts to 5.5 using Update Manager.
  27. Upgrade your vDS to 5.5.
  28. Switch scripts & third-party apps to use the new vCSAs. You might also be able to do this earlier in the process, depending on the tool.

Bask in the glory of vSphere 5.5 running as a vCSA. It’s a lot of steps but mostly straightforward. It is also a good opportunity to learn how to script a lot of this if you are unfamiliar with it. For a couple of these steps I just used PowerCLI one-liners and populated them with Excel’s autocomplete, pasting them into the PowerCLI window as I needed them. Crude? Yes. Effective? Yes. I’m way better in C, C++, and Perl than I am in PowerShell. Also, I owe Alan Renouf and Luc Dekens beer.

Have we had any issues so far? Yes! No major upgrade is without a bunch of little problems.

  • One of our hosts crashed a few months ago from a hardware fault, and during the upgrade when we moved VMs around we were getting disconnected NICs on VMs. When we tried to reconnect them we’d get “Invalid configuration for device ‘0’” and the NIC wouldn’t reconnect. A temporary fix is to reassign that NIC to a different port group, save it, then reassign it to the correct port group and reconnect it. The KB indicates that this is a problem with the cached information for the vDS but that it’s fixed in 5.1. I don’t think it is. Nevertheless, we just split the primary & backup NICs again, created a new vDS, and moved all the VMs a third time.
  • We are also having a hell of a time with the VMware Tools under 5.5, where it just deletes the VMXNET3 driver and quits, leaving our Windows VMs without a NIC. We have a support case open on that one, moving incredibly slowly. I doubt that’s a problem with the vCSA, just the typical craptastic Tools installer, probably not checking return codes or handling error conditions well (not that I’m bitter, every hour my team spends dealing with Windows Tools issues is an hour out of our lives we never get back).
  • Lastly, there’s a bug with the vCSA that causes the console functionality in the web client to fail, citing “Could not connect to x.y.z.a:7331.” There is a nice KB article on how to fix it that my team found, and I’m hoping that it’s fixed permanently in 5.5 Update 1. I know bugs happen, but I don’t like it when we customize appliances.

So, in conclusion, I hope this helps someone. Feel free to use the comments here to add upgrade tips if you’ve been through it or correct glaring mistakes I’ve made in writing this up. That’s why I like you folks. :)

What Clients Don't Know (and Why It's Your Fault)

“Whether you work with outside clients or whether you’re part of an internal team your job is always, always going to include having to convince someone of something. Because your job isn’t just making things. Believe it or not, that’s the easy part. You’re going to spend 90% of your time convincing people that shit you thought up in the shower this morning is right. Your job is to figure out whether something should be made, how it’s made, and always, always, always work to convince someone that you’ve made the right choices.”

That’s a quote from Mike Monteiro’s presentation at the Event Apart Austin 2013 conference, a presentation that seems suited to system administrators, IT consultants, and IT professionals in general. Thing is, the presentation is actually talking to designers, about designers. But design is a client services business, just like IT, whether we act like it or not (often the client is just the organization itself). Us IT guys can definitely learn a thing or two from our design brethren, and watching this is a great start. To whet your appetite here’s three of the hundred or so points he makes:

  • “When you’re afraid to make [an argument] with your clients what you really are saying is that I think you’re too dumb to understand, and they are not! Put them at ease by letting them know that you not just going to do whatever they ask for. Clients need to know that you’re confident enough to not even let them screw up a project. Never work for somebody you can’t argue with. And definitely never work for somebody that you can’t say no to.”
  • “Design is the solution to a problem within a set of constraints. There is no bigger constraint that you have than your budget.”
  • “Make sure that you talk to everybody on their side, and always, always, as a professional courtesy, find out if there is another designer in the vicinity and announce yourself. That other designer is going to have a lot of valuable information for you, and projects go better when somebody isn’t feeling butthurt for being left out. And if you are that internal designer don’t be a jerk. Make the people from the outside your friends. Don’t make competing comps to show your boss behind their back… Crap like that helps no one and only serves to jeopardize the project.”

Watch it. Now. All of it. In HD. At work, because it’s 100% professional development, which we all need more of, and your boss didn’t have to send you to another state to get it. Then share the link.

What Clients Don’t Know (and Why It’s Your Fault) by Mike Monteiro – An Event Apart Austin from Jeffrey Zeldman on Vimeo.

Hat tip to Alex King.

Update to VMware vCenter Server Appliance & NTP Issues

Earlier today I posted “VMware vCenter Server Appliance 5.5.0 Has An Insecure NTP Server.” One of the reasons I like VMware is that they’re responsive to customer issues. This situation is no different. I just spoke with a few guys involved in VMware security, and this is what I’ve learned.

1. There has been mitigation information available internally to VMware Support/GSS since shortly after the vulnerability was published.

If you call VMware Support your best bet is to reference the CVE number, CVE-2013-5211. I have not called VMware Support to confirm this, or to verify that they’re able to properly resolve the issue if you don’t reference the CVE number. In the future I’ll make sure to reference the CVE number if a problem I’m dealing with has one.

2. If you do not have NTP time sync enabled you are not susceptible. If you are using Integrated Windows Authentication on your vCSA you are not susceptible.

With AD the appliance syncs to the time on the domain controllers. These now become alternate remediation paths, though you probably shouldn’t shut NTP time sync off without syncing another way.

3. There is public KB information available on remediating this problem, in the “Timekeeping best practices for Linux guests.

The fix is really inconspicuous, at the bottom, in red. The VMware folks and I disagree thoroughly that a passing generic mention of how to fix this is an adequate way to inform affected customers of a security issue on a particular shipping product. Their point is that they didn’t have a fix for it yet, and did not want to alarm customers or give attackers more information. My point is that the attackers have already found these open NTP servers by scanning the whole internet for the last month, so there’s no more damage to be done in that regard. Letting people know they’re probably affected and that it can be simply mitigated is a responsible thing to do, especially since there are likely folks out there that don’t know this is even happening to them.

Furthermore, by not publishing a real, public KB article they are doing a disservice to Windows-oriented system administrators who likely are unfamiliar with Linux and the vi editor. Sysadmins don’t like calling support when we can just fix something ourselves and get on with our lives, and it’d be nice to afford sysadmins more familiar with Windows than Linux the same opportunity.

4. There is a pending fix for these issues in all affected VMware products and the fixes will ship soon.

It sounds like we won’t have to wait too long for a real fix for the immediate problem. We also discussed the need for the other things I asked for, like proper firewalling and control of said firewall via the management interface (VAMI on port 5480), and that’s being taken back to Engineering.

5. Concerns about security issues can always be sent to security@vmware.com, which is always staffed and responsive.

I have not tried emailing them, yet, but after my conversations this afternoon I fully believe it.

I’ve updated my original post to reflect some of this new information. To be clear, I’m not at all angry about this. I’m mostly just disappointed, as I expected more public disclosure for a very public vulnerability, especially since mitigation techniques were available. I love being able to search the KB, find a fix for a problem, and move on. That wasn’t really the case here, and I hope this post goes some way towards explaining to all those inside VMware what I’m thinking as a long-time customer, vExpert, and usually friendly extroverted blogger. :)

A big thanks to my old friends in VMware for helping out with this, and my new friends in VMware Security!

VMware vCenter Server Appliance 5.5.0 Has An Insecure NTP Server

Update: I have updated this article to reflect some new information provided by VMware. I have also published new notes and discussion as a separate blog post.

On January 10, 2014 a vulnerability in ntpd, the Network Time Protocol daemon, was made public (US CERT VU#348126):

UDP protocols such as NTP can be abused to amplify denial-of-service attack traffic. Servers running the network time protocol (NTP) based on implementations of ntpd prior to version 4.2.7p26 that use the default unrestricted query configuration are susceptible to a reflected denial-of-service (DRDoS) attack. Other proprietary NTP implementations may also be affected.

I have encountered several vCenter Server Appliances, version 5.5.0 build 1476327 and older, that were exposed to the general Internet, and have been found to have this vulnerability. In these cases they were participating in DDoS attacks.

Yesterday I looked to the VMware KB to see if there were any security updates for these vCSAs, or mitigation approaches. Despite the vulnerability being over a month old there is no mention of it from VMware, nor is there a fix of any sort. The vulnerability probably extends to older versions of VMware ESX, too, if you are using NTP on them (as per best practices).

If you are running a vCenter Server Appliance I strongly suggest that you open a case with VMware Support regarding this problem. They have internal KB information about mitigating this. Ask them to search for CVE-2013-5211.

If you want to mitigate this problem on your own there are two ways to do it. First, VMware actually has public KB information in 1006427. It’s just buried (search that KB for CVE-2013-5211). Follow my steps below to edit the file and add their information.

If you want to mitigate the problem in a completely unsupported manner, but the one recommended by SANS and other organizations, you can SSH into the vCSA as root, and add “disable monitor” to /etc/ntp.conf. You can do this with the following steps:

  1. vi /etc/ntp.conf
  2. Move the cursor using the arrow keys to just below the entry called “driftfile /var/lib/ntp/drift/ntp.drift”
  3. Type an ‘i’ to put vi into insert mode. Don’t type the single quotes I use here, just the letter i.
  4. Type “disable monitor” and hit Enter.
  5. Type ‘ESC’ to get vi out of insert mode.
  6. Type ‘:wq’ to get vi to write the file and quit.
  7. service ntp restart

As with all problems like this we should ask why this happened in the first place. My questions to VMware are:

  • Why is there an open NTP server running on the vCSA at all? I understand that when I configure NTP it will start an NTP daemon to keep the time in sync, but my expectation is that it would be completely firewalled, given that it isn’t intended to be a real NTP server.
  • Why is the NTP configuration (/etc/ntp.conf) on the vCSA not secured?
  • Why are there no firewall controls for any virtual appliances available through the web interface at port 5480? I should be able to control what services are open or not, and to what IPv4 and IPv6 addresses. This should be basic functionality in the Appliance Studio, so that all appliances built with it get these basic security features.
  • Why has it been a month since the publication of a security vulnerability that affects VMware products, and days since DDoS attacks started, yet there has been no action taken and no patches released?

These are basic & obvious security measures, and part of the common security practice called “defense in depth.” VMware, you have failed in a few different ways. Fix it.

Constructive comments are welcome. Please don’t tell me that the fix for this is to use an external firewall. In the short term that may be the case, but real security on all the appliances delivered as part of a software package should be the goal, as well as proper security response from our vendors.

The Lone Bookshelf: The Macintosh Way by Guy Kawasaki

(This is the inaugural post of my Lone Bookshelf series. Find more posts using the “Books” category)

Last summer my family moved to a different house. By itself, moving isn’t that big of a deal. Take everything out of the old house, put it on a truck, unload it into the new house. What is a big deal is sorting. At the old house all of our stuff had a place, carefully curated and filed and sorted and stored. At the new place our stuff had piles in the middle of rooms. Ugh.

The Macintosh Way CoverI have three large bookshelves from my college years that needed a new home in our new home. Bookshelves are a particularly pernicious piece of furniture. By themselves they are bulky. Fulfilling their purpose makes them very heavy. Somewhere around the third heavy box of books up three flights of stairs I asked myself why I was doing this. Was I bragging to people that this is how well-read I am? Did I use them as reference? Perhaps I just needed ballast, worried that the house would tip over in a particular direction.

Right then I started sorting my books, keeping only those that were influential or inspiring, or serve as a recently used reference. Now, as I am finally getting my home office & lab under control, I am going through what I kept to find which ones have newer editions and electronic versions. One of these books is Guy Kawasaki’s “The Macintosh Way,” which Mr. Kawasaki has recently offered as a free electronic copy in PDF, MOBI, or EPUB formats:

Mr. Kawasaki was Apple Computer Corporation’s chief evangelist back when the Macintosh was first being developed. His job, put crudely, was to get software companies to develop software for a platform that didn’t really exist yet. As Calxeda discovered (the hard way) and ARM is working to mitigate, a new technology is only as good as the software that runs on it. Unfortunately, when there isn’t much software for a platform there isn’t much interest from consumers. The same was true of the Macintosh. In order to sell Macs they needed software. In order to get software they needed a customer base. It was Kawasaki’s job to fix that paradoxical situation.

The book is funny and sarcastic. It uses eclectic references, talks a lot about early Apple, and mocks Apple a lot (“What is the difference between Apple and the Cub Scouts?” The answer: “The Cub Scouts have adult supervision.”). It also talks a lot about how Apple picked up the Hewlett-Packard Way baton as a way to do things right and treat customers well, how to hire the right people, how to build community, how to do a good demo, etc. Despite the book being 24 years old the topics, exercises, and commentary are incredibly relevant to those of us working in the IT industry.

It’s a fast read and the sections are perfect for reading while flying, as you can put it down and pick it up quickly. The price is free if you grab the electronic copy from the tweet above. Otherwise Amazon has it, too. I’ve had my copy since it was first published, and I’m attached to it, so it’s going back on my bookshelf again. :)

New Java Security Settings: More Proof That Oracle Hates You

I began the day yesterday updating to Java 7u51, after which absolutely none of my enterprise Java applications worked anymore. I could not reach the consoles of my Rackspace cloud servers. I could not open the iDRAC console on my Dell PowerEdge. They all exited with some error about the Permissions attribute not being set. Being the guy that I am I decided to search for the error. Turns out that 7u51 sneaks a major change in a point release: on the default Java security slider setting of “high” no applet may run if it’s self-signed, unsigned, or is missing the Permissions attribute.

Unfortunately, that describes all enterprise software, at least all the current versions of things I’m using.

This isn’t a trivial change. This is the sort of change that accompanies a major version, heralded far and wide for months, with customers given a choice about deployment and testing. Is that what happened? No, because this release is also a security update. So people across the globe autoupdate and suddenly can’t do anything, because absolutely no Java applets meet these criteria (probably not even Oracle’s own).

So into the Java control panel we go:

Java Control Panel, 7u51

What sort of company labels the bottom part of a three-position slider “medium” when the description is “least secure?” Oh, a disengenuous one, that’s right.

The fix is basically to disable security, either globally by moving the slider (as I did, because I’m not a moron and can tell what the security prompt is for)[0] or for specific sites (like my entry for mycloud.rackspace.com). Of course, none of this is really what I want. I don’t want to trust mycloud.rackspace.com implicitly, because I don’t want just any applet running from there. I only want the console applet that I requested. I don’t want to lower all my security settings, either, but I’m going to, because I need to do my job.

Assuming that Oracle is trying to fix some legitimate problem, they’ve now completely bungled their shot at it. By changing defaults in what is essentially a point release they’re ensuring that no software has been updated to conform to their new standards, and users will have to change the security settings to simply continue doing their job. The right time and place for a change like this is a major version release, when all other parts of the support ecosystem already need to test and recertify against the new version.

Instead, it’s a mess, which is just par for the course when working with Oracle.

——-

[0] Pre-emptive snarky comment: “Well, that’s the problem they’re trying to fix, people are morons.” My coworkers and I have a saying, “you cannot fix people problems with technology.” This is squarely a people problem, and the “fix” here doesn’t make it less of a people problem because they botched it. Besides, if I’m an attacker I’ll just recompile my malicious applet with a Permissions manifest and go back to slurping up your credit card numbers. It wouldn’t surprise me to learn that malicious apps are already updated.

Redundant Gigabit Management NICs, Please

I’ve been doing a lot of system design work lately, building virtualization infrastructure for places where there is no pre-existing infrastructure available (also known as the revered “green field” deployment). One of the biggest issues I’ve had is that 10 Gbps switches can fall back to 1 Gbps when the proper transceiver is installed. However, they cannot go to 10 or 100 Mbps.

“So what?” you ask. “Nobody in their right mind uses 10 or 100 Mbps anymore.”

Management interfaces do, because the manufacturers haven’t bothered to update them to triple speed NICs (10/100/1000 Mbps). The Dell PowerVault 124T tape library can only do 10/100 Mbps. Brocade fibre channel switches, including their newest models, only have 10/100 Mbps capabilities on their management NIC.

Because of this, when I’m designing a new environment, instead of putting two 10 Gbps switches out in the field I now need at least three switches: two 10 Gbps switches and something that can do 10/100 Mbps.

“Again, so what?” you say. “Switches like that are a dime a dozen, and everybody uses old 10/100 switches for management.”

Yes, but I don’t have a 10/100 switch available at that site. So now I have to spend money to acquire one, and spend money to pay someone to configure it, maintain it, keep it on a service contract, monitor it, have it consume 1U of space, etc. If it had a NIC that could do 10/100/1000 I could plug it right into a leftover port on my nice big, monitored, already-there-and-configured 10 Gbps switches and move on with my life. Even the cheap Cisco Linksys desktop switches, available from Best Buy for $99, have 10/100/1000 available. Why doesn’t my $40,000 fibre channel switch?

On top of all that, why isn’t there redundancy for the management NIC on some of my equipment? My day is bad enough when I lose or misconfigure a switch. Not being able to reach other equipment during a crisis limits my options. I don’t like limited options, especially when the equipment is five hours away.

I’ve singled out Dell and Brocade a bit, both here and with my comments on Twitter, but remember that I know their products very well. They are not the only folks that have this problem. Vendors, if you have a copper management NIC on your device please upgrade it to redundant, gigabit-capable NICs.

Better Linux Disk Caching & Performance with vm.dirty_ratio & vm.dirty_background_ratio

This is post #16 in my December 2013 series about Linux Virtual Machine Performance Tuning. For more, please see the tag “Linux VM Performance Tuning.”

In previous posts on vm.swappiness and using RAM disks we talked about how the memory on a Linux guest is used for the OS itself (the kernel, buffers, etc.), applications, and also for file cache. File caching is an important performance improvement, and read caching is a clear win in most cases, balanced against applications using the RAM directly. Write caching is trickier. The Linux kernel stages disk writes into cache, and over time asynchronously flushes them to disk. This has a nice effect of speeding disk I/O but it is risky. When data isn’t written to disk there is an increased chance of losing it.

There is also the chance that a lot of I/O will overwhelm the cache, too. Ever written a lot of data to disk all at once, and seen large pauses on the system while it tries to deal with all that data? Those pauses are a result of the cache deciding that there’s too much data to be written asynchronously (as a non-blocking background operation, letting the application process continue), and switches to writing synchronously (blocking and making the process wait until the I/O is committed to disk). Of course, a filesystem also has to preserve write order, so when it starts writing synchronously it first has to destage the cache. Hence the long pause.

The nice thing is that these are controllable options, and based on your workloads & data you can decide how you want to set them up. Let’s take a look:

$ sysctl -a | grep dirty
 vm.dirty_background_ratio = 10
 vm.dirty_background_bytes = 0
 vm.dirty_ratio = 20
 vm.dirty_bytes = 0
 vm.dirty_writeback_centisecs = 500
 vm.dirty_expire_centisecs = 3000

vm.dirty_background_ratio is the percentage of system memory that can be filled with “dirty” pages — memory pages that still need to be written to disk — before the pdflush/flush/kdmflush background processes kick in to write it to disk. My example is 10%, so if my virtual server has 32 GB of memory that’s 3.2 GB of data that can be sitting in RAM before something is done.

vm.dirty_ratio is the absolute maximum amount of system memory that can be filled with dirty pages before everything must get committed to disk. When the system gets to this point all new I/O blocks until dirty pages have been written to disk. This is often the source of long I/O pauses, but is a safeguard against too much data being cached unsafely in memory.

vm.dirty_background_bytes and vm.dirty_bytes are another way to specify these parameters. If you set the _bytes version the _ratio version will become 0, and vice-versa.

vm.dirty_expire_centisecs is how long something can be in cache before it needs to be written. In this case it’s 30 seconds. When the pdflush/flush/kdmflush processes kick in they will check to see how old a dirty page is, and if it’s older than this value it’ll be written asynchronously to disk. Since holding a dirty page in memory is unsafe this is also a safeguard against data loss.

vm.dirty_writeback_centisecs is how often the pdflush/flush/kdmflush processes wake up and check to see if work needs to be done.

You can also see statistics on the page cache in /proc/vmstat:

$ cat /proc/vmstat | egrep "dirty|writeback"
 nr_dirty 878
 nr_writeback 0
 nr_writeback_temp 0

In my case I have 878 dirty pages waiting to be written to disk.

Approach 1: Decreasing the Cache

As with most things in the computer world, how you adjust these depends on what you’re trying to do. In many cases we have fast disk subsystems with their own big, battery-backed NVRAM caches, so keeping things in the OS page cache is risky. Let’s try to send I/O to the array in a more timely fashion and reduce the chance our local OS will, to borrow a phrase from the service industry, be “in the weeds.” To do this we lower vm.dirty_background_ratio and vm.dirty_ratio by adding new numbers to /etc/sysctl.conf and reloading with “sysctl –p”:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 10

This is a typical approach on virtual machines, as well as Linux-based hypervisors. I wouldn’t suggest setting these parameters to zero, as some background I/O is nice to decouple application performance from short periods of higher latency on your disk array & SAN (“spikes”).

Approach 2: Increasing the Cache

There are scenarios where raising the cache dramatically has positive effects on performance. These situations are where the data contained on a Linux guest isn’t critical and can be lost, and usually where an application is writing to the same files repeatedly or in repeatable bursts. In theory, by allowing more dirty pages to exist in memory you’ll rewrite the same blocks over and over in cache, and just need to do one write every so often to the actual disk. To do this we raise the parameters:

vm.dirty_background_ratio = 50
vm.dirty_ratio = 80

Sometimes folks also increase the vm.dirty_expire_centisecs parameter to allow more time in cache. Beyond the increased risk of data loss, you also run the risk of long I/O pauses if that cache gets full and needs to destage, because on large VMs there will be a lot of data in cache.

Approach 3: Both Ways

There are also scenarios where a system has to deal with infrequent, bursty traffic to slow disk (batch jobs at the top of the hour, midnight, writing to an SD card on a Raspberry Pi, etc.). In that case an approach might be to allow all that write I/O to be deposited in the cache so that the background flush operations can deal with it asynchronously over time:

vm.dirty_background_ratio = 5
vm.dirty_ratio = 80

Here the background processes will start writing right away when it hits that 5% ceiling but the system won’t force synchronous I/O until it gets to 80% full. From there you just size your system RAM and vm.dirty_ratio to be able to consume all the written data. Again, there are tradeoffs with data consistency on disk, which translates into risk to data. Buy a UPS and make sure you can destage cache before the UPS runs out of power. :)

No matter the route you choose you should always be gathering hard data to support your changes and help you determine if you are improving things or making them worse. In this case you can get data from many different places, including the application itself, /proc/vmstat, /proc/meminfo, iostat, vmstat, and many of the things in /proc/sys/vm. Good luck!