How to Disable Windows IPv6 Temporary Addresses

CPU IconThe default Microsoft Windows IPv6 implementation has privacy extensions enabled, where IPv6 temporary addresses are used for client activities. The idea is that IPv6 has so many addresses available to it that we can create extra ones to help mask our activities. In practice these temporary addresses are largely pointless, and are very unhelpful if firewalls and ACLs are configured to allow access from a specific static address.

By themselves, IP addresses aren’t a good way to authenticate people but they often form another layer of defense. This is especially important for IT infrastructure where there often aren’t (or can’t be) sophisticated authentication mechanisms.

Paste these commands into an administrator-level PowerShell or Command Prompt and then restart your PC:

netsh interface ipv6 set global randomizeidentifiers=disabled
netsh interface ipv6 set privacy state=disabled

I also disable Teredo tunneling as well, so my traffic isn’t going places I don’t know about:

netsh interface teredo set state disable

Good luck!

Should We Panic About the KPTI/KAISER Intel CPU Design Flaw?

CPU IconAs a followup to yesterday’s post, I’ve been asked: should we panic about the KPTI/KAISER/F*CKWIT Intel CPU design flaw?

My answer was: it depends on a lot of unknowns. There are NDAs around a lot of the fixes so it’s hard to know the scope and effect. We also don’t know how much this will affect particular workloads. The folks over at Sophos have a nice writeup today about the actual problem (link below) but in short, the fix will reduce the effectiveness of the CPU’s speculative execution and on-die caches, forcing it to go out to main memory more. Main memory (what we call RAM) is 20x slower than the CPU’s L2 cache (look below for a good link showing the speed/latency differences between computer components). How that affects driver performance, workloads, I/O, and so on is hard to tell now.

Here’s what I think, based on my experience with stuff like this:

First, there are some people out there with gaming benchmarks saying there’s no performance impact. They’re benchmarking the wrong thing, though. This isn’t about GPUs, it’s about CPUs, and the frame rate they can get while killing each other online is mostly dependent on the Graphics Processing Unit, or GPU.

If you use physical servers that are only accessed by a trusted team, and you have excess capacity then you should remain calm. Doubly so if you have a test environment and/or can simulate production workloads. Don’t panic, apply your security updates according to your regularly scheduled process.

If you own virtual infrastructure and your company is the only user of it, insofar as everything from the hardware to the applications is run by the same trusted group of admins, don’t panic. Plan to use your normal patching process for both the hypervisor and the workloads, but keep in mind that there might be a loss of performance.

If you own virtual infrastructure and there are workloads on it that are outside of your control you will need to set yourself up to respond quickly to the patches when they are released. I wouldn’t panic, but you’re going to need to move faster than usual. I’d be getting a plan together for testing and deployment right now both for the hypervisors and the workloads you do control, prioritizing the hypervisors. Keep in mind the loss of performance. I might plan to start with a smaller cluster and work my way up to a larger one. I might be warning staff about some extra work coming up, and warning other projects that something is happening and timelines might change a bit.

If you use the public cloud I’d be looking up the Azure, AWS, and Google Compute Engine notices about this problem and seeing if your workloads will be forcibly rebooted in the near future. I’d also make plans to patch your virtual machines, and keep in mind the possible loss of performance depending on your instance type.

If you use containers I’d make sure that your baseline images are all patched once patches are released. Likewise with template VMs, if you don’t have a process to bring them to current immediately upon deployment or build VMs dynamically.

I would stop trusting all internet-supplied VM appliances and container images until they have documented updates. If you didn’t build it yourself you don’t know it’s safe.

In all the scenarios I’d be doing some basic capacity planning so you have a baseline to compare to, auditing to make sure that applications are patched, and auditing firewall rules and access control.

As the British say, keep calm and carry on. Good luck.

Intel CPU Design Flaw, Performance Degradation, Security Updates

I was just taking a break and reading some tech news and I saw a wonderfully detailed post from El Reg (link below) about an Intel CPU design flaw and impending crisis-level security updates to fix it. As if that wasn’t bad enough, the fix for the problem is estimated to decrease performance by 5% to 30%, with older systems being the hardest hit.

Welcome to 2018, folks.

In short, an Intel CPU tries to keep itself busy by speculating about what it’s going to need to work on next. On Intel CPUs (but not AMD) this speculative execution doesn’t properly respect the security boundaries between the OS kernel and userspace applications, so you can trick an Intel processor into letting you read memory you shouldn’t have access to. That’s a big problem because that memory could hold encryption keys & other secrets, virtual machines, anything.

So what? Here’s my thoughts:

  1. All of our systems just got 30% more expensive. Put another way, we are all about to lose 5-30% of the systems we paid for, if they’re built on Intel hardware. That includes network switches, storage arrays, traditional servers, everything.
  2. I’m guessing there’s a class-action lawsuit in the works already against Intel, if only to establish whose fault this is (not Dell, HP, etc. but Intel’s).
  3. We don’t know the effects of these updates yet, insofar as whether the performance hit will be global, just to CPU or memory, just to I/O, or some mix. We also don’t know how workloads will react to this. If you don’t have a proper test and/or QA environment you’re going to fly by the seat of your pants for a bit.
  4. What we can surmise, though, is that all system benchmarks are now null & void. This is an epoch, the great extinction of performance data from vendors. As of right now any sizing or performance data offered by a vendor needs to meet with questions around when that data was gathered, what OS levels & patches, and probably should have some written guarantees in the contract.
  5. If you have a system or application that’s Intel-based and within 30% of “full” you probably should start thinking about your options, especially if it’s on older hardware.
  6. If you aren’t collecting performance data from your systems you should get that going. There are lots of options, from established vendors like Solarwinds, newcomers like Uila, to open-source tools like Observium. Historical performance data is essential for assessing a situation like this, as well as system sizing and troubleshooting.
  7. Microsoft has announced that Azure instances will rebooted on January 10, 2018. AWS is dancing around the same message. They don’t have live migration, like vMotion, so it’s a huge deal when they decide to fix something like this. The speed and scope of the reaction should tell you how important this is. It also should delineate how helpful things like vMotion are in a VMware vSphere environment, where you’ll be able to update the infrastructure without taking applications down (versus the public cloud which doesn’t live-migrate workloads). Yes, in an ideal world applications are built to not care, but very few of the world’s companies have their systems set up that way (and a discussion for the comments or over a beer).
  8. Remember that the public cloud will take a performance hit, too. Yet one more way the public cloud DOESN’T actually help IT. At least a SaaS application means it’s someone else’s problem, though.
  9. Companies that don’t patch won’t have a problem with this, but that’s gross criminal negligence (e.g. Equifax, etc.) and should be the subject of whistleblowing action from here on out. Companies that do patch are getting screwed, of course, but this is solid due diligence and part of the cost of doing business. Truth is, regular patching is the #1 way to prevent security problems, but defense-in-depth is equally important (multiple other security controls that can help mitigate a problem like this until you figure out what you’re going to do to fix it).This update isn’t going to be avoidable for long, so you might as well suck it up and deal with it.
  10. I’d bet HPC/supercomputing folks won’t apply this update, though, but hopefully they have an understanding of their workloads and defense-in-depth. Losing even 5% of a system like TACC’s Stampede would hurt. Also seen another way, Intel’s insecure design practices just made things like cancer research 5-30% slower.
  11. If you don’t take snapshots or image-level backups now might be a time to try it, so you can roll things back quickly. Remember, though, that snapshots are a performance hit on their own. Rolling back the OS patches might be acceptable, too. The point is to have an answer to the question “how do we go back to the way things were after this patch is applied?” You might need to buy yourself some time to cope with these updates.
  12. AMD is probably going to try to make hay here, because they’re not affected. However, AMD systems have classically had problems of their own, such as bugs that ended up disabling all L3 cache, etc. There’s no high ground to be occupied by them. As always insist on actual performance data around vendor promises, and insist that those promises get documented, preferably in contractual form.
  13. Sysadmins are merely the messengers here, but we need to begin communicating this problem to the business around us. Our managers, VPs, CTOs, CIOs, everybody. This is an all-hands issue. The effect on IT is clear, but if we get ahead of it with our management stacks it’ll demonstrate our competence & security-mindedness. It’ll also clear the path for when we ask to buy something to cope with the 30% capacity hit.

As always, good luck.

Update (2018/01/03): Should we panic about the KPTI/KAISER Intel CPU design flaw?

Apple Deserves What It Gets From This Battery Fiasco

AppleYesterday Apple issued an apology for the intentional slowing of iPhones because of aging in the iPhone battery. As part of that they announced a number of changes, like a $29 battery replacement and actually giving people information and choices about how their device functions.

This says a few things to me. First, it says that have gouged consumers for the cost of a battery all these years. Second, it tells me they are scared enough of these class-action lawsuits to admit fault publicly.

There are a million reasons why an iPhone might perform poorly, especially after an upgrade. This has little to do with the battery, and likely more to do with background maintenance tasks that happen after an OS update. Of course, I am guessing at this, because Apple never tells anybody anything about what is going on. Don’t believe me? Look at the release notes for a software update. They don’t tell people what they fixed or what they changed, or when they do it’s either a lie or a lie of omission. “Improves power management during peak workloads to avoid unexpected shutdowns” is what iOS 10.2.1 said. The word “improve” is a blatant lie, given what we now know about their fix. Perhaps they also feel Steve Jobs’ health has improved since his death.

Beyond lying, they don’t expose controls to users that might allow the users to customize behavior or make choices. After all, they’ve been throwing shade at PCs for years essentially saying choice is bad because it might add complexity. They make it very difficult to service devices which forces people to choose between Apple’s own now-apparent price gouging and a third-party that might disable the device. Apple builds their devices in ways that make common end-user repairs very risky, while saying that those measures are for our own protection. Nor do they expose information about the devices that might enable a user to make informed choices on their own, or enable an honest secondary market for these devices.

The net effect of all this tight-lipped behavior is that they have opened themselves up to legal action from everyone that has a slow iPhone 6, 6s, or 7, for any reason. The average consumer now has very plausible reasons to think that Apple is and has been screwing them into buying new iPhones. After all, Apple has a long history of being dishonest. Look at the iPhone 4S and the faulty water detection devices. Look at Antennagate and all the other problems with cracking, bending, and subsequent screen malfunctions that they blamed on user behavior instead of their own impeccable design. Watch their “geniuses” at an Apple Store weasel out of covering anything under AppleCare. Observe how they’ve quietly brought back the DRM that Steve Jobs removed. Look at their corporate behavior, talking out of both sides of their mouths about their Congressional lobbying, as well as their hiding money offshore. They have made this bed for themselves.

Some folks are saying that this was a colossal communications error on Apple’s part. For a company that prides itself on appearing intentional about everything they cannot say now that this was a screwup. It was a calculated risk, a big bet on a massive lie of omission. They could have chosen to expose battery information in iOS, like my PC laptops have done for decades. They could have written their battery “explainer” then, too. Instead, they bet that they could keep their secondary market & repair lock-in and the status quo by hiding it all, all while their sales go up. And up they went, to record valuations of their company based on sales they dishonestly forced.

So here’s to hoping that the worldwide legal system gives Apple the comeuppance they are due. I hope it’s big enough to cause stock losses, penalizing the investors that support such ongoing dishonesty. More than all of that, though, I hope this is a warning to other organizations. Up-front honesty is always the best policy, even if it seems hard. It never — never — gets better if you let your customers figure it out themselves. And they always will.

Let’s Just Keep An Eye On The Time

Double Quote“You’re asking me how a watch works. For now, let’s just keep an eye on the time.” – Alejandro, Sicario

I’ve enjoyed the eclectic roles Benicio del Toro has been playing these last few years. His appearance in recent space movies reminded me of this quote of his from the movie Sicario. Often enough in our own technological roles we are asked to explain ourselves, explain why something is the way it is or why we want it to be a particular way. How do you convey to someone in just a minute the years of school, decades of experience, days in noisy data centers, nights bringing systems back online, hours staring at configurations that are wrong and scripts that don’t work, dumb mistakes, clever moves, seemingly clever moves that were really dumb mistakes… all that led to this particular moment, this particular system design, this particular decision?

I kinda just want to tell them to keep an eye on the time.

Fixing Veeam “Can’t Delete Replica When It Is Being Processed” Errors

Floppy DiskI’ve used Veeam Backup & Replication for a long time now, and when we restructure storage, redeploy VMs, or change our replication jobs we sometimes get situations where we get the error:

Error: Can't delete replica when it is being processed

Here’s how I fix it. As always, your mileage may vary, and free advice is often worth what you paid, especially from a stranger on the Internet. Veeam support is probably a safe but much higher latency source of non-free advice.

  1. Stop the affected jobs and disable them.
  2. Ensure that the replicas are gone, from both the VMware environment (vCenter) and in Backup & Replication (Replicas -> Ready, then right-click and Delete From Disk). Don’t delete it from the configuration or you’ll have to go in and edit the VM Exclusions.
  3. Browse the target datastore and remove any orphaned disk files or folders associated with that replica.
  4. Clone the affected job (by doing this you make sure that the replica you’re about to create has nearly identical properties to what the other job would have created). Change the virtual machine list to be only the affected VM or VMs.
  5. Run the replication job. Let it complete. It should create a replica of the VM.
  6. Edit the original job. Enable “Low connection bandwidth (enable replica seeding).” In the “Seeding” tab check “Map replicas to existing VMs” and click the “Detect” button to automatically map most/all of them. Check to make sure the affected VMs are detected, and fix it if not.
  7. Run the original job. It should complete without error now.
  8. Edit the original job and disable “Low connection bandwidth (enable replica seeding).” I’d run the job again just to make sure everything is good.

Good luck.

7 Ways IT Staff Can Prepare for the Holidays

Calendar IconFor us IT types it is important to maintain a good balance between work and our lives. Just as they say that good fences make good neighbors, I’ve found that a good delineation between work and home improves both. The holiday season is taxing, though. People rush around trying to wrap up loose ends, they’re using vacation they’re going to lose, and they’re generally scattered and distracted, which isn’t a good thing.

If you’re lucky enough to work somewhere with a true 24×7 operations center then coverage over the holidays is already thought out. However, most IT staff in the world aren’t in places like that. Here are some thoughts I have about how to defend your time off over the holiday season.

1. Talk to your manager about expectations.

Ideally your manager has told you what they expect over the holidays, but if that isn’t the case you might have to bring it up. Certainly there is a range of competency when it comes to leadership but hopefully you can have an honest & fair discussion about staff availability.

I always have a million questions about expectations. For example, will anybody be working during the holidays? Remember that not everyone follows a Christian religion. What I think of as Christmas will just be “Monday” to several of my coworkers. Will they need support? Does that mean you’re on call? If you’re on call, what are the expectations? Can you support them remotely, or might you need to go into the office? What timeframe do they expect you to respond in? What timeframe do they expect you to get to the office in? How will they contact you?

If your boss doesn’t have answers seize the chance to propose something reasonable. I might start with “no calls on the actual holidays, and one person will be available by phone, but not on site, during business hours between Christmas and New Year’s Day” and go from there.

Remember that this stuff is work, so any work you do reduces any vacation time you’re taking. Keep that in mind if you’re trying to use up vacation.

2. Publish & honor holiday hours.

Many problems in IT are communication issues. Getting called on your day off is both a communication issue and a problem. Head it off by making a calendar of who’s available when so your team knows who is in and who isn’t, and who’s on call. Make that a PDF and send it to people so it’s in their email. That way they can get to it if they’re traveling.

The second part of that is honoring those hours. Where I work you’re either working or you’re not working. If you’re working you’re fair game, but if you aren’t working we don’t call. If we do call there’s no expectation that you’ll answer.

My teammates will text the group if there’s something going on, which is nicely non-disruptive. If we’re busy we ignore it, if we have a few minutes we’ll jump in, or suggest a course of action. Above all, respect your team’s personal lives and time away from work.

3. Institute a pre-holiday change freeze.

There have been several studies done over the last couple decades that showed a high percentage (85% or more) of incidents & outages are caused by IT staff themselves. To me that means if we’re all on vacation there’s a much smaller chance something bad will happen!

You don’t want to deal with preventable problems when you’re trying to leave on vacation. You also don’t want to have to call or be called to help someone fix something they could have left until January. To avoid that, invoke a change freeze. For example, you could declare that no changes will be made between December 18 and January 2 unless they are to resolve a critical issue. As Shakespeare would put it, “on pain of death.”

Every IT shop I’ve ever worked in has had lots of stuff to do that didn’t involve changing production systems. Find something else to do. Write some documentation. Build a wiki for your documents. Vacuum your office. Vacuum the data center. Clean out that closet of old cables. Demo some new software. Restock printer supplies. Brush up on your Linux admin skills. Learn a scripting language. Test your backup system by restoring something. Take long lunches and do team building exercises at a pub. You’ll figure it out. One last thing, though — all these things are part of your job. Stand your ground if someone says you aren’t working, because they’re wrong.

4. Find opportunities for users to help themselves.

One thing you can do is find opportunities for your users to help themselves. Write a “what to do if your IT person is out” document. There’s no reason that a user couldn’t reload paper in a printer, or reboot their PC, or do some basic troubleshooting before they interrupt your vacation time..

For example, I once found that adding a second printer to people’s PCs, a printer in a neighboring department, for instance, got me out of 95% of urgent printer calls. People get really angry when they can’t print right before a meeting. Write up some tips and give them permission to help themselves out.

5. Socialize limited availability and light work schedules.

Now’s the time to mention at meetings that the holiday week will be lightly staffed, so turnaround times might be longer, project work will be on hold, and issues that aren’t absolutely life or death will be tabled. I’ve always found that by mentioning these issues early you get people thinking about it in a non-threatening way, and scheduling assumptions can be sorted out proactively & calmly.

6. Set up an on-call phone number.

It’s one thing for your team to know your personal cell phone number, but you really don’t want everybody in your organization knowing it. Trust me. If you are doing an on-call rotation that involves actual calling you should find a way to forward a business phone number to whoever is responding. Maybe it’s tricks with Google Voice, or someone logs into Burner or Signal or something when it’s their turn. Be creative but make sure it’s reliable. Make sure to document it and send the document to people via email so they can find it when they need to.

7. Set up & test remote access methods.

The time to test your ability to get in remotely is BEFORE you need it. Make sure you can get in from the road if you’re traveling. Do you need a hotspot? Can you VPN in? What happens if the VPN concentrator can’t talk to your Active Directory? Can you get to the consoles of all of your devices? Does everyone on your team know how to get to everything?

Anything you can’t reach remotely is potentially a reason you’ll have to go into the office. Going into the office means having to put on pants, comb your hair, log out of Overwatch, sober up — all things we just don’t want to have to do while we’re on vacation. So get it together, folks, and lay the foundation for a very merry end of December.

Good luck!

Consistency Is the Hobgoblin of Little Minds

Ever heard someone tell you “consistency is the hobgoblin of little minds?”

They’re misquoting Ralph Waldo Emerson by leaving out an important word: foolish.

That’s like leaving out the word “not” in a statement. The whole meaning changes because of the omission. We can all agree that “I am on fire” and “I am not on fire” are two very different statements. The same is true here. Let’s examine the actual quote:

A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. With consistency a great soul has simply nothing to do. He may as well concern himself with his shadow on the wall.

As with most things, context matters, which is what makes this quote inappropriate almost everywhere I see it used. In the context of IT, with consistency, a great soul can trade meaningless & soul-crushing work for important & strategic tasks, moving their organization forward rather than struggling just to keep up. In IT, consistency is generally a good thing, and when it is delivered via standards and automation it forms a stable & solid & predictable foundation on which we can build towering pinnacles of applications and services. Stability and predictability are important things to app developers, end users, and those of us that want to take a vacation from time to time.

However, there are foolish consistencies. When your automation becomes handcuffs and not an enabler it’s foolish. When your standards are held so tightly that you cannot enable new business ventures because of them it’s foolish. When efficiencies aren’t taken advantage of, new technology and methods eschewed, and/or positive changes avoided with the excuse of “standards” or “that isn’t the way we do things” it’s foolish.

Neither standards nor our tools are an end unto themselves. They exist to enable greater things, and when that stops being true we need to change them so they are helpful to us again. After all, having two standards is still better than having 1500 one-offs.

 

Advice On Downgrading Adobe Flash

VMware has a KB article out (linked below) about the Adobe Flash crashes that happen if you’re running the latest version of Flash (27.0.0.170). A lot of us were caught off guard recently when our PCs updated themselves and we couldn’t get into our VMware vSphere environments.

The VMware KB article suggests downgrading your Flash client. Left by itself this is completely irresponsible advice.

1. The Adobe Flash update addresses a critical security vulnerability that is being exploited in the wild. The security advisory (linked below) states:

Adobe has released a security update for Adobe Flash Player for Windows, Macintosh, Linux and Chrome OS. This update addresses a critical type confusion vulnerability that could lead to code execution.

Adobe is aware of a report that an exploit for CVE-2017-11292 exists in the wild, and is being used in limited, targeted attacks against users running Windows.

(as an aside, Adobe acknowledges Kaspersky Labs staff, which makes me think that they’re making good on their promises to figure out how Russian hackers used their software to exfiltrate NSA data).

2. If you downgrade your Flash installations you will need to disable the auto-updaters, which is what got us all into these situations. I don’t know about you but I always forget to re-enable the updaters, and that’s bad.

3. There are workarounds. The HTML5 client, though incomplete, gets many people back in business. Microsoft Edge and Internet Explorer seem to work with Flash on Windows 10 1703, too, at least for all my team’s environments.

So what’s my advice?

  • Limp along with Microsoft Edge and the HTML5 client until VMware updates their clients. I think it’s safe to assume they’re working on it. Start making plans to patch your vCenter in the next few weeks.
  • If you don’t have the HTML5 client you can get it as a VMware Fling (link below).
  • If you absolutely have to downgrade Flash don’t run the vulnerable Flash on a PC you use for anything else. It’s annoying but you can survive a few weeks of this, provided you’re running a supported version of vSphere.
    1. Use network- & host-based firewalls to prevent all traffic that isn’t destined for your vSphere implementations. You’ll probably need to allow DNS, as well, but I’d keep it really locked down. I would even think twice about joining it to Active Directory.
    2. You should already be running antivirus and antimalware on your systems but it’s especially important for systems that are intentionally out of date.
    3. Use a virtual machine running in VMware Workstation for the insecure client. Make it non-persistent and use it for nothing else. Or a Windows Server installation with Terminal Services enabled.
    4. Put a calendar reminder in for your team to clean this whole thing up in a month.
    5. If you have dedicated IT security personnel (CISOs and such) reach out to them proactively. Make a business case around this — you need to do this to be able to support the environments, but you’re being responsible about the risk.
  • If you’re running an unsupported version of vSphere you need to upgrade ASAP. This is a great business driver for it. Never let a crisis go to waste! vSphere 5.5 goes end-of-support on 09/19/2018 so I’d even consider using this as a driver to get to 6.5…

Good luck & stay safe.

———————-

Stop Chrome Autoplay

If you didn’t catch this on Twitter:

In short, go to chrome://flags/-policy and set it to “Document user activation required.”

It’s funny how simple things can be so virally popular.

While Chrome can sync settings between browsers where I am logged in, I have got to figure out if there’s an API to set Chrome configuration options automatically…

%d bloggers like this: