vSphere Upgrade Saga: The Great Failure

As you may know, I have been having air-conditioning problems that, tied to system failure, led to some issues to address on a regular basis, at least until the AC is fixed properly. That seems to be a protracted affair. Time passes… My AC is fixed finally! Apparently, ductless AC units can be tricky.

Actually, they are pretty simple. There is a rotor to push cool air off the coil within the inside unit. There is a line set that includes drainage, inflow and outflow copper tubing, and electricity. This line set connects the outside unit to the inside unit. The outside unit is also pretty simple. It contains a heat disperser, fan, and condenser. There are only two control boards in a ductless unit: one on the inside and one on the outside.

As part of my ductless unit, I have had leaks within the line set, the outside unit, and now on the inside unit. What makes these leaks hard to find is that you need an electronic nose that is fairly sensitive to detect some of the leaks. In all, I have had nearly every part of the ductless unit replaced except for the electrical wiring, inside unit control board, and blower as well as the cases for the outside and inside unit.

Yes, you read correctly: the vast majority of this ductless unit has been replaced over the years. Each leak has taken a while to find. Each leak has caused me to take my robust virtual environment and place it in a much-reduced capacity of just one node and enough virtual machine to run my businesses.

Over the years, I have gotten really good at moving things around, shutting down systems, and keeping the lights going, so to speak. Over the years, I have also added myriad sensors into my rack enclosure and replaced elements that generate high heat with ones that generate less heat. That effort will go on over the next few years.

What this disaster has shown me is that we need to be able to detect the most disastrous issues early and as they happen, and then respond to them quickly. A proper script to do everything I need done is a powerful tool, one that I still need to complete. However, tying such a script into monitoring would be difficult.

Why? One of my sensors is purely visual. The others do not currently report to any useful locations. I get a warning on my cell phone, and from there, I have to go in and move things around and shut things down in the proper order, and do all that while scriptable still needs a good trigger. You also have to think through the logic quite well. One mess up, and your recovery from disaster will be a true disaster. The vSphere Upgrade Saga is full of those types of issues.

In my environment, the process is this:

Ensure all critical virtual machines are on the HPE StoreVirtual Cluster
Move HPE StoreVirtual Failover Manager to local storage on one node
Shut down all desktops running on VSAN
Place my non-HPE StoreVirtual node into maintenance mode
Shut down the node
Place the node without the Failover Manager on it into maintenance mode
Shut down the HPE StoreVirtual node using the HPE StoreVirtual console (shutting it down via vSphere causes corruption)
Shut down the node now in maintenance mode
Shut down my KVM node taking out one of my storage devices
Shut down my 10 G switch, as it is no longer needed

This leaves me with exactly one node and the switch stack to support external connections.

This is currently the end of my AC issues. Yet, I still monitor the situation, waiting for the next failure. I now have the start of better tooling to aid in protecting my environment, but it is tooling I hope to never need.

Leave a comment

Cancel reply