vSphere Upgrade Saga: Catastrophic Failure Recovery

Recently, my environment experienced a seemingly major catastrophe. The IBM DS3400 SAN I use experienced a two-drive failure, which meant the RAID 5 array was effectively gone. Or was it? I will cover why it was not completely lost in the following writeup, but most of my data was damaged, which forced me to restore what I could from backup and recreate my entire set of management tools. Luckily, however, email, AD, accounting, and other critical business data was not lost. Some was restored; some was rebuilt from the array.

In essence, a RAID 5 array is generally non-recoverable from a two-disk failure, but I had an ace in the hole, so to speak. I knew that sometimes the DS3400 throws drive errors when there are actually no errors. To solve this, I effectively chose a disk and reenabled it by setting it from failed back to optimal. But before I could even do that, I had to set up my management console for the SAN, which I did within a VMware Fusion Windows 7 VM. Not ideal, but it does the trick when everything else has failed.

There is one saving grace for the entire issue: I never once lost internet connectivity, as my virtual firewall stayed running even though its disk was no longer accessible. Actually, as long as the physical nodes stayed running, the ARP cache also had the DNS entries for my local network.

Lesson #0: Shut down all VMs on the array except for any firewall VMs needed to access the outside world.

Step 1: Install IBM DS Storage Manager 10 into a VM.

The only way to manage the IBM DS3400 I have is through the storage manager. While DNS was down, I did remember the IP addresses of at least one of the controllers. This allowed me to connect to the controller and find the problem. It showed that two disks had failed and that the hot-spare had not come into use. On an IBM DS3400, perhaps only this particular array, disks tend to appear failed when they are not actually failed. So one option was to attempt to reset one of the failed disks to good and let the array rebuild. But which disk? I chose the first of the failed disks for my experiment. If that failed, I would move on to the second.

Using the script editor (Tools -> Execute Script) that is part of the IBM System Storage DS Storage Manager 10, I executed the following command:

revive drive [88,5];

This command did not revive the drive, so I then moved on to the next command:

set drive [88,5] operationalState=optimal;

And that worked. The next thing I saw was that the hot-spare had taken over for the second spare drive, and all the lights were flickering green except for the second spare drive. By the way, that drive is running just fine now, with no issues; I am not sure what really caused the failure.

Lesson #1: Let the array rebuild before doing anything else!

However, I was impatient, so did not learn Lesson #0 or #1 until far too late.

Step 2: Let the array rebuild.

This is the most important step. Why? Because the array will rebuild, and some things may end up having errors, but generally not. All the data was there as of the moment of the two-disk failure, which apparently happened moments after the array sent email about the first failure. This we know because the hot-spare did not take over immediately.

Unfortunately, I did not let the array rebuild for all my VMs. I should have, as my actions to attempt a recover of AD, for example, corrupted the AD virtual disks, forcing a restore from backup.

Step 3: Install a vCenter Server.

Virtual machine restorations require a management construct into which they can restore the images. Granted, we could restore directly to a host, but I was afraid my management systems were fully corrupted, and I wanted to be able to fix everything as it became available. So, I installed the VMware vCenter Server Appliance (VCSA) from an OVA I had from the last time I downloaded vCenter. I keep a repository of crucial images on a non-SAN-based tier of storage. This data is eventually backed up to BDXL Blu-ray drives. Since I did not have AD, I enabled SSO when I installed the VCSA. Into SSO I entered a user not only for myself, but for use by the restoration services. This fits virtual environment security of one login per service in use. Furthermore, I modified the /etc/hosts file on the VCSA with the IP address of each ESXi host.

Granted, my Windows-based vCenter server was gone, as was the database. But with my small environment, I could recover fairly easily. The hardest part was reconfiguring VDS to use the newly created versions, as I had no VDS backup for import.

Lesson #2: Export VDS definitions to a backup repository separate from vCenter database backups. [ Using the Web Client for each VDS: Right click on VDS -> All vCenter Actions -> Export Configuration … ]

Step 4: Install a restoration VM.

In order to restore corrupted VMs, I had to first recover my backup VM and import my backups into the software for restoration reasons. Actually, recover is too easy a word. I reinstalled the VM from a template (since the templates were not running, their bits were all there) and readied for use. Now, they could have been corrupted with the array failure, but the disk did not actually go bad; something just told the array there was an issue. This Windows 2008 R2 template was used to create a brand-new VM. That VM was patched and updated and finally brought online as a restoration server.

The restoration server had to be configured with an LMHOSTS file that contained all the IPs and FQDNs of the ESXi hosts, as well as the newly installed vCenter server. At this point, my DNS server was not yet restored, so we had to let this VM think it had proper DNS entries for everything.

Lesson #3: Keep a non-AD version of the backup tool on a secondary non-SAN datastore for restoration purposes.

Step 5: Restore AD.

If I were not impatient, I would have learned Lesson #1 already, but I had not. So, I removed the existing AD server and restored it from backup using the backup tool I had just built. Now, I had some troubles with role-based access controls, as I run in a least privilege configuration for all service accounts in use. But once those were figured out, with a little help from documentation, I was ready to restore my AD system. This is the critical dependency for each of my other subsystems, or, more to the point, was.

Lesson #4: Make sure there are blueprints available that show all dependencies for backups. [ VMware vCenter Infrastructure Navigator can work as one such source of blueprints. ]

By the time I got this far, the array had safely rebuilt itself, so it was time to determine the extent of the damage. Unfortunately, I had started to reboot VMs so that I could manage the system. The VMs I rebooted included my Windows vCenter Server, my HP SIM Server (used to manage the hardware), and a few others. These systems were toast, and some would not even boot once the array had rebuilt. What this meant was that I had to recreate my entire virtualization management stack around my newly installed VCSA. That was another task that had to take place.

As I rebooted each machine, I determined its current state and recovered any data I could off the systems. Some I marked for reinstall/recreate, such as my virtual desktops, and others I safely recovered but updated to the latest VMware Tools and patches. The key is that my main mail server and other critical components all worked just fine after a reboot. I cannot stress enough going back to Lesson #1: Let the array rebuild before proceeding with any potential recovery.

My impatience forced me to go down a different path. Perhaps it is a better path. One of the things I did discover I had changed was that my virtual storage appliance (VSA), HP StoreVirtual, was dependent upon my SAN. That dependency was removed. Easy to do, as the VSA runs as VMs. My VSA VMs did not reboot well, so I reinstalled them from OVF and placed them on local storage associated with the VSA itself. I lost nothing within the VSA, just the two controllers and the failover manager. Easy to reinstall.

Lesson #5: Do not have a VSA dependent on another storage array.

There are lessons to learn from every failure, and I am still working through all my lessons. I have changed how I back up, what I back up, and what tools are available to recover. I keep a copy of a backup extraction tool with my backups, as well as a non-AD-based restoration VM on a different storage device from my SAN; I am also continuing to uncover new things to keep on a backup repository. So far, that list includes:

Host Profile(s)
VDS Export
vShield Manager backup
Barracuda Spam and Antivirus Firewall Vx configuration
Export of my backup server configuration.

Is there something else that you keep a backup of outside of your normal full or other VM backups? Are there other lessons you have learned?

vSphere Upgrade Saga: Catastrophic Failure Recovery

Leave a comment

Cancel reply