vSphere Upgrade Saga: HPE StoreVirtual Upgrades Go Bad

I recently decided to disable my very hot, very expensive to cool spinning disk Fibre Channel SAN. It was also the slowest thing on my storage network. It had only 4 GB controllers. Upgrading it would be too expensive at the moment. Instead, I upgraded my HPE StoreVirtual VSAs to have more disk space. As I have licensed up to 10 TBs, I figured I would take advantage of that and increase my storage.

I also made several other decisions related to my storage upgrade, including to go with a 10 GB network and to switch my HPE StoreVirtuals to all flash. Given that I have two of them that are in sync using RAIN with replication, I felt safe enough going with SATA-based SSDs. Still, I still wanted high-quality ones with a large mean time between failures.

Sizing my new array had to include the 3 TBs already allocated to my StoreVirtual and the 6 TBs of my FC SAN. So, in essence, I needed 9 TBs per side of my HPE StoreVirtual, or 18 TBs total. I also chose to expand my capacity by one disk so that I would need seven disks (six for storage and one as a spare) for each StoreVirtual in use. I already used Samsung SSDs for VMware vSAN, so I chose to use them once more for my StoreVirtual arrays. I ended up purchasing fourteen 1.92 TB SSDs from Samsung.

The seven disks per StoreVirtual were configured as a RAID 5 array with one spare disk offering 8.942 TB raw storage available to each HPE StoreVirtual, of which 8.74 TB was made available to each HPE StoreVirtual VSA instance. Not quite 9 TBs of usable space, but sufficient for my needs.

I moved every VM to other storage using VMware Storage vMotion. The VMs ended up either on the FC SAN or another iSCSI server. That iSCSI server had 10 TBs, but it is what I would consider secondary storage. It ran at 10 G, but the StoreVirtual runs at 20 G, given the backplane of the blades I use. Then, I:

Shut down one HPE StoreVirtual VSA machine
Removed the storage disk from the virtual machine
Removed the existing RAID array using hpssacli
Removed the six 900 GB 10 K RPM 2.5″ SAS disks
Removed the blank for the seventh drive and added in the seven Samsung 1.92 TB SSDs
Created the RAID 5 with spare using hpssacli
Created a virtual disk on the new RAID 5 array
Assigned the 8.74 TB virtual disk to the HPE StoreVirtual VSA
Rebooted the HPE StoreVirtual VSA

This worked a treat! I knew had an HPE StoreVirtual with 8.74 TBs of usable storage.

However, this is where issues began. The other VSA had issues with this change. Replication was broken, basically, and that caused management to do odd things. To fix this, I disabled replication and moved the second VSA out of the storage array. This did not quite go as well as I thought it would. But more on that later. I then repeated the steps above.

This is where things got really strange on me. The second VSA would not join the array with the old name. Apparently, it still existed. So, I forced the removal, and it came back into the array. RAIN was set up, and things looked good.

The next step was to move VMs back to the HPE StoreVirtual. To do that, I joined the HPE StoreVirtual to a datastore cluster and started to place the other datastores into maintenance mode. During that transfer of many VMs, the second added (actually primary) HPE StoreVirtual VSA crashed. This was the sign of bad things to come.

I rebooted the VSA, and voilà, things came back and the cluster started to work as expected. Still, not all was well. During a routine patch of VMware vSphere, the StoreVirtual would become inaccessible if the first node went down, even if the second node was alive and well. It was so bad that I lost vCenter and many other VMs until the node came back. By then, my virtual machines had ended up with corrupted disks.

Reboots of the VMs and several fscks later, everything was back up and running. What made this more difficult was that my AC decided to die at exactly the same time, and I had to power off VMs and hosts. This included the FC array and the secondary iSCSI server. I was literally down to one node: the secondary StoreVirtual node. However, it took me a bit to find out which one caused the issue and how to fix it. I ended up with virtual machines that were corrupted several times before I discovered it was the secondary StoreVirtual that I had to keep running.

Once the AC was fixed, it was time to fix the primary HPE StoreVirtual. To do that, I went through the steps to remove the primary HPE StoreVirtual from the RAIN cluster:

Copy the license key for the primary HPE StoreVirtual VSA
Disable RAIN
Stop the manager on the primary HPE StoreVirtual VSA
Move the Failover Manager out of the cluster
Move the primary HPE StoreVirtual VSA out of the cluster
Power off the primary HPE StoreVirtual VSA
Delete the primary HPE StoreVirtual VSA from disk
Reimport the VSA OVF as the same name and MAC address as the primary HPE StoreVirtual VSA
Change the virtual hardware of the primary HPE StoreVirtual VSA to match the first one
Create a new virtual disk of 8.7 TBs
Power on the primary HPE StoreVirtual VSA
Log in and give it the appropriate IP address, netmask, and default route
Join the primary HPE StoreVirtual VSA to the cluster
Add the Failover Manager back to the array
Relicense the primary HPE StoreVirtual VSA
Re-enable RAIN
Wait for replication to finish

Then, all I had to do was wait for replication to finish. By then, the AC was back on and I could power on the FC-SAN, which allowed me to use Storage vMotion finally to move all data off the HPE StoreVirtual so I could run some tests.

Those tests included upgrading the HPE StoreVirtual, upgrading the hosts, and ensuring I did not once more lose access to a running virtual machine. All tests passed.

Lesson learned: properly remove HPE StoreVirtual VSAs from the cluster before working on them!

View my other vSphere Uprades Saga posts.

vSphere Upgrade Saga: HPE StoreVirtual Upgrades Go Bad

Leave a comment

Cancel reply