vSphere with Tanzu - SupervisorControlPlaneVM stucks in state NotReady
Power outage related circumstances recently brought my vSphere with Tanzu Workload Management1 enabled homelab cluster (Supervisor Cluster) into a not desired state which of course I had to fix.
And I do not mean that I have to protect my homelab against the next power outage by finally installing a UPS 😉.
What I actually mean is the following…
Failed to get available workloads: bad gateway
The last couple of weeks I spent a lot of time using my Tanzu Kubernetes Cluster(s)2 to get my head around as well as my hands dirty on this awesome project Knative3. More to come soon 😉. Interacting with a healthy vSphere Supervisor Cluster4 is necessary for e.g. the provisioning of new Tanzu Kubernetes Cluster or the instantiation of vSphere Native Pods.
Unfortunately, my attempt to login into mine ends quicker than expected with the error message:
My first instinct was to quickly have a look at the Workload Management subsection in the vSphere Client (Menu –> Workload Management or ctrl + alt + 7) and here my suspicion that something is wrong was confirmed. The Config Status of my cluster Malibu was in
Configuring state and the following two Status Messages as shown in Figure I were displayed.
I observed this
Configuring state for a while longer and I was hoping it gets fixed automagically but it didn’t.
Seriously, I do believe that this is really due to various (again) circumstances which I was facing with my homelab and normally the desired state (
ready) for the Supervisor Cluster or more specifically, for the Supervisor Control Plane VMs, will be recovered automatically.
HealthState WCP service unhealthy
The next move I did was checking the state of my Tanzu Kubernetes Cluster also via the vSphere Client but they were no longer displayed at all. Things went strange 👻.
Consequently, I checked the overall health of my vCenter Server as well as the health state of the services and especially my attention was on the
wcp service (Workload Control Plane). Checking the service state can be done in two ways:
- vCenter Appliance Management Interface aka VAMI (vcenter url:5480)
- via the
shell- which requires an enabled and running
sshservice on the vCenter Server Appliance
shell output was the following:
And here a little bit more coloured:
Let’s see if the HealthState will change after restarting the service:
Well, way better. Let’s move on from here.
Supervisor Control Plane node status
wcp service back in operating state led me to start over from where I began. This time logging in to my Supervisor Cluster went well and I checked the state of the three Control Plane Nodes by executing
kubectl get nodes.
Control Plane Node 1: STATUS NotReady
With this new gains, I also checked the Virtual Machine state via the Remote Console and surprisingly, it seems that the power outages affected this particular Node (VM) badly. The Remote Console window was swarmed with error messages saying
print_req_error: I/O error, dev sda, sector ...[counting up]
I was also trying to figure out if there’s a way to get this fixed on the Operating System level but there was no chance or at least there wasn’t one for my expertise.
vSphere ESXi Agent Manager
If the desired state cannot be recovered automatically again, what option remains?
Well, there’s always something to learn. My well appreciated colleague Dominik Zorgnotti was pointing me to the vSphere ESXi Agent Manager (EAM), which in the end turned out to be the solution for my problem. To be honest, I never was in the situation to make use of the EAM before and therefore it wasn’t on my radar at all but I was quite happy to get to know this component now. What it does?
You will find the EAM in the vSphere Client under Menu -> Administration -> vCenter Server Extensions -> vSphere ESXi Agent Manager.
The three Supervisor Control Plane VMs can be found via the Configure tab and are listed in the column as Agency named with prefix
vmware-vsc-apiserver-xxxxxx. By selecting the three dots besides the name, it gives us the two options Delete Agency as well as Remove All Issues. See Figure V.
But which of the three “Agency’s” is our affected one? I’ve never recognized this used naming pattern before, not in the vSphere Client nor by using
kubectl describe node). Ultimately, I took a look at the summary page of the VM and the Notes widget enlightened me.
This Virtual Machine is a VMware agent implementing support for vSphere Workloads. Its lifecycle operations are managed by VMware vCenter Server. EAM Agency: vmware-vsc-apiserver-w8mqd8
See also Figure VI:
Having found the missing piece, I first went with the Remove All Issues option but it didn’t solve my problem.
With having in mind, that the Supervisor Cluster is a high available construct consisting of three members and it’s current degraded state, I checked twice if the one I picked is the right one before hitting the Delete Agency 🔴 button.
Immediately thereafter, the affected VM got deleted and a new one was on it’s way.
At the end, my cluster reached the Config Status
Running, all three Nodes were
Ready and the
wcp service never became (until now)