Recovering an OCP/OKD Cluster After a Long Time Powered Off
By Mark DeNeve
Introduction
If you are like me, you have multiple Lab clusters of OpenShift or OKD in your home or work Lab. Each of these clusters takes up a significant amount of resources and so you may shut them down to save power or compute resources. Or perhaps you are running a cluster in one of the many supported Cloud providers, and you power the machines down to save costs when you are not using them. If you leave the cluster powered off for more than 2 weeks you will find that when you power the cluster back on you are unable to connect to the cluster or the console. Most times, this is due to one or more internal certificates expiring. There is a quick fix for this which we will discuss below.
Prerequisites
To follow the steps below, you will need a cluster that has been powered off for a minimum of two weeks, with internal certificates that have expired. You will also need to have the SSH key for the “core” user on the master nodes. You should have this private SSH key set aside from when you first built your cluster. We will use the SSH key to remotely connect to one of the master nodes and approve the pending CSR certificates to allow the cluster to come back online.
Procedure
With ALL Master Nodes powered on, use ssh to remotely connect to one of the master nodes and then using the su command switch to the root user:
$ ssh core@<mast node ip address here>
Red Hat Enterprise Linux CoreOS 47.84.202109082139-0
Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system
managed by the Machine Config Operator (`clusteroperator/machine-config`).
---
[core@ocp47-89pwv-master-0 ~]$ sudo su -
[root@ocp47-89pwv-master-0 ~]#
As the root user, set your KUBECONFIG to point to the recovery kubeconfig file:
[root@ocp47-89pwv-master-0 node-kubeconfigs]# export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig
You should now be able to get a listing of the Certificate Signing Requests that are pending using the oc command:
[root@ocp47-89pwv-master-0 node-kubeconfigs]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-8jhdv 6m38s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-8th88 21m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-c4m5z 21m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-chpcm 6m29s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-dzf9x 6m28s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-hbf8v 6m35s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-kbq9c 6m23s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-lc65f 6m25s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-n5wmr 22m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-p6h2d 21m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-q45zs 22m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-t4znl 21m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
Note that there are many CSRs that are pending. We can approve all of these with one command:
[root@ocp47-89pwv-master-0 node-kubeconfigs]# oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-8jhdv approved
certificatesigningrequest.certificates.k8s.io/csr-8th88 approved
certificatesigningrequest.certificates.k8s.io/csr-c4m5z approved
certificatesigningrequest.certificates.k8s.io/csr-chpcm approved
certificatesigningrequest.certificates.k8s.io/csr-dzf9x approved
certificatesigningrequest.certificates.k8s.io/csr-hbf8v approved
certificatesigningrequest.certificates.k8s.io/csr-kbq9c approved
certificatesigningrequest.certificates.k8s.io/csr-lc65f approved
certificatesigningrequest.certificates.k8s.io/csr-n5wmr approved
certificatesigningrequest.certificates.k8s.io/csr-p6h2d approved
certificatesigningrequest.certificates.k8s.io/csr-q45zs approved
certificatesigningrequest.certificates.k8s.io/csr-t4znl approved
At this point, the primary CSRs are approved. We can validate that the CSRs have been approved:
[root@ocp47-89pwv-master-0 node-kubeconfigs]# oc get csr
NAME AGE SIGNERNAME REQUESTOR CONDITION
csr-5hc7n 2m55s kubernetes.io/kubelet-serving system:node:ocp47-89pwv-worker-9hjtn Pending
csr-5x5h2 2m50s kubernetes.io/kubelet-serving system:node:ocp47-89pwv-worker-9mbkt Pending
csr-8fs5g 2m52s kubernetes.io/kubelet-serving system:node:ocp47-89pwv-master-2 Pending
csr-8jhdv 11m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-8th88 26m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-c4m5z 26m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-chpcm 11m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-dzf9x 11m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-hbf8v 11m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-kbq9c 10m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-lc65f 10m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-n5wmr 26m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-nk7vj 2m52s kubernetes.io/kubelet-serving system:node:ocp47-89pwv-master-1 Pending
csr-nls7t 2m44s kubernetes.io/kubelet-serving system:node:ocp47-89pwv-worker-s2bbm Pending
csr-p6h2d 26m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-q45zs 26m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
csr-q9bfm 2m48s kubernetes.io/kubelet-serving system:node:ocp47-89pwv-master-0 Pending
csr-t4znl 26m kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Approved,Issued
You will note that new CSRs have already shown up in the list. These should be automatically approved by the system now that the primary CSR has been approved. However, you can re-run the command from above to approve the pending certificates to move things along.
Give your cluster a few more minutes (up to 10 in my experience) and your cluster should be available for use again. You can now continue to use your OpenShift cluster just where you left it the last time you powered it down.