Recovering an OCP/OKD Cluster After a Long Time Powered Off

Posted by Mark DeNeve on Thursday, January 13, 2022

Introduction

If you are like me, you have multiple Lab clusters of OpenShift or OKD in your home or work Lab. Each of these clusters takes up a significant amount of resources and so you may shut them down to save power or compute resources. Or perhaps you are running a cluster in one of the many supported Cloud providers, and you power the machines down to save costs when you are not using them. If you leave the cluster powered off for more than 2 weeks you will find that when you power the cluster back on you are unable to connect to the cluster or the console. Most times, this is due to one or more internal certificates expiring. There is a quick fix for this which we will discuss below.

Prerequisites

To follow the steps below, you will need a cluster that has been powered off for a minimum of two weeks, with internal certificates that have expired. You will also need to have the SSH key for the “core” user on the master nodes. You should have this private SSH key set aside from when you first built your cluster. We will use the SSH key to remotely connect to one of the master nodes and approve the pending CSR certificates to allow the cluster to come back online.

Procedure

With ALL Master Nodes powered on, use ssh to remotely connect to one of the master nodes and then using the su command switch to the root user:

$ ssh core@<mast node ip address here>
Red Hat Enterprise Linux CoreOS 47.84.202109082139-0
  Part of OpenShift 4.7, RHCOS is a Kubernetes native operating system
  managed by the Machine Config Operator (`clusteroperator/machine-config`).
---
[core@ocp47-89pwv-master-0 ~]$ sudo su -
[root@ocp47-89pwv-master-0 ~]#

As the root user, set your KUBECONFIG to point to the recovery kubeconfig file:

[root@ocp47-89pwv-master-0 node-kubeconfigs]# export KUBECONFIG=/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs/localhost-recovery.kubeconfig

You should now be able to get a listing of the Certificate Signing Requests that are pending using the oc command:

[root@ocp47-89pwv-master-0 node-kubeconfigs]# oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-8jhdv   6m38s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-8th88   21m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-c4m5z   21m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-chpcm   6m29s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-dzf9x   6m28s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-hbf8v   6m35s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-kbq9c   6m23s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-lc65f   6m25s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-n5wmr   22m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-p6h2d   21m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-q45zs   22m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-t4znl   21m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending

Note that there are many CSRs that are pending. We can approve all of these with one command:

[root@ocp47-89pwv-master-0 node-kubeconfigs]# oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-8jhdv approved
certificatesigningrequest.certificates.k8s.io/csr-8th88 approved
certificatesigningrequest.certificates.k8s.io/csr-c4m5z approved
certificatesigningrequest.certificates.k8s.io/csr-chpcm approved
certificatesigningrequest.certificates.k8s.io/csr-dzf9x approved
certificatesigningrequest.certificates.k8s.io/csr-hbf8v approved
certificatesigningrequest.certificates.k8s.io/csr-kbq9c approved
certificatesigningrequest.certificates.k8s.io/csr-lc65f approved
certificatesigningrequest.certificates.k8s.io/csr-n5wmr approved
certificatesigningrequest.certificates.k8s.io/csr-p6h2d approved
certificatesigningrequest.certificates.k8s.io/csr-q45zs approved
certificatesigningrequest.certificates.k8s.io/csr-t4znl approved

At this point, the primary CSRs are approved. We can validate that the CSRs have been approved:

[root@ocp47-89pwv-master-0 node-kubeconfigs]# oc get csr
NAME        AGE     SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-5hc7n   2m55s   kubernetes.io/kubelet-serving                 system:node:ocp47-89pwv-worker-9hjtn                                        Pending
csr-5x5h2   2m50s   kubernetes.io/kubelet-serving                 system:node:ocp47-89pwv-worker-9mbkt                                        Pending
csr-8fs5g   2m52s   kubernetes.io/kubelet-serving                 system:node:ocp47-89pwv-master-2                                            Pending
csr-8jhdv   11m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-8th88   26m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-c4m5z   26m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-chpcm   11m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-dzf9x   11m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-hbf8v   11m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-kbq9c   10m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-lc65f   10m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-n5wmr   26m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-nk7vj   2m52s   kubernetes.io/kubelet-serving                 system:node:ocp47-89pwv-master-1                                            Pending
csr-nls7t   2m44s   kubernetes.io/kubelet-serving                 system:node:ocp47-89pwv-worker-s2bbm                                        Pending
csr-p6h2d   26m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-q45zs   26m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-q9bfm   2m48s   kubernetes.io/kubelet-serving                 system:node:ocp47-89pwv-master-0                                            Pending
csr-t4znl   26m     kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued

You will note that new CSRs have already shown up in the list. These should be automatically approved by the system now that the primary CSR has been approved. However, you can re-run the command from above to approve the pending certificates to move things along.

Give your cluster a few more minutes (up to 10 in my experience) and your cluster should be available for use again. You can now continue to use your OpenShift cluster just where you left it the last time you powered it down.