OpenShift Machine Remediation

Posted by Mark DeNeve on Tuesday, October 31, 2023

Kubernetes and thus OpenShift are designed to host applications in such a way that if a node hosting your application fails, it will reschedule the app on another node automatically, and everything “just keeps working”. This happens without any intervention by an administrator letting you continue on with your life, not getting bothered by some on-call alert system. But what about that node that failed? While the app may be up and running you have a node that is no longer pulling its weight, your cluster capacity is lessened and if you get enough of these failed nodes, other apps may be effected or your cluster may fail.

If you are in operations, you know just how annoying it can be to be called to reboot or fix that failed node. Why not let the platform self heal using the Node Health Check Operator, the Self Remediation Operator and the Machine Deletion Operator. These Operators are availed in Operator Hub and includes as a part of the OpenShift platform.

In this blog post, we will use the Node Health Check operator to create an automation which will run an escalating remediation, starting with a reboot of the unhealthy node and finishing with the complete destruction and redeployment of a node with a system failure. By using an escalating remediation we ensure that the cluster has enough compute resources with as minimal impact as possible. Since its all automated, you or your operations team is never bothered to repair or replace a node.

Prerequisites

  • OpenShift 4 cluster - tested with OCP 4.13 with a working MachineAPI
    • The MachineAPI is configured for any cluster installed using the IPI Install process
  • Administrator level privileges in the cluster
  • The OpenShift oc command-line tool

Install the Node Health Check Operator for Red Hat OpenShift

We will start by installing the Node Health Check Operator. The Node Health Check Operator is responsible for watching the nodes in a cluster and flagging those nodes that are failing or in a non-healthy state so that they can be isolated and remediated back to health.

  1. Log in to the OpenShift Container Platform web console
  2. Navigate to Operators → OperatorHub
  3. Enter “Node Health Check Operator” into the filter box
  4. Select the “Node Health Check Operator” and click Install
  5. On the Install Operator page, leave all defaults selected, and click Install

Wait for the Operator to install before proceeding to the next section. This will update the OpenShift UI to have a “NodeHealthChecks” under the “Compute” section of the OpenShift UI Administrator view. The Node Health Check Operator will also install the Self Node Remediation CR, which we will talk more about in the Configuring Automatic Remediation of nodes section below.

Install the Machine Deletion Remediation Operator for Red Hat OpenShift

Next up we will install the the Machine Deletion Remediation Operator. The Machine Deletion Remediation Operator will control the second stage of our escalating node remediation and is responsible for deleting a Machine from the cluster when the Node Health Check Operator calls for it.

  1. Log in to the OpenShift Container Platform web console
  2. Navigate to Operators → OperatorHub
  3. Enter “Machine Deletion Remediation operator” into the filter box
  4. Select the “Machine Deletion Remediation operator” and click Install
  5. On the Install Operator page, select all defaults, and click Install

Wait for the Operator to install before proceeding to the next section.

Configuring Automatic Remediation of Nodes

Now that the base components are installed, its time to configure them. We will need to define both a MachineDeletionRemediationTemplate as well as a NodeHealthCheck. We will create an “escalating remediation” which means that the first action will not be to delete and re-create the failed host. In stage 1 we will trigger a simple reboot of the host, and then if that does not fix the issue we will move onto the second stage and more drastic action of deleting and re-creating the failed node if the reboot does not fix the issue.

We will be limiting the scope of nodes that can be auto-remediated with this remediation as the OpenShift platform does not support the automatic management of master nodes. If you are using OpenShift Data Foundation, or use local storage for your containers you should also ensure that you don’t enable automatic deletion.

Check the SelfNodeRemediationConfiguration

We will start by checking the configuration of the Self Node Remediation, which is used to reboot the node when it is failing. You can review the config using the command below, and if necessary change any of the default settings. We will leave the defaults for this example.

$ oc get SelfNodeRemediationConfig -A -n openshift-operators -o yaml
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationConfig
metadata:
  name: self-node-remediation-config
  namespace: openshift-operators
spec:
  apiCheckInterval: 15s
  apiServerTimeout: 5s
  isSoftwareRebootEnabled: true
  maxApiErrorThreshold: 3
  peerApiServerTimeout: 5s
  peerDialTimeout: 5s
  peerRequestTimeout: 5s
  peerUpdateInterval: 15m
  safeTimeToAssumeNodeRebootedSeconds: 180
  watchdogFilePath: /dev/watchdog

If you would like to understand more about how the SelfNodeRemediationCheck works, be sure to check the How it Works page for more in-depth details.

Create a MachineDeletionRemediationTemplate

Next up we will create a MachineDeletionRemediationTemplate, which is used by the NodeHealthCheck to delete failing nodes. Start by creating a file called mdr-template.yaml and put the following contents in it:

apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
kind: MachineDeletionRemediationTemplate
metadata:
    name: worker-delete-remediation
    namespace: openshift-operators
spec:
    template:
        spec: {}

Now, apply that file to your cluster:

$ oc create -f mdr-template.yaml

Create a NodeHealthCheck

With our SelfNodeRemediation CR configured, and a MachineDeletionRemediationTemplate created, we can configure our NodeHealthCheck. We will create a NodeHealthCheck that only works on nodes with a label node-role.kubernetes.io/worker assigned.

WARNING: If you are running a Single Node OpenShift cluster, or a Compact Cluster (three nodes), this health check will select and monitor your shared master nodes. This may cause issues with your cluster. It is not recommended to run this tool in a Single Node or Compact cluster.

As previously mentioned, we will be configuring this health check to take a staged approach to remediation. We will be configuring it to first attempt a reboot on a failed node, and then only after 5 minutes, start the process to delete the machine. Depending on your particular cluster, a 5 minute timeout may not be enough time for a complete reboot. Be sure to adjust for how long your nodes take to reboot.

If the node has not returned to a healthy state after 5 minutes, the remediation will move onto the next stage, and trigger the Machine Deletion Remediation Operator to go ahead and delete the node.

Once the machine is deleted, the Machine API controller will notice that there is not enough machines to satisfy the configured MachineSet replicas target, and start the build of a new replacement node.

Create a file called nhc-config.yaml and populate with the following contents:

apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
  name: worker-node-healthcheck
spec:
  escalatingRemediations:
    - order: 1
      remediationTemplate:
        apiVersion: self-node-remediation.medik8s.io/v1alpha1
        kind: SelfNodeRemediationTemplate
        name: self-node-remediation-resource-deletion-template
        namespace: openshift-operators
      timeout: 5m
    - order: 2
      remediationTemplate:
        apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
        kind: MachineDeletionRemediationTemplate
        name: worker-delete-remediation
        namespace: openshift-operators
      timeout: 20m
  minHealthy: 51%
  selector:
    matchExpressions:
      - key: node-role.kubernetes.io/worker
        operator: Exists
        values: []
  unhealthyConditions:
    - duration: 300s
      status: 'False'
      type: Ready
    - duration: 300s
      status: Unknown
      type: Ready

Now we will apply that file to our cluster:

$ oc create -f nhc-config.yaml

With the NodeHealthCheck configuration applied, we will check the state of the NodeHealthCheck.

$ oc describe nodehealthcheck
...
Status:
  Conditions:
    Last Transition Time:  2023-10-10T13:16:25Z
    Message:               No issues found, NodeHealthCheck is enabled.
    Reason:                NodeHealthCheckEnabled
    Status:                False
    Type:                  Disabled
  Healthy Nodes:           4
  Last Update Time:        2023-10-10T13:16:25Z
  Observed Nodes:          4
  Phase:                   Enabled
  Reason:                  NHC is enabled, no ongoing remediation
Events:
  Type     Reason    Age    From             Message
  ----     ------    ----   ----             -------
  Normal   Enabled   2m12s  NodeHealthCheck  No issues found, NodeHealthCheck is enabled.

The output above is truncated, but the important information is shown. Specifically we can see that it has identified 4 nodes (“Observed Nodes”) and that all 4 nodes are healthy (“Healthy Nodes”). Now lets go break something and see this in action.

Testing the healthcheck

Now for the fun part. We are going to check that the healthcheck operator is working. I will be testing with a cluster built with the IPI on VMware, the tests should work on any cluster that has the proper platform configuration.

WARNING: we are about to intentionally cause the disruption of a node in the cluster. Don’t ever do this on a production environment unless you know what you are doing! WARNED!

We will cause a temporary disruption of a node, by stopping the kubelet service. Since the kubelet service is critical to the node running, this will will cause the node to go unhealthy. SSH to the node you want to disrupt using the core account that was set up during the cluster install.

$ ssh core@<ip of worker>
$ sudo su -
$ systemctl stop kubelet
# exit
$ exit

Now it will take 5 minutes (300 seconds) for the NodeHealthCheck to pick up on this failed state. This is due to the configuration in our nhc-config.yaml file from earlier where we configured:

  unhealthyConditions:
    - duration: 300s
      status: 'False'
      type: Ready
    - duration: 300s
      status: Unknown
      type: Ready

After 5 minutes, check the status of our nodehealthcheck:

$ oc get nodehealthcheck
...
Status:
  Conditions:
    Last Transition Time:  2023-10-10T13:16:25Z
    Message:               No issues found, NodeHealthCheck is enabled.
    Reason:                NodeHealthCheckEnabled
    Status:                False
    Type:                  Disabled
  Healthy Nodes:           3
  In Flight Remediations:
    acm-775pj-worker-l7vfq:  2023-10-10T13:30:15Z
  Last Update Time:          2023-10-10T13:30:15Z
  Observed Nodes:            4
  Phase:                     Remediating
  Reason:                    NHC is remediating 1 nodes
  Unhealthy Nodes:
    Name:  acm-775pj-worker-l7vfq
    Remediations:
      Resource:
        API Version:  self-node-remediation.medik8s.io/v1alpha1
        Kind:         SelfNodeRemediation
        Name:         acm-775pj-worker-l7vfq
        Namespace:    openshift-operators
        UID:          5561c44d-75fd-4103-856b-11608eb37cfe
      Started:        2023-10-10T13:30:15Z
Events:
  Type     Reason              Age   From             Message
  ----     ------              ----  ----             -------
  Warning  Disabled            27m   NodeHealthCheck  Disabling NHC. Reason: RemediationTemplateNotFound, Message: Remediation template not found: "failed to get external remediation template openshift-operators/worker-remediation: machinedeletionremediationtemplates.machine-deletion-remediation.medik8s.io \"worker-remediation\" not found"
  Normal   Enabled             14m   NodeHealthCheck  No issues found, NodeHealthCheck is enabled.
  Normal   RemediationCreated  61s   NodeHealthCheck  Created remediation object for node acm-775pj-worker-l7vfq

With the node detected as failed, if you watch the console of the worker node, you should see it start to reboot. Also notice that the “Phase” is now listed as “Remediating”. Wait for your node to reboot and return to a healthy state and then check the NodeHealthCheck state again. You should see that the node has returned to a healthy state and that the NodeHealthCheck “Phase” has returned to “Enabled”.

Testing the Escalation process

OK, so we have seen a simple reboot work, but what happens if the node becomes corrupt in a way where the node no longer boots properly. This is where the escalating remediation comes into play. With our current configuration, if the node is not healthy after 10 minutes, we move to the next stage and delete the node, so it can be recreated.

WARNING: we are about to intentionally corrupt a node’s boot disk. Don’t ever do this in production. You have been WARNED!

SSH to the node you want to disrupt using the core account that was set up during the cluster install.

$ oc core@<ip of worker>
$ sudo su -
# chroot /host
# dd if=/dev/urandom of=/dev/sda bs=1024 count=1000
# sync
# systemctl stop kubelet
# exit
$ exit

Now we wait. The first thing that is going to happen is the node will be rebooted just like before. After 5 minutes the machine will reboot, however since we also corrupted the boot partition, the node will never come back on line. After 5 minutes of time, the second stage will trigger and a MachineDeletionRemediation will kick in. This will delete the Machine from OpenShift and the MachineSet controller will handle replacing the failed node with a new node.

In my lab environment this means that the vSphere Machine Controller will delete the failed virtual machine, and create a NEW virtual machine and joined to the cluster.

We can check and see that the node that failed has been replaced by a new node:

$ oc get nodes
NAME                     STATUS   ROLES                  AGE    VERSION
acm-775pj-master-0       Ready    control-plane,master   184d   v1.26.7+c7ee51f
acm-775pj-master-1       Ready    control-plane,master   184d   v1.26.7+c7ee51f
acm-775pj-master-2       Ready    control-plane,master   184d   v1.26.7+c7ee51f
acm-775pj-worker-mbxcg   Ready    worker                 43d    v1.26.7+c7ee51f
acm-775pj-worker-rmd8z   Ready    worker                 184d   v1.26.7+c7ee51f
acm-775pj-worker-skjxx   Ready    worker                 23m    v1.26.7+c7ee51f
acm-775pj-worker-ws7nb   Ready    worker                 119d   v1.26.7+c7ee51f

SUCCESS! our failed node has been replaced and our cluster is back to a healthy state with full capacity.

Conclusion

The power of Kubernetes and OpenShift to manage applications and keep them running in the face of application failures and node failures, and with the addition of NodeHealthCheck and the MachineDeletionRemediation Operator we can extend that functionality to the nodes of the cluster.