OpenShift Machine Remediation
By Mark DeNeve
Kubernetes and thus OpenShift are designed to host applications in such a way that if a node hosting your application fails, it will reschedule the app on another node automatically, and everything “just keeps working”. This happens without any intervention by an administrator letting you continue on with your life, not getting bothered by some on-call alert system. But what about that node that failed? While the app may be up and running you have a node that is no longer pulling its weight, your cluster capacity is lessened and if you get enough of these failed nodes, other apps may be effected or your cluster may fail.
If you are in operations, you know just how annoying it can be to be called to reboot or fix that failed node. Why not let the platform self heal using the Node Health Check Operator, the Self Remediation Operator and the Machine Deletion Operator. These Operators are availed in Operator Hub and includes as a part of the OpenShift platform.
In this blog post, we will use the Node Health Check operator to create an automation which will run an escalating remediation, starting with a reboot of the unhealthy node and finishing with the complete destruction and redeployment of a node with a system failure. By using an escalating remediation we ensure that the cluster has enough compute resources with as minimal impact as possible. Since its all automated, you or your operations team is never bothered to repair or replace a node.
Prerequisites
- OpenShift 4 cluster - tested with OCP 4.13 with a working MachineAPI
- The MachineAPI is configured for any cluster installed using the IPI Install process
- Administrator level privileges in the cluster
- The OpenShift oc command-line tool
Install the Node Health Check Operator for Red Hat OpenShift
We will start by installing the Node Health Check Operator. The Node Health Check Operator is responsible for watching the nodes in a cluster and flagging those nodes that are failing or in a non-healthy state so that they can be isolated and remediated back to health.
- Log in to the OpenShift Container Platform web console
- Navigate to Operators → OperatorHub
- Enter “Node Health Check Operator” into the filter box
- Select the “Node Health Check Operator” and click Install
- On the Install Operator page, leave all defaults selected, and click Install
Wait for the Operator to install before proceeding to the next section. This will update the OpenShift UI to have a “NodeHealthChecks” under the “Compute” section of the OpenShift UI Administrator view. The Node Health Check Operator will also install the Self Node Remediation CR, which we will talk more about in the Configuring Automatic Remediation of nodes section below.
Install the Machine Deletion Remediation Operator for Red Hat OpenShift
Next up we will install the the Machine Deletion Remediation Operator. The Machine Deletion Remediation Operator will control the second stage of our escalating node remediation and is responsible for deleting a Machine from the cluster when the Node Health Check Operator calls for it.
- Log in to the OpenShift Container Platform web console
- Navigate to Operators → OperatorHub
- Enter “Machine Deletion Remediation operator” into the filter box
- Select the “Machine Deletion Remediation operator” and click Install
- On the Install Operator page, select all defaults, and click Install
Wait for the Operator to install before proceeding to the next section.
Configuring Automatic Remediation of Nodes
Now that the base components are installed, its time to configure them. We will need to define both a MachineDeletionRemediationTemplate as well as a NodeHealthCheck. We will create an “escalating remediation” which means that the first action will not be to delete and re-create the failed host. In stage 1 we will trigger a simple reboot of the host, and then if that does not fix the issue we will move onto the second stage and more drastic action of deleting and re-creating the failed node if the reboot does not fix the issue.
We will be limiting the scope of nodes that can be auto-remediated with this remediation as the OpenShift platform does not support the automatic management of master nodes. If you are using OpenShift Data Foundation, or use local storage for your containers you should also ensure that you don’t enable automatic deletion.
Check the SelfNodeRemediationConfiguration
We will start by checking the configuration of the Self Node Remediation, which is used to reboot the node when it is failing. You can review the config using the command below, and if necessary change any of the default settings. We will leave the defaults for this example.
$ oc get SelfNodeRemediationConfig -A -n openshift-operators -o yaml
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationConfig
metadata:
name: self-node-remediation-config
namespace: openshift-operators
spec:
apiCheckInterval: 15s
apiServerTimeout: 5s
isSoftwareRebootEnabled: true
maxApiErrorThreshold: 3
peerApiServerTimeout: 5s
peerDialTimeout: 5s
peerRequestTimeout: 5s
peerUpdateInterval: 15m
safeTimeToAssumeNodeRebootedSeconds: 180
watchdogFilePath: /dev/watchdog
If you would like to understand more about how the SelfNodeRemediationCheck works, be sure to check the How it Works page for more in-depth details.
Create a MachineDeletionRemediationTemplate
Next up we will create a MachineDeletionRemediationTemplate, which is used by the NodeHealthCheck to delete failing nodes. Start by creating a file called mdr-template.yaml
and put the following contents in it:
apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
kind: MachineDeletionRemediationTemplate
metadata:
name: worker-delete-remediation
namespace: openshift-operators
spec:
template:
spec: {}
Now, apply that file to your cluster:
$ oc create -f mdr-template.yaml
Create a NodeHealthCheck
With our SelfNodeRemediation CR configured, and a MachineDeletionRemediationTemplate created, we can configure our NodeHealthCheck. We will create a NodeHealthCheck that only works on nodes with a label node-role.kubernetes.io/worker
assigned.
WARNING: If you are running a Single Node OpenShift cluster, or a Compact Cluster (three nodes), this health check will select and monitor your shared master nodes. This may cause issues with your cluster. It is not recommended to run this tool in a Single Node or Compact cluster.
As previously mentioned, we will be configuring this health check to take a staged approach to remediation. We will be configuring it to first attempt a reboot on a failed node, and then only after 5 minutes, start the process to delete the machine. Depending on your particular cluster, a 5 minute timeout may not be enough time for a complete reboot. Be sure to adjust for how long your nodes take to reboot.
If the node has not returned to a healthy state after 5 minutes, the remediation will move onto the next stage, and trigger the Machine Deletion Remediation Operator to go ahead and delete the node.
Once the machine is deleted, the Machine API controller will notice that there is not enough machines to satisfy the configured MachineSet replicas target, and start the build of a new replacement node.
Create a file called nhc-config.yaml
and populate with the following contents:
apiVersion: remediation.medik8s.io/v1alpha1
kind: NodeHealthCheck
metadata:
name: worker-node-healthcheck
spec:
escalatingRemediations:
- order: 1
remediationTemplate:
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationTemplate
name: self-node-remediation-resource-deletion-template
namespace: openshift-operators
timeout: 5m
- order: 2
remediationTemplate:
apiVersion: machine-deletion-remediation.medik8s.io/v1alpha1
kind: MachineDeletionRemediationTemplate
name: worker-delete-remediation
namespace: openshift-operators
timeout: 20m
minHealthy: 51%
selector:
matchExpressions:
- key: node-role.kubernetes.io/worker
operator: Exists
values: []
unhealthyConditions:
- duration: 300s
status: 'False'
type: Ready
- duration: 300s
status: Unknown
type: Ready
Now we will apply that file to our cluster:
$ oc create -f nhc-config.yaml
With the NodeHealthCheck configuration applied, we will check the state of the NodeHealthCheck.
$ oc describe nodehealthcheck
...
Status:
Conditions:
Last Transition Time: 2023-10-10T13:16:25Z
Message: No issues found, NodeHealthCheck is enabled.
Reason: NodeHealthCheckEnabled
Status: False
Type: Disabled
Healthy Nodes: 4
Last Update Time: 2023-10-10T13:16:25Z
Observed Nodes: 4
Phase: Enabled
Reason: NHC is enabled, no ongoing remediation
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Enabled 2m12s NodeHealthCheck No issues found, NodeHealthCheck is enabled.
The output above is truncated, but the important information is shown. Specifically we can see that it has identified 4 nodes (“Observed Nodes”) and that all 4 nodes are healthy (“Healthy Nodes”). Now lets go break something and see this in action.
Testing the healthcheck
Now for the fun part. We are going to check that the healthcheck operator is working. I will be testing with a cluster built with the IPI on VMware, the tests should work on any cluster that has the proper platform configuration.
WARNING: we are about to intentionally cause the disruption of a node in the cluster. Don’t ever do this on a production environment unless you know what you are doing! WARNED!
We will cause a temporary disruption of a node, by stopping the kubelet service. Since the kubelet service is critical to the node running, this will will cause the node to go unhealthy. SSH to the node you want to disrupt using the core account that was set up during the cluster install.
$ ssh core@<ip of worker>
$ sudo su -
$ systemctl stop kubelet
# exit
$ exit
Now it will take 5 minutes (300 seconds) for the NodeHealthCheck to pick up on this failed state. This is due to the configuration in our nhc-config.yaml
file from earlier where we configured:
unhealthyConditions:
- duration: 300s
status: 'False'
type: Ready
- duration: 300s
status: Unknown
type: Ready
After 5 minutes, check the status of our nodehealthcheck:
$ oc get nodehealthcheck
...
Status:
Conditions:
Last Transition Time: 2023-10-10T13:16:25Z
Message: No issues found, NodeHealthCheck is enabled.
Reason: NodeHealthCheckEnabled
Status: False
Type: Disabled
Healthy Nodes: 3
In Flight Remediations:
acm-775pj-worker-l7vfq: 2023-10-10T13:30:15Z
Last Update Time: 2023-10-10T13:30:15Z
Observed Nodes: 4
Phase: Remediating
Reason: NHC is remediating 1 nodes
Unhealthy Nodes:
Name: acm-775pj-worker-l7vfq
Remediations:
Resource:
API Version: self-node-remediation.medik8s.io/v1alpha1
Kind: SelfNodeRemediation
Name: acm-775pj-worker-l7vfq
Namespace: openshift-operators
UID: 5561c44d-75fd-4103-856b-11608eb37cfe
Started: 2023-10-10T13:30:15Z
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Disabled 27m NodeHealthCheck Disabling NHC. Reason: RemediationTemplateNotFound, Message: Remediation template not found: "failed to get external remediation template openshift-operators/worker-remediation: machinedeletionremediationtemplates.machine-deletion-remediation.medik8s.io \"worker-remediation\" not found"
Normal Enabled 14m NodeHealthCheck No issues found, NodeHealthCheck is enabled.
Normal RemediationCreated 61s NodeHealthCheck Created remediation object for node acm-775pj-worker-l7vfq
With the node detected as failed, if you watch the console of the worker node, you should see it start to reboot. Also notice that the “Phase” is now listed as “Remediating”. Wait for your node to reboot and return to a healthy state and then check the NodeHealthCheck state again. You should see that the node has returned to a healthy state and that the NodeHealthCheck “Phase” has returned to “Enabled”.
Testing the Escalation process
OK, so we have seen a simple reboot work, but what happens if the node becomes corrupt in a way where the node no longer boots properly. This is where the escalating remediation comes into play. With our current configuration, if the node is not healthy after 10 minutes, we move to the next stage and delete the node, so it can be recreated.
WARNING: we are about to intentionally corrupt a node’s boot disk. Don’t ever do this in production. You have been WARNED!
SSH to the node you want to disrupt using the core account that was set up during the cluster install.
$ oc core@<ip of worker>
$ sudo su -
# chroot /host
# dd if=/dev/urandom of=/dev/sda bs=1024 count=1000
# sync
# systemctl stop kubelet
# exit
$ exit
Now we wait. The first thing that is going to happen is the node will be rebooted just like before. After 5 minutes the machine will reboot, however since we also corrupted the boot partition, the node will never come back on line. After 5 minutes of time, the second stage will trigger and a MachineDeletionRemediation will kick in. This will delete the Machine from OpenShift and the MachineSet controller will handle replacing the failed node with a new node.
In my lab environment this means that the vSphere Machine Controller will delete the failed virtual machine, and create a NEW virtual machine and joined to the cluster.
We can check and see that the node that failed has been replaced by a new node:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
acm-775pj-master-0 Ready control-plane,master 184d v1.26.7+c7ee51f
acm-775pj-master-1 Ready control-plane,master 184d v1.26.7+c7ee51f
acm-775pj-master-2 Ready control-plane,master 184d v1.26.7+c7ee51f
acm-775pj-worker-mbxcg Ready worker 43d v1.26.7+c7ee51f
acm-775pj-worker-rmd8z Ready worker 184d v1.26.7+c7ee51f
acm-775pj-worker-skjxx Ready worker 23m v1.26.7+c7ee51f
acm-775pj-worker-ws7nb Ready worker 119d v1.26.7+c7ee51f
SUCCESS! our failed node has been replaced and our cluster is back to a healthy state with full capacity.
Conclusion
The power of Kubernetes and OpenShift to manage applications and keep them running in the face of application failures and node failures, and with the addition of NodeHealthCheck and the MachineDeletionRemediation Operator we can extend that functionality to the nodes of the cluster.