Understanding OpenShift MachineConfigs and MachineConfigPools |

Introduction

OpenShift 4 is built upon Red Hat CoreOS (RHCOS), and RHCOS is managed differently than most traditional Operating Systems. Unlike other Kubernetes distributions where you must manage the base Operating System as well as your Kubernetes distribution, with OpenSHift 4 the RHCOS Operating System and the Kubernetes platform are tightly coupled, and management of RHCOS including any system-level configurations is managed by MachineConfigs, and MachineConfigPools. These constructs allow you to manage system configuration and detect configuration drift on your Control Plane and Worker nodes.

MachineConfigs are responsible for creating and maintaining local RHCOS configuration settings on each node. These settings can be any of the following:

user creation/deletion
kernel configs
file system directories and permissions
configuration files
systemd units

In this blog post, we will use a dummy configuration file as an example to distribute to our worker nodes. We will create a file called /etc/sensitive.conf which will contain the text “critical_config_data”. We will then distribute this to our worker nodes and see how the MachineConfigPool handles changes to this file both through the proper channels as well as through manual changes. We will then cover the creation of additional MachineConfigPools and see how these can be used to manage multiple pools of hardware for heterogeneous clusters.

Terminology

We will be working with a few kinds of Kubernetes objects in this blog.

Machine - the object that describes the host for a node. A Machine has a providerSpec which describes the attributes of a host for a cloud provider, such as AWS, Azure, or vSphere.
Node - Kubernetes construct that can run workloads as pods. A node can be a part of at most one MachineConfigPool
MachineConfigs - MachineConfig objects define the configuration that you want to apply to a given machine. These can be things like kernel parameters, files, systemd units, etc. A full list of items that are configurable by MachineConfigs can be found here: Machine Configuration Tasks
MachineConfigPools - A MachineConfigPool is a collection of MachineConfigs, selected based on labels that are defined on MachineConfigs. A MachineConfigPool can apply to only one type of machine. You can not apply multiple MachineConfigPools to any given machine. MachineConfigPools are responsible for pulling together all the MachineConfigs for a given type of node and applying them to Machines.

Prerequisites

OpenShift Cluster 4.10 or later
Cluster Admin privileges on an OpenShift Cluster
oc command

Test MachineConfig

We will start by creating our test configuration file. We will create a file called /etc/sensitive.conf which will contain one line of data “critical_config_data”. Larger more complex files should be created using butane which helps simplify the creation of a MachineConfig file.

Create a new file called “100-critical-config.yaml” and put the following contents in it.

 1---
 2apiVersion: machineconfiguration.openshift.io/v1
 3kind: MachineConfig
 4metadata:
 5  labels:
 6    machineconfiguration.openshift.io/role: worker
 7  name: 100-critical-config
 8spec:
 9  config:
10    ignition:
11      version: 3.2.0
12    storage:
13      files:
14      - contents:
15          source: data:,critical_config_data%0A
16        mode: 420
17        overwrite: true
18        path: /etc/sensitive.conf

Note lines 6 and 15. Line 6 is where we define which roles we want our configuration file applied to. We will start by only applying this config file to the “worker” role. The mode listed here is DECIMAL, not octal. Normally when setting file permissions in Linux one thinks of “0644” as being “-rw-r–r-”, but this Octal setting needs to be stored as a decimal in an Ignition file, which means that 0644 becomes 420. You can use your favorite Octal to Decimal calculator to make this easier.

With our MachineConfig file created, we will apply it to our OpenShift Cluster:

$ oc login
$ oc create -f 100-critical-config
machineconfig.machineconfiguration.openshift.io/100-critical-config created

With the new machineConfig applied to our cluster, we will look at our MachineConfigPools to see how it is applied. Run the following command and note that the “worker” MachineConfigPool shows as “Updating”.

$ oc get mcp
NAME     CONFIG              UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-6   True      False      False      3              3                   3                     0                      8d
worker   rendered-worker-b   False     True       False      4              0                   0                     0                      8d

Run oc get nodes and notice that one of your worker nodes is now “NotReady,SchedulingDisabled”

$ oc get nodes
NAME                        STATUS                        ROLES    AGE   VERSION
ocp410-zh4dg-master-0       Ready                         master   8d    v1.23.3+e419edf
ocp410-zh4dg-master-1       Ready                         master   8d    v1.23.3+e419edf
ocp410-zh4dg-master-2       Ready                         master   8d    v1.23.3+e419edf
ocp410-zh4dg-worker-d98q7   NotReady,SchedulingDisabled   worker   8d    v1.23.3+e419edf
ocp410-zh4dg-worker-q8q2x   Ready                         worker   8d    v1.23.3+e419edf
ocp410-zh4dg-worker-qsv9s   Ready                         worker   8d    v1.23.3+e419edf
ocp410-zh4dg-worker-r2hcx   Ready                         worker   14h   v1.23.3+e419edf

Rerun the oc get mcp command again and you should now see that one of your machines has updated:

$ oc get mcp
NAME     CONFIG              UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-6   True      False      False      3              3                   3                     0                      8d
worker   rendered-worker-b   False     True       False      4              1                   1                     0                      8d

Allow this process to complete as it goes through each of your machines. Wait until the oc get mcp command shows a status of “UPDATED” being True before continuing to the next section.

Out of Band Change

Now that our file has been applied to all our worker nodes, let’s examine the file that we created and make a small change to the file locally. Use the oc debug node command to connect to one of your worker nodes:

$ oc debug node/ocp410-zh4dg-worker-d98q7
Starting pod/ocp410-zh4dg-worker-d98q7-debug ...
To use host binaries, run `chroot /host`
Pod IP: 172.16.25.127
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cat /etc/sensitive.conf
critical_config_data
sh-4.4#

Open a new terminal window, so that you can run the oc get mcp –watch and then go back to your oc debug command window and run the following command:

# echo "more data" >> /etc/sensitive.conf

You should note that the oc get mcp command output changes, showing that there is a DEGRADED state. To get additional data on this, run the oc describe mcp/worker command:

$ oc describe mcp/worker
...
Status:
  Conditions:
    Last Transition Time:  2022-03-14T19:50:08Z
    Message:
    Reason:
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2022-03-23T15:53:14Z
    Message:
    Reason:
    Status:                False
    Type:                  Updated
    Last Transition Time:  2022-03-23T15:53:14Z
    Message:               All nodes are updating to rendered-worker-af144fcfd50fb859d796318769bb4a66
    Reason:
    Status:                True
    Type:                  Updating
    Last Transition Time:  2022-03-23T15:53:14Z
    Message:               Node ocp410-zh4dg-worker-d98q7 is reporting: "content mismatch for file \"/etc/sensitive.conf\""
    Reason:                1 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded

Note that the status message shows that there is an issue with node “ocp410-zh4dg-worker-d98q7” and that the problem is “content mismatch…”. You can also see that Annotations have been added to the node machineconfiguration.openshift.io/state: Degraded indicating that there is an issue with this node:

$ oc describe node/ocp410-zh4dg-worker-d98q7
Name:               ocp410-zh4dg-worker-d98q7
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.openshift.io/os_id=rhcos
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.vsphere.vmware.com":"ocp410-zh4dg-worker-d98q7"}
                    k8s.ovn.org/host-addresses: ["172.16.25.127","172.16.25.91"]
...
                    machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-af144fcfd50fb859d796318769bb4a66
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-af144fcfd50fb859d796318769bb4a66
                    machineconfiguration.openshift.io/reason: content mismatch for file "/etc/sensitive.conf"
                    machineconfiguration.openshift.io/state: Degraded
...

One thing to note is that in this state, the MachineConfigOperator will not fix this issue on its own. Additional steps are required to remediate this out-of-band change.

Manually Forcing MCO to resync

$ oc debug node/ocp410-zh4dg-worker-d98q7
Starting pod/ocp410-zh4dg-worker-d98q7-debug ...
To use host binaries, run `chroot /host`
Pod IP: 172.16.25.127
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# touch /run/machine-config-daemon-force
sh-4.4#

You will need to wait for up to 15 minutes for the force command to take effect. The node will be de-scheduled and rebooted to re-apply the configuration.

Creating a new MachineConfigPool

So what if you needed to have a different configuration file on one of your nodes. How do you handle that? The best way to do this is to create a new MachineConfigPool that can be applied to only certain machines.

Start by creating a new MachineConfigPool that will apply to nodes with a role called “GPU”

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: gpu
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,gpu]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/gpu: ""

Now let’s create a new MachineConfig that will only target nodes with the role “gpu”.

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: gpu
  name: 100-gpu-config
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,gpuenabled%0A
        mode: 420
        overwrite: true
        path: /etc/gpu.conf

Apply these two changes to your cluster and then review the status of the MachineConfigPools

$ oc create -f gpu-mcp.yaml
machineconfigpool.machineconfiguration.openshift.io/gpu created
$ oc create -f 100-gpu-config.yaml
machineconfig.machineconfiguration.openshift.io/100-gpu-config created
$ oc get mcp
NAME     CONFIG             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
gpu      rendered-gpu-af14  True      False      False      0              0                   0                     0                      47s
master   rendered-master-67 True      False      False      3              3                   3                     0                      8d
worker   rendered-worker-af True      False      False      4              4                   4                     0                      8d

You will see we now have a new MCP, called gpu but it has no machines. We can fix this by adding an additional role to one of our machines.

$ oc label node/ocp410-zh4dg-worker-d98q7 node-role.kubernetes.io/gpu=
$ oc get mcp
NAME     CONFIG              UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
gpu      rendered-gpu-900e   False     True       False      1              0                   0                     0                      3m55s
master   rendered-master-67  True      False      False      3              3                   3                     0                      8d
worker   rendered-worker-af  True      False      False      3              3                   3                     0                      8d

Notice now that the gpu MCP has one node in it, and the worker node MCP has decreased by one. The gpu mcp is now a superset of both the worker mcp as well as the gpu MCP. If we connect to this machine again we will see that not only does our /etc/sensitive.conf file exist, but it will also have a /etc/gpu.conf file. We can validate this by connecting to the machine we have tagged as “gpu” and see that both the /etc/sensitive.conf and the /etc/gpu.conf files are present.

$ oc debug node/ocp410-zh4dg-worker-d98q7
Starting pod/ocp410-zh4dg-worker-d98q7-debug ...
To use host binaries, run `chroot /host`
Pod IP: 172.16.25.127
If you don't see a command prompt, try pressing enter.
sh-4.4# chroot /host
sh-4.4# cat /etc/sensitive.conf
critical_config_data
sh-4.4# cat /etc/gpu.conf
gpuenabled
sh-4.4#

Cleanup

The MachineConfigs that we created here really do nothing, but we shouldn’t leave unused configurations lying around.

First we will remove the label from the machine that we tagged as node-role=gpu:

$ oc label node/ocp410-zh4dg-worker-d98q7 node-role.kubernetes.io/gpu-
node/ocp410-zh4dg-worker-d98q7 unlabeled

This will remove the node from the “gpu” MachineConfigPool. The node will reboot and the /etc/gpu.conf file will be removed from the node. We can then delete the “gpu” MachineConfigPool and the “100-gpu-config” MachineConfig and finally the “100-critical-config” MachineConfig from the cluster.

$ oc delete mcp/gpu
machineconfigpool.machineconfiguration.openshift.io "gpu" deleted
$ oc delete mc/100-gpu-config
machineconfig.machineconfiguration.openshift.io "100-gpu-config" deleted
$ oc delete mc/100-critical-config

The MachineConfigOperator will go through and remove the customizations we made to all of our worker nodes, deleting the “/etc/sensitive.conf” file one node at a time, just like when we added the file.

$ oc get mcp
NAME     CONFIG             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-6  True      False      False      3              3                   3                     0                      11d
worker   rendered-worker-a  False     True       False      4              0                   0                     0                      11d

When the oc get mcp command returns that all nodes are Updated, your cluster is back to the same state it was at the beginning of this post.

Conclusion

Managing custom OS configurations on nodes in OpenShift is now handled by the Machine Config Operator. By using MachineConfigs and MachineConfigPools you can be sure that the configuration that you want to apply to your nodes is applied, and if there is drift, it is called out so that you can address and remediate the node.

Understanding OpenShift MachineConfigs and MachineConfigPools