Security Context

In the CKS we need to go deeper into this topic. Take an overview at cka security context.

Security context allows us to define privileges and access control at pod level or at container level.

We can specify:

userID
groupID
Privilege escalation
Linux capabilities
others

spec:
  # Pod level (Applied to all containers)
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
    fsGroup: 2000

  containers:
  - name: busybox
    command: ["sleep","3600"]
    image: busybox
    # Container level
    securityContext:
      runAsUser: 0 # Will override the user in this pod
      #
      #runAsUser: 0
      #runAsGroup: 3000
      #fsGroup: 2000
  - name: busybox-2
    command: ["sleep","3600"]
    image: busybox2
    # This will be inherited from pod definition
    #securityContext:
      #runAsUser: 1000
      #runAsGroup: 3000
      #fsGroup: 2000

In this definition we're saying that:

runAsUser -> uid=1000 (user ID)
runAsGroup -> gid=3000 (main group ID)
fsGroup -> groups=2000 (supplementary group ID)

Starting with Kubernetes 1.25, the operating system (os) field was introduced in the pod spec with possible values for name being windows or linux (default). This field indicates which operating system the pod will run on. In addition to being useful in the future for kube-scheduler, it will also have functionality in security context.

spec:
  os:
    name: windows
  containers:
  - name: windows-container
    image: mcr.microsoft.com/windows/servercore:ltsc2022

There are parameters that can exist at pod level, container level, or both. Let's do a quick analysis just to know what's possible. Nobody needs to remember this for the CKS.

A quick general table where we'll only talk about the main ones.

Parameter	Type	OS	Pod Level	Container Level	Description
allowPrivilegeEscalation	boolean	Linux	No	Yes	Controls whether the process can gain more privileges than its parent. It's automatically set to true if "privileged true" or has CAP_SYS_ADMIN capability.
appArmorProfile	AppArmorProfile	Linux	Yes	Yes	If set, the pod's appArmorProfile will be changed
capabilities	Capabilities	Linux	No	Yes	To add or drop capabilities in the container.
fsGroup	Integer	Linux	Yes	No	A supplemental group that will be applied to all containers in the pod.
privileged	boolean	Linux	No	Yes	To run the container as root granting privileges equivalent to the host root. Default is false.
procMount	string	Linux	No	Yes	Involves the type of proc mount used by the container.
readOnlyRootFilesystem	boolean	Linux	No	Yes	Whether the container's /root should be read-only. Default is false
runAsGroup	integer	Linux	Yes	Yes	The gid to run the process entrypoint.
runAsNonRoot	boolean	Linux	Yes	Yes	Indicates that the container must run with a non-root user. Kubelet validates if the image is setting the user.
runAsUser	integer	Linux	Yes	Yes	User UID. Default is the same used in the container image.
seLinuxOptions	SELinuxOptions	Linux	Yes	Yes	If not specified, the container runtime will allocate a random SELinux for each container.
seccompProfile	SeccompProfile	Linux	Yes	Yes	Seccomp options for containers.
supplementalGroups	integer array	Linux	Yes	No	A list of GIDs applied to the first process of each container.
supplementalGroupsPolicy	string	Linux	Yes	No	Only used if supplementalGroups is defined.
sysctls	sysctls array	Linux	Yes	No	Contains a list of namespaced sysctls used by the pod.
windowsOptions	WindowsSecurityContextOptions	windows	Yes	Yes	Windows-specific settings applied to all containers.

Let's use this yaml as a base and we'll modify it several times.

root@cks-master:~# k run pod --image=busybox --command -oyaml --dry-run=client -- sh -c 'sleep 1d' > pod.yaml

root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod
  name: pod
spec:
  containers:
  - command:
    - sh
    - -c
    - sleep 1d
    image: busybox
    name: pod
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}
root@cks-master:~#

runAsUser and runAsGroup (pod and container levels)

The busybox image uses the root user in the image itself and we didn't pass anything to change it.
If we create a file it will be created according to the user we're using which in this case is root.

root@cks-master:~# k apply -f pod.yaml
pod/pod created

root@cks-master:~# k exec -it pod -- sh
/ # id
uid=0(root) gid=0(root) groups=10(wheel)
/ # touch test
/ # ls -lh test
-rw-r--r--    1 root     root           0 Aug 29 12:00 test
/ # exit

root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted

Let's define a user for the pod, which will be inherited by the container since we won't override it.

We changed the pod's user.
Since the image defines the workdir directly to /, as user 1000 we don't have permission to create anything.
If we change to a location like tmp that all users have permission to, we can create and the created file will belong to the specified user.

root@cks-master:~# vim pod.yaml

root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod
  name: pod
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
  containers:
  - command:
    - sh
    - -c
    - sleep 1d
    image: busybox
    name: pod
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

root@cks-master:~# k apply -f pod.yaml
pod/pod created

root@cks-master:~# k exec -it pod -- sh
~ $ id
uid=1000 gid=3000
~ $ touch test
touch: test: Permission denied
~ $ pwd
/
~ $ cd tmp/
/tmp $ touch test
/tmp $ ls -lh test
-rw-r--r--    1 1000     3000           0 Aug 29 12:07 test
/tmp $ exit

root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.

runAsNonRoot (pod and container Level)

Now let's force the container to run as non-root, but we won't pass any user. If the container image already defines a user we'll have no problems, but if it defines as root it won't be able to run.

In this scenario we keep the user and have no problems because we're changing the owner user of the main process.
Note that the owner of process 1 is user 1000 which we kept.

root@cks-master:~# vim pod.yaml

root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod
  name: pod
spec:
  securityContext:
    runAsUser: 1000
    runAsGroup: 3000
  containers:
  - command:
    - sh
    - -c
    - sleep 1d
    image: busybox
    name: pod
    resources: {}
    securityContext:
      runAsNonRoot: true
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k get pods
NAME   READY   STATUS    RESTARTS   AGE
pod    1/1     Running   0          5s

root@cks-master:~# k exec -it pod -- sh
~ $
~ $ ps
PID   USER     TIME  COMMAND
    1 1000      0:00 sh -c sleep 1d
    8 1000      0:00 sh
   14 1000      0:00 ps
~ $ exit

root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted

However, removing the user we have the problem mentioned above.

In this case we're forcing the image to have a user defined for the process, which is not the case with busybox.

root@cks-master:~# vim pod.yaml

root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod
  name: pod
spec:
#  securityContext:
#    runAsUser: 1000
#    runAsGroup: 3000
  containers:
  - command:
    - sh
    - -c
    - sleep 1d
    image: busybox
    name: pod
    resources: {}
    securityContext:
      runAsNonRoot: true
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

root@cks-master:~# k apply -f pod.yaml
pod/pod created

root@cks-master:~# k get pod
NAME   READY   STATUS                       RESTARTS   AGE
pod    0/1     CreateContainerConfigError   0          3s

root@cks-master:~# k get pod pod -o jsonpath={.status.containerStatuses.*.state} | jq
{
  "waiting": {
    "message": "container has runAsNonRoot and image will run as root (pod: \"pod_default(7dd567c0-ead9-4460-a012-35acd9122bad)\", container: pod)",
    "reason": "CreateContainerConfigError"
  }
}

root@cks-master:~# k delete pod pod
pod "pod" deleted

The default nginx image uses root, but there's another image that doesn't use it, let's use it for testing

root@cks-master:~# k run nginx --image=nginxinc/nginx-unprivileged -o yaml --dry-run=client > podnonroot.yaml

root@cks-master:~# vim podnonroot.yaml

root@cks-master:~# cat podnonroot.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nginx
  name: nginx
spec:
  containers:
  - image: nginxinc/nginx-unprivileged
    name: nginx
    resources: {}
    securityContext:
      runAsNonRoot: true
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}
root@cks-master:~# k apply -f podnonroot.yaml
pod/nginx created

root@cks-master:~# k exec -it nginx -- bash
nginx@nginx:/$ id
uid=101(nginx) gid=101(nginx) groups=101(nginx)
nginx@nginx:/$ exit
exit

root@cks-master:~# k delete pod nginx
pod "nginx" deleted

privileged (Container Level)

By default containers run as unprivileged, but it's possible to run as privileged.

A case where this could happen is if we wanted to run docker-in-docker in the container, that is, a container inside another. We could also have a container that needs access to all devices.

Running a container as privileged means that user 0 (root) in the container is directly mapped to user 0 (root) on the host. One of the abstractions of using containers is that inside the container we can have the same id as a user on the host or other containers but they are different, they can have the same id but different permissions.

With the sysctl command we can set kernel parameters at runtime, but for this we need root permission.

Without privileged, even being root in the container we can't change it.

root@cks-master:~# vim pod.yaml

root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod
  name: pod
spec:
  containers:
  - command:
    - sh
    - -c
    - sleep 1d
    image: busybox
    name: pod
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

root@cks-master:~# k apply -f pod.yaml
pod/pod created

root@cks-master:~# k exec -it pod -- bash
root@cks-master:~# k exec -it pod -- sh
/ # id
uid=0(root) gid=0(root) groups=10(wheel)
/ # sysctl kernel.hostname=cks
sysctl: error setting key 'kernel.hostname': Read-only file system
/ # sysctl kernel.hostname=cks

Setting privileged to true. Remember that privileged is a security context at container level only.

root@cks-master:~# vim pod.yaml

root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod
  name: pod
spec:
  containers:
  - command:
    - sh
    - -c
    - sleep 1d
    image: busybox
    name: pod
    resources: {}
    securityContext:
      privileged: true
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

# For the record, the hostname is the pod name itself (in this case pod)
# The sysctl command changes in proc temporarily not in /etc/hostname.
root@cks-master:~# k exec pod -it -- sh
/ # cat /proc/sys/kernel/hostname
pod
/ # sysctl kernel.hostname=cks-test
kernel.hostname = cks-test
/ # cat /proc/sys/kernel/hostname
cks-test
/ # cat /etc/hostname
pod
/ # exit

# In the worker where the pod is running the change is at kernel level but in the pod's kernel group and not on the host. It's not the same filesystem.
root@cks-worker:~# cat /proc/sys/kernel/hostname
cks-worker

root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted

allowPrivilegeEscalation (Container Level)

Now let's talk about allowPrivilegeEscalation which is automatically true.

The allowPrivilegeEscalation resource in Kubernetes is a security configuration that controls whether a process inside a container can obtain additional privileges, such as through the sudo command or when using setuid binaries.

How it works:

allowPrivilegeEscalation: true (default): Allows processes inside the container to escalate their privileges. This may be necessary for some applications that need to temporarily elevate their privileges to perform certain operations.
allowPrivilegeEscalation: false: Blocks privilege elevation. Even if the container is run with root privileges, it won't be able to use mechanisms like sudo or setuid to gain additional privileges. This configuration is used as an extra layer of security to limit the capabilities of processes inside the container.

Relationship with privileged and runAsNonRoot:

privileged: If the container is in privileged: true mode, it ignores the allowPrivilegeEscalation configuration because it already has full privileges on the host.
runAsNonRoot: If runAsNonRoot: true is configured, allowPrivilegeEscalation should generally be false, as the goal is to ensure that the container doesn't have root access or the ability to escalate to it.

NoNewPrivs 0 means it's disabled, that is, it can escalate privileges.

root@cks-master:~# cat pod.yaml

root@cks-master:~# vim pod.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod
  name: pod
spec:
  containers:
  - command:
    - sh
    - -c
    - sleep 1d
    image: busybox
    name: pod
    resources: {}
    securityContext:
    # This is already the default, it's just to confirm
      allowPrivilegeEscalation: true
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k exec -it pod -- sh
/ #
/ # cat /proc/1/status | grep NoNewPrivs
NoNewPrivs:0
/ # exit
root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted

Let's change to false, NoNewPrivs should be 1 showing it's enabled.

root@cks-master:~# vim pod.yaml
root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: pod
  name: pod
spec:
  containers:
  - command:
    - sh
    - -c
    - sleep 1d
    image: busybox
    name: pod
    resources: {}
    securityContext:
      allowPrivilegeEscalation: false
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k exec -it pod -- sh
/ #
/ # cat /proc/1/status | grep NoNewPrivs
NoNewPrivs:1
/ # exit
root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted

AppArmor and SecComp

Take an overview of these two resources we have in Linux.

This content, although covered in the CKS, was made available in apparmor, seccomp.

Study of both tools is necessary.

runAsUser and runAsGroup (pod and container levels)​

runAsNonRoot (pod and container Level)​

privileged (Container Level)​

allowPrivilegeEscalation (Container Level)​

AppArmor and SecComp​

runAsUser and runAsGroup (pod and container levels)

runAsNonRoot (pod and container Level)

privileged (Container Level)

allowPrivilegeEscalation (Container Level)

AppArmor and SecComp