Security Context
In the CKS we need to go deeper into this topic. Take an overview at cka security context.
Security context allows us to define privileges and access control at pod level or at container level.
We can specify:
- userID
- groupID
- Privilege escalation
- Linux capabilities
- others
spec:
# Pod level (Applied to all containers)
securityContext:
runAsUser: 1000
runAsGroup: 3000
fsGroup: 2000
containers:
- name: busybox
command: ["sleep","3600"]
image: busybox
# Container level
securityContext:
runAsUser: 0 # Will override the user in this pod
#
#runAsUser: 0
#runAsGroup: 3000
#fsGroup: 2000
- name: busybox-2
command: ["sleep","3600"]
image: busybox2
# This will be inherited from pod definition
#securityContext:
#runAsUser: 1000
#runAsGroup: 3000
#fsGroup: 2000
In this definition we're saying that:
- runAsUser -> uid=1000 (user ID)
- runAsGroup -> gid=3000 (main group ID)
- fsGroup -> groups=2000 (supplementary group ID)
Starting with Kubernetes 1.25, the operating system (os) field was introduced in the pod spec with possible values for name being windows or linux (default). This field indicates which operating system the pod will run on. In addition to being useful in the future for kube-scheduler, it will also have functionality in security context.
spec:
os:
name: windows
containers:
- name: windows-container
image: mcr.microsoft.com/windows/servercore:ltsc2022
There are parameters that can exist at pod level, container level, or both. Let's do a quick analysis just to know what's possible. Nobody needs to remember this for the CKS.
A quick general table where we'll only talk about the main ones.
| Parameter | Type | OS | Pod Level | Container Level | Description |
|---|---|---|---|---|---|
| allowPrivilegeEscalation | boolean | Linux | No | Yes | Controls whether the process can gain more privileges than its parent. It's automatically set to true if "privileged true" or has CAP_SYS_ADMIN capability. |
| appArmorProfile | AppArmorProfile | Linux | Yes | Yes | If set, the pod's appArmorProfile will be changed |
| capabilities | Capabilities | Linux | No | Yes | To add or drop capabilities in the container. |
| fsGroup | Integer | Linux | Yes | No | A supplemental group that will be applied to all containers in the pod. |
| privileged | boolean | Linux | No | Yes | To run the container as root granting privileges equivalent to the host root. Default is false. |
| procMount | string | Linux | No | Yes | Involves the type of proc mount used by the container. |
| readOnlyRootFilesystem | boolean | Linux | No | Yes | Whether the container's /root should be read-only. Default is false |
| runAsGroup | integer | Linux | Yes | Yes | The gid to run the process entrypoint. |
| runAsNonRoot | boolean | Linux | Yes | Yes | Indicates that the container must run with a non-root user. Kubelet validates if the image is setting the user. |
| runAsUser | integer | Linux | Yes | Yes | User UID. Default is the same used in the container image. |
| seLinuxOptions | SELinuxOptions | Linux | Yes | Yes | If not specified, the container runtime will allocate a random SELinux for each container. |
| seccompProfile | SeccompProfile | Linux | Yes | Yes | Seccomp options for containers. |
| supplementalGroups | integer array | Linux | Yes | No | A list of GIDs applied to the first process of each container. |
| supplementalGroupsPolicy | string | Linux | Yes | No | Only used if supplementalGroups is defined. |
| sysctls | sysctls array | Linux | Yes | No | Contains a list of namespaced sysctls used by the pod. |
| windowsOptions | WindowsSecurityContextOptions | windows | Yes | Yes | Windows-specific settings applied to all containers. |
Let's use this yaml as a base and we'll modify it several times.
root@cks-master:~# k run pod --image=busybox --command -oyaml --dry-run=client -- sh -c 'sleep 1d' > pod.yaml
root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: pod
name: pod
spec:
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: pod
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
root@cks-master:~#
runAsUser and runAsGroup (pod and container levels)
- The busybox image uses the root user in the image itself and we didn't pass anything to change it.
- If we create a file it will be created according to the user we're using which in this case is root.
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k exec -it pod -- sh
/ # id
uid=0(root) gid=0(root) groups=10(wheel)
/ # touch test
/ # ls -lh test
-rw-r--r-- 1 root root 0 Aug 29 12:00 test
/ # exit
root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted
Let's define a user for the pod, which will be inherited by the container since we won't override it.
- We changed the pod's user.
- Since the image defines the workdir directly to /, as user 1000 we don't have permission to create anything.
- If we change to a location like tmp that all users have permission to, we can create and the created file will belong to the specified user.
root@cks-master:~# vim pod.yaml
root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: pod
name: pod
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: pod
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k exec -it pod -- sh
~ $ id
uid=1000 gid=3000
~ $ touch test
touch: test: Permission denied
~ $ pwd
/
~ $ cd tmp/
/tmp $ touch test
/tmp $ ls -lh test
-rw-r--r-- 1 1000 3000 0 Aug 29 12:07 test
/tmp $ exit
root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
runAsNonRoot (pod and container Level)
Now let's force the container to run as non-root, but we won't pass any user. If the container image already defines a user we'll have no problems, but if it defines as root it won't be able to run.
- In this scenario we keep the user and have no problems because we're changing the owner user of the main process.
- Note that the owner of process 1 is user 1000 which we kept.
root@cks-master:~# vim pod.yaml
root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: pod
name: pod
spec:
securityContext:
runAsUser: 1000
runAsGroup: 3000
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: pod
resources: {}
securityContext:
runAsNonRoot: true
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k get pods
NAME READY STATUS RESTARTS AGE
pod 1/1 Running 0 5s
root@cks-master:~# k exec -it pod -- sh
~ $
~ $ ps
PID USER TIME COMMAND
1 1000 0:00 sh -c sleep 1d
8 1000 0:00 sh
14 1000 0:00 ps
~ $ exit
root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted
However, removing the user we have the problem mentioned above.
- In this case we're forcing the image to have a user defined for the process, which is not the case with busybox.
root@cks-master:~# vim pod.yaml
root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: pod
name: pod
spec:
# securityContext:
# runAsUser: 1000
# runAsGroup: 3000
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: pod
resources: {}
securityContext:
runAsNonRoot: true
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k get pod
NAME READY STATUS RESTARTS AGE
pod 0/1 CreateContainerConfigError 0 3s
root@cks-master:~# k get pod pod -o jsonpath={.status.containerStatuses.*.state} | jq
{
"waiting": {
"message": "container has runAsNonRoot and image will run as root (pod: \"pod_default(7dd567c0-ead9-4460-a012-35acd9122bad)\", container: pod)",
"reason": "CreateContainerConfigError"
}
}
root@cks-master:~# k delete pod pod
pod "pod" deleted
The default nginx image uses root, but there's another image that doesn't use it, let's use it for testing
root@cks-master:~# k run nginx --image=nginxinc/nginx-unprivileged -o yaml --dry-run=client > podnonroot.yaml
root@cks-master:~# vim podnonroot.yaml
root@cks-master:~# cat podnonroot.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: nginx
name: nginx
spec:
containers:
- image: nginxinc/nginx-unprivileged
name: nginx
resources: {}
securityContext:
runAsNonRoot: true
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
root@cks-master:~# k apply -f podnonroot.yaml
pod/nginx created
root@cks-master:~# k exec -it nginx -- bash
nginx@nginx:/$ id
uid=101(nginx) gid=101(nginx) groups=101(nginx)
nginx@nginx:/$ exit
exit
root@cks-master:~# k delete pod nginx
pod "nginx" deleted
privileged (Container Level)
By default containers run as unprivileged, but it's possible to run as privileged.
A case where this could happen is if we wanted to run docker-in-docker in the container, that is, a container inside another. We could also have a container that needs access to all devices.
Running a container as privileged means that user 0 (root) in the container is directly mapped to user 0 (root) on the host. One of the abstractions of using containers is that inside the container we can have the same id as a user on the host or other containers but they are different, they can have the same id but different permissions.
With the sysctl command we can set kernel parameters at runtime, but for this we need root permission.
Without privileged, even being root in the container we can't change it.
root@cks-master:~# vim pod.yaml
root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: pod
name: pod
spec:
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: pod
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k exec -it pod -- bash
root@cks-master:~# k exec -it pod -- sh
/ # id
uid=0(root) gid=0(root) groups=10(wheel)
/ # sysctl kernel.hostname=cks
sysctl: error setting key 'kernel.hostname': Read-only file system
/ # sysctl kernel.hostname=cks
Setting privileged to true. Remember that privileged is a security context at container level only.
root@cks-master:~# vim pod.yaml
root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: pod
name: pod
spec:
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: pod
resources: {}
securityContext:
privileged: true
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
# For the record, the hostname is the pod name itself (in this case pod)
# The sysctl command changes in proc temporarily not in /etc/hostname.
root@cks-master:~# k exec pod -it -- sh
/ # cat /proc/sys/kernel/hostname
pod
/ # sysctl kernel.hostname=cks-test
kernel.hostname = cks-test
/ # cat /proc/sys/kernel/hostname
cks-test
/ # cat /etc/hostname
pod
/ # exit
# In the worker where the pod is running the change is at kernel level but in the pod's kernel group and not on the host. It's not the same filesystem.
root@cks-worker:~# cat /proc/sys/kernel/hostname
cks-worker
root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted
allowPrivilegeEscalation (Container Level)
Now let's talk about allowPrivilegeEscalation which is automatically true.
The allowPrivilegeEscalation resource in Kubernetes is a security configuration that controls whether a process inside a container can obtain additional privileges, such as through the sudo command or when using setuid binaries.
How it works:
-
allowPrivilegeEscalation: true(default): Allows processes inside the container to escalate their privileges. This may be necessary for some applications that need to temporarily elevate their privileges to perform certain operations. -
allowPrivilegeEscalation: false: Blocks privilege elevation. Even if the container is run with root privileges, it won't be able to use mechanisms like sudo or setuid to gain additional privileges. This configuration is used as an extra layer of security to limit the capabilities of processes inside the container.
Relationship with privileged and runAsNonRoot:
-
privileged: If the container is inprivileged: truemode, it ignores the allowPrivilegeEscalation configuration because it already has full privileges on the host. -
runAsNonRoot: IfrunAsNonRoot: trueis configured,allowPrivilegeEscalation should generally be false, as the goal is to ensure that the container doesn't have root access or the ability to escalate to it.
NoNewPrivs 0 means it's disabled, that is, it can escalate privileges.
root@cks-master:~# cat pod.yaml
root@cks-master:~# vim pod.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: pod
name: pod
spec:
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: pod
resources: {}
securityContext:
# This is already the default, it's just to confirm
allowPrivilegeEscalation: true
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k exec -it pod -- sh
/ #
/ # cat /proc/1/status | grep NoNewPrivs
NoNewPrivs:0
/ # exit
root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted
Let's change to false, NoNewPrivs should be 1 showing it's enabled.
root@cks-master:~# vim pod.yaml
root@cks-master:~# cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: pod
name: pod
spec:
containers:
- command:
- sh
- -c
- sleep 1d
image: busybox
name: pod
resources: {}
securityContext:
allowPrivilegeEscalation: false
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
root@cks-master:~# k apply -f pod.yaml
pod/pod created
root@cks-master:~# k exec -it pod -- sh
/ #
/ # cat /proc/1/status | grep NoNewPrivs
NoNewPrivs:1
/ # exit
root@cks-master:~# k delete pod pod --force --grace-period 0
Warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
pod "pod" force deleted
AppArmor and SecComp
Take an overview of these two resources we have in Linux.
This content, although covered in the CKS, was made available in apparmor, seccomp.
Study of both tools is necessary.