Security Context

Container vs Host vs Privileges

First, let's understand a bit about container security, as it only makes sense in Kubernetes if we understand containers.

The host that hosts the containers has its own running processes that carry the operating system, the Docker daemon if you have Docker or some other container runtime.

To make an analysis, let's run any container and put it to sleep for 1 hour so that there's something holding the process and it doesn't die.

docker run --name ubuntu  ubuntu sleep 3600

The container shares the same kernel as the host, but they are isolated through namespaces in Linux. The host has its namespace and each container has its own separate one.

That's why the container can only see its processes and nothing outside of it.

Inside the container:

docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED              STATUS              PORTS                       NAMES
764da973f0fe   ubuntu                 "sleep 3600"             About a minute ago   Up About a minute                               ubuntu

docker exec -it ubuntu  ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0   2792  1408 ?        Ss   08:04   0:00 sleep 3600
root           7  0.0  0.0   7064  2944 pts/0    Rs+  08:05   0:00 ps aux

The process that keeps the container running has pid 1.

On the host:

git:(main) ✗ ps -aux | grep sleep
david-p+ 1628755  0.0  0.0 2066812 25536 pts/2   Sl+  05:04   0:00 docker run --name ubuntu ubuntu sleep 3600
root     1628843  0.0  0.0   2792  1408 ?        Ss   05:04   0:00 sleep 3600
david-p+ 1631734  0.0  0.0   9220  2560 pts/5    R+   05:08   0:00 grep --color=auto sleep

On the host we can see the container process which is sleep 3600 but with a different pid. This is because processes can have different PIDs in different namespaces, this is how Docker works.

If we create another container, what can we see?

docker run --name centos centos sleep 3600

docker exec -it centos ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0  23048  2560 ?        Ss   08:14   0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 3600
root           7  1.0  0.0  44668  3456 pts/0    Rs+  08:16   0:00 ps aux

# and on the host

ps -aux | grep sleep
david-p+ 1628755  0.0  0.0 2066812 24948 pts/2   Sl+  05:04   0:00 docker run --name ubuntu ubuntu sleep 3600
root     1628843  0.0  0.0   2792  1408 ?        Ss   05:04   0:00 sleep 3600
david-p+ 1634805  0.0  0.0 1992824 26052 pts/5   Sl+  05:14   0:00 docker run --name centos centos sleep 3600
root     1634985  0.0  0.0  23048  2560 ?        Ss   05:14   0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 3600
david-p+ 1639928  0.0  0.0   9220  2560 pts/4    S+   05:17   0:00 grep --color=auto sleep

We can analyze here that inside the container we are running the process as root user.

We can force another user to run the process by passing --user 1000 in the command. Generally, 1000 is the first user created other than root. On my personal machine, 1001 is my user id.

docker run --name ubuntu --user 1001 ubuntu sleep 3600

docker exec -it ubuntu  ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
1001           1  0.0  0.0   2792  1536 ?        Ss   08:23   0:00 sleep 3600
1001           7  0.0  0.0   7064  2944 pts/0    Rs+  08:23   0:00 ps aux

Or define it in the Dockerfile of the image we're going to build by creating a custom image.

FROM ubuntu
USER 1001

Now on the host we have:

ps aux | grep sleep
david-p+ 1658884  0.0  0.0 1845040 25728 pts/9   Sl+  05:38   0:00 docker run --name ubuntu --user 1001 ubuntu sleep 3600
david-p+ 1658972  0.0  0.0   2792  1408 ?        Ss   05:38   0:00 sleep 3600
david-p+ 1660399  0.0  0.0   9220  2688 pts/8    S+   05:40   0:00 grep --color=auto sleep

And who is executing the process? david-p....

What we can see is that we use the same user inside the container and on the host. The same root from the container is the root from the system and the same user 1001 from the container is the 1001 from the host which is my own and appeared there as sleep being a process from my user.

But wasn't it supposed to be isolated? If this is true, isn't it dangerous?

Can the process inside the container do anything that the root user can do on the host? Actually no. Docker uses a Linux feature called capabilities that limits the powers of root inside the container or the passed user, but it's the same user with limited powers.

The root user can do anything on the host system as we already know.

To see what a user can do we can analyze the file /usr/include/linux/capability.h

If we want to remove this limitation so that the container's root user can do more things we can do:

# This container is already executed as root and --privileged enables all root privileges of the container within the host.
docker run --name ubuntu --privileged  ubuntu sleep 3600

# or only a specific permission --cap-drop would remove a permission
docker run --name ubuntu --cap-add CHOWN ubuntu sleep 3600

Here is the list:

Capability Key	Capability Description
AUDIT_WRITE	Write records to the kernel audit log.
CHOWN	Make arbitrary changes to file UIDs and GIDs (see chown(2)).
DAC_OVERRIDE	Bypass file read, write, and execute permission checks.
FOWNER	Bypass permission checks on operations that normally require the filesystem UID of the process to match the UID of the file.
FSETID	Don't clear set-user-ID and set-group-ID permission bits when a file is modified.
KILL	Bypass permission checks for sending signals.
MKNOD	Create special files using mknod(2).
NET_BIND_SERVICE	Bind a socket to privileged Internet domain ports (port numbers less than 1024).
NET_RAW	Use RAW and PACKET sockets.
SETFCAP	Set file capabilities.
SETGID	Make arbitrary manipulations of process GIDs and supplementary GID list.
SETPCAP	Modify process capabilities.
SETUID	Make arbitrary manipulations of process UIDs.
SYS_CHROOT	Use chroot(2), change root directory.
AUDIT_CONTROL	Enable and disable kernel auditing; change auditing filter rules; retrieve auditing status and filtering rules.
AUDIT_READ	Allow reading the audit log via multicast netlink socket.
BLOCK_SUSPEND	Allow preventing system suspends.
BPF	Allow creating BPF maps, loading BPF Type Format (BTF) data, retrieve JITed code of BPF programs and more.
CHECKPOINT_RESTORE	Allow checkpoint/restore related operations. Introduced in kernel 5.9.
DAC_READ_SEARCH	Bypass file read permission checks and directory read and execute permission checks.
IPC_LOCK	Lock memory (mlock(2), mlockall(2), mmap(2), shmctl(2)).
IPC_OWNER	Bypass permission checks for operations on System V IPC objects.
LEASE	Establish leases on arbitrary files (see fcntl(2)).
LINUX_IMMUTABLE	Set the FS_APPEND_FL and FS_IMMUTABLE_FL i-node flags.
MAC_ADMIN	Allow MAC configuration or state changes. Implemented for the Smack LSM.
MAC_OVERRIDE	Override Mandatory Access Control (MAC). Implemented for the Smack Linux Security Module (LSM).
NET_ADMIN	Perform various network-related operations.
NET_BROADCAST	Make socket broadcasts and listen to multicasts.
PERFMON	Allow privileged system performance and observability operations using perf_events, i915_perf and other kernel subsystems.
SYS_ADMIN	Perform a range of system administration operations.
SYS_BOOT	Use reboot(2) and kexec_load(2), reboot and load a new kernel for later execution.
SYS_MODULE	Load and unload kernel modules.
SYS_NICE	Raise process nice value (nice(2), setpriority(2)) and change the nice value for arbitrary processes.
SYS_PACCT	Use acct(2), switch process accounting on or off.
SYS_PTRACE	Trace arbitrary processes using ptrace(2).
SYS_RAWIO	Perform I/O port operations (iopl(2) and ioperm(2)).
SYS_RESOURCE	Override resource limits.
SYS_TIME	Set system clock (settimeofday(2), stime(2), adjtimex(2)); set real-time (hardware) clock.
SYS_TTY_CONFIG	Use vhangup(2); employ various privileged ioctl(2) operations on virtual terminals.
SYSLOG	Perform privileged syslog(2) operations.
WAKE_ALARM	Trigger something that will wake up the system.

A user with these permissions could simply kill system processes, reboot the machine, change network configurations, create new users, change file permissions, etc.

docker run -it --name ubuntu --privileged ubuntu bash
root@a40542f4c76e:~# reboot
bash: reboot: command not found

The Ubuntu image is a minimal image without many utilities which already helps with security. Alpine is even smaller, with much less. That's why Alpine is great. Less is more!

docker run -it --name centos --privileged centos:8 bash
reboot
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
Failed to talk to init daemon.

# Inside the container
[root@5622c6f14e5e /]# reboot -f
Rebooting.

#
docker ps -a | grep centos
5622c6f14e5e   5d0da3dc9764  "bash" 2 minutes ago   Exited (129) 2 minutes ago  centos

In this case the container rebooted, not the host, but it rebooted, meaning it had privileges to execute the command.

Using the --privileged flag when starting a Docker container grants additional privileges to the container, allowing it to have access to devices and resources on the host system. However, this option does not grant direct access to the host system outside the container.

When you execute commands inside a Docker container, they are isolated from the host system, even when the container is started with the --privileged option. This is a fundamental feature of container technology, which aims to ensure security and isolation between container processes and host processes.

Therefore, even if you have additional privileges inside the container using --privileged, access to the host is still restricted. This means that even as root user inside the container, you will not have direct access to the host system, unless you use specific techniques, such as mounting the host filesystem inside the container (which can be risky and is not recommended unless necessary).

Understanding this, now everything we can do in Kubernetes.

The securityContext can be applied to all containers in the pod if declared at the spec level, only to a specific container if declared at the container level, or both, with the container level taking precedence over the pod level. However, capabilities are only applied at the container level.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: ubuntu
  name: ubuntu
spec:
  # Pod level
  securityContext:
    runAsUser: 1000

  containers:
  - name: ubuntu
    command: ["sleep","3600"]
    image: ubuntu
#  securityContext:
#    runAsUser: 1001 << changing the user
#    capabilities:
#      add: ["NET_ADMIN","SYSLOG"]

# Keeping the same user but adding capabilities
#  securityContext:
#    capabilities:
#      add: ["NET_ADMIN","SYSLOG"]
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

To find out which user the container inside the pod is running as:

             #PODS                         #CONTAINER # COMMAND
kubectl exec pods/nginx-david-c8644f94d-nlgch nginx -- whoami

Container vs Host vs Privileges​

Container vs Host vs Privileges