Security Context
Container vs Host vs Privileges
First, let's understand a bit about container security, as it only makes sense in Kubernetes if we understand containers.
The host that hosts the containers has its own running processes that carry the operating system, the Docker daemon if you have Docker or some other container runtime.
To make an analysis, let's run any container and put it to sleep for 1 hour so that there's something holding the process and it doesn't die.
docker run --name ubuntu ubuntu sleep 3600
The container shares the same kernel as the host, but they are isolated through namespaces in Linux. The host has its namespace and each container has its own separate one.
That's why the container can only see its processes and nothing outside of it.
Inside the container:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
764da973f0fe ubuntu "sleep 3600" About a minute ago Up About a minute ubuntu
docker exec -it ubuntu ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 2792 1408 ? Ss 08:04 0:00 sleep 3600
root 7 0.0 0.0 7064 2944 pts/0 Rs+ 08:05 0:00 ps aux
The process that keeps the container running has pid 1.
On the host:
git:(main) ✗ ps -aux | grep sleep
david-p+ 1628755 0.0 0.0 2066812 25536 pts/2 Sl+ 05:04 0:00 docker run --name ubuntu ubuntu sleep 3600
root 1628843 0.0 0.0 2792 1408 ? Ss 05:04 0:00 sleep 3600
david-p+ 1631734 0.0 0.0 9220 2560 pts/5 R+ 05:08 0:00 grep --color=auto sleep
On the host we can see the container process which is sleep 3600 but with a different pid. This is because processes can have different PIDs in different namespaces, this is how Docker works.
If we create another container, what can we see?
docker run --name centos centos sleep 3600
docker exec -it centos ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 23048 2560 ? Ss 08:14 0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 3600
root 7 1.0 0.0 44668 3456 pts/0 Rs+ 08:16 0:00 ps aux
# and on the host
ps -aux | grep sleep
david-p+ 1628755 0.0 0.0 2066812 24948 pts/2 Sl+ 05:04 0:00 docker run --name ubuntu ubuntu sleep 3600
root 1628843 0.0 0.0 2792 1408 ? Ss 05:04 0:00 sleep 3600
david-p+ 1634805 0.0 0.0 1992824 26052 pts/5 Sl+ 05:14 0:00 docker run --name centos centos sleep 3600
root 1634985 0.0 0.0 23048 2560 ? Ss 05:14 0:00 /usr/bin/coreutils --coreutils-prog-shebang=sleep /usr/bin/sleep 3600
david-p+ 1639928 0.0 0.0 9220 2560 pts/4 S+ 05:17 0:00 grep --color=auto sleep
We can analyze here that inside the container we are running the process as root user.
We can force another user to run the process by passing --user 1000 in the command. Generally, 1000 is the first user created other than root. On my personal machine, 1001 is my user id.
docker run --name ubuntu --user 1001 ubuntu sleep 3600
docker exec -it ubuntu ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
1001 1 0.0 0.0 2792 1536 ? Ss 08:23 0:00 sleep 3600
1001 7 0.0 0.0 7064 2944 pts/0 Rs+ 08:23 0:00 ps aux
Or define it in the Dockerfile of the image we're going to build by creating a custom image.
FROM ubuntu
USER 1001
Now on the host we have:
ps aux | grep sleep
david-p+ 1658884 0.0 0.0 1845040 25728 pts/9 Sl+ 05:38 0:00 docker run --name ubuntu --user 1001 ubuntu sleep 3600
david-p+ 1658972 0.0 0.0 2792 1408 ? Ss 05:38 0:00 sleep 3600
david-p+ 1660399 0.0 0.0 9220 2688 pts/8 S+ 05:40 0:00 grep --color=auto sleep
And who is executing the process? david-p....
What we can see is that we use the same user inside the container and on the host. The same root from the container is the root from the system and the same user 1001 from the container is the 1001 from the host which is my own and appeared there as sleep being a process from my user.
But wasn't it supposed to be isolated? If this is true, isn't it dangerous?
Can the process inside the container do anything that the root user can do on the host? Actually no. Docker uses a Linux feature called capabilities that limits the powers of root inside the container or the passed user, but it's the same user with limited powers.
The root user can do anything on the host system as we already know.
To see what a user can do we can analyze the file /usr/include/linux/capability.h
If we want to remove this limitation so that the container's root user can do more things we can do:
# This container is already executed as root and --privileged enables all root privileges of the container within the host.
docker run --name ubuntu --privileged ubuntu sleep 3600
# or only a specific permission --cap-drop would remove a permission
docker run --name ubuntu --cap-add CHOWN ubuntu sleep 3600
Here is the list:
| Capability Key | Capability Description |
|---|---|
| AUDIT_WRITE | Write records to the kernel audit log. |
| CHOWN | Make arbitrary changes to file UIDs and GIDs (see chown(2)). |
| DAC_OVERRIDE | Bypass file read, write, and execute permission checks. |
| FOWNER | Bypass permission checks on operations that normally require the filesystem UID of the process to match the UID of the file. |
| FSETID | Don't clear set-user-ID and set-group-ID permission bits when a file is modified. |
| KILL | Bypass permission checks for sending signals. |
| MKNOD | Create special files using mknod(2). |
| NET_BIND_SERVICE | Bind a socket to privileged Internet domain ports (port numbers less than 1024). |
| NET_RAW | Use RAW and PACKET sockets. |
| SETFCAP | Set file capabilities. |
| SETGID | Make arbitrary manipulations of process GIDs and supplementary GID list. |
| SETPCAP | Modify process capabilities. |
| SETUID | Make arbitrary manipulations of process UIDs. |
| SYS_CHROOT | Use chroot(2), change root directory. |
| AUDIT_CONTROL | Enable and disable kernel auditing; change auditing filter rules; retrieve auditing status and filtering rules. |
| AUDIT_READ | Allow reading the audit log via multicast netlink socket. |
| BLOCK_SUSPEND | Allow preventing system suspends. |
| BPF | Allow creating BPF maps, loading BPF Type Format (BTF) data, retrieve JITed code of BPF programs and more. |
| CHECKPOINT_RESTORE | Allow checkpoint/restore related operations. Introduced in kernel 5.9. |
| DAC_READ_SEARCH | Bypass file read permission checks and directory read and execute permission checks. |
| IPC_LOCK | Lock memory (mlock(2), mlockall(2), mmap(2), shmctl(2)). |
| IPC_OWNER | Bypass permission checks for operations on System V IPC objects. |
| LEASE | Establish leases on arbitrary files (see fcntl(2)). |
| LINUX_IMMUTABLE | Set the FS_APPEND_FL and FS_IMMUTABLE_FL i-node flags. |
| MAC_ADMIN | Allow MAC configuration or state changes. Implemented for the Smack LSM. |
| MAC_OVERRIDE | Override Mandatory Access Control (MAC). Implemented for the Smack Linux Security Module (LSM). |
| NET_ADMIN | Perform various network-related operations. |
| NET_BROADCAST | Make socket broadcasts and listen to multicasts. |
| PERFMON | Allow privileged system performance and observability operations using perf_events, i915_perf and other kernel subsystems. |
| SYS_ADMIN | Perform a range of system administration operations. |
| SYS_BOOT | Use reboot(2) and kexec_load(2), reboot and load a new kernel for later execution. |
| SYS_MODULE | Load and unload kernel modules. |
| SYS_NICE | Raise process nice value (nice(2), setpriority(2)) and change the nice value for arbitrary processes. |
| SYS_PACCT | Use acct(2), switch process accounting on or off. |
| SYS_PTRACE | Trace arbitrary processes using ptrace(2). |
| SYS_RAWIO | Perform I/O port operations (iopl(2) and ioperm(2)). |
| SYS_RESOURCE | Override resource limits. |
| SYS_TIME | Set system clock (settimeofday(2), stime(2), adjtimex(2)); set real-time (hardware) clock. |
| SYS_TTY_CONFIG | Use vhangup(2); employ various privileged ioctl(2) operations on virtual terminals. |
| SYSLOG | Perform privileged syslog(2) operations. |
| WAKE_ALARM | Trigger something that will wake up the system. |
A user with these permissions could simply kill system processes, reboot the machine, change network configurations, create new users, change file permissions, etc.
docker run -it --name ubuntu --privileged ubuntu bash
root@a40542f4c76e:~# reboot
bash: reboot: command not found
The Ubuntu image is a minimal image without many utilities which already helps with security. Alpine is even smaller, with much less. That's why Alpine is great. Less is more!
docker run -it --name centos --privileged centos:8 bash
reboot
System has not been booted with systemd as init system (PID 1). Can't operate.
Failed to connect to bus: Host is down
Failed to talk to init daemon.
# Inside the container
[root@5622c6f14e5e /]# reboot -f
Rebooting.
#
docker ps -a | grep centos
5622c6f14e5e 5d0da3dc9764 "bash" 2 minutes ago Exited (129) 2 minutes ago centos
In this case the container rebooted, not the host, but it rebooted, meaning it had privileges to execute the command.
Using the --privileged flag when starting a Docker container grants additional privileges to the container, allowing it to have access to devices and resources on the host system. However, this option does not grant direct access to the host system outside the container.
When you execute commands inside a Docker container, they are isolated from the host system, even when the container is started with the --privileged option. This is a fundamental feature of container technology, which aims to ensure security and isolation between container processes and host processes.
Therefore, even if you have additional privileges inside the container using --privileged, access to the host is still restricted. This means that even as root user inside the container, you will not have direct access to the host system, unless you use specific techniques, such as mounting the host filesystem inside the container (which can be risky and is not recommended unless necessary).
Understanding this, now everything we can do in Kubernetes.
The securityContext can be applied to all containers in the pod if declared at the spec level, only to a specific container if declared at the container level, or both, with the container level taking precedence over the pod level. However, capabilities are only applied at the container level.
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: ubuntu
name: ubuntu
spec:
# Pod level
securityContext:
runAsUser: 1000
containers:
- name: ubuntu
command: ["sleep","3600"]
image: ubuntu
# securityContext:
# runAsUser: 1001 << changing the user
# capabilities:
# add: ["NET_ADMIN","SYSLOG"]
# Keeping the same user but adding capabilities
# securityContext:
# capabilities:
# add: ["NET_ADMIN","SYSLOG"]
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
To find out which user the container inside the pod is running as:
#PODS #CONTAINER # COMMAND
kubectl exec pods/nginx-david-c8644f94d-nlgch nginx -- whoami