Seccomp (Secure Computing Mode)

It is a Linux kernel feature that allows restricting system calls (syscalls) that processes can make. It would be like applying a security mode to the process so that it can only make syscalls exit(), sigreturn(), read() and write() file descriptors that are already open. If it tries any other call, the kernel will only log the event or terminate the process with SIGKILL or SIGSYS.

prctl()?

The prctl (Process Control) is a Linux system call that allows a process to configure various aspects of its own behavior and environment. This call is used to adjust process-related properties such as security, signal management, and resource control.

Using the prctl() call we can activate Seccomp Mode (PR_SET_SECCOMP) which restricts syscall calls and what the process can do. In addition to activating seccomp mode, it is possible to use prctl() to configure signal management, memory protection, etc.

Seccomp modes can be:

SECCOMP_MODE_DISABLED: Seccomp disabled.
SECCOMP_MODE_STRICT: Only a very limited set of syscalls is allowed.
SECCOMP_MODE_FILTER: Allows configuration of syscall filters using the seccomp filter API.

If it were in Python.

import ctypes
import os

def set_seccomp_mode():
    libc = ctypes.CDLL('libc.so.6')
    result = libc.prctl(22, 1)  # PR_SET_SECCOMP = 22, SECCOMP_MODE_STRICT = 1
    if result != 0:
        raise OSError("Error activating seccomp with prctl")

if __name__ == "__main__":
    try:
        getpid() # Theoretical function to display process pid would work
        set_seccomp_mode()
        print("Seccomp activated successfully!")
        getpid() # Would no longer work after activation and the process would be forcibly terminated
    except OSError as e:
        print(f"Error: {e}")

If it were in Golang.

package main

import (
    "fmt"
    "log"
    "golang.org/x/sys/unix"
)

func main() {
    getpid() // Theoretical function to display process pid would work
    // Activates seccomp in strict mode
    err := unix.Prctl(unix.PR_SET_SECCOMP, unix.SECCOMP_MODE_STRICT)
    if err != nil {
        log.Fatalf("Error activating seccomp with prctl: %v", err)
    }
    fmt.Println("Seccomp activated successfully!")
    getpid() // Would no longer work and the process would terminate with sigkill
}

If it were in Rust...

extern crate nix;

use nix::sys::prctl::{self, SeccompMode};
use nix::Error;

fn main() {
    getpid() // Theoretical function to display process pid would work
    // Activates seccomp in strict mode
    match prctl::prctl(prctl::PrctlCmd::SetSeccompMode, SeccompMode::Strict as usize) {
        Ok(_) => println!("Seccomp activated successfully!"),
        Err(e) => eprintln!("Error activating seccomp: {}", e),
    }
    getpid() // Would not work
}

Note that a system call was added before and after activating seccomp to show results. It is advisable to activate seccomp after application services are ready with all files open.

Of course there are seccomp libraries for various languages and easier ways to do the activation, this demonstration only serves the purpose of showing the use of prctl() activating seccomp directly in the code. libseccomp is used to configure seccomp by adding and removing permissions.

Rarely will the developer "waste time" or have knowledge about this, moreover, the code would be restricted to the Linux environment.

This way, we can work around it by applying seccomp to a process during its initialization, reducing concern during development.

Seccomp-BPF (Secure Computing Mode with Berkeley Packet Filter)

Seccomp evolved and was combined with BPF filters allowing advanced syscall filtering. It offers granular control over which syscalls a process can execute, allowing detailed and restrictive security policies to be created for processes, especially for those running in containers, such as those used in Docker.

In seccomp without BPF in SECCOMP_MODE_STRICT mode, the process can only make very basic syscalls, such as read, write, exit, and sigreturn. Any attempt to use other syscalls will result in process termination.
With Seccomp-BPF, you can specify which syscalls a process can call, which should be blocked, or which should result in a specific signal or error. The mentioned libseccomp library allows adding permissions for more syscall calls.

In Golang, an idea of how to use it.

package main

import (
    "fmt"
    "log"
    "os"

    seccomp "github.com/seccomp/libseccomp-golang"
)

func main() {
    // The lib already uses prctl() to activate seccomp, leaving the developer focused on adding and removing permissions.
    // Creates a seccomp filter with the default kill action
    filter, err := seccomp.NewFilter(seccomp.ActKill)
    if err != nil {
        log.Fatalf("Error creating seccomp filter: %v", err)
    }

    // Adds rules to allow syscalls
    syscallsToAllow := []seccomp.ScmpSys{
        seccomp.ScmpSys(syscall.SYS_GETPID), // Extra
        seccomp.ScmpSys(syscall.SYS_READ),
        seccomp.ScmpSys(syscall.SYS_WRITE),
        seccomp.ScmpSys(syscall.SYS_EXIT),
        seccomp.ScmpSys(syscall.SYS_SIGRETURN),
    }

    for _, syscall := range syscallsToAllow {
        err = filter.AddRule(syscall, seccomp.ActAllow)
        if err != nil {
            log.Fatalf("Error adding syscall rule: %v", err)
        }
    }

    // Applies the seccomp filter to the current process
    err = filter.Load()
    if err != nil {
        log.Fatalf("Error loading seccomp filter: %v", err)
    }
}

Software Using Seccomp

Looking a bit at seccomp wikipedia we can see that several software use seccomp or have support for it.

Just a few:

Android
Various sandboxes
Docker, LXD for containers
LXD
Chrome, Firefox
Snap and Flatpak
OpenSSH
etc

Seccomp Containers and Kubernetes

First, we must keep in mind that the container runtime cannot change seccomp after it is in use. The predefined rules will be valid until the container finishes or dies.

It is necessary to create a profile with all the syscalls that will be allowed and these will be applied to the process. This profile is defined in JSON.

Let's understand the structure of a seccomp profile.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "archMap": [
    {
      "architecture": "SCMP_ARCH_X86_64",
      "subArchitectures": [
        "SCMP_ARCH_X86",
        "SCMP_ARCH_X32"
      ]
    },
    {
      "architecture": "SCMP_ARCH_AARCH64",
      "subArchitectures": [
        "SCMP_ARCH_ARM"
      ]
    }
  ],
  "syscalls": [
    {
      "names": ["getpid", "read", "write", "exit", "rt_sigreturn"],
      "action": "SCMP_ACT_ALLOW",
      "args": []
    },
    {
      "names": ["open"],
      "action": "SCMP_ACT_ERRNO",
      "args": [],
      "comment": "Blocks the open() system call"
    }
  ]
}

Main sections:

defaultAction: Defines the default action to be taken for any syscall that is not explicitly listed in the profile. SCMP_ACT_ERRNO means the process will receive an error (errno) when trying to execute them.
archMap: System architecture on which the profile should be applied.
syscalls: It is the list of allowed syscalls. Each entry contains:
- names: A list of syscalls to which the rule applies.
- action: Action to be taken for the listed syscalls.
  - SCMP_ACT_ALLOW: Allows execution of the syscall.
  - SCMP_ACT_ERRNO: Returns an errno error when the syscall is called.
  - SCMP_ACT_KILL: Terminates the process that tried to execute the syscall.
  - SCMP_ACT_TRAP: Sends a SIGSYS signal to the process.
- args: List of arguments to define conditional rules based on the syscall arguments. It is empty in this example, but can be used to allow or block syscalls based on their parameters.
- comment (optional): A comment field to describe the rule.

We have the following more complex profile taken from the docker documentation, also used by containerd.

Podman uses another profile, but quite similar taken from podman's github.

Now let's create a container using docker's seccomp.

# Creating the profile file with docker's profile.
root@cks-worker:~# vim default.json
# Running the nginx container to see if it works
root@cks-worker:~# docker run --security-opt seccomp=default.json nginx
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2024/09/13 13:07:10 [notice] 1#1: using the "epoll" event method
2024/09/13 13:07:10 [notice] 1#1: nginx/1.27.1
2024/09/13 13:07:10 [notice] 1#1: built by gcc 12.2.0 (Debian 12.2.0-14)
2024/09/13 13:07:10 [notice] 1#1: OS: Linux 5.15.0-1067-gcp
2024/09/13 13:07:10 [notice] 1#1: getrlimit(RLIMIT_NOFILE): 1048576:1048576
2024/09/13 13:07:10 [notice] 1#1: start worker processes
2024/09/13 13:07:10 [notice] 1#1: start worker process 29
2024/09/13 13:07:10 [notice] 1#1: start worker process 30
^C2024/09/13 13:07:28 [notice] 1#1: signal 2 (SIGINT) received, exiting
2024/09/13 13:07:28 [notice] 29#29: exiting
2024/09/13 13:07:28 [notice] 29#29: exit
2024/09/13 13:07:28 [notice] 30#30: exiting
2024/09/13 13:07:28 [notice] 30#30: exit
2024/09/13 13:07:28 [notice] 1#1: signal 17 (SIGCHLD) received from 29
2024/09/13 13:07:28 [notice] 1#1: worker process 29 exited with code 0
2024/09/13 13:07:28 [notice] 1#1: worker process 30 exited with code 0

So far so good. Remove write from the syscalls list and test again.

root@cks-worker:~# cat default.json | grep "write"
        "pwrite64",
        "pwritev",
        "pwritev2",
        "write", # <<<<<
        "writev"
        "s390_pci_mmio_write",
        "process_vm_writev",

root@cks-worker:~# vim default.json
root@cks-worker:~# cat default.json | grep "write"
        "pwrite64",
        "pwritev",
        "pwritev2",
        "writev"
        "s390_pci_mmio_write",
        "process_vm_writev",

root@cks-worker:~# docker run --security-opt seccomp=default.json nginx
docker: Error response from daemon: OCI runtime start failed: cannot start an already running container: unknown.
ERRO[0000] error waiting for container:

Moving on to Kubernetes...

It is possible to configure the kubelet to use a default seccomp on all pods automatically by passing the --seccomp-default parameter in the kubelet. For this it is necessary to be sure that all your workloads work correctly using the profile.

When a specific profile is passed in a pod it will search in the /var/lib/kubelet/seccomp/ folder. So let's create and put our profile in there.

It is necessary that all nodes have this profile available, or at least those that will run the specific pod.

We can define seccomp at the pod level or per container using the security context.

# On a worker node
root@cks-worker:~/var/lib/kubelet~# mkdir -p /var/lib/kubelet/seccomp
# Putting write back in place.
root@cks-worker:~# vim default.json
root@cks-worker:~# cat default.json | grep write
        "pwrite64",
        "pwritev",
        "pwritev2",
        "write",
        "writev"
        "s390_pci_mmio_write",
        "process_vm_writev",
root@cks-worker:~# mv default.json /var/lib/kubelet/seccomp/
root@cks-worker:~# ls /var/lib/kubelet/seccomp/
default.json

# Now let's run a container pointing to this profile.
# On the master...
root@cks-master:~# k run nginx --image=nginx -oyaml --dry-run=client > nginx.yaml

root@cks-master:~# vim nginx.yaml
root@cks-master:~# cat nginx.yaml
apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: nginx
  name: nginx
spec:
  containers:
  - image: nginx
    name: nginx
    resources: {}
    securityContext:
      seccompProfile:
        type: Localhost
        # path to the profile from the seccomp folder
        localhostProfile: default.json
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

root@cks-master:~# k apply -f nginx.yaml
pod/nginx created

root@cks-master:~# k get pods
NAME    READY   STATUS    RESTARTS   AGE
nginx   1/1     Running   0          4s

root@cks-master:~# k describe pod nginx
Name:             nginx
Namespace:        default
Priority:         0
Service Account:  default
Node:             cks-worker/10.128.0.7
Start Time:       Fri, 13 Sep 2024 13:54:22 +0000
Labels:           run=nginx
Annotations:      cni.projectcalico.org/containerID: e29bb575577eb4f5d7a1686520fcc1375b3efa95c922d07da33a3f68321e4ae0
                  cni.projectcalico.org/podIP: 192.168.1.19/32
                  cni.projectcalico.org/podIPs: 192.168.1.19/32
Status:           Running
IP:               192.168.1.19
IPs:
  IP:  192.168.1.19
Containers:
  nginx:
    Container ID:        containerd://bb31ae8ab0db99ee90bb85dc0ea34a6709879cfb4f356a63dff0050a47c6d0ab
    Image:               nginx
    Image ID:            docker.io/library/nginx@sha256:04ba374043ccd2fc5c593885c0eacddebabd5ca375f9323666f28dfd5a9710e3
    Port:                <none>
    Host Port:           <none>
    SeccompProfile:      Localhost # <<<<<
      LocalhostProfile:  default.json #<<<<<
    State:               Running
      Started:           Fri, 13 Sep 2024 13:54:23 +0000
    Ready:               True
    Restart Count:       0
    Environment:         <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mmzj8 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  kube-api-access-mmzj8:
    Type:                    Projected (a volume that contains injected data from multiple sources)

prctl()?​

Seccomp-BPF (Secure Computing Mode with Berkeley Packet Filter)​

Software Using Seccomp​

Seccomp Containers and Kubernetes​

prctl()?

Seccomp-BPF (Secure Computing Mode with Berkeley Packet Filter)

Software Using Seccomp

Seccomp Containers and Kubernetes