I have been working on SELinux for over 15 years. I switched my primary job to working on containers several years ago, but one of the first things I did with containers was to add SELinux support. Now all of the container projects I work on including CRI-O, Podman, Buildah as well as Docker, Moby, Rocket, runc, systemd-nspawn, lxc ... all have SELinux support. I also maintain the container-selinux policy package which all of these container runtimes rely on.
Any ways container runtimes started adding the no-new-privileges capabilities a couple of years ago.
The no_new_privs kernel feature works as follows:
no_new_privsbit in kernel that persists across fork, clone, & exec.
no_new_privsbit ensures process/children processes do not gain any additional privileges.
o_new_privsbit once set.
no_new_privsprocesses are not allowed to change uid/gid or gain any other capabilities, even if the process executes setuid binaries or executables with file capability bits set.
no_new_privsprevents Linux Security Modules (LSMs) like SELinux from transitioning to process labels that have access not allowed to the current process. This means an SELinux process is only allowed to transition to a process type with less privileges.
Oops that last flag is a problem for containers and SELinux. If I am running a command like
# podman run -ti --security-opt no-new-privileges fedora sh
On and SELinux system, usually podman command would be running as unconfined_t, and usually podman asks for the container process to be launched as container_t.
docker run -ti --security-opt no-new-privileges fedora sh
In the case of Docker the docker daemon is usually running as container_runtime_t. And will attempt to launch the container as container_t.
But the user also asked for no-new-privileges. If both flags are set the kernel would not allow the process to transition from unconfined_t -> container_t. And in the Docker case the kernel would not allow a transition from container_runtime_t -> container_t.
Well you may say that is pretty Dumb. no_new_privileges is supposed to be a security measure that prevents a process from gaining further privs, but in this case it is actually preventing us from lessening its SELinux access.
Well the SELinux kernel and policy had the concept of "typebounds", where a policy writer could write that one type typebounted another type. For example
typebounds container_runtime_t container_t, and the kernel would then make sure that container_t has not more allow rules then container_runtime_t. This concept proved to be problematic for two reasons.
Writing policy for the typebounds was very difficult and in some cases we would have to add additional access to the bounding type. An example of this is SELinux can control the `entrypoint` of a process. For example we write policy that says httpd_t can only be entered by executables labeled with the entrypoint type httpd_exec_t. We also had a rule that container_runtime_t can only be entered via the entrpoint type of container_runtime_exec_t. But we wanted to allow any processes to be run inside of a container, we wrote a rules that all executable types could be used as entrypoints to container_t. With typebounds we needed to add all of these rules to container_runtime_t meaning we would have to allow all executables to be run as container_runtime_t. Not ideal.
The second problem with typebounds and the kernel and policy only allowed a single typebounds of a type. So if we wanted to allow unconfined_t processes to launch container_t processes, we would end up writing rules like
typebounds unconfined_t container_runtime_t
typebounds container_runtime_t container_t.
Now unconfined_t would need to grow all of the allow rules of container_runtime_t and container_t.
Well I was complaining about this to Lucas Vrabec, the guy who took over selinux-policy from me, and he tells me about this new allow rule called nnp_transitions. The policy writer could write a rule into policy to say that a process could nnp_transition from one domain to another.
allow container_runtime_t confined_t:process2 nnp_transition;
allow unconfined_t confined_t:process2 nnp_transition;
With a recent enough kernel, SELinux would allow the transition even if the no_new_privs kernel flag was set, and the typebounds rules were NOT in place.
Boy did I feel like an SELinux NEWBIE. I added the rules on Fedora 27 and suddenly everything started working. As of RHEL7.5 this feature will be back ported into the RHEL5 kernel. Awesome.
While I was looking at the nnp_transition rules, I noticed that there was also a nosuid_transition permission. nosuid allows people to mount a file system with nosuid flag, this tells the kernel that even if a setuid application exists on this file system, the kernel should ignore it and not allow a process to gain privilege via the file. You always want untrusted file systems like usb sticks to be mounted with this flag. Well SELinux systems similarly ignore transition rules on labels based on a nosuid file system. Similar to nnp_transition, this blocks a process from transition from a privileged domain to a less privileged domain. But the nosuid_transtion flag allows us to tell the kernel to allow transitions from one domain to another even if the file system is marked nosuid.
allow container_runtime_t confined_t:process2 nosuid_transition;
allow unconfined_t container_t:process2 nosuid_transition;
This means that even if a user used podman to execute a file on a nosuid file system it would be allowed to transition from the unconfined_t to container_t.
Well it is nice to know there are still things that I can learn new about SELinux.