Containers on Linux have been marketed as the cure-all for your infrastructure: easy to build, manage and distribute. They also have been touted as light-weight virtual machines, with some security isolation to boot. However the walls of your containers are more like many thin layers of paper rather than a single steel enclosure. My goal is to remove the shroud of hype and talk about the underlying Linux Kernel technologies that form the walls of your containers to understand where attacks can occur. This provides the raw material for containers and helps explain how container isolation really works.
In Linux a container is simply a collection of namespaced resources plus other technologies like cgroups, overlayfs, iptables and linux security modules (LSMs). Different container runtimes may use different sets of these technologies and various approaches. For example Docker emphasizes service isolation; whereas LXC presents an environment that mimics a virtual machine sans virtualization. The Linux kernel allows one to compose various namespace technologies in a myriad of ways. For example, if you want to accomplish virtual networking you can just use network namespaces. If you want to limit a process’s CPU usage you can apply cgroup limits to a single process.
Typically one can think of a Container as a very special process and its children. Conversely one can think of namespaces only existing if the owning process exists (with some exceptions such as network namespaces). For example if I launch a docker container and then ‘kill -9’ the container process from the host filesystem, then my container is effectively gone.
A very common misconception is that a Linux container is a lightweight virtual machine. On Linux this is not the case and the container actually shares the same host kernel and is not virtualized. Because of this, Linux needs to rely on layers of protection to prevent access to the host system.
Containers are an amalgamation of various technologies to limit resources, provide isolation, and mimic various aspects of a full system. Here I’ll go through the smorgasbord of various Linux technologies.
Control groups or cgroups is a Linux feature that allows the limiting of resources for a particular group of processes. Resources such as CPU, memory, and I/O can be limited using cgroups.
An example of the power of cgroups is to use docker run flags such as memory or cpus to limit resources when running a new container.
Enforcing Cgroup limits is important to ensure that a container cannot Denial-Of-Service the host machine.
Namespaces isolate a global operating system resource and are the heart of containers. New namespaces can be minted by using clone system call flags or using the unshare system call. Below are a summary of the difference name spaces and a short description of what they accomplish.
Your container will have its own set of PIDs. Generally the root of the container will appear as PID 1 within the pid namespace.
This allows your container to have its own IP address and routing.
Posix IPC and message queues are unique within a namespace.
The path ‘/’ inside your container looks different than ‘/’ on your root filesystem. This provides path isolation and makes it easy for containerized apps to read from regular paths without disturbing the host filesystem.
This requires a mapping of root user namespace IDs to namespace IDs, but can be very effective in ensuring that UID 0 inside a container is different from UID 0 on your root namespace.
This allows your container to have its own hostname.
User namespaces are not enabled by default in Docker, and can be the source of many security problems. For example if a container bind-mounts a sensitive directory into a container, and the user is UID 0, then that container user could potentially attack the host system. Detecting these sensitive mounts is something that can easily be detected by inspecting docker container configurations.
Overlayfs and AUFS allow for overlaying an ‘upper’ filesystem onto a ‘lower’ filesystem. This forms the basis of Docker’s layering and means that instead of having to copy an entire filesystem on each new container we only track changes within an ‘upper’ overlay. The ‘lower’ filesystem can even be another overlayfs filesystem which allows for stacking.
Linux has a few Security Modules with AppArmor and SELinux being the most prominent. These security modules hook into various points in the Linux kernel and can perform actions such as allowing, denying, or logging events. Security modules implement Mandatory Access Control which supplements traditional UNIX Discretionary Access Control.
Apparmor allows a program to ship with a profile that describes which capabilities a program can access. For example, one could limit a program to only access files within a certain path, only read files within another path, limit network connections to a process, or even limit executing other processes entirely.
SELinux allows one to apply contexts to files and processes, and use policy to describe which programs and users can access which contexts. It has a very powerful syntax.
In docker a default profile is applied to all container processes, but it is easy to override it with the following syntax:
LSMs form a strong perimeter of defense, but the default configurations may be more permissive to allow for ease of use. Ideally you should run containers with the least set of access as possible; however this requires tuning to your specific application.
Each process can have ulimits applied to ensure it does not exceed certain operating system resource limits. For example, if I have a process that needs to open many files I may increase the number of open file descriptors a process is allowed to have:
Limiting container processes to sane ulimit defaults is important, otherwise containers may deny other applications access to operating system resources.
Traditional UNIX permissions only distinguish between a privileged ID of 0 and everybody else. Linux capabilities allow for more fine grained control of what a process is allowed to do. For example if I want a normal process to be able to execute network operations (CAP_NET_ADMIN) or sending signals to processes (CAP_KILL), these capabilities can be enabled or disabled for any particular process. Docker will use the LSM profile (apparmor,selinux) to form a basis for capabilities defaults; however capabilities can be configured using the cap-add and cap-drop flags. An example of using NET_ADMIN to create a dummy interface:
Seccomp or Secure Computing Mode is a kernel feature that is like a firewall for syscalls. You can log, allow, deny various syscalls and even provide some programmable filters that can check syscall arguments. Docker provides a default seccomp profile which provides some sane defaults; however you can also provide your own profile with the following:
This is yet another subsystem where you can limit specific syscalls and parameters to provide another layer of security for your container.
All these technologies combined allow us to create the walls of our container. This can achieve limiting resources, security and isolation of processes, and even allowing a process to act as its own virtual machine having a hostname and IP address. Defaults are generally chosen for easy-of-use rather than security so care must be taken when deploying your containers. With these basics we now see the walls of the container.
Using this knowledge at Confluera we’ve built an amazing product that can detect when attackers break out of the confines of your containers! For a demo please send an email to demo@confluera.com to see for yourself.