How is Sandboxing implemented?

Question

What I would like to know is two fold: First off, what is sandboxing? Is it the trapping of OS system calls and then secondly deciding whether to allow it to pass through or not? How is it implemented to begin with? Would it be by way of hooks in the SSDT (kernel level)?

forest · Answer 1 · 2022-06-05T20:54:19.470

Well this answer ended up fairly long, but sandboxing is a huge topic. At its most basic, sandboxing is a technique to minimize the effect a program will have on the rest of the systems in the case of malice or malfunction. This can be for testing or for enhancing the security of a system. The reason one might want to use a sandbox also varies, and in some cases it is not even related to security, for example in the case of OpenBSD's systrace. The main uses of a sandbox are:

Program testing to detect broken packages, especially during builds.
Malware analysis to understand behavior of malicious software.
Securing untrusted or unsafe applications to minimize damage they can do.

There are many sandboxing techniques, all with differing threat models. Some may just reduce attack surface area by limiting APIs that can be used, while others define access controls using formalized models similar to Bell-LaPadula or Biba. I'll be describing a few popular sandboxing techniques, mostly for Linux, but I will also touch on other operating systems.

Seccomp

Seccomp is a Linux security feature that reduces kernel attack surface area. It is technically a syscall filter and not a sandbox, but is often used to augment sandboxes. There are two types of seccomp filters, called mode 1 (strict mode) and mode 2 (BPF mode).

Mode 1

Seccomp mode 1 is the most strict, and original, mode. When a program enables mode 1 seccomp, it is limited to using only four hardcoded syscalls: read(), write(), exit(), and rt_sigreturn(). Any file descriptors that will be needed must be created before enforcing seccomp. In the case of a violation, the offending process is terminated with SIGKILL.

Taken from another answer I wrote on another StackExchange site, a sample program that securely executes a function that returns 42 in bytecode:

#include <unistd.h>
#include <stdint.h>
#include <stdio.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <linux/seccomp.h>

/* "mov al,42; ret" aka "return 42" */
static const unsigned char code[] = "\xb0\x2a\xc3";

int main(void)
{
    int fd[2], ret;

    /* spawn child process, connected by a pipe */
    pipe(fd);
    if (fork() == 0) {
        close(fd[0]);

        /* enter mode 1 seccomp and execute untrusted bytecode */
        prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);
        ret = (*(uint8_t(*)())code)();

        /* send result over pipe, and exit */
        write(fd[1], &ret, sizeof(ret));
        syscall(SYS_exit, 0);
    } else {
        close(fd[1]);

        /* read the result from the pipe, and print it */
        read(fd[0], &ret, sizeof(ret));
        printf("untrusted bytecode returned %d\n", ret);
        return 0;
    }
}

Mode 2

Mode 2 seccomp, also called seccomp-bpf, involves a userspace-created policy being sent to the kernel, defining which syscalls are permitted, what arguments are allowed for those syscalls, and what action should be taken in the case of a syscall violation. The filter comes in the form of BPF bytecode, a special type instruction set that is interpreted in the kernel and used to implement filters. This is used in the Chrome/Chromium and OpenSSH sandbox on Linux, for example.

A simple program that prints the current PID using seccomp-bpf:

#include <seccomp.h>
#include <unistd.h>
#include <stdio.h>
#include <errno.h>

int main(void)
{
    /* initialize the libseccomp context */
    scmp_filter_ctx ctx = seccomp_init(SCMP_ACT_KILL);

    /* allow exiting */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(exit_group), 0);

    /* allow getting the current pid */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(getpid), 0);

    /* allow changing data segment size, as required by glibc */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(brk), 0);

    /* allow writing up to 512 bytes to fd 1 */
    seccomp_rule_add(ctx, SCMP_ACT_ALLOW, SCMP_SYS(write), 2,
        SCMP_A0(SCMP_CMP_EQ, 1),
        SCMP_A2(SCMP_CMP_LE, 512));

    /* if writing to any other fd, return -EBADF */
    seccomp_rule_add(ctx, SCMP_ACT_ERRNO(EBADF), SCMP_SYS(write), 1,
        SCMP_A0(SCMP_CMP_NE, 1));

    /* load and enforce the filters */
    seccomp_load(ctx);
    seccomp_release(ctx);

    printf("this process is %d\n", getpid());
    return 0;
}

Because the Linux syscall ABI keeps arguments in general purpose registers, only these registers are validated by seccomp. This is fine in some cases, such as when an argument is a bitwise ORed list of flags, but in cases where the argument is a pointer to memory, filtering will not work. The reason for this is that the pointer only references memory, so validating the pointer only ensure that the pointer itself is allowed, not that the memory it is referencing has not been changed. This means that it is not possible to reliably filter certain arguments for syscalls like open() (where the path is a pointer to a null-terminated string in memory). To filter paths and similar objects, mandatory access controls or another LSM-based framework must be used.

Pledge

OpenBSD has added a syscall filter similar to (but more coarse-grained than) seccomp called pledge (previously tame). Pledge is a syscall that applications can opt into, essentially "pledging" that they will limit their uses of various kernel interfaces. No matter how much the application begs, the kernel won't revoke the restrictions once they are in place, even if it changes its mind.

Pledge allows an application to make a promise, which is a group of actions which will be permitted, with all others being denied. In essence, it is promising to only use functions it explicitly requests beforehand. Some (non-exhaustive) examples:

The stdio promise allows basic functionality like closing file descriptors or managing memory.
The rpath promise allows syscalls that can be used to read the filesystem.
The wpath promise allows syscalls that can be used to write to the filesystem.
The tmppath promise allows syscalls that can read/write, but only in /tmp.
The id promise allows syscalls that are used to change credentials, like setuid().

Although pledge is much more coarse-grained than seccomp, it is, for this reason, much easier to use and maintain. Because of this, OpenBSD has progressed to adding pledge support to a wide variety of their base applications, from things as security-sensitive as sshd to things as trivial as cat. This "security by default" architecture ends up greatly improving the security of the system as a whole, even if individual promises are coarse-grained and not particularly flexible.

Chroot

A chroot is a *nix feature that allows setting a new path as the root directory for a given program, forcing it to see everything as relative to that path. This is not usually used for security, since a privileged program can often escape a chroot, and because it does not isolate IPC or networking, allowing even unprivileged processes to do mischief like killing other processes. In a touch, it can be used to augment other security techniques. It is very useful for preventing an application from doing accidental damage, and for giving legacy software a view of the filesystem that it expects.

Chrooting bash, for example, would involve putting any executables and libraries it needs into the new directory, and running the chroot utility (which itself just calls the syscall of the same name):

host ~ # ldd /bin/bash
        linux-vdso.so.1 (0x0000036b3fb5a000)
        libreadline.so.6 => /lib64/libreadline.so.6 (0x0000036b3f6e5000)
        libncurses.so.6 => /lib64/libncurses.so.6 (0x0000036b3f47e000)
        libc.so.6 => /lib64/libc.so.6 (0x0000036b3f0bc000)
        /lib64/ld-linux-x86-64.so.2 (0x0000036b3f938000)
host ~ # ldd /bin/ls
        linux-vdso.so.1 (0x000003a093481000)
        libc.so.6 => /lib64/libc.so.6 (0x000003a092e9d000)
        /lib64/ld-linux-x86-64.so.2 (0x000003a09325f000)
host ~ # mkdir -p newroot/{lib64,bin}
host ~ # cp -aL /lib64/{libreadline,libncurses,libc}.so.6 newroot/lib64
host ~ # cp -aL /lib64/ld-linux-x86-64.so.2 newroot/lib64
host ~ # cp -a /bin/{bash,ls} newroot/bin
host ~ # pwd
/root
host ~ # chroot newroot /bin/bash
bash-4.3# pwd
/
bash-4.3# ls
bin  lib64
bash-4.3# ls /bin
bash  ls
bash-4.3# id
bash: id: command not found

Only a process with the CAP_SYS_CHROOT capability is able to enter a chroot. This is necessary to prevent a malicious program from creating its own copy of /etc/passwd in a directory it controls, and chrooting into it with a setuid program like su, tricking the binary into giving them root.

Namespaces

On Linux, namespaces are used to isolate system resources, giving a namespaced program a different understanding of what resources it owns. This is commonly used to implement containers. From the namespaces(7) manpage:

A namespace wraps a global system resource in an abstraction that makes it appear to the process within the namespace that they have their own isolated instance of a global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes.

There are 7 namespaces supported under Linux currently:

cgroup - Cgroup root directory
IPC - System V IPC and POSIX message queues
Network - Network interfaces, stacks, ports, etc
Mount - Mountpoints, similar in function to a chroot
PID - Process IDs
User - User and group IDs
UTS - Hostname and domain name

An example of PID namespaces using the unshare utility:

host ~ # echo $$
25688
host ~ # unshare --fork --pid
host ~ # echo $$
1
host ~ # logout
host ~ # echo $$
25688

While these can be used to augment sandboxing or even be used as an integral part of a sandbox, some of them can reduce security. User namespaces, when unprivileged (the default), expose a much greater attack surface area from the kernel. Many kernel vulnerabilities are exploitable by unprivileged processes when the user namespace is enabled. On some kernels, you can disable unprivileged user namespaces by setting kernel.unprivileged_userns_clone to 0, or, if that specific sysctl is not available on your system, setting user.max_user_namespaces to 0. If you are building your own kernel, you can set CONFIG_USER_NS=n to disable user namespaces globally.

Mandatory Access Controls

A MAC is a framework for defining what a program can and cannot do, on a whitelist basis. A program is represented as a subject. Anything the program wants to act on, such as a file, path, network interface, or port is represented as an object. The rules for accessing the object are called the permission, or flag. Take the AppArmor policy for the ping utility, with added comments:

#include <tunables/global>

/bin/ping {
  # use header files containing more rules
  #include <abstractions/base>
  #include <abstractions/consoles>
  #include <abstractions/nameservice>

  capability net_raw,  # allow having CAP_NET_RAW
  capability setuid,   # allow being setuid
  network inet raw,    # allow creating raw sockets

  /bin/ping mixr,      # allow mmaping, executing, and reading
  /etc/modules.conf r, # allow reading
}

With this policy in place, the ping utility, if compromised, cannot read from your home directory, execute a shell, write new files, etc. This kind of sandboxing is used for securing a server or workstation. Other than AppArmor, some popular MACs include SELinux, TOMOYO, and SMACK. These are typically implemented in the kernel as a Linux Security Module, or LSM. This is a subsystem under Linux that provides modules with hooks for various actions (like changing credentials and accessing objects) so they can enforce a security policy.

Hypervisors

A hypervisor is virtualization software. It usually leverages hardware features that allow isolating all system resources, such as CPU cores, memory, hardware, etc. A virtualized system believes not just that it has root, but that it has ring 0 (kernelmode). Hardware is either abstracted by the CPU (the case for the CPU cores and memory itself), or emulated by the hypervisor software (for more complex hardware, such as NICs). Because the guest is made to believe it owns the whole system, anything that can run on that architecture will tend to also run in the virtual machine, allowing a Linux host with a Windows guest, or a FreeBSD host with a Solaris guest. In theory, a hypervisor will prevent any actions in the guest from affecting the host.

A useful resource that helps with a low-level understanding how a guest is set up is an LWN writeup on the KVM API. KVM is a kernel interface supported on Linux and Illumos (a family of open source forks of Solaris) for setting up virtual machines. As it's fairly low level (interaction only being through IOCTLs on a file descriptor opened against the /dev/kvm character device), it is typically the backend for projects like QEMU. KVM itself makes use of privileged hardware virtualization features such as VT-x on Intel processors and AMD-V on AMD.

It's common for hypervisors, such as with the popular Cuckoo sandbox, to be used to assist in malware analysis. It creates a virtual machine that malware can be run on, and it analyzes the internal state of the system, reading memory contents, dumping memory, etc. Because it runs a full operating system, it is often more difficult for malware to realize it is running virtualized. Common techniques like attaching a dummy debugger to itself so it cannot be debugged can be fooled, though some malware can attempt to detect virtualization (with varying levels of sophistication). Hypervisor detection itself is a very broad and complex subject.

Hypervisors are often (ab)used for security, such as by Qubes OS or Bromium (using the Xen hypervisor to isolate Fedora and Windows, respectively). Whether or not this is a good idea is often debated, due to bugs in Xen cropping up repeatedly. A famous and rather abrasive quote from Theo de Raddt, founder of OpenBSD, on the topic of virtualization when relied on for security:

You are absolutely deluded, if not stupid, if you think that a worldwide collection of software engineers who can't write operating systems or applications without security holes, can then turn around and suddenly write virtualization layers without security holes.

Whether or not a hypervisor is a good choice for security depends on many factors. It is typically easier to use and maintain, since it isolates an entire guest operating system, but it has a large attack surface area and does not provide fine-grained protections.

Containers

Containers are similar to hypervisors, but rather than using virtualization, they use namespaces. Each container has every resource put in its own namespace, allowing every container to run an independent operating system. The init process on the container sees itself as PID 1 running as root, but the host sees it as just another non-init and non-root PID. However, as they share the host's kernel, they can only run the same type of operating system as the host. Additionally, while containers can have root processes that can do privileged actions like setting up network interfaces (only in that container's namespace, of course), they cannot change global kernel settings that would affect all containers. Docker and OpenVZ are popular container implementations.

Because containers fundamentally rely on user namespaces of various implementations (standard Linux namespaces for Docker, and a bespoke namespace technology for OpenVZ), they are often criticized for providing poor security. Container escapes and privilege escalation vulnerabilities are not uncommon on these systems. The reason for these security issues stems from the fact that the namespace root user can interact with the kernel in new and unexpected ways. While the kernel is designed not to let the namespace root user make any obviously dangerous changes to the system that cannot be kept in a namespace (like sysctl tweaks), the root is still able to interact with a lot more of the kernel than an unprivileged process. Because of this, a vulnerability that can only be exploited by root in the process of setting up a virtual network interface, for example, could be exploited by an unprivileged process if that process is able to enter a user namespace. This is an issue even if the syscalls can only "see" the network interface of the container.

In the end, a user namespace simply allows an unprivileged user to interact with far more of the kernel. The surface area is increased to such an extent that many vulnerabilities that otherwise would be relatively harmless instead become LPEs. This is what happens when kernel developers tend not to keep a security mentality when writing code that only root can interact with.

Other technologies

Linux is certainly not the only operating system that has sandboxing. Many other operating systems have their own technology, implemented in various different ways and with varying threat models:

AppContainer on Windows provides isolation similar to a combination of chroots and namespaces. Domains, files, networks, and even windows are isolated.
Seatbelt on OSX acts as mandatory access controls, limiting the resources a confined application can access. It has seen its share of bypasses.
Jails on FreeBSD build on the concept of chroot. It assigns an IP address to the program running in the jail and gives it its own hostname. Unlike a chroot, it is designed for security.
Zones on Solaris are advanced containers that run a copy of Solaris' userland under the host's kernel. They are similar to FreeBSD Jails, but more feature-rich.

score 1 · Answer 2 · edited Sep 29 '17 at 13:47

A quick look at Wikipedia page about sandboxing:

In computer security, a sandbox is a security mechanism for separating running programs, usually in an effort to mitigate system failures or software vulnerabilities from spreading. It is often used to execute untested or untrusted programs or code, possibly from unverified or untrusted third parties, suppliers, users or websites, without risking harm to the host machine or operating system. A sandbox typically provides a tightly controlled set of resources for guest programs to run in, such as scratch space on disk and memory. Network access, the ability to inspect the host system or read from input devices are usually disallowed or heavily restricted.

In the sense of providing a highly controlled environment, sandboxes may be seen as a specific example of virtualization. Sandboxing is frequently used to test unverified programs that may contain a virus or other malicious code, without allowing the software to harm the host device.

So, sandbox is an implementation that creates a controlled and restricted environment to perform/analyze a non-trusty application/task/whatever can run on a computer.

They are mostly implemented with virtual machines, because they are like an actual physical system, which can be easily set up and monitored.

Another way is Secure Computing Mode (seccomp), mentioned in the same wikipedia page:

Its a sandbox built in the Linux kernel. When activated, seccomp only allows the write(), read(), exit(), and sigreturn() system calls.

Quoting from Wikipedia page about Operating-system-level virtualization

Then, there is the well known Operating-system-level virtualization, also known as containerization. Its an operating system feature in which the kernel allows the existence of multiple isolated user-space instances. Such instances, called containers, partitions, virtualization engines (VEs) or jails (FreeBSD jail or chroot jail). They may look like real computers from the point of view of programs running in them. A computer program running on an ordinary person's computer's operating system can see all resources (connected devices, files and folders, network shares, CPU power, quantifiable hardware capabilities) of that computer. However, programs running inside a container can only see the container's contents and devices assigned to the container.

There are some other ways and applications that use sandboxing, like Java Runtime environment. Sandboxing and other virtualization techniques are used a lot for testing apps and environments before final release, perform many tasks simultaneously with less hardware, analyze malwares etc. I hope you get a brief idea of the uses and the extend of usability. Feel free to ask anythin else you want below!