Designing a sandbox or how to "perfectly" isolate an app?

Question

I have been thinking on how an app could be "perfectly" isolated from the rest of the system. Now I know we will never achieve "perfect" in practice, but in theory, how could one go about it? I put together a few thoughts:

IMHO the security that we are talking about has to be on the lowest possible level or in other words the only place that software cannot circumvent: hardware.

I am not advocating against security restrictions on higher levels, but since with higher level the complexity also rises, there will be security holes and someone will find them.

Now by hardware I mean of course the kernel-mode of the processor and its rings.

I have looked on the various system calls of different operating systems and here again I see that there is already a huge amount of complexity in the number of system calls. If I were to create a sandbox, I would reduce the system calls to the absolute minimum required.

Here it is worth mentioning, that the apps that are supposed to run in this sandbox can by any means be implemented/built specifically for the sandbox. I do not want to introduce the requirement to be able to port existing applications into this sandbox!

So to have something we can agree on lets say the apps in the sandbox need:

timing (Edit in response to @Pascal: time & date, hardware timers)
reading / writing / creating / deleting files (no directory modification required)
networking
sound
display (specifically opengl)

Of course those need to be restricted too: There will be a supervisor process that decides to which hosts the sandboxed process may connect and which files it may see and change and to which part of the screen it may draw.

In C it is common practice to call functions with a pointer (or address) to a range in memory (e.g. an array or a string) and a number of bytes to read/write. When designing the system calls for the sandbox, I would go a way that is more common in higher level programming languages: every pointer must point to a struct that first contains the datatype of the memory range and it's size. Also I would only allow ranges in memory that have been accquired by calling one of the system calls, e.g. the system call will always know what it is working with and will fail if it is not given something already known to it in the first place (including a safe storage/copy of the the type and size information so that the kernel cannot be tricked into writing/reading outside of range).

As you probably guessed I made a few assumptions that would need verification. I am restricting those assumptions to the major OSs: Linux, Windows, MacOS, Android, iOS. PIU stands for Process in usermode (= not kernel mode). So here goes: Is it true that

A PIU can only ever read/write its own memory never any other unless through system calls?
A PIU can only ever access the harddrives through system calls?
A PIU can only ever access the network through system calls?
A PIU can only ever access any other hardware using system calls?
When forking a process I can disable/remove/replace each and every system call?
There is no way for a PIU to use any other system call then the ones I defined/allowed when forking?
In a custom system call (= inside kernel mode) I can use the original system calls directly (without using interrupts)?

So far if all my assumptions are correct I would (theoretically) be able to create a "perfectly" safe sandbox. There is one area though I have absolutely no feel for: display.

How does opengl work? Does it directly talk to the graphics card or does it use system calls as well? Can something running on the GPU affect other graphical applications? Can I draw to any area of the screen or can that be restricted? Is the GPU somewhat connected to the current CPU mode (kernel/user)?

When answering this question please refrain from arguing that one should not replace system calls or one should use this or that existing sandbox/vm implementation. If we talk about the replacement of system calls please tell me what security and performance implications that might have and if we talk about existing sandbox implementations tell me how they solve certain problems that are related to the question or how certain malware circumvented their security mechanisms.

I think seccomp might be what you're looking for, with regards to restricting system calls. — user253751, Mar 03 '17 at 03:42
Also, a perfectly isolated app can't do anything. That's not what you want. If you give it access to files, networking, and so on then it's no longer perfectly isolated is it? — user253751, Mar 03 '17 at 03:50
@immibis Ok yeah the wording "perfect" might not be perfect ;-). Appreciate the hit to seccomp! — John Smith, Mar 03 '17 at 11:49
In that case - kernels already do a very good job of sandboxing applications! But there are deliberate holes in the sandbox because otherwise your application wouldn't be able to do anything. Sometimes the holes let the application do more things than you want it to be able to do. — user253751, Mar 03 '17 at 22:33

score 3 · Answer 1 · answered Mar 05 '17 at 07:03

Since your question was so broad (as others have pointed out), and each person answered a different part of your question, I'll try to answer this part.

Here it is worth mentioning, that the apps that are supposed to run in this sandbox can by any means be implemented/built specifically for the sandbox. I do not want to introduce the requirement to be able to port existing applications into this sandbox!

Sorry, but this is going to be effectively impossible to do well. My job heavily involves writing sandboxes, and it almost always means I have to patch whatever I am sandboxing if I want to improve security in any real way. Most programs call unnecessary syscalls with arguments that fundamentally cannot be filtered without very heavy kernel changes (you cannot filter the memory pointed to by an address in a syscall argument using a generic syscall filter).

There are some experimental techniques, however. There is one hypervisor-based sandbox which spawns processes in their own hypervisor, where the host kernel is duplicated as soon as fork() is called (the memory is copy-on-write, to avoid memory overhead), and the guest kernel is destroyed when the process exits. The project is called Capsule by Quarkslab. The processes are denied hardware access. If the guest kernel is exploited, it cannot exploit the host. This may be a good starting point for a sandbox, at least if you want to go the experimental route and need applications to work without modification. I kind of doubt it supports OpenGL, though.

timing (Edit in response to @Pascal: time & date, hardware timers)

Getting time on Linux involves a special type of syscall called a vDSO, which is not a true syscall. It's actually an ELF linked into every process which hooks syscalls like gettimeofday() and time(). When time() is called, the process reads a dynamic constantly-updating and read-only page of memory set up by the kernel and readable by all processes called the VVAR which contains information such as the current time. This allows the process to get the time without context switches. For other syscalls when nanosecond precision is required, the linked in ELF causes the process to use the RDTSC instruction, which is much faster than calling gettimeofday() as a true syscall. While people may argue about the security of a shared page like the VVAR, the security of these read-only "syscalls" is indisputable.

Hardware timers is another issue. You have to read either HPET (High Precision Event Timer), or the RTC. I'm not sure which you're talking about. I imagine you meant the RTC, in which case you could limit the IOCTL argument to RTC_RD_TIME, which simply fills out the rtc_time struct when called against the RTC character device. I can't imagine that would be exploited.

reading / writing / creating / deleting files (no directory modification required)

This is easy. You can use a mix of standard syscall filtering (implementations like seccomp) and LSM hooks or similar to do this.

networking

It would not be difficult to design a sandbox which allows secure network access. You would absolutely need to disallow all unnecessary protocols, limiting yourself to TCP, UDP, and ICMP on top of IPv4 and, if necessary, IPv6. Other protocols like DCCP and SCCP have a history of vulnerabilities (and very recently, a horrible UAF was found that resulted in an LPE from DCCP). Currently, the Linux networking code is quite complex, but if all you need is socket(), connect(), bind(), accept(), listen(), setsockopt(), getsockopt(), and shutdown(), then you can do the rest entirely with other, generic I/O syscalls like read() and write().

You would, however, want to strictly limit the *sockopt syscalls, since they have a history of issues. If you are reimplementing/replacing syscalls, you could drastically limit what they can do, at the expense of limiting flexibility.

sound

Sound requires accessing character devices in /dev/snd/ with the IOCTL syscall, passing complex buffers to it. An alternative would be to use something like PulseAudio or Jack. It would be hard to have a "perfect" sandbox with audio. I don't know much about audio, so I won't comment on this.

display (specifically opengl)

You could not do OpenGL securely. If you use the framebuffer, you would be able to write to the display by opening /dev/fb0 for example, and merely writing to it. While even this has proven vulnerable in the past, it is much, much easier to secure on the kernel-side. The performance is much lower however. If you are using Xorg or Wayland, you only need networking access, and then even if an unsandboxed Xorg/Wayland has access to hardware acceleration, your application doesn't. This will provide you with better performance than directly accessing the framebuffer, but it won't give you the same performance as OpenGL would (so video games, rendering tools, etc. will not work).

Software OpenGL (LLVM) works, but it, being entirely CPU-bound, is much slower.

How does opengl work? Does it directly talk to the graphics card or does it use system calls as well? Can something running on the GPU affect other graphical applications? Can I draw to any area of the screen or can that be restricted? Is the GPU somewhat connected to the current CPU mode (kernel/user)?

The process communicates using the ioctl() system call on a file descriptor pointing to a character device such as /dev/dri/card0. The call typically passes it a buffer, so there is a lot of room for bugs. Even if you whitelist the individual IOCTL arguments, the ones which are still present cannot have their buffers whitelisted in a syscall sandbox. The reasons for this are why seccomp is unable to whitelist anything other than the raw register values for syscall arguments. If you try to whitelist anything which they point to (let's say a struct which contains OpenGL-related data in an IOCTL to be sent to the master DRM node), a concurrent thread could modify the contents of the struct in memory while the thread enters kernelmode, right after the contents of memory are checked for correctness. The relevant research paper is titled Exploiting Concurrency Vulnerabilities in System Call Wrappers.

You cannot restrict a process' graphics if it has access to the master DRM node. It can tell the GPU to allocate a large pixel buffer but not fill it, then read the contents. It will contain previously used framebuffers. It is so hard to restrict that even a process inside a virtual machine, if given OpenGL access, can view graphics from processes outside the virtual machine. This issue has been nicknamed the Palinopsia bug.

I appreciate your insights about timers in linux as well as OpenGL. It's crazy that it went to the browser seeming so vulnerable (I read the article and know they addressed at least the palinopsia bug though). Hypervisor based virtualization is where my thoughts were going. The project (if it becomes one, I am in the process of researching if it is doable) will be about creating a "secure" container for apps that use a specifically designed API and are intended to be run only in that container. I will have a look at Capsule. Thanks a lot! — John Smith, Mar 05 '17 at 13:59
Javascript running in a browser cannot abuse the Palinopsia bug because WebGL is an abstraction of OpenGL, and is far more limited. The browser itself _can_, because it has raw access to `/dev/dri/card0`, but a web page would need to actually exploit the browser to make use of full OpenGL to use it. — guest, Mar 05 '17 at 23:25

Out of Band · Answer 2 · 2017-03-04T08:57:33.827

This is an unbelievably broad question.

Here's a few problems that kinda kill your approach in my opinion:

As far as I know, at least under linux, forking a process does not give you any kind of control over which kernel system calls are available.
Limiting system calls will reduce the attack surface, but you want timing (whatever that is), reading and writing files, networking, sound and graphics to be available in your sandbox. That opens up a huge attack surface and doesn't reduce the number of system calls by any relevant percentage.
Even if you manage to limit system calls to a small set, the real problem is that operating system kernels contain bugs, too. So if even one system call you allow contains a bug that can be exploited, your perfect security is gone.
You're saying that you want to use hardware for your sandboxing, so software can't circumvent it. But then you drift away from that idea and keep talking about limiting system calls. Which hardware functionality would you want to leverage that modern operating systems don't already use? About the only two things that come to mind is that linux only uses 2 out of 4 protection rings on x86 hardware, and that it forgoes much of the segmentation capabilities in favour of paging and the address space virtualization that comes with it.
Your point number 3 about safe memory ranges is already mostly fulfilled by said address space virtualization - each process gets its own address space. For sandboxing purposes, it doesn't matter if a process overwrites its own memory - it's just important that it can't overwrite another processes memory, or kernel memory. Kernels take very good care that this can't happen, they use hardware support to isolate processes from each other, and if one process manages to trick the kernel into writing or reading memory that's off limits, it is because of kernel bugs (see 3.). What makes you think that your sandbox / sandbox kernel would be bug-free?

I think instead of focusing on system calls, you should focus on reducing kernel complexity. One way to achieve that is to say goodbye to monolithic kernels, and move to microkernel design. It's generally accepted that microkernels offer better isolation between different subsystems, which increases security. It also means you can move drivers out of ring 0, and ideally degrade them to run as normal user processes. However, Linus Torvalds has been saying for twenty years that this kind of design decreases performance and usability too much, and that there are no microkernels available that do the kind of heavylifting that monolithic kernels such as linux and windows do. In his words (while arguing with Tanenbaum about kernel design): "Linux wins heavily on points of being available now."

To conclude, if you want perfect isolation, think about buying a second computer and don't move data between the two.

Edit: Moved added section to it's own answer.

Good point about safe memory ranges! And I like the part about the kernel complexity, you are right about that. By (4.) using "using hardware for sandboxing" I did not mean to use new unused hardware features, I meant in in contrast to what for example the JVM when running untrusted code. The next points were about the design of the sandbox on top of those. You are right about 3. I cannot exclude such bugs, but I can reduce the probability & impact by adding another layer of security on top of them (and still use them). Sry about lack of clarity in 1 that must and can be done within the kernel — John Smith, Mar 03 '17 at 09:42
About 2: I think it possible to reduce the number of syscalls by a relevant percentage. Consider networking. To open a network connection to example.com you need a dns request (1 to 3 syscalls) and then some connect or open syscall and then you have to choose between `read`, `readv`, `recvfrom`, and `recvmsg`. I would replace all those with `connect("hostname-string")` which does the whole dns lookup and `read(secketdescriptor, buffer, numofbytes)`. Complexity reduced from 7 to 2 (roughly anyway). [By timing I meant the systemcalls related to time and date and querying system timers] — John Smith, Mar 03 '17 at 09:59
the crucial point is that the additional (complexity decreasing) layer of security is on top of the kernel's routines, but below the hardware enforced mechanism that prevents circumventing them. — John Smith, Mar 03 '17 at 11:24
Thanks @Pascal, I appreciate your insights. I cannot upvote your answer again thus I am not able to reward you properly. I suggest moving your edit to a new answer. — John Smith, Mar 03 '17 at 19:47
I wish to comment on your edit but I guess it makes sense for you to move it first — John Smith, Mar 03 '17 at 19:53

score 2 · Answer 3 · answered Mar 03 '17 at 03:39

I'll take on the process in user mode part. Although the OS should not matter, because kernel mode vs. user mode is a switch in the CPU itself (which also contains other modes but almost no OSes use them), there are cases where an OS does the switch badly. So discounting bugs in the kernel:

A PIU can only ever read/write its own memory never any other unless through system calls?

Yes, absolutely. All OSes today use virtual memory, which means that every app believes that it is the only thing that is running on the hardware. And every process believes that is can access the entire memory. All memory that the process accesses is mapped over memory pages and then mapped again onto real memory, the kernel keeps track of all the pages the the process is accessing.

A PIU can only ever access the harddrives through system calls?

Yes, again, given that there are no bugs in kernel code or CPU. Some assembly language instructions can use the system bus directly (the system bus is used to talk to pretty much any hard drive), yet none of these assembly instructions is allowed to execute if the CPU is in user mode. Encountering this instruction the CPU will fire an interrupt, dump the execution, return to kernel mode and let the kernel deal with the interrupt; which normally means that the kernel will scrap the process that caused the interrupt.

A PIU can only ever access the network through system calls?

Absolutely the same as above, systemcalls only. Network system calls also go over the system bus. The only difference is that there are more network system calls than harddrive system calls (not counting the calls specific to exotic filesystems maybe), so there may be more bugs in the implementation.

A PIU can only ever access any other hardware using system calls?

Yes. Any hardware is connected to an IRQ, the IRQ will cause a CPU interrupt which will pass it to the kernel. The kernel will query the IRQ and use the system bus to retrieve or write data to the hardware.

The only hardware that do not go over a bus is the memory that is inside the CPU (the CPU cache). There are also often some optimisations by which the physical memory can be talked with without negotiation for a bus. So yeah, the memory can often be talked to almost directly, but there is the MMU (memory management unit, normally part of the CPU itself) which performs the memory mapping we saw above.

When forking a process I can disable/remove/replace each and every system call?

Forking a process is something the kernel does. Can't talk about the Windows way of dealing with this but all others (of the mentioned in the question) copy the process memory (actually copy-on-write but that does not matter here), adjust internal data structures (in the kernel memory) and let the scheduler do its work.

You can change the kernel code to perform it in some other way, it wouldn't be trivial though. The map of syscall number to syscall procedure is global so you would need to keep track of processes that have access to certain system calls and the ones that don't yourself.

There is no way for a PIU to use any other system call then the ones I defined/allowed when forking?

Up to how you implement that.

In a custom system call (= inside kernel mode) I can use the original system calls directly (without using interrupts)?

Yes and no. Not the system calls in the kernel headers. Yet, for every system call the kernel code has a procedure in its own code, you can call that instead. (I mean the Linux kernel, can't say anything about windows.)

Colophon

Kernels tend to do a good work on the above. The problem starts with the fact that some system calls are meant to allow IPC. For example, assuming that you implement a syscall mapper in the kernel, if you disable open() you will not be able to do not only filesystem operations but also networking. On the other hand you would be able to still use mmap() to get the contents of a file into program memory.

There is a plethora of system calls and some of them have more than one use (e.g. unlink() for both hard and soft links, or the mentioned open()). Getting exactly right which system calls you need and which you don't is not easy, which is a problem due to how error prone it can be.

Thanks for answering my assumptions! On the colophon: the idea is to completely remove the processes access to the original syscalls and define my own set of syscalls that in turn are layered upon the original kernel routines. This restricted set of system calls will have exactly **one** use. In the case of file access those might be `exists("path/to/file")`, `remove("path/to/file")`, `fd = open("path/to/file")`, `read(fd, buffer, numofbytes)`, `write(`fd, buffer, numofbytes)` and `random_write(fd, buffer, startbyte, numofbytes)` and no others! — John Smith, Mar 03 '17 at 11:31
@JohnSmith you would need to recompile the program for that. The kernel sets a piece of (kernel space) memory in which it makes the list of syscalls and updates a register to point to that place. When `int 0x80` happens (syscall interrupt) the CPU handles the directing of the userspace process into the correct kernel routing through that map. That's why I say that the list is global. You would also need to prevent the program from actually making an `int 0x80` instruction, which is error prone because it is blacklisting. If you are recompiling you may as well use your own compiler instead — grochmal, Mar 03 '17 at 18:02
Ah I think I got it now. Since the goal is "perfect" security I cannot assume that I will have control over compilation. Therefore I need to find another way of restricting the process from calling the original syscalls. I think the only way left is hardware virtualization then... I found a nice article about that: https://www.codeproject.com/Articles/215458/Virtualization-for-System-Programmers — John Smith, Mar 03 '17 at 19:38
@JohnSmith - Yes, VMX would be a good way to catch the syscall interrup and then figure out what to do with it. At least from the little that I know about VMX. As far as I'm aware you can build a trigger for a context switch (user mode to kernel mode) and decide (on the host machine) whether to allow it or not. I don't know much about VMX though. On the other hand, I'm confident that plain KVM (the linux kernel built-in for virtualisation would not be capable of that although it is built on top of VMX). — grochmal, Mar 03 '17 at 21:00
Do you know about any resources about that trigger? I am having a hard time googling it — John Smith, Mar 03 '17 at 21:03
@JohnSmith - It must be inside KVM code. There is something about VMCALL vs. VMMCALL which I never followed too closely. But we do have two question here on sec.SE: [q1](http://security.stackexchange.com/questions/9786/breaking-out-of-a-strict-linux-sandbox-running-virtually-under-windows-do-the-l/9792#9792), [q2](http://security.stackexchange.com/questions/9877/can-an-unprivileged-process-in-a-hardware-virtualized-system-cause-a-vmexit-wi). Both are quite old though. — grochmal, Mar 03 '17 at 22:29

Out of Band · Answer 4 · 2017-03-04T09:18:01.180

Thoughts on your comments about how syscall reduction would improve security

the crucial point is that the additional (complexity decreasing) layer of security is on top of the kernel's routines

and

This restricted set of system calls will have exactly one use. In the case of file access those might be exists("path/to/file"), remove("path/to/file"), fd = open("path/to/file"), read(fd, buffer, numofbytes), write(fd, buffer, numofbytes)` and random_write(fd, buffer, startbyte, numofbytes) and no others!

I don't understand how that will provide additional security.

For me, it's the opposite - by adding an additional layer, you're making the system more complex, thus introducing more opportunities for bugs, instead of making it simpler and more secure.

Consider this: You provide fewer entry points for user mode code, but your code still calls kernel functions. So if we assume that some of these kernel functions contain security weaknesses, you'll have to add code to your entry point to abort with an error condition when a user provides unsafe arguments. But in order to do that, you'll already need to know about the security problems of the kernel code. So why not simply fix these weaknesses in the kernel and forget about the additional layer you put in between the kernel and the user?

(It's true that you might not call every kernel function the syscall interface provides, but you want to achieve near-perfect sandboxing, so it's not just about reducing risk a little by removing some code paths)

There's a second problem with the idea that removing (for example) readv, recvfrom, and recvmsg and only providing read will increase security:

We begin by reminding ourselves that the kernel syscall interface is well tested and has stood the test of time.

Look at it loke that: The syscalls are a kind of front end, a storefront. Obviously, customers interact with the store mainly through this storefront, so it's polished, there are security cameras and other measures in place to deal with thieves and other unwelcome customers. But the storefront doesn't actually do much except represent the store; the real work is done deep in the back rooms of the store, and when a customer enters the store and asks to be serviced, most of the required work will be delegated to the back rooms, out of sight. It doesn't really matter who services you - the young lady or the grumpy old man - they still pass your order along to the back rooms. And it's most likely there that the problems start.

It's unlikely that there are many serious bugs left in "kernel frontend" code, which has been scrutinized for over twenty years (in the Linux case). It's much more likely that bugs will be present in contributed device driver code, and reducing the number of syscalls does absolutely nothing to protect against exploits based on these bugs. What would help is to isolate driver code in unprivileged containers, so that a bug in any one driver can't be exploited to gain control over the whole system.

Check out Qubes OS, for example - the main design idea is to isolate most parts of the network stack and the handling of usb devices in their own virtual machines using the type 1 xen hypervisor. I think that's a clever approach. You probably also know about the grsecurity linux kernel patch, which hardens kernel security by implementing, among other things, bound checks when copying data from userland to kernel and so on. The grsecurity website also hosts a few papers about kernel security which might interest you.

I wish to note that on the general idea we have been on the same boat since the beginning: reducing complexity will increase security. I agree with you, that there should be the possibility for drivers in unprivileged containers (that was a new one for me), I agree with you about a less complex kernel design. Now when thinking about this sandbox I do not purely have linux in mind. I agree with you, that hardening the system does increase security in a way that I want. Apart from that I still think that reducing the attack surface will also *increase* security - albeit not closing all holes. — John Smith, Mar 04 '17 at 09:35
Yes, agreed. Your question just seemed to take a view that I'd think was too narrow, but as _one component_ of a secure system, reducing the interface complexity of the kernel certainly is a good idea. This is the situation we have with hypervisors which have a much smaller hypercall interface. But note that the hypervisors can do that because they don't have as much work to do as a regular kernel, so obviously the interface they have to present to the world can be much smaller, too. — Out of Band, Mar 04 '17 at 09:47
I think you could see a sandbox in a similar way couldn't you? It also makes sense to present a much smaller interface to whatever is running in it than a regular kernel. And apart from hardening the system I think you'd want untrusted code (like a website) in an environment that just simply gives the code a lot less abilities than your normal trusted apps, which can and should have a lot more control over the system in general, and that in a way that "is not circumventable" → hardware. — John Smith, Mar 04 '17 at 10:06

score 1 · Answer 5 · answered Mar 04 '17 at 10:29

An important part of the question that has not been addressed yet is graphics. I read up a lot more on the topic and it turns out it pretty much cannot be secured at the current state of the art.

For example a process can simply read the data that the process that ran before wrote to the GPU's memory.

The only possibility is to restrict access to the GPU altogether and provide some API/X-Server-like-thing to draw to the screen.

Designing a sandbox or how to "perfectly" isolate an app?

5 Answers5

Colophon

Thoughts on your comments about how syscall reduction would improve security