32

I've recently been learning about how ASLR (address space randomization) works on Linux. At least on Fedora and Red Hat Enterprise Linux, there are two kinds of executable programs:

  • Position Independent Executables (PIEs) receive strong address randomization. Apparently, the location of everything is randomized, separately for each program. Apparently, network-facing daemons ought to be compiled as PIEs (using the -pie -fpie compiler flags), to ensure they receive the full-strength randomization.

  • Other executables receive partial address randomization. The executable code segment is not randomized -- it is at a fixed, predictable address that is the same for all Linux systems. In contrast, shared libraries are randomized: they are loaded at a random position that's the same for all such programs on the system.

I think I understand why non-PIE executables have the weaker form of randomization for shared libraries (this is necessary for prelink, which speeds up linking and loading of executables). I also think I understand why non-PIE executables don't have their executable segment randomized at all: it looks like it's because the program must be compiled as a PIE, to be able to randomize the location of the executable code segment.

Still, leaving the location of the executable code segment un-randomized is potentially a security risk (for instance, it makes ROP attacks easier), so it would be good to understand whether it's feasible to provide full randomization for all binaries.

So, is there a reason not to compile everything as PIE? Is there a performance overhead to compiling as PIE? If so, how much is the performance overhead on different architectures, particularly on x86_64, where address randomization is most effective?


References:

D.W.
  • 98,420
  • 30
  • 267
  • 572
  • Fedora [requires use of PIE](https://fedoraproject.org/wiki/Packaging:Guidelines?rd=Packaging/Guidelines#PIE) for long-running processes, or any process that runs as root or has suid binaries or capabilities. – Michael Hampton Sep 04 '13 at 04:16
  • I would rephrase this title as "What are the disadvantages of randomizing the executable code segment?". Because now most Linux distros do randomize it by default :-) e.g. Ubuntu 18.04: https://stackoverflow.com/questions/2463150/what-is-the-fpie-option-for-position-independent-executables-in-gcc-and-ld/51308031#51308031 – Ciro Santilli OurBigBook.com Jul 12 '18 at 14:26

3 Answers3

28

Although the details greatly vary between architectures, what I say here applies equally well to 32-bit x86, 64-bit x86, but also ARM and PowerPC: faced with the same issues, about all architecture designers have used similar solutions.


There are (roughly speaking) four kinds of "accesses", at assembly level, which are relevant to the "position-independent" system: there are function calls (call opcodes) and data accesses, and both may target either an entity within the same object (where an object is a "shared object", i.e. a DLL, or the executable file itself) or within another object. Data accesses to stack variables are not relevant here; I am talking about data accesses to global variables or static constant data (in particular the contents of what appears, at source level, to be literal character strings). In a C++ context, virtual methods are referenced by what is, internally, function pointers in special tables (called "vtables"); for the purposes of this answer, these are data accesses as well, even though a method is code.

The call opcode uses a target address which is relative: it is an offset computed between the current instruction pointer (technically, the first byte after the argument to the call opcode) and the call target address. This means that function calls within the same object can be fully resolved at (static) link time; they don't show up in the dynamic symbol tables, and they are "position-independent". On the other hand, function calls to other objects (cross-DLL calls, or calls from the executable file to a DLL) must go through some indirection which is handled by the dynamic linker. The call opcode must still jump "somewhere", and the dynamic linker wants to dynamically adjust it. The format tries to achieve two characteristics:

  • Lazy linking: the call target is looked for and resolved only when it is first used.
  • Shared pages: at most as possible, in-memory structures should be kept identical to the corresponding bytes in executable files, to promote sharing across multiple invocations (if two processes load the same DLL, the code should be present only once in RAM) and easier paging (when RAM is tight, a page which is an unmodified copy of a chunk of data in a file can be evicted from physical RAM, since it can be reloaded at will).

Since sharing is on a per-page basis, this means that dynamically altering the call argument (the few bytes after the call opcode) should be avoided. Instead, the compiled code uses a Global Offsets Table (or several -- I simplify things a bit). Basically, the call jumps to a small piece of code which does the actual call, and is subject to modification by the dynamic linker. All such small wrappers, for a given object, are stored together in pages which the dynamic linker will modify; these pages are at a fixed offset from the code, so the argument to call is computed at static link time and needs not be modified from the source file. When the object is first loaded, all the wrappers point to a dynamic linker function which does the linking upon first invocation; that function modifies the wrapper itself to point to the resolved target, for subsequent invocations. The assembly-level juggling is intricate but works well.

Data accesses follow a similar pattern, but they don't have relative addressing. That is, a data access will use an absolute address. That address will be computed within a register, which will then be used for access. The x86 line of CPU can have the absolute address directly as part of the opcode; for RISC architectures, with fixed-size opcodes, the address will be loaded as two or three successive instructions.

In a non-PIE executable file, the target address of a data element is known to the static linker, who can hardcode it directly in the opcode which does the access. In a PIE executable, or in a DLL, this is not possible since the target address is not known before execution (it depends on other objects which will be loaded in RAM, and also on ASLR). Instead, the binary code must use the GOT again. The GOT address is dynamically computed into a base register. On 32-bit x86, the base register is conventionally %ebx and the following code is typical:

    call nextaddress
nextaddress:
    popl %ebx
    addl somefixedvalue, %ebx

The first call simply jumps to the next opcode (so the relative address here is just a zero); since this is a call, it pushes the return address (also that of the popl opcode) on the stack, and the popl extracts it. At that point, %ebx contains the address of popl, so a simple addition modifies that value to point to the start of the GOT. Data accesses can then be done relatively to %ebx.


So what is changed by compiling an executable file as PIE ? Actually not much. A "PIE executable" means making the main executable a DLL, and loading it and linking it just like any other DLL. This implies the following:

  • Function calls are unmodified.
  • Data accesses from code in the main executable, to data elements which also are in the main executable, incur some extra overhead. All other data accesses are unaltered.

The overhead from data accesses is due to the use of a conventional register to point at the GOT: one extra indirection, one register used for this functionality (this impacts register-starved architectures like 32-bit x86), and some extra code to recompute the pointer to the GOT.

However, data accesses are already somewhat "slow", when compared with accesses to local variables, so compiled code already caches such accesses when possible (the variable value is kept in a register and flushed only when needed; and even when flushed, the variable address is also kept in a register). This is made even more so by the fact that global variables are shared between threads, so most application code which uses such global data uses it only in a read-only way (when writes are performed, they are done under the protection of a mutex, and grabbing the mutex incurs a much bigger cost anyway). Most CPU-intensive code will work on registers and stack variables, and won't be impacted by making the code position-independent.

At most, compiling code as PIE will imply a size overhead of about 2% on typical code, with no measurable impact on code efficiency, so that's hardly a problem (I got that figure from discussing with people involved in the development of OpenBSD; the "+2%" was a problem for them in the very specific situation of trying to fit a barebone system on a boot floppy disc).


Non-C/C++ code may have trouble with PIE, though. When producing compiled code, the compiler must "know" whether it is for a DLL or for a static executable, to include the code chunks which find the GOT. There won't be many packages in a Linux OS which may incur issues, but Emacs would be a candidate for trouble, with its Lisp dump-and-reload feature.

Note that code in Python, Java, C#/.NET, Ruby... is completely out of scope of all this. PIE is for "traditional" code in C or C++.

Thomas Pornin
  • 320,799
  • 57
  • 780
  • 949
  • Fascinating. Thank you! So, bottom line: most packages could be compiled as PIE with essentially no impact? (A 2% increase in size, but that's minor; and only for C/C++ packages; and some packages might have to be blacklisted and compiled as non-PIE.) That's interesting. I wonder why distributions don't already do that.... – D.W. Sep 03 '13 at 18:07
  • 6
    Ubuntu has PIE activated for a [small list of packages](https://wiki.ubuntu.com/Security/Features#pie). They claim a 5% to 10% CPU overhead on 32-bit x86, which I don't completely believe. They also say that on x86-64 it will "eventually be made the default" after some more testing is done. These pages were last updated two years ago, so the idea may have dropped out of fashion. – Thomas Pornin Sep 03 '13 at 20:18
  • *Since sharing is on a per-page basis, this means that dynamically altering the call argument (the few bytes after the call opcode) should be avoided.* I don't quite follow what point you're making here bearface. – lynks Sep 19 '13 at 10:54
  • 1
    @lynks: suppose that the same DLL is loaded by two processes, but not at the same address in each process. If an opcode must be modified, then it will have to be modified differently in the other process, and the whole page can no longer be shared: it must be duplicated. This contradicts one of the goal of shared objects, i.e. to save RAM. By regrouping all the addresses which must be adjusted in the GOT, only the GOT has to be duplicated, and the bulk of the code can be shared. – Thomas Pornin Sep 19 '13 at 10:58
  • 1
    Windows has another solution for that, which is that it will try to load each DLL at the same address for all process (the address is chosen randomly with ASLR, but remains fixed for all process and until the next boot or the next time the DLL is completely unloaded). This can trigger address space fragmentation issues, and Linux does not follow that road. – Thomas Pornin Sep 19 '13 at 11:00
  • Ahh I understand. I suppose the Windows method has a performance benefit, if they do indeed modify calls et.al. in the code itself. Thanks again. – lynks Sep 19 '13 at 11:04
  • How does the system tell whether the target of a static call instruction, e.g., `call abc`, is a relative offset (in the PIC case) or an absolute address (in PDC)? Or does the Loader change all relative static targets to absolute addresses after relocating the text segment? – MEE May 17 '16 at 17:52
  • Each opcode always uses a relative or absolute address. It is hardwired that way in the CPU. – Thomas Pornin May 17 '16 at 18:07
  • Sorry if this sounds like a noob question, but what is the `offset` in `call offset` relative to? And who updates that base when the code is relocated? – MEE May 17 '16 at 18:23
  • 3
    Relative addressing is relative to the opcode's address itself. Details depend on the involved architecture, but in practice you can imagine it as follows: the opcode is encoded, in machine code, as xx yy yy yy yy, where "xx" is a byte that says "this is a `call` opcode", and the "yy" encode a value (here over four bytes). To get the target address, that four-byte value is to be added to the address of the "xx" byte itself. (Actual representation may differ but the idea is preserved.) – Thomas Pornin May 17 '16 at 18:36
  • x86-64 has RIP-relative addressing for static data, 32-bit x86 doesn't. That's a huge difference. 32-bit has to `call` a function that returns the return address to find out its own return address, then keep that in a register, tying up one of only 7 general-purpose registers (not including the stack pointer). Some specific asm details are mentioned in [32-bit absolute addresses no longer allowed in x86-64 Linux?](https://stackoverflow.com/q/43367427)/. – Peter Cordes May 16 '18 at 03:08
11

One reason why some Linux distributions may be hesitant to compile all executables as Position-Independent Executables (PIE), so the executable code is randomized, is because of concerns about performance. The thing about performance concerns is that sometimes people worry about performance even when it's not an issue. So, it would be nice to have detailed measurements of the actual cost.

Fortunately, the following paper presents some measurements of the cost of compiling executables as PIE:

The paper analyzed the performance overhead of enabling PIE on a set of CPU-intensive programs (namely, the SPEC CPU2006 benchmarks). Since we expect this class of executables to show the worst performance overheads due to PIE, this gives a conservative, worst-case estimate of the potential performance estimate.

To summarize the paper's main findings:

  • On 32-bit x86 architectures, the performance overhead could be substantial: it is an average of about a 10% slowdown, for the SPEC CPU2006 benchmarks (CPU-intensive programs), and up to 25% slowdown or so for a few of the programs.

  • On 64-bit x64 architectures, the performance overhead is much smaller: an average slowdown of about 3%, on the CPU-intensive programs. Likely the performance overhead would be even less for many programs that people use (as many programs are not CPU-intensive).

This suggests that enabling PIE for all executables on 64-bit architectures would be a reasonable step for security, and the performance impact is very small. However, enabling PIE for all executables on 32-bit architectures would be too costly.

Iwan Aucamp
  • 103
  • 4
D.W.
  • 98,420
  • 30
  • 267
  • 572
2

Fairly obvious why the position-dependent executables aren't randomized.

"Position dependent" simply means that at least some addresses are hardcoded. In particular, this may apply to branch addresses. Moving the base address of the executable segment moves all branch destinations as well.

There are two alternatives for such hardcoded addresses: either replace them by IP-relative addresses (so the CPU can determine the absolute address at runtime), or fix them up at load time (when the base address is known).

You of course need a compiler which can generate such executables.

MSalters
  • 2,699
  • 1
  • 15
  • 16
  • 1
    Yes. That's what I guessed in my question. Thank you for confirming it. The real question was at the end: why don't we compile everything as position-independent, then? – D.W. Sep 03 '13 at 17:59
  • @D.W. (Your comment would be probably a very good question on http://cs.stackexchange.com ). IP-relative branching isn't really welcomed on every cpu arch. IP-relative data structures take away a cpu register and makes every data access to indexed/indirect addressing. There are other problems as well, relating mainly speed, but none of them is really hard. As I know, most code in current linuxes are compiled PIC. – peterh Aug 19 '15 at 18:33
  • @D.W. O.k. Afaik, the difference between "position independent code" and "position independent executable" is similar to the difference between a "bikeshed" and a "cycle store". There is another, practical reason, why distros didn't do it long ago: distros have continously troubles with their upstream software developers. The upstream developers in most cases don't really like to work too much on their build scripts and often won't understand why different features (parallel build, better or configurable optimization flags, etc) would be needed. – peterh Aug 19 '15 at 18:46
  • @D.W. And overwriting the buildscripts in 40000 packages would be a tremendous work, which would make the distros only a little bit better, but a lot buggier. In case of the non-free world, as I know, nobody know (or is able to influence) how are the different softwares compiled, but I think in most professional software vendors, the importance of the difference between pic and non-pic code is totally unimportant. – peterh Aug 19 '15 at 18:48