Although the details greatly vary between architectures, what I say here applies equally well to 32-bit x86, 64-bit x86, but also ARM and PowerPC: faced with the same issues, about all architecture designers have used similar solutions.
There are (roughly speaking) four kinds of "accesses", at assembly level, which are relevant to the "position-independent" system: there are function calls (call
opcodes) and data accesses, and both may target either an entity within the same object (where an object is a "shared object", i.e. a DLL, or the executable file itself) or within another object. Data accesses to stack variables are not relevant here; I am talking about data accesses to global variables or static constant data (in particular the contents of what appears, at source level, to be literal character strings). In a C++ context, virtual methods are referenced by what is, internally, function pointers in special tables (called "vtables"); for the purposes of this answer, these are data accesses as well, even though a method is code.
The call
opcode uses a target address which is relative: it is an offset computed between the current instruction pointer (technically, the first byte after the argument to the call
opcode) and the call target address. This means that function calls within the same object can be fully resolved at (static) link time; they don't show up in the dynamic symbol tables, and they are "position-independent". On the other hand, function calls to other objects (cross-DLL calls, or calls from the executable file to a DLL) must go through some indirection which is handled by the dynamic linker. The call
opcode must still jump "somewhere", and the dynamic linker wants to dynamically adjust it. The format tries to achieve two characteristics:
- Lazy linking: the call target is looked for and resolved only when it is first used.
- Shared pages: at most as possible, in-memory structures should be kept identical to the corresponding bytes in executable files, to promote sharing across multiple invocations (if two processes load the same DLL, the code should be present only once in RAM) and easier paging (when RAM is tight, a page which is an unmodified copy of a chunk of data in a file can be evicted from physical RAM, since it can be reloaded at will).
Since sharing is on a per-page basis, this means that dynamically altering the call
argument (the few bytes after the call
opcode) should be avoided. Instead, the compiled code uses a Global Offsets Table (or several -- I simplify things a bit). Basically, the call
jumps to a small piece of code which does the actual call, and is subject to modification by the dynamic linker. All such small wrappers, for a given object, are stored together in pages which the dynamic linker will modify; these pages are at a fixed offset from the code, so the argument to call
is computed at static link time and needs not be modified from the source file. When the object is first loaded, all the wrappers point to a dynamic linker function which does the linking upon first invocation; that function modifies the wrapper itself to point to the resolved target, for subsequent invocations. The assembly-level juggling is intricate but works well.
Data accesses follow a similar pattern, but they don't have relative addressing. That is, a data access will use an absolute address. That address will be computed within a register, which will then be used for access. The x86 line of CPU can have the absolute address directly as part of the opcode; for RISC architectures, with fixed-size opcodes, the address will be loaded as two or three successive instructions.
In a non-PIE executable file, the target address of a data element is known to the static linker, who can hardcode it directly in the opcode which does the access. In a PIE executable, or in a DLL, this is not possible since the target address is not known before execution (it depends on other objects which will be loaded in RAM, and also on ASLR). Instead, the binary code must use the GOT again. The GOT address is dynamically computed into a base register. On 32-bit x86, the base register is conventionally %ebx
and the following code is typical:
call nextaddress
nextaddress:
popl %ebx
addl somefixedvalue, %ebx
The first call
simply jumps to the next opcode (so the relative address here is just a zero); since this is a call
, it pushes the return address (also that of the popl
opcode) on the stack, and the popl
extracts it. At that point, %ebx
contains the address of popl
, so a simple addition modifies that value to point to the start of the GOT. Data accesses can then be done relatively to %ebx
.
So what is changed by compiling an executable file as PIE ? Actually not much. A "PIE executable" means making the main executable a DLL, and loading it and linking it just like any other DLL. This implies the following:
- Function calls are unmodified.
- Data accesses from code in the main executable, to data elements which also are in the main executable, incur some extra overhead. All other data accesses are unaltered.
The overhead from data accesses is due to the use of a conventional register to point at the GOT: one extra indirection, one register used for this functionality (this impacts register-starved architectures like 32-bit x86), and some extra code to recompute the pointer to the GOT.
However, data accesses are already somewhat "slow", when compared with accesses to local variables, so compiled code already caches such accesses when possible (the variable value is kept in a register and flushed only when needed; and even when flushed, the variable address is also kept in a register). This is made even more so by the fact that global variables are shared between threads, so most application code which uses such global data uses it only in a read-only way (when writes are performed, they are done under the protection of a mutex, and grabbing the mutex incurs a much bigger cost anyway). Most CPU-intensive code will work on registers and stack variables, and won't be impacted by making the code position-independent.
At most, compiling code as PIE will imply a size overhead of about 2% on typical code, with no measurable impact on code efficiency, so that's hardly a problem (I got that figure from discussing with people involved in the development of OpenBSD; the "+2%" was a problem for them in the very specific situation of trying to fit a barebone system on a boot floppy disc).
Non-C/C++ code may have trouble with PIE, though. When producing compiled code, the compiler must "know" whether it is for a DLL or for a static executable, to include the code chunks which find the GOT. There won't be many packages in a Linux OS which may incur issues, but Emacs would be a candidate for trouble, with its Lisp dump-and-reload feature.
Note that code in Python, Java, C#/.NET, Ruby... is completely out of scope of all this. PIE is for "traditional" code in C or C++.