Why Your Shared Library Code Is Slow
This article explains why your code is slower when you compile it as dynamic shared object (DLL, shared library).
Position Independent Code
When you compile a shared library (with
-fPIC, in the case of
gcc), you are specifying that you want the compiler to generate position independent code.
Position independent code never uses straightforward machine addressing modes to access global variables, static variables, or non-static functions. Instead, position independent code accesses these kinds of data indirectly.
PLTs and the GOT
PLTs. To invoke a non-static function, position independent code uses a so-called procedure linkage table, or PLT. A PLT is a section of the address space that is both writable and executable. It holds short sequences of instructions, stub functions. There is one stub per non-static function. Calls to non-static functions are transformed by the compiler to calls of the stub functions in the PLT.
GOT. To access a global variable or a static variable, position independent code uses a so-called global offset table, or GOT. A GOT is a set of symbol names to be used at link-time (which is load-time, for shared libraries) to specify the names of global variables and static variables referred to by the program/library in which the GOT appears.
Symbol Resolution and Lazy Binding
The operating system's dynamic loader initialises PLT entries at load-time (when a program that uses the shared library is
exec'd). Their initial values are a sequence of instructions that invokes the run-time linker (the dynamic loader) to resolve the address of the function to which the PLT entry will indirect.
To resolve an entry means to find the actual address for the entry's symbol. The address for the resolved entry is then cached.
The operating system's dynamic loader resolves the PLT entries when a program that is linked to your shared library first uses the symbol in the PLT, and not before. This procedure is called lazy binding. Lazy binding is typically the default mode of operation for non-static function invocation in shared libraries.
GOT entries are not bound lazily. The operating system's dynamic loader resolves the GOT entries at load time. When a global variable or static variable is referenced, it is done via the resolved GOT entries in the invoking code.
Mechanisms for Indirection
Resolved PLT Entry
To invoke a function via the PLT, as compared to invoking a function directly, there are many more instructions to execute, and there is a change in the locality of the instruction stream. Here's why.
There are two cases to consider: the case in which the PLT entry has already been resolved, and the case in which the PLT entry has not yet been resolved (the first access of the function).
First, let us consider the case in which the PLT entry has already been resolved. This is most common. In this case, the calling code simply calls the PLT entry as if it were a function. The resolved PLT entry will be a jump instruction, followed by the address of the non-static function that is actually to be executed. This small sequence of instructions would have been placed there when the PLT entry was bound (lazily). So, to call a non-static function in position independent code, the instructions from the PLT are first loaded into the CPU's instruction stream, then they execute a jump to the address of the desired function. The pseudo-assembly code might look like this:
... call pltentryfor_foo ; calling foo() indirectly ... PLT: ... pltentryfor_foo: jmp foo pltentryfor_bar: jmp bar ...
Remember, those code sequences in the PLT do not exist in your shared library. Only the addresses of those code sequences (relative to the calling code) are known at compile time.
Unresolved PLT Entry
Now, let us consider the case in which the PLT entry has not been resolved. This happens when the symbol is first used at run-time. This lazy binding is useful if no code which uses the symbol is ever invoked. In that case, there is no run-time cost.
The initial value of any PLT entry is a sequence of instructions which invokes the run-time linker. The the run-time linker resolves the real function address, and stores it in the PLT entry, so that the next time that the PLT entry is invoked, it will call the desired function. This resolution process, performed by the run-time linker, is very costly in comparison to the cost of a simple direct call. The cost is paid only upon the first invocation of the function.
Accesses of global variables and static variables are via the GOT. For each access of a such a variable, the compiler generates code to load the appropriate GOT entry. The entry is the address of the desired variable. With that address, the code will then access the value of the desired variable. As compared to non-PIC, this method results in extra code to do the extra load, and then an additional data access to perform extra load itself.
The generated code does not know the exact address of the GOT at compile time, but it knows the address of the GOT relative to its own address. Code fragments like the following pseudo-assembly language are common in IA32 PIC:
call __i686.get_pc_thunk.bx add ebx, GLOBAL_OFFSET_TABLE add eax, [ebx+the_offset_into_the_GOT]
__i686.get_pc_thunk.bx function simply puts the value of the PC into the EBX register. To that is added the (possibly negative) offset of the GOT relative to the currently executing code. The result is the address of the GOT in register EBX. Next, the value of the GOT entry with the index corresponding to some desired global variable is loaded into register EAX. After this sequence of instructions, the actual value of the global still has not been fetched. Only the global's address is known. An additional load instruction is required to obtain the value of the global.
Note that these indirect function accesses happen even when functions within the same shared library are the calling functions! This is because it is permitted for users of the shared library to override a symbol (function name or global variable) with their own versions. This is true for ELF, but not necessarily all DSOs. This behaviour can be overridden with the
-Bsymbolic-functions linker flag. One often sees this flag passed to the linker by
gcc with a command-line switch that looks like this:
If the magnitude of run-time overhead for function invocations and global variable references discussed above reminds of you the the overheads associated with interpreted languages and virtual machines, then you understand. DSOs are trading performance for the various benefits of run-time linking. Many would suggest that the benefits of PIC do not outweigh the performance costs.
This run-time overhead might be one reason why your DSO is slow.