tub -- evade TASK_UNMAPPED_BASE

for large dynamic arrays with shared libs under Linux on x86

February 18, 2005

homepage: http://BitWagon.com/tub/tub.html
download: http://BitWagon.com/tub/tub-0.98.tgz (21KB; GPLv2)
author: jreiser BitWagon com

If an app wants to use large dynamic arrays (upto 2.8GB or so), then until recently the default behavior of many distributions of Linux on x86 has forced the app to be linked statically, and not use dynamic linking or shared libraries.  When dynamic linking is used, then the first mmap(0, ...) gets assigned to a fixed address TASK_UNMAPPED_BASE, which is (TASK_SIZE / 3) in linux/include/asm-i386/processor.h.  Typically TASK_SIZE is 3GB (0xc0000000), so TASK_UNMAPPED_BASE is 1GB (0x40000000), which leaves something less than 2GB as the largest chunk of contiguous address space for dynamic user data.  Although it is nearly trivial to make task_unmapped_base part of inherited per-process state controlled by setrlimit() and getrlimit(), Linus has not done so.  Even if Linus' source were changed tomorrow, it would still be a year or two before an application reasonably could rely on this feature being present in an arbitrary installation.  Hardware with 64-bit address space is still somewhat expensive and uncommon.

tub is a user-mode hack which works around the problem on today's Linux for x86.  When linked into the application, then tub intercepts all mmap(0, ...), chooses from its master list the page frames to be used, and calls the kernel with mmap(frame_address,,,MAP_FIXED,,).  In effect, tub allows the application to set task_unmapped_base to any address.  Setting task_unmapped_base = brk(0); allows for maximum contiguous address space.  Shared libraries and dynamic linking can still be used because tub intercepts even the mmap() calls performed by the dynamic linker ld-linux.so.2.  prelinked shared libraries work OK.  The prelinking is ignored, and the library is mapped into the page range with the lowest available addresses.  As long as the .so was compiled with -fpic, then readonly pages are shared just as much as before.  Pages with relocation (including the _GLOBAL_OFFSET_TABLE__) probably are shared less than usual, because the prelinked relocated values must be relocated again.

tub consists of about 2.8KB compiled code written in C and assembler, plus some link-time scripts.  The link-time scripts make the app look like it has no PT_INTERP, and change the entry point to be inside tub code.  Because the on-disk app has no PT_INTERP, then execve() starts the process at Elf32_Ehdr.e_entry, instead of at the entry to the program interpreter /lib/ld-linux.so.2.  Upon entry at runtime, then tub changes the AT_ENTRY to _start, reverts the current process image to having a PT_INTERP, and maps the program interpreter itself.  tub arranges to intercept all calls from the program interpreter to mmap/mmap64/munmap,  restores the stack, and then jumps to the entry point of the program interpreter.  Any successful mmap/mmap64 with PROT_EXEC, MAP_PRIVATE, and !MAP_ANONYMOUS is scanned for further instances of mmap/mmap64/munmap, which are also intercepted.

For example, the interception for mmap looks like:
mmap:  # as in ld-linux.so.2 or libc.so.6
    mov    %ebx,%edx
    mov    $0x5a,%eax
    lea    0x4(%esp,1),%ebx
    int    $0x80   # or call *%gs:0x10
    mov    %edx,%ebx
    cmp    $0xfffff000,%eax
    ja     error
    ret   
mmap:  # as rewritten by tub during execution
    mov    %ebx,%edx
    call   __pre_mmap
    lea    0x4(%esp,1),%ebx
    int    $0x80  # or call *%gs:0x10
    mov    %edx,%ebx
    call   __post_mmap
    ja     error
    ret   

Each intercepting call takes 5 bytes, the same as the overwritten mov and cmp.  The assembly-language routines __pre_mmap and __post_mmap handle scratch register contents and processor flags, then call corresponding C-language routines tub_pre_mmap and tub_post_mmap.  By taking care with the subroutine linkage conventions (arguments on stack [by value-result] and in registers, and return value), everything just fits.  tub_pre_mmap looks for argument values that should be changed, consults a bitmap of free pages, changes the addr to be the desired frame, and ORs  MAP_FIXED into flags.

As of 2004-04-24, tub has been enhanced to work with glibc-2.3.2 and NPTL, and ld-linux.so.2 "over-mapping" and executable file by including .bss in the first mmap (in order to guarantee address-space reservations when there are "holes" in the new PT_LOAD or the existing address space.)   Also, version 0.94 fixed bugs in handling mmap64() and exec-shield (random placement by tbe Linux kernel of individual mmap() requests that do not specify MAP_FIXED.)   Version 0.95 (2005-02-05) handles mremap(), and accommodates some quirks of gcc 3.3.1-2mdk and the #include files of kernel-2.6.8.1-12mdk. Version 0.96 (2005-02-16) fixes a SIGBUS that happened with some modules such as libpthread-0.10.so which have a large .bss. Version 0.97 (2005-02-18) makes tub more robust by removing some dependencies on the particular code generated by differing versions of gcc. Version 0.98 (2008-07-16) adapts to evolution of elf.h and Linux 2.6.24.

Detecting the body of mmap/mmap64/munmap in newly-mapped pages is heuristic and not as robust as it could be.   The allocator for page frames is multi-thread safe, and somewhat efficient; it uses spin wait during thread-to-thread contention.   The allocator also detects re-entrant use by a signal handler.   In theory such a situation can be handled, but it is too complex.   So, the current implementation gives a message on stderr and aborts.   Of course, doing an explicit mmap (or any system call) in a signal handler is a dubious idea.   However, *printf() buffering typically uses mmap.   So, establish buffering (or no buffering) by calling setbuf, setbuffer, setlinebuf or setvbuf for the FILE before enabling the handler.