LzmaFlow is re-factored code from lzmaSDK, specifically the decompressor function LzmaDecode() from lzma443.tar.bz2. The modifications make the code easier to understand and easier to adapt for specific purpose. The changes introduce a separate variable for each unique data flow, encapsulate the probability state machine in a single small function, and avoid torturous macro syntax. Inserting or deleting a single inline keyword moves between fastest and smallest code. Applying minor tweaks by hand to the assembly language generated by gcc can produce object code that is simultaneously fastest and smallest. The author of the original lzma443 expressed little interest in the modifications.
Version 0.2 adds hand-compiled code for several architectures: i386, amd64 (x86_64), ARM (and Thumb), 32-bit PowerPC, and MIPS R3000. In general the hand-compiled code is a factor of 2.5 to 3 smaller, while executing nearly as fast as the gcc-compiled original C code from the lzmaSDK-443. The conditional compilation features _LZMA_IN_CB and _LZMA_OUT_READ are not supported by this version of the hand-compiled source. The hand-compiled assembly-language files are licensed under the GNU Lesser General Public License version 2.1 (GNU LGPL) or the Common Public License v 1.0 (CPL).
Performance: In general the goal of hand compiling was smallest code size. However, a few adjustments were made to increase speed as long as the space increase was small. A typical adjustment was to align the beginning of the main loop at an instruction cache line boundary.
The test case for speed is three intervals of uncompressed size 0x980, 0x8c0, 0xcaf8; and compressed size 0x5f5, 0x584, 0x4c9f, respectively. The measurement is the time taken to uncompress them 1000 times, generating a total of 56,632,000 bytes. Your Mileage May Vary. Note that the hardware caches are >=64KB except on ARM, which is 32KB.
In the table below, the styles "fast" and "size" designate the gcc-compiled versions of the original lzmaSDK files LzmaDecode.c and LzmaDecodeSize.c, respectively. The other styles "hand", "tiny", and "thumb" designate code that was hand-compiled from LzmaDecodeFlow.c, which is a re-written version of LzmaDecode.c.
style | .text | seconds | output rate |
---|---|---|---|
amd64 (2.010GHz Athlon-64 3200+ family 15 model 47; gcc 4.1.1) | |||
fast | 3132 | 3.06 | 18.5 MB/s |
hand | 1089 | 3.13 | 18.1 MB/s |
size | 2755 | 3.99 | 14.2 MB/s |
i386 (2.010GHz Athlon-64 3200+ family 15 model 47 in i686 mode; gcc 4.1.1) | |||
fast | 3409 | 3.26 | 17.4 MB/s |
tiny | 983 | 3.46 | 16.4 MB/s |
hand | 1033 | 3.56 | 15.9 MB/s |
size | 2517 | 4.46 | 12.7 MB/s |
PowerPC32 (1.000GHz PowerMac G4 model 7455; gcc 4.1.1) | |||
fast | 3048 | 3.85 | 14.7 MB/s |
inline | 3900 | 3.89 | 14.6 MB/s |
hand | 1260 | 4.72 | 12.0 MB/s |
size | 3168 | 8.14 | 6.96 MB/s |
i386 (1.614GHz Pentium4 family 15 model 2; gcc 4.0.2) | |||
hand | 1033 | 9.12 | 6.21 MB/s |
tiny | 983 | 9.17 | 6.18 MB/s |
fast | 2913 | 9.20 | 6.16 MB/s |
size | 2132 | 11.03 | 5.13 MB/s |
MIPS R3000 (199MHz BCM3302 v0.8; gcc 3.4.4) | |||
fast | 3068 | 47.3 | 1.197 MB/s |
hand | 1336 | 49.9 | 1.135 MB/s |
size | 3064 | 90.7 | 0.624 MB/s |
ARM (133MHz XScale iXP42x Family rev 1 (v5l); gcc 4.1.2) | |||
hand | 1156 | 66.6 | 0.850 MB/s |
fast | 3300 | 71.1 | 0.797 MB/s |
thumb | 880 | 93.2 | 0.608 MB/s |
size | 2248 | 128.4 | 0.441 MB/s |
Comments: The PowerPC instruction decoder slows down when the size of a basic block is too small. The 'inline' style shows the speed of the gcc-compiled C code when adding inline to the declaration of rcGetBit().
The download file http://BitWagon.com/LzmaFlow/LzmaFlow-0.2.tgz (212KB) includes the modified LzmaDecodeFlow.c, the hand-compiled assembly language source files, and the original source lzma443.tar.bz2 from http://sourceforge.net/projects/sevenzip/.