Saturday, November 28, 2009

Seekable tarballs

In the Linux world, the most common archive tool is tar, hands down. You take a bunch of files, you throw them in a tar file, and then you compress the tar file with your compressor of choice. GZIP is the most common, leading to a .tar.gz/.tgz extension. BZIP2 is also common (.tar.bz2), but I've played around with LZMA (.tar.lzma) and rzip (.tar.rz) as well.

I'm only going to talk about gzip/DEFLATE because that's the only general compression algorithm I've considered in this approach.

It occurred to me there's a way to make a tarball (and, indeed, any DEFLATE stream) seekable. It stems from the reason DEFLATE isn't really seekable in the first place; The actual encoding of the data depends on data that it's already seen, so you can't just peek at any one place in the stream and start decoding there without knowing what came earlier.

In the case of DEFLATE, there's a compressor state that keeps changing as the compressor processes more data, in order to improve the compressor's efficiency. That state represents how the compressor will encode the symbols it sees in the near future, and the symbols that it sees will cause it to change its state to be more efficient for the symbols following.

In the process of decompressing a DEFLATE stream, that same state gets reconstructed, referenced and updated as a decode key as the stream continues. One can't normally jump to the middle of a DEFLATE stream because one needs to have that state in the form that it would be by that point in the stream.

The solution is simple; Save off a copy of the compressor/decompressor state wherever you want a seek point. Keep an index containing the original data stream address, the deflate stream address, and the decompressor state one would have at those two points.

Put the index in a separate file, after the terminator indicator, or multiplex it using some sort of container format; I don't care. Point is, you have the option of resuming a decompression run that you never really started.

Yes, this harms the compression efficiency, in a way. To seek, you need that decompressor state, and that decompressor state will have a nonzero size. No worries; There are all kinds of ways to tune how large the index will be:

  • You could set an index size as a percentage of the resulting deflate size. After building N bytes of deflate stream data, you save off an index point of M bytes, where (M/(N+M)) is your target percentage.

  • If you know enough about your source data stream, you could be selective about where you put the seek points. For example, if you know that the source stream is a tar file, you can put a seek point at the beginning of each file in the tar archive.

  • You don't have to generate the seek index at all until you'd like to start examining the file. Most of the tools I've used that allow be to browse tarballs decompress the entire archive into a temporary file or directory. In such cases, generating the seek index as an initial step saves the disk cost of a temp file or directory, and the seek index can be kept for later reference.


Another interesting side effect of the seek index is that it allows parallel decoding. If the underlying I/O subsystem is up to the task, multiple chunks of the overall DEFLATE stream may be decompressed simultaneously, with each seek index point as a potential starting point.

No comments:

Post a Comment