[Compress] BZip2 file object size?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Compress] BZip2 file object size?

garydgregory
Hi All,

BZip2FileObject does not implement doGetContentSize() and always returns
-1, which causes VFS to blow up if you try to read. Can this kind of
content only be streamed?

Gary
Reply | Threaded
Open this post in threaded view
|

Re: [Compress] BZip2 file object size?

Stefan Bodewig
On 2019-10-18, Gary Gregory wrote:

> BZip2FileObject does not implement doGetContentSize() and always returns
> -1, which causes VFS to blow up if you try to read. Can this kind of
> content only be streamed?

First a "I'm not an expert in the bzip2 file format" disclaimer.

From what I can tell the file format does not contain the information
about the uncompressed size.

BZip2 files consist of a series of blocks each of which holds the result
of compressing a multiple of 100000 uncompressed bytes. The multiple
(the block size, a number between 1 and 9) is part of the meta data. All
blocks except for the last have compressed the same number of original
bytes.

So you could count the blocks for an estimate and uncompress the last
block for the exact uncompressed size but in the end you have to
uncompress at least some part of the content to get the uncompressed
size.

Also I believe blocks don't need to start on byte boundaries so even
counting the blocks will be a bit more tricky. There are parallel
implementations of bzip2 in the Hadoop eco system (uncompressing blocks
in parallel) which must have solved this part, though.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]