[compress] Do we want 7z Archive*Stream-like classes

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[compress] Do we want 7z Archive*Stream-like classes

Stefan Bodewig
Hi all,

over this weekend I added 7z support to the compress antlib which I also
like to use as a second testbed for Commons Compress - I even found a
bug for archives that only contain empty directories.

The antlib is based on the interface provided by Archive*Stream even
when it is not using any streams at all, so I added
SevenZ(In|Out)putStreams that only work on files and delegate all calls
to the corresponding SevenZ(Out)File[1].  They are no streams at all.

Would those classes be useful inside of Commons Compress or should they
better be kept out as they'd promise more than they can hold?

[1] http://svn.apache.org/repos/asf/ant/antlibs/compress/trunk/src/main/org/apache/ant/compress/util/SevenZStreamFactory.java

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

Torsten Curdt-3
Hm - it is indeed a little misleading. So I am +0 for an inclusion.
Is a stream based implementation of 7z somewhat feasible - at least in
theory?

cheers,
Torsten


On Sun, Sep 29, 2013 at 8:09 AM, Stefan Bodewig <[hidden email]> wrote:

> Hi all,
>
> over this weekend I added 7z support to the compress antlib which I also
> like to use as a second testbed for Commons Compress - I even found a
> bug for archives that only contain empty directories.
>
> The antlib is based on the interface provided by Archive*Stream even
> when it is not using any streams at all, so I added
> SevenZ(In|Out)putStreams that only work on files and delegate all calls
> to the corresponding SevenZ(Out)File[1].  They are no streams at all.
>
> Would those classes be useful inside of Commons Compress or should they
> better be kept out as they'd promise more than they can hold?
>
> [1]
> http://svn.apache.org/repos/asf/ant/antlibs/compress/trunk/src/main/org/apache/ant/compress/util/SevenZStreamFactory.java
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

Stefan Bodewig
On 2013-09-29, Torsten Curdt wrote:

> Hm - it is indeed a little misleading. So I am +0 for an inclusion.

This is what I feel as well.

> Is a stream based implementation of 7z somewhat feasible - at least in
> theory?

I'm in no way as familiar with the format as Damian is but IMHO it is
feasible - but likely pretty memory hungry.  Even more so for the
writing side.  Similar to zip some information is stored in a central
place but in this case at the front of the archive.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

Benedikt Ritter-4
2013/9/30 Stefan Bodewig <[hidden email]>

> On 2013-09-29, Torsten Curdt wrote:
>
> > Hm - it is indeed a little misleading. So I am +0 for an inclusion.
>
> This is what I feel as well.
>
> > Is a stream based implementation of 7z somewhat feasible - at least in
> > theory?
>
> I'm in no way as familiar with the format as Damian is but IMHO it is
> feasible - but likely pretty memory hungry.  Even more so for the
> writing side.  Similar to zip some information is stored in a central
> place but in this case at the front of the archive.
>

Hi Stefan,

just out of curiosity: is this memory problem related to Java or to 7z in
general?

Benedikt


>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
http://people.apache.org/~britter/
http://www.systemoutprintln.de/
http://twitter.com/BenediktRitter
http://github.com/britter
Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

Bernd Eckenfels
Hello,

I think it is not related to java, but a general problem of some file  
formats in regards to streaming access.

If a format needs seeking/random-access there are basically three options  
(with the Java classes but also other languages). The first is having a  
random access file (which in this context mean you write the stream to a  
temp file and work on it), the second is doing the buffering in memory  
(mark/reset style). This might be a problem if you have to read from the  
end of the file as you need to keep everything in between in memory. The  
third option would be to allow to open the provided input stream multiple  
times (eighter by providing some form on "data source" or by supporting  
clone/reset on the input stream). (another option would be a random  
access-like buffer, but the amount of work to do that might not be worth  
it as you can easyly use a temp file).

For the 7z stream I guess the minimum which can be done is working with a  
temp file. But a general idea for this (and other compressors) is a "if  
you can provide multiple input streams you can use ..." API.

Greetings
Bernd


Am 30.09.2013, 18:47 Uhr, schrieb Benedikt Ritter <[hidden email]>:

> 2013/9/30 Stefan Bodewig <[hidden email]>
>
>> On 2013-09-29, Torsten Curdt wrote:
>>
>> > Hm - it is indeed a little misleading. So I am +0 for an inclusion.
>>
>> This is what I feel as well.
>>
>> > Is a stream based implementation of 7z somewhat feasible - at least in
>> > theory?
>>
>> I'm in no way as familiar with the format as Damian is but IMHO it is
>> feasible - but likely pretty memory hungry.  Even more so for the
>> writing side.  Similar to zip some information is stored in a central
>> place but in this case at the front of the archive.
>>
>
> Hi Stefan,
>
> just out of curiosity: is this memory problem related to Java or to 7z in
> general?
>
> Benedikt
>
>
>>
>> Stefan
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
>


--
http://www.zusammenkunft.net

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

Stefan Bodewig
In reply to this post by Benedikt Ritter-4
On 2013-09-30, Benedikt Ritter wrote:

> 2013/9/30 Stefan Bodewig <[hidden email]>

>> I'm in no way as familiar with the format as Damian is but IMHO it is
>> feasible - but likely pretty memory hungry.  Even more so for the
>> writing side.  Similar to zip some information is stored in a central
>> place but in this case at the front of the archive.

> just out of curiosity: is this memory problem related to Java or to 7z in
> general?

What Bernd said.

Reading may be simpler, here you can store the meta-information from the
start of the file in memory and then read entries as you go, ZipFile
inside the zip package does something like this.

When you consider writing you'll have to write metadata about all
entries before you even start to write the first bytes of the first
entry.  Either you build up everything in memory or you use a temporary
output.  This is not without precedent in Compress, pack200 allows users
to chose between two strategies that provide exactly those two options.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

Damjan Jovanovic
On Tue, Oct 1, 2013 at 6:09 AM, Stefan Bodewig <[hidden email]> wrote:

> On 2013-09-30, Benedikt Ritter wrote:
>
>> 2013/9/30 Stefan Bodewig <[hidden email]>
>
>>> I'm in no way as familiar with the format as Damian is but IMHO it is
>>> feasible - but likely pretty memory hungry.  Even more so for the
>>> writing side.  Similar to zip some information is stored in a central
>>> place but in this case at the front of the archive.
>
>> just out of curiosity: is this memory problem related to Java or to 7z in
>> general?
>
> What Bernd said.
>
> Reading may be simpler, here you can store the meta-information from the
> start of the file in memory and then read entries as you go, ZipFile
> inside the zip package does something like this.

From what I remember:

The "meta-information" can be anywhere in the file, as can the
compressed files themselves. The 7zip tool seems to write the
meta-information at the end of the 7z file when multi-file archives
are created. Compressed file codecs, positions, lengths, and solid
compression details are only stored in the meta-information, so it's
not possible to write a streaming reader without O(n) memory in the
worst case.

> When you consider writing you'll have to write metadata about all
> entries before you even start to write the first bytes of the first
> entry.  Either you build up everything in memory or you use a temporary
> output.  This is not without precedent in Compress, pack200 allows users
> to chose between two strategies that provide exactly those two options.

Writing also requires seeking or O(n) memory, as the initial header at
the beginning of the file contains the offset to the next header, and
we only know the size/contents/location of the next header once all
the files have been written.

Since we now have multiple archivers that require seeking, I suggest
we add a SeekableStream class or something along those lines. The
Commons Imaging project also has the same problem to solve for images,
and it uses ByteSources, which can be arrays, files, or an InputStream
wrapper that caches what has been read (so seeking is efficient, while
it only reads as much from the InputStream as is necessary).

> Stefan
>

Damjan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

Stefan Bodewig
On 2013-10-01, Damjan Jovanovic wrote:

> On Tue, Oct 1, 2013 at 6:09 AM, Stefan Bodewig <[hidden email]> wrote:

>> Reading may be simpler, here you can store the meta-information from the
>> start of the file in memory and then read entries as you go, ZipFile
>> inside the zip package does something like this.

> From what I remember:

> The "meta-information" can be anywhere in the file, as can the
> compressed files themselves. The 7zip tool seems to write the
> meta-information at the end of the 7z file when multi-file archives
> are created.

Oh yes, my understanding has been pretty much wrong and re-reading your
implementation has helped me to see clearer.  Right now I think the
important metadata actually is at the end but there is a smaller part at
the front - in particular a pointer to the Header holding the metadata.

> Compressed file codecs, positions, lengths, and solid compression
> details are only stored in the meta-information, so it's not possible
> to write a streaming reader without O(n) memory in the worst case.

I agree.

> Writing also requires seeking or O(n) memory, as the initial header at
> the beginning of the file contains the offset to the next header, and
> we only know the size/contents/location of the next header once all
> the files have been written.

or a temporary file to which the first header could be prepended - but
if you have that, you could use seeking as well.  So yes, I agree again.

> Since we now have multiple archivers that require seeking, I suggest
> we add a SeekableStream class or something along those lines. The
> Commons Imaging project also has the same problem to solve for images,
> and it uses ByteSources, which can be arrays, files, or an InputStream
> wrapper that caches what has been read (so seeking is efficient, while
> it only reads as much from the InputStream as is necessary).

Interesting idea.

Right now I'm willing to postpone and streaming API for 7z and rather
cut a release with a files only API.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

dam6923 .
> Since we now have multiple archivers that require seeking, I suggest
> we add a SeekableStream class or something along those lines. The
> Commons Imaging project also has the same problem to solve for images,
> and it uses ByteSources, which can be arrays, files, or an InputStream
> wrapper that caches what has been read (so seeking is efficient, while
> it only reads as much from the InputStream as is necessary).

I would also like to advocate for this approach.  I was looking into
writing up an implementation of Google SNAPPY decompressor, but was
unable to effectively wrap it into an InputStream.  Having a seekable
stream would make my efforts a better fit for this library.

On Sun, Oct 6, 2013 at 9:25 AM, Stefan Bodewig <[hidden email]> wrote:

> On 2013-10-01, Damjan Jovanovic wrote:
>
>> On Tue, Oct 1, 2013 at 6:09 AM, Stefan Bodewig <[hidden email]> wrote:
>
>>> Reading may be simpler, here you can store the meta-information from the
>>> start of the file in memory and then read entries as you go, ZipFile
>>> inside the zip package does something like this.
>
>> From what I remember:
>
>> The "meta-information" can be anywhere in the file, as can the
>> compressed files themselves. The 7zip tool seems to write the
>> meta-information at the end of the 7z file when multi-file archives
>> are created.
>
> Oh yes, my understanding has been pretty much wrong and re-reading your
> implementation has helped me to see clearer.  Right now I think the
> important metadata actually is at the end but there is a smaller part at
> the front - in particular a pointer to the Header holding the metadata.
>
>> Compressed file codecs, positions, lengths, and solid compression
>> details are only stored in the meta-information, so it's not possible
>> to write a streaming reader without O(n) memory in the worst case.
>
> I agree.
>
>> Writing also requires seeking or O(n) memory, as the initial header at
>> the beginning of the file contains the offset to the next header, and
>> we only know the size/contents/location of the next header once all
>> the files have been written.
>
> or a temporary file to which the first header could be prepended - but
> if you have that, you could use seeking as well.  So yes, I agree again.
>
>> Since we now have multiple archivers that require seeking, I suggest
>> we add a SeekableStream class or something along those lines. The
>> Commons Imaging project also has the same problem to solve for images,
>> and it uses ByteSources, which can be arrays, files, or an InputStream
>> wrapper that caches what has been read (so seeking is efficient, while
>> it only reads as much from the InputStream as is necessary).
>
> Interesting idea.
>
> Right now I'm willing to postpone and streaming API for 7z and rather
> cut a release with a files only API.
>
> Stefan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

jochen-2
In reply to this post by Stefan Bodewig
Document what you can hold, so that there isn't overly much promise.



On Sun, Sep 29, 2013 at 8:09 AM, Stefan Bodewig <[hidden email]> wrote:

> Hi all,
>
> over this weekend I added 7z support to the compress antlib which I also
> like to use as a second testbed for Commons Compress - I even found a
> bug for archives that only contain empty directories.
>
> The antlib is based on the interface provided by Archive*Stream even
> when it is not using any streams at all, so I added
> SevenZ(In|Out)putStreams that only work on files and delegate all calls
> to the corresponding SevenZ(Out)File[1].  They are no streams at all.
>
> Would those classes be useful inside of Commons Compress or should they
> better be kept out as they'd promise more than they can hold?
>
> [1]
> http://svn.apache.org/repos/asf/ant/antlibs/compress/trunk/src/main/org/apache/ant/compress/util/SevenZStreamFactory.java
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


--
"That's what prayers are ... it's frightened people trying to make friends
with the bully!"

Terry Pratchett. The Last Hero
Reply | Threaded
Open this post in threaded view
|

Re: [compress] Do we want 7z Archive*Stream-like classes

Bernd Eckenfels
In reply to this post by Stefan Bodewig
Hello,

just wanted to point out, that TrueVFS/TrueZIP has a NIO.2  
SeekableByteChannel implementation. Thats the same thing as would be  
needed vor [compress]: https://truezip.java.net/

Bernd

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]