[COMPRESS] TIFF file identified as TAR

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[COMPRESS] TIFF file identified as TAR

Allison, Timothy B.
COMPRESS colleagues,
   On TIKA-2591[0], a user reports that a specific type of TIFF is being identified as a TAR file.  Is this something we should try to fix at the Tika level, or is this something that would be better fixed in COMPRESS?
   Thank you!

           Best,

               Tim

[0] https://issues.apache.org/jira/browse/TIKA-2591

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [COMPRESS] TIFF file identified as TAR

Stefan Bodewig
On 2018-02-27, Allison, Timothy B. wrote:

>    On TIKA-2591[0], a user reports that a specific type of TIFF is
>    being identified as a TAR file.  Is this something we should try to
>    fix at the Tika level, or is this something that would be better
>    fixed in COMPRESS?

TAR auto-detection is, erm, clumsy. But this is due to the format not
being built for being detected.

This is how it works right now:

* read the first candidate header of 512 bytes

* look at the eight bytes that contain the "ustar" string and the
  version and verify they look like something we support.

* verify the checksum of the candidate tar header

It is extremely unlikely that you find a file that contains the literal
"ustar" and a bunch of NULs and also a marching checksum at the right
places, but you seem to have found one.

Of course it is possible we've got a bug, so we should look at the TIFF
file and verify it really looks like a TAR.  If there is no bug I'm not
sure what else we could do - or what TIKA could do.

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [COMPRESS] TIFF file identified as TAR

Stefan Bodewig
On 2018-02-27, Stefan Bodewig wrote:

> On 2018-02-27, Allison, Timothy B. wrote:

>>    On TIKA-2591[0], a user reports that a specific type of TIFF is
>>    being identified as a TAR file.  Is this something we should try to
>>    fix at the Tika level, or is this something that would be better
>>    fixed in COMPRESS?

> TAR auto-detection is, erm, clumsy. But this is due to the format not
> being built for being detected.

> This is how it works right now:

> * read the first candidate header of 512 bytes

> * look at the eight bytes that contain the "ustar" string and the
>   version and verify they look like something we support.

> * verify the checksum of the candidate tar header

Actually I was mis-reading the code. It is either "ustar and version
look good" or "parses as tar header with correct checksum". So the
chance for false positives is bigger.

Unfortunately this has proven necessary to detect all valid TAR
archives: https://issues.apache.org/jira/browse/COMPRESS-117

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [COMPRESS] TIFF file identified as TAR

Allison, Timothy B.
As always, thank you, Stefan!

We might add a kluge at the Tika level to check for TIFF first...unless you'd like that kluge in your code? 😉

The reporter recommended one option: a conditional that checked the tarHeader variable to see if it started with one of the TIFF magic numbers (II/MM 49 49 2A 00 / 4D 4D 00 2A).



-----Original Message-----
From: Stefan Bodewig [mailto:[hidden email]]
Sent: Tuesday, February 27, 2018 3:46 PM
To: Stefan Bodewig <[hidden email]>
Cc: Allison, Timothy B. <[hidden email]>; Commons Developers List <[hidden email]>
Subject: Re: [COMPRESS] TIFF file identified as TAR

On 2018-02-27, Stefan Bodewig wrote:

> On 2018-02-27, Allison, Timothy B. wrote:

>>    On TIKA-2591[0], a user reports that a specific type of TIFF is
>>    being identified as a TAR file.  Is this something we should try to
>>    fix at the Tika level, or is this something that would be better
>>    fixed in COMPRESS?

> TAR auto-detection is, erm, clumsy. But this is due to the format not
> being built for being detected.

> This is how it works right now:

> * read the first candidate header of 512 bytes

> * look at the eight bytes that contain the "ustar" string and the
>   version and verify they look like something we support.

> * verify the checksum of the candidate tar header

Actually I was mis-reading the code. It is either "ustar and version look good" or "parses as tar header with correct checksum". So the chance for false positives is bigger.

Unfortunately this has proven necessary to detect all valid TAR
archives: https://issues.apache.org/jira/browse/COMPRESS-117

Stefan

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]