[CODEC] Sign Extension Error in Murmur3

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[CODEC] Sign Extension Error in Murmur3

Claude Warren
There is an error in the current Murmur3 code introduced by sign extension
errors.  This is documented in CODEC-264.[1]

I have created a pull request to fix it.[2]

While the code changes did not change any of the existing Murmur3 tests, I
did add new tests that failed until the changes were applied.

However, I am concerned that the commons-codec Murmur3 code is in the wild
and this change will impact some hashes.

Therefore, I am asking for a discussion concerning how to apply any patches
like CODEC-264 to hashing algorithms that are in the wild.

Thx,
Claude

[1] https://issues.apache.org/jira/browse/CODEC-264
[2] https://github.com/apache/commons-codec/pull/27
Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

garydgregory
I feel like I am missing something basic in the assumption of this issue:
there is no such thing as an unsigned int in Java and the ticket talks
about (C?) unsigned ints. Please help me understand how or why we should
care about C vs. Java ints. Are we comparing apples to oranges here?

Thank you,
Gary

On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:

> There is an error in the current Murmur3 code introduced by sign extension
> errors.  This is documented in CODEC-264.[1]
>
> I have created a pull request to fix it.[2]
>
> While the code changes did not change any of the existing Murmur3 tests, I
> did add new tests that failed until the changes were applied.
>
> However, I am concerned that the commons-codec Murmur3 code is in the wild
> and this change will impact some hashes.
>
> Therefore, I am asking for a discussion concerning how to apply any patches
> like CODEC-264 to hashing algorithms that are in the wild.
>
> Thx,
> Claude
>
> [1] https://issues.apache.org/jira/browse/CODEC-264
> [2] https://github.com/apache/commons-codec/pull/27
>
Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

Alex Herbert


> On 3 Nov 2019, at 21:45, Gary Gregory <[hidden email]> wrote:
>
> I feel like I am missing something basic in the assumption of this issue:
> there is no such thing as an unsigned int in Java and the ticket talks
> about (C?) unsigned ints. Please help me understand how or why we should
> care about C vs. Java ints. Are we comparing apples to oranges here?

When a byte is converted to an int there is sign extension if it is negative. A negative byte will have all bits above the 8th bit set to 1. So if the byte is negative then when converted to an int for bit shift and xor operations the raw bits are not the same.

These are not the same:

byte b = -1;
(int) b != (b & 0xff);
b << 8 != (b & 0xff) << 8;
b << 16 != (b & 0xff) << 16;

The original code has the use of the 0xff mask for most of the murmur3 algorithm. It has been missed from the final steps applied the the last 3 bytes in the hash32 algorithm variant.

Alex



>
> Thank you,
> Gary
>
> On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:
>
>> There is an error in the current Murmur3 code introduced by sign extension
>> errors.  This is documented in CODEC-264.[1]
>>
>> I have created a pull request to fix it.[2]
>>
>> While the code changes did not change any of the existing Murmur3 tests, I
>> did add new tests that failed until the changes were applied.
>>
>> However, I am concerned that the commons-codec Murmur3 code is in the wild
>> and this change will impact some hashes.
>>
>> Therefore, I am asking for a discussion concerning how to apply any patches
>> like CODEC-264 to hashing algorithms that are in the wild.
>>
>> Thx,
>> Claude
>>
>> [1] https://issues.apache.org/jira/browse/CODEC-264
>> [2] https://github.com/apache/commons-codec/pull/27
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

sebb-2-2
As I see it, there are two use cases here.

1) Commons Codec code is used exclusively, in which case it is
essential that the output does not change for a given seed.

2) Commons Codec is used in conjunction with other implementations, in
which case it is essential that Codec is aligned with other
implementations

Both use cases seem valid to me, so I think it's important to allow
both old and new algorithms to co-exist going forward.

However:
Should the default behaviour change (with option to revert)?
Or would it be better to make the corrected behaviour optional?

Or deliberately break the API to force users to choose between old and
new behaviour?

Note that Wikipedia says that the x86 and x64 versions of the
algorithm generate different results in the case of the 128 bit hash.
However Commons Codec does not have different implementations for x86
and x64; it is not clear which has been implemented.
The JavaDoc also fails to mention that there is a 64bit hash.

The class clearly needs additional test data; it's possible that there
are discrepancies in other hash sizes as well.

The code has multiple implementations for each hash size depending on
the parameter types, which increases the number of tests needed.

Note that the original C++ implementation only has 3 hash methods:
MurmurHash3_x86_32
MurmurHash3_x86_128
MurmurHash3_x64_128

AFAICT these all take the same parameters:
- unsigned byte array
- length
- unsigned 32 bit seed
- pointer to output buffer

No idea why the Codec version allows additional input such as one (or
two) longs, etc.
Unless there is a defined standard for how these should be handled, it
will be impossible to prove that they behave correctly.

On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[hidden email]> wrote:

>
>
>
> > On 3 Nov 2019, at 21:45, Gary Gregory <[hidden email]> wrote:
> >
> > I feel like I am missing something basic in the assumption of this issue:
> > there is no such thing as an unsigned int in Java and the ticket talks
> > about (C?) unsigned ints. Please help me understand how or why we should
> > care about C vs. Java ints. Are we comparing apples to oranges here?
>
> When a byte is converted to an int there is sign extension if it is negative. A negative byte will have all bits above the 8th bit set to 1. So if the byte is negative then when converted to an int for bit shift and xor operations the raw bits are not the same.
>
> These are not the same:
>
> byte b = -1;
> (int) b != (b & 0xff);
> b << 8 != (b & 0xff) << 8;
> b << 16 != (b & 0xff) << 16;
>
> The original code has the use of the 0xff mask for most of the murmur3 algorithm. It has been missed from the final steps applied the the last 3 bytes in the hash32 algorithm variant.
>
> Alex
>
>
>
> >
> > Thank you,
> > Gary
> >
> > On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:
> >
> >> There is an error in the current Murmur3 code introduced by sign extension
> >> errors.  This is documented in CODEC-264.[1]
> >>
> >> I have created a pull request to fix it.[2]
> >>
> >> While the code changes did not change any of the existing Murmur3 tests, I
> >> did add new tests that failed until the changes were applied.
> >>
> >> However, I am concerned that the commons-codec Murmur3 code is in the wild
> >> and this change will impact some hashes.
> >>
> >> Therefore, I am asking for a discussion concerning how to apply any patches
> >> like CODEC-264 to hashing algorithms that are in the wild.
> >>
> >> Thx,
> >> Claude
> >>
> >> [1] https://issues.apache.org/jira/browse/CODEC-264
> >> [2] https://github.com/apache/commons-codec/pull/27
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

Claude Warren
I think the way to prove they behave correctly is to test the results
against the original "C" implementation.  With this in mind I proposed the
changes.

I agree that there should be a way to revert to the old code as there is
code in the wild that depends on the earlier implementation.

I know that there are broken implementations in other Apache projects, but
those are not libraries (i.e. Cassandra).

I would suggest that perhaps the changes that I provided be renamed as
hash128Standard and hash32Standard (or some other designator to show that
it is different from the original CODEC implementation).

Alternatively hash128 and hash32 could be deprecated  and replaced with
hashx86_32Old, and hashx64_128Old and provide the corrected versions for
hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
hashx86_128 implementation.

My opinion is that we should provide Murmur3 hash implementations that
produce the same results as the "C"/C++ implementations for cross
application compatibility.

Claude

On Sun, Nov 3, 2019 at 11:34 PM sebb <[hidden email]> wrote:

> As I see it, there are two use cases here.
>
> 1) Commons Codec code is used exclusively, in which case it is
> essential that the output does not change for a given seed.
>
> 2) Commons Codec is used in conjunction with other implementations, in
> which case it is essential that Codec is aligned with other
> implementations
>
> Both use cases seem valid to me, so I think it's important to allow
> both old and new algorithms to co-exist going forward.
>
> However:
> Should the default behaviour change (with option to revert)?
> Or would it be better to make the corrected behaviour optional?
>
> Or deliberately break the API to force users to choose between old and
> new behaviour?
>
> Note that Wikipedia says that the x86 and x64 versions of the
> algorithm generate different results in the case of the 128 bit hash.
> However Commons Codec does not have different implementations for x86
> and x64; it is not clear which has been implemented.
> The JavaDoc also fails to mention that there is a 64bit hash.
>
> The class clearly needs additional test data; it's possible that there
> are discrepancies in other hash sizes as well.
>
> The code has multiple implementations for each hash size depending on
> the parameter types, which increases the number of tests needed.
>
> Note that the original C++ implementation only has 3 hash methods:
> MurmurHash3_x86_32
> MurmurHash3_x86_128
> MurmurHash3_x64_128
>
> AFAICT these all take the same parameters:
> - unsigned byte array
> - length
> - unsigned 32 bit seed
> - pointer to output buffer
>
> No idea why the Codec version allows additional input such as one (or
> two) longs, etc.
> Unless there is a defined standard for how these should be handled, it
> will be impossible to prove that they behave correctly.
>
> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[hidden email]>
> wrote:
> >
> >
> >
> > > On 3 Nov 2019, at 21:45, Gary Gregory <[hidden email]> wrote:
> > >
> > > I feel like I am missing something basic in the assumption of this
> issue:
> > > there is no such thing as an unsigned int in Java and the ticket talks
> > > about (C?) unsigned ints. Please help me understand how or why we
> should
> > > care about C vs. Java ints. Are we comparing apples to oranges here?
> >
> > When a byte is converted to an int there is sign extension if it is
> negative. A negative byte will have all bits above the 8th bit set to 1. So
> if the byte is negative then when converted to an int for bit shift and xor
> operations the raw bits are not the same.
> >
> > These are not the same:
> >
> > byte b = -1;
> > (int) b != (b & 0xff);
> > b << 8 != (b & 0xff) << 8;
> > b << 16 != (b & 0xff) << 16;
> >
> > The original code has the use of the 0xff mask for most of the murmur3
> algorithm. It has been missed from the final steps applied the the last 3
> bytes in the hash32 algorithm variant.
> >
> > Alex
> >
> >
> >
> > >
> > > Thank you,
> > > Gary
> > >
> > > On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:
> > >
> > >> There is an error in the current Murmur3 code introduced by sign
> extension
> > >> errors.  This is documented in CODEC-264.[1]
> > >>
> > >> I have created a pull request to fix it.[2]
> > >>
> > >> While the code changes did not change any of the existing Murmur3
> tests, I
> > >> did add new tests that failed until the changes were applied.
> > >>
> > >> However, I am concerned that the commons-codec Murmur3 code is in the
> wild
> > >> and this change will impact some hashes.
> > >>
> > >> Therefore, I am asking for a discussion concerning how to apply any
> patches
> > >> like CODEC-264 to hashing algorithms that are in the wild.
> > >>
> > >> Thx,
> > >> Claude
> > >>
> > >> [1] https://issues.apache.org/jira/browse/CODEC-264
> > >> [2] https://github.com/apache/commons-codec/pull/27
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

Claude Warren
As a third option as @melloware said in the pull request comments the
original implementation came from Apache Hive.  The current hashes could be
named hash31Hive and hash128Hive

On Mon, Nov 4, 2019 at 12:53 AM Claude Warren <[hidden email]> wrote:

> I think the way to prove they behave correctly is to test the results
> against the original "C" implementation.  With this in mind I proposed the
> changes.
>
> I agree that there should be a way to revert to the old code as there is
> code in the wild that depends on the earlier implementation.
>
> I know that there are broken implementations in other Apache projects, but
> those are not libraries (i.e. Cassandra).
>
> I would suggest that perhaps the changes that I provided be renamed as
> hash128Standard and hash32Standard (or some other designator to show that
> it is different from the original CODEC implementation).
>
> Alternatively hash128 and hash32 could be deprecated  and replaced with
> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
> hashx86_128 implementation.
>
> My opinion is that we should provide Murmur3 hash implementations that
> produce the same results as the "C"/C++ implementations for cross
> application compatibility.
>
> Claude
>
> On Sun, Nov 3, 2019 at 11:34 PM sebb <[hidden email]> wrote:
>
>> As I see it, there are two use cases here.
>>
>> 1) Commons Codec code is used exclusively, in which case it is
>> essential that the output does not change for a given seed.
>>
>> 2) Commons Codec is used in conjunction with other implementations, in
>> which case it is essential that Codec is aligned with other
>> implementations
>>
>> Both use cases seem valid to me, so I think it's important to allow
>> both old and new algorithms to co-exist going forward.
>>
>> However:
>> Should the default behaviour change (with option to revert)?
>> Or would it be better to make the corrected behaviour optional?
>>
>> Or deliberately break the API to force users to choose between old and
>> new behaviour?
>>
>> Note that Wikipedia says that the x86 and x64 versions of the
>> algorithm generate different results in the case of the 128 bit hash.
>> However Commons Codec does not have different implementations for x86
>> and x64; it is not clear which has been implemented.
>> The JavaDoc also fails to mention that there is a 64bit hash.
>>
>> The class clearly needs additional test data; it's possible that there
>> are discrepancies in other hash sizes as well.
>>
>> The code has multiple implementations for each hash size depending on
>> the parameter types, which increases the number of tests needed.
>>
>> Note that the original C++ implementation only has 3 hash methods:
>> MurmurHash3_x86_32
>> MurmurHash3_x86_128
>> MurmurHash3_x64_128
>>
>> AFAICT these all take the same parameters:
>> - unsigned byte array
>> - length
>> - unsigned 32 bit seed
>> - pointer to output buffer
>>
>> No idea why the Codec version allows additional input such as one (or
>> two) longs, etc.
>> Unless there is a defined standard for how these should be handled, it
>> will be impossible to prove that they behave correctly.
>>
>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[hidden email]>
>> wrote:
>> >
>> >
>> >
>> > > On 3 Nov 2019, at 21:45, Gary Gregory <[hidden email]> wrote:
>> > >
>> > > I feel like I am missing something basic in the assumption of this
>> issue:
>> > > there is no such thing as an unsigned int in Java and the ticket talks
>> > > about (C?) unsigned ints. Please help me understand how or why we
>> should
>> > > care about C vs. Java ints. Are we comparing apples to oranges here?
>> >
>> > When a byte is converted to an int there is sign extension if it is
>> negative. A negative byte will have all bits above the 8th bit set to 1. So
>> if the byte is negative then when converted to an int for bit shift and xor
>> operations the raw bits are not the same.
>> >
>> > These are not the same:
>> >
>> > byte b = -1;
>> > (int) b != (b & 0xff);
>> > b << 8 != (b & 0xff) << 8;
>> > b << 16 != (b & 0xff) << 16;
>> >
>> > The original code has the use of the 0xff mask for most of the murmur3
>> algorithm. It has been missed from the final steps applied the the last 3
>> bytes in the hash32 algorithm variant.
>> >
>> > Alex
>> >
>> >
>> >
>> > >
>> > > Thank you,
>> > > Gary
>> > >
>> > > On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:
>> > >
>> > >> There is an error in the current Murmur3 code introduced by sign
>> extension
>> > >> errors.  This is documented in CODEC-264.[1]
>> > >>
>> > >> I have created a pull request to fix it.[2]
>> > >>
>> > >> While the code changes did not change any of the existing Murmur3
>> tests, I
>> > >> did add new tests that failed until the changes were applied.
>> > >>
>> > >> However, I am concerned that the commons-codec Murmur3 code is in
>> the wild
>> > >> and this change will impact some hashes.
>> > >>
>> > >> Therefore, I am asking for a discussion concerning how to apply any
>> patches
>> > >> like CODEC-264 to hashing algorithms that are in the wild.
>> > >>
>> > >> Thx,
>> > >> Claude
>> > >>
>> > >> [1] https://issues.apache.org/jira/browse/CODEC-264
>> > >> [2] https://github.com/apache/commons-codec/pull/27
>> > >>
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

sebb-2-2
In reply to this post by Claude Warren
On Mon, 4 Nov 2019 at 00:53, Claude Warren <[hidden email]> wrote:

>
> I think the way to prove they behave correctly is to test the results
> against the original "C" implementation.  With this in mind I proposed the
> changes.
>
> I agree that there should be a way to revert to the old code as there is
> code in the wild that depends on the earlier implementation.
>
> I know that there are broken implementations in other Apache projects, but
> those are not libraries (i.e. Cassandra).
>
> I would suggest that perhaps the changes that I provided be renamed as
> hash128Standard and hash32Standard (or some other designator to show that
> it is different from the original CODEC implementation).
>
> Alternatively hash128 and hash32 could be deprecated  and replaced with
> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
> hashx86_128 implementation.
>
> My opinion is that we should provide Murmur3 hash implementations that
> produce the same results as the "C"/C++ implementations for cross
> application compatibility.

I agree: but the original implementation has far fewer methods.
So how does one check the Codec methods that have no corresponding C/C++ method?

Maybe the Codec implementation needs to be drastically pruned as well.

> Claude
>
> On Sun, Nov 3, 2019 at 11:34 PM sebb <[hidden email]> wrote:
>
> > As I see it, there are two use cases here.
> >
> > 1) Commons Codec code is used exclusively, in which case it is
> > essential that the output does not change for a given seed.
> >
> > 2) Commons Codec is used in conjunction with other implementations, in
> > which case it is essential that Codec is aligned with other
> > implementations
> >
> > Both use cases seem valid to me, so I think it's important to allow
> > both old and new algorithms to co-exist going forward.
> >
> > However:
> > Should the default behaviour change (with option to revert)?
> > Or would it be better to make the corrected behaviour optional?
> >
> > Or deliberately break the API to force users to choose between old and
> > new behaviour?
> >
> > Note that Wikipedia says that the x86 and x64 versions of the
> > algorithm generate different results in the case of the 128 bit hash.
> > However Commons Codec does not have different implementations for x86
> > and x64; it is not clear which has been implemented.
> > The JavaDoc also fails to mention that there is a 64bit hash.
> >
> > The class clearly needs additional test data; it's possible that there
> > are discrepancies in other hash sizes as well.
> >
> > The code has multiple implementations for each hash size depending on
> > the parameter types, which increases the number of tests needed.
> >
> > Note that the original C++ implementation only has 3 hash methods:
> > MurmurHash3_x86_32
> > MurmurHash3_x86_128
> > MurmurHash3_x64_128
> >
> > AFAICT these all take the same parameters:
> > - unsigned byte array
> > - length
> > - unsigned 32 bit seed
> > - pointer to output buffer
> >
> > No idea why the Codec version allows additional input such as one (or
> > two) longs, etc.
> > Unless there is a defined standard for how these should be handled, it
> > will be impossible to prove that they behave correctly.
> >
> > On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[hidden email]>
> > wrote:
> > >
> > >
> > >
> > > > On 3 Nov 2019, at 21:45, Gary Gregory <[hidden email]> wrote:
> > > >
> > > > I feel like I am missing something basic in the assumption of this
> > issue:
> > > > there is no such thing as an unsigned int in Java and the ticket talks
> > > > about (C?) unsigned ints. Please help me understand how or why we
> > should
> > > > care about C vs. Java ints. Are we comparing apples to oranges here?
> > >
> > > When a byte is converted to an int there is sign extension if it is
> > negative. A negative byte will have all bits above the 8th bit set to 1. So
> > if the byte is negative then when converted to an int for bit shift and xor
> > operations the raw bits are not the same.
> > >
> > > These are not the same:
> > >
> > > byte b = -1;
> > > (int) b != (b & 0xff);
> > > b << 8 != (b & 0xff) << 8;
> > > b << 16 != (b & 0xff) << 16;
> > >
> > > The original code has the use of the 0xff mask for most of the murmur3
> > algorithm. It has been missed from the final steps applied the the last 3
> > bytes in the hash32 algorithm variant.
> > >
> > > Alex
> > >
> > >
> > >
> > > >
> > > > Thank you,
> > > > Gary
> > > >
> > > > On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:
> > > >
> > > >> There is an error in the current Murmur3 code introduced by sign
> > extension
> > > >> errors.  This is documented in CODEC-264.[1]
> > > >>
> > > >> I have created a pull request to fix it.[2]
> > > >>
> > > >> While the code changes did not change any of the existing Murmur3
> > tests, I
> > > >> did add new tests that failed until the changes were applied.
> > > >>
> > > >> However, I am concerned that the commons-codec Murmur3 code is in the
> > wild
> > > >> and this change will impact some hashes.
> > > >>
> > > >> Therefore, I am asking for a discussion concerning how to apply any
> > patches
> > > >> like CODEC-264 to hashing algorithms that are in the wild.
> > > >>
> > > >> Thx,
> > > >> Claude
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/CODEC-264
> > > >> [2] https://github.com/apache/commons-codec/pull/27
> > > >>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

Alex Herbert


> On 4 Nov 2019, at 02:13, sebb <[hidden email]> wrote:
>
> On Mon, 4 Nov 2019 at 00:53, Claude Warren <[hidden email] <mailto:[hidden email]>> wrote:
>>
>> I think the way to prove they behave correctly is to test the results
>> against the original "C" implementation.  With this in mind I proposed the
>> changes.
>>
>> I agree that there should be a way to revert to the old code as there is
>> code in the wild that depends on the earlier implementation.
>>
>> I know that there are broken implementations in other Apache projects, but
>> those are not libraries (i.e. Cassandra).
>>
>> I would suggest that perhaps the changes that I provided be renamed as
>> hash128Standard and hash32Standard (or some other designator to show that
>> it is different from the original CODEC implementation).
>>
>> Alternatively hash128 and hash32 could be deprecated  and replaced with
>> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
>> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
>> hashx86_128 implementation.
>>
>> My opinion is that we should provide Murmur3 hash implementations that
>> produce the same results as the "C"/C++ implementations for cross
>> application compatibility.
>
> I agree: but the original implementation has far fewer methods.
> So how does one check the Codec methods that have no corresponding C/C++ method?

AFAICT the methods that accept a short, int, or long are convenience methods to avoid creating a byte[] for the data. The JUnit tests demonstrate this by passing the same values to a ByteBuffer and then calling the corresponding byte[] methods. The methods do assume the endian order of the primitive datatypes as processed by a default ByteBuffer (Big Endian). So perhaps this should be added to the Javadoc with an example for the equivalent call to the method using the byte[].

Note that these methods will not be broken. The bug in the code is for input byte[] lengths that are not a factor of 4. The fix is for any of the leftover 1, 2, or 3 bytes when computing the hash.

The broken methods are:

public static int hash32(final byte[] data)
public static int hash32(final String data)
public static int hash32(final byte[] data, final int length)
public static int hash32(final byte[] data, final int length, final int seed)
public static int hash32(final byte[] data, final int offset, final int length, final int seed)

This class is also broken:
public static class IncrementalHash32

>
> Maybe the Codec implementation needs to be drastically pruned as well.
>
>> Claude
>>
>> On Sun, Nov 3, 2019 at 11:34 PM sebb <[hidden email]> wrote:
>>
>>> As I see it, there are two use cases here.
>>>
>>> 1) Commons Codec code is used exclusively, in which case it is
>>> essential that the output does not change for a given seed.
>>>
>>> 2) Commons Codec is used in conjunction with other implementations, in
>>> which case it is essential that Codec is aligned with other
>>> implementations
>>>
>>> Both use cases seem valid to me, so I think it's important to allow
>>> both old and new algorithms to co-exist going forward.
>>>
>>> However:
>>> Should the default behaviour change (with option to revert)?
>>> Or would it be better to make the corrected behaviour optional?
>>>
>>> Or deliberately break the API to force users to choose between old and
>>> new behaviour?
>>>
>>> Note that Wikipedia says that the x86 and x64 versions of the
>>> algorithm generate different results in the case of the 128 bit hash.
>>> However Commons Codec does not have different implementations for x86
>>> and x64; it is not clear which has been implemented.
>>> The JavaDoc also fails to mention that there is a 64bit hash.

And lacks <p> and <a> tags in the header as appropriate.

>>>
>>> The class clearly needs additional test data; it's possible that there
>>> are discrepancies in other hash sizes as well.
>>>
>>> The code has multiple implementations for each hash size depending on
>>> the parameter types, which increases the number of tests needed.
>>>
>>> Note that the original C++ implementation only has 3 hash methods:
>>> MurmurHash3_x86_32
>>> MurmurHash3_x86_128
>>> MurmurHash3_x64_128
>>>
>>> AFAICT these all take the same parameters:
>>> - unsigned byte array
>>> - length
>>> - unsigned 32 bit seed
>>> - pointer to output buffer
>>>
>>> No idea why the Codec version allows additional input such as one (or
>>> two) longs, etc.
>>> Unless there is a defined standard for how these should be handled, it
>>> will be impossible to prove that they behave correctly.
>>>
>>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[hidden email]>
>>> wrote:
>>>>
>>>>
>>>>
>>>>> On 3 Nov 2019, at 21:45, Gary Gregory <[hidden email]> wrote:
>>>>>
>>>>> I feel like I am missing something basic in the assumption of this
>>> issue:
>>>>> there is no such thing as an unsigned int in Java and the ticket talks
>>>>> about (C?) unsigned ints. Please help me understand how or why we
>>> should
>>>>> care about C vs. Java ints. Are we comparing apples to oranges here?
>>>>
>>>> When a byte is converted to an int there is sign extension if it is
>>> negative. A negative byte will have all bits above the 8th bit set to 1. So
>>> if the byte is negative then when converted to an int for bit shift and xor
>>> operations the raw bits are not the same.
>>>>
>>>> These are not the same:
>>>>
>>>> byte b = -1;
>>>> (int) b != (b & 0xff);
>>>> b << 8 != (b & 0xff) << 8;
>>>> b << 16 != (b & 0xff) << 16;
>>>>
>>>> The original code has the use of the 0xff mask for most of the murmur3
>>> algorithm. It has been missed from the final steps applied the the last 3
>>> bytes in the hash32 algorithm variant.
>>>>
>>>> Alex
>>>>
>>>>
>>>>
>>>>>
>>>>> Thank you,
>>>>> Gary
>>>>>
>>>>> On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:
>>>>>
>>>>>> There is an error in the current Murmur3 code introduced by sign
>>> extension
>>>>>> errors.  This is documented in CODEC-264.[1]
>>>>>>
>>>>>> I have created a pull request to fix it.[2]
>>>>>>
>>>>>> While the code changes did not change any of the existing Murmur3
>>> tests, I
>>>>>> did add new tests that failed until the changes were applied.
>>>>>>
>>>>>> However, I am concerned that the commons-codec Murmur3 code is in the
>>> wild
>>>>>> and this change will impact some hashes.
>>>>>>
>>>>>> Therefore, I am asking for a discussion concerning how to apply any
>>> patches
>>>>>> like CODEC-264 to hashing algorithms that are in the wild.
>>>>>>
>>>>>> Thx,
>>>>>> Claude
>>>>>>
>>>>>> [1] https://issues.apache.org/jira/browse/CODEC-264
>>>>>> [2] https://github.com/apache/commons-codec/pull/27
>>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email] <mailto:[hidden email]>
> For additional commands, e-mail: [hidden email] <mailto:[hidden email]>
Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

Claude Warren
There are a number of issues with the format and potential bugs in the
Codec Murmur3 code. ( See spotbugs, PMD, and codestyle reports)  The one
that tripped me up was the mix of tab/space lines.  Anyway, these issues
can be fixed in later pull requests.  I think is one overarching question
and one specific question:

In general, what is the policy for how do we deal with digest
implementations that have a bug and that have been released into the wild?

Specifically in this case, what do we do to solve the problem?

I was in favor of the following specific action:

1. Deprecate all affected methods and document the issue.
2. Move the deprecated methods into a new class called HiveMurmur3 and have
all deprecation point to this class.
3. Create a new Murmur3 class with method names like hash128_x64 (ones that
do not conflict with the old names).

But the more I think about it the more I think that we should follow KISS
and

1. Create a new Murmur3 class with method names like hash128_x64 (ones that
do not conflict with the old names).
2. Update the javadocs for the defective methods with notes indicating they
have defects.  Perhaps write a readme type document explaining in detail
the issue.

I think in general we should follow the same action with any digest defect:
New methods and Javadoc old.

Claude




On Mon, Nov 4, 2019 at 8:08 AM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 4 Nov 2019, at 02:13, sebb <[hidden email]> wrote:
> >
> > On Mon, 4 Nov 2019 at 00:53, Claude Warren <[hidden email] <mailto:
> [hidden email]>> wrote:
> >>
> >> I think the way to prove they behave correctly is to test the results
> >> against the original "C" implementation.  With this in mind I proposed
> the
> >> changes.
> >>
> >> I agree that there should be a way to revert to the old code as there is
> >> code in the wild that depends on the earlier implementation.
> >>
> >> I know that there are broken implementations in other Apache projects,
> but
> >> those are not libraries (i.e. Cassandra).
> >>
> >> I would suggest that perhaps the changes that I provided be renamed as
> >> hash128Standard and hash32Standard (or some other designator to show
> that
> >> it is different from the original CODEC implementation).
> >>
> >> Alternatively hash128 and hash32 could be deprecated  and replaced with
> >> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
> >> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
> >> hashx86_128 implementation.
> >>
> >> My opinion is that we should provide Murmur3 hash implementations that
> >> produce the same results as the "C"/C++ implementations for cross
> >> application compatibility.
> >
> > I agree: but the original implementation has far fewer methods.
> > So how does one check the Codec methods that have no corresponding C/C++
> method?
>
> AFAICT the methods that accept a short, int, or long are convenience
> methods to avoid creating a byte[] for the data. The JUnit tests
> demonstrate this by passing the same values to a ByteBuffer and then
> calling the corresponding byte[] methods. The methods do assume the endian
> order of the primitive datatypes as processed by a default ByteBuffer (Big
> Endian). So perhaps this should be added to the Javadoc with an example for
> the equivalent call to the method using the byte[].
>
> Note that these methods will not be broken. The bug in the code is for
> input byte[] lengths that are not a factor of 4. The fix is for any of the
> leftover 1, 2, or 3 bytes when computing the hash.
>
> The broken methods are:
>
> public static int hash32(final byte[] data)
> public static int hash32(final String data)
> public static int hash32(final byte[] data, final int length)
> public static int hash32(final byte[] data, final int length, final int
> seed)
> public static int hash32(final byte[] data, final int offset, final int
> length, final int seed)
>
> This class is also broken:
> public static class IncrementalHash32
>
> >
> > Maybe the Codec implementation needs to be drastically pruned as well.
> >
> >> Claude
> >>
> >> On Sun, Nov 3, 2019 at 11:34 PM sebb <[hidden email]> wrote:
> >>
> >>> As I see it, there are two use cases here.
> >>>
> >>> 1) Commons Codec code is used exclusively, in which case it is
> >>> essential that the output does not change for a given seed.
> >>>
> >>> 2) Commons Codec is used in conjunction with other implementations, in
> >>> which case it is essential that Codec is aligned with other
> >>> implementations
> >>>
> >>> Both use cases seem valid to me, so I think it's important to allow
> >>> both old and new algorithms to co-exist going forward.
> >>>
> >>> However:
> >>> Should the default behaviour change (with option to revert)?
> >>> Or would it be better to make the corrected behaviour optional?
> >>>
> >>> Or deliberately break the API to force users to choose between old and
> >>> new behaviour?
> >>>
> >>> Note that Wikipedia says that the x86 and x64 versions of the
> >>> algorithm generate different results in the case of the 128 bit hash.
> >>> However Commons Codec does not have different implementations for x86
> >>> and x64; it is not clear which has been implemented.
> >>> The JavaDoc also fails to mention that there is a 64bit hash.
>
> And lacks <p> and <a> tags in the header as appropriate.
>
> >>>
> >>> The class clearly needs additional test data; it's possible that there
> >>> are discrepancies in other hash sizes as well.
> >>>
> >>> The code has multiple implementations for each hash size depending on
> >>> the parameter types, which increases the number of tests needed.
> >>>
> >>> Note that the original C++ implementation only has 3 hash methods:
> >>> MurmurHash3_x86_32
> >>> MurmurHash3_x86_128
> >>> MurmurHash3_x64_128
> >>>
> >>> AFAICT these all take the same parameters:
> >>> - unsigned byte array
> >>> - length
> >>> - unsigned 32 bit seed
> >>> - pointer to output buffer
> >>>
> >>> No idea why the Codec version allows additional input such as one (or
> >>> two) longs, etc.
> >>> Unless there is a defined standard for how these should be handled, it
> >>> will be impossible to prove that they behave correctly.
> >>>
> >>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[hidden email]>
> >>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On 3 Nov 2019, at 21:45, Gary Gregory <[hidden email]>
> wrote:
> >>>>>
> >>>>> I feel like I am missing something basic in the assumption of this
> >>> issue:
> >>>>> there is no such thing as an unsigned int in Java and the ticket
> talks
> >>>>> about (C?) unsigned ints. Please help me understand how or why we
> >>> should
> >>>>> care about C vs. Java ints. Are we comparing apples to oranges here?
> >>>>
> >>>> When a byte is converted to an int there is sign extension if it is
> >>> negative. A negative byte will have all bits above the 8th bit set to
> 1. So
> >>> if the byte is negative then when converted to an int for bit shift
> and xor
> >>> operations the raw bits are not the same.
> >>>>
> >>>> These are not the same:
> >>>>
> >>>> byte b = -1;
> >>>> (int) b != (b & 0xff);
> >>>> b << 8 != (b & 0xff) << 8;
> >>>> b << 16 != (b & 0xff) << 16;
> >>>>
> >>>> The original code has the use of the 0xff mask for most of the murmur3
> >>> algorithm. It has been missed from the final steps applied the the
> last 3
> >>> bytes in the hash32 algorithm variant.
> >>>>
> >>>> Alex
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> Thank you,
> >>>>> Gary
> >>>>>
> >>>>> On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:
> >>>>>
> >>>>>> There is an error in the current Murmur3 code introduced by sign
> >>> extension
> >>>>>> errors.  This is documented in CODEC-264.[1]
> >>>>>>
> >>>>>> I have created a pull request to fix it.[2]
> >>>>>>
> >>>>>> While the code changes did not change any of the existing Murmur3
> >>> tests, I
> >>>>>> did add new tests that failed until the changes were applied.
> >>>>>>
> >>>>>> However, I am concerned that the commons-codec Murmur3 code is in
> the
> >>> wild
> >>>>>> and this change will impact some hashes.
> >>>>>>
> >>>>>> Therefore, I am asking for a discussion concerning how to apply any
> >>> patches
> >>>>>> like CODEC-264 to hashing algorithms that are in the wild.
> >>>>>>
> >>>>>> Thx,
> >>>>>> Claude
> >>>>>>
> >>>>>> [1] https://issues.apache.org/jira/browse/CODEC-264
> >>>>>> [2] https://github.com/apache/commons-codec/pull/27
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [hidden email]
> >>>> For additional commands, e-mail: [hidden email]
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>>
> >>
> >> --
> >> I like: Like Like - The likeliest place on the web
> >> <http://like-like.xenei.com>
> >> LinkedIn: http://www.linkedin.com/in/claudewarren
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email] <mailto:
> [hidden email]>
> > For additional commands, e-mail: [hidden email] <mailto:
> [hidden email]>
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

Claude Warren
I took the approach that I would leave the original code there and add new
methods hash128_x64 and hash32_x86.  I also marked the older methods as
deprecated with a note that the implementation has a sign extension error
and to use the new methods for new development and cases where the change
does not impact the code in the wild.

Claude

On Mon, Nov 4, 2019 at 10:50 AM Claude Warren <[hidden email]> wrote:

> There are a number of issues with the format and potential bugs in the
> Codec Murmur3 code. ( See spotbugs, PMD, and codestyle reports)  The one
> that tripped me up was the mix of tab/space lines.  Anyway, these issues
> can be fixed in later pull requests.  I think is one overarching question
> and one specific question:
>
> In general, what is the policy for how do we deal with digest
> implementations that have a bug and that have been released into the wild?
>
> Specifically in this case, what do we do to solve the problem?
>
> I was in favor of the following specific action:
>
> 1. Deprecate all affected methods and document the issue.
> 2. Move the deprecated methods into a new class called HiveMurmur3 and
> have all deprecation point to this class.
> 3. Create a new Murmur3 class with method names like hash128_x64 (ones
> that do not conflict with the old names).
>
> But the more I think about it the more I think that we should follow KISS
> and
>
> 1. Create a new Murmur3 class with method names like hash128_x64 (ones
> that do not conflict with the old names).
> 2. Update the javadocs for the defective methods with notes indicating
> they have defects.  Perhaps write a readme type document explaining in
> detail the issue.
>
> I think in general we should follow the same action with any digest
> defect: New methods and Javadoc old.
>
> Claude
>
>
>
>
> On Mon, Nov 4, 2019 at 8:08 AM Alex Herbert <[hidden email]>
> wrote:
>
>>
>>
>> > On 4 Nov 2019, at 02:13, sebb <[hidden email]> wrote:
>> >
>> > On Mon, 4 Nov 2019 at 00:53, Claude Warren <[hidden email] <mailto:
>> [hidden email]>> wrote:
>> >>
>> >> I think the way to prove they behave correctly is to test the results
>> >> against the original "C" implementation.  With this in mind I proposed
>> the
>> >> changes.
>> >>
>> >> I agree that there should be a way to revert to the old code as there
>> is
>> >> code in the wild that depends on the earlier implementation.
>> >>
>> >> I know that there are broken implementations in other Apache projects,
>> but
>> >> those are not libraries (i.e. Cassandra).
>> >>
>> >> I would suggest that perhaps the changes that I provided be renamed as
>> >> hash128Standard and hash32Standard (or some other designator to show
>> that
>> >> it is different from the original CODEC implementation).
>> >>
>> >> Alternatively hash128 and hash32 could be deprecated  and replaced with
>> >> hashx86_32Old, and hashx64_128Old and provide the corrected versions
>> for
>> >> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
>> >> hashx86_128 implementation.
>> >>
>> >> My opinion is that we should provide Murmur3 hash implementations that
>> >> produce the same results as the "C"/C++ implementations for cross
>> >> application compatibility.
>> >
>> > I agree: but the original implementation has far fewer methods.
>> > So how does one check the Codec methods that have no corresponding
>> C/C++ method?
>>
>> AFAICT the methods that accept a short, int, or long are convenience
>> methods to avoid creating a byte[] for the data. The JUnit tests
>> demonstrate this by passing the same values to a ByteBuffer and then
>> calling the corresponding byte[] methods. The methods do assume the endian
>> order of the primitive datatypes as processed by a default ByteBuffer (Big
>> Endian). So perhaps this should be added to the Javadoc with an example for
>> the equivalent call to the method using the byte[].
>>
>> Note that these methods will not be broken. The bug in the code is for
>> input byte[] lengths that are not a factor of 4. The fix is for any of the
>> leftover 1, 2, or 3 bytes when computing the hash.
>>
>> The broken methods are:
>>
>> public static int hash32(final byte[] data)
>> public static int hash32(final String data)
>> public static int hash32(final byte[] data, final int length)
>> public static int hash32(final byte[] data, final int length, final int
>> seed)
>> public static int hash32(final byte[] data, final int offset, final int
>> length, final int seed)
>>
>> This class is also broken:
>> public static class IncrementalHash32
>>
>> >
>> > Maybe the Codec implementation needs to be drastically pruned as well.
>> >
>> >> Claude
>> >>
>> >> On Sun, Nov 3, 2019 at 11:34 PM sebb <[hidden email]> wrote:
>> >>
>> >>> As I see it, there are two use cases here.
>> >>>
>> >>> 1) Commons Codec code is used exclusively, in which case it is
>> >>> essential that the output does not change for a given seed.
>> >>>
>> >>> 2) Commons Codec is used in conjunction with other implementations, in
>> >>> which case it is essential that Codec is aligned with other
>> >>> implementations
>> >>>
>> >>> Both use cases seem valid to me, so I think it's important to allow
>> >>> both old and new algorithms to co-exist going forward.
>> >>>
>> >>> However:
>> >>> Should the default behaviour change (with option to revert)?
>> >>> Or would it be better to make the corrected behaviour optional?
>> >>>
>> >>> Or deliberately break the API to force users to choose between old and
>> >>> new behaviour?
>> >>>
>> >>> Note that Wikipedia says that the x86 and x64 versions of the
>> >>> algorithm generate different results in the case of the 128 bit hash.
>> >>> However Commons Codec does not have different implementations for x86
>> >>> and x64; it is not clear which has been implemented.
>> >>> The JavaDoc also fails to mention that there is a 64bit hash.
>>
>> And lacks <p> and <a> tags in the header as appropriate.
>>
>> >>>
>> >>> The class clearly needs additional test data; it's possible that there
>> >>> are discrepancies in other hash sizes as well.
>> >>>
>> >>> The code has multiple implementations for each hash size depending on
>> >>> the parameter types, which increases the number of tests needed.
>> >>>
>> >>> Note that the original C++ implementation only has 3 hash methods:
>> >>> MurmurHash3_x86_32
>> >>> MurmurHash3_x86_128
>> >>> MurmurHash3_x64_128
>> >>>
>> >>> AFAICT these all take the same parameters:
>> >>> - unsigned byte array
>> >>> - length
>> >>> - unsigned 32 bit seed
>> >>> - pointer to output buffer
>> >>>
>> >>> No idea why the Codec version allows additional input such as one (or
>> >>> two) longs, etc.
>> >>> Unless there is a defined standard for how these should be handled, it
>> >>> will be impossible to prove that they behave correctly.
>> >>>
>> >>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[hidden email]>
>> >>> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On 3 Nov 2019, at 21:45, Gary Gregory <[hidden email]>
>> wrote:
>> >>>>>
>> >>>>> I feel like I am missing something basic in the assumption of this
>> >>> issue:
>> >>>>> there is no such thing as an unsigned int in Java and the ticket
>> talks
>> >>>>> about (C?) unsigned ints. Please help me understand how or why we
>> >>> should
>> >>>>> care about C vs. Java ints. Are we comparing apples to oranges here?
>> >>>>
>> >>>> When a byte is converted to an int there is sign extension if it is
>> >>> negative. A negative byte will have all bits above the 8th bit set to
>> 1. So
>> >>> if the byte is negative then when converted to an int for bit shift
>> and xor
>> >>> operations the raw bits are not the same.
>> >>>>
>> >>>> These are not the same:
>> >>>>
>> >>>> byte b = -1;
>> >>>> (int) b != (b & 0xff);
>> >>>> b << 8 != (b & 0xff) << 8;
>> >>>> b << 16 != (b & 0xff) << 16;
>> >>>>
>> >>>> The original code has the use of the 0xff mask for most of the
>> murmur3
>> >>> algorithm. It has been missed from the final steps applied the the
>> last 3
>> >>> bytes in the hash32 algorithm variant.
>> >>>>
>> >>>> Alex
>> >>>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>> Thank you,
>> >>>>> Gary
>> >>>>>
>> >>>>> On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:
>> >>>>>
>> >>>>>> There is an error in the current Murmur3 code introduced by sign
>> >>> extension
>> >>>>>> errors.  This is documented in CODEC-264.[1]
>> >>>>>>
>> >>>>>> I have created a pull request to fix it.[2]
>> >>>>>>
>> >>>>>> While the code changes did not change any of the existing Murmur3
>> >>> tests, I
>> >>>>>> did add new tests that failed until the changes were applied.
>> >>>>>>
>> >>>>>> However, I am concerned that the commons-codec Murmur3 code is in
>> the
>> >>> wild
>> >>>>>> and this change will impact some hashes.
>> >>>>>>
>> >>>>>> Therefore, I am asking for a discussion concerning how to apply any
>> >>> patches
>> >>>>>> like CODEC-264 to hashing algorithms that are in the wild.
>> >>>>>>
>> >>>>>> Thx,
>> >>>>>> Claude
>> >>>>>>
>> >>>>>> [1] https://issues.apache.org/jira/browse/CODEC-264
>> >>>>>> [2] https://github.com/apache/commons-codec/pull/27
>> >>>>>>
>> >>>>
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: [hidden email]
>> >>>> For additional commands, e-mail: [hidden email]
>> >>>>
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: [hidden email]
>> >>> For additional commands, e-mail: [hidden email]
>> >>>
>> >>>
>> >>
>> >> --
>> >> I like: Like Like - The likeliest place on the web
>> >> <http://like-like.xenei.com>
>> >> LinkedIn: http://www.linkedin.com/in/claudewarren
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email] <mailto:
>> [hidden email]>
>> > For additional commands, e-mail: [hidden email] <mailto:
>> [hidden email]>
>>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [CODEC] Sign Extension Error in Murmur3

Claude Warren
https://github.com/apache/commons-codec/pull/27

On Mon, Nov 11, 2019 at 4:25 PM Claude Warren <[hidden email]> wrote:

> I took the approach that I would leave the original code there and add new
> methods hash128_x64 and hash32_x86.  I also marked the older methods as
> deprecated with a note that the implementation has a sign extension error
> and to use the new methods for new development and cases where the change
> does not impact the code in the wild.
>
> Claude
>
> On Mon, Nov 4, 2019 at 10:50 AM Claude Warren <[hidden email]> wrote:
>
>> There are a number of issues with the format and potential bugs in the
>> Codec Murmur3 code. ( See spotbugs, PMD, and codestyle reports)  The one
>> that tripped me up was the mix of tab/space lines.  Anyway, these issues
>> can be fixed in later pull requests.  I think is one overarching question
>> and one specific question:
>>
>> In general, what is the policy for how do we deal with digest
>> implementations that have a bug and that have been released into the wild?
>>
>> Specifically in this case, what do we do to solve the problem?
>>
>> I was in favor of the following specific action:
>>
>> 1. Deprecate all affected methods and document the issue.
>> 2. Move the deprecated methods into a new class called HiveMurmur3 and
>> have all deprecation point to this class.
>> 3. Create a new Murmur3 class with method names like hash128_x64 (ones
>> that do not conflict with the old names).
>>
>> But the more I think about it the more I think that we should follow KISS
>> and
>>
>> 1. Create a new Murmur3 class with method names like hash128_x64 (ones
>> that do not conflict with the old names).
>> 2. Update the javadocs for the defective methods with notes indicating
>> they have defects.  Perhaps write a readme type document explaining in
>> detail the issue.
>>
>> I think in general we should follow the same action with any digest
>> defect: New methods and Javadoc old.
>>
>> Claude
>>
>>
>>
>>
>> On Mon, Nov 4, 2019 at 8:08 AM Alex Herbert <[hidden email]>
>> wrote:
>>
>>>
>>>
>>> > On 4 Nov 2019, at 02:13, sebb <[hidden email]> wrote:
>>> >
>>> > On Mon, 4 Nov 2019 at 00:53, Claude Warren <[hidden email] <mailto:
>>> [hidden email]>> wrote:
>>> >>
>>> >> I think the way to prove they behave correctly is to test the results
>>> >> against the original "C" implementation.  With this in mind I
>>> proposed the
>>> >> changes.
>>> >>
>>> >> I agree that there should be a way to revert to the old code as there
>>> is
>>> >> code in the wild that depends on the earlier implementation.
>>> >>
>>> >> I know that there are broken implementations in other Apache
>>> projects, but
>>> >> those are not libraries (i.e. Cassandra).
>>> >>
>>> >> I would suggest that perhaps the changes that I provided be renamed as
>>> >> hash128Standard and hash32Standard (or some other designator to show
>>> that
>>> >> it is different from the original CODEC implementation).
>>> >>
>>> >> Alternatively hash128 and hash32 could be deprecated  and replaced
>>> with
>>> >> hashx86_32Old, and hashx64_128Old and provide the corrected versions
>>> for
>>> >> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
>>> >> hashx86_128 implementation.
>>> >>
>>> >> My opinion is that we should provide Murmur3 hash implementations that
>>> >> produce the same results as the "C"/C++ implementations for cross
>>> >> application compatibility.
>>> >
>>> > I agree: but the original implementation has far fewer methods.
>>> > So how does one check the Codec methods that have no corresponding
>>> C/C++ method?
>>>
>>> AFAICT the methods that accept a short, int, or long are convenience
>>> methods to avoid creating a byte[] for the data. The JUnit tests
>>> demonstrate this by passing the same values to a ByteBuffer and then
>>> calling the corresponding byte[] methods. The methods do assume the endian
>>> order of the primitive datatypes as processed by a default ByteBuffer (Big
>>> Endian). So perhaps this should be added to the Javadoc with an example for
>>> the equivalent call to the method using the byte[].
>>>
>>> Note that these methods will not be broken. The bug in the code is for
>>> input byte[] lengths that are not a factor of 4. The fix is for any of the
>>> leftover 1, 2, or 3 bytes when computing the hash.
>>>
>>> The broken methods are:
>>>
>>> public static int hash32(final byte[] data)
>>> public static int hash32(final String data)
>>> public static int hash32(final byte[] data, final int length)
>>> public static int hash32(final byte[] data, final int length, final int
>>> seed)
>>> public static int hash32(final byte[] data, final int offset, final int
>>> length, final int seed)
>>>
>>> This class is also broken:
>>> public static class IncrementalHash32
>>>
>>> >
>>> > Maybe the Codec implementation needs to be drastically pruned as well.
>>> >
>>> >> Claude
>>> >>
>>> >> On Sun, Nov 3, 2019 at 11:34 PM sebb <[hidden email]> wrote:
>>> >>
>>> >>> As I see it, there are two use cases here.
>>> >>>
>>> >>> 1) Commons Codec code is used exclusively, in which case it is
>>> >>> essential that the output does not change for a given seed.
>>> >>>
>>> >>> 2) Commons Codec is used in conjunction with other implementations,
>>> in
>>> >>> which case it is essential that Codec is aligned with other
>>> >>> implementations
>>> >>>
>>> >>> Both use cases seem valid to me, so I think it's important to allow
>>> >>> both old and new algorithms to co-exist going forward.
>>> >>>
>>> >>> However:
>>> >>> Should the default behaviour change (with option to revert)?
>>> >>> Or would it be better to make the corrected behaviour optional?
>>> >>>
>>> >>> Or deliberately break the API to force users to choose between old
>>> and
>>> >>> new behaviour?
>>> >>>
>>> >>> Note that Wikipedia says that the x86 and x64 versions of the
>>> >>> algorithm generate different results in the case of the 128 bit hash.
>>> >>> However Commons Codec does not have different implementations for x86
>>> >>> and x64; it is not clear which has been implemented.
>>> >>> The JavaDoc also fails to mention that there is a 64bit hash.
>>>
>>> And lacks <p> and <a> tags in the header as appropriate.
>>>
>>> >>>
>>> >>> The class clearly needs additional test data; it's possible that
>>> there
>>> >>> are discrepancies in other hash sizes as well.
>>> >>>
>>> >>> The code has multiple implementations for each hash size depending on
>>> >>> the parameter types, which increases the number of tests needed.
>>> >>>
>>> >>> Note that the original C++ implementation only has 3 hash methods:
>>> >>> MurmurHash3_x86_32
>>> >>> MurmurHash3_x86_128
>>> >>> MurmurHash3_x64_128
>>> >>>
>>> >>> AFAICT these all take the same parameters:
>>> >>> - unsigned byte array
>>> >>> - length
>>> >>> - unsigned 32 bit seed
>>> >>> - pointer to output buffer
>>> >>>
>>> >>> No idea why the Codec version allows additional input such as one (or
>>> >>> two) longs, etc.
>>> >>> Unless there is a defined standard for how these should be handled,
>>> it
>>> >>> will be impossible to prove that they behave correctly.
>>> >>>
>>> >>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[hidden email]>
>>> >>> wrote:
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>> On 3 Nov 2019, at 21:45, Gary Gregory <[hidden email]>
>>> wrote:
>>> >>>>>
>>> >>>>> I feel like I am missing something basic in the assumption of this
>>> >>> issue:
>>> >>>>> there is no such thing as an unsigned int in Java and the ticket
>>> talks
>>> >>>>> about (C?) unsigned ints. Please help me understand how or why we
>>> >>> should
>>> >>>>> care about C vs. Java ints. Are we comparing apples to oranges
>>> here?
>>> >>>>
>>> >>>> When a byte is converted to an int there is sign extension if it is
>>> >>> negative. A negative byte will have all bits above the 8th bit set
>>> to 1. So
>>> >>> if the byte is negative then when converted to an int for bit shift
>>> and xor
>>> >>> operations the raw bits are not the same.
>>> >>>>
>>> >>>> These are not the same:
>>> >>>>
>>> >>>> byte b = -1;
>>> >>>> (int) b != (b & 0xff);
>>> >>>> b << 8 != (b & 0xff) << 8;
>>> >>>> b << 16 != (b & 0xff) << 16;
>>> >>>>
>>> >>>> The original code has the use of the 0xff mask for most of the
>>> murmur3
>>> >>> algorithm. It has been missed from the final steps applied the the
>>> last 3
>>> >>> bytes in the hash32 algorithm variant.
>>> >>>>
>>> >>>> Alex
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> Thank you,
>>> >>>>> Gary
>>> >>>>>
>>> >>>>> On Sun, Nov 3, 2019, 07:59 Claude Warren <[hidden email]> wrote:
>>> >>>>>
>>> >>>>>> There is an error in the current Murmur3 code introduced by sign
>>> >>> extension
>>> >>>>>> errors.  This is documented in CODEC-264.[1]
>>> >>>>>>
>>> >>>>>> I have created a pull request to fix it.[2]
>>> >>>>>>
>>> >>>>>> While the code changes did not change any of the existing Murmur3
>>> >>> tests, I
>>> >>>>>> did add new tests that failed until the changes were applied.
>>> >>>>>>
>>> >>>>>> However, I am concerned that the commons-codec Murmur3 code is in
>>> the
>>> >>> wild
>>> >>>>>> and this change will impact some hashes.
>>> >>>>>>
>>> >>>>>> Therefore, I am asking for a discussion concerning how to apply
>>> any
>>> >>> patches
>>> >>>>>> like CODEC-264 to hashing algorithms that are in the wild.
>>> >>>>>>
>>> >>>>>> Thx,
>>> >>>>>> Claude
>>> >>>>>>
>>> >>>>>> [1] https://issues.apache.org/jira/browse/CODEC-264
>>> >>>>>> [2] https://github.com/apache/commons-codec/pull/27
>>> >>>>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> ---------------------------------------------------------------------
>>> >>>> To unsubscribe, e-mail: [hidden email]
>>> >>>> For additional commands, e-mail: [hidden email]
>>> >>>>
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: [hidden email]
>>> >>> For additional commands, e-mail: [hidden email]
>>> >>>
>>> >>>
>>> >>
>>> >> --
>>> >> I like: Like Like - The likeliest place on the web
>>> >> <http://like-like.xenei.com>
>>> >> LinkedIn: http://www.linkedin.com/in/claudewarren
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [hidden email] <mailto:
>>> [hidden email]>
>>> > For additional commands, e-mail: [hidden email] <mailto:
>>> [hidden email]>
>>>
>>
>>
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren