[collections] Bloom filters

classic Classic list List threaded Threaded
44 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[collections] Bloom filters

Alex Herbert
I had a look through all the BloomFilter code. Thanks Claude for the
contribution.

Some items that need clarifying:


1. HashFunctionIdentity.Signedness

This is not fully documented as to what the sign applies to. There is no
javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
states "Identifies the signedness of the calculations for this function".

I assume this applies to the Hash computation in 'long
HashFunction.apply(byte[], int)'

Does this mean the hash algorithm has a variant that can treat the
bytes/seed as signed or unsigned. If so which one because 2 enum values
cannot cover all 4 possibilities. Since there is no javadoc it is
unclear exactly what this property is supposed to indicate.


2. HashFunctionIdentity comparators

The interface HashFunctionIdentity defines constants which are
comparators. Do these need to be here? Typically comparators should be
Serializable. Are these comparators ever going to be bundled with
BloomFilters and serialized? There are lots of ways to compare
HashFunctionIdentities. Comparators are for ordering. So why do you need
to order a HashFunction and why choose the two implementations provided?

 From what I can see the COMMON comparator is used in the main code in
Shape, DynamicHasher and StaticHasher with:

if (compare(a, b) != 0) {
     // not allowed ...
}

The DEEP comparator is not used in the main code.

The use case for checking that the hash function is the same seems to be
better satisfied by using a package private HashFunctionIdentityEquality
helper method:

     static boolean areEqual(HashFunctionIdentity a,
HashFunctionIdentity b) {
         return (a.getSignedness() == b.getSignedness() &&
             a.getProcessType() == b.getProcessType() &&
             a.getName().equalsIgnoreCase(b.getName()));
     }

This would drop the comparators that no-one actually requires and
clean-up the public API.

I note that not everything uses the comparator.
AbstractBloomFilter.verfiyHasher uses:

if (a.getSignature() != b.getSignature()) {
     // not allowed ...
}

However the signature (a long value) could be the same by chance with
p=2^-64. So why is this class allow to use getSignature() but others do
a full hash identity comparison? It seems to me you either do one or the
other: a full property equality test; or a quick signature check (which
is error prone); but not mix and match in different parts of the code.


3. AbtractBloomFilter

I considered changing the cardinality methods to avoid using BitSet
cardinality() and computing the cardinality direct. BitSet will
duplicate the input long[] in the constructor. For the or/xor/and
cardinality this is 2 array creations that can be avoided with a direct
implementation. For the cardinality method it avoids 1 array creation
via BitSet. Here is the type of implementation for a direct computation:

     @Override
     public int cardinality() {
         int count = 0;
         for (final long bits : getBits()) {
             count += Long.bitCount(bits);
         }
         return count;
     }

     @Override
     public int andCardinality(final BloomFilter other) {
         verifyShape(other);
         final long[] mine = getBits();
         final long[] theirs = other.getBits();
         final int limit = Integer.min(mine.length, theirs.length);
         int count = 0;
         for (int i = 0; i < limit; i++) {
             count += Long.bitCount(mine[i] & theirs[i]);
         }
         return count;
     }

     @Override
     public int orCardinality(final BloomFilter other) {
         verifyShape(other);
         final long[] mine = getBits();
         final long[] theirs = other.getBits();
         long[] small;
         long[] big;
         if (mine.length > theirs.length) {
             big = mine;
             small = theirs;
         } else {
             small = mine;
             big = theirs;
         }
         int count = 0;
         for (int i = 0; i < small.length; i++) {
             // OR
             count += Long.bitCount(small[i] | big[i]);
         }
         for (int i = small.length; i < big.length; i++) {
             count += Long.bitCount(big[i]);
         }
         return count;
     }

The xorCardinality is the same as orCardinality but using the ^ operator.

I didn't make the change but this should be considered after a
performance test with JMH.

There are a few other places where specialised implementations may be
more performant so a JMH benchmark would be useful for Bloom filter,
e.g. the important

BloomFilter.contains(Hasher)

BloomFilter.contains(BloomFilter)


Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Claude Warren
Alex,

Thank you for your comments.

See comments inline.



On Mon, Feb 17, 2020 at 3:20 PM Alex Herbert <[hidden email]>
wrote:

> I had a look through all the BloomFilter code. Thanks Claude for the
> contribution.
>
> Some items that need clarifying:
>
>
> 1. HashFunctionIdentity.Signedness
>
> This is not fully documented as to what the sign applies to. There is no
> javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
> states "Identifies the signedness of the calculations for this function".
>
> I assume this applies to the Hash computation in 'long
> HashFunction.apply(byte[], int)'
>
> Does this mean the hash algorithm has a variant that can treat the
> bytes/seed as signed or unsigned. If so which one because 2 enum values
> cannot cover all 4 possibilities. Since there is no javadoc it is
> unclear exactly what this property is supposed to indicate.
>
> I will attempt to clarify in the javadoc.  Basically when the underlying
hash function executes it returns an array of bytes.  The algorithm can
treat those bytes as signed or unsigned long values.  For the java
implementations this is probably signed but other languages could implement
it as unsigned.   For example the murmur128 has returns 128 bytes this
could be interpreted as 2 64-bit signed or unsigned longs.  The Signedness
attribute describes how that is interpreted.  As for the other values the
seed is a signed value and the byte buffer is a byte buffer interpreted as
per the underlying hash function.


> 2. HashFunctionIdentity comparators
>
> The interface HashFunctionIdentity defines constants which are
> comparators. Do these need to be here? Typically comparators should be
> Serializable. Are these comparators ever going to be bundled with
> BloomFilters and serialized? There are lots of ways to compare
> HashFunctionIdentities. Comparators are for ordering. So why do you need
> to order a HashFunction and why choose the two implementations provided?
>

There was a discussion on the list about building factories where the
provider of the HashFunctionIdentity would be important (Think in terms of
the JCA where multiple providers can provide the same hash/encryption
functions but an application may wish to choose one over the other).

>
>  From what I can see the COMMON comparator is used in the main code in
> Shape, DynamicHasher and StaticHasher with:
>
> if (compare(a, b) != 0) {
>      // not allowed ...
> }
>
> The DEEP comparator is not used in the main code.


> The use case for checking that the hash function is the same seems to be
> better satisfied by using a package private HashFunctionIdentityEquality
> helper method:
>
>      static boolean areEqual(HashFunctionIdentity a,
> HashFunctionIdentity b) {
>          return (a.getSignedness() == b.getSignedness() &&
>              a.getProcessType() == b.getProcessType() &&
>              a.getName().equalsIgnoreCase(b.getName()));
>      }
>
> This would drop the comparators that no-one actually requires and
> clean-up the public API.
>
> I can see the value of this solution.

I note that not everything uses the comparator.

> AbstractBloomFilter.verfiyHasher uses:
>
> if (a.getSignature() != b.getSignature()) {
>      // not allowed ...
> }
>
> However the signature (a long value) could be the same by chance with
> p=2^-64. So why is this class allow to use getSignature() but others do
> a full hash identity comparison? It seems to me you either do one or the
> other: a full property equality test; or a quick signature check (which
> is error prone); but not mix and match in different parts of the code.
>
> The reasoning for the second part is that bloom filter comparisons must be
very fast.  The possibility of signature collision is low but not zero.
However, trusting the name has issues as well.  There are a number of
murmur128 hash implementations that are not correct.  Using the signature
will detect those because the signature value is built by the hashing
function executing on the name.  Keep in mind that the actual function that
performed the hash may not be present when the signature is verified.

>
> 3. AbtractBloomFilter
>
> I considered changing the cardinality methods to avoid using BitSet
> cardinality() and computing the cardinality direct. BitSet will
> duplicate the input long[] in the constructor. For the or/xor/and
> cardinality this is 2 array creations that can be avoided with a direct
> implementation. For the cardinality method it avoids 1 array creation
> via BitSet. Here is the type of implementation for a direct computation:
>
>      @Override
>      public int cardinality() {
>          int count = 0;
>          for (final long bits : getBits()) {
>              count += Long.bitCount(bits);
>          }
>          return count;
>      }
>
>      @Override
>      public int andCardinality(final BloomFilter other) {
>          verifyShape(other);
>          final long[] mine = getBits();
>          final long[] theirs = other.getBits();
>          final int limit = Integer.min(mine.length, theirs.length);
>          int count = 0;
>          for (int i = 0; i < limit; i++) {
>              count += Long.bitCount(mine[i] & theirs[i]);
>          }
>          return count;
>      }
>
>      @Override
>      public int orCardinality(final BloomFilter other) {
>          verifyShape(other);
>          final long[] mine = getBits();
>          final long[] theirs = other.getBits();
>          long[] small;
>          long[] big;
>          if (mine.length > theirs.length) {
>              big = mine;
>              small = theirs;
>          } else {
>              small = mine;
>              big = theirs;
>          }
>          int count = 0;
>          for (int i = 0; i < small.length; i++) {
>              // OR
>              count += Long.bitCount(small[i] | big[i]);
>          }
>          for (int i = small.length; i < big.length; i++) {
>              count += Long.bitCount(big[i]);
>          }
>          return count;
>      }
>
> The xorCardinality is the same as orCardinality but using the ^ operator.
>
> I didn't make the change but this should be considered after a
> performance test with JMH.
>

Indeed your solutions would work against the longs returned by getBits()
and makes more sense.


>
> There are a few other places where specialised implementations may be
> more performant so a JMH benchmark would be useful for Bloom filter,
> e.g. the important
>
> BloomFilter.contains(Hasher)
>
> BloomFilter.contains(BloomFilter)
>
> If you have better "standard"  implementations let's implement them.
Implementations that have special needs (e.g. CountingBloomFilter,
MultidimensionalBloomFilter, AttenuatedBloomFilter) will implement as they
see fit.

Alex
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Alex Herbert


> On 17 Feb 2020, at 20:30, Claude Warren <[hidden email]> wrote:
>
> Alex,
>
> Thank you for your comments.
>
> See comments inline.
>
>
>
> On Mon, Feb 17, 2020 at 3:20 PM Alex Herbert <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>> I had a look through all the BloomFilter code. Thanks Claude for the
>> contribution.
>>
>> Some items that need clarifying:
>>
>>
>> 1. HashFunctionIdentity.Signedness
>>
>> This is not fully documented as to what the sign applies to. There is no
>> javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
>> states "Identifies the signedness of the calculations for this function".
>>
>> I assume this applies to the Hash computation in 'long
>> HashFunction.apply(byte[], int)'
>>
>> Does this mean the hash algorithm has a variant that can treat the
>> bytes/seed as signed or unsigned. If so which one because 2 enum values
>> cannot cover all 4 possibilities. Since there is no javadoc it is
>> unclear exactly what this property is supposed to indicate.
>>
>> I will attempt to clarify in the javadoc.  Basically when the underlying
> hash function executes it returns an array of bytes.  The algorithm can
> treat those bytes as signed or unsigned long values.  For the java
> implementations this is probably signed but other languages could implement
> it as unsigned.   For example the murmur128 has returns 128 bytes this
> could be interpreted as 2 64-bit signed or unsigned longs.  The Signedness
> attribute describes how that is interpreted.  As for the other values the
> seed is a signed value and the byte buffer is a byte buffer interpreted as
> per the underlying hash function.

OK. So we add comments to the javadoc to state that the signedness is the interpretation of the return value of

long HashFunction.apply(byte[], int)

So there can be algorithms that return a signed long and those that should return an unsigned long but this is not a supported type in Java.

Either way it doesn’t really matter, it seems more for correctness than for actual practical use given that unsigned and signed arithmetic on twos-complement integers is the same.

>
>
>> 2. HashFunctionIdentity comparators
>>
>> The interface HashFunctionIdentity defines constants which are
>> comparators. Do these need to be here? Typically comparators should be
>> Serializable. Are these comparators ever going to be bundled with
>> BloomFilters and serialized? There are lots of ways to compare
>> HashFunctionIdentities. Comparators are for ordering. So why do you need
>> to order a HashFunction and why choose the two implementations provided?
>>
>
> There was a discussion on the list about building factories where the
> provider of the HashFunctionIdentity would be important (Think in terms of
> the JCA where multiple providers can provide the same hash/encryption
> functions but an application may wish to choose one over the other).

OK. But given that this is not being done we could drop it from the API for now to make it cleaner.

>
>>
>> From what I can see the COMMON comparator is used in the main code in
>> Shape, DynamicHasher and StaticHasher with:
>>
>> if (compare(a, b) != 0) {
>>     // not allowed ...
>> }
>>
>> The DEEP comparator is not used in the main code.
>
>
>> The use case for checking that the hash function is the same seems to be
>> better satisfied by using a package private HashFunctionIdentityEquality
>> helper method:
>>
>>     static boolean areEqual(HashFunctionIdentity a,
>> HashFunctionIdentity b) {
>>         return (a.getSignedness() == b.getSignedness() &&
>>             a.getProcessType() == b.getProcessType() &&
>>             a.getName().equalsIgnoreCase(b.getName()));
>>     }
>>
>> This would drop the comparators that no-one actually requires and
>> clean-up the public API.
>>
>> I can see the value of this solution.
>
> I note that not everything uses the comparator.
>> AbstractBloomFilter.verfiyHasher uses:
>>
>> if (a.getSignature() != b.getSignature()) {
>>     // not allowed ...
>> }
>>
>> However the signature (a long value) could be the same by chance with
>> p=2^-64. So why is this class allow to use getSignature() but others do
>> a full hash identity comparison? It seems to me you either do one or the
>> other: a full property equality test; or a quick signature check (which
>> is error prone); but not mix and match in different parts of the code.
>>
>> The reasoning for the second part is that bloom filter comparisons must be
> very fast.  The possibility of signature collision is low but not zero.
> However, trusting the name has issues as well.  There are a number of
> murmur128 hash implementations that are not correct.  Using the signature
> will detect those because the signature value is built by the hashing
> function executing on the name.  Keep in mind that the actual function that
> performed the hash may not be present when the signature is verified.

So given as you point out that the comparison of the name is also open to flaws I would suggest dropping the comparators and using getSignature() to check for hash function equality. It is not perfect but then no other solution is either and as least it is fast.

I think at least for the Murmur32x86Iterative class which is stateless there is a case for a getInstance() method which could return a singleton given the methods are stateless. Perhaps this pattern should be applied to all the HashFunction implementations, i.e. private constructor and a getInstance() method to obtain an instance.

>
>>
>> 3. AbtractBloomFilter
>>
>> I considered changing the cardinality methods to avoid using BitSet
>> cardinality() and computing the cardinality direct. BitSet will
>> duplicate the input long[] in the constructor. For the or/xor/and
>> cardinality this is 2 array creations that can be avoided with a direct
>> implementation. For the cardinality method it avoids 1 array creation
>> via BitSet. Here is the type of implementation for a direct computation:
>>
>>     @Override
>>     public int cardinality() {
>>         int count = 0;
>>         for (final long bits : getBits()) {
>>             count += Long.bitCount(bits);
>>         }
>>         return count;
>>     }
>>
>>     @Override
>>     public int andCardinality(final BloomFilter other) {
>>         verifyShape(other);
>>         final long[] mine = getBits();
>>         final long[] theirs = other.getBits();
>>         final int limit = Integer.min(mine.length, theirs.length);
>>         int count = 0;
>>         for (int i = 0; i < limit; i++) {
>>             count += Long.bitCount(mine[i] & theirs[i]);
>>         }
>>         return count;
>>     }
>>
>>     @Override
>>     public int orCardinality(final BloomFilter other) {
>>         verifyShape(other);
>>         final long[] mine = getBits();
>>         final long[] theirs = other.getBits();
>>         long[] small;
>>         long[] big;
>>         if (mine.length > theirs.length) {
>>             big = mine;
>>             small = theirs;
>>         } else {
>>             small = mine;
>>             big = theirs;
>>         }
>>         int count = 0;
>>         for (int i = 0; i < small.length; i++) {
>>             // OR
>>             count += Long.bitCount(small[i] | big[i]);
>>         }
>>         for (int i = small.length; i < big.length; i++) {
>>             count += Long.bitCount(big[i]);
>>         }
>>         return count;
>>     }
>>
>> The xorCardinality is the same as orCardinality but using the ^ operator.
>>
>> I didn't make the change but this should be considered after a
>> performance test with JMH.
>>
>
> Indeed your solutions would work against the longs returned by getBits()
> and makes more sense.

I will look at how other projects have incorporated JMH for performance tests. It would be valuable here.

The purpose would be to show that it is faster to use a BloomFilter.contains(…) method in front of a full check for object presence (i.e. in a HashMap) rather than just a HashMap. Do you have any such examples? I remember a while back you posted some information on speed testing. Say for example a filter to store 10000 random URLs and then testing random URL hits (with some actual bad hits) are in/not in the set of bad URLs. I could mock this up as a simple speed test for the Bloom filter but if you have other examples then it would be useful.

>
>
>>
>> There are a few other places where specialised implementations may be
>> more performant so a JMH benchmark would be useful for Bloom filter,
>> e.g. the important
>>
>> BloomFilter.contains(Hasher)
>>
>> BloomFilter.contains(BloomFilter)
>>
>> If you have better "standard"  implementations let's implement them.
> Implementations that have special needs (e.g. CountingBloomFilter,
> MultidimensionalBloomFilter, AttenuatedBloomFilter) will implement as they
> see fit.

I was actually just wondering about a custom implementation of BitSetBloomFilter. There are only a few operations from BitSet that are used:

cardinality()
setBit(int)
or(BitSet)
and(BitSet)
xor(BitSet)
stream()

However BitSet has some overhead in that it maintains the ability to expand in size and will do so for some of these operations. If replaced with a FixedBitSetBloomFilter then it may be faster at the cost of having a large initial allocation overhead.

At least in the case of the or/and/xor methods these are used with a cloned BitSet for a cardinality count. This has the memory allocation overhead for a potentially large object that will be thrown away. It is used via the default method for BloomFilter.contains(BloomFilter) which calls andCardinality. This may be the most commonly used method for a BitSetBloomFilter other than BloomFilter.contains(Hasher).

All the methods apart from stream() are easy to implement. In that case it seems what is actually required is a primitive iterator of nextSetBit(int). So they require some real world use cases to determine if a custom FixedBitSetBloomFilter with its own internal long[] would be better than delegating to BitSet.

>
> Alex
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email] <mailto:[hidden email]>
>> For additional commands, e-mail: [hidden email] <mailto:[hidden email]>
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren>
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Claude Warren
On Mon, Feb 17, 2020 at 9:52 PM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 17 Feb 2020, at 20:30, Claude Warren <[hidden email]> wrote:
> >
> > Alex,
> >
> > Thank you for your comments.
> >
> > See comments inline.
> >
> >
> >
> > On Mon, Feb 17, 2020 at 3:20 PM Alex Herbert <[hidden email]
> <mailto:[hidden email]>>
> > wrote:
> >
> >> I had a look through all the BloomFilter code. Thanks Claude for the
> >> contribution.
> >>
> >> Some items that need clarifying:
> >>
> >>
> >> 1. HashFunctionIdentity.Signedness
> >>
> >> This is not fully documented as to what the sign applies to. There is no
> >> javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
> >> states "Identifies the signedness of the calculations for this
> function".
> >>
> >> I assume this applies to the Hash computation in 'long
> >> HashFunction.apply(byte[], int)'
> >>
> >> Does this mean the hash algorithm has a variant that can treat the
> >> bytes/seed as signed or unsigned. If so which one because 2 enum values
> >> cannot cover all 4 possibilities. Since there is no javadoc it is
> >> unclear exactly what this property is supposed to indicate.
> >>
> >> I will attempt to clarify in the javadoc.  Basically when the underlying
> > hash function executes it returns an array of bytes.  The algorithm can
> > treat those bytes as signed or unsigned long values.  For the java
> > implementations this is probably signed but other languages could
> implement
> > it as unsigned.   For example the murmur128 has returns 128 bytes this
> > could be interpreted as 2 64-bit signed or unsigned longs.  The
> Signedness
> > attribute describes how that is interpreted.  As for the other values the
> > seed is a signed value and the byte buffer is a byte buffer interpreted
> as
> > per the underlying hash function.
>
> OK. So we add comments to the javadoc to state that the signedness is the
> interpretation of the return value of
>
> long HashFunction.apply(byte[], int)
>
> So there can be algorithms that return a signed long and those that should
> return an unsigned long but this is not a supported type in Java.
>
> Either way it doesn’t really matter, it seems more for correctness than
> for actual practical use given that unsigned and signed arithmetic on
> twos-complement integers is the same.
>

My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for all
positive values of N?  If so then you are correct and Signedness does not
matter and can be removed.  I thought that the statement was false.


> >
> >
> >> 2. HashFunctionIdentity comparators
> >>
> >> The interface HashFunctionIdentity defines constants which are
> >> comparators. Do these need to be here? Typically comparators should be
> >> Serializable. Are these comparators ever going to be bundled with
> >> BloomFilters and serialized? There are lots of ways to compare
> >> HashFunctionIdentities. Comparators are for ordering. So why do you need
> >> to order a HashFunction and why choose the two implementations provided?
> >>
> >
> > There was a discussion on the list about building factories where the
> > provider of the HashFunctionIdentity would be important (Think in terms
> of
> > the JCA where multiple providers can provide the same hash/encryption
> > functions but an application may wish to choose one over the other).
>
> OK. But given that this is not being done we could drop it from the API
> for now to make it cleaner.
>
OK

>
> >
> >>
> >> From what I can see the COMMON comparator is used in the main code in
> >> Shape, DynamicHasher and StaticHasher with:
> >>
> >> if (compare(a, b) != 0) {
> >>     // not allowed ...
> >> }
> >>
> >> The DEEP comparator is not used in the main code.
> >
> >
> >> The use case for checking that the hash function is the same seems to be
> >> better satisfied by using a package private HashFunctionIdentityEquality
> >> helper method:
> >>
> >>     static boolean areEqual(HashFunctionIdentity a,
> >> HashFunctionIdentity b) {
> >>         return (a.getSignedness() == b.getSignedness() &&
> >>             a.getProcessType() == b.getProcessType() &&
> >>             a.getName().equalsIgnoreCase(b.getName()));
> >>     }
> >>
> >> This would drop the comparators that no-one actually requires and
> >> clean-up the public API.
> >>
> >> I can see the value of this solution.
> >
> > I note that not everything uses the comparator.
> >> AbstractBloomFilter.verfiyHasher uses:
> >>
> >> if (a.getSignature() != b.getSignature()) {
> >>     // not allowed ...
> >> }
> >>
> >> However the signature (a long value) could be the same by chance with
> >> p=2^-64. So why is this class allow to use getSignature() but others do
> >> a full hash identity comparison? It seems to me you either do one or the
> >> other: a full property equality test; or a quick signature check (which
> >> is error prone); but not mix and match in different parts of the code.
> >>
> >> The reasoning for the second part is that bloom filter comparisons must
> be
> > very fast.  The possibility of signature collision is low but not zero.
> > However, trusting the name has issues as well.  There are a number of
> > murmur128 hash implementations that are not correct.  Using the signature
> > will detect those because the signature value is built by the hashing
> > function executing on the name.  Keep in mind that the actual function
> that
> > performed the hash may not be present when the signature is verified.
>
> So given as you point out that the comparison of the name is also open to
> flaws I would suggest dropping the comparators and using getSignature() to
> check for hash function equality. It is not perfect but then no other
> solution is either and as least it is fast.
>

Q: The code used earlier in the process (while building the filters) uses
the comparators to do a deeper check at a point where the impact of such a
check is minimal.  The later signature check is used where speed is
required and we can reasonably assume the deeper check has been performed.
Does it not make sense to retain both?


> I think at least for the Murmur32x86Iterative class which is stateless
> there is a case for a getInstance() method which could return a singleton
> given the methods are stateless. Perhaps this pattern should be applied to
> all the HashFunction implementations, i.e. private constructor and a
> getInstance() method to obtain an instance.
>
>
I am not certain this can be done in all cases.  Certainly  the javadoc
would need to specify that the instance returned must be thread safe.  For
example a simple static instance of MessageDigest (currently used in MD5
implementation) could not be used. Additionally, In genomics I believe
there are hasher constructors that require properties of the object being
hashed before the hashing begins.


> >
> >>
> >> 3. AbtractBloomFilter
> >>
> >> I considered changing the cardinality methods to avoid using BitSet
> >> cardinality() and computing the cardinality direct. BitSet will
> >> duplicate the input long[] in the constructor. For the or/xor/and
> >> cardinality this is 2 array creations that can be avoided with a direct
> >> implementation. For the cardinality method it avoids 1 array creation
> >> via BitSet. Here is the type of implementation for a direct computation:
> >>
> >>     @Override
> >>     public int cardinality() {
> >>         int count = 0;
> >>         for (final long bits : getBits()) {
> >>             count += Long.bitCount(bits);
> >>         }
> >>         return count;
> >>     }
> >>
> >>     @Override
> >>     public int andCardinality(final BloomFilter other) {
> >>         verifyShape(other);
> >>         final long[] mine = getBits();
> >>         final long[] theirs = other.getBits();
> >>         final int limit = Integer.min(mine.length, theirs.length);
> >>         int count = 0;
> >>         for (int i = 0; i < limit; i++) {
> >>             count += Long.bitCount(mine[i] & theirs[i]);
> >>         }
> >>         return count;
> >>     }
> >>
> >>     @Override
> >>     public int orCardinality(final BloomFilter other) {
> >>         verifyShape(other);
> >>         final long[] mine = getBits();
> >>         final long[] theirs = other.getBits();
> >>         long[] small;
> >>         long[] big;
> >>         if (mine.length > theirs.length) {
> >>             big = mine;
> >>             small = theirs;
> >>         } else {
> >>             small = mine;
> >>             big = theirs;
> >>         }
> >>         int count = 0;
> >>         for (int i = 0; i < small.length; i++) {
> >>             // OR
> >>             count += Long.bitCount(small[i] | big[i]);
> >>         }
> >>         for (int i = small.length; i < big.length; i++) {
> >>             count += Long.bitCount(big[i]);
> >>         }
> >>         return count;
> >>     }
> >>
> >> The xorCardinality is the same as orCardinality but using the ^
> operator.
> >>
> >> I didn't make the change but this should be considered after a
> >> performance test with JMH.
> >>
> >
> > Indeed your solutions would work against the longs returned by getBits()
> > and makes more sense.
>
> I will look at how other projects have incorporated JMH for performance
> tests. It would be valuable here.
>
> The purpose would be to show that it is faster to use a
> BloomFilter.contains(…) method in front of a full check for object presence
> (i.e. in a HashMap) rather than just a HashMap. Do you have any such
> examples? I remember a while back you posted some information on speed
> testing. Say for example a filter to store 10000 random URLs and then
> testing random URL hits (with some actual bad hits) are in/not in the set
> of bad URLs. I could mock this up as a simple speed test for the Bloom
> filter but if you have other examples then it would be useful.
>
>
The classes that perform the testing I reported earlier are in
https://github.com/Claudenw/BloomTest.


> >
> >
> >>
> >> There are a few other places where specialised implementations may be
> >> more performant so a JMH benchmark would be useful for Bloom filter,
> >> e.g. the important
> >>
> >> BloomFilter.contains(Hasher)
> >>
> >> BloomFilter.contains(BloomFilter)
> >>
> >> If you have better "standard"  implementations let's implement them.
> > Implementations that have special needs (e.g. CountingBloomFilter,
> > MultidimensionalBloomFilter, AttenuatedBloomFilter) will implement as
> they
> > see fit.
>
> I was actually just wondering about a custom implementation of
> BitSetBloomFilter. There are only a few operations from BitSet that are
> used:
>
> cardinality()
> setBit(int)
> or(BitSet)
> and(BitSet)
> xor(BitSet)
> stream()
>
> However BitSet has some overhead in that it maintains the ability to
> expand in size and will do so for some of these operations. If replaced
> with a FixedBitSetBloomFilter then it may be faster at the cost of having a
> large initial allocation overhead.
>
> At least in the case of the or/and/xor methods these are used with a
> cloned BitSet for a cardinality count. This has the memory allocation
> overhead for a potentially large object that will be thrown away. It is
> used via the default method for BloomFilter.contains(BloomFilter) which
> calls andCardinality. This may be the most commonly used method for a
> BitSetBloomFilter other than BloomFilter.contains(Hasher).
>
> All the methods apart from stream() are easy to implement. In that case it
> seems what is actually required is a primitive iterator of nextSetBit(int).
> So they require some real world use cases to determine if a custom
> FixedBitSetBloomFilter with its own internal long[] would be better than
> delegating to BitSet.
>
> I have an example of a different bitset implementation using an EWAH
compressed bitmap.  It can be found in
https://github.com/Claudenw/MultidimentionalBloom/blob/master/src/main/java/org/xenei/bloom/filter/EWAHBloomFilter.java
part of a multidimentional Bloom filter library.
https://github.com/Claudenw/MultidimentionalBloom

You can also find some interesting reading concerning bitmaps and blom
filter implementations from Daniel Lemire on his github
https://github.com/lemire

>
> > Alex
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email] <mailto:
> [hidden email]>
> >> For additional commands, e-mail: [hidden email] <mailto:
> [hidden email]>
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com <http://like-like.xenei.com/>>
> > LinkedIn: http://www.linkedin.com/in/claudewarren <
> http://www.linkedin.com/in/claudewarren>
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Alex Herbert


> On 18 Feb 2020, at 08:02, Claude Warren <[hidden email]> wrote:
>
> On Mon, Feb 17, 2020 at 9:52 PM Alex Herbert <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>>
>>
>>> On 17 Feb 2020, at 20:30, Claude Warren <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>> Alex,
>>>
>>> Thank you for your comments.
>>>
>>> See comments inline.
>>>
>>>
>>>
>>> On Mon, Feb 17, 2020 at 3:20 PM Alex Herbert <[hidden email] <mailto:[hidden email]>
>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>> wrote:
>>>
>>>> I had a look through all the BloomFilter code. Thanks Claude for the
>>>> contribution.
>>>>
>>>> Some items that need clarifying:
>>>>
>>>>
>>>> 1. HashFunctionIdentity.Signedness
>>>>
>>>> This is not fully documented as to what the sign applies to. There is no
>>>> javadoc on the enum values for SIGNED and UNSIGNED. The current javadoc
>>>> states "Identifies the signedness of the calculations for this
>> function".
>>>>
>>>> I assume this applies to the Hash computation in 'long
>>>> HashFunction.apply(byte[], int)'
>>>>
>>>> Does this mean the hash algorithm has a variant that can treat the
>>>> bytes/seed as signed or unsigned. If so which one because 2 enum values
>>>> cannot cover all 4 possibilities. Since there is no javadoc it is
>>>> unclear exactly what this property is supposed to indicate.
>>>>
>>>> I will attempt to clarify in the javadoc.  Basically when the underlying
>>> hash function executes it returns an array of bytes.  The algorithm can
>>> treat those bytes as signed or unsigned long values.  For the java
>>> implementations this is probably signed but other languages could
>> implement
>>> it as unsigned.   For example the murmur128 has returns 128 bytes this
>>> could be interpreted as 2 64-bit signed or unsigned longs.  The
>> Signedness
>>> attribute describes how that is interpreted.  As for the other values the
>>> seed is a signed value and the byte buffer is a byte buffer interpreted
>> as
>>> per the underlying hash function.
>>
>> OK. So we add comments to the javadoc to state that the signedness is the
>> interpretation of the return value of
>>
>> long HashFunction.apply(byte[], int)
>>
>> So there can be algorithms that return a signed long and those that should
>> return an unsigned long but this is not a supported type in Java.
>>
>> Either way it doesn’t really matter, it seems more for correctness than
>> for actual practical use given that unsigned and signed arithmetic on
>> twos-complement integers is the same.
>>
>
> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for all
> positive values of N?  If so then you are correct and Signedness does not
> matter and can be removed.  I thought that the statement was false.
>

Yes, you are correct. Addition and multiplication use the bits in the same way. Modulus does not.

So are the Murmur3 hash functions actually unsigned? I thought the original c code returned unsigned 64-bit values (uint64_t).

IIUC the purpose of the Signedness enum is to specify that a negative value returned from the hash function should be used as a numerical magnitude if then used in downstream computations since sign extension on negatives will set leading bits to 1s. I.e. using the negative as if it were unsigned (and all bits are random) is not valid. Perhaps you can give an example of how signed or unsigned would be handled differently.

>
>>>
>>>
>>>> 2. HashFunctionIdentity comparators
>>>>
>>>> The interface HashFunctionIdentity defines constants which are
>>>> comparators. Do these need to be here? Typically comparators should be
>>>> Serializable. Are these comparators ever going to be bundled with
>>>> BloomFilters and serialized? There are lots of ways to compare
>>>> HashFunctionIdentities. Comparators are for ordering. So why do you need
>>>> to order a HashFunction and why choose the two implementations provided?
>>>>
>>>
>>> There was a discussion on the list about building factories where the
>>> provider of the HashFunctionIdentity would be important (Think in terms
>> of
>>> the JCA where multiple providers can provide the same hash/encryption
>>> functions but an application may wish to choose one over the other).
>>
>> OK. But given that this is not being done we could drop it from the API
>> for now to make it cleaner.
>>
> OK
>
>>
>>>
>>>>
>>>> From what I can see the COMMON comparator is used in the main code in
>>>> Shape, DynamicHasher and StaticHasher with:
>>>>
>>>> if (compare(a, b) != 0) {
>>>>    // not allowed ...
>>>> }
>>>>
>>>> The DEEP comparator is not used in the main code.
>>>
>>>
>>>> The use case for checking that the hash function is the same seems to be
>>>> better satisfied by using a package private HashFunctionIdentityEquality
>>>> helper method:
>>>>
>>>>    static boolean areEqual(HashFunctionIdentity a,
>>>> HashFunctionIdentity b) {
>>>>        return (a.getSignedness() == b.getSignedness() &&
>>>>            a.getProcessType() == b.getProcessType() &&
>>>>            a.getName().equalsIgnoreCase(b.getName()));
>>>>    }
>>>>
>>>> This would drop the comparators that no-one actually requires and
>>>> clean-up the public API.
>>>>
>>>> I can see the value of this solution.
>>>
>>> I note that not everything uses the comparator.
>>>> AbstractBloomFilter.verfiyHasher uses:
>>>>
>>>> if (a.getSignature() != b.getSignature()) {
>>>>    // not allowed ...
>>>> }
>>>>
>>>> However the signature (a long value) could be the same by chance with
>>>> p=2^-64. So why is this class allow to use getSignature() but others do
>>>> a full hash identity comparison? It seems to me you either do one or the
>>>> other: a full property equality test; or a quick signature check (which
>>>> is error prone); but not mix and match in different parts of the code.
>>>>
>>>> The reasoning for the second part is that bloom filter comparisons must
>> be
>>> very fast.  The possibility of signature collision is low but not zero.
>>> However, trusting the name has issues as well.  There are a number of
>>> murmur128 hash implementations that are not correct.  Using the signature
>>> will detect those because the signature value is built by the hashing
>>> function executing on the name.  Keep in mind that the actual function
>> that
>>> performed the hash may not be present when the signature is verified.
>>
>> So given as you point out that the comparison of the name is also open to
>> flaws I would suggest dropping the comparators and using getSignature() to
>> check for hash function equality. It is not perfect but then no other
>> solution is either and as least it is fast.
>>
>
> Q: The code used earlier in the process (while building the filters) uses
> the comparators to do a deeper check at a point where the impact of such a
> check is minimal.  The later signature check is used where speed is
> required and we can reasonably assume the deeper check has been performed.
> Does it not make sense to retain both?

In this case we can use an internal package private equality function and drop the comparators from the public API. I’ll make these changes (with your comments on why we are using a full deep equals or quick signature equals) and you can review.

>
>
>> I think at least for the Murmur32x86Iterative class which is stateless
>> there is a case for a getInstance() method which could return a singleton
>> given the methods are stateless. Perhaps this pattern should be applied to
>> all the HashFunction implementations, i.e. private constructor and a
>> getInstance() method to obtain an instance.
>>
>>
> I am not certain this can be done in all cases.  Certainly  the javadoc
> would need to specify that the instance returned must be thread safe.  For
> example a simple static instance of MessageDigest (currently used in MD5
> implementation) could not be used. Additionally, In genomics I believe
> there are hasher constructors that require properties of the object being
> hashed before the hashing begins.

I did state that a singleton can be used for Murmur32x86Iterative, it was not clear that I was saying a singleton is not suitable for the others. I was discussing a change to the API to hide constructors.

The advantage of using the getInstance() method over a public INSTANCE field is that if in the future the class cannot be a singleton then the API does not need to change. So to allow support for the singleton I suggested adding the getInstance() method. In this case is would be strange to have a getInstance() method in Murmur32x86Iterative but not any of the others. Given that the classes are final a private constructor is fine and so a factor constructor approach could be used.

Anyway since the other classes would require a newInstance rather than the singleton version getInstance it is a rearrangement that does not provide much benefit. Perhaps I’ll just add a note to the javadoc for Murmur32x86Iterative that the class is thread-safe. If this change in the future then the javadoc can be updated.

>
>
>>>
>>>>
>>>> 3. AbtractBloomFilter
>>>>
>>>> I considered changing the cardinality methods to avoid using BitSet
>>>> cardinality() and computing the cardinality direct. BitSet will
>>>> duplicate the input long[] in the constructor. For the or/xor/and
>>>> cardinality this is 2 array creations that can be avoided with a direct
>>>> implementation. For the cardinality method it avoids 1 array creation
>>>> via BitSet. Here is the type of implementation for a direct computation:
>>>>
>>>>    @Override
>>>>    public int cardinality() {
>>>>        int count = 0;
>>>>        for (final long bits : getBits()) {
>>>>            count += Long.bitCount(bits);
>>>>        }
>>>>        return count;
>>>>    }
>>>>
>>>>    @Override
>>>>    public int andCardinality(final BloomFilter other) {
>>>>        verifyShape(other);
>>>>        final long[] mine = getBits();
>>>>        final long[] theirs = other.getBits();
>>>>        final int limit = Integer.min(mine.length, theirs.length);
>>>>        int count = 0;
>>>>        for (int i = 0; i < limit; i++) {
>>>>            count += Long.bitCount(mine[i] & theirs[i]);
>>>>        }
>>>>        return count;
>>>>    }
>>>>
>>>>    @Override
>>>>    public int orCardinality(final BloomFilter other) {
>>>>        verifyShape(other);
>>>>        final long[] mine = getBits();
>>>>        final long[] theirs = other.getBits();
>>>>        long[] small;
>>>>        long[] big;
>>>>        if (mine.length > theirs.length) {
>>>>            big = mine;
>>>>            small = theirs;
>>>>        } else {
>>>>            small = mine;
>>>>            big = theirs;
>>>>        }
>>>>        int count = 0;
>>>>        for (int i = 0; i < small.length; i++) {
>>>>            // OR
>>>>            count += Long.bitCount(small[i] | big[i]);
>>>>        }
>>>>        for (int i = small.length; i < big.length; i++) {
>>>>            count += Long.bitCount(big[i]);
>>>>        }
>>>>        return count;
>>>>    }
>>>>
>>>> The xorCardinality is the same as orCardinality but using the ^
>> operator.
>>>>
>>>> I didn't make the change but this should be considered after a
>>>> performance test with JMH.
>>>>
>>>
>>> Indeed your solutions would work against the longs returned by getBits()
>>> and makes more sense.
>>
>> I will look at how other projects have incorporated JMH for performance
>> tests. It would be valuable here.
>>
>> The purpose would be to show that it is faster to use a
>> BloomFilter.contains(…) method in front of a full check for object presence
>> (i.e. in a HashMap) rather than just a HashMap. Do you have any such
>> examples? I remember a while back you posted some information on speed
>> testing. Say for example a filter to store 10000 random URLs and then
>> testing random URL hits (with some actual bad hits) are in/not in the set
>> of bad URLs. I could mock this up as a simple speed test for the Bloom
>> filter but if you have other examples then it would be useful.
>>
>>
> The classes that perform the testing I reported earlier are in
> https://github.com/Claudenw/BloomTest <https://github.com/Claudenw/BloomTest>.

Thanks. I’ll take a look.

>
>
>>>
>>>
>>>>
>>>> There are a few other places where specialised implementations may be
>>>> more performant so a JMH benchmark would be useful for Bloom filter,
>>>> e.g. the important
>>>>
>>>> BloomFilter.contains(Hasher)
>>>>
>>>> BloomFilter.contains(BloomFilter)
>>>>
>>>> If you have better "standard"  implementations let's implement them.
>>> Implementations that have special needs (e.g. CountingBloomFilter,
>>> MultidimensionalBloomFilter, AttenuatedBloomFilter) will implement as
>> they
>>> see fit.
>>
>> I was actually just wondering about a custom implementation of
>> BitSetBloomFilter. There are only a few operations from BitSet that are
>> used:
>>
>> cardinality()
>> setBit(int)
>> or(BitSet)
>> and(BitSet)
>> xor(BitSet)
>> stream()
>>
>> However BitSet has some overhead in that it maintains the ability to
>> expand in size and will do so for some of these operations. If replaced
>> with a FixedBitSetBloomFilter then it may be faster at the cost of having a
>> large initial allocation overhead.
>>
>> At least in the case of the or/and/xor methods these are used with a
>> cloned BitSet for a cardinality count. This has the memory allocation
>> overhead for a potentially large object that will be thrown away. It is
>> used via the default method for BloomFilter.contains(BloomFilter) which
>> calls andCardinality. This may be the most commonly used method for a
>> BitSetBloomFilter other than BloomFilter.contains(Hasher).
>>
>> All the methods apart from stream() are easy to implement. In that case it
>> seems what is actually required is a primitive iterator of nextSetBit(int).
>> So they require some real world use cases to determine if a custom
>> FixedBitSetBloomFilter with its own internal long[] would be better than
>> delegating to BitSet.
>>
>> I have an example of a different bitset implementation using an EWAH
> compressed bitmap.  It can be found in
> https://github.com/Claudenw/MultidimentionalBloom/blob/master/src/main/java/org/xenei/bloom/filter/EWAHBloomFilter.java <https://github.com/Claudenw/MultidimentionalBloom/blob/master/src/main/java/org/xenei/bloom/filter/EWAHBloomFilter.java>
> part of a multidimentional Bloom filter library.
> https://github.com/Claudenw/MultidimentionalBloom <https://github.com/Claudenw/MultidimentionalBloom>
>
> You can also find some interesting reading concerning bitmaps and blom
> filter implementations from Daniel Lemire on his github
> https://github.com/lemire <https://github.com/lemire>
>
>>
>>> Alex
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email] <mailto:[hidden email]> <mailto:
>> [hidden email] <mailto:[hidden email]>>
>>>> For additional commands, e-mail: [hidden email] <mailto:[hidden email]> <mailto:
>> [hidden email] <mailto:[hidden email]>>
>>>>
>>>>
>>>
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com <http://like-like.xenei.com/> <http://like-like.xenei.com/ <http://like-like.xenei.com/>>>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren> <
>> http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren>>
>>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren>
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Claude Warren
On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <[hidden email]>
wrote:


> > My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
> > B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for
> all
> > positive values of N?  If so then you are correct and Signedness does not
> > matter and can be removed.  I thought that the statement was false.
> >
>
> Yes, you are correct. Addition and multiplication use the bits in the same
> way. Modulus does not.
>
> So are the Murmur3 hash functions actually unsigned? I thought the
> original c code returned unsigned 64-bit values (uint64_t).
>
> IIUC the purpose of the Signedness enum is to specify that a negative
> value returned from the hash function should be used as a numerical
> magnitude if then used in downstream computations since sign extension on
> negatives will set leading bits to 1s. I.e. using the negative as if it
> were unsigned (and all bits are random) is not valid. Perhaps you can give
> an example of how signed or unsigned would be handled differently.
>
>
The murmur hash is not a good example here as it returns the resulting
buffer as 2 signed longs.  The MD5 is probably a better example as the
messageDigest.digest() returns a byte[128].  The code then interprets that
as 2 signed longs.  A C implementation could interpret that as 2 unsigned
longs.  Additionally, there are hashes that return smaller buffers such as
with murmur2 which returns a 32bit unsigned int in the C code.  In java
that could be represented as a long as per the Integer.asUnsignedlong()
method or it could just be cast to a long and thus signed.

The values returned are then passed to floorMod( value, numberOfBits ) to
turn on the appropriate bit in the resulting bit vector.  As noted above
this will yield different values depending upon how the byte[] was
represented.

Claude
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Alex Herbert
On 18/02/2020 09:54, Claude Warren wrote:

> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <[hidden email]>
> wrote:
>
>
>>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for
>> all
>>> positive values of N?  If so then you are correct and Signedness does not
>>> matter and can be removed.  I thought that the statement was false.
>>>
>> Yes, you are correct. Addition and multiplication use the bits in the same
>> way. Modulus does not.
>>
>> So are the Murmur3 hash functions actually unsigned? I thought the
>> original c code returned unsigned 64-bit values (uint64_t).
>>
>> IIUC the purpose of the Signedness enum is to specify that a negative
>> value returned from the hash function should be used as a numerical
>> magnitude if then used in downstream computations since sign extension on
>> negatives will set leading bits to 1s. I.e. using the negative as if it
>> were unsigned (and all bits are random) is not valid. Perhaps you can give
>> an example of how signed or unsigned would be handled differently.
>>
>>
> The murmur hash is not a good example here as it returns the resulting
> buffer as 2 signed longs.  The MD5 is probably a better example as the
> messageDigest.digest() returns a byte[128].  The code then interprets that
> as 2 signed longs.  A C implementation could interpret that as 2 unsigned
> longs.  Additionally, there are hashes that return smaller buffers such as
> with murmur2 which returns a 32bit unsigned int in the C code.  In java
> that could be represented as a long as per the Integer.asUnsignedlong()
> method or it could just be cast to a long and thus signed.
>
> The values returned are then passed to floorMod( value, numberOfBits ) to
> turn on the appropriate bit in the resulting bit vector.  As noted above
> this will yield different values depending upon how the byte[] was
> represented.
>
> Claude

Thanks for the explanation. So the signedness is only appropriate when
the returned long from the hash function is smaller than 64-bits. Should
it be documented to indicate that the upper sign bit of the 64-bit long
will never be set by the Hasher (see example below)?

Reading about floorMod(x, y) it will return a positive value for any x
if y is positive. If the input x and y have the same sign you will get
the same result as x % y. However floorMod has to have a condition to
check if the signs are the same and so will be slower.

So floorMod could be replaced with x % y when the Signedness is UNSIGNED
and y is positive, or when both x & y are known to be positive. Looking
at the current code that seems to apply to
AbstractBloomFilter.contains(Hasher) which is using int values from the
Hasher and assumes they are positive (it uses idx / Long.SIZE as a
positive index). So this could use idx % Long.SIZE instead of floorMod.
This makes sense as the Hasher should have already sanitised the raw
output from the hash function into a positive int index for the
appropriate bit in the filter.

The same applies to HasherBloomFilter.getBits() which is using
floorMod(idx, Long.SIZE) when a negative idx would throw an index out of
bounds on the previous line.

So elimination of floorMod in two places can be done to increase speed.
However on browsing at BitSet the divide and mod operations can be
removed altogether using shifts:

     final int buffIdx = idx / Long.SIZE;
     final int pwr = Math.floorMod(idx, Long.SIZE);
     final long buffOffset = 1L << pwr;

     // Equivalent assuming idx is positive since the << left shift
works using the first 6 bits.

     final int buffIdx = idx >> 6;
     final long buffOffset = 1L << idx;


This also opens the possibility of using the HashFunctionIdentity
Signedness to do the modulus in DynamicHasher:

     /**
      * Identifies the signedness of the calculations for this function.
      * <p>
      * When the hash function executes it typically returns an array of
bytes.
      * That array is converted into one or more numerical values which
will be provided
      * as a {@code long} primitive type.
      * The signedness identifies if those {@code long} values are
signed or unsigned.
      * For example a hash function that outputs only 32-bits can be
unsigned if converted
      * using {@link Integer#toUnsignedLong(int)}. A hash function that
outputs more than
      * 64-bits is typically signed.
      * </p>
      */
     enum Signedness {
         /**
          * The result of {@link HashFunction#apply(byte[], int)} is signed,
          * thus the sign bit may be set.
          *
          * <p>The result can be used with {@code Math.floorMod(x, y)}
to generate a positive
          * value if y is positive.
          *
          * @see Math#floorMod(int, int)
          */
         SIGNED {
             @Override
             int mod(long x, int y) {
                 // Cast to long to workaround a bug in animal-sniffer.
                 return (int) Math.floorMod(x, (long) y);
             }
         },
         /**
          * The result of {@link HashFunction#apply(byte[], int)} is
unsigned,
          * thus the sign bit is never set.
          *
          * <p>The result can be used with {@code x % y} to generate a
positive
          * value if y is positive.
          */
         UNSIGNED {
             @Override
             int mod(long x, int y) {
                 return (int) (x % y);
             }
         };

         /**
          * Get the modulus of the arguments. This function assumes the
divisor is positive.
          * The modulus is to be used to provide a positive index from
random input value
          * {@code x}.
          *
          * @param x the dividend
          * @param y the divisor (must be positive)
          * @return the modulus
          */
         abstract int mod(long x, int y);
     }

DynamicHasher.Iterator then uses:

     @Override
     public int nextInt() {
         if (hasNext()) {
             if (funcCount >= shape.getNumberOfHashFunctions()) {
                 funcCount = 0;
                 buffer++;
             }
             return function.getSignedness().mod(
                       function.apply(buffers.get(buffer), funcCount++),
                       shape.getNumberOfBits());
         }
         throw new NoSuchElementException();
     }

Adding this mod function is a speed optimisation that will doubtless
come around with a bug because it is misused. So should I update the
Signedness to use the comments as above (based on your PR and this
discussion) but without the abstract mod function? For now we can leave
Math.floorMod in the DynamicHasher and remove it from
AbstractBloomFilter and HasherBloomFilter using bit shifts. A negative x
in those cases would throw an index out of bounds exception in the same
function.


>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Alex Herbert
In reply to this post by Claude Warren
I've updated master with some of the fixes discussed.

I've been looking through the rest of the BloomFilter code and the
CountingBloomFilter appears to be broken:

1. When another CountingBloomFiler is merged or removed the counts of
the other filter are ignored. This is not as documented.

2. When a remove operation is performed and the count for an index
decrements past zero it throws an exception. This is not documented but
enforced in the unit tests. This is very unfriendly. Digging into this
part of the code and there is a duplication of logic for remove. The
first part will not throw if the decrement passes zero. Then the same
index is decremented again (with the same new value) and it will throw
this time for underflow. So maybe at some point it didn't throw and now
it does.

I commented the code where I think the problems are and added tests to
master that fail with the current implementation.

I thought this type of logic is ideally suited to a Bag. So I rewrote
the class using Bag<Integer>. This passes the same tests but has two
differences:

a) It will not throw when remove would underflow. The bit index just
gets set to zero (and that item is removed from the Bag).

b) Bag has no support for overflow in the count of each item. It
actually has a weird feature that adding 2 to Integer.MAX_VALUE will
overflow to negative and then subtracting 1 would reset the count to
zero because it was still negative after subtracting 1. In Bag this is
not an issue as it is limited by Collections.size() returning an int for
the count of everything in the Bag. So if you add that many items the
collection will break elsewhere too.

I think the change (a) is an improvement. But are there situations where
you want to know that you cannot remove a filter exactly from another
one? If so perhaps a removeExactly(...) method for this case. I'd prefer
to just drop the throwing of exceptions. After all if you do an
andCardinality(BloomFilter) you do not get errors thrown when one cannot
be "subtracted" from the other.

Change (b) is faster but could lead to errors. So how likely is overflow
in a counting Bloom filter?

Both can be supported but it would have to use a custom Bag
implementation. This is pretty trivial and would actually speed up other
parts of the filter by having direct access to the Map underlying the
Bag for faster merge and remove operations. This would then be a change
to the current Map<Integer, Integer> to a Map<Integer, MutableInteger>
to store counts.

I can update the code to fix the implementation but will only do so
after a decision on overflow is made. We could even have a custom
implementation that stores counts as a long for very high counting
filters. Is this likely to be used? Only in a situation where the
exceptions from overflow are an issue.


Other questions about this:

- Why public Stream<Map.Entry<Integer, Integer>> getCounts()?

Perhaps this is better served raw as arrays of length 2. This has lower
memory overhead:

public Stream<int[]> getCounts()

I've ruled out an accessor method as there is no equivalent for boolean
getBit(int index) in BloomFilter, e.g:

public int getCount(int index)

- Why a TreeMap? Is the order of the indices important?

Or do you have some benchmarks to show that the TreeMap handles lots of
growth and shrinkage better than a HashMap. There are situations where
each one would be a better choice and so perhaps this is a case for
having a CountingBloomFilter with a choice for the backing Map. This
could be done using an abstract class with implementations. This is one
to leave until some benchmarks are in place in the main code.


On 18/02/2020 09:54, Claude Warren wrote:

> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <[hidden email]>
> wrote:
>
>
>>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for
>> all
>>> positive values of N?  If so then you are correct and Signedness does not
>>> matter and can be removed.  I thought that the statement was false.
>>>
>> Yes, you are correct. Addition and multiplication use the bits in the same
>> way. Modulus does not.
>>
>> So are the Murmur3 hash functions actually unsigned? I thought the
>> original c code returned unsigned 64-bit values (uint64_t).
>>
>> IIUC the purpose of the Signedness enum is to specify that a negative
>> value returned from the hash function should be used as a numerical
>> magnitude if then used in downstream computations since sign extension on
>> negatives will set leading bits to 1s. I.e. using the negative as if it
>> were unsigned (and all bits are random) is not valid. Perhaps you can give
>> an example of how signed or unsigned would be handled differently.
>>
>>
> The murmur hash is not a good example here as it returns the resulting
> buffer as 2 signed longs.  The MD5 is probably a better example as the
> messageDigest.digest() returns a byte[128].  The code then interprets that
> as 2 signed longs.  A C implementation could interpret that as 2 unsigned
> longs.  Additionally, there are hashes that return smaller buffers such as
> with murmur2 which returns a 32bit unsigned int in the C code.  In java
> that could be represented as a long as per the Integer.asUnsignedlong()
> method or it could just be cast to a long and thus signed.
>
> The values returned are then passed to floorMod( value, numberOfBits ) to
> turn on the appropriate bit in the resulting bit vector.  As noted above
> this will yield different values depending upon how the byte[] was
> represented.
>
> Claude
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Claude Warren
Last one first, why a tree map?  I think it is a holdover from an earlier
implementation.  It can be any reasonable Map (e.g. HashMap).

on the remove() issue.  You are absolutely correct, there is a bug.  I
looks like a partial edit got restored back into the code.  The first block
should be removed.  In fact it should probably read

           if (val == null || val == 0) {
                throw new IllegalStateException("Underflow on index " +
idx);
            } else if (val == 1) {
                counts.remove(idx);
            } else {
                counts.put(idx, val - 1);
            }

The question of merging/removing counting bloom filters is an interesting
one.

Merging a "normal" Bloom filter A into a counting Bloom filter B simply
increments all the bits specified  by A.getBits().

I took the approach that a counting bloom filter was just another Bloom
filter.  So merging counting Bloom filter A into counting Bloom filter B
simply incremented all the bits specified by A.getBits().

However the source has moved on since the original implementation was laid
down, and it might make sense that merging bloom filters adds the counts
from A into B.  Then if the developer wanted to perform a simple merge for
a Bloom filter the developer could use B.merge( A.getHasher() ).  However,
I felt that since a counting Bloom filter is an instance of Bloom filter
the merge should be driven by getBits().

Perhaps it makes sense for counting Bloom filters to have another method
add( CountingBloomFilter a) that would add the counts from A into B.  This
would necessitate a subtract( CountingBloomFilter a) as well.

the getCounts() method returns a Map.Entry because it was more semantic in
nature (e.g. it is obvious that it is a key-value pair) but I am note
adverse to to changing it to an int[].  If the question is why the method
at all  then the answers are 1. it is needed for testing (bad answer), 2.
it is the only way to retrieve the counts for each bit which can be
required by some applications.

Using a Bag might make sense, but my reading of the AbstractMapBag is that
it tracks the total of all counts in the bag, so we would overflow that
counter before we overflowed a single count.   This definition of size()
differs from the definition of size() for a map and so we would have to
rewrite the cardinality() method.  However, it would be nice if we could
lift the MutableInteger class.

In addition, we do need to check for overflow and underflow.  An underflow
indicates that a filter was removed that was not originally added.
Overflow is possible when processing streams as in a "stable Bloom filter" (
https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).

Claude


On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <[hidden email]>
wrote:

> I've updated master with some of the fixes discussed.
>
> I've been looking through the rest of the BloomFilter code and the
> CountingBloomFilter appears to be broken:
>
> 1. When another CountingBloomFiler is merged or removed the counts of
> the other filter are ignored. This is not as documented.
>
> 2. When a remove operation is performed and the count for an index
> decrements past zero it throws an exception. This is not documented but
> enforced in the unit tests. This is very unfriendly. Digging into this
> part of the code and there is a duplication of logic for remove. The
> first part will not throw if the decrement passes zero. Then the same
> index is decremented again (with the same new value) and it will throw
> this time for underflow. So maybe at some point it didn't throw and now
> it does.
>
> I commented the code where I think the problems are and added tests to
> master that fail with the current implementation.
>
> I thought this type of logic is ideally suited to a Bag. So I rewrote
> the class using Bag<Integer>. This passes the same tests but has two
> differences:
>
> a) It will not throw when remove would underflow. The bit index just
> gets set to zero (and that item is removed from the Bag).
>
> b) Bag has no support for overflow in the count of each item. It
> actually has a weird feature that adding 2 to Integer.MAX_VALUE will
> overflow to negative and then subtracting 1 would reset the count to
> zero because it was still negative after subtracting 1. In Bag this is
> not an issue as it is limited by Collections.size() returning an int for
> the count of everything in the Bag. So if you add that many items the
> collection will break elsewhere too.
>
> I think the change (a) is an improvement. But are there situations where
> you want to know that you cannot remove a filter exactly from another
> one? If so perhaps a removeExactly(...) method for this case. I'd prefer
> to just drop the throwing of exceptions. After all if you do an
> andCardinality(BloomFilter) you do not get errors thrown when one cannot
> be "subtracted" from the other.
>
> Change (b) is faster but could lead to errors. So how likely is overflow
> in a counting Bloom filter?
>
> Both can be supported but it would have to use a custom Bag
> implementation. This is pretty trivial and would actually speed up other
> parts of the filter by having direct access to the Map underlying the
> Bag for faster merge and remove operations. This would then be a change
> to the current Map<Integer, Integer> to a Map<Integer, MutableInteger>
> to store counts.
>
> I can update the code to fix the implementation but will only do so
> after a decision on overflow is made. We could even have a custom
> implementation that stores counts as a long for very high counting
> filters. Is this likely to be used? Only in a situation where the
> exceptions from overflow are an issue.
>
>
> Other questions about this:
>
> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
>
> Perhaps this is better served raw as arrays of length 2. This has lower
> memory overhead:
>
> public Stream<int[]> getCounts()
>
> I've ruled out an accessor method as there is no equivalent for boolean
> getBit(int index) in BloomFilter, e.g:
>
> public int getCount(int index)
>
> - Why a TreeMap? Is the order of the indices important?
>
> Or do you have some benchmarks to show that the TreeMap handles lots of
> growth and shrinkage better than a HashMap. There are situations where
> each one would be a better choice and so perhaps this is a case for
> having a CountingBloomFilter with a choice for the backing Map. This
> could be done using an abstract class with implementations. This is one
> to leave until some benchmarks are in place in the main code.
>
>
> On 18/02/2020 09:54, Claude Warren wrote:
> > On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <[hidden email]>
> > wrote:
> >
> >
> >>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned  and
> >>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N) for
> >> all
> >>> positive values of N?  If so then you are correct and Signedness does
> not
> >>> matter and can be removed.  I thought that the statement was false.
> >>>
> >> Yes, you are correct. Addition and multiplication use the bits in the
> same
> >> way. Modulus does not.
> >>
> >> So are the Murmur3 hash functions actually unsigned? I thought the
> >> original c code returned unsigned 64-bit values (uint64_t).
> >>
> >> IIUC the purpose of the Signedness enum is to specify that a negative
> >> value returned from the hash function should be used as a numerical
> >> magnitude if then used in downstream computations since sign extension
> on
> >> negatives will set leading bits to 1s. I.e. using the negative as if it
> >> were unsigned (and all bits are random) is not valid. Perhaps you can
> give
> >> an example of how signed or unsigned would be handled differently.
> >>
> >>
> > The murmur hash is not a good example here as it returns the resulting
> > buffer as 2 signed longs.  The MD5 is probably a better example as the
> > messageDigest.digest() returns a byte[128].  The code then interprets
> that
> > as 2 signed longs.  A C implementation could interpret that as 2 unsigned
> > longs.  Additionally, there are hashes that return smaller buffers such
> as
> > with murmur2 which returns a 32bit unsigned int in the C code.  In java
> > that could be represented as a long as per the Integer.asUnsignedlong()
> > method or it could just be cast to a long and thus signed.
> >
> > The values returned are then passed to floorMod( value, numberOfBits ) to
> > turn on the appropriate bit in the resulting bit vector.  As noted above
> > this will yield different values depending upon how the byte[] was
> > represented.
> >
> > Claude
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

garydgregory
On Tue, Feb 18, 2020 at 5:32 PM Claude Warren <[hidden email]> wrote:

> Last one first, why a tree map?  I think it is a holdover from an earlier
> implementation.  It can be any reasonable Map (e.g. HashMap).
>
> on the remove() issue.  You are absolutely correct, there is a bug.  I
>

May you please add a test to validate whatever fix you guys implement? I'm
not following the details here but I want to stress the importance of tests
to help future maintainers validate any future changes and protect us from
regression bugs. IOW, while the current code base is in your head, more
tests! ;-)

Gary


> looks like a partial edit got restored back into the code.  The first block
> should be removed.  In fact it should probably read
>
>            if (val == null || val == 0) {
>                 throw new IllegalStateException("Underflow on index " +
> idx);
>             } else if (val == 1) {
>                 counts.remove(idx);
>             } else {
>                 counts.put(idx, val - 1);
>             }
>
> The question of merging/removing counting bloom filters is an interesting
> one.
>
> Merging a "normal" Bloom filter A into a counting Bloom filter B simply
> increments all the bits specified  by A.getBits().
>
> I took the approach that a counting bloom filter was just another Bloom
> filter.  So merging counting Bloom filter A into counting Bloom filter B
> simply incremented all the bits specified by A.getBits().
>
> However the source has moved on since the original implementation was laid
> down, and it might make sense that merging bloom filters adds the counts
> from A into B.  Then if the developer wanted to perform a simple merge for
> a Bloom filter the developer could use B.merge( A.getHasher() ).  However,
> I felt that since a counting Bloom filter is an instance of Bloom filter
> the merge should be driven by getBits().
>
> Perhaps it makes sense for counting Bloom filters to have another method
> add( CountingBloomFilter a) that would add the counts from A into B.  This
> would necessitate a subtract( CountingBloomFilter a) as well.
>
> the getCounts() method returns a Map.Entry because it was more semantic in
> nature (e.g. it is obvious that it is a key-value pair) but I am note
> adverse to to changing it to an int[].  If the question is why the method
> at all  then the answers are 1. it is needed for testing (bad answer), 2.
> it is the only way to retrieve the counts for each bit which can be
> required by some applications.
>
> Using a Bag might make sense, but my reading of the AbstractMapBag is that
> it tracks the total of all counts in the bag, so we would overflow that
> counter before we overflowed a single count.   This definition of size()
> differs from the definition of size() for a map and so we would have to
> rewrite the cardinality() method.  However, it would be nice if we could
> lift the MutableInteger class.
>
> In addition, we do need to check for overflow and underflow.  An underflow
> indicates that a filter was removed that was not originally added.
> Overflow is possible when processing streams as in a "stable Bloom filter"
> (
> https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).
>
> Claude
>
>
> On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <[hidden email]>
> wrote:
>
> > I've updated master with some of the fixes discussed.
> >
> > I've been looking through the rest of the BloomFilter code and the
> > CountingBloomFilter appears to be broken:
> >
> > 1. When another CountingBloomFiler is merged or removed the counts of
> > the other filter are ignored. This is not as documented.
> >
> > 2. When a remove operation is performed and the count for an index
> > decrements past zero it throws an exception. This is not documented but
> > enforced in the unit tests. This is very unfriendly. Digging into this
> > part of the code and there is a duplication of logic for remove. The
> > first part will not throw if the decrement passes zero. Then the same
> > index is decremented again (with the same new value) and it will throw
> > this time for underflow. So maybe at some point it didn't throw and now
> > it does.
> >
> > I commented the code where I think the problems are and added tests to
> > master that fail with the current implementation.
> >
> > I thought this type of logic is ideally suited to a Bag. So I rewrote
> > the class using Bag<Integer>. This passes the same tests but has two
> > differences:
> >
> > a) It will not throw when remove would underflow. The bit index just
> > gets set to zero (and that item is removed from the Bag).
> >
> > b) Bag has no support for overflow in the count of each item. It
> > actually has a weird feature that adding 2 to Integer.MAX_VALUE will
> > overflow to negative and then subtracting 1 would reset the count to
> > zero because it was still negative after subtracting 1. In Bag this is
> > not an issue as it is limited by Collections.size() returning an int for
> > the count of everything in the Bag. So if you add that many items the
> > collection will break elsewhere too.
> >
> > I think the change (a) is an improvement. But are there situations where
> > you want to know that you cannot remove a filter exactly from another
> > one? If so perhaps a removeExactly(...) method for this case. I'd prefer
> > to just drop the throwing of exceptions. After all if you do an
> > andCardinality(BloomFilter) you do not get errors thrown when one cannot
> > be "subtracted" from the other.
> >
> > Change (b) is faster but could lead to errors. So how likely is overflow
> > in a counting Bloom filter?
> >
> > Both can be supported but it would have to use a custom Bag
> > implementation. This is pretty trivial and would actually speed up other
> > parts of the filter by having direct access to the Map underlying the
> > Bag for faster merge and remove operations. This would then be a change
> > to the current Map<Integer, Integer> to a Map<Integer, MutableInteger>
> > to store counts.
> >
> > I can update the code to fix the implementation but will only do so
> > after a decision on overflow is made. We could even have a custom
> > implementation that stores counts as a long for very high counting
> > filters. Is this likely to be used? Only in a situation where the
> > exceptions from overflow are an issue.
> >
> >
> > Other questions about this:
> >
> > - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
> >
> > Perhaps this is better served raw as arrays of length 2. This has lower
> > memory overhead:
> >
> > public Stream<int[]> getCounts()
> >
> > I've ruled out an accessor method as there is no equivalent for boolean
> > getBit(int index) in BloomFilter, e.g:
> >
> > public int getCount(int index)
> >
> > - Why a TreeMap? Is the order of the indices important?
> >
> > Or do you have some benchmarks to show that the TreeMap handles lots of
> > growth and shrinkage better than a HashMap. There are situations where
> > each one would be a better choice and so perhaps this is a case for
> > having a CountingBloomFilter with a choice for the backing Map. This
> > could be done using an abstract class with implementations. This is one
> > to leave until some benchmarks are in place in the main code.
> >
> >
> > On 18/02/2020 09:54, Claude Warren wrote:
> > > On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <[hidden email]
> >
> > > wrote:
> > >
> > >
> > >>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned
> and
> > >>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N)
> for
> > >> all
> > >>> positive values of N?  If so then you are correct and Signedness does
> > not
> > >>> matter and can be removed.  I thought that the statement was false.
> > >>>
> > >> Yes, you are correct. Addition and multiplication use the bits in the
> > same
> > >> way. Modulus does not.
> > >>
> > >> So are the Murmur3 hash functions actually unsigned? I thought the
> > >> original c code returned unsigned 64-bit values (uint64_t).
> > >>
> > >> IIUC the purpose of the Signedness enum is to specify that a negative
> > >> value returned from the hash function should be used as a numerical
> > >> magnitude if then used in downstream computations since sign extension
> > on
> > >> negatives will set leading bits to 1s. I.e. using the negative as if
> it
> > >> were unsigned (and all bits are random) is not valid. Perhaps you can
> > give
> > >> an example of how signed or unsigned would be handled differently.
> > >>
> > >>
> > > The murmur hash is not a good example here as it returns the resulting
> > > buffer as 2 signed longs.  The MD5 is probably a better example as the
> > > messageDigest.digest() returns a byte[128].  The code then interprets
> > that
> > > as 2 signed longs.  A C implementation could interpret that as 2
> unsigned
> > > longs.  Additionally, there are hashes that return smaller buffers such
> > as
> > > with murmur2 which returns a 32bit unsigned int in the C code.  In java
> > > that could be represented as a long as per the Integer.asUnsignedlong()
> > > method or it could just be cast to a long and thus signed.
> > >
> > > The values returned are then passed to floorMod( value, numberOfBits )
> to
> > > turn on the appropriate bit in the resulting bit vector.  As noted
> above
> > > this will yield different values depending upon how the byte[] was
> > > represented.
> > >
> > > Claude
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Alex Herbert


> On 18 Feb 2020, at 22:34, Gary Gregory <[hidden email]> wrote:
>
> On Tue, Feb 18, 2020 at 5:32 PM Claude Warren <[hidden email]> wrote:
>
>> Last one first, why a tree map?  I think it is a holdover from an earlier
>> implementation.  It can be any reasonable Map (e.g. HashMap).
>>
>> on the remove() issue.  You are absolutely correct, there is a bug.  I
>>
>
> May you please add a test to validate whatever fix you guys implement? I'm
> not following the details here but I want to stress the importance of tests
> to help future maintainers validate any future changes and protect us from
> regression bugs. IOW, while the current code base is in your head, more
> tests! ;-)
>

I added a test that shows the current implementation does not do what I thought it should do based on the documentation in javadoc:

"For each bit that is turned on in the other filter; if the other filter is
 also a CountingBloomFilter the count is added to this filter, otherwise the
 count is incremented by one."

My test checks the counts are added.

However it does what Claude thought it should do. The original test checks that the indices are incremented by 1.

So what to do about counting? If overflow or underflow occur on addition or subtraction you can throw an exception to signal error but if a partial operation has been made this leaves the filter in an undetermined state. The filter does not actually correspond to a sum of filters anymore. So it is invalid. Should any further operations on the filter be prevented? In this case throwing an IllegalStateException for all further operations seems extreme. It may be better to have the addition/subtraction return true/false if the operation was clean, i.e similar to a standard Set collection that returns true if adding/removing an item changes the state, false otherwise. The filter could also maintain a state flag to indicate that an overflow/underflow has occurred and the filter no longer represents a sum of filters.

Given this idea we change merge to increment the indices by 1 for any BloomFilter including a counting filter. It will not throw if it overflows but will set an invalid state flag. The logic here is the method is inherited from BloomFilter and so will function in the same way as other filters by accumulating set bits (indices).

Then add some methods:

boolean add(BloomFilter)
boolean remove(BloomFilter)
boolean isInvalid()

These update the counts using the counts of the other filter if it is a CountingBloomFilter, otherwise add or subtract 1. The methods return true if the operation added or removed the filter entirely, i.e. it is now a part of the sum or was removed from the sum. Otherwise they return false and the filter is marked as an invalid sum of filters. This leaves the counting filter in an undetermined state but it can still be used as a non-counting filter as the indices will be correctly maintained.

It is a compromise but there does not seem to be a nice way to do this without a lot of overhead. That would be to check the add/remove can be done entirely before actually performing the operation. If the operation would under/overflow then an exception can be thrown and the filter is left in a usable state. This could be achieved with some efficiency if the filter stored the maximum count. Then any add would be able to check for potential overflow quickly and do a safe add. However the subtraction could not know the indices that may underflow using a max value (since 0 - x underflows) and would always have to do a safe removal by checking all decrements before actually doing the entire removal. One idea here is a reusable list is kept to store all the indices that are to be decremented. Stage 1 finds all the indices and checks for underflow in the subtraction. If underflow then throw. Otherwise traverse the list and actually do the subtraction. The overhead of the second list traversal is likely to be much lower than the initial identification of indices in the backing map. This all becomes easy to do if the backing data structure stores both the index and a mutable field, e.g.

Set<MutableIndex> counts;
class MutableIndex {
   int index;
   int count;
   int hashCode() { return index; }
   // etc
}

You then find all the MutableIndex objects in the Set and put them in a reusable temp List (with the new count stored in a parallel list). If all can be incremented/decremented without under/overflow then just go over the list and update the count using the precomputed new count. The entire operation then works as a single transaction.

It may be better put into a TransactionCountingBloomFilter and then CountingBloomFilter is left to be faster and not care about over/underflow.

Thoughts on this approach?

> Gary
>
>
>> looks like a partial edit got restored back into the code.  The first block
>> should be removed.  In fact it should probably read
>>
>>           if (val == null || val == 0) {
>>                throw new IllegalStateException("Underflow on index " +
>> idx);
>>            } else if (val == 1) {
>>                counts.remove(idx);
>>            } else {
>>                counts.put(idx, val - 1);
>>            }
>>
>> The question of merging/removing counting bloom filters is an interesting
>> one.
>>
>> Merging a "normal" Bloom filter A into a counting Bloom filter B simply
>> increments all the bits specified  by A.getBits().
>>
>> I took the approach that a counting bloom filter was just another Bloom
>> filter.  So merging counting Bloom filter A into counting Bloom filter B
>> simply incremented all the bits specified by A.getBits().
>>
>> However the source has moved on since the original implementation was laid
>> down, and it might make sense that merging bloom filters adds the counts
>> from A into B.  Then if the developer wanted to perform a simple merge for
>> a Bloom filter the developer could use B.merge( A.getHasher() ).  However,
>> I felt that since a counting Bloom filter is an instance of Bloom filter
>> the merge should be driven by getBits().
>>
>> Perhaps it makes sense for counting Bloom filters to have another method
>> add( CountingBloomFilter a) that would add the counts from A into B.  This
>> would necessitate a subtract( CountingBloomFilter a) as well.
>>
>> the getCounts() method returns a Map.Entry because it was more semantic in
>> nature (e.g. it is obvious that it is a key-value pair) but I am note
>> adverse to to changing it to an int[].  If the question is why the method
>> at all  then the answers are 1. it is needed for testing (bad answer), 2.
>> it is the only way to retrieve the counts for each bit which can be
>> required by some applications.
>>
>> Using a Bag might make sense, but my reading of the AbstractMapBag is that
>> it tracks the total of all counts in the bag, so we would overflow that
>> counter before we overflowed a single count.   This definition of size()
>> differs from the definition of size() for a map and so we would have to
>> rewrite the cardinality() method.  However, it would be nice if we could
>> lift the MutableInteger class.
>>
>> In addition, we do need to check for overflow and underflow.  An underflow
>> indicates that a filter was removed that was not originally added.
>> Overflow is possible when processing streams as in a "stable Bloom filter"
>> (
>> https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).
>>
>> Claude
>>
>>
>> On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <[hidden email]>
>> wrote:
>>
>>> I've updated master with some of the fixes discussed.
>>>
>>> I've been looking through the rest of the BloomFilter code and the
>>> CountingBloomFilter appears to be broken:
>>>
>>> 1. When another CountingBloomFiler is merged or removed the counts of
>>> the other filter are ignored. This is not as documented.
>>>
>>> 2. When a remove operation is performed and the count for an index
>>> decrements past zero it throws an exception. This is not documented but
>>> enforced in the unit tests. This is very unfriendly. Digging into this
>>> part of the code and there is a duplication of logic for remove. The
>>> first part will not throw if the decrement passes zero. Then the same
>>> index is decremented again (with the same new value) and it will throw
>>> this time for underflow. So maybe at some point it didn't throw and now
>>> it does.
>>>
>>> I commented the code where I think the problems are and added tests to
>>> master that fail with the current implementation.
>>>
>>> I thought this type of logic is ideally suited to a Bag. So I rewrote
>>> the class using Bag<Integer>. This passes the same tests but has two
>>> differences:
>>>
>>> a) It will not throw when remove would underflow. The bit index just
>>> gets set to zero (and that item is removed from the Bag).
>>>
>>> b) Bag has no support for overflow in the count of each item. It
>>> actually has a weird feature that adding 2 to Integer.MAX_VALUE will
>>> overflow to negative and then subtracting 1 would reset the count to
>>> zero because it was still negative after subtracting 1. In Bag this is
>>> not an issue as it is limited by Collections.size() returning an int for
>>> the count of everything in the Bag. So if you add that many items the
>>> collection will break elsewhere too.
>>>
>>> I think the change (a) is an improvement. But are there situations where
>>> you want to know that you cannot remove a filter exactly from another
>>> one? If so perhaps a removeExactly(...) method for this case. I'd prefer
>>> to just drop the throwing of exceptions. After all if you do an
>>> andCardinality(BloomFilter) you do not get errors thrown when one cannot
>>> be "subtracted" from the other.
>>>
>>> Change (b) is faster but could lead to errors. So how likely is overflow
>>> in a counting Bloom filter?
>>>
>>> Both can be supported but it would have to use a custom Bag
>>> implementation. This is pretty trivial and would actually speed up other
>>> parts of the filter by having direct access to the Map underlying the
>>> Bag for faster merge and remove operations. This would then be a change
>>> to the current Map<Integer, Integer> to a Map<Integer, MutableInteger>
>>> to store counts.
>>>
>>> I can update the code to fix the implementation but will only do so
>>> after a decision on overflow is made. We could even have a custom
>>> implementation that stores counts as a long for very high counting
>>> filters. Is this likely to be used? Only in a situation where the
>>> exceptions from overflow are an issue.
>>>
>>>
>>> Other questions about this:
>>>
>>> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
>>>
>>> Perhaps this is better served raw as arrays of length 2. This has lower
>>> memory overhead:
>>>
>>> public Stream<int[]> getCounts()
>>>
>>> I've ruled out an accessor method as there is no equivalent for boolean
>>> getBit(int index) in BloomFilter, e.g:
>>>
>>> public int getCount(int index)
>>>
>>> - Why a TreeMap? Is the order of the indices important?
>>>
>>> Or do you have some benchmarks to show that the TreeMap handles lots of
>>> growth and shrinkage better than a HashMap. There are situations where
>>> each one would be a better choice and so perhaps this is a case for
>>> having a CountingBloomFilter with a choice for the backing Map. This
>>> could be done using an abstract class with implementations. This is one
>>> to leave until some benchmarks are in place in the main code.
>>>
>>>
>>> On 18/02/2020 09:54, Claude Warren wrote:
>>>> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <[hidden email]
>>>
>>>> wrote:
>>>>
>>>>
>>>>>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned
>> and
>>>>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N)
>> for
>>>>> all
>>>>>> positive values of N?  If so then you are correct and Signedness does
>>> not
>>>>>> matter and can be removed.  I thought that the statement was false.
>>>>>>
>>>>> Yes, you are correct. Addition and multiplication use the bits in the
>>> same
>>>>> way. Modulus does not.
>>>>>
>>>>> So are the Murmur3 hash functions actually unsigned? I thought the
>>>>> original c code returned unsigned 64-bit values (uint64_t).
>>>>>
>>>>> IIUC the purpose of the Signedness enum is to specify that a negative
>>>>> value returned from the hash function should be used as a numerical
>>>>> magnitude if then used in downstream computations since sign extension
>>> on
>>>>> negatives will set leading bits to 1s. I.e. using the negative as if
>> it
>>>>> were unsigned (and all bits are random) is not valid. Perhaps you can
>>> give
>>>>> an example of how signed or unsigned would be handled differently.
>>>>>
>>>>>
>>>> The murmur hash is not a good example here as it returns the resulting
>>>> buffer as 2 signed longs.  The MD5 is probably a better example as the
>>>> messageDigest.digest() returns a byte[128].  The code then interprets
>>> that
>>>> as 2 signed longs.  A C implementation could interpret that as 2
>> unsigned
>>>> longs.  Additionally, there are hashes that return smaller buffers such
>>> as
>>>> with murmur2 which returns a 32bit unsigned int in the C code.  In java
>>>> that could be represented as a long as per the Integer.asUnsignedlong()
>>>> method or it could just be cast to a long and thus signed.
>>>>
>>>> The values returned are then passed to floorMod( value, numberOfBits )
>> to
>>>> turn on the appropriate bit in the resulting bit vector.  As noted
>> above
>>>> this will yield different values depending upon how the byte[] was
>>>> represented.
>>>>
>>>> Claude
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Claude Warren
I think the compromise solution of removing the thrown exception and adding
an isInvalid() method makes sense and fits within the parameters of how the
counting Bloom filter should work.  However, I think the add() and remove()
methods should (or add and subtract) should only accept CountingBloomFilter
arguments.  The reasoning being that the methods are specific  to the
counting filter and are defined in terms of the internal structure of the
filter (the count of each bit) that are not present in the "standard" bloom
filter.  I think that it makes it lexically clear.

Do you have a branch someplace where you are doing this work?   I would
like to see the proposed changes you outlined for the AbstractBloomFilter
implemented as I think you are correct and they would be faster and cleaner.

Claude

On Wed, Feb 19, 2020 at 12:41 AM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 18 Feb 2020, at 22:34, Gary Gregory <[hidden email]> wrote:
> >
> > On Tue, Feb 18, 2020 at 5:32 PM Claude Warren <[hidden email]> wrote:
> >
> >> Last one first, why a tree map?  I think it is a holdover from an
> earlier
> >> implementation.  It can be any reasonable Map (e.g. HashMap).
> >>
> >> on the remove() issue.  You are absolutely correct, there is a bug.  I
> >>
> >
> > May you please add a test to validate whatever fix you guys implement?
> I'm
> > not following the details here but I want to stress the importance of
> tests
> > to help future maintainers validate any future changes and protect us
> from
> > regression bugs. IOW, while the current code base is in your head, more
> > tests! ;-)
> >
>
> I added a test that shows the current implementation does not do what I
> thought it should do based on the documentation in javadoc:
>
> "For each bit that is turned on in the other filter; if the other filter is
>  also a CountingBloomFilter the count is added to this filter, otherwise
> the
>  count is incremented by one."
>
> My test checks the counts are added.
>
> However it does what Claude thought it should do. The original test checks
> that the indices are incremented by 1.
>
> So what to do about counting? If overflow or underflow occur on addition
> or subtraction you can throw an exception to signal error but if a partial
> operation has been made this leaves the filter in an undetermined state.
> The filter does not actually correspond to a sum of filters anymore. So it
> is invalid. Should any further operations on the filter be prevented? In
> this case throwing an IllegalStateException for all further operations
> seems extreme. It may be better to have the addition/subtraction return
> true/false if the operation was clean, i.e similar to a standard Set
> collection that returns true if adding/removing an item changes the state,
> false otherwise. The filter could also maintain a state flag to indicate
> that an overflow/underflow has occurred and the filter no longer represents
> a sum of filters.
>
> Given this idea we change merge to increment the indices by 1 for any
> BloomFilter including a counting filter. It will not throw if it overflows
> but will set an invalid state flag. The logic here is the method is
> inherited from BloomFilter and so will function in the same way as other
> filters by accumulating set bits (indices).
>
> Then add some methods:
>
> boolean add(BloomFilter)
> boolean remove(BloomFilter)
> boolean isInvalid()
>
> These update the counts using the counts of the other filter if it is a
> CountingBloomFilter, otherwise add or subtract 1. The methods return true
> if the operation added or removed the filter entirely, i.e. it is now a
> part of the sum or was removed from the sum. Otherwise they return false
> and the filter is marked as an invalid sum of filters. This leaves the
> counting filter in an undetermined state but it can still be used as a
> non-counting filter as the indices will be correctly maintained.
>
> It is a compromise but there does not seem to be a nice way to do this
> without a lot of overhead. That would be to check the add/remove can be
> done entirely before actually performing the operation. If the operation
> would under/overflow then an exception can be thrown and the filter is left
> in a usable state. This could be achieved with some efficiency if the
> filter stored the maximum count. Then any add would be able to check for
> potential overflow quickly and do a safe add. However the subtraction could
> not know the indices that may underflow using a max value (since 0 - x
> underflows) and would always have to do a safe removal by checking all
> decrements before actually doing the entire removal. One idea here is a
> reusable list is kept to store all the indices that are to be decremented.
> Stage 1 finds all the indices and checks for underflow in the subtraction.
> If underflow then throw. Otherwise traverse the list and actually do the
> subtraction. The overhead of the second list traversal is likely to be much
> lower than the initial identification of indices in the backing map. This
> all becomes easy to do if the backing data structure stores both the index
> and a mutable field, e.g.
>
> Set<MutableIndex> counts;
> class MutableIndex {
>    int index;
>    int count;
>    int hashCode() { return index; }
>    // etc
> }
>
> You then find all the MutableIndex objects in the Set and put them in a
> reusable temp List (with the new count stored in a parallel list). If all
> can be incremented/decremented without under/overflow then just go over the
> list and update the count using the precomputed new count. The entire
> operation then works as a single transaction.
>
> It may be better put into a TransactionCountingBloomFilter and then
> CountingBloomFilter is left to be faster and not care about over/underflow.
>
> Thoughts on this approach?
>
> > Gary
> >
> >
> >> looks like a partial edit got restored back into the code.  The first
> block
> >> should be removed.  In fact it should probably read
> >>
> >>           if (val == null || val == 0) {
> >>                throw new IllegalStateException("Underflow on index " +
> >> idx);
> >>            } else if (val == 1) {
> >>                counts.remove(idx);
> >>            } else {
> >>                counts.put(idx, val - 1);
> >>            }
> >>
> >> The question of merging/removing counting bloom filters is an
> interesting
> >> one.
> >>
> >> Merging a "normal" Bloom filter A into a counting Bloom filter B simply
> >> increments all the bits specified  by A.getBits().
> >>
> >> I took the approach that a counting bloom filter was just another Bloom
> >> filter.  So merging counting Bloom filter A into counting Bloom filter B
> >> simply incremented all the bits specified by A.getBits().
> >>
> >> However the source has moved on since the original implementation was
> laid
> >> down, and it might make sense that merging bloom filters adds the counts
> >> from A into B.  Then if the developer wanted to perform a simple merge
> for
> >> a Bloom filter the developer could use B.merge( A.getHasher() ).
> However,
> >> I felt that since a counting Bloom filter is an instance of Bloom filter
> >> the merge should be driven by getBits().
> >>
> >> Perhaps it makes sense for counting Bloom filters to have another method
> >> add( CountingBloomFilter a) that would add the counts from A into B.
> This
> >> would necessitate a subtract( CountingBloomFilter a) as well.
> >>
> >> the getCounts() method returns a Map.Entry because it was more semantic
> in
> >> nature (e.g. it is obvious that it is a key-value pair) but I am note
> >> adverse to to changing it to an int[].  If the question is why the
> method
> >> at all  then the answers are 1. it is needed for testing (bad answer),
> 2.
> >> it is the only way to retrieve the counts for each bit which can be
> >> required by some applications.
> >>
> >> Using a Bag might make sense, but my reading of the AbstractMapBag is
> that
> >> it tracks the total of all counts in the bag, so we would overflow that
> >> counter before we overflowed a single count.   This definition of size()
> >> differs from the definition of size() for a map and so we would have to
> >> rewrite the cardinality() method.  However, it would be nice if we could
> >> lift the MutableInteger class.
> >>
> >> In addition, we do need to check for overflow and underflow.  An
> underflow
> >> indicates that a filter was removed that was not originally added.
> >> Overflow is possible when processing streams as in a "stable Bloom
> filter"
> >> (
> >> https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).
> >>
> >> Claude
> >>
> >>
> >> On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <[hidden email]>
> >> wrote:
> >>
> >>> I've updated master with some of the fixes discussed.
> >>>
> >>> I've been looking through the rest of the BloomFilter code and the
> >>> CountingBloomFilter appears to be broken:
> >>>
> >>> 1. When another CountingBloomFiler is merged or removed the counts of
> >>> the other filter are ignored. This is not as documented.
> >>>
> >>> 2. When a remove operation is performed and the count for an index
> >>> decrements past zero it throws an exception. This is not documented but
> >>> enforced in the unit tests. This is very unfriendly. Digging into this
> >>> part of the code and there is a duplication of logic for remove. The
> >>> first part will not throw if the decrement passes zero. Then the same
> >>> index is decremented again (with the same new value) and it will throw
> >>> this time for underflow. So maybe at some point it didn't throw and now
> >>> it does.
> >>>
> >>> I commented the code where I think the problems are and added tests to
> >>> master that fail with the current implementation.
> >>>
> >>> I thought this type of logic is ideally suited to a Bag. So I rewrote
> >>> the class using Bag<Integer>. This passes the same tests but has two
> >>> differences:
> >>>
> >>> a) It will not throw when remove would underflow. The bit index just
> >>> gets set to zero (and that item is removed from the Bag).
> >>>
> >>> b) Bag has no support for overflow in the count of each item. It
> >>> actually has a weird feature that adding 2 to Integer.MAX_VALUE will
> >>> overflow to negative and then subtracting 1 would reset the count to
> >>> zero because it was still negative after subtracting 1. In Bag this is
> >>> not an issue as it is limited by Collections.size() returning an int
> for
> >>> the count of everything in the Bag. So if you add that many items the
> >>> collection will break elsewhere too.
> >>>
> >>> I think the change (a) is an improvement. But are there situations
> where
> >>> you want to know that you cannot remove a filter exactly from another
> >>> one? If so perhaps a removeExactly(...) method for this case. I'd
> prefer
> >>> to just drop the throwing of exceptions. After all if you do an
> >>> andCardinality(BloomFilter) you do not get errors thrown when one
> cannot
> >>> be "subtracted" from the other.
> >>>
> >>> Change (b) is faster but could lead to errors. So how likely is
> overflow
> >>> in a counting Bloom filter?
> >>>
> >>> Both can be supported but it would have to use a custom Bag
> >>> implementation. This is pretty trivial and would actually speed up
> other
> >>> parts of the filter by having direct access to the Map underlying the
> >>> Bag for faster merge and remove operations. This would then be a change
> >>> to the current Map<Integer, Integer> to a Map<Integer, MutableInteger>
> >>> to store counts.
> >>>
> >>> I can update the code to fix the implementation but will only do so
> >>> after a decision on overflow is made. We could even have a custom
> >>> implementation that stores counts as a long for very high counting
> >>> filters. Is this likely to be used? Only in a situation where the
> >>> exceptions from overflow are an issue.
> >>>
> >>>
> >>> Other questions about this:
> >>>
> >>> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
> >>>
> >>> Perhaps this is better served raw as arrays of length 2. This has lower
> >>> memory overhead:
> >>>
> >>> public Stream<int[]> getCounts()
> >>>
> >>> I've ruled out an accessor method as there is no equivalent for boolean
> >>> getBit(int index) in BloomFilter, e.g:
> >>>
> >>> public int getCount(int index)
> >>>
> >>> - Why a TreeMap? Is the order of the indices important?
> >>>
> >>> Or do you have some benchmarks to show that the TreeMap handles lots of
> >>> growth and shrinkage better than a HashMap. There are situations where
> >>> each one would be a better choice and so perhaps this is a case for
> >>> having a CountingBloomFilter with a choice for the backing Map. This
> >>> could be done using an abstract class with implementations. This is one
> >>> to leave until some benchmarks are in place in the main code.
> >>>
> >>>
> >>> On 18/02/2020 09:54, Claude Warren wrote:
> >>>> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <
> [hidden email]
> >>>
> >>>> wrote:
> >>>>
> >>>>
> >>>>>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned
> >> and
> >>>>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N)
> >> for
> >>>>> all
> >>>>>> positive values of N?  If so then you are correct and Signedness
> does
> >>> not
> >>>>>> matter and can be removed.  I thought that the statement was false.
> >>>>>>
> >>>>> Yes, you are correct. Addition and multiplication use the bits in the
> >>> same
> >>>>> way. Modulus does not.
> >>>>>
> >>>>> So are the Murmur3 hash functions actually unsigned? I thought the
> >>>>> original c code returned unsigned 64-bit values (uint64_t).
> >>>>>
> >>>>> IIUC the purpose of the Signedness enum is to specify that a negative
> >>>>> value returned from the hash function should be used as a numerical
> >>>>> magnitude if then used in downstream computations since sign
> extension
> >>> on
> >>>>> negatives will set leading bits to 1s. I.e. using the negative as if
> >> it
> >>>>> were unsigned (and all bits are random) is not valid. Perhaps you can
> >>> give
> >>>>> an example of how signed or unsigned would be handled differently.
> >>>>>
> >>>>>
> >>>> The murmur hash is not a good example here as it returns the resulting
> >>>> buffer as 2 signed longs.  The MD5 is probably a better example as the
> >>>> messageDigest.digest() returns a byte[128].  The code then interprets
> >>> that
> >>>> as 2 signed longs.  A C implementation could interpret that as 2
> >> unsigned
> >>>> longs.  Additionally, there are hashes that return smaller buffers
> such
> >>> as
> >>>> with murmur2 which returns a 32bit unsigned int in the C code.  In
> java
> >>>> that could be represented as a long as per the
> Integer.asUnsignedlong()
> >>>> method or it could just be cast to a long and thus signed.
> >>>>
> >>>> The values returned are then passed to floorMod( value, numberOfBits )
> >> to
> >>>> turn on the appropriate bit in the resulting bit vector.  As noted
> >> above
> >>>> this will yield different values depending upon how the byte[] was
> >>>> represented.
> >>>>
> >>>> Claude
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: [hidden email]
> >>> For additional commands, e-mail: [hidden email]
> >>>
> >>>
> >>
> >> --
> >> I like: Like Like - The likeliest place on the web
> >> <http://like-like.xenei.com>
> >> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Alex Herbert


> On 19 Feb 2020, at 21:14, Claude Warren <[hidden email]> wrote:
>
> I think the compromise solution of removing the thrown exception and adding
> an isInvalid() method makes sense and fits within the parameters of how the
> counting Bloom filter should work.  However, I think the add() and remove()
> methods should (or add and subtract) should only accept CountingBloomFilter
> arguments.  The reasoning being that the methods are specific  to the
> counting filter and are defined in terms of the internal structure of the
> filter (the count of each bit) that are not present in the "standard" bloom
> filter.  I think that it makes it lexically clear.

OK. We can always add a transaction counting Bloom filter later that will ensure entire filters are added or removed cleanly. For now it can be user beware documentation that states if you subtract a filter you did not add then you can end up with a set of counts that are not a sum of filters.

I do note that you could do this anyway even with a transaction counting Bloom filter, e.g. A+B+C+D - E. It may or may not throw and the trust in on the user to ensure they do not subtract E from something that never had it added.

Thinking about the invalid flag perhaps the methods merge/add/subtract should just return 0 if the filter state is valid and -1 if invalid. Thus once a filter is invalid all calls would return the invalid flag.

The reason for this is efficiency. The counts are allowed to roam freely over any range for an integer. You have a valid flag that is the bitwise OR of every count you have adjusted. If any over/underflow occurs then the valid flag will have a negative sign bit. The methods then return the signed shift of the valid flag:

return valid >> 31;

This would be -1 if any count ever over/underflows, zero otherwise.

This eliminates any conditional statements from the filter operations. The user just has to know this is how it works and so any downstream consumption of the counts would have to first check the isInvalid flag and sanitise them as necessary. You can even view the counts as unsigned int32 if you never subtract filters.

>
> Do you have a branch someplace where you are doing this work?   I would
> like to see the proposed changes you outlined for the AbstractBloomFilter
> implemented as I think you are correct and they would be faster and cleaner.

No but I’ll create one. I’m waiting for a big PR against collections to go through (an update to force use of checkstyle). Then I’ll revert the changes I made to the CountingBloomFilterTest to its previous state and put the changes into a branch. These would have a new internal structure for the counts and the methods for add/subtract.

For the AbstractBloomFilter do you mean the use of the mod(long, int) method in the Signedness enum? I think the removal of Math.floorMod can be done in two places with bit shift but should we also use the % operator instead of Math.floorMod for Signedness.UNSIGNED?


>
> Claude
>
> On Wed, Feb 19, 2020 at 12:41 AM Alex Herbert <[hidden email]>
> wrote:
>
>>
>>
>>> On 18 Feb 2020, at 22:34, Gary Gregory <[hidden email]> wrote:
>>>
>>> On Tue, Feb 18, 2020 at 5:32 PM Claude Warren <[hidden email]> wrote:
>>>
>>>> Last one first, why a tree map?  I think it is a holdover from an
>> earlier
>>>> implementation.  It can be any reasonable Map (e.g. HashMap).
>>>>
>>>> on the remove() issue.  You are absolutely correct, there is a bug.  I
>>>>
>>>
>>> May you please add a test to validate whatever fix you guys implement?
>> I'm
>>> not following the details here but I want to stress the importance of
>> tests
>>> to help future maintainers validate any future changes and protect us
>> from
>>> regression bugs. IOW, while the current code base is in your head, more
>>> tests! ;-)
>>>
>>
>> I added a test that shows the current implementation does not do what I
>> thought it should do based on the documentation in javadoc:
>>
>> "For each bit that is turned on in the other filter; if the other filter is
>> also a CountingBloomFilter the count is added to this filter, otherwise
>> the
>> count is incremented by one."
>>
>> My test checks the counts are added.
>>
>> However it does what Claude thought it should do. The original test checks
>> that the indices are incremented by 1.
>>
>> So what to do about counting? If overflow or underflow occur on addition
>> or subtraction you can throw an exception to signal error but if a partial
>> operation has been made this leaves the filter in an undetermined state.
>> The filter does not actually correspond to a sum of filters anymore. So it
>> is invalid. Should any further operations on the filter be prevented? In
>> this case throwing an IllegalStateException for all further operations
>> seems extreme. It may be better to have the addition/subtraction return
>> true/false if the operation was clean, i.e similar to a standard Set
>> collection that returns true if adding/removing an item changes the state,
>> false otherwise. The filter could also maintain a state flag to indicate
>> that an overflow/underflow has occurred and the filter no longer represents
>> a sum of filters.
>>
>> Given this idea we change merge to increment the indices by 1 for any
>> BloomFilter including a counting filter. It will not throw if it overflows
>> but will set an invalid state flag. The logic here is the method is
>> inherited from BloomFilter and so will function in the same way as other
>> filters by accumulating set bits (indices).
>>
>> Then add some methods:
>>
>> boolean add(BloomFilter)
>> boolean remove(BloomFilter)
>> boolean isInvalid()
>>
>> These update the counts using the counts of the other filter if it is a
>> CountingBloomFilter, otherwise add or subtract 1. The methods return true
>> if the operation added or removed the filter entirely, i.e. it is now a
>> part of the sum or was removed from the sum. Otherwise they return false
>> and the filter is marked as an invalid sum of filters. This leaves the
>> counting filter in an undetermined state but it can still be used as a
>> non-counting filter as the indices will be correctly maintained.
>>
>> It is a compromise but there does not seem to be a nice way to do this
>> without a lot of overhead. That would be to check the add/remove can be
>> done entirely before actually performing the operation. If the operation
>> would under/overflow then an exception can be thrown and the filter is left
>> in a usable state. This could be achieved with some efficiency if the
>> filter stored the maximum count. Then any add would be able to check for
>> potential overflow quickly and do a safe add. However the subtraction could
>> not know the indices that may underflow using a max value (since 0 - x
>> underflows) and would always have to do a safe removal by checking all
>> decrements before actually doing the entire removal. One idea here is a
>> reusable list is kept to store all the indices that are to be decremented.
>> Stage 1 finds all the indices and checks for underflow in the subtraction.
>> If underflow then throw. Otherwise traverse the list and actually do the
>> subtraction. The overhead of the second list traversal is likely to be much
>> lower than the initial identification of indices in the backing map. This
>> all becomes easy to do if the backing data structure stores both the index
>> and a mutable field, e.g.
>>
>> Set<MutableIndex> counts;
>> class MutableIndex {
>>   int index;
>>   int count;
>>   int hashCode() { return index; }
>>   // etc
>> }
>>
>> You then find all the MutableIndex objects in the Set and put them in a
>> reusable temp List (with the new count stored in a parallel list). If all
>> can be incremented/decremented without under/overflow then just go over the
>> list and update the count using the precomputed new count. The entire
>> operation then works as a single transaction.
>>
>> It may be better put into a TransactionCountingBloomFilter and then
>> CountingBloomFilter is left to be faster and not care about over/underflow.
>>
>> Thoughts on this approach?
>>
>>> Gary
>>>
>>>
>>>> looks like a partial edit got restored back into the code.  The first
>> block
>>>> should be removed.  In fact it should probably read
>>>>
>>>>          if (val == null || val == 0) {
>>>>               throw new IllegalStateException("Underflow on index " +
>>>> idx);
>>>>           } else if (val == 1) {
>>>>               counts.remove(idx);
>>>>           } else {
>>>>               counts.put(idx, val - 1);
>>>>           }
>>>>
>>>> The question of merging/removing counting bloom filters is an
>> interesting
>>>> one.
>>>>
>>>> Merging a "normal" Bloom filter A into a counting Bloom filter B simply
>>>> increments all the bits specified  by A.getBits().
>>>>
>>>> I took the approach that a counting bloom filter was just another Bloom
>>>> filter.  So merging counting Bloom filter A into counting Bloom filter B
>>>> simply incremented all the bits specified by A.getBits().
>>>>
>>>> However the source has moved on since the original implementation was
>> laid
>>>> down, and it might make sense that merging bloom filters adds the counts
>>>> from A into B.  Then if the developer wanted to perform a simple merge
>> for
>>>> a Bloom filter the developer could use B.merge( A.getHasher() ).
>> However,
>>>> I felt that since a counting Bloom filter is an instance of Bloom filter
>>>> the merge should be driven by getBits().
>>>>
>>>> Perhaps it makes sense for counting Bloom filters to have another method
>>>> add( CountingBloomFilter a) that would add the counts from A into B.
>> This
>>>> would necessitate a subtract( CountingBloomFilter a) as well.
>>>>
>>>> the getCounts() method returns a Map.Entry because it was more semantic
>> in
>>>> nature (e.g. it is obvious that it is a key-value pair) but I am note
>>>> adverse to to changing it to an int[].  If the question is why the
>> method
>>>> at all  then the answers are 1. it is needed for testing (bad answer),
>> 2.
>>>> it is the only way to retrieve the counts for each bit which can be
>>>> required by some applications.
>>>>
>>>> Using a Bag might make sense, but my reading of the AbstractMapBag is
>> that
>>>> it tracks the total of all counts in the bag, so we would overflow that
>>>> counter before we overflowed a single count.   This definition of size()
>>>> differs from the definition of size() for a map and so we would have to
>>>> rewrite the cardinality() method.  However, it would be nice if we could
>>>> lift the MutableInteger class.
>>>>
>>>> In addition, we do need to check for overflow and underflow.  An
>> underflow
>>>> indicates that a filter was removed that was not originally added.
>>>> Overflow is possible when processing streams as in a "stable Bloom
>> filter"
>>>> (
>>>> https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).
>>>>
>>>> Claude
>>>>
>>>>
>>>> On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <[hidden email]>
>>>> wrote:
>>>>
>>>>> I've updated master with some of the fixes discussed.
>>>>>
>>>>> I've been looking through the rest of the BloomFilter code and the
>>>>> CountingBloomFilter appears to be broken:
>>>>>
>>>>> 1. When another CountingBloomFiler is merged or removed the counts of
>>>>> the other filter are ignored. This is not as documented.
>>>>>
>>>>> 2. When a remove operation is performed and the count for an index
>>>>> decrements past zero it throws an exception. This is not documented but
>>>>> enforced in the unit tests. This is very unfriendly. Digging into this
>>>>> part of the code and there is a duplication of logic for remove. The
>>>>> first part will not throw if the decrement passes zero. Then the same
>>>>> index is decremented again (with the same new value) and it will throw
>>>>> this time for underflow. So maybe at some point it didn't throw and now
>>>>> it does.
>>>>>
>>>>> I commented the code where I think the problems are and added tests to
>>>>> master that fail with the current implementation.
>>>>>
>>>>> I thought this type of logic is ideally suited to a Bag. So I rewrote
>>>>> the class using Bag<Integer>. This passes the same tests but has two
>>>>> differences:
>>>>>
>>>>> a) It will not throw when remove would underflow. The bit index just
>>>>> gets set to zero (and that item is removed from the Bag).
>>>>>
>>>>> b) Bag has no support for overflow in the count of each item. It
>>>>> actually has a weird feature that adding 2 to Integer.MAX_VALUE will
>>>>> overflow to negative and then subtracting 1 would reset the count to
>>>>> zero because it was still negative after subtracting 1. In Bag this is
>>>>> not an issue as it is limited by Collections.size() returning an int
>> for
>>>>> the count of everything in the Bag. So if you add that many items the
>>>>> collection will break elsewhere too.
>>>>>
>>>>> I think the change (a) is an improvement. But are there situations
>> where
>>>>> you want to know that you cannot remove a filter exactly from another
>>>>> one? If so perhaps a removeExactly(...) method for this case. I'd
>> prefer
>>>>> to just drop the throwing of exceptions. After all if you do an
>>>>> andCardinality(BloomFilter) you do not get errors thrown when one
>> cannot
>>>>> be "subtracted" from the other.
>>>>>
>>>>> Change (b) is faster but could lead to errors. So how likely is
>> overflow
>>>>> in a counting Bloom filter?
>>>>>
>>>>> Both can be supported but it would have to use a custom Bag
>>>>> implementation. This is pretty trivial and would actually speed up
>> other
>>>>> parts of the filter by having direct access to the Map underlying the
>>>>> Bag for faster merge and remove operations. This would then be a change
>>>>> to the current Map<Integer, Integer> to a Map<Integer, MutableInteger>
>>>>> to store counts.
>>>>>
>>>>> I can update the code to fix the implementation but will only do so
>>>>> after a decision on overflow is made. We could even have a custom
>>>>> implementation that stores counts as a long for very high counting
>>>>> filters. Is this likely to be used? Only in a situation where the
>>>>> exceptions from overflow are an issue.
>>>>>
>>>>>
>>>>> Other questions about this:
>>>>>
>>>>> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
>>>>>
>>>>> Perhaps this is better served raw as arrays of length 2. This has lower
>>>>> memory overhead:
>>>>>
>>>>> public Stream<int[]> getCounts()
>>>>>
>>>>> I've ruled out an accessor method as there is no equivalent for boolean
>>>>> getBit(int index) in BloomFilter, e.g:
>>>>>
>>>>> public int getCount(int index)
>>>>>
>>>>> - Why a TreeMap? Is the order of the indices important?
>>>>>
>>>>> Or do you have some benchmarks to show that the TreeMap handles lots of
>>>>> growth and shrinkage better than a HashMap. There are situations where
>>>>> each one would be a better choice and so perhaps this is a case for
>>>>> having a CountingBloomFilter with a choice for the backing Map. This
>>>>> could be done using an abstract class with implementations. This is one
>>>>> to leave until some benchmarks are in place in the main code.
>>>>>
>>>>>
>>>>> On 18/02/2020 09:54, Claude Warren wrote:
>>>>>> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <
>> [hidden email]
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned
>>>> and
>>>>>>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod N)
>>>> for
>>>>>>> all
>>>>>>>> positive values of N?  If so then you are correct and Signedness
>> does
>>>>> not
>>>>>>>> matter and can be removed.  I thought that the statement was false.
>>>>>>>>
>>>>>>> Yes, you are correct. Addition and multiplication use the bits in the
>>>>> same
>>>>>>> way. Modulus does not.
>>>>>>>
>>>>>>> So are the Murmur3 hash functions actually unsigned? I thought the
>>>>>>> original c code returned unsigned 64-bit values (uint64_t).
>>>>>>>
>>>>>>> IIUC the purpose of the Signedness enum is to specify that a negative
>>>>>>> value returned from the hash function should be used as a numerical
>>>>>>> magnitude if then used in downstream computations since sign
>> extension
>>>>> on
>>>>>>> negatives will set leading bits to 1s. I.e. using the negative as if
>>>> it
>>>>>>> were unsigned (and all bits are random) is not valid. Perhaps you can
>>>>> give
>>>>>>> an example of how signed or unsigned would be handled differently.
>>>>>>>
>>>>>>>
>>>>>> The murmur hash is not a good example here as it returns the resulting
>>>>>> buffer as 2 signed longs.  The MD5 is probably a better example as the
>>>>>> messageDigest.digest() returns a byte[128].  The code then interprets
>>>>> that
>>>>>> as 2 signed longs.  A C implementation could interpret that as 2
>>>> unsigned
>>>>>> longs.  Additionally, there are hashes that return smaller buffers
>> such
>>>>> as
>>>>>> with murmur2 which returns a 32bit unsigned int in the C code.  In
>> java
>>>>>> that could be represented as a long as per the
>> Integer.asUnsignedlong()
>>>>>> method or it could just be cast to a long and thus signed.
>>>>>>
>>>>>> The values returned are then passed to floorMod( value, numberOfBits )
>>>> to
>>>>>> turn on the appropriate bit in the resulting bit vector.  As noted
>>>> above
>>>>>> this will yield different values depending upon how the byte[] was
>>>>>> represented.
>>>>>>
>>>>>> Claude
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [hidden email]
>>>>> For additional commands, e-mail: [hidden email]
>>>>>
>>>>>
>>>>
>>>> --
>>>> I like: Like Like - The likeliest place on the web
>>>> <http://like-like.xenei.com>
>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Claude Warren
Way back in your first post on this topic you talked about changing the
AbstractBloomFilter to use:


     @Override
     public int cardinality() {
         int count = 0;
         for (final long bits : getBits()) {
             count += Long.bitCount(bits);
         }
         return count;
     }

among other changes.  Those were the changes I was referring to.

Claude

On Wed, Feb 19, 2020 at 11:33 PM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 19 Feb 2020, at 21:14, Claude Warren <[hidden email]> wrote:
> >
> > I think the compromise solution of removing the thrown exception and
> adding
> > an isInvalid() method makes sense and fits within the parameters of how
> the
> > counting Bloom filter should work.  However, I think the add() and
> remove()
> > methods should (or add and subtract) should only accept
> CountingBloomFilter
> > arguments.  The reasoning being that the methods are specific  to the
> > counting filter and are defined in terms of the internal structure of the
> > filter (the count of each bit) that are not present in the "standard"
> bloom
> > filter.  I think that it makes it lexically clear.
>
> OK. We can always add a transaction counting Bloom filter later that will
> ensure entire filters are added or removed cleanly. For now it can be user
> beware documentation that states if you subtract a filter you did not add
> then you can end up with a set of counts that are not a sum of filters.
>
> I do note that you could do this anyway even with a transaction counting
> Bloom filter, e.g. A+B+C+D - E. It may or may not throw and the trust in on
> the user to ensure they do not subtract E from something that never had it
> added.
>
> Thinking about the invalid flag perhaps the methods merge/add/subtract
> should just return 0 if the filter state is valid and -1 if invalid. Thus
> once a filter is invalid all calls would return the invalid flag.
>
> The reason for this is efficiency. The counts are allowed to roam freely
> over any range for an integer. You have a valid flag that is the bitwise OR
> of every count you have adjusted. If any over/underflow occurs then the
> valid flag will have a negative sign bit. The methods then return the
> signed shift of the valid flag:
>
> return valid >> 31;
>
> This would be -1 if any count ever over/underflows, zero otherwise.
>
> This eliminates any conditional statements from the filter operations. The
> user just has to know this is how it works and so any downstream
> consumption of the counts would have to first check the isInvalid flag and
> sanitise them as necessary. You can even view the counts as unsigned int32
> if you never subtract filters.
>
> >
> > Do you have a branch someplace where you are doing this work?   I would
> > like to see the proposed changes you outlined for the AbstractBloomFilter
> > implemented as I think you are correct and they would be faster and
> cleaner.
>
> No but I’ll create one. I’m waiting for a big PR against collections to go
> through (an update to force use of checkstyle). Then I’ll revert the
> changes I made to the CountingBloomFilterTest to its previous state and put
> the changes into a branch. These would have a new internal structure for
> the counts and the methods for add/subtract.
>
> For the AbstractBloomFilter do you mean the use of the mod(long, int)
> method in the Signedness enum? I think the removal of Math.floorMod can be
> done in two places with bit shift but should we also use the % operator
> instead of Math.floorMod for Signedness.UNSIGNED?
>
>
> >
> > Claude
> >
> > On Wed, Feb 19, 2020 at 12:41 AM Alex Herbert <[hidden email]>
> > wrote:
> >
> >>
> >>
> >>> On 18 Feb 2020, at 22:34, Gary Gregory <[hidden email]> wrote:
> >>>
> >>> On Tue, Feb 18, 2020 at 5:32 PM Claude Warren <[hidden email]>
> wrote:
> >>>
> >>>> Last one first, why a tree map?  I think it is a holdover from an
> >> earlier
> >>>> implementation.  It can be any reasonable Map (e.g. HashMap).
> >>>>
> >>>> on the remove() issue.  You are absolutely correct, there is a bug.  I
> >>>>
> >>>
> >>> May you please add a test to validate whatever fix you guys implement?
> >> I'm
> >>> not following the details here but I want to stress the importance of
> >> tests
> >>> to help future maintainers validate any future changes and protect us
> >> from
> >>> regression bugs. IOW, while the current code base is in your head, more
> >>> tests! ;-)
> >>>
> >>
> >> I added a test that shows the current implementation does not do what I
> >> thought it should do based on the documentation in javadoc:
> >>
> >> "For each bit that is turned on in the other filter; if the other
> filter is
> >> also a CountingBloomFilter the count is added to this filter, otherwise
> >> the
> >> count is incremented by one."
> >>
> >> My test checks the counts are added.
> >>
> >> However it does what Claude thought it should do. The original test
> checks
> >> that the indices are incremented by 1.
> >>
> >> So what to do about counting? If overflow or underflow occur on addition
> >> or subtraction you can throw an exception to signal error but if a
> partial
> >> operation has been made this leaves the filter in an undetermined state.
> >> The filter does not actually correspond to a sum of filters anymore. So
> it
> >> is invalid. Should any further operations on the filter be prevented? In
> >> this case throwing an IllegalStateException for all further operations
> >> seems extreme. It may be better to have the addition/subtraction return
> >> true/false if the operation was clean, i.e similar to a standard Set
> >> collection that returns true if adding/removing an item changes the
> state,
> >> false otherwise. The filter could also maintain a state flag to indicate
> >> that an overflow/underflow has occurred and the filter no longer
> represents
> >> a sum of filters.
> >>
> >> Given this idea we change merge to increment the indices by 1 for any
> >> BloomFilter including a counting filter. It will not throw if it
> overflows
> >> but will set an invalid state flag. The logic here is the method is
> >> inherited from BloomFilter and so will function in the same way as other
> >> filters by accumulating set bits (indices).
> >>
> >> Then add some methods:
> >>
> >> boolean add(BloomFilter)
> >> boolean remove(BloomFilter)
> >> boolean isInvalid()
> >>
> >> These update the counts using the counts of the other filter if it is a
> >> CountingBloomFilter, otherwise add or subtract 1. The methods return
> true
> >> if the operation added or removed the filter entirely, i.e. it is now a
> >> part of the sum or was removed from the sum. Otherwise they return false
> >> and the filter is marked as an invalid sum of filters. This leaves the
> >> counting filter in an undetermined state but it can still be used as a
> >> non-counting filter as the indices will be correctly maintained.
> >>
> >> It is a compromise but there does not seem to be a nice way to do this
> >> without a lot of overhead. That would be to check the add/remove can be
> >> done entirely before actually performing the operation. If the operation
> >> would under/overflow then an exception can be thrown and the filter is
> left
> >> in a usable state. This could be achieved with some efficiency if the
> >> filter stored the maximum count. Then any add would be able to check for
> >> potential overflow quickly and do a safe add. However the subtraction
> could
> >> not know the indices that may underflow using a max value (since 0 - x
> >> underflows) and would always have to do a safe removal by checking all
> >> decrements before actually doing the entire removal. One idea here is a
> >> reusable list is kept to store all the indices that are to be
> decremented.
> >> Stage 1 finds all the indices and checks for underflow in the
> subtraction.
> >> If underflow then throw. Otherwise traverse the list and actually do the
> >> subtraction. The overhead of the second list traversal is likely to be
> much
> >> lower than the initial identification of indices in the backing map.
> This
> >> all becomes easy to do if the backing data structure stores both the
> index
> >> and a mutable field, e.g.
> >>
> >> Set<MutableIndex> counts;
> >> class MutableIndex {
> >>   int index;
> >>   int count;
> >>   int hashCode() { return index; }
> >>   // etc
> >> }
> >>
> >> You then find all the MutableIndex objects in the Set and put them in a
> >> reusable temp List (with the new count stored in a parallel list). If
> all
> >> can be incremented/decremented without under/overflow then just go over
> the
> >> list and update the count using the precomputed new count. The entire
> >> operation then works as a single transaction.
> >>
> >> It may be better put into a TransactionCountingBloomFilter and then
> >> CountingBloomFilter is left to be faster and not care about
> over/underflow.
> >>
> >> Thoughts on this approach?
> >>
> >>> Gary
> >>>
> >>>
> >>>> looks like a partial edit got restored back into the code.  The first
> >> block
> >>>> should be removed.  In fact it should probably read
> >>>>
> >>>>          if (val == null || val == 0) {
> >>>>               throw new IllegalStateException("Underflow on index " +
> >>>> idx);
> >>>>           } else if (val == 1) {
> >>>>               counts.remove(idx);
> >>>>           } else {
> >>>>               counts.put(idx, val - 1);
> >>>>           }
> >>>>
> >>>> The question of merging/removing counting bloom filters is an
> >> interesting
> >>>> one.
> >>>>
> >>>> Merging a "normal" Bloom filter A into a counting Bloom filter B
> simply
> >>>> increments all the bits specified  by A.getBits().
> >>>>
> >>>> I took the approach that a counting bloom filter was just another
> Bloom
> >>>> filter.  So merging counting Bloom filter A into counting Bloom
> filter B
> >>>> simply incremented all the bits specified by A.getBits().
> >>>>
> >>>> However the source has moved on since the original implementation was
> >> laid
> >>>> down, and it might make sense that merging bloom filters adds the
> counts
> >>>> from A into B.  Then if the developer wanted to perform a simple merge
> >> for
> >>>> a Bloom filter the developer could use B.merge( A.getHasher() ).
> >> However,
> >>>> I felt that since a counting Bloom filter is an instance of Bloom
> filter
> >>>> the merge should be driven by getBits().
> >>>>
> >>>> Perhaps it makes sense for counting Bloom filters to have another
> method
> >>>> add( CountingBloomFilter a) that would add the counts from A into B.
> >> This
> >>>> would necessitate a subtract( CountingBloomFilter a) as well.
> >>>>
> >>>> the getCounts() method returns a Map.Entry because it was more
> semantic
> >> in
> >>>> nature (e.g. it is obvious that it is a key-value pair) but I am note
> >>>> adverse to to changing it to an int[].  If the question is why the
> >> method
> >>>> at all  then the answers are 1. it is needed for testing (bad answer),
> >> 2.
> >>>> it is the only way to retrieve the counts for each bit which can be
> >>>> required by some applications.
> >>>>
> >>>> Using a Bag might make sense, but my reading of the AbstractMapBag is
> >> that
> >>>> it tracks the total of all counts in the bag, so we would overflow
> that
> >>>> counter before we overflowed a single count.   This definition of
> size()
> >>>> differs from the definition of size() for a map and so we would have
> to
> >>>> rewrite the cardinality() method.  However, it would be nice if we
> could
> >>>> lift the MutableInteger class.
> >>>>
> >>>> In addition, we do need to check for overflow and underflow.  An
> >> underflow
> >>>> indicates that a filter was removed that was not originally added.
> >>>> Overflow is possible when processing streams as in a "stable Bloom
> >> filter"
> >>>> (
> >>>> https://en.wikipedia.org/wiki/Bloom_filter#Stable_Bloom_filters).
> >>>>
> >>>> Claude
> >>>>
> >>>>
> >>>> On Tue, Feb 18, 2020 at 5:31 PM Alex Herbert <
> [hidden email]>
> >>>> wrote:
> >>>>
> >>>>> I've updated master with some of the fixes discussed.
> >>>>>
> >>>>> I've been looking through the rest of the BloomFilter code and the
> >>>>> CountingBloomFilter appears to be broken:
> >>>>>
> >>>>> 1. When another CountingBloomFiler is merged or removed the counts of
> >>>>> the other filter are ignored. This is not as documented.
> >>>>>
> >>>>> 2. When a remove operation is performed and the count for an index
> >>>>> decrements past zero it throws an exception. This is not documented
> but
> >>>>> enforced in the unit tests. This is very unfriendly. Digging into
> this
> >>>>> part of the code and there is a duplication of logic for remove. The
> >>>>> first part will not throw if the decrement passes zero. Then the same
> >>>>> index is decremented again (with the same new value) and it will
> throw
> >>>>> this time for underflow. So maybe at some point it didn't throw and
> now
> >>>>> it does.
> >>>>>
> >>>>> I commented the code where I think the problems are and added tests
> to
> >>>>> master that fail with the current implementation.
> >>>>>
> >>>>> I thought this type of logic is ideally suited to a Bag. So I rewrote
> >>>>> the class using Bag<Integer>. This passes the same tests but has two
> >>>>> differences:
> >>>>>
> >>>>> a) It will not throw when remove would underflow. The bit index just
> >>>>> gets set to zero (and that item is removed from the Bag).
> >>>>>
> >>>>> b) Bag has no support for overflow in the count of each item. It
> >>>>> actually has a weird feature that adding 2 to Integer.MAX_VALUE will
> >>>>> overflow to negative and then subtracting 1 would reset the count to
> >>>>> zero because it was still negative after subtracting 1. In Bag this
> is
> >>>>> not an issue as it is limited by Collections.size() returning an int
> >> for
> >>>>> the count of everything in the Bag. So if you add that many items the
> >>>>> collection will break elsewhere too.
> >>>>>
> >>>>> I think the change (a) is an improvement. But are there situations
> >> where
> >>>>> you want to know that you cannot remove a filter exactly from another
> >>>>> one? If so perhaps a removeExactly(...) method for this case. I'd
> >> prefer
> >>>>> to just drop the throwing of exceptions. After all if you do an
> >>>>> andCardinality(BloomFilter) you do not get errors thrown when one
> >> cannot
> >>>>> be "subtracted" from the other.
> >>>>>
> >>>>> Change (b) is faster but could lead to errors. So how likely is
> >> overflow
> >>>>> in a counting Bloom filter?
> >>>>>
> >>>>> Both can be supported but it would have to use a custom Bag
> >>>>> implementation. This is pretty trivial and would actually speed up
> >> other
> >>>>> parts of the filter by having direct access to the Map underlying the
> >>>>> Bag for faster merge and remove operations. This would then be a
> change
> >>>>> to the current Map<Integer, Integer> to a Map<Integer,
> MutableInteger>
> >>>>> to store counts.
> >>>>>
> >>>>> I can update the code to fix the implementation but will only do so
> >>>>> after a decision on overflow is made. We could even have a custom
> >>>>> implementation that stores counts as a long for very high counting
> >>>>> filters. Is this likely to be used? Only in a situation where the
> >>>>> exceptions from overflow are an issue.
> >>>>>
> >>>>>
> >>>>> Other questions about this:
> >>>>>
> >>>>> - Why public Stream<Map.Entry<Integer, Integer>> getCounts()?
> >>>>>
> >>>>> Perhaps this is better served raw as arrays of length 2. This has
> lower
> >>>>> memory overhead:
> >>>>>
> >>>>> public Stream<int[]> getCounts()
> >>>>>
> >>>>> I've ruled out an accessor method as there is no equivalent for
> boolean
> >>>>> getBit(int index) in BloomFilter, e.g:
> >>>>>
> >>>>> public int getCount(int index)
> >>>>>
> >>>>> - Why a TreeMap? Is the order of the indices important?
> >>>>>
> >>>>> Or do you have some benchmarks to show that the TreeMap handles lots
> of
> >>>>> growth and shrinkage better than a HashMap. There are situations
> where
> >>>>> each one would be a better choice and so perhaps this is a case for
> >>>>> having a CountingBloomFilter with a choice for the backing Map. This
> >>>>> could be done using an abstract class with implementations. This is
> one
> >>>>> to leave until some benchmarks are in place in the main code.
> >>>>>
> >>>>>
> >>>>> On 18/02/2020 09:54, Claude Warren wrote:
> >>>>>> On Tue, Feb 18, 2020 at 9:12 AM Alex Herbert <
> >> [hidden email]
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>>>> My maths is rusty.  If A=0xF000ABCD as interpreted as an unsigned
> >>>> and
> >>>>>>>> B=0xF000ABCD but interpreted as a signed does (A mod N) = (B mod
> N)
> >>>> for
> >>>>>>> all
> >>>>>>>> positive values of N?  If so then you are correct and Signedness
> >> does
> >>>>> not
> >>>>>>>> matter and can be removed.  I thought that the statement was
> false.
> >>>>>>>>
> >>>>>>> Yes, you are correct. Addition and multiplication use the bits in
> the
> >>>>> same
> >>>>>>> way. Modulus does not.
> >>>>>>>
> >>>>>>> So are the Murmur3 hash functions actually unsigned? I thought the
> >>>>>>> original c code returned unsigned 64-bit values (uint64_t).
> >>>>>>>
> >>>>>>> IIUC the purpose of the Signedness enum is to specify that a
> negative
> >>>>>>> value returned from the hash function should be used as a numerical
> >>>>>>> magnitude if then used in downstream computations since sign
> >> extension
> >>>>> on
> >>>>>>> negatives will set leading bits to 1s. I.e. using the negative as
> if
> >>>> it
> >>>>>>> were unsigned (and all bits are random) is not valid. Perhaps you
> can
> >>>>> give
> >>>>>>> an example of how signed or unsigned would be handled differently.
> >>>>>>>
> >>>>>>>
> >>>>>> The murmur hash is not a good example here as it returns the
> resulting
> >>>>>> buffer as 2 signed longs.  The MD5 is probably a better example as
> the
> >>>>>> messageDigest.digest() returns a byte[128].  The code then
> interprets
> >>>>> that
> >>>>>> as 2 signed longs.  A C implementation could interpret that as 2
> >>>> unsigned
> >>>>>> longs.  Additionally, there are hashes that return smaller buffers
> >> such
> >>>>> as
> >>>>>> with murmur2 which returns a 32bit unsigned int in the C code.  In
> >> java
> >>>>>> that could be represented as a long as per the
> >> Integer.asUnsignedlong()
> >>>>>> method or it could just be cast to a long and thus signed.
> >>>>>>
> >>>>>> The values returned are then passed to floorMod( value,
> numberOfBits )
> >>>> to
> >>>>>> turn on the appropriate bit in the resulting bit vector.  As noted
> >>>> above
> >>>>>> this will yield different values depending upon how the byte[] was
> >>>>>> represented.
> >>>>>>
> >>>>>> Claude
> >>>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: [hidden email]
> >>>>> For additional commands, e-mail: [hidden email]
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> I like: Like Like - The likeliest place on the web
> >>>> <http://like-like.xenei.com>
> >>>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Alex Herbert
I have created a PR that contains most of the changes discussed in this thread (see [1]).

Please review the changes and comment here or on GitHub.

I have left the CountingBloomFilter alone. I will reimplement the add/subtract functionality as discussed into another PR.

Alex

[1] https://github.com/apache/commons-collections/pull/133 <https://github.com/apache/commons-collections/pull/133>


Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Claude Warren
Alex would you take a look at pull request 131 [1].  it adds a new hasher
implementation and makes the HashFunctionValidator available for public use.

https://github.com/apache/commons-collections/pull/131

On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <[hidden email]>
wrote:

> I have created a PR that contains most of the changes discussed in this
> thread (see [1]).
>
> Please review the changes and comment here or on GitHub.
>
> I have left the CountingBloomFilter alone. I will reimplement the
> add/subtract functionality as discussed into another PR.
>
> Alex
>
> [1] https://github.com/apache/commons-collections/pull/133 <
> https://github.com/apache/commons-collections/pull/133>
>
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Alex Herbert


> On 29 Feb 2020, at 07:46, Claude Warren <[hidden email]> wrote:
>
> Alex would you take a look at pull request 131 [1].  it adds a new hasher
> implementation and makes the HashFunctionValidator available for public use.
>
> https://github.com/apache/commons-collections/pull/131 <https://github.com/apache/commons-collections/pull/131>

OK. I’ll take a look.

I’ve been thinking about the counting Bloom filter and the backing storage. In summary:

1. The backing storage should be a fixed array.
2. Merging a Hasher should count duplicate indices, not flatten them all to a single count.

For background I’ve used the standard formulas to estimate the number of indices that will be non-zero in a Bloom filter. The wikipedia page gives this formula for the expected number of bits set to 0 (E(q)) if you have inserted i elements into a filter of size m using k hash functions:

E(q) = (1 - 1/m)^ki"

So a rough guess of the number of indices (bits) used by a filter is 1-E(q).

Here is a table of Bloom filters with different collision probabilities and the proportion of bits that will be set when 1%, 10%, 100% of the capacity of the filter has been met:

n p m k I E(q) bits
1000 1E-04 19171 13 10 0.9932 0.0068
1000 1E-04 19171 13 100 0.9344 0.0656
1000 1E-04 19171 13 1000 0.5076 0.4924
1000 1E-05 23963 17 10 0.9929 0.0071
1000 1E-05 23963 17 100 0.9315 0.0685
1000 1E-05 23963 17 1000 0.4919 0.5081
1000 1E-06 28756 20 10 0.9931 0.0069
1000 1E-06 28756 20 100 0.9328 0.0672
1000 1E-06 28756 20 1000 0.4988 0.5012
10000 1E-06 287552 20 100 0.9931 0.0069
10000 1E-06 287552 20 1000 0.9328 0.0672
10000 1E-06 287552 20 10000 0.4988 0.5012

The point is that if you create a Bloom filter and fill it to 10% of the intended capacity the number of indices used will be about 6-7% of the filter bits.

So how to store the counts? Currently the counting bloom filter uses a TreeMap<Integer, Integer>. I tried:

TreeMap<Integer, Integer>
HashMap<Integer, Integer>
TreeSet<MutableCount>
HashSet<MutableCount>
int[]

The MutableCount is a custom object that stores the bit index and uses it for a hash code and then has a mutable integer count field. It allows the count to be incremented/decremented if the object is in the set:

    static final class MutableCount implements Comparable<MutableCount> {
        final int key;
        int count;
        // etc
    }

This is adapted from the Bag<T> collection which stores an item count with a MutableInteger. Here the mutable count is part of the same object T inserted into the Set. So you can find the object, change the count and not have to put it back into the set. This is more efficient than changing the Integer stored in a Map.

I’ve estimated memory usage using an idea based on this article from JavaWorld: Java Tip 130: Do you know your data size? [1].

Basically you:

- create an object and throw it away. All classes are then initialised.
- Then you free memory (run garbage collection) and get the current memory usage
- Then create a lot of your object (held in an array)
- Then measure memory usage again
- memory = (after - before) / count

Here is some example output for n bits set in size m:

 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =   733947 bytes
 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =  1467866 bytes
 13107 / 262144 (0.050) : TreeSet<MutableCount>          =   838928 bytes
 26214 / 262144 (0.100) : TreeSet<MutableCount>          =  1677776 bytes
 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =  1677712 bytes
 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =  2306739 bytes
 13107 / 262144 (0.050) : HashSet<MutableCount>          =  1782664 bytes
 26214 / 262144 (0.100) : HashSet<MutableCount>          =  2516656 bytes
     0 / 262144 (0.000) : int[]                          =  1048608 bytes
     0 / 262144 (0.000) : short[]                        =   524320 bytes

The estimate is accurate to 0.0001% for the arrays so the method is working. The HashMap was created with the capacity set to the expected capacity of the filter (m).

I’ve chosen these sizes because at 5% full a HashSet is less memory efficient than using a fixed size array, and at 10% the TreeSet is also less efficient.

Note that the java.util.Tree/HashSet versions just wrap a Map and insert a dummy object for all keys in the Map. So here a Set is not as efficient as a Map because in the Map test I always inserted the same Integer object representing 1. This would be the same as using a Set with an Integer key but here the Set had to contain the MutableCount which has an extra int field and is larger than an Integer.

These data lead me to think that a counting Bloom filter should just use a fixed size backing array:

- If created using the same Shape as a standard Bloom filter it uses a fixed size. This has high memory cost when the filter is empty but when it exceeds 10% of the intended capacity it is more efficient than a dynamic backing storage.
- All operations will operate in order(n) time for an operation with another filter with n indices. Each individual index count in the filter will have order(1) time for access/update. Performance will be limited by the memory cache of the entire array.

The issue is that a counting Bloom filter with a very low number of inserted items will be memory inefficient. But under what circumstance will such a filter be used in a short-term lifecycle? If it is simply to merge into another filter then this can be done using a merge with a Hasher. If counts are to be merged then perhaps we provide a method to merge counts using the same data structure returned by the CountingBloomFilter getCounts() method, e.g. using a stream of <index,count> pairs:

Stream<int[]> getCounts();
void add(Stream<int[]> counts);

The issue here is the Shape and HashFunctionIdentity of the origin of the merge cannot be validated. So just leave it out and use the merge with a Hasher.


Thus the next issue with the counting Bloom filter implementation. Currently when it merges with a Hasher it puts all the indices into a Set and so will only increment the count by 1 for each index identified by the Hasher. This appears to miss the entire point of the counting Bloom filter. If I hash an objects to generate k indices I would hope that I do get k indices. But the hash may not be perfect and I may get [1, k] indices with some duplications. This is part of the signature of that object with the given hash. So surely a counting Bloom filter should accommodate this. If my Hasher generates the same index 20 times this should result in the count of that index incrementing by 20.

The result if that if an object is added direct to a counting Bloom filter using a Hasher it will have a different result that if added to a standard Bloom filter and then that filter added to the counting Bloom filter.

Opinions on this?

Alex



[1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html

>
> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <[hidden email]>
> wrote:
>
>> I have created a PR that contains most of the changes discussed in this
>> thread (see [1]).
>>
>> Please review the changes and comment here or on GitHub.
>>
>> I have left the CountingBloomFilter alone. I will reimplement the
>> add/subtract functionality as discussed into another PR.
>>
>> Alex
>>
>> [1] https://github.com/apache/commons-collections/pull/133 <
>> https://github.com/apache/commons-collections/pull/133>
>>
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Claude Warren
The idea of a backing array is fine and the only problem I see with it is
in very large filters (on the order of 10^8 bits and larger) but document
the size calculation and let the developer worry about it.

As for the merge question.  merge is a standard bloom filter operation.  It
is well defined in the literature.  Merging a bloom filter into a counting
bloom filter means incrementing the bit counts.  I  think that merge/remove
should continue to operate as though the parameter were a standard bloom
filter.

We had spoken of implementing and adding/deleting method pair that would
operate on CountingBloomFilters and would add/subtract the counts.  (e.g.
add(CountingBloomFilter) and subtract(CountingBloomFilter))

I disagree with your proposal for the merge(Hasher) implementation, and I
am not certain that an add(Hasher) makes sense.  First consider that the
Hasher returns the bits that are to be enabled in the Bloom filter so
collisions are expected.  In the normal course of events a Hasher is used
to create a normal Bloom filter where all the duplicates are removed.  That
filter is then merged into a CountingBloomFilter.  So in some sense the
Hasher and the normal Bloom filter are the same.  So I would expect the
merge of the Hasher and the merge of the normal Bloom filter created from
that hasher into a CountingBloomFilter to yield the same result.  If you
wanted to add an add(Hasher)/delete(Hasher) pair of functions to a
CountingBloomFilter you could implement with duplicate counting, but I am
not certain of the validity of such a count and I fear that it muddies the
waters with respect to what the CountingBloomFilter is counting.

Claude

On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 29 Feb 2020, at 07:46, Claude Warren <[hidden email]> wrote:
> >
> > Alex would you take a look at pull request 131 [1].  it adds a new hasher
> > implementation and makes the HashFunctionValidator available for public
> use.
> >
> > https://github.com/apache/commons-collections/pull/131 <
> https://github.com/apache/commons-collections/pull/131>
>
> OK. I’ll take a look.
>
> I’ve been thinking about the counting Bloom filter and the backing
> storage. In summary:
>
> 1. The backing storage should be a fixed array.
> 2. Merging a Hasher should count duplicate indices, not flatten them all
> to a single count.
>
> For background I’ve used the standard formulas to estimate the number of
> indices that will be non-zero in a Bloom filter. The wikipedia page gives
> this formula for the expected number of bits set to 0 (E(q)) if you have
> inserted i elements into a filter of size m using k hash functions:
>
> E(q) = (1 - 1/m)^ki"
>
> So a rough guess of the number of indices (bits) used by a filter is
> 1-E(q).
>
> Here is a table of Bloom filters with different collision probabilities
> and the proportion of bits that will be set when 1%, 10%, 100% of the
> capacity of the filter has been met:
>
> n       p       m       k       I       E(q)    bits
> 1000    1E-04   19171   13      10      0.9932  0.0068
> 1000    1E-04   19171   13      100     0.9344  0.0656
> 1000    1E-04   19171   13      1000    0.5076  0.4924
> 1000    1E-05   23963   17      10      0.9929  0.0071
> 1000    1E-05   23963   17      100     0.9315  0.0685
> 1000    1E-05   23963   17      1000    0.4919  0.5081
> 1000    1E-06   28756   20      10      0.9931  0.0069
> 1000    1E-06   28756   20      100     0.9328  0.0672
> 1000    1E-06   28756   20      1000    0.4988  0.5012
> 10000   1E-06   287552  20      100     0.9931  0.0069
> 10000   1E-06   287552  20      1000    0.9328  0.0672
> 10000   1E-06   287552  20      10000   0.4988  0.5012
>
> The point is that if you create a Bloom filter and fill it to 10% of the
> intended capacity the number of indices used will be about 6-7% of the
> filter bits.
>
> So how to store the counts? Currently the counting bloom filter uses a
> TreeMap<Integer, Integer>. I tried:
>
> TreeMap<Integer, Integer>
> HashMap<Integer, Integer>
> TreeSet<MutableCount>
> HashSet<MutableCount>
> int[]
>
> The MutableCount is a custom object that stores the bit index and uses it
> for a hash code and then has a mutable integer count field. It allows the
> count to be incremented/decremented if the object is in the set:
>
>     static final class MutableCount implements Comparable<MutableCount> {
>         final int key;
>         int count;
>         // etc
>     }
>
> This is adapted from the Bag<T> collection which stores an item count with
> a MutableInteger. Here the mutable count is part of the same object T
> inserted into the Set. So you can find the object, change the count and not
> have to put it back into the set. This is more efficient than changing the
> Integer stored in a Map.
>
> I’ve estimated memory usage using an idea based on this article from
> JavaWorld: Java Tip 130: Do you know your data size? [1].
>
> Basically you:
>
> - create an object and throw it away. All classes are then initialised.
> - Then you free memory (run garbage collection) and get the current memory
> usage
> - Then create a lot of your object (held in an array)
> - Then measure memory usage again
> - memory = (after - before) / count
>
> Here is some example output for n bits set in size m:
>
>  13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =   733947 bytes
>  26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =  1467866 bytes
>  13107 / 262144 (0.050) : TreeSet<MutableCount>          =   838928 bytes
>  26214 / 262144 (0.100) : TreeSet<MutableCount>          =  1677776 bytes
>  13107 / 262144 (0.050) : HashMap<Integer, Integer>      =  1677712 bytes
>  26214 / 262144 (0.100) : HashMap<Integer, Integer>      =  2306739 bytes
>  13107 / 262144 (0.050) : HashSet<MutableCount>          =  1782664 bytes
>  26214 / 262144 (0.100) : HashSet<MutableCount>          =  2516656 bytes
>      0 / 262144 (0.000) : int[]                          =  1048608 bytes
>      0 / 262144 (0.000) : short[]                        =   524320 bytes
>
> The estimate is accurate to 0.0001% for the arrays so the method is
> working. The HashMap was created with the capacity set to the expected
> capacity of the filter (m).
>
> I’ve chosen these sizes because at 5% full a HashSet is less memory
> efficient than using a fixed size array, and at 10% the TreeSet is also
> less efficient.
>
> Note that the java.util.Tree/HashSet versions just wrap a Map and insert a
> dummy object for all keys in the Map. So here a Set is not as efficient as
> a Map because in the Map test I always inserted the same Integer object
> representing 1. This would be the same as using a Set with an Integer key
> but here the Set had to contain the MutableCount which has an extra int
> field and is larger than an Integer.
>
> These data lead me to think that a counting Bloom filter should just use a
> fixed size backing array:
>
> - If created using the same Shape as a standard Bloom filter it uses a
> fixed size. This has high memory cost when the filter is empty but when it
> exceeds 10% of the intended capacity it is more efficient than a dynamic
> backing storage.
> - All operations will operate in order(n) time for an operation with
> another filter with n indices. Each individual index count in the filter
> will have order(1) time for access/update. Performance will be limited by
> the memory cache of the entire array.
>
> The issue is that a counting Bloom filter with a very low number of
> inserted items will be memory inefficient. But under what circumstance will
> such a filter be used in a short-term lifecycle? If it is simply to merge
> into another filter then this can be done using a merge with a Hasher. If
> counts are to be merged then perhaps we provide a method to merge counts
> using the same data structure returned by the CountingBloomFilter
> getCounts() method, e.g. using a stream of <index,count> pairs:
>
> Stream<int[]> getCounts();
> void add(Stream<int[]> counts);
>
> The issue here is the Shape and HashFunctionIdentity of the origin of the
> merge cannot be validated. So just leave it out and use the merge with a
> Hasher.
>
>
> Thus the next issue with the counting Bloom filter implementation.
> Currently when it merges with a Hasher it puts all the indices into a Set
> and so will only increment the count by 1 for each index identified by the
> Hasher. This appears to miss the entire point of the counting Bloom filter.
> If I hash an objects to generate k indices I would hope that I do get k
> indices. But the hash may not be perfect and I may get [1, k] indices with
> some duplications. This is part of the signature of that object with the
> given hash. So surely a counting Bloom filter should accommodate this. If
> my Hasher generates the same index 20 times this should result in the count
> of that index incrementing by 20.
>
> The result if that if an object is added direct to a counting Bloom filter
> using a Hasher it will have a different result that if added to a standard
> Bloom filter and then that filter added to the counting Bloom filter.
>
> Opinions on this?
>
> Alex
>
>
>
> [1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
>
> >
> > On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <[hidden email]>
> > wrote:
> >
> >> I have created a PR that contains most of the changes discussed in this
> >> thread (see [1]).
> >>
> >> Please review the changes and comment here or on GitHub.
> >>
> >> I have left the CountingBloomFilter alone. I will reimplement the
> >> add/subtract functionality as discussed into another PR.
> >>
> >> Alex
> >>
> >> [1] https://github.com/apache/commons-collections/pull/133 <
> >> https://github.com/apache/commons-collections/pull/133>
> >>
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Alex Herbert


> On 1 Mar 2020, at 09:28, Claude Warren <[hidden email]> wrote:
>
> The idea of a backing array is fine and the only problem I see with it is
> in very large filters (on the order of 10^8 bits and larger) but document
> the size calculation and let the developer worry about it.

Let us look at the use case where we max out the array. Using the Bloom filter calculator:

n = 149,363,281
p = 0.001000025 (1 in 1000)
m = 2147483647 (256MiB)
k = 10

n = 74,681,641
p = 0.000001 (1 in 999950)
m = 2147483647 (256MiB)
k = 20

n = 49,787,761
p = 0.000000001 (1 in 999924899)
m = 2147483647 (256MiB)
k = 30

So you will be able to put somewhere in the order of 10^8 or 10^7 items into the filter. I would say that anyone putting more than that into the filter has an unusual use case. The CountingBloomFilter can throw an exception if m is too large and will throw an OutOfMemoryError if you cannot allocate an array large enough.

One clear point here is that you cannot use a short as a 16-bit count would easily overflow. So you have to use an integer array for the counts.

A maximum length int[] is roughly 8GB.

What would another implementation cost in terms of memory? The TreeMap<Integer, Integer> was the most space efficient. In the previous e-mail the saturation of a Bloom filter bits was approximately 50% when at the intended capacity. So we have to estimate the size of a TreeMap containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test shows the TreeMap memory scales linearly with entries:

 32768 /  65536 (0.500) : TreeMap<Integer, Integer>      =  1834061 bytes
 65536 / 131072 (0.500) : TreeMap<Integer, Integer>      =  3669080 bytes
131072 / 262144 (0.500) : TreeMap<Integer, Integer>      =  7339090 bytes

So what is the memory for a TreeMap with 2^30 indices. I make it about:

(2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB

I would say that this amount of RAM is unusual. It is definitely not as efficient as using an array. So very large counting Bloom filters are going to require some thought as to the hardware they run on. This may not be the case in 10 years time.

I would say that we try an int[] backing array for the storage implementation and document it’s limitations. A different implementation could be provided in future if required.

This could be done by making CountingBloomFilter an interface that extends BloomFilter with the methods:

subtract(BloomFilter filter)
subtract(Hasher filter)

These will negate the effect of the corresponding merge(BloomFilter) operation.

Do we also need access to the counts and add/subtract of another CountingBloomFilter?:

add(CountingBloomFilter filter);
subtract(CountingBloomFilter filter);

Iterator<int[]> getCounts();
int getSize(); // Number of items added

The CountingBloomFilter is then an interface that defines how to reverse the merge of some bits into the filter.

My concern is the inefficiency of the creation of objects in any method that provides access to the counts (e.g. above using an iterator as for Hasher.getBits()). I presume this method would be to allow some type of storage/serialisation of the filter akin to the long[] getBits() method of BloomFilter. So it may be better done using a method:

int getCount(int index);

The caller can then use long[] getBits() to get the indices set in the filter and then for each non-zero bit index call getCount(index). Or just not include the method as the counts are only of concern when storing the filter. This functionality is cleaner pushed into an implementation.

In a previous post we discussed whether to throw an exception on overflow/underflow or raise in an invalid flag. Using the invalid flag idea the interface would be:

interface CountingBloomFilter {
    int add(CountingBloomFilter filter);
    int subtract(BloomFilter filter);
    int subtract(Hasher filter);
    int subtract(CountingBloomFilter filter);
    int getStatus();

    // Maybe
    int getSize();
    int getCount(int index);
}

The status would be negative if any operation overflowed/underflowed, or zero if OK. The current status is returned by the add/subtract methods.

However I note that overflow may not be a concern. The number of items to add to a filter to create overflow would be using a filter with a number of bits that is unrealistic to store in memory:

n = 2147483647
p = 0.001000025 (1 in 1000)
m = 30875634182 (3.59GiB)
k = 10

If you want to add 2 billion items (and overflow an integer count) then your filter would be so big it would break the rest of the API that uses a 32-bit int for the bit index.

Thus only underflow is a realistic concern. This could be documented as handled in an implementation specific manner (i.e. throw or ignore). The API is then simplified to:

interface CountingBloomFilter {
    boolean add(CountingBloomFilter filter);
    boolean subtract(BloomFilter filter);
    boolean subtract(Hasher filter);
    boolean subtract(CountingBloomFilter filter);
    int getStatus();

    // Maybe
    int getSize();
    int getCount(int index);
}

The boolean is used to state that add/subtract did not over/underflow. Implementations can throw if they require it.

The question then becomes what does getSize() represent if an add/subtract did not execute cleanly. Under this scheme it would be the number of (add - subtract) operations. The status flag would be used to indicate if the size is valid, or any of the counts from getCount(). The simpler API is to not allow access to counts/size or adding/subtracting counts:

interface CountingBloomFilter {
    boolean subtract(BloomFilter filter);
    boolean subtract(Hasher filter);
    int getStatus();
    // Or something like ...
    boolean isValid();
}

This filter is then only concerned with reversing the merge of Bloom filters with a valid status flag to indicate that the current state is consistent (i.e. all filters have been cleanly merged/subtracted).

WDYT?

>
> As for the merge question.  merge is a standard bloom filter operation.  It
> is well defined in the literature.  Merging a bloom filter into a counting
> bloom filter means incrementing the bit counts.  I  think that merge/remove
> should continue to operate as though the parameter were a standard bloom
> filter.
>

OK. So the count is to represent the number of filters that had a bit set at that index. This makes it more clear.

> We had spoken of implementing and adding/deleting method pair that would
> operate on CountingBloomFilters and would add/subtract the counts.  (e.g.
> add(CountingBloomFilter) and subtract(CountingBloomFilter))
>
> I disagree with your proposal for the merge(Hasher) implementation, and I
> am not certain that an add(Hasher) makes sense.  First consider that the
> Hasher returns the bits that are to be enabled in the Bloom filter so
> collisions are expected.  In the normal course of events a Hasher is used
> to create a normal Bloom filter where all the duplicates are removed.  That
> filter is then merged into a CountingBloomFilter.  So in some sense the
> Hasher and the normal Bloom filter are the same.  So I would expect the
> merge of the Hasher and the merge of the normal Bloom filter created from
> that hasher into a CountingBloomFilter to yield the same result.  If you
> wanted to add an add(Hasher)/delete(Hasher) pair of functions to a
> CountingBloomFilter you could implement with duplicate counting, but I am
> not certain of the validity of such a count and I fear that it muddies the
> waters with respect to what the CountingBloomFilter is counting.

Agreed.

>
> Claude
>
> On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>>
>>
>>> On 29 Feb 2020, at 07:46, Claude Warren <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>> Alex would you take a look at pull request 131 [1].  it adds a new hasher
>>> implementation and makes the HashFunctionValidator available for public
>> use.
>>>
>>> https://github.com/apache/commons-collections/pull/131 <https://github.com/apache/commons-collections/pull/131> <
>> https://github.com/apache/commons-collections/pull/131 <https://github.com/apache/commons-collections/pull/131>>
>>
>> OK. I’ll take a look.
>>
>> I’ve been thinking about the counting Bloom filter and the backing
>> storage. In summary:
>>
>> 1. The backing storage should be a fixed array.
>> 2. Merging a Hasher should count duplicate indices, not flatten them all
>> to a single count.
>>
>> For background I’ve used the standard formulas to estimate the number of
>> indices that will be non-zero in a Bloom filter. The wikipedia page gives
>> this formula for the expected number of bits set to 0 (E(q)) if you have
>> inserted i elements into a filter of size m using k hash functions:
>>
>> E(q) = (1 - 1/m)^ki"
>>
>> So a rough guess of the number of indices (bits) used by a filter is
>> 1-E(q).
>>
>> Here is a table of Bloom filters with different collision probabilities
>> and the proportion of bits that will be set when 1%, 10%, 100% of the
>> capacity of the filter has been met:
>>
>> n       p       m       k       I       E(q)    bits
>> 1000    1E-04   19171   13      10      0.9932  0.0068
>> 1000    1E-04   19171   13      100     0.9344  0.0656
>> 1000    1E-04   19171   13      1000    0.5076  0.4924
>> 1000    1E-05   23963   17      10      0.9929  0.0071
>> 1000    1E-05   23963   17      100     0.9315  0.0685
>> 1000    1E-05   23963   17      1000    0.4919  0.5081
>> 1000    1E-06   28756   20      10      0.9931  0.0069
>> 1000    1E-06   28756   20      100     0.9328  0.0672
>> 1000    1E-06   28756   20      1000    0.4988  0.5012
>> 10000   1E-06   287552  20      100     0.9931  0.0069
>> 10000   1E-06   287552  20      1000    0.9328  0.0672
>> 10000   1E-06   287552  20      10000   0.4988  0.5012
>>
>> The point is that if you create a Bloom filter and fill it to 10% of the
>> intended capacity the number of indices used will be about 6-7% of the
>> filter bits.
>>
>> So how to store the counts? Currently the counting bloom filter uses a
>> TreeMap<Integer, Integer>. I tried:
>>
>> TreeMap<Integer, Integer>
>> HashMap<Integer, Integer>
>> TreeSet<MutableCount>
>> HashSet<MutableCount>
>> int[]
>>
>> The MutableCount is a custom object that stores the bit index and uses it
>> for a hash code and then has a mutable integer count field. It allows the
>> count to be incremented/decremented if the object is in the set:
>>
>>    static final class MutableCount implements Comparable<MutableCount> {
>>        final int key;
>>        int count;
>>        // etc
>>    }
>>
>> This is adapted from the Bag<T> collection which stores an item count with
>> a MutableInteger. Here the mutable count is part of the same object T
>> inserted into the Set. So you can find the object, change the count and not
>> have to put it back into the set. This is more efficient than changing the
>> Integer stored in a Map.
>>
>> I’ve estimated memory usage using an idea based on this article from
>> JavaWorld: Java Tip 130: Do you know your data size? [1].
>>
>> Basically you:
>>
>> - create an object and throw it away. All classes are then initialised.
>> - Then you free memory (run garbage collection) and get the current memory
>> usage
>> - Then create a lot of your object (held in an array)
>> - Then measure memory usage again
>> - memory = (after - before) / count
>>
>> Here is some example output for n bits set in size m:
>>
>> 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =   733947 bytes
>> 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =  1467866 bytes
>> 13107 / 262144 (0.050) : TreeSet<MutableCount>          =   838928 bytes
>> 26214 / 262144 (0.100) : TreeSet<MutableCount>          =  1677776 bytes
>> 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =  1677712 bytes
>> 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =  2306739 bytes
>> 13107 / 262144 (0.050) : HashSet<MutableCount>          =  1782664 bytes
>> 26214 / 262144 (0.100) : HashSet<MutableCount>          =  2516656 bytes
>>     0 / 262144 (0.000) : int[]                          =  1048608 bytes
>>     0 / 262144 (0.000) : short[]                        =   524320 bytes
>>
>> The estimate is accurate to 0.0001% for the arrays so the method is
>> working. The HashMap was created with the capacity set to the expected
>> capacity of the filter (m).
>>
>> I’ve chosen these sizes because at 5% full a HashSet is less memory
>> efficient than using a fixed size array, and at 10% the TreeSet is also
>> less efficient.
>>
>> Note that the java.util.Tree/HashSet versions just wrap a Map and insert a
>> dummy object for all keys in the Map. So here a Set is not as efficient as
>> a Map because in the Map test I always inserted the same Integer object
>> representing 1. This would be the same as using a Set with an Integer key
>> but here the Set had to contain the MutableCount which has an extra int
>> field and is larger than an Integer.
>>
>> These data lead me to think that a counting Bloom filter should just use a
>> fixed size backing array:
>>
>> - If created using the same Shape as a standard Bloom filter it uses a
>> fixed size. This has high memory cost when the filter is empty but when it
>> exceeds 10% of the intended capacity it is more efficient than a dynamic
>> backing storage.
>> - All operations will operate in order(n) time for an operation with
>> another filter with n indices. Each individual index count in the filter
>> will have order(1) time for access/update. Performance will be limited by
>> the memory cache of the entire array.
>>
>> The issue is that a counting Bloom filter with a very low number of
>> inserted items will be memory inefficient. But under what circumstance will
>> such a filter be used in a short-term lifecycle? If it is simply to merge
>> into another filter then this can be done using a merge with a Hasher. If
>> counts are to be merged then perhaps we provide a method to merge counts
>> using the same data structure returned by the CountingBloomFilter
>> getCounts() method, e.g. using a stream of <index,count> pairs:
>>
>> Stream<int[]> getCounts();
>> void add(Stream<int[]> counts);
>>
>> The issue here is the Shape and HashFunctionIdentity of the origin of the
>> merge cannot be validated. So just leave it out and use the merge with a
>> Hasher.
>>
>>
>> Thus the next issue with the counting Bloom filter implementation.
>> Currently when it merges with a Hasher it puts all the indices into a Set
>> and so will only increment the count by 1 for each index identified by the
>> Hasher. This appears to miss the entire point of the counting Bloom filter.
>> If I hash an objects to generate k indices I would hope that I do get k
>> indices. But the hash may not be perfect and I may get [1, k] indices with
>> some duplications. This is part of the signature of that object with the
>> given hash. So surely a counting Bloom filter should accommodate this. If
>> my Hasher generates the same index 20 times this should result in the count
>> of that index incrementing by 20.
>>
>> The result if that if an object is added direct to a counting Bloom filter
>> using a Hasher it will have a different result that if added to a standard
>> Bloom filter and then that filter added to the counting Bloom filter.
>>
>> Opinions on this?
>>
>> Alex
>>
>>
>>
>> [1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
>>
>>>
>>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <[hidden email]>
>>> wrote:
>>>
>>>> I have created a PR that contains most of the changes discussed in this
>>>> thread (see [1]).
>>>>
>>>> Please review the changes and comment here or on GitHub.
>>>>
>>>> I have left the CountingBloomFilter alone. I will reimplement the
>>>> add/subtract functionality as discussed into another PR.
>>>>
>>>> Alex
>>>>
>>>> [1] https://github.com/apache/commons-collections/pull/133 <
>>>> https://github.com/apache/commons-collections/pull/133>
>>>>
>>>>
>>>>
>>>
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren>
Reply | Threaded
Open this post in threaded view
|

Re: [collections] Bloom filters

Claude Warren
I think the CountingBloomFilter interface needs to extend BloomFilter.

I think I am confused.

I would expect CountingBloomFilter to have

interface CountingBloomFilter extends BloomFilter {
    // these 2 methods are the reverse of merge()
    void remove( BloomFilter );
    void remove( Hasher );

    // these 2 methods are the addition/subtraction of counts
    void add( CountingBloomFilter )
    void subtract( CountingBloomFilter );

    // 2 methods to retrieve data
    Stream<long[]> getCounts();
    boolean isValid()
}

Claude


On Sun, Mar 1, 2020 at 2:48 PM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 1 Mar 2020, at 09:28, Claude Warren <[hidden email]> wrote:
> >
> > The idea of a backing array is fine and the only problem I see with it is
> > in very large filters (on the order of 10^8 bits and larger) but document
> > the size calculation and let the developer worry about it.
>
> Let us look at the use case where we max out the array. Using the Bloom
> filter calculator:
>
> n = 149,363,281
> p = 0.001000025 (1 in 1000)
> m = 2147483647 (256MiB)
> k = 10
>
> n = 74,681,641
> p = 0.000001 (1 in 999950)
> m = 2147483647 (256MiB)
> k = 20
>
> n = 49,787,761
> p = 0.000000001 (1 in 999924899)
> m = 2147483647 (256MiB)
> k = 30
>
> So you will be able to put somewhere in the order of 10^8 or 10^7 items
> into the filter. I would say that anyone putting more than that into the
> filter has an unusual use case. The CountingBloomFilter can throw an
> exception if m is too large and will throw an OutOfMemoryError if you
> cannot allocate an array large enough.
>
> One clear point here is that you cannot use a short as a 16-bit count
> would easily overflow. So you have to use an integer array for the counts.
>
> A maximum length int[] is roughly 8GB.
>
> What would another implementation cost in terms of memory? The
> TreeMap<Integer, Integer> was the most space efficient. In the previous
> e-mail the saturation of a Bloom filter bits was approximately 50% when at
> the intended capacity. So we have to estimate the size of a TreeMap
> containing Integer.MAX_VALUE/2 indices ~ 2^30. The memory test shows the
> TreeMap memory scales linearly with entries:
>
>  32768 /  65536 (0.500) : TreeMap<Integer, Integer>      =  1834061 bytes
>  65536 / 131072 (0.500) : TreeMap<Integer, Integer>      =  3669080 bytes
> 131072 / 262144 (0.500) : TreeMap<Integer, Integer>      =  7339090 bytes
>
> So what is the memory for a TreeMap with 2^30 indices. I make it about:
>
> (2^30 / 131,072) * 7,339,090 bytes ~ 6e10 bytes = 55.99 GB
>
> I would say that this amount of RAM is unusual. It is definitely not as
> efficient as using an array. So very large counting Bloom filters are going
> to require some thought as to the hardware they run on. This may not be the
> case in 10 years time.
>
> I would say that we try an int[] backing array for the storage
> implementation and document it’s limitations. A different implementation
> could be provided in future if required.
>
> This could be done by making CountingBloomFilter an interface that extends
> BloomFilter with the methods:
>
> subtract(BloomFilter filter)
> subtract(Hasher filter)
>
> These will negate the effect of the corresponding merge(BloomFilter)
> operation.
>
> Do we also need access to the counts and add/subtract of another
> CountingBloomFilter?:
>
> add(CountingBloomFilter filter);
> subtract(CountingBloomFilter filter);
>
> Iterator<int[]> getCounts();
> int getSize(); // Number of items added
>
> The CountingBloomFilter is then an interface that defines how to reverse
> the merge of some bits into the filter.
>
> My concern is the inefficiency of the creation of objects in any method
> that provides access to the counts (e.g. above using an iterator as for
> Hasher.getBits()). I presume this method would be to allow some type of
> storage/serialisation of the filter akin to the long[] getBits() method of
> BloomFilter. So it may be better done using a method:
>
> int getCount(int index);
>
> The caller can then use long[] getBits() to get the indices set in the
> filter and then for each non-zero bit index call getCount(index). Or just
> not include the method as the counts are only of concern when storing the
> filter. This functionality is cleaner pushed into an implementation.
>
> In a previous post we discussed whether to throw an exception on
> overflow/underflow or raise in an invalid flag. Using the invalid flag idea
> the interface would be:
>
> interface CountingBloomFilter {
>     int add(CountingBloomFilter filter);
>     int subtract(BloomFilter filter);
>     int subtract(Hasher filter);
>     int subtract(CountingBloomFilter filter);
>     int getStatus();
>
>     // Maybe
>     int getSize();
>     int getCount(int index);
> }
>
> The status would be negative if any operation overflowed/underflowed, or
> zero if OK. The current status is returned by the add/subtract methods.
>
> However I note that overflow may not be a concern. The number of items to
> add to a filter to create overflow would be using a filter with a number of
> bits that is unrealistic to store in memory:
>
> n = 2147483647
> p = 0.001000025 (1 in 1000)
> m = 30875634182 (3.59GiB)
> k = 10
>
> If you want to add 2 billion items (and overflow an integer count) then
> your filter would be so big it would break the rest of the API that uses a
> 32-bit int for the bit index.
>
> Thus only underflow is a realistic concern. This could be documented as
> handled in an implementation specific manner (i.e. throw or ignore). The
> API is then simplified to:
>
> interface CountingBloomFilter {
>     boolean add(CountingBloomFilter filter);
>     boolean subtract(BloomFilter filter);
>     boolean subtract(Hasher filter);
>     boolean subtract(CountingBloomFilter filter);
>     int getStatus();
>
>     // Maybe
>     int getSize();
>     int getCount(int index);
> }
>
> The boolean is used to state that add/subtract did not over/underflow.
> Implementations can throw if they require it.
>
> The question then becomes what does getSize() represent if an add/subtract
> did not execute cleanly. Under this scheme it would be the number of (add -
> subtract) operations. The status flag would be used to indicate if the size
> is valid, or any of the counts from getCount(). The simpler API is to not
> allow access to counts/size or adding/subtracting counts:
>
> interface CountingBloomFilter {
>     boolean subtract(BloomFilter filter);
>     boolean subtract(Hasher filter);
>     int getStatus();
>     // Or something like ...
>     boolean isValid();
> }
>
> This filter is then only concerned with reversing the merge of Bloom
> filters with a valid status flag to indicate that the current state is
> consistent (i.e. all filters have been cleanly merged/subtracted).
>
> WDYT?
>
> >
> > As for the merge question.  merge is a standard bloom filter operation.
> It
> > is well defined in the literature.  Merging a bloom filter into a
> counting
> > bloom filter means incrementing the bit counts.  I  think that
> merge/remove
> > should continue to operate as though the parameter were a standard bloom
> > filter.
> >
>
> OK. So the count is to represent the number of filters that had a bit set
> at that index. This makes it more clear.
>
> > We had spoken of implementing and adding/deleting method pair that would
> > operate on CountingBloomFilters and would add/subtract the counts.  (e.g.
> > add(CountingBloomFilter) and subtract(CountingBloomFilter))
> >
> > I disagree with your proposal for the merge(Hasher) implementation, and I
> > am not certain that an add(Hasher) makes sense.  First consider that the
> > Hasher returns the bits that are to be enabled in the Bloom filter so
> > collisions are expected.  In the normal course of events a Hasher is used
> > to create a normal Bloom filter where all the duplicates are removed.
> That
> > filter is then merged into a CountingBloomFilter.  So in some sense the
> > Hasher and the normal Bloom filter are the same.  So I would expect the
> > merge of the Hasher and the merge of the normal Bloom filter created from
> > that hasher into a CountingBloomFilter to yield the same result.  If you
> > wanted to add an add(Hasher)/delete(Hasher) pair of functions to a
> > CountingBloomFilter you could implement with duplicate counting, but I am
> > not certain of the validity of such a count and I fear that it muddies
> the
> > waters with respect to what the CountingBloomFilter is counting.
>
> Agreed.
>
> >
> > Claude
> >
> > On Sat, Feb 29, 2020 at 2:13 PM Alex Herbert <[hidden email]
> <mailto:[hidden email]>>
> > wrote:
> >
> >>
> >>
> >>> On 29 Feb 2020, at 07:46, Claude Warren <[hidden email] <mailto:
> [hidden email]>> wrote:
> >>>
> >>> Alex would you take a look at pull request 131 [1].  it adds a new
> hasher
> >>> implementation and makes the HashFunctionValidator available for public
> >> use.
> >>>
> >>> https://github.com/apache/commons-collections/pull/131 <
> https://github.com/apache/commons-collections/pull/131> <
> >> https://github.com/apache/commons-collections/pull/131 <
> https://github.com/apache/commons-collections/pull/131>>
> >>
> >> OK. I’ll take a look.
> >>
> >> I’ve been thinking about the counting Bloom filter and the backing
> >> storage. In summary:
> >>
> >> 1. The backing storage should be a fixed array.
> >> 2. Merging a Hasher should count duplicate indices, not flatten them all
> >> to a single count.
> >>
> >> For background I’ve used the standard formulas to estimate the number of
> >> indices that will be non-zero in a Bloom filter. The wikipedia page
> gives
> >> this formula for the expected number of bits set to 0 (E(q)) if you have
> >> inserted i elements into a filter of size m using k hash functions:
> >>
> >> E(q) = (1 - 1/m)^ki"
> >>
> >> So a rough guess of the number of indices (bits) used by a filter is
> >> 1-E(q).
> >>
> >> Here is a table of Bloom filters with different collision probabilities
> >> and the proportion of bits that will be set when 1%, 10%, 100% of the
> >> capacity of the filter has been met:
> >>
> >> n       p       m       k       I       E(q)    bits
> >> 1000    1E-04   19171   13      10      0.9932  0.0068
> >> 1000    1E-04   19171   13      100     0.9344  0.0656
> >> 1000    1E-04   19171   13      1000    0.5076  0.4924
> >> 1000    1E-05   23963   17      10      0.9929  0.0071
> >> 1000    1E-05   23963   17      100     0.9315  0.0685
> >> 1000    1E-05   23963   17      1000    0.4919  0.5081
> >> 1000    1E-06   28756   20      10      0.9931  0.0069
> >> 1000    1E-06   28756   20      100     0.9328  0.0672
> >> 1000    1E-06   28756   20      1000    0.4988  0.5012
> >> 10000   1E-06   287552  20      100     0.9931  0.0069
> >> 10000   1E-06   287552  20      1000    0.9328  0.0672
> >> 10000   1E-06   287552  20      10000   0.4988  0.5012
> >>
> >> The point is that if you create a Bloom filter and fill it to 10% of the
> >> intended capacity the number of indices used will be about 6-7% of the
> >> filter bits.
> >>
> >> So how to store the counts? Currently the counting bloom filter uses a
> >> TreeMap<Integer, Integer>. I tried:
> >>
> >> TreeMap<Integer, Integer>
> >> HashMap<Integer, Integer>
> >> TreeSet<MutableCount>
> >> HashSet<MutableCount>
> >> int[]
> >>
> >> The MutableCount is a custom object that stores the bit index and uses
> it
> >> for a hash code and then has a mutable integer count field. It allows
> the
> >> count to be incremented/decremented if the object is in the set:
> >>
> >>    static final class MutableCount implements Comparable<MutableCount> {
> >>        final int key;
> >>        int count;
> >>        // etc
> >>    }
> >>
> >> This is adapted from the Bag<T> collection which stores an item count
> with
> >> a MutableInteger. Here the mutable count is part of the same object T
> >> inserted into the Set. So you can find the object, change the count and
> not
> >> have to put it back into the set. This is more efficient than changing
> the
> >> Integer stored in a Map.
> >>
> >> I’ve estimated memory usage using an idea based on this article from
> >> JavaWorld: Java Tip 130: Do you know your data size? [1].
> >>
> >> Basically you:
> >>
> >> - create an object and throw it away. All classes are then initialised.
> >> - Then you free memory (run garbage collection) and get the current
> memory
> >> usage
> >> - Then create a lot of your object (held in an array)
> >> - Then measure memory usage again
> >> - memory = (after - before) / count
> >>
> >> Here is some example output for n bits set in size m:
> >>
> >> 13107 / 262144 (0.050) : TreeMap<Integer, Integer>      =   733947 bytes
> >> 26214 / 262144 (0.100) : TreeMap<Integer, Integer>      =  1467866 bytes
> >> 13107 / 262144 (0.050) : TreeSet<MutableCount>          =   838928 bytes
> >> 26214 / 262144 (0.100) : TreeSet<MutableCount>          =  1677776 bytes
> >> 13107 / 262144 (0.050) : HashMap<Integer, Integer>      =  1677712 bytes
> >> 26214 / 262144 (0.100) : HashMap<Integer, Integer>      =  2306739 bytes
> >> 13107 / 262144 (0.050) : HashSet<MutableCount>          =  1782664 bytes
> >> 26214 / 262144 (0.100) : HashSet<MutableCount>          =  2516656 bytes
> >>     0 / 262144 (0.000) : int[]                          =  1048608 bytes
> >>     0 / 262144 (0.000) : short[]                        =   524320 bytes
> >>
> >> The estimate is accurate to 0.0001% for the arrays so the method is
> >> working. The HashMap was created with the capacity set to the expected
> >> capacity of the filter (m).
> >>
> >> I’ve chosen these sizes because at 5% full a HashSet is less memory
> >> efficient than using a fixed size array, and at 10% the TreeSet is also
> >> less efficient.
> >>
> >> Note that the java.util.Tree/HashSet versions just wrap a Map and
> insert a
> >> dummy object for all keys in the Map. So here a Set is not as efficient
> as
> >> a Map because in the Map test I always inserted the same Integer object
> >> representing 1. This would be the same as using a Set with an Integer
> key
> >> but here the Set had to contain the MutableCount which has an extra int
> >> field and is larger than an Integer.
> >>
> >> These data lead me to think that a counting Bloom filter should just
> use a
> >> fixed size backing array:
> >>
> >> - If created using the same Shape as a standard Bloom filter it uses a
> >> fixed size. This has high memory cost when the filter is empty but when
> it
> >> exceeds 10% of the intended capacity it is more efficient than a dynamic
> >> backing storage.
> >> - All operations will operate in order(n) time for an operation with
> >> another filter with n indices. Each individual index count in the filter
> >> will have order(1) time for access/update. Performance will be limited
> by
> >> the memory cache of the entire array.
> >>
> >> The issue is that a counting Bloom filter with a very low number of
> >> inserted items will be memory inefficient. But under what circumstance
> will
> >> such a filter be used in a short-term lifecycle? If it is simply to
> merge
> >> into another filter then this can be done using a merge with a Hasher.
> If
> >> counts are to be merged then perhaps we provide a method to merge counts
> >> using the same data structure returned by the CountingBloomFilter
> >> getCounts() method, e.g. using a stream of <index,count> pairs:
> >>
> >> Stream<int[]> getCounts();
> >> void add(Stream<int[]> counts);
> >>
> >> The issue here is the Shape and HashFunctionIdentity of the origin of
> the
> >> merge cannot be validated. So just leave it out and use the merge with a
> >> Hasher.
> >>
> >>
> >> Thus the next issue with the counting Bloom filter implementation.
> >> Currently when it merges with a Hasher it puts all the indices into a
> Set
> >> and so will only increment the count by 1 for each index identified by
> the
> >> Hasher. This appears to miss the entire point of the counting Bloom
> filter.
> >> If I hash an objects to generate k indices I would hope that I do get k
> >> indices. But the hash may not be perfect and I may get [1, k] indices
> with
> >> some duplications. This is part of the signature of that object with the
> >> given hash. So surely a counting Bloom filter should accommodate this.
> If
> >> my Hasher generates the same index 20 times this should result in the
> count
> >> of that index incrementing by 20.
> >>
> >> The result if that if an object is added direct to a counting Bloom
> filter
> >> using a Hasher it will have a different result that if added to a
> standard
> >> Bloom filter and then that filter added to the counting Bloom filter.
> >>
> >> Opinions on this?
> >>
> >> Alex
> >>
> >>
> >>
> >> [1] http://www.javaworld.com/javaworld/javatips/jw-javatip130.html
> >>
> >>>
> >>> On Tue, Feb 25, 2020 at 12:35 AM Alex Herbert <
> [hidden email]>
> >>> wrote:
> >>>
> >>>> I have created a PR that contains most of the changes discussed in
> this
> >>>> thread (see [1]).
> >>>>
> >>>> Please review the changes and comment here or on GitHub.
> >>>>
> >>>> I have left the CountingBloomFilter alone. I will reimplement the
> >>>> add/subtract functionality as discussed into another PR.
> >>>>
> >>>> Alex
> >>>>
> >>>> [1] https://github.com/apache/commons-collections/pull/133 <
> >>>> https://github.com/apache/commons-collections/pull/133>
> >>>>
> >>>>
> >>>>
> >>>
> >>> --
> >>> I like: Like Like - The likeliest place on the web
> >>> <http://like-like.xenei.com>
> >>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com <http://like-like.xenei.com/>>
> > LinkedIn: http://www.linkedin.com/in/claudewarren <
> http://www.linkedin.com/in/claudewarren>
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
123