[BloomFilters] changes to BloomFilter

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[BloomFilters] changes to BloomFilter

Claude Warren
We have spoken elsewhere about removing getHasher() and adding iterator()
In addition should we add forEachBit( IntConsumer )?



--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert
On Sun, 15 Mar 2020, 17:27 Claude Warren, <[hidden email]> wrote:

> We have spoken elsewhere about removing getHasher() and adding iterator()
> In addition should we add forEachBit( IntConsumer )?I


I was thinking the same. So we provide an iterator allowing failfast on the
first index that fails a criteria, e.g. for contains, and a foreach
allowing efficient receipt of all indexes.

The only thing missing is whether we add a spliterator which has in its API
the ability to specify DISTINCT and the exact size of the number of
indexes. The spliterator can be a default method using the iterator to
create it. An implementation can provide one if it wants.



>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
I made a quick pass at changing getHasher() to iterator().

I think we can get rid of HasherBloomFilter as its purpose was really to
create a Bloom filter for temporary usage and it doesn't seem to be
required if we have a hasher that can be created from a Shape and a
function that creates an Iterator.

On Sun, Mar 15, 2020 at 6:08 PM Alex Herbert <[hidden email]>
wrote:

> On Sun, 15 Mar 2020, 17:27 Claude Warren, <[hidden email]> wrote:
>
> > We have spoken elsewhere about removing getHasher() and adding iterator()
> > In addition should we add forEachBit( IntConsumer )?I
>
>
> I was thinking the same. So we provide an iterator allowing failfast on the
> first index that fails a criteria, e.g. for contains, and a foreach
> allowing efficient receipt of all indexes.
>
> The only thing missing is whether we add a spliterator which has in its API
> the ability to specify DISTINCT and the exact size of the number of
> indexes. The spliterator can be a default method using the iterator to
> create it. An implementation can provide one if it wants.
>
>
>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert

On 16/03/2020 07:57, Claude Warren wrote:
> I made a quick pass at changing getHasher() to iterator().

A look at the feasibility or have you started work on this? If so then
I'll not start work on it as well.

I changed master to return a boolean for the merge operations in
BloomFilter. So the outstanding changes are to drop getHasher() from the
BloomFilter interface in favour of an iterator, spliterator and a
forEachBit method.

> I think we can get rid of HasherBloomFilter as its purpose was really to
> create a Bloom filter for temporary usage and it doesn't seem to be
> required if we have a hasher that can be created from a Shape and a
> function that creates an Iterator.

I agree.

One change that could be made is to clarify the contract between a
Hasher and a BloomFilter. At present the Hasher can operate without a
defined contract in this method:

PrimitiveIterator.OfInt getBits(Shape shape)

It should validate that it can generate indexes for the shape. But it
doesn't have to. It could return unlimited indexes and they could be
outside the number of bits of the BloomFilter.

There does not appear to be any control anywhere on the number of hash
functions generated by the Hasher. I would expect this test in the
AbstractBloomFilterTest to pass:

     @Test
     public void hasherMergeTest() {
         int n = 1;
         int m = 10;
         HashFunctionIdentity h = new
HashFunctionIdentityImpl("provider", "name",
             Signedness.SIGNED, ProcessType.CYCLIC, 0L);
         Hasher hasher = new Hasher() {
             @Override
             public boolean isEmpty() {
                 return false;
             }
             @Override
             public HashFunctionIdentity getHashFunctionIdentity() {
                 return h;
             }
             @Override
             public OfInt getBits(Shape shape) {
                 // Do not respect the shape number of hash functions
but do respect
                 // the number of bits
                 return IntStream.range(0, m).iterator();
             }
         };
         for (int k = 1; k < 5; k++) {
             Shape shape = new Shape(h, n, m, k);
             BloomFilter bf = createEmptyFilter(shape);
             bf.merge(hasher);
             assertEquals("incorrect cardinality", k, bf.cardinality());
         }
     }

It currently does not as all the BloomFilters will not respect the Shape
with which they were created, i.e. they disregard the number of hash
functions in the Shape. So does the Hasher.

I think some of the control should be returned to the BloomFilter. The
Hasher would be reduced to a simple generator of data for the
BloomFilter, for example:

     PrimitiveIterator.OfInt getBits(int m);
     PrimitiveIterator.OfInt getBits(int k, int m);
     PrimitiveIterator.OfLong getBits();

The BloomFilter then accept responsibility for converting the primitives
to a suitable index and creating the correct number of hash functions
(i.e. indexes).

A merge operation with a BloomFilter then becomes:

- check the Hasher is using the correct hash function identity
- ask the Hasher for an iterator
- set k bits in the filter using the iterator, mapping each to the range
[0, m)

The BloomFilter has then encapsulated its state and respects the Shape.

The HashFuntion will convert byte[] to a long.

The Hasher exists to convert anything to a byte[] format.

This perhaps needs the Hasher.Builder to be revised to include more
methods that accept all the primitive data types. These are all
converted to a single byte[] representation. Thus the Hasher.Builder is
effectively a specification for a ByteBuffer. Once an object is
decomposed into the byte[] it can be fed through the HashFunction with
different seeds or using the cyclic method to create the iterator. The
BloomFilter consumes the raw long output from the stream produced by the
Hasher and sets k bits within the range m.

Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
First I think that the hasher.getBits( Shape ) should be renamed to
iterator( Shape ).  It was poorly named at the start.

By definition a Hasher knows how many items it is going to insert.

The Shape tells the hasher how many hash functions to apply to each item.
The Shape number of items is how many items are expected to be in the final
Bloom filter, it is more the expected value not a hard limit.

Keeping in mind the possibility of hash collisions, I don't see a way to
check that the Hasher has respected the number of functions.

The static hasher for example will not return duplicates so it might appear
that it has not respected the number of functions.  In addition there is no
indication from the hasher how many items it contains..

The inputs to the hash.builder are byte buffers that are fed to the hash
algorithm.  They are inputs to that algorithm.  So primitive types would
simply convert from the primitive type to its byte buffer representation.
Is that what you meant?

The hasher contract is that it will generate integers in the proper range
and use the proper number of hash functions for each item that was added to
the builder and that repeated calls to getBits(Shape) will return the same
values.

Did I misunderstand something?

Claude


On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <[hidden email]>
wrote:

>
> On 16/03/2020 07:57, Claude Warren wrote:
> > I made a quick pass at changing getHasher() to iterator().
>
> A look at the feasibility or have you started work on this? If so then
> I'll not start work on it as well.
>
> I changed master to return a boolean for the merge operations in
> BloomFilter. So the outstanding changes are to drop getHasher() from the
> BloomFilter interface in favour of an iterator, spliterator and a
> forEachBit method.
>
> > I think we can get rid of HasherBloomFilter as its purpose was really to
> > create a Bloom filter for temporary usage and it doesn't seem to be
> > required if we have a hasher that can be created from a Shape and a
> > function that creates an Iterator.
>
> I agree.
>
> One change that could be made is to clarify the contract between a
> Hasher and a BloomFilter. At present the Hasher can operate without a
> defined contract in this method:
>
> PrimitiveIterator.OfInt getBits(Shape shape)
>
> It should validate that it can generate indexes for the shape. But it
> doesn't have to. It could return unlimited indexes and they could be
> outside the number of bits of the BloomFilter.
>
> There does not appear to be any control anywhere on the number of hash
> functions generated by the Hasher. I would expect this test in the
> AbstractBloomFilterTest to pass:
>
>      @Test
>      public void hasherMergeTest() {
>          int n = 1;
>          int m = 10;
>          HashFunctionIdentity h = new
> HashFunctionIdentityImpl("provider", "name",
>              Signedness.SIGNED, ProcessType.CYCLIC, 0L);
>          Hasher hasher = new Hasher() {
>              @Override
>              public boolean isEmpty() {
>                  return false;
>              }
>              @Override
>              public HashFunctionIdentity getHashFunctionIdentity() {
>                  return h;
>              }
>              @Override
>              public OfInt getBits(Shape shape) {
>                  // Do not respect the shape number of hash functions
> but do respect
>                  // the number of bits
>                  return IntStream.range(0, m).iterator();
>              }
>          };
>          for (int k = 1; k < 5; k++) {
>              Shape shape = new Shape(h, n, m, k);
>              BloomFilter bf = createEmptyFilter(shape);
>              bf.merge(hasher);
>              assertEquals("incorrect cardinality", k, bf.cardinality());
>          }
>      }
>
> It currently does not as all the BloomFilters will not respect the Shape
> with which they were created, i.e. they disregard the number of hash
> functions in the Shape. So does the Hasher.
>
> I think some of the control should be returned to the BloomFilter. The
> Hasher would be reduced to a simple generator of data for the
> BloomFilter, for example:
>
>      PrimitiveIterator.OfInt getBits(int m);
>      PrimitiveIterator.OfInt getBits(int k, int m);
>      PrimitiveIterator.OfLong getBits();
>
> The BloomFilter then accept responsibility for converting the primitives
> to a suitable index and creating the correct number of hash functions
> (i.e. indexes).
>
> A merge operation with a BloomFilter then becomes:
>
> - check the Hasher is using the correct hash function identity
> - ask the Hasher for an iterator
> - set k bits in the filter using the iterator, mapping each to the range
> [0, m)
>
> The BloomFilter has then encapsulated its state and respects the Shape.
>
> The HashFuntion will convert byte[] to a long.
>
> The Hasher exists to convert anything to a byte[] format.
>
> This perhaps needs the Hasher.Builder to be revised to include more
> methods that accept all the primitive data types. These are all
> converted to a single byte[] representation. Thus the Hasher.Builder is
> effectively a specification for a ByteBuffer. Once an object is
> decomposed into the byte[] it can be fed through the HashFunction with
> different seeds or using the cyclic method to create the iterator. The
> BloomFilter consumes the raw long output from the stream produced by the
> Hasher and sets k bits within the range m.
>
> Alex
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert


> On 16 Mar 2020, at 18:58, Claude Warren <[hidden email]> wrote:
>
> First I think that the hasher.getBits( Shape ) should be renamed to
> iterator( Shape ).  It was poorly named at the start.

I can fix that.

>
> By definition a Hasher knows how many items it is going to insert.
>
> The Shape tells the hasher how many hash functions to apply to each item.

OK. This is may misunderstanding. There is a contract that the Hasher is expected to fulfil but it is just not recorded in the javadoc. I can update the docs to indicate that:

"A Hasher represents items of arbitrary byte size as a byte representation of fixed size (a hash). The hash for each item is created using a hash function; use of different seeds allows generation of different hashes for the same item. The hashes can be dynamically converted into the bit index representation used by a Bloom filter. The shape of the Bloom filter defines the number of indexes per item and the range of the indexes. The hasher functions to generate the correct number of indexes in the range required by the Bloom filter for each item it represents.

Note that the process of generating hashes and mapping them to a Bloom filter shape may create duplicate indexes. The hasher may generate fewer than the required number of hash functions per item if duplicates have been removed."

> The Shape number of items is how many items are expected to be in the final
> Bloom filter, it is more the expected value not a hard limit.

Yes. As discussed before this is not actually required for a Bloom filter to function, it is required to maintain the intended purpose of the filter when it was constructed.

>
> Keeping in mind the possibility of hash collisions, I don't see a way to
> check that the Hasher has respected the number of functions.

It would require encapsulating the Hasher functionality inside the Bloom filter. That would require passing the hash function and/or hasher to the Bloom filter on construction. The BloomFilter interface would then be changed to not accept a hasher in the contains and merge methods but the raw byte[] representation of an object. Or it could accept the Object itself if you provide a method to convert the object to bytes.

Encapsulating the conversion of objects to the hash then to the indexes is how the BloomFilter has been implemented in Guava. The implementation there is much simpler. A BloomFilter is typed to accept objects of type T. It has three methods:

put(T)
putAll(BloomFilter<T>)
mightContain(T)

Underneath it uses a Funnel<T> which you specify which converts T to one or more primitives/byte[]/String that are passed to a Sink. The Sink accepts data which is dynamically fed through a hash function.

Pros:

- Simple encapsulation of adding items to a filter
- Dynamic hashing without large byte[] intermediate buffers

Cons:

- No configuration of the hash function
- You lose type safety if you want to add different types of items. You have to use T = Object.



>
> The static hasher for example will not return duplicates so it might appear
> that it has not respected the number of functions.  In addition there is no
> indication from the hasher how many items it contains..

Yes. So we state that the hasher represents one or more items.

>
> The inputs to the hash.builder are byte buffers that are fed to the hash
> algorithm.  They are inputs to that algorithm.  So primitive types would
> simply convert from the primitive type to its byte buffer representation.
> Is that what you meant?

I was unclear on the purpose of the Hasher.Builder. It seemed incomplete. If the builder is to add items then it seems strange to have:

with(byte property)
with(String property)

It also seems strange to throw 'IllegalStateException if the Hasher is locked’ without explaining what this means. Is the builder intended to be concurrent? What is ‘locked’? Etc.

The byte could not possibly represent many meaningful objects. The string is trivially converted to UTF-8 bytes (as is done in the DynamicHasher). Both these methods could be added to the interface as default methods or preferrably dropped as they are so trivial.

I changed the documentation to remove the encoding as UTF-8 requirement from the with(String) method. It seems like an implementation detail and a Hasher.Builder implementation can decide how to convert the String. It is faster to use UTF-16 bytes for instance. I understand UTF-8 is for cross-platform standard. But mandating that it has to be done is too restrictive IMO. It would be better as:

with(CharSequence, Charset)
withUnencoded(CharSequence)

I was interpreting the Hasher.Builder as a builder of a single byte[] for hashing where you would pass different primitive values or byte[] for the same Object you want to convert. This is essentially a ByteBuffer. But if it is to receive an entire object for each call then (a) it should be documented as such; (b) it should be simplified to just the byte[] method with perhaps another one/two:

with(byte[])
with(byte[], int length)
with(T)
with(ByteBuffer)

Adding the T method would make the interface typed as Hasher.Builder<T>. It would require a function to convert items T to a byte[]:

Collection<T> items = ...
BloomFilter bf = …
Function<T, byte[]> converter = …
HashFunction hf = ...

for (T item : items) {
    bf.merge(new DynamicHasher.Builder<>(hf, converter).with(item).build());
}

Or:

DynamicHasher.Builder<T> builder = new DynamicHasher.Builder<>(hf, converter);
for (T item : Collection<T>) {
    builder.with(item);
}
bf.merge(builder.build());

I think the Hasher.Builder interface can be removed. It does not really add anything to the API without a factory somewhere to be able to create Hasher.Builder instances since each instance has no methods for reset:

Hasher h = factory.create().with(x).with(y).with(z).build();

If you do not have either a factory to create a Hasher.Builder or the ability to reset a Builder then why have a Hasher.Builder interface? Passing around just a single instance of the builder has limited use. I would drop the interface and leave it to Hasher implementations to define how they want to be constructed.

>
> The hasher contract is that it will generate integers in the proper range
> and use the proper number of hash functions for each item that was added to
> the builder and that repeated calls to getBits(Shape) will return the same
> values.
>
> Did I misunderstand something?

No, I did. Hence the need to clarify all the javadoc.

What I think we are missing with the Hasher is the simplicity of the Guava implementation. What you ideally would like to do is:

Collection<T> items = ...
BloomFilter bf = …

for (T item : items) {
    bf.merge(item);
}

Currently you have to do something like:

Collection<T> items = ...
BloomFilter bf = …
Function<T, Hasher> itemToHasher = …

for (T item : items) {
    bf.merge(itemToHasher.apply(item));
}

The itemToHasher is effectively an improved DynamicHasher.Builder<T> as above.

It would also be possible for the itemToHasher to recycle byte[] space by returning the same Hasher object that has been reset and filled with the next item. This would not be thread-safe but would save on intermediate storage.
 
All of this still fixes on having a byte[] representation to feed to a HashFunction. Moving away from the current design would change HashFunction to specify methods to accept blocks of data and have a final method to get the hash. So making the HashFunction an online hash. This would then match Guava by having the some object accept items T and require a function to map the item T to blocks of data.

However I note that the current implementation that accepts a byte[] for each call to get a hash value with a different seed can either use the byte[] or not (if cyclic). If the HashFunction was online then the choice of cyclic or iterative would not easily be possible. The Guava implementation creates a single hash and then the BloomFilter always uses this with a cyclic method. So the move away from the current design would be less flexible to allow different implementations of hashing.

So we keep the byte[] interface to HashFunction for now. A performance test can be used to determine if there is an advantage to an advanced DynamicHasher.Builder which can recycle byte[] space. Javadoc should be added to the HashFunction to indicate that the same bytes passed with the same seed should create the same output. The same bytes with a different seed should create different output with very high probability. A seed of zero is used as a reset signal for implementations that have cached computation results that the byte[] input is different from the previous call.


The last thing is that the Hasher.isEmpty() is not used anywhere except the units tests. It seems strange to have it. Can we just assume a Hasher is not empty. An empty hasher would return an iterator that does nothing.


In summary:

1. change Hasher getBits to iterator
2. improve documentation of Hasher and the contract that it should fulfil with respect to items and a Shape
3. potentially drop Hasher.Builder unless there is a way to reset the Builder or create more
4. Or make Hasher.Builder typed to an object <T> so it is clear the with(…) methods are to accept a full representation of an item and add it to the in-progress Hasher currently being built
5. Improve HashFunction javadoc on the use of the seed as a reset signal
6. Drop Hasher.isEmpty()

That should clarify the currently functionality.

Thought on this?

Alex


>
> Claude
>
>
> On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <[hidden email]>
> wrote:
>
>>
>> On 16/03/2020 07:57, Claude Warren wrote:
>>> I made a quick pass at changing getHasher() to iterator().
>>
>> A look at the feasibility or have you started work on this? If so then
>> I'll not start work on it as well.
>>
>> I changed master to return a boolean for the merge operations in
>> BloomFilter. So the outstanding changes are to drop getHasher() from the
>> BloomFilter interface in favour of an iterator, spliterator and a
>> forEachBit method.
>>
>>> I think we can get rid of HasherBloomFilter as its purpose was really to
>>> create a Bloom filter for temporary usage and it doesn't seem to be
>>> required if we have a hasher that can be created from a Shape and a
>>> function that creates an Iterator.
>>
>> I agree.
>>
>> One change that could be made is to clarify the contract between a
>> Hasher and a BloomFilter. At present the Hasher can operate without a
>> defined contract in this method:
>>
>> PrimitiveIterator.OfInt getBits(Shape shape)
>>
>> It should validate that it can generate indexes for the shape. But it
>> doesn't have to. It could return unlimited indexes and they could be
>> outside the number of bits of the BloomFilter.
>>
>> There does not appear to be any control anywhere on the number of hash
>> functions generated by the Hasher. I would expect this test in the
>> AbstractBloomFilterTest to pass:
>>
>>     @Test
>>     public void hasherMergeTest() {
>>         int n = 1;
>>         int m = 10;
>>         HashFunctionIdentity h = new
>> HashFunctionIdentityImpl("provider", "name",
>>             Signedness.SIGNED, ProcessType.CYCLIC, 0L);
>>         Hasher hasher = new Hasher() {
>>             @Override
>>             public boolean isEmpty() {
>>                 return false;
>>             }
>>             @Override
>>             public HashFunctionIdentity getHashFunctionIdentity() {
>>                 return h;
>>             }
>>             @Override
>>             public OfInt getBits(Shape shape) {
>>                 // Do not respect the shape number of hash functions
>> but do respect
>>                 // the number of bits
>>                 return IntStream.range(0, m).iterator();
>>             }
>>         };
>>         for (int k = 1; k < 5; k++) {
>>             Shape shape = new Shape(h, n, m, k);
>>             BloomFilter bf = createEmptyFilter(shape);
>>             bf.merge(hasher);
>>             assertEquals("incorrect cardinality", k, bf.cardinality());
>>         }
>>     }
>>
>> It currently does not as all the BloomFilters will not respect the Shape
>> with which they were created, i.e. they disregard the number of hash
>> functions in the Shape. So does the Hasher.
>>
>> I think some of the control should be returned to the BloomFilter. The
>> Hasher would be reduced to a simple generator of data for the
>> BloomFilter, for example:
>>
>>     PrimitiveIterator.OfInt getBits(int m);
>>     PrimitiveIterator.OfInt getBits(int k, int m);
>>     PrimitiveIterator.OfLong getBits();
>>
>> The BloomFilter then accept responsibility for converting the primitives
>> to a suitable index and creating the correct number of hash functions
>> (i.e. indexes).
>>
>> A merge operation with a BloomFilter then becomes:
>>
>> - check the Hasher is using the correct hash function identity
>> - ask the Hasher for an iterator
>> - set k bits in the filter using the iterator, mapping each to the range
>> [0, m)
>>
>> The BloomFilter has then encapsulated its state and respects the Shape.
>>
>> The HashFuntion will convert byte[] to a long.
>>
>> The Hasher exists to convert anything to a byte[] format.
>>
>> This perhaps needs the Hasher.Builder to be revised to include more
>> methods that accept all the primitive data types. These are all
>> converted to a single byte[] representation. Thus the Hasher.Builder is
>> effectively a specification for a ByteBuffer. Once an object is
>> decomposed into the byte[] it can be fed through the HashFunction with
>> different seeds or using the cyclic method to create the iterator. The
>> BloomFilter consumes the raw long output from the stream produced by the
>> Hasher and sets k bits within the range m.
>>
>> Alex
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert
Another item: ObjectsHashIterative is marked as Signedness.SIGNED.

The computation is done using 32-bit integers. So the long output can be negative but the upper 32-bits are always either entirely 0 or entirely 1. I think this is a candidate for converting to an unsigned long and allowing it to be Signedness.UNSIGNED.

Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
In reply to this post by Alex Herbert
On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert <[hidden email]>
wrote:

>
> > The Shape tells the hasher how many hash functions to apply to each item.
>
> OK. This is may misunderstanding. There is a contract that the Hasher is
> expected to fulfil but it is just not recorded in the javadoc. I can update
> the docs to indicate that:
>
> "A Hasher represents items of arbitrary byte size as a byte representation
> of fixed size (a hash). The hash for each item is created using a hash
> function; use of different seeds allows generation of different hashes for
> the same item. The hashes can be dynamically converted into the bit index
> representation used by a Bloom filter. The shape of the Bloom filter
> defines the number of indexes per item and the range of the indexes. The
> hasher functions to generate the correct number of indexes in the range
> required by the Bloom filter for each item it represents.
>
> Note that the process of generating hashes and mapping them to a Bloom
> filter shape may create duplicate indexes. The hasher may generate fewer
> than the required number of hash functions per item if duplicates have been
> removed."
>
> Looks Good

> > The Shape number of items is how many items are expected to be in the
> final
> > Bloom filter, it is more the expected value not a hard limit.
>
> Yes. As discussed before this is not actually required for a Bloom filter
> to function, it is required to maintain the intended purpose of the filter
> when it was constructed.
>
>
>
> >
> > The static hasher for example will not return duplicates so it might
> appear
> > that it has not respected the number of functions.  In addition there is
> no
> > indication from the hasher how many items it contains..
>
> Yes. So we state that the hasher represents one or more items.
>
> >
> > The inputs to the hash.builder are byte buffers that are fed to the hash
> > algorithm.  They are inputs to that algorithm.  So primitive types would
> > simply convert from the primitive type to its byte buffer representation.
> > Is that what you meant?
>
> I was unclear on the purpose of the Hasher.Builder. It seemed incomplete.
> If the builder is to add items then it seems strange to have:
>
> with(byte property)
> with(String property)
>
> It also seems strange to throw 'IllegalStateException if the Hasher is
> locked’ without explaining what this means. Is the builder intended to be
> concurrent? What is ‘locked’? Etc.
>

Good catch. Documentation error.  Originally the Hasher.Builder was locked
once it generated a Hasher that was removed but the documentation was not
cleaned up.


> The byte could not possibly represent many meaningful objects. The string
> is trivially converted to UTF-8 bytes (as is done in the DynamicHasher).
> Both these methods could be added to the interface as default methods or
> preferrably dropped as they are so trivial.
>
> I changed the documentation to remove the encoding as UTF-8 requirement
> from the with(String) method. It seems like an implementation detail and a
> Hasher.Builder implementation can decide how to convert the String. It is
> faster to use UTF-16 bytes for instance. I understand UTF-8 is for
> cross-platform standard. But mandating that it has to be done is too
> restrictive IMO. It would be better as:
>
> with(CharSequence, Charset)
> withUnencoded(CharSequence)
>
>
The Hasher.Builder is patterned after the cryptography MessageDigest
pattern where a number of update() calls are made before a final digest()
produces the hash.  Originally it was update() and build() but earlier
suggestion was to change the update() to with().

Adding with( byte[] buffer, int offset, int len ) makes sense.

with(String ) is a convenience method, it was documented as UTF-8 so that
if when  trying to build equivalent filters in a different language there
was no need to look deep in the hasher code to figure out what to do.

with( byte ) is also a convenience method.

They are documented in Hasher.Builder to be common across all builders.
Obviously different Builder implementations can add other functionality.

One could build a decorator that wraps a builder and adds the with(<T>)
functionality.

The Builder interface is designed to be a balance between minimal
implementation need to be functional and an implementation that can handle
all arbitrary types.


> I was interpreting the Hasher.Builder as a builder of a single byte[] for
> hashing where you would pass different primitive values or byte[] for the
> same Object you want to convert. This is essentially a ByteBuffer. But if
> it is to receive an entire object for each call then (a) it should be
> documented as such; (b) it should be simplified to just the byte[] method
> with perhaps another one/two:
>
> with(byte[])
> with(byte[], int length)
> with(T)
> with(ByteBuffer)
>

> Adding the T method would make the interface typed as Hasher.Builder<T>.
> It would require a function to convert items T to a byte[]:
>
>
I think the T and ByteBuffer implementations are better left to a decorator
class.  But that is just my take.  Since the Bloom filter only supports a
logical equivalent of equals putting numerical values into a Bloom filter
is not that common.  My experience is that most values are Strings.

Collection<T> items = ...

> BloomFilter bf = …
> Function<T, byte[]> converter = …
> HashFunction hf = ...
>
> for (T item : items) {
>     bf.merge(new DynamicHasher.Builder<>(hf,
> converter).with(item).build());
> }
>
> Or:
>
> DynamicHasher.Builder<T> builder = new DynamicHasher.Builder<>(hf,
> converter);
> for (T item : Collection<T>) {
>     builder.with(item);
> }
> bf.merge(builder.build());
>
> I think the Hasher.Builder interface can be removed. It does not really
> add anything to the API without a factory somewhere to be able to create
> Hasher.Builder instances since each instance has no methods for reset:
>
> At one time Builder.build() performed a reset but that was removed by
request, I would have no issues putting it back, or adding a reset method.

The Hasher.Builder interface defines a common interface across all
builders.  So I can write my code to build Bloom filters and swap out the
hasher implementation easily as long as I stick to the Builder.Hasher
methods.

Hasher h = factory.create().with(x).with(y).with(z).build();

>
> If you do not have either a factory to create a Hasher.Builder or the
> ability to reset a Builder then why have a Hasher.Builder interface?
> Passing around just a single instance of the builder has limited use. I
> would drop the interface and leave it to Hasher implementations to define
> how they want to be constructed.
>
> >
> > The hasher contract is that it will generate integers in the proper range
> > and use the proper number of hash functions for each item that was added
> to
> > the builder and that repeated calls to getBits(Shape) will return the
> same
> > values.
> >
> > Did I misunderstand something?
>
> No, I did. Hence the need to clarify all the javadoc.
>
> What I think we are missing with the Hasher is the simplicity of the Guava
> implementation. What you ideally would like to do is:
>
> Collection<T> items = ...
> BloomFilter bf = …
>
> for (T item : items) {
>     bf.merge(item);
> }
>
> This change significantly reduces the applicability of this implementation
of Bloom filter across applications.  I am currently working on several
applications where Hashers are sent across a network to query repositories.

By separating the Hasher from the Filter (separation of concerns) then I
don't have to send large objects across the network to do the query.  Nor
do I have to expose the properties I am querying for.  In addition the
endpoints I am querying are free to resize the bloom filters they store as
they see fit based on size of collection and other Shape based concerns.
With the Guava implementation this is not possible.

Note: "properties" in the above paragraph are items placed into a single
builder.

Currently you have to do something like:

>
> Collection<T> items = ...
> BloomFilter bf = …
> Function<T, Hasher> itemToHasher = …
>
> for (T item : items) {
>     bf.merge(itemToHasher.apply(item));
> }
>
> The itemToHasher is effectively an improved DynamicHasher.Builder<T> as
> above.
>

Yes, and simple to construct from the components in the bloomfilter
packages.  In addition, I think you have made the assumption that T
contains a single value to be added, in which case a Function<T,byte[]>
would suffice and, assuming bf is not a CountingBloom filter, the above can
be written as:

Hasher.Builder builder = new ....
Function<T,byte[]> f = new ...
items.iterator().forEachRemaining( t -> builder.with( f.apply(t) )
bf.merge( builder.build() )

Additionally if T is an object with multiple elements that are to be added
to the filter then

Hasher.Builder nativeBuilder = new ....
DecoratorT builder = new DecoratorT( nativeBuilder ) // adds the with(T)
method
items.iterator().forEachRemaining( t -> builder.with( t ) )
bf.merge( builder.build() )

there is no difference in result between
-
loop {
bf.merge( new Builder().with(x).build() )
}

-- and --

new Builder()
loop {
builder.with( x )
}
bf.merge( builder.build() );


> It would also be possible for the itemToHasher to recycle byte[] space by
> returning the same Hasher object that has been reset and filled with the
> next item. This would not be thread-safe but would save on intermediate
> storage.
>
> All of this still fixes on having a byte[] representation to feed to a
> HashFunction. Moving away from the current design would change HashFunction
> to specify methods to accept blocks of data and have a final method to get
> the hash. So making the HashFunction an online hash. This would then match
> Guava by having the some object accept items T and require a function to
> map the item T to blocks of data.
>
> However I note that the current implementation that accepts a byte[] for
> each call to get a hash value with a different seed can either use the
> byte[] or not (if cyclic). If the HashFunction was online then the choice
> of cyclic or iterative would not easily be possible. The Guava
> implementation creates a single hash and then the BloomFilter always uses
> this with a cyclic method. So the move away from the current design would
> be less flexible to allow different implementations of hashing.
>
> So we keep the byte[] interface to HashFunction for now. A performance
> test can be used to determine if there is an advantage to an advanced
> DynamicHasher.Builder which can recycle byte[] space. Javadoc should be
> added to the HashFunction to indicate that the same bytes passed with the
> same seed should create the same output. The same bytes with a different
> seed should create different output with very high probability. A seed of
> zero is used as a reset signal for implementations that have cached
> computation results that the byte[] input is different from the previous
> call.
>
>
> The last thing is that the Hasher.isEmpty() is not used anywhere except
> the units tests. It seems strange to have it. Can we just assume a Hasher
> is not empty. An empty hasher would return an iterator that does nothing.
>
> Yes, I guess we don't need isEmpty() I think it was originally used in
BloomFilter.merge() in a guard statement to test if the merge actually
needed to attempt to do anything.


> In summary:
>
> 1. change Hasher getBits to iterator
>
agree

> 2. improve documentation of Hasher and the contract that it should fulfil
> with respect to items and a Shape
>
absolutly

> 3. potentially drop Hasher.Builder unless there is a way to reset the
> Builder or create more
>
add the reset or have build() implicitly do a reset.

4. Or make Hasher.Builder typed to an object <T> so it is clear the with(…)
> methods are to accept a full representation of an item and add it to the
> in-progress Hasher currently being built
>
disagree.

5. Improve HashFunction javadoc on the use of the seed as a reset signal

agree

> 6. Drop Hasher.isEmpty()
>
ambivalent.

>
> That should clarify the currently functionality.
>
> Thought on this?
>
> Alex
>
>
> >
> > Claude
> >
> >
> > On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <[hidden email]>
> > wrote:
> >
> >>
> >> On 16/03/2020 07:57, Claude Warren wrote:
> >>> I made a quick pass at changing getHasher() to iterator().
> >>
> >> A look at the feasibility or have you started work on this? If so then
> >> I'll not start work on it as well.
> >>
> >> I changed master to return a boolean for the merge operations in
> >> BloomFilter. So the outstanding changes are to drop getHasher() from the
> >> BloomFilter interface in favour of an iterator, spliterator and a
> >> forEachBit method.
> >>
> >>> I think we can get rid of HasherBloomFilter as its purpose was really
> to
> >>> create a Bloom filter for temporary usage and it doesn't seem to be
> >>> required if we have a hasher that can be created from a Shape and a
> >>> function that creates an Iterator.
> >>
> >> I agree.
> >>
> >> One change that could be made is to clarify the contract between a
> >> Hasher and a BloomFilter. At present the Hasher can operate without a
> >> defined contract in this method:
> >>
> >> PrimitiveIterator.OfInt getBits(Shape shape)
> >>
> >> It should validate that it can generate indexes for the shape. But it
> >> doesn't have to. It could return unlimited indexes and they could be
> >> outside the number of bits of the BloomFilter.
> >>
> >> There does not appear to be any control anywhere on the number of hash
> >> functions generated by the Hasher. I would expect this test in the
> >> AbstractBloomFilterTest to pass:
> >>
> >>     @Test
> >>     public void hasherMergeTest() {
> >>         int n = 1;
> >>         int m = 10;
> >>         HashFunctionIdentity h = new
> >> HashFunctionIdentityImpl("provider", "name",
> >>             Signedness.SIGNED, ProcessType.CYCLIC, 0L);
> >>         Hasher hasher = new Hasher() {
> >>             @Override
> >>             public boolean isEmpty() {
> >>                 return false;
> >>             }
> >>             @Override
> >>             public HashFunctionIdentity getHashFunctionIdentity() {
> >>                 return h;
> >>             }
> >>             @Override
> >>             public OfInt getBits(Shape shape) {
> >>                 // Do not respect the shape number of hash functions
> >> but do respect
> >>                 // the number of bits
> >>                 return IntStream.range(0, m).iterator();
> >>             }
> >>         };
> >>         for (int k = 1; k < 5; k++) {
> >>             Shape shape = new Shape(h, n, m, k);
> >>             BloomFilter bf = createEmptyFilter(shape);
> >>             bf.merge(hasher);
> >>             assertEquals("incorrect cardinality", k, bf.cardinality());
> >>         }
> >>     }
> >>
> >> It currently does not as all the BloomFilters will not respect the Shape
> >> with which they were created, i.e. they disregard the number of hash
> >> functions in the Shape. So does the Hasher.
> >>
> >> I think some of the control should be returned to the BloomFilter. The
> >> Hasher would be reduced to a simple generator of data for the
> >> BloomFilter, for example:
> >>
> >>     PrimitiveIterator.OfInt getBits(int m);
> >>     PrimitiveIterator.OfInt getBits(int k, int m);
> >>     PrimitiveIterator.OfLong getBits();
> >>
> >> The BloomFilter then accept responsibility for converting the primitives
> >> to a suitable index and creating the correct number of hash functions
> >> (i.e. indexes).
> >>
> >> A merge operation with a BloomFilter then becomes:
> >>
> >> - check the Hasher is using the correct hash function identity
> >> - ask the Hasher for an iterator
> >> - set k bits in the filter using the iterator, mapping each to the range
> >> [0, m)
> >>
> >> The BloomFilter has then encapsulated its state and respects the Shape.
> >>
> >> The HashFuntion will convert byte[] to a long.
> >>
> >> The Hasher exists to convert anything to a byte[] format.
> >>
> >> This perhaps needs the Hasher.Builder to be revised to include more
> >> methods that accept all the primitive data types. These are all
> >> converted to a single byte[] representation. Thus the Hasher.Builder is
> >> effectively a specification for a ByteBuffer. Once an object is
> >> decomposed into the byte[] it can be fed through the HashFunction with
> >> different seeds or using the cyclic method to create the iterator. The
> >> BloomFilter consumes the raw long output from the stream produced by the
> >> Hasher and sets k bits within the range m.
> >>
> >> Alex
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert


> On 17 Mar 2020, at 11:08, Claude Warren <[hidden email]> wrote:
>
> On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>>
>>> The Shape tells the hasher how many hash functions to apply to each item.
>>
>> OK. This is may misunderstanding. There is a contract that the Hasher is
>> expected to fulfil but it is just not recorded in the javadoc. I can update
>> the docs to indicate that:
>>
>> "A Hasher represents items of arbitrary byte size as a byte representation
>> of fixed size (a hash). The hash for each item is created using a hash
>> function; use of different seeds allows generation of different hashes for
>> the same item. The hashes can be dynamically converted into the bit index
>> representation used by a Bloom filter. The shape of the Bloom filter
>> defines the number of indexes per item and the range of the indexes. The
>> hasher functions to generate the correct number of indexes in the range
>> required by the Bloom filter for each item it represents.
>>
>> Note that the process of generating hashes and mapping them to a Bloom
>> filter shape may create duplicate indexes. The hasher may generate fewer
>> than the required number of hash functions per item if duplicates have been
>> removed."
>>
>> Looks Good

Added to master.

>
>>> The Shape number of items is how many items are expected to be in the
>> final
>>> Bloom filter, it is more the expected value not a hard limit.
>>
>> Yes. As discussed before this is not actually required for a Bloom filter
>> to function, it is required to maintain the intended purpose of the filter
>> when it was constructed.
>>
>>
>>
>>>
>>> The static hasher for example will not return duplicates so it might
>> appear
>>> that it has not respected the number of functions.  In addition there is
>> no
>>> indication from the hasher how many items it contains..
>>
>> Yes. So we state that the hasher represents one or more items.
>>
>>>
>>> The inputs to the hash.builder are byte buffers that are fed to the hash
>>> algorithm.  They are inputs to that algorithm.  So primitive types would
>>> simply convert from the primitive type to its byte buffer representation.
>>> Is that what you meant?
>>
>> I was unclear on the purpose of the Hasher.Builder. It seemed incomplete.
>> If the builder is to add items then it seems strange to have:
>>
>> with(byte property)
>> with(String property)
>>
>> It also seems strange to throw 'IllegalStateException if the Hasher is
>> locked’ without explaining what this means. Is the builder intended to be
>> concurrent? What is ‘locked’? Etc.
>>
>
> Good catch. Documentation error.  Originally the Hasher.Builder was locked
> once it generated a Hasher that was removed but the documentation was not
> cleaned up.

I’ve removed this from the Hasher.Builder interface.

>
>
>> The byte could not possibly represent many meaningful objects. The string
>> is trivially converted to UTF-8 bytes (as is done in the DynamicHasher).
>> Both these methods could be added to the interface as default methods or
>> preferrably dropped as they are so trivial.
>>
>> I changed the documentation to remove the encoding as UTF-8 requirement
>> from the with(String) method. It seems like an implementation detail and a
>> Hasher.Builder implementation can decide how to convert the String. It is
>> faster to use UTF-16 bytes for instance. I understand UTF-8 is for
>> cross-platform standard. But mandating that it has to be done is too
>> restrictive IMO. It would be better as:
>>
>> with(CharSequence, Charset)
>> withUnencoded(CharSequence)
>>
>>
> The Hasher.Builder is patterned after the cryptography MessageDigest
> pattern where a number of update() calls are made before a final digest()
> produces the hash.  Originally it was update() and build() but earlier
> suggestion was to change the update() to with().
>
> Adding with( byte[] buffer, int offset, int len ) makes sense.
>
> with(String ) is a convenience method, it was documented as UTF-8 so that
> if when  trying to build equivalent filters in a different language there
> was no need to look deep in the hasher code to figure out what to do.
>
> with( byte ) is also a convenience method.
>
> They are documented in Hasher.Builder to be common across all builders.
> Obviously different Builder implementations can add other functionality.
>
> One could build a decorator that wraps a builder and adds the with(<T>)
> functionality.
>
> The Builder interface is designed to be a balance between minimal
> implementation need to be functional and an implementation that can handle
> all arbitrary types.

OK. What I think is different from the MessageDigest pattern is that a MessageDigest is designed to receive byte data that is composes into a single digest representing the combined bytes.

Here the Hasher.Builder is designed to create a Hasher that represents one or more items. Each item is a set of byte data. So each method to add data is effectively functioning as a complete addition of a single item. This is how the implementation of the DynamicHasher works. Each call to add data will add a byte[] to the List<byte[]>. When you call iterator(Shape) on the resulting Hasher you will get an iterator that generates k indexes per byte[] buffer that was added.

Thus my point that adding a single byte is confusing. Not many entire objects would have a single byte representation!

It may be simpler to drop the methods with(String) and with(byte). There is the option to add:

with(byte[], int offset, int length)

However I am reluctant to do this as it would just be a delay of the inevitable copy to a duplicate array of the correct length given that the HashFunction.apply(byte[], int) method does not support a range. So you would have to extract the entire byte[] anyway.

To aid the process of adding byte[] to the Hasher.Builder we can provided a ByteBuilder that would function similarly to a StringBuilder with append methods for all sorts of data. However I am wary that the JDK did not make ByteBuffer expandable as it impacts performance. A ByteBuilder would effectively be a duplicate of ByteBuffer but expandable. So perhaps we add instead this method:

with(ByteBuffer)

The method will extract the current contents of the ByteBuffer between the current position and the limit into a new byte[] and add it to the builder:

default Builder with(ByteBuffer) {
    final int remaining = byteBuffer.remaining();
    final byte[] byteArray = new byte[remaining];
    byteBuffer.get(byteArray);
    return with(byteArray);
}

Thus this is a bridge between hashing byte[] and the ability to hash anything, e.g.

            Hasher.Builder builder = ...;
            ByteBuffer bb = ByteBuffer.allocate(100);
            bb.putInt(42);
            bb.putDouble(123.45);
            bb.flip();
            builder.with(bb);
            bb.putInt(3);
            bb.putLong(78493577979L);
            bb.flip();
            Hasher hasher = builder.with(bb).build();

Given the design is fixed on having byte[] to pass entirely to a HashFunction then we cannot remove the need to create intermediate byte[] storage.

An alternative is to formalise the convertor idea with a typed function:

<T> Builder with(T item, Function<T, byte[]> convertor);

Or provide a Builder decorator (which you discuss later):

    class BuilderDecorator<T> implements Hasher.Builder {
        Function<T, byte[]> convertor;
        Hasher.Builder builder;

        BuilderDecorator(Function<T, byte[]> convertor, Hasher.Builder builder) {
            this.convertor = convertor;
            this.builder = builder;
        }

        @Override
        public Hasher build() {
            return builder.build();
        }

        @Override
        public BuilderDecorator<T> with(byte[] property) {
            builder.with(property);
            return this;
        }

        public BuilderDecorator<T> with(T item) {
            return with(convertor.apply(item));
        }
    }

This is for the future perhaps.

>
>
>> I was interpreting the Hasher.Builder as a builder of a single byte[] for
>> hashing where you would pass different primitive values or byte[] for the
>> same Object you want to convert. This is essentially a ByteBuffer. But if
>> it is to receive an entire object for each call then (a) it should be
>> documented as such; (b) it should be simplified to just the byte[] method
>> with perhaps another one/two:
>>
>> with(byte[])
>> with(byte[], int length)
>> with(T)
>> with(ByteBuffer)
>>
>
>> Adding the T method would make the interface typed as Hasher.Builder<T>.
>> It would require a function to convert items T to a byte[]:
>>
>>
> I think the T and ByteBuffer implementations are better left to a decorator
> class.  But that is just my take.  Since the Bloom filter only supports a
> logical equivalent of equals putting numerical values into a Bloom filter
> is not that common.  My experience is that most values are Strings.

So we drop all method from the Builder except with(byte[]) and change the ‘property’ to ‘item’ to clarify that the byte[] is representing an entire item, not a property of a larger item.

>
> Collection<T> items = ...
>> BloomFilter bf = …
>> Function<T, byte[]> converter = …
>> HashFunction hf = ...
>>
>> for (T item : items) {
>>    bf.merge(new DynamicHasher.Builder<>(hf,
>> converter).with(item).build());
>> }
>>
>> Or:
>>
>> DynamicHasher.Builder<T> builder = new DynamicHasher.Builder<>(hf,
>> converter);
>> for (T item : Collection<T>) {
>>    builder.with(item);
>> }
>> bf.merge(builder.build());
>>
>> I think the Hasher.Builder interface can be removed. It does not really
>> add anything to the API without a factory somewhere to be able to create
>> Hasher.Builder instances since each instance has no methods for reset:
>>
>> At one time Builder.build() performed a reset but that was removed by
> request, I would have no issues putting it back, or adding a reset method.

Add:

Builder reset();
Hasher buildAndReset();

Or force build() to do a reset as well? It is simplest to make build() mandate a reset(). Any situations where you would want to do incremental builds by adding more items to a builder and getting a bigger Hasher as a super-set of the previous Hasher?

>
> The Hasher.Builder interface defines a common interface across all
> builders.  So I can write my code to build Bloom filters and swap out the
> hasher implementation easily as long as I stick to the Builder.Hasher
> methods.
>
> Hasher h = factory.create().with(x).with(y).with(z).build();
>>
>> If you do not have either a factory to create a Hasher.Builder or the
>> ability to reset a Builder then why have a Hasher.Builder interface?
>> Passing around just a single instance of the builder has limited use. I
>> would drop the interface and leave it to Hasher implementations to define
>> how they want to be constructed.
>>
>>>
>>> The hasher contract is that it will generate integers in the proper range
>>> and use the proper number of hash functions for each item that was added
>> to
>>> the builder and that repeated calls to getBits(Shape) will return the
>> same
>>> values.
>>>
>>> Did I misunderstand something?
>>
>> No, I did. Hence the need to clarify all the javadoc.
>>
>> What I think we are missing with the Hasher is the simplicity of the Guava
>> implementation. What you ideally would like to do is:
>>
>> Collection<T> items = ...
>> BloomFilter bf = …
>>
>> for (T item : items) {
>>    bf.merge(item);
>> }
>>
>> This change significantly reduces the applicability of this implementation
> of Bloom filter across applications.  I am currently working on several
> applications where Hashers are sent across a network to query repositories.
>
> By separating the Hasher from the Filter (separation of concerns) then I
> don't have to send large objects across the network to do the query.  Nor
> do I have to expose the properties I am querying for.  In addition the
> endpoints I am querying are free to resize the bloom filters they store as
> they see fit based on size of collection and other Shape based concerns.
> With the Guava implementation this is not possible.
>
> Note: "properties" in the above paragraph are items placed into a single
> builder.

So as above, change ‘property’ to ‘item’.

>
> Currently you have to do something like:
>>
>> Collection<T> items = ...
>> BloomFilter bf = …
>> Function<T, Hasher> itemToHasher = …
>>
>> for (T item : items) {
>>    bf.merge(itemToHasher.apply(item));
>> }
>>
>> The itemToHasher is effectively an improved DynamicHasher.Builder<T> as
>> above.
>>
>
> Yes, and simple to construct from the components in the bloomfilter
> packages.  In addition, I think you have made the assumption that T
> contains a single value to be added, in which case a Function<T,byte[]>
> would suffice and, assuming bf is not a CountingBloom filter, the above can
> be written as:
>
> Hasher.Builder builder = new ....
> Function<T,byte[]> f = new ...
> items.iterator().forEachRemaining( t -> builder.with( f.apply(t) )
> bf.merge( builder.build() )
>
> Additionally if T is an object with multiple elements that are to be added
> to the filter then
>
> Hasher.Builder nativeBuilder = new ....
> DecoratorT builder = new DecoratorT( nativeBuilder ) // adds the with(T)
> method
> items.iterator().forEachRemaining( t -> builder.with( t ) )
> bf.merge( builder.build() )
>
> there is no difference in result between
> -
> loop {
> bf.merge( new Builder().with(x).build() )
> }
>
> -- and --
>
> new Builder()
> loop {
> builder.with( x )
> }
> bf.merge( builder.build() );
>
>
>> It would also be possible for the itemToHasher to recycle byte[] space by
>> returning the same Hasher object that has been reset and filled with the
>> next item. This would not be thread-safe but would save on intermediate
>> storage.
>>
>> All of this still fixes on having a byte[] representation to feed to a
>> HashFunction. Moving away from the current design would change HashFunction
>> to specify methods to accept blocks of data and have a final method to get
>> the hash. So making the HashFunction an online hash. This would then match
>> Guava by having the some object accept items T and require a function to
>> map the item T to blocks of data.
>>
>> However I note that the current implementation that accepts a byte[] for
>> each call to get a hash value with a different seed can either use the
>> byte[] or not (if cyclic). If the HashFunction was online then the choice
>> of cyclic or iterative would not easily be possible. The Guava
>> implementation creates a single hash and then the BloomFilter always uses
>> this with a cyclic method. So the move away from the current design would
>> be less flexible to allow different implementations of hashing.
>>
>> So we keep the byte[] interface to HashFunction for now. A performance
>> test can be used to determine if there is an advantage to an advanced
>> DynamicHasher.Builder which can recycle byte[] space. Javadoc should be
>> added to the HashFunction to indicate that the same bytes passed with the
>> same seed should create the same output. The same bytes with a different
>> seed should create different output with very high probability. A seed of
>> zero is used as a reset signal for implementations that have cached
>> computation results that the byte[] input is different from the previous
>> call.
>>
>>
>> The last thing is that the Hasher.isEmpty() is not used anywhere except
>> the units tests. It seems strange to have it. Can we just assume a Hasher
>> is not empty. An empty hasher would return an iterator that does nothing.
>>
>> Yes, I guess we don't need isEmpty() I think it was originally used in
> BloomFilter.merge() in a guard statement to test if the merge actually
> needed to attempt to do anything.

I have removed it.
>
>
>> In summary:
>>
>> 1. change Hasher getBits to iterator
>>
> agree

Done.

>
>> 2. improve documentation of Hasher and the contract that it should fulfil
>> with respect to items and a Shape
>>
> absolutly

I’ve updated the Javadoc as above.

>
>> 3. potentially drop Hasher.Builder unless there is a way to reset the
>> Builder or create more
>>
> add the reset or have build() implicitly do a reset.

I think we mandate Builder be reusable and that build() forces a reset. I also think dropping all the method except with(byte[]) makes it very clear that the builder is to accept complete items on each call.

>
> 4. Or make Hasher.Builder typed to an object <T> so it is clear the with(…)
>> methods are to accept a full representation of an item and add it to the
>> in-progress Hasher currently being built
>>
> disagree.

Fine. We shall have a raw API for now and possibly decorators for objects hashing later.

>
> 5. Improve HashFunction javadoc on the use of the seed as a reset signal
>
> agree

Here there are more options. If we document that apply(byte[], long) is called with a seed of zero to initialise and then non-zero seed to create new hashes for the same byte[] why not make the interface split into two methods:

long hash(byte[]);
long increment(int seed);

We already have the ProcessType property for the HashFunction that allows the function to specify if the entire byte[] is used each time or if the initial hash is recycled when creating new values from a non-zero seed. From what I can see in the implementations the cyclic functions would just ignore the seed in the increment(int) method. The iterative functions use the seed.

I tweaked the ObjectsHashIterative HashFunction. But it seems to me that it could be further improved if we can assume that apply(byte[], int) is called with the same byte[] when the seed is non-zero. If this is true then the Arrays.hashCode(byte[]) result can be cached for reuse. The function should also be made UNSIGNED as the 32-bit result can be converted to an unsigned long.

Splitting the HashFunction interface into two methods makes it clear that an implementation will be called multiple times with the same byte[]. This is part of the design of the Hasher and so should be formalised into the design of the HashFunction.

WDYT?

>
>> 6. Drop Hasher.isEmpty()
>>
> ambivalent.

Dropped.

>
>>
>> That should clarify the currently functionality.
>>
>> Thought on this?
>>
>> Alex
>>
>>
>>>
>>> Claude
>>>
>>>
>>> On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <[hidden email]>
>>> wrote:
>>>
>>>>
>>>> On 16/03/2020 07:57, Claude Warren wrote:
>>>>> I made a quick pass at changing getHasher() to iterator().
>>>>
>>>> A look at the feasibility or have you started work on this? If so then
>>>> I'll not start work on it as well.
>>>>
>>>> I changed master to return a boolean for the merge operations in
>>>> BloomFilter. So the outstanding changes are to drop getHasher() from the
>>>> BloomFilter interface in favour of an iterator, spliterator and a
>>>> forEachBit method.
>>>>
>>>>> I think we can get rid of HasherBloomFilter as its purpose was really
>> to
>>>>> create a Bloom filter for temporary usage and it doesn't seem to be
>>>>> required if we have a hasher that can be created from a Shape and a
>>>>> function that creates an Iterator.
>>>>
>>>> I agree.
>>>>
>>>> One change that could be made is to clarify the contract between a
>>>> Hasher and a BloomFilter. At present the Hasher can operate without a
>>>> defined contract in this method:
>>>>
>>>> PrimitiveIterator.OfInt getBits(Shape shape)
>>>>
>>>> It should validate that it can generate indexes for the shape. But it
>>>> doesn't have to. It could return unlimited indexes and they could be
>>>> outside the number of bits of the BloomFilter.
>>>>
>>>> There does not appear to be any control anywhere on the number of hash
>>>> functions generated by the Hasher. I would expect this test in the
>>>> AbstractBloomFilterTest to pass:
>>>>
>>>>    @Test
>>>>    public void hasherMergeTest() {
>>>>        int n = 1;
>>>>        int m = 10;
>>>>        HashFunctionIdentity h = new
>>>> HashFunctionIdentityImpl("provider", "name",
>>>>            Signedness.SIGNED, ProcessType.CYCLIC, 0L);
>>>>        Hasher hasher = new Hasher() {
>>>>            @Override
>>>>            public boolean isEmpty() {
>>>>                return false;
>>>>            }
>>>>            @Override
>>>>            public HashFunctionIdentity getHashFunctionIdentity() {
>>>>                return h;
>>>>            }
>>>>            @Override
>>>>            public OfInt getBits(Shape shape) {
>>>>                // Do not respect the shape number of hash functions
>>>> but do respect
>>>>                // the number of bits
>>>>                return IntStream.range(0, m).iterator();
>>>>            }
>>>>        };
>>>>        for (int k = 1; k < 5; k++) {
>>>>            Shape shape = new Shape(h, n, m, k);
>>>>            BloomFilter bf = createEmptyFilter(shape);
>>>>            bf.merge(hasher);
>>>>            assertEquals("incorrect cardinality", k, bf.cardinality());
>>>>        }
>>>>    }
>>>>
>>>> It currently does not as all the BloomFilters will not respect the Shape
>>>> with which they were created, i.e. they disregard the number of hash
>>>> functions in the Shape. So does the Hasher.
>>>>
>>>> I think some of the control should be returned to the BloomFilter. The
>>>> Hasher would be reduced to a simple generator of data for the
>>>> BloomFilter, for example:
>>>>
>>>>    PrimitiveIterator.OfInt getBits(int m);
>>>>    PrimitiveIterator.OfInt getBits(int k, int m);
>>>>    PrimitiveIterator.OfLong getBits();
>>>>
>>>> The BloomFilter then accept responsibility for converting the primitives
>>>> to a suitable index and creating the correct number of hash functions
>>>> (i.e. indexes).
>>>>
>>>> A merge operation with a BloomFilter then becomes:
>>>>
>>>> - check the Hasher is using the correct hash function identity
>>>> - ask the Hasher for an iterator
>>>> - set k bits in the filter using the iterator, mapping each to the range
>>>> [0, m)
>>>>
>>>> The BloomFilter has then encapsulated its state and respects the Shape.
>>>>
>>>> The HashFuntion will convert byte[] to a long.
>>>>
>>>> The Hasher exists to convert anything to a byte[] format.
>>>>
>>>> This perhaps needs the Hasher.Builder to be revised to include more
>>>> methods that accept all the primitive data types. These are all
>>>> converted to a single byte[] representation. Thus the Hasher.Builder is
>>>> effectively a specification for a ByteBuffer. Once an object is
>>>> decomposed into the byte[] it can be fed through the HashFunction with
>>>> different seeds or using the cyclic method to create the iterator. The
>>>> BloomFilter consumes the raw long output from the stream produced by the
>>>> Hasher and sets k bits within the range m.
>>>>
>>>> Alex
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email]
>>>> For additional commands, e-mail: [hidden email]
>>>>
>>>>
>>>
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email] <mailto:[hidden email]>
>> For additional commands, e-mail: [hidden email] <mailto:[hidden email]>
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren>
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
I agree with the HashFunction changes.

The only place I know of where more items are added to a hasher is in our
test code.  So I think just requiring Builder.build() to do a reset is
correct solution.
I think Builder should have
with(byte[])
with(byte[], int offset, int len )
with(String)

I find that I use with(String) more than any other with() method.

If you want to add with(ByteBuffer) with the implementation you have above
I think that would be a good addition.  Anything else can probably be done
with a Decorator.

I would like to see https://github.com/apache/commons-collections/pull/131
get merged so that we can have more than one example of a hasher that
actually hashes.




On Tue, Mar 17, 2020 at 1:53 PM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 17 Mar 2020, at 11:08, Claude Warren <[hidden email]> wrote:
> >
> > On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert <[hidden email]
> <mailto:[hidden email]>>
> > wrote:
> >
> >>
> >>> The Shape tells the hasher how many hash functions to apply to each
> item.
> >>
> >> OK. This is may misunderstanding. There is a contract that the Hasher is
> >> expected to fulfil but it is just not recorded in the javadoc. I can
> update
> >> the docs to indicate that:
> >>
> >> "A Hasher represents items of arbitrary byte size as a byte
> representation
> >> of fixed size (a hash). The hash for each item is created using a hash
> >> function; use of different seeds allows generation of different hashes
> for
> >> the same item. The hashes can be dynamically converted into the bit
> index
> >> representation used by a Bloom filter. The shape of the Bloom filter
> >> defines the number of indexes per item and the range of the indexes. The
> >> hasher functions to generate the correct number of indexes in the range
> >> required by the Bloom filter for each item it represents.
> >>
> >> Note that the process of generating hashes and mapping them to a Bloom
> >> filter shape may create duplicate indexes. The hasher may generate fewer
> >> than the required number of hash functions per item if duplicates have
> been
> >> removed."
> >>
> >> Looks Good
>
> Added to master.
>
> >
> >>> The Shape number of items is how many items are expected to be in the
> >> final
> >>> Bloom filter, it is more the expected value not a hard limit.
> >>
> >> Yes. As discussed before this is not actually required for a Bloom
> filter
> >> to function, it is required to maintain the intended purpose of the
> filter
> >> when it was constructed.
> >>
> >>
> >>
> >>>
> >>> The static hasher for example will not return duplicates so it might
> >> appear
> >>> that it has not respected the number of functions.  In addition there
> is
> >> no
> >>> indication from the hasher how many items it contains..
> >>
> >> Yes. So we state that the hasher represents one or more items.
> >>
> >>>
> >>> The inputs to the hash.builder are byte buffers that are fed to the
> hash
> >>> algorithm.  They are inputs to that algorithm.  So primitive types
> would
> >>> simply convert from the primitive type to its byte buffer
> representation.
> >>> Is that what you meant?
> >>
> >> I was unclear on the purpose of the Hasher.Builder. It seemed
> incomplete.
> >> If the builder is to add items then it seems strange to have:
> >>
> >> with(byte property)
> >> with(String property)
> >>
> >> It also seems strange to throw 'IllegalStateException if the Hasher is
> >> locked’ without explaining what this means. Is the builder intended to
> be
> >> concurrent? What is ‘locked’? Etc.
> >>
> >
> > Good catch. Documentation error.  Originally the Hasher.Builder was
> locked
> > once it generated a Hasher that was removed but the documentation was not
> > cleaned up.
>
> I’ve removed this from the Hasher.Builder interface.
>
> >
> >
> >> The byte could not possibly represent many meaningful objects. The
> string
> >> is trivially converted to UTF-8 bytes (as is done in the DynamicHasher).
> >> Both these methods could be added to the interface as default methods or
> >> preferrably dropped as they are so trivial.
> >>
> >> I changed the documentation to remove the encoding as UTF-8 requirement
> >> from the with(String) method. It seems like an implementation detail
> and a
> >> Hasher.Builder implementation can decide how to convert the String. It
> is
> >> faster to use UTF-16 bytes for instance. I understand UTF-8 is for
> >> cross-platform standard. But mandating that it has to be done is too
> >> restrictive IMO. It would be better as:
> >>
> >> with(CharSequence, Charset)
> >> withUnencoded(CharSequence)
> >>
> >>
> > The Hasher.Builder is patterned after the cryptography MessageDigest
> > pattern where a number of update() calls are made before a final digest()
> > produces the hash.  Originally it was update() and build() but earlier
> > suggestion was to change the update() to with().
> >
> > Adding with( byte[] buffer, int offset, int len ) makes sense.
> >
> > with(String ) is a convenience method, it was documented as UTF-8 so that
> > if when  trying to build equivalent filters in a different language there
> > was no need to look deep in the hasher code to figure out what to do.
> >
> > with( byte ) is also a convenience method.
> >
> > They are documented in Hasher.Builder to be common across all builders.
> > Obviously different Builder implementations can add other functionality.
> >
> > One could build a decorator that wraps a builder and adds the with(<T>)
> > functionality.
> >
> > The Builder interface is designed to be a balance between minimal
> > implementation need to be functional and an implementation that can
> handle
> > all arbitrary types.
>
> OK. What I think is different from the MessageDigest pattern is that a
> MessageDigest is designed to receive byte data that is composes into a
> single digest representing the combined bytes.
>
> Here the Hasher.Builder is designed to create a Hasher that represents one
> or more items. Each item is a set of byte data. So each method to add data
> is effectively functioning as a complete addition of a single item. This is
> how the implementation of the DynamicHasher works. Each call to add data
> will add a byte[] to the List<byte[]>. When you call iterator(Shape) on the
> resulting Hasher you will get an iterator that generates k indexes per
> byte[] buffer that was added.
>
> Thus my point that adding a single byte is confusing. Not many entire
> objects would have a single byte representation!
>
> It may be simpler to drop the methods with(String) and with(byte). There
> is the option to add:
>
> with(byte[], int offset, int length)
>
> However I am reluctant to do this as it would just be a delay of the
> inevitable copy to a duplicate array of the correct length given that the
> HashFunction.apply(byte[], int) method does not support a range. So you
> would have to extract the entire byte[] anyway.
>
> To aid the process of adding byte[] to the Hasher.Builder we can provided
> a ByteBuilder that would function similarly to a StringBuilder with append
> methods for all sorts of data. However I am wary that the JDK did not make
> ByteBuffer expandable as it impacts performance. A ByteBuilder would
> effectively be a duplicate of ByteBuffer but expandable. So perhaps we add
> instead this method:
>
> with(ByteBuffer)
>
> The method will extract the current contents of the ByteBuffer between the
> current position and the limit into a new byte[] and add it to the builder:
>
> default Builder with(ByteBuffer) {
>     final int remaining = byteBuffer.remaining();
>     final byte[] byteArray = new byte[remaining];
>     byteBuffer.get(byteArray);
>     return with(byteArray);
> }
>
> Thus this is a bridge between hashing byte[] and the ability to hash
> anything, e.g.
>
>             Hasher.Builder builder = ...;
>             ByteBuffer bb = ByteBuffer.allocate(100);
>             bb.putInt(42);
>             bb.putDouble(123.45);
>             bb.flip();
>             builder.with(bb);
>             bb.putInt(3);
>             bb.putLong(78493577979L);
>             bb.flip();
>             Hasher hasher = builder.with(bb).build();
>
> Given the design is fixed on having byte[] to pass entirely to a
> HashFunction then we cannot remove the need to create intermediate byte[]
> storage.
>
> An alternative is to formalise the convertor idea with a typed function:
>
> <T> Builder with(T item, Function<T, byte[]> convertor);
>
> Or provide a Builder decorator (which you discuss later):
>
>     class BuilderDecorator<T> implements Hasher.Builder {
>         Function<T, byte[]> convertor;
>         Hasher.Builder builder;
>
>         BuilderDecorator(Function<T, byte[]> convertor, Hasher.Builder
> builder) {
>             this.convertor = convertor;
>             this.builder = builder;
>         }
>
>         @Override
>         public Hasher build() {
>             return builder.build();
>         }
>
>         @Override
>         public BuilderDecorator<T> with(byte[] property) {
>             builder.with(property);
>             return this;
>         }
>
>         public BuilderDecorator<T> with(T item) {
>             return with(convertor.apply(item));
>         }
>     }
>
> This is for the future perhaps.
>
> >
> >
> >> I was interpreting the Hasher.Builder as a builder of a single byte[]
> for
> >> hashing where you would pass different primitive values or byte[] for
> the
> >> same Object you want to convert. This is essentially a ByteBuffer. But
> if
> >> it is to receive an entire object for each call then (a) it should be
> >> documented as such; (b) it should be simplified to just the byte[]
> method
> >> with perhaps another one/two:
> >>
> >> with(byte[])
> >> with(byte[], int length)
> >> with(T)
> >> with(ByteBuffer)
> >>
> >
> >> Adding the T method would make the interface typed as Hasher.Builder<T>.
> >> It would require a function to convert items T to a byte[]:
> >>
> >>
> > I think the T and ByteBuffer implementations are better left to a
> decorator
> > class.  But that is just my take.  Since the Bloom filter only supports a
> > logical equivalent of equals putting numerical values into a Bloom filter
> > is not that common.  My experience is that most values are Strings.
>
> So we drop all method from the Builder except with(byte[]) and change the
> ‘property’ to ‘item’ to clarify that the byte[] is representing an entire
> item, not a property of a larger item.
>
> >
> > Collection<T> items = ...
> >> BloomFilter bf = …
> >> Function<T, byte[]> converter = …
> >> HashFunction hf = ...
> >>
> >> for (T item : items) {
> >>    bf.merge(new DynamicHasher.Builder<>(hf,
> >> converter).with(item).build());
> >> }
> >>
> >> Or:
> >>
> >> DynamicHasher.Builder<T> builder = new DynamicHasher.Builder<>(hf,
> >> converter);
> >> for (T item : Collection<T>) {
> >>    builder.with(item);
> >> }
> >> bf.merge(builder.build());
> >>
> >> I think the Hasher.Builder interface can be removed. It does not really
> >> add anything to the API without a factory somewhere to be able to create
> >> Hasher.Builder instances since each instance has no methods for reset:
> >>
> >> At one time Builder.build() performed a reset but that was removed by
> > request, I would have no issues putting it back, or adding a reset
> method.
>
> Add:
>
> Builder reset();
> Hasher buildAndReset();
>
> Or force build() to do a reset as well? It is simplest to make build()
> mandate a reset(). Any situations where you would want to do incremental
> builds by adding more items to a builder and getting a bigger Hasher as a
> super-set of the previous Hasher?
>
> >
> > The Hasher.Builder interface defines a common interface across all
> > builders.  So I can write my code to build Bloom filters and swap out the
> > hasher implementation easily as long as I stick to the Builder.Hasher
> > methods.
> >
> > Hasher h = factory.create().with(x).with(y).with(z).build();
> >>
> >> If you do not have either a factory to create a Hasher.Builder or the
> >> ability to reset a Builder then why have a Hasher.Builder interface?
> >> Passing around just a single instance of the builder has limited use. I
> >> would drop the interface and leave it to Hasher implementations to
> define
> >> how they want to be constructed.
> >>
> >>>
> >>> The hasher contract is that it will generate integers in the proper
> range
> >>> and use the proper number of hash functions for each item that was
> added
> >> to
> >>> the builder and that repeated calls to getBits(Shape) will return the
> >> same
> >>> values.
> >>>
> >>> Did I misunderstand something?
> >>
> >> No, I did. Hence the need to clarify all the javadoc.
> >>
> >> What I think we are missing with the Hasher is the simplicity of the
> Guava
> >> implementation. What you ideally would like to do is:
> >>
> >> Collection<T> items = ...
> >> BloomFilter bf = …
> >>
> >> for (T item : items) {
> >>    bf.merge(item);
> >> }
> >>
> >> This change significantly reduces the applicability of this
> implementation
> > of Bloom filter across applications.  I am currently working on several
> > applications where Hashers are sent across a network to query
> repositories.
> >
> > By separating the Hasher from the Filter (separation of concerns) then I
> > don't have to send large objects across the network to do the query.  Nor
> > do I have to expose the properties I am querying for.  In addition the
> > endpoints I am querying are free to resize the bloom filters they store
> as
> > they see fit based on size of collection and other Shape based concerns.
> > With the Guava implementation this is not possible.
> >
> > Note: "properties" in the above paragraph are items placed into a single
> > builder.
>
> So as above, change ‘property’ to ‘item’.
>
> >
> > Currently you have to do something like:
> >>
> >> Collection<T> items = ...
> >> BloomFilter bf = …
> >> Function<T, Hasher> itemToHasher = …
> >>
> >> for (T item : items) {
> >>    bf.merge(itemToHasher.apply(item));
> >> }
> >>
> >> The itemToHasher is effectively an improved DynamicHasher.Builder<T> as
> >> above.
> >>
> >
> > Yes, and simple to construct from the components in the bloomfilter
> > packages.  In addition, I think you have made the assumption that T
> > contains a single value to be added, in which case a Function<T,byte[]>
> > would suffice and, assuming bf is not a CountingBloom filter, the above
> can
> > be written as:
> >
> > Hasher.Builder builder = new ....
> > Function<T,byte[]> f = new ...
> > items.iterator().forEachRemaining( t -> builder.with( f.apply(t) )
> > bf.merge( builder.build() )
> >
> > Additionally if T is an object with multiple elements that are to be
> added
> > to the filter then
> >
> > Hasher.Builder nativeBuilder = new ....
> > DecoratorT builder = new DecoratorT( nativeBuilder ) // adds the with(T)
> > method
> > items.iterator().forEachRemaining( t -> builder.with( t ) )
> > bf.merge( builder.build() )
> >
> > there is no difference in result between
> > -
> > loop {
> > bf.merge( new Builder().with(x).build() )
> > }
> >
> > -- and --
> >
> > new Builder()
> > loop {
> > builder.with( x )
> > }
> > bf.merge( builder.build() );
> >
> >
> >> It would also be possible for the itemToHasher to recycle byte[] space
> by
> >> returning the same Hasher object that has been reset and filled with the
> >> next item. This would not be thread-safe but would save on intermediate
> >> storage.
> >>
> >> All of this still fixes on having a byte[] representation to feed to a
> >> HashFunction. Moving away from the current design would change
> HashFunction
> >> to specify methods to accept blocks of data and have a final method to
> get
> >> the hash. So making the HashFunction an online hash. This would then
> match
> >> Guava by having the some object accept items T and require a function to
> >> map the item T to blocks of data.
> >>
> >> However I note that the current implementation that accepts a byte[] for
> >> each call to get a hash value with a different seed can either use the
> >> byte[] or not (if cyclic). If the HashFunction was online then the
> choice
> >> of cyclic or iterative would not easily be possible. The Guava
> >> implementation creates a single hash and then the BloomFilter always
> uses
> >> this with a cyclic method. So the move away from the current design
> would
> >> be less flexible to allow different implementations of hashing.
> >>
> >> So we keep the byte[] interface to HashFunction for now. A performance
> >> test can be used to determine if there is an advantage to an advanced
> >> DynamicHasher.Builder which can recycle byte[] space. Javadoc should be
> >> added to the HashFunction to indicate that the same bytes passed with
> the
> >> same seed should create the same output. The same bytes with a different
> >> seed should create different output with very high probability. A seed
> of
> >> zero is used as a reset signal for implementations that have cached
> >> computation results that the byte[] input is different from the previous
> >> call.
> >>
> >>
> >> The last thing is that the Hasher.isEmpty() is not used anywhere except
> >> the units tests. It seems strange to have it. Can we just assume a
> Hasher
> >> is not empty. An empty hasher would return an iterator that does
> nothing.
> >>
> >> Yes, I guess we don't need isEmpty() I think it was originally used in
> > BloomFilter.merge() in a guard statement to test if the merge actually
> > needed to attempt to do anything.
>
> I have removed it.
> >
> >
> >> In summary:
> >>
> >> 1. change Hasher getBits to iterator
> >>
> > agree
>
> Done.
>
> >
> >> 2. improve documentation of Hasher and the contract that it should
> fulfil
> >> with respect to items and a Shape
> >>
> > absolutly
>
> I’ve updated the Javadoc as above.
>
> >
> >> 3. potentially drop Hasher.Builder unless there is a way to reset the
> >> Builder or create more
> >>
> > add the reset or have build() implicitly do a reset.
>
> I think we mandate Builder be reusable and that build() forces a reset. I
> also think dropping all the method except with(byte[]) makes it very clear
> that the builder is to accept complete items on each call.
>
> >
> > 4. Or make Hasher.Builder typed to an object <T> so it is clear the
> with(…)
> >> methods are to accept a full representation of an item and add it to the
> >> in-progress Hasher currently being built
> >>
> > disagree.
>
> Fine. We shall have a raw API for now and possibly decorators for objects
> hashing later.
>
> >
> > 5. Improve HashFunction javadoc on the use of the seed as a reset signal
> >
> > agree
>
> Here there are more options. If we document that apply(byte[], long) is
> called with a seed of zero to initialise and then non-zero seed to create
> new hashes for the same byte[] why not make the interface split into two
> methods:
>
> long hash(byte[]);
> long increment(int seed);
>
> We already have the ProcessType property for the HashFunction that allows
> the function to specify if the entire byte[] is used each time or if the
> initial hash is recycled when creating new values from a non-zero seed.
> From what I can see in the implementations the cyclic functions would just
> ignore the seed in the increment(int) method. The iterative functions use
> the seed.
>
> I tweaked the ObjectsHashIterative HashFunction. But it seems to me that
> it could be further improved if we can assume that apply(byte[], int) is
> called with the same byte[] when the seed is non-zero. If this is true then
> the Arrays.hashCode(byte[]) result can be cached for reuse. The function
> should also be made UNSIGNED as the 32-bit result can be converted to an
> unsigned long.
>
> Splitting the HashFunction interface into two methods makes it clear that
> an implementation will be called multiple times with the same byte[]. This
> is part of the design of the Hasher and so should be formalised into the
> design of the HashFunction.
>
> WDYT?
>
> >
> >> 6. Drop Hasher.isEmpty()
> >>
> > ambivalent.
>
> Dropped.
>
> >
> >>
> >> That should clarify the currently functionality.
> >>
> >> Thought on this?
> >>
> >> Alex
> >>
> >>
> >>>
> >>> Claude
> >>>
> >>>
> >>> On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <[hidden email]
> >
> >>> wrote:
> >>>
> >>>>
> >>>> On 16/03/2020 07:57, Claude Warren wrote:
> >>>>> I made a quick pass at changing getHasher() to iterator().
> >>>>
> >>>> A look at the feasibility or have you started work on this? If so then
> >>>> I'll not start work on it as well.
> >>>>
> >>>> I changed master to return a boolean for the merge operations in
> >>>> BloomFilter. So the outstanding changes are to drop getHasher() from
> the
> >>>> BloomFilter interface in favour of an iterator, spliterator and a
> >>>> forEachBit method.
> >>>>
> >>>>> I think we can get rid of HasherBloomFilter as its purpose was really
> >> to
> >>>>> create a Bloom filter for temporary usage and it doesn't seem to be
> >>>>> required if we have a hasher that can be created from a Shape and a
> >>>>> function that creates an Iterator.
> >>>>
> >>>> I agree.
> >>>>
> >>>> One change that could be made is to clarify the contract between a
> >>>> Hasher and a BloomFilter. At present the Hasher can operate without a
> >>>> defined contract in this method:
> >>>>
> >>>> PrimitiveIterator.OfInt getBits(Shape shape)
> >>>>
> >>>> It should validate that it can generate indexes for the shape. But it
> >>>> doesn't have to. It could return unlimited indexes and they could be
> >>>> outside the number of bits of the BloomFilter.
> >>>>
> >>>> There does not appear to be any control anywhere on the number of hash
> >>>> functions generated by the Hasher. I would expect this test in the
> >>>> AbstractBloomFilterTest to pass:
> >>>>
> >>>>    @Test
> >>>>    public void hasherMergeTest() {
> >>>>        int n = 1;
> >>>>        int m = 10;
> >>>>        HashFunctionIdentity h = new
> >>>> HashFunctionIdentityImpl("provider", "name",
> >>>>            Signedness.SIGNED, ProcessType.CYCLIC, 0L);
> >>>>        Hasher hasher = new Hasher() {
> >>>>            @Override
> >>>>            public boolean isEmpty() {
> >>>>                return false;
> >>>>            }
> >>>>            @Override
> >>>>            public HashFunctionIdentity getHashFunctionIdentity() {
> >>>>                return h;
> >>>>            }
> >>>>            @Override
> >>>>            public OfInt getBits(Shape shape) {
> >>>>                // Do not respect the shape number of hash functions
> >>>> but do respect
> >>>>                // the number of bits
> >>>>                return IntStream.range(0, m).iterator();
> >>>>            }
> >>>>        };
> >>>>        for (int k = 1; k < 5; k++) {
> >>>>            Shape shape = new Shape(h, n, m, k);
> >>>>            BloomFilter bf = createEmptyFilter(shape);
> >>>>            bf.merge(hasher);
> >>>>            assertEquals("incorrect cardinality", k, bf.cardinality());
> >>>>        }
> >>>>    }
> >>>>
> >>>> It currently does not as all the BloomFilters will not respect the
> Shape
> >>>> with which they were created, i.e. they disregard the number of hash
> >>>> functions in the Shape. So does the Hasher.
> >>>>
> >>>> I think some of the control should be returned to the BloomFilter. The
> >>>> Hasher would be reduced to a simple generator of data for the
> >>>> BloomFilter, for example:
> >>>>
> >>>>    PrimitiveIterator.OfInt getBits(int m);
> >>>>    PrimitiveIterator.OfInt getBits(int k, int m);
> >>>>    PrimitiveIterator.OfLong getBits();
> >>>>
> >>>> The BloomFilter then accept responsibility for converting the
> primitives
> >>>> to a suitable index and creating the correct number of hash functions
> >>>> (i.e. indexes).
> >>>>
> >>>> A merge operation with a BloomFilter then becomes:
> >>>>
> >>>> - check the Hasher is using the correct hash function identity
> >>>> - ask the Hasher for an iterator
> >>>> - set k bits in the filter using the iterator, mapping each to the
> range
> >>>> [0, m)
> >>>>
> >>>> The BloomFilter has then encapsulated its state and respects the
> Shape.
> >>>>
> >>>> The HashFuntion will convert byte[] to a long.
> >>>>
> >>>> The Hasher exists to convert anything to a byte[] format.
> >>>>
> >>>> This perhaps needs the Hasher.Builder to be revised to include more
> >>>> methods that accept all the primitive data types. These are all
> >>>> converted to a single byte[] representation. Thus the Hasher.Builder
> is
> >>>> effectively a specification for a ByteBuffer. Once an object is
> >>>> decomposed into the byte[] it can be fed through the HashFunction with
> >>>> different seeds or using the cyclic method to create the iterator. The
> >>>> BloomFilter consumes the raw long output from the stream produced by
> the
> >>>> Hasher and sets k bits within the range m.
> >>>>
> >>>> Alex
> >>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: [hidden email]
> >>>> For additional commands, e-mail: [hidden email]
> >>>>
> >>>>
> >>>
> >>> --
> >>> I like: Like Like - The likeliest place on the web
> >>> <http://like-like.xenei.com>
> >>> LinkedIn: http://www.linkedin.com/in/claudewarren
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email] <mailto:
> [hidden email]>
> >> For additional commands, e-mail: [hidden email] <mailto:
> [hidden email]>
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com <http://like-like.xenei.com/>>
> > LinkedIn: http://www.linkedin.com/in/claudewarren <
> http://www.linkedin.com/in/claudewarren>
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert


> On 17 Mar 2020, at 15:41, Claude Warren <[hidden email]> wrote:
>
> I agree with the HashFunction changes.

OK, but which ones?

Changing HashFunction to have two methods:

long hash(byte[])
long increment(int seed)

Or just correctly documenting what we have.

And/or fixing the ObjectsHashIterative to be unsigned.

Also see below on the use of a range.

>
> The only place I know of where more items are added to a hasher is in our
> test code.  So I think just requiring Builder.build() to do a reset is
> correct solution.

OK.

> I think Builder should have
> with(byte[])
> with(byte[], int offset, int len )

Not convinced here. The HashFunction requires a byte[] and cannot operate on a range. This change should be made in conjunction with a similar change to HashFunction. So should we update HashFunction to:

long hash(byte[] data, int offset, int length);
long increment(int seed);

Using a range may be exploited to recycle byte[] storage. It does however break the DynamicHasher which currently stores only the byte[]. It also adds complexity for any iterative implementation that then requires storing the offset and length. This would suggest this:

// Initial hash, seed = 0
long hash(byte[] data, int offset, int length);
// Incremental hash, only called with the same data, offset and length and a non-zero seed.
long hash(byte[] data, int offset, int length, int seed);

I favour the former and force implementations to hold a reference to the data if they require it for the incremental hash.

> with(String)
>
> I find that I use with(String) more than any other with() method.

That may be so but String.getBytes(Charset) is trivial to call for the user. Then they get to decide on the encoding and not leave it to the Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice as a cross-language standard. Leave it out of the API for now, or add both:

Builder with(CharSequence, Charset);
Builder withUnencoded(CharSequence);

I would argue that you may use BloomFilters for Strings but if we see a BloomFilter as a collection then we should really support all Objects (with a decorator or by typing the Builder) or not support Objects. Currently we are not supporting any Object so for now would drop this and the Hasher.Builder then becomes a very simple API that specifies that you put in items represented as a byte[] and call build to create a Hasher containing those items and reset for further use.


>
> If you want to add with(ByteBuffer) with the implementation you have above
> I think that would be a good addition.  Anything else can probably be done
> with a Decorator.

I’d suggest leaving all this for now and correcting what we have.

>
> I would like to see https://github.com/apache/commons-collections/pull/131 <https://github.com/apache/commons-collections/pull/131>
> get merged so that we can have more than one example of a hasher that
> actually hashes.
>

As above I’d hold off on merging anything new as it just adds overhead when the current API is still changing.


>
>
>
> On Tue, Mar 17, 2020 at 1:53 PM Alex Herbert <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>>
>>
>>> On 17 Mar 2020, at 11:08, Claude Warren <[hidden email] <mailto:[hidden email]>> wrote:
>>>
>>> On Tue, Mar 17, 2020 at 12:28 AM Alex Herbert <[hidden email] <mailto:[hidden email]>
>> <mailto:[hidden email] <mailto:[hidden email]>>>
>>> wrote:
>>>
>>>>
>>>>> The Shape tells the hasher how many hash functions to apply to each
>> item.
>>>>
>>>> OK. This is may misunderstanding. There is a contract that the Hasher is
>>>> expected to fulfil but it is just not recorded in the javadoc. I can
>> update
>>>> the docs to indicate that:
>>>>
>>>> "A Hasher represents items of arbitrary byte size as a byte
>> representation
>>>> of fixed size (a hash). The hash for each item is created using a hash
>>>> function; use of different seeds allows generation of different hashes
>> for
>>>> the same item. The hashes can be dynamically converted into the bit
>> index
>>>> representation used by a Bloom filter. The shape of the Bloom filter
>>>> defines the number of indexes per item and the range of the indexes. The
>>>> hasher functions to generate the correct number of indexes in the range
>>>> required by the Bloom filter for each item it represents.
>>>>
>>>> Note that the process of generating hashes and mapping them to a Bloom
>>>> filter shape may create duplicate indexes. The hasher may generate fewer
>>>> than the required number of hash functions per item if duplicates have
>> been
>>>> removed."
>>>>
>>>> Looks Good
>>
>> Added to master.
>>
>>>
>>>>> The Shape number of items is how many items are expected to be in the
>>>> final
>>>>> Bloom filter, it is more the expected value not a hard limit.
>>>>
>>>> Yes. As discussed before this is not actually required for a Bloom
>> filter
>>>> to function, it is required to maintain the intended purpose of the
>> filter
>>>> when it was constructed.
>>>>
>>>>
>>>>
>>>>>
>>>>> The static hasher for example will not return duplicates so it might
>>>> appear
>>>>> that it has not respected the number of functions.  In addition there
>> is
>>>> no
>>>>> indication from the hasher how many items it contains..
>>>>
>>>> Yes. So we state that the hasher represents one or more items.
>>>>
>>>>>
>>>>> The inputs to the hash.builder are byte buffers that are fed to the
>> hash
>>>>> algorithm.  They are inputs to that algorithm.  So primitive types
>> would
>>>>> simply convert from the primitive type to its byte buffer
>> representation.
>>>>> Is that what you meant?
>>>>
>>>> I was unclear on the purpose of the Hasher.Builder. It seemed
>> incomplete.
>>>> If the builder is to add items then it seems strange to have:
>>>>
>>>> with(byte property)
>>>> with(String property)
>>>>
>>>> It also seems strange to throw 'IllegalStateException if the Hasher is
>>>> locked’ without explaining what this means. Is the builder intended to
>> be
>>>> concurrent? What is ‘locked’? Etc.
>>>>
>>>
>>> Good catch. Documentation error.  Originally the Hasher.Builder was
>> locked
>>> once it generated a Hasher that was removed but the documentation was not
>>> cleaned up.
>>
>> I’ve removed this from the Hasher.Builder interface.
>>
>>>
>>>
>>>> The byte could not possibly represent many meaningful objects. The
>> string
>>>> is trivially converted to UTF-8 bytes (as is done in the DynamicHasher).
>>>> Both these methods could be added to the interface as default methods or
>>>> preferrably dropped as they are so trivial.
>>>>
>>>> I changed the documentation to remove the encoding as UTF-8 requirement
>>>> from the with(String) method. It seems like an implementation detail
>> and a
>>>> Hasher.Builder implementation can decide how to convert the String. It
>> is
>>>> faster to use UTF-16 bytes for instance. I understand UTF-8 is for
>>>> cross-platform standard. But mandating that it has to be done is too
>>>> restrictive IMO. It would be better as:
>>>>
>>>> with(CharSequence, Charset)
>>>> withUnencoded(CharSequence)
>>>>
>>>>
>>> The Hasher.Builder is patterned after the cryptography MessageDigest
>>> pattern where a number of update() calls are made before a final digest()
>>> produces the hash.  Originally it was update() and build() but earlier
>>> suggestion was to change the update() to with().
>>>
>>> Adding with( byte[] buffer, int offset, int len ) makes sense.
>>>
>>> with(String ) is a convenience method, it was documented as UTF-8 so that
>>> if when  trying to build equivalent filters in a different language there
>>> was no need to look deep in the hasher code to figure out what to do.
>>>
>>> with( byte ) is also a convenience method.
>>>
>>> They are documented in Hasher.Builder to be common across all builders.
>>> Obviously different Builder implementations can add other functionality.
>>>
>>> One could build a decorator that wraps a builder and adds the with(<T>)
>>> functionality.
>>>
>>> The Builder interface is designed to be a balance between minimal
>>> implementation need to be functional and an implementation that can
>> handle
>>> all arbitrary types.
>>
>> OK. What I think is different from the MessageDigest pattern is that a
>> MessageDigest is designed to receive byte data that is composes into a
>> single digest representing the combined bytes.
>>
>> Here the Hasher.Builder is designed to create a Hasher that represents one
>> or more items. Each item is a set of byte data. So each method to add data
>> is effectively functioning as a complete addition of a single item. This is
>> how the implementation of the DynamicHasher works. Each call to add data
>> will add a byte[] to the List<byte[]>. When you call iterator(Shape) on the
>> resulting Hasher you will get an iterator that generates k indexes per
>> byte[] buffer that was added.
>>
>> Thus my point that adding a single byte is confusing. Not many entire
>> objects would have a single byte representation!
>>
>> It may be simpler to drop the methods with(String) and with(byte). There
>> is the option to add:
>>
>> with(byte[], int offset, int length)
>>
>> However I am reluctant to do this as it would just be a delay of the
>> inevitable copy to a duplicate array of the correct length given that the
>> HashFunction.apply(byte[], int) method does not support a range. So you
>> would have to extract the entire byte[] anyway.
>>
>> To aid the process of adding byte[] to the Hasher.Builder we can provided
>> a ByteBuilder that would function similarly to a StringBuilder with append
>> methods for all sorts of data. However I am wary that the JDK did not make
>> ByteBuffer expandable as it impacts performance. A ByteBuilder would
>> effectively be a duplicate of ByteBuffer but expandable. So perhaps we add
>> instead this method:
>>
>> with(ByteBuffer)
>>
>> The method will extract the current contents of the ByteBuffer between the
>> current position and the limit into a new byte[] and add it to the builder:
>>
>> default Builder with(ByteBuffer) {
>>    final int remaining = byteBuffer.remaining();
>>    final byte[] byteArray = new byte[remaining];
>>    byteBuffer.get(byteArray);
>>    return with(byteArray);
>> }
>>
>> Thus this is a bridge between hashing byte[] and the ability to hash
>> anything, e.g.
>>
>>            Hasher.Builder builder = ...;
>>            ByteBuffer bb = ByteBuffer.allocate(100);
>>            bb.putInt(42);
>>            bb.putDouble(123.45);
>>            bb.flip();
>>            builder.with(bb);
>>            bb.putInt(3);
>>            bb.putLong(78493577979L);
>>            bb.flip();
>>            Hasher hasher = builder.with(bb).build();
>>
>> Given the design is fixed on having byte[] to pass entirely to a
>> HashFunction then we cannot remove the need to create intermediate byte[]
>> storage.
>>
>> An alternative is to formalise the convertor idea with a typed function:
>>
>> <T> Builder with(T item, Function<T, byte[]> convertor);
>>
>> Or provide a Builder decorator (which you discuss later):
>>
>>    class BuilderDecorator<T> implements Hasher.Builder {
>>        Function<T, byte[]> convertor;
>>        Hasher.Builder builder;
>>
>>        BuilderDecorator(Function<T, byte[]> convertor, Hasher.Builder
>> builder) {
>>            this.convertor = convertor;
>>            this.builder = builder;
>>        }
>>
>>        @Override
>>        public Hasher build() {
>>            return builder.build();
>>        }
>>
>>        @Override
>>        public BuilderDecorator<T> with(byte[] property) {
>>            builder.with(property);
>>            return this;
>>        }
>>
>>        public BuilderDecorator<T> with(T item) {
>>            return with(convertor.apply(item));
>>        }
>>    }
>>
>> This is for the future perhaps.
>>
>>>
>>>
>>>> I was interpreting the Hasher.Builder as a builder of a single byte[]
>> for
>>>> hashing where you would pass different primitive values or byte[] for
>> the
>>>> same Object you want to convert. This is essentially a ByteBuffer. But
>> if
>>>> it is to receive an entire object for each call then (a) it should be
>>>> documented as such; (b) it should be simplified to just the byte[]
>> method
>>>> with perhaps another one/two:
>>>>
>>>> with(byte[])
>>>> with(byte[], int length)
>>>> with(T)
>>>> with(ByteBuffer)
>>>>
>>>
>>>> Adding the T method would make the interface typed as Hasher.Builder<T>.
>>>> It would require a function to convert items T to a byte[]:
>>>>
>>>>
>>> I think the T and ByteBuffer implementations are better left to a
>> decorator
>>> class.  But that is just my take.  Since the Bloom filter only supports a
>>> logical equivalent of equals putting numerical values into a Bloom filter
>>> is not that common.  My experience is that most values are Strings.
>>
>> So we drop all method from the Builder except with(byte[]) and change the
>> ‘property’ to ‘item’ to clarify that the byte[] is representing an entire
>> item, not a property of a larger item.
>>
>>>
>>> Collection<T> items = ...
>>>> BloomFilter bf = …
>>>> Function<T, byte[]> converter = …
>>>> HashFunction hf = ...
>>>>
>>>> for (T item : items) {
>>>>   bf.merge(new DynamicHasher.Builder<>(hf,
>>>> converter).with(item).build());
>>>> }
>>>>
>>>> Or:
>>>>
>>>> DynamicHasher.Builder<T> builder = new DynamicHasher.Builder<>(hf,
>>>> converter);
>>>> for (T item : Collection<T>) {
>>>>   builder.with(item);
>>>> }
>>>> bf.merge(builder.build());
>>>>
>>>> I think the Hasher.Builder interface can be removed. It does not really
>>>> add anything to the API without a factory somewhere to be able to create
>>>> Hasher.Builder instances since each instance has no methods for reset:
>>>>
>>>> At one time Builder.build() performed a reset but that was removed by
>>> request, I would have no issues putting it back, or adding a reset
>> method.
>>
>> Add:
>>
>> Builder reset();
>> Hasher buildAndReset();
>>
>> Or force build() to do a reset as well? It is simplest to make build()
>> mandate a reset(). Any situations where you would want to do incremental
>> builds by adding more items to a builder and getting a bigger Hasher as a
>> super-set of the previous Hasher?
>>
>>>
>>> The Hasher.Builder interface defines a common interface across all
>>> builders.  So I can write my code to build Bloom filters and swap out the
>>> hasher implementation easily as long as I stick to the Builder.Hasher
>>> methods.
>>>
>>> Hasher h = factory.create().with(x).with(y).with(z).build();
>>>>
>>>> If you do not have either a factory to create a Hasher.Builder or the
>>>> ability to reset a Builder then why have a Hasher.Builder interface?
>>>> Passing around just a single instance of the builder has limited use. I
>>>> would drop the interface and leave it to Hasher implementations to
>> define
>>>> how they want to be constructed.
>>>>
>>>>>
>>>>> The hasher contract is that it will generate integers in the proper
>> range
>>>>> and use the proper number of hash functions for each item that was
>> added
>>>> to
>>>>> the builder and that repeated calls to getBits(Shape) will return the
>>>> same
>>>>> values.
>>>>>
>>>>> Did I misunderstand something?
>>>>
>>>> No, I did. Hence the need to clarify all the javadoc.
>>>>
>>>> What I think we are missing with the Hasher is the simplicity of the
>> Guava
>>>> implementation. What you ideally would like to do is:
>>>>
>>>> Collection<T> items = ...
>>>> BloomFilter bf = …
>>>>
>>>> for (T item : items) {
>>>>   bf.merge(item);
>>>> }
>>>>
>>>> This change significantly reduces the applicability of this
>> implementation
>>> of Bloom filter across applications.  I am currently working on several
>>> applications where Hashers are sent across a network to query
>> repositories.
>>>
>>> By separating the Hasher from the Filter (separation of concerns) then I
>>> don't have to send large objects across the network to do the query.  Nor
>>> do I have to expose the properties I am querying for.  In addition the
>>> endpoints I am querying are free to resize the bloom filters they store
>> as
>>> they see fit based on size of collection and other Shape based concerns.
>>> With the Guava implementation this is not possible.
>>>
>>> Note: "properties" in the above paragraph are items placed into a single
>>> builder.
>>
>> So as above, change ‘property’ to ‘item’.
>>
>>>
>>> Currently you have to do something like:
>>>>
>>>> Collection<T> items = ...
>>>> BloomFilter bf = …
>>>> Function<T, Hasher> itemToHasher = …
>>>>
>>>> for (T item : items) {
>>>>   bf.merge(itemToHasher.apply(item));
>>>> }
>>>>
>>>> The itemToHasher is effectively an improved DynamicHasher.Builder<T> as
>>>> above.
>>>>
>>>
>>> Yes, and simple to construct from the components in the bloomfilter
>>> packages.  In addition, I think you have made the assumption that T
>>> contains a single value to be added, in which case a Function<T,byte[]>
>>> would suffice and, assuming bf is not a CountingBloom filter, the above
>> can
>>> be written as:
>>>
>>> Hasher.Builder builder = new ....
>>> Function<T,byte[]> f = new ...
>>> items.iterator().forEachRemaining( t -> builder.with( f.apply(t) )
>>> bf.merge( builder.build() )
>>>
>>> Additionally if T is an object with multiple elements that are to be
>> added
>>> to the filter then
>>>
>>> Hasher.Builder nativeBuilder = new ....
>>> DecoratorT builder = new DecoratorT( nativeBuilder ) // adds the with(T)
>>> method
>>> items.iterator().forEachRemaining( t -> builder.with( t ) )
>>> bf.merge( builder.build() )
>>>
>>> there is no difference in result between
>>> -
>>> loop {
>>> bf.merge( new Builder().with(x).build() )
>>> }
>>>
>>> -- and --
>>>
>>> new Builder()
>>> loop {
>>> builder.with( x )
>>> }
>>> bf.merge( builder.build() );
>>>
>>>
>>>> It would also be possible for the itemToHasher to recycle byte[] space
>> by
>>>> returning the same Hasher object that has been reset and filled with the
>>>> next item. This would not be thread-safe but would save on intermediate
>>>> storage.
>>>>
>>>> All of this still fixes on having a byte[] representation to feed to a
>>>> HashFunction. Moving away from the current design would change
>> HashFunction
>>>> to specify methods to accept blocks of data and have a final method to
>> get
>>>> the hash. So making the HashFunction an online hash. This would then
>> match
>>>> Guava by having the some object accept items T and require a function to
>>>> map the item T to blocks of data.
>>>>
>>>> However I note that the current implementation that accepts a byte[] for
>>>> each call to get a hash value with a different seed can either use the
>>>> byte[] or not (if cyclic). If the HashFunction was online then the
>> choice
>>>> of cyclic or iterative would not easily be possible. The Guava
>>>> implementation creates a single hash and then the BloomFilter always
>> uses
>>>> this with a cyclic method. So the move away from the current design
>> would
>>>> be less flexible to allow different implementations of hashing.
>>>>
>>>> So we keep the byte[] interface to HashFunction for now. A performance
>>>> test can be used to determine if there is an advantage to an advanced
>>>> DynamicHasher.Builder which can recycle byte[] space. Javadoc should be
>>>> added to the HashFunction to indicate that the same bytes passed with
>> the
>>>> same seed should create the same output. The same bytes with a different
>>>> seed should create different output with very high probability. A seed
>> of
>>>> zero is used as a reset signal for implementations that have cached
>>>> computation results that the byte[] input is different from the previous
>>>> call.
>>>>
>>>>
>>>> The last thing is that the Hasher.isEmpty() is not used anywhere except
>>>> the units tests. It seems strange to have it. Can we just assume a
>> Hasher
>>>> is not empty. An empty hasher would return an iterator that does
>> nothing.
>>>>
>>>> Yes, I guess we don't need isEmpty() I think it was originally used in
>>> BloomFilter.merge() in a guard statement to test if the merge actually
>>> needed to attempt to do anything.
>>
>> I have removed it.
>>>
>>>
>>>> In summary:
>>>>
>>>> 1. change Hasher getBits to iterator
>>>>
>>> agree
>>
>> Done.
>>
>>>
>>>> 2. improve documentation of Hasher and the contract that it should
>> fulfil
>>>> with respect to items and a Shape
>>>>
>>> absolutly
>>
>> I’ve updated the Javadoc as above.
>>
>>>
>>>> 3. potentially drop Hasher.Builder unless there is a way to reset the
>>>> Builder or create more
>>>>
>>> add the reset or have build() implicitly do a reset.
>>
>> I think we mandate Builder be reusable and that build() forces a reset. I
>> also think dropping all the method except with(byte[]) makes it very clear
>> that the builder is to accept complete items on each call.
>>
>>>
>>> 4. Or make Hasher.Builder typed to an object <T> so it is clear the
>> with(…)
>>>> methods are to accept a full representation of an item and add it to the
>>>> in-progress Hasher currently being built
>>>>
>>> disagree.
>>
>> Fine. We shall have a raw API for now and possibly decorators for objects
>> hashing later.
>>
>>>
>>> 5. Improve HashFunction javadoc on the use of the seed as a reset signal
>>>
>>> agree
>>
>> Here there are more options. If we document that apply(byte[], long) is
>> called with a seed of zero to initialise and then non-zero seed to create
>> new hashes for the same byte[] why not make the interface split into two
>> methods:
>>
>> long hash(byte[]);
>> long increment(int seed);
>>
>> We already have the ProcessType property for the HashFunction that allows
>> the function to specify if the entire byte[] is used each time or if the
>> initial hash is recycled when creating new values from a non-zero seed.
>> From what I can see in the implementations the cyclic functions would just
>> ignore the seed in the increment(int) method. The iterative functions use
>> the seed.
>>
>> I tweaked the ObjectsHashIterative HashFunction. But it seems to me that
>> it could be further improved if we can assume that apply(byte[], int) is
>> called with the same byte[] when the seed is non-zero. If this is true then
>> the Arrays.hashCode(byte[]) result can be cached for reuse. The function
>> should also be made UNSIGNED as the 32-bit result can be converted to an
>> unsigned long.
>>
>> Splitting the HashFunction interface into two methods makes it clear that
>> an implementation will be called multiple times with the same byte[]. This
>> is part of the design of the Hasher and so should be formalised into the
>> design of the HashFunction.
>>
>> WDYT?
>>
>>>
>>>> 6. Drop Hasher.isEmpty()
>>>>
>>> ambivalent.
>>
>> Dropped.
>>
>>>
>>>>
>>>> That should clarify the currently functionality.
>>>>
>>>> Thought on this?
>>>>
>>>> Alex
>>>>
>>>>
>>>>>
>>>>> Claude
>>>>>
>>>>>
>>>>> On Mon, Mar 16, 2020 at 6:34 PM Alex Herbert <[hidden email]
>>>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On 16/03/2020 07:57, Claude Warren wrote:
>>>>>>> I made a quick pass at changing getHasher() to iterator().
>>>>>>
>>>>>> A look at the feasibility or have you started work on this? If so then
>>>>>> I'll not start work on it as well.
>>>>>>
>>>>>> I changed master to return a boolean for the merge operations in
>>>>>> BloomFilter. So the outstanding changes are to drop getHasher() from
>> the
>>>>>> BloomFilter interface in favour of an iterator, spliterator and a
>>>>>> forEachBit method.
>>>>>>
>>>>>>> I think we can get rid of HasherBloomFilter as its purpose was really
>>>> to
>>>>>>> create a Bloom filter for temporary usage and it doesn't seem to be
>>>>>>> required if we have a hasher that can be created from a Shape and a
>>>>>>> function that creates an Iterator.
>>>>>>
>>>>>> I agree.
>>>>>>
>>>>>> One change that could be made is to clarify the contract between a
>>>>>> Hasher and a BloomFilter. At present the Hasher can operate without a
>>>>>> defined contract in this method:
>>>>>>
>>>>>> PrimitiveIterator.OfInt getBits(Shape shape)
>>>>>>
>>>>>> It should validate that it can generate indexes for the shape. But it
>>>>>> doesn't have to. It could return unlimited indexes and they could be
>>>>>> outside the number of bits of the BloomFilter.
>>>>>>
>>>>>> There does not appear to be any control anywhere on the number of hash
>>>>>> functions generated by the Hasher. I would expect this test in the
>>>>>> AbstractBloomFilterTest to pass:
>>>>>>
>>>>>>   @Test
>>>>>>   public void hasherMergeTest() {
>>>>>>       int n = 1;
>>>>>>       int m = 10;
>>>>>>       HashFunctionIdentity h = new
>>>>>> HashFunctionIdentityImpl("provider", "name",
>>>>>>           Signedness.SIGNED, ProcessType.CYCLIC, 0L);
>>>>>>       Hasher hasher = new Hasher() {
>>>>>>           @Override
>>>>>>           public boolean isEmpty() {
>>>>>>               return false;
>>>>>>           }
>>>>>>           @Override
>>>>>>           public HashFunctionIdentity getHashFunctionIdentity() {
>>>>>>               return h;
>>>>>>           }
>>>>>>           @Override
>>>>>>           public OfInt getBits(Shape shape) {
>>>>>>               // Do not respect the shape number of hash functions
>>>>>> but do respect
>>>>>>               // the number of bits
>>>>>>               return IntStream.range(0, m).iterator();
>>>>>>           }
>>>>>>       };
>>>>>>       for (int k = 1; k < 5; k++) {
>>>>>>           Shape shape = new Shape(h, n, m, k);
>>>>>>           BloomFilter bf = createEmptyFilter(shape);
>>>>>>           bf.merge(hasher);
>>>>>>           assertEquals("incorrect cardinality", k, bf.cardinality());
>>>>>>       }
>>>>>>   }
>>>>>>
>>>>>> It currently does not as all the BloomFilters will not respect the
>> Shape
>>>>>> with which they were created, i.e. they disregard the number of hash
>>>>>> functions in the Shape. So does the Hasher.
>>>>>>
>>>>>> I think some of the control should be returned to the BloomFilter. The
>>>>>> Hasher would be reduced to a simple generator of data for the
>>>>>> BloomFilter, for example:
>>>>>>
>>>>>>   PrimitiveIterator.OfInt getBits(int m);
>>>>>>   PrimitiveIterator.OfInt getBits(int k, int m);
>>>>>>   PrimitiveIterator.OfLong getBits();
>>>>>>
>>>>>> The BloomFilter then accept responsibility for converting the
>> primitives
>>>>>> to a suitable index and creating the correct number of hash functions
>>>>>> (i.e. indexes).
>>>>>>
>>>>>> A merge operation with a BloomFilter then becomes:
>>>>>>
>>>>>> - check the Hasher is using the correct hash function identity
>>>>>> - ask the Hasher for an iterator
>>>>>> - set k bits in the filter using the iterator, mapping each to the
>> range
>>>>>> [0, m)
>>>>>>
>>>>>> The BloomFilter has then encapsulated its state and respects the
>> Shape.
>>>>>>
>>>>>> The HashFuntion will convert byte[] to a long.
>>>>>>
>>>>>> The Hasher exists to convert anything to a byte[] format.
>>>>>>
>>>>>> This perhaps needs the Hasher.Builder to be revised to include more
>>>>>> methods that accept all the primitive data types. These are all
>>>>>> converted to a single byte[] representation. Thus the Hasher.Builder
>> is
>>>>>> effectively a specification for a ByteBuffer. Once an object is
>>>>>> decomposed into the byte[] it can be fed through the HashFunction with
>>>>>> different seeds or using the cyclic method to create the iterator. The
>>>>>> BloomFilter consumes the raw long output from the stream produced by
>> the
>>>>>> Hasher and sets k bits within the range m.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: [hidden email]
>>>>>> For additional commands, e-mail: [hidden email]
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> I like: Like Like - The likeliest place on the web
>>>>> <http://like-like.xenei.com>
>>>>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [hidden email] <mailto:[hidden email]> <mailto:
>> [hidden email] <mailto:[hidden email]>>
>>>> For additional commands, e-mail: [hidden email] <mailto:[hidden email]> <mailto:
>> [hidden email] <mailto:[hidden email]>>
>>>>
>>>>
>>>
>>> --
>>> I like: Like Like - The likeliest place on the web
>>> <http://like-like.xenei.com <http://like-like.xenei.com/> <http://like-like.xenei.com/ <http://like-like.xenei.com/>>>
>>> LinkedIn: http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren> <
>> http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren>>
>>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren>
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 17 Mar 2020, at 15:41, Claude Warren <[hidden email]> wrote:
> >
> > I agree with the HashFunction changes.
>
> OK, but which ones?
>

DOH! this one...

>
> Changing HashFunction to have two methods:
>
> long hash(byte[])
> long increment(int seed)
>
> > I think Builder should have
> > with(byte[])
> > with(byte[], int offset, int len )
>
> Not convinced here. The HashFunction requires a byte[] and cannot operate
> on a range. This change should be made in conjunction with a similar change
> to HashFunction. So should we update HashFunction to:
>
>
Given the depth of the change let's just leave the with( byte[] )


> > with(String)
> >
> > I find that I use with(String) more than any other with() method.
>
> That may be so but String.getBytes(Charset) is trivial to call for the
> user. Then they get to decide on the encoding and not leave it to the
> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice as a
> cross-language standard. Leave it out of the API for now, or add both:
>
> Builder with(CharSequence, Charset);
> Builder withUnencoded(CharSequence);
>

CharSequence has no easy method to convert to a byte[]. While it could be
done, it looks to be more of a streaming interface.  Let's leave that out.


> I would argue that you may use BloomFilters for Strings but if we see a
> BloomFilter as a collection then we should really support all Objects (with
> a decorator or by typing the Builder) or not support Objects. Currently we
> are not supporting any Object so for now would drop this and the
> Hasher.Builder then becomes a very simple API that specifies that you put
> in items represented as a byte[] and call build to create a Hasher
> containing those items and reset for further use.
>

I have code example in several places where I hash GeoCode entities.  Since
they are comprised of strings, for the most part, building a hasher for
them simply requires hashing the Strings.  Many web services use JSON and
most JSON is string based.  I disagree with removing with(String) because
it is so convenient in so many cases.  It also makes the code
cleaner/easier to read.  But if you feel strongly about removing it then OK.

The only other thing that is really bothersome is the lack of
Shape.getNumberOfBytes().  Yes it is easy to call Math.ceil(
Shape.getNumberOfBits / 8.0 ).  But getNumberOfBytes() is much more
readable in the code.

Claude
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert


> On 17 Mar 2020, at 17:06, Claude Warren <[hidden email]> wrote:
>
> On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <[hidden email]>
> wrote:
>
>>
>>
>>> On 17 Mar 2020, at 15:41, Claude Warren <[hidden email]> wrote:
>>>
>>> I agree with the HashFunction changes.
>>
>> OK, but which ones?
>>
>
> DOH! this one...
>
>>
>> Changing HashFunction to have two methods:
>>
>> long hash(byte[])
>> long increment(int seed)

OK. I’ll update.

>>
>>> I think Builder should have
>>> with(byte[])
>>> with(byte[], int offset, int len )
>>
>> Not convinced here. The HashFunction requires a byte[] and cannot operate
>> on a range. This change should be made in conjunction with a similar change
>> to HashFunction. So should we update HashFunction to:
>>
>>
> Given the depth of the change let's just leave the with( byte[] )
>
>
>>> with(String)
>>>
>>> I find that I use with(String) more than any other with() method.
>>
>> That may be so but String.getBytes(Charset) is trivial to call for the
>> user. Then they get to decide on the encoding and not leave it to the
>> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice as a
>> cross-language standard. Leave it out of the API for now, or add both:
>>
>> Builder with(CharSequence, Charset);
>> Builder withUnencoded(CharSequence);
>>
>
> CharSequence has no easy method to convert to a byte[]. While it could be
> done, it looks to be more of a streaming interface.  Let's leave that out.

I was thinking:

        /**
         * Adds a character sequence item to the hasher using the specified encoding.
         *
         * @param item the item to add
         * @param charset the character set
         * @return a reference to this object
         */
        default Builder with(CharSequence item, Charset charset) {
            return with(item.toString().getBytes(charset));
        }

        /**
         * Adds a character sequence item to the hasher. Each 16-bit character is
         * converted to 2 bytes using little-endian order.
         *
         * @param item the item to add
         * @return a reference to this object
         */
        default Builder withUnencoded(CharSequence item) {
            final int length = item.length();
            final byte[] bytes = new byte[length * 2];
            for (int i = 0; i < length; i++) {
                final char ch = item.charAt(i);
                bytes[i * 2] = (byte) ch;
                bytes[i * 2 + 1] = (byte) (ch >>> 8);
            }
            return with(bytes);
        }

>
>
>> I would argue that you may use BloomFilters for Strings but if we see a
>> BloomFilter as a collection then we should really support all Objects (with
>> a decorator or by typing the Builder) or not support Objects. Currently we
>> are not supporting any Object so for now would drop this and the
>> Hasher.Builder then becomes a very simple API that specifies that you put
>> in items represented as a byte[] and call build to create a Hasher
>> containing those items and reset for further use.
>>
>
> I have code example in several places where I hash GeoCode entities.  Since
> they are comprised of strings, for the most part, building a hasher for
> them simply requires hashing the Strings.  Many web services use JSON and
> most JSON is string based.  I disagree with removing with(String) because
> it is so convenient in so many cases.  It also makes the code
> cleaner/easier to read.  But if you feel strongly about removing it then OK.

So with(String) is replaced by a default with(CharSequence, Charset) method.

>
> The only other thing that is really bothersome is the lack of
> Shape.getNumberOfBytes().  Yes it is easy to call Math.ceil(
> Shape.getNumberOfBits / 8.0 ).  But getNumberOfBytes() is much more
> readable in the code.

I removed this as the number of bytes is implementation dependent, for example if using a compressed BloomFilter.

If the number of bytes is intended to be for the uncompressed representation then this does not match up with the canonical format of the BloomFilter using a long[].

Q. Why do you need the number of bytes?


One thing we do not have is the computation for a false probability using the current cardinality (this is in the Guava code):

p(false positive) = (cardinality() / m) ^ k

Which is just a probability that you will choose all k indexes as ones that are already set. This is not on the wikipedia page. Is this something you have used?


Also on the wikipedia page is a section on CountingBloomFilters where each bucket is stated as typically having 3-4 bits. Using 4 bits would allow counting to 16 for each index. It would be more space efficient than the current use of 32-bit integers. Changing to 8-bit bytes is a simple compromise. There is also a main page on counting Bloom filters [1]:

For a typical shape with p 1e-6 and n items we have

N = 1000, M = 28756, k = 20
N = 10000, M = 287552, k = 20

Worst case scenario is each item added sets the same index and count could be equal to n. But if the hash is uniform then the count for each bit will be much lower. It is taken from the binomial distribution using number of trials = nk, and probability of success = 1/m. Here are the expected counts (l) for an index when the filter is filled with n items (p is just used to set optimal k and m using n):

n       k     m        p           l    binom(l, nk, 1/m)
1000    20    28756    1.00E-06    0    0.498815437
1000    20    28756    1.00E-06    1    0.346941705
1000    20    28756    1.00E-06    2    0.12064836
1000    20    28756    1.00E-06    4    0.004862559
1000    20    28756    1.00E-06    8    6.76619E-07
1000    20    28756    1.00E-06    16   7.10853E-17
1000    20    28756    1.00E-06    32   1.66389E-41
1000    20    28756    1.00E-06    64   2.8772E-100
1000    20    28756    1.00E-06    127  1.0449E-234
10000   20    287552   1.00E-06    0    0.498811214
10000   20    287552   1.00E-06    1    0.346937562
10000   20    287552   1.00E-06    2    0.120651928
10000   20    287552   1.00E-06    4    0.004863763
10000   20    287552   1.00E-06    8    6.77448E-07
10000   20    287552   1.00E-06    16   7.14657E-17
10000   20    287552   1.00E-06    32   1.70126E-41
10000   20    287552   1.00E-06    64   3.15E-100
10000   20    287552   1.00E-06    127  1.4984E-234

So being able to store counts of int max value is total overkill. We could using 4 bits per counter. But this is harder to do with the current design that uses the sign of entries as a guard digit for overflow or negative counts. A simple switch is to drop to a byte[] and allow counts up to 127. Any overflow would mark the state invalid. This makes the ArrayCountingBloomFilter 4 times more space efficient.

Note that a counting Bloom filter is also described to have functionality we do not currently have. We can test for membership of a Hasher (e.g. an item) within the CountingBloomFilter but not query for a set number of the same item. Should this be added to the interface:

boolean contains(Hasher hasher, int count)

Here the Hasher is an issue as it can represent multiple items which could overlap indexes. In this case the 'contains a specified count' method is invalid as the expected count for those indexes that are overlapping would be higher.

[1] https://en.wikipedia.org/wiki/Counting_Bloom_filter <https://en.wikipedia.org/wiki/Counting_Bloom_filter>
>
> Claude

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
Builder discussion:

Let's go with

>> Builder with(CharSequence, Charset);
>> Builder withUnencoded(CharSequence);

Shape Discussion:

as for getNumberOfBytes() it should return the maximum number of bytes
returned by a getBits() call to a filter with this shape.  So yes, if there
is a compressed internal representation, no it won't be that.  It is a
method on Shape so it should literally be Math.ceil( getNumberOfBits() /
8.0 )

Basically, if you want to create an array that will fit all the bits
returned by BloomFilter.iterator() you need an array of
Shape.getNumberOfBytes().  And that is actually what I use it for.

Bloom Filter Discussion:

I have not use probability() on a single filter, just on the Shape.
However, I can see the value of it.  It seems to me that the difference
between Shape.probability() and BloomFilter.probability() would give you an
indication of how many collisions have already occurred.

In the SetOperations class there is an "estimateSize()" method.  It is the
only single Bloom filter argument method in the class and I think it should
be moved into BloomFilter so we have 2 new methods:

probability()
estimateSize()

Counting Filter Discussion:

As for counting filters we could implement several
int
short
byte

Currently they would all have to return int counts but realistically that
is what would be used in code anyway.  Also, once primitives can be used in
generics this will be easier.

As for contains( Hasher, int ), I think we would need to add contains(
BloomFilter, int).  If I understand correctly, contains( BloomFilter, X )
would test that a BloomFilter has been added X times or rather that there
are enough counts in the right places to make it appear that BloomFilter
has been added X times.  When used with a Hasher, remove the duplicates,
and perform the same test.

I see no reason not to add them.

On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 17 Mar 2020, at 17:06, Claude Warren <[hidden email]> wrote:
> >
> > On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <[hidden email]>
> > wrote:
> >
> >>
> >>
> >>> On 17 Mar 2020, at 15:41, Claude Warren <[hidden email]> wrote:
> >>>
> >>> I agree with the HashFunction changes.
> >>
> >> OK, but which ones?
> >>
> >
> > DOH! this one...
> >
> >>
> >> Changing HashFunction to have two methods:
> >>
> >> long hash(byte[])
> >> long increment(int seed)
>
> OK. I’ll update.
>
> >>
> >>> I think Builder should have
> >>> with(byte[])
> >>> with(byte[], int offset, int len )
> >>
> >> Not convinced here. The HashFunction requires a byte[] and cannot
> operate
> >> on a range. This change should be made in conjunction with a similar
> change
> >> to HashFunction. So should we update HashFunction to:
> >>
> >>
> > Given the depth of the change let's just leave the with( byte[] )
> >
> >
> >>> with(String)
> >>>
> >>> I find that I use with(String) more than any other with() method.
> >>
> >> That may be so but String.getBytes(Charset) is trivial to call for the
> >> user. Then they get to decide on the encoding and not leave it to the
> >> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice
> as a
> >> cross-language standard. Leave it out of the API for now, or add both:
> >>
> >> Builder with(CharSequence, Charset);
> >> Builder withUnencoded(CharSequence);
> >>
> >
> > CharSequence has no easy method to convert to a byte[]. While it could be
> > done, it looks to be more of a streaming interface.  Let's leave that
> out.
>
> I was thinking:
>
>         /**
>          * Adds a character sequence item to the hasher using the
> specified encoding.
>          *
>          * @param item the item to add
>          * @param charset the character set
>          * @return a reference to this object
>          */
>         default Builder with(CharSequence item, Charset charset) {
>             return with(item.toString().getBytes(charset));
>         }
>
>         /**
>          * Adds a character sequence item to the hasher. Each 16-bit
> character is
>          * converted to 2 bytes using little-endian order.
>          *
>          * @param item the item to add
>          * @return a reference to this object
>          */
>         default Builder withUnencoded(CharSequence item) {
>             final int length = item.length();
>             final byte[] bytes = new byte[length * 2];
>             for (int i = 0; i < length; i++) {
>                 final char ch = item.charAt(i);
>                 bytes[i * 2] = (byte) ch;
>                 bytes[i * 2 + 1] = (byte) (ch >>> 8);
>             }
>             return with(bytes);
>         }
>
> >
> >
> >> I would argue that you may use BloomFilters for Strings but if we see a
> >> BloomFilter as a collection then we should really support all Objects
> (with
> >> a decorator or by typing the Builder) or not support Objects. Currently
> we
> >> are not supporting any Object so for now would drop this and the
> >> Hasher.Builder then becomes a very simple API that specifies that you
> put
> >> in items represented as a byte[] and call build to create a Hasher
> >> containing those items and reset for further use.
> >>
> >
> > I have code example in several places where I hash GeoCode entities.
> Since
> > they are comprised of strings, for the most part, building a hasher for
> > them simply requires hashing the Strings.  Many web services use JSON and
> > most JSON is string based.  I disagree with removing with(String) because
> > it is so convenient in so many cases.  It also makes the code
> > cleaner/easier to read.  But if you feel strongly about removing it then
> OK.
>
> So with(String) is replaced by a default with(CharSequence, Charset)
> method.
>
> >
> > The only other thing that is really bothersome is the lack of
> > Shape.getNumberOfBytes().  Yes it is easy to call Math.ceil(
> > Shape.getNumberOfBits / 8.0 ).  But getNumberOfBytes() is much more
> > readable in the code.
>
> I removed this as the number of bytes is implementation dependent, for
> example if using a compressed BloomFilter.
>
> If the number of bytes is intended to be for the uncompressed
> representation then this does not match up with the canonical format of the
> BloomFilter using a long[].
>
> Q. Why do you need the number of bytes?
>
>
> One thing we do not have is the computation for a false probability using
> the current cardinality (this is in the Guava code):
>
> p(false positive) = (cardinality() / m) ^ k
>
> Which is just a probability that you will choose all k indexes as ones
> that are already set. This is not on the wikipedia page. Is this something
> you have used?
>
>
> Also on the wikipedia page is a section on CountingBloomFilters where each
> bucket is stated as typically having 3-4 bits. Using 4 bits would allow
> counting to 16 for each index. It would be more space efficient than the
> current use of 32-bit integers. Changing to 8-bit bytes is a simple
> compromise. There is also a main page on counting Bloom filters [1]:
>
> For a typical shape with p 1e-6 and n items we have
>
> N = 1000, M = 28756, k = 20
> N = 10000, M = 287552, k = 20
>
> Worst case scenario is each item added sets the same index and count could
> be equal to n. But if the hash is uniform then the count for each bit will
> be much lower. It is taken from the binomial distribution using number of
> trials = nk, and probability of success = 1/m. Here are the expected counts
> (l) for an index when the filter is filled with n items (p is just used to
> set optimal k and m using n):
>
> n       k     m        p           l    binom(l, nk, 1/m)
> 1000    20    28756    1.00E-06    0    0.498815437
> 1000    20    28756    1.00E-06    1    0.346941705
> 1000    20    28756    1.00E-06    2    0.12064836
> 1000    20    28756    1.00E-06    4    0.004862559
> 1000    20    28756    1.00E-06    8    6.76619E-07
> 1000    20    28756    1.00E-06    16   7.10853E-17
> 1000    20    28756    1.00E-06    32   1.66389E-41
> 1000    20    28756    1.00E-06    64   2.8772E-100
> 1000    20    28756    1.00E-06    127  1.0449E-234
> 10000   20    287552   1.00E-06    0    0.498811214
> 10000   20    287552   1.00E-06    1    0.346937562
> 10000   20    287552   1.00E-06    2    0.120651928
> 10000   20    287552   1.00E-06    4    0.004863763
> 10000   20    287552   1.00E-06    8    6.77448E-07
> 10000   20    287552   1.00E-06    16   7.14657E-17
> 10000   20    287552   1.00E-06    32   1.70126E-41
> 10000   20    287552   1.00E-06    64   3.15E-100
> 10000   20    287552   1.00E-06    127  1.4984E-234
>
> So being able to store counts of int max value is total overkill. We could
> using 4 bits per counter. But this is harder to do with the current design
> that uses the sign of entries as a guard digit for overflow or negative
> counts. A simple switch is to drop to a byte[] and allow counts up to 127.
> Any overflow would mark the state invalid. This makes the
> ArrayCountingBloomFilter 4 times more space efficient.
>
> Note that a counting Bloom filter is also described to have functionality
> we do not currently have. We can test for membership of a Hasher (e.g. an
> item) within the CountingBloomFilter but not query for a set number of the
> same item. Should this be added to the interface:
>
> boolean contains(Hasher hasher, int count)
>
> Here the Hasher is an issue as it can represent multiple items which could
> overlap indexes. In this case the 'contains a specified count' method is
> invalid as the expected count for those indexes that are overlapping would
> be higher.
>
> [1] https://en.wikipedia.org/wiki/Counting_Bloom_filter <
> https://en.wikipedia.org/wiki/Counting_Bloom_filter>
> >
> > Claude
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
On a slightly different note.  CountingBloomFilters have no way to perform
a reload.  All other bloom filters you can dump the bits and reload
(trivial) but if you preserve the counts from a bloom filter and want to
reload them you can't.  We need a constructor that takes the index,count
pairs somehow.

On Tue, Mar 17, 2020 at 10:34 PM Claude Warren <[hidden email]> wrote:

> Builder discussion:
>
> Let's go with
>
> >> Builder with(CharSequence, Charset);
> >> Builder withUnencoded(CharSequence);
>
> Shape Discussion:
>
> as for getNumberOfBytes() it should return the maximum number of bytes
> returned by a getBits() call to a filter with this shape.  So yes, if there
> is a compressed internal representation, no it won't be that.  It is a
> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
> 8.0 )
>
> Basically, if you want to create an array that will fit all the bits
> returned by BloomFilter.iterator() you need an array of
> Shape.getNumberOfBytes().  And that is actually what I use it for.
>
> Bloom Filter Discussion:
>
> I have not use probability() on a single filter, just on the Shape.
> However, I can see the value of it.  It seems to me that the difference
> between Shape.probability() and BloomFilter.probability() would give you an
> indication of how many collisions have already occurred.
>
> In the SetOperations class there is an "estimateSize()" method.  It is the
> only single Bloom filter argument method in the class and I think it should
> be moved into BloomFilter so we have 2 new methods:
>
> probability()
> estimateSize()
>
> Counting Filter Discussion:
>
> As for counting filters we could implement several
> int
> short
> byte
>
> Currently they would all have to return int counts but realistically that
> is what would be used in code anyway.  Also, once primitives can be used in
> generics this will be easier.
>
> As for contains( Hasher, int ), I think we would need to add contains(
> BloomFilter, int).  If I understand correctly, contains( BloomFilter, X )
> would test that a BloomFilter has been added X times or rather that there
> are enough counts in the right places to make it appear that BloomFilter
> has been added X times.  When used with a Hasher, remove the duplicates,
> and perform the same test.
>
> I see no reason not to add them.
>
> On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert <[hidden email]>
> wrote:
>
>>
>>
>> > On 17 Mar 2020, at 17:06, Claude Warren <[hidden email]> wrote:
>> >
>> > On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <[hidden email]>
>> > wrote:
>> >
>> >>
>> >>
>> >>> On 17 Mar 2020, at 15:41, Claude Warren <[hidden email]> wrote:
>> >>>
>> >>> I agree with the HashFunction changes.
>> >>
>> >> OK, but which ones?
>> >>
>> >
>> > DOH! this one...
>> >
>> >>
>> >> Changing HashFunction to have two methods:
>> >>
>> >> long hash(byte[])
>> >> long increment(int seed)
>>
>> OK. I’ll update.
>>
>> >>
>> >>> I think Builder should have
>> >>> with(byte[])
>> >>> with(byte[], int offset, int len )
>> >>
>> >> Not convinced here. The HashFunction requires a byte[] and cannot
>> operate
>> >> on a range. This change should be made in conjunction with a similar
>> change
>> >> to HashFunction. So should we update HashFunction to:
>> >>
>> >>
>> > Given the depth of the change let's just leave the with( byte[] )
>> >
>> >
>> >>> with(String)
>> >>>
>> >>> I find that I use with(String) more than any other with() method.
>> >>
>> >> That may be so but String.getBytes(Charset) is trivial to call for the
>> >> user. Then they get to decide on the encoding and not leave it to the
>> >> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice
>> as a
>> >> cross-language standard. Leave it out of the API for now, or add both:
>> >>
>> >> Builder with(CharSequence, Charset);
>> >> Builder withUnencoded(CharSequence);
>> >>
>> >
>> > CharSequence has no easy method to convert to a byte[]. While it could
>> be
>> > done, it looks to be more of a streaming interface.  Let's leave that
>> out.
>>
>> I was thinking:
>>
>>         /**
>>          * Adds a character sequence item to the hasher using the
>> specified encoding.
>>          *
>>          * @param item the item to add
>>          * @param charset the character set
>>          * @return a reference to this object
>>          */
>>         default Builder with(CharSequence item, Charset charset) {
>>             return with(item.toString().getBytes(charset));
>>         }
>>
>>         /**
>>          * Adds a character sequence item to the hasher. Each 16-bit
>> character is
>>          * converted to 2 bytes using little-endian order.
>>          *
>>          * @param item the item to add
>>          * @return a reference to this object
>>          */
>>         default Builder withUnencoded(CharSequence item) {
>>             final int length = item.length();
>>             final byte[] bytes = new byte[length * 2];
>>             for (int i = 0; i < length; i++) {
>>                 final char ch = item.charAt(i);
>>                 bytes[i * 2] = (byte) ch;
>>                 bytes[i * 2 + 1] = (byte) (ch >>> 8);
>>             }
>>             return with(bytes);
>>         }
>>
>> >
>> >
>> >> I would argue that you may use BloomFilters for Strings but if we see a
>> >> BloomFilter as a collection then we should really support all Objects
>> (with
>> >> a decorator or by typing the Builder) or not support Objects.
>> Currently we
>> >> are not supporting any Object so for now would drop this and the
>> >> Hasher.Builder then becomes a very simple API that specifies that you
>> put
>> >> in items represented as a byte[] and call build to create a Hasher
>> >> containing those items and reset for further use.
>> >>
>> >
>> > I have code example in several places where I hash GeoCode entities.
>> Since
>> > they are comprised of strings, for the most part, building a hasher for
>> > them simply requires hashing the Strings.  Many web services use JSON
>> and
>> > most JSON is string based.  I disagree with removing with(String)
>> because
>> > it is so convenient in so many cases.  It also makes the code
>> > cleaner/easier to read.  But if you feel strongly about removing it
>> then OK.
>>
>> So with(String) is replaced by a default with(CharSequence, Charset)
>> method.
>>
>> >
>> > The only other thing that is really bothersome is the lack of
>> > Shape.getNumberOfBytes().  Yes it is easy to call Math.ceil(
>> > Shape.getNumberOfBits / 8.0 ).  But getNumberOfBytes() is much more
>> > readable in the code.
>>
>> I removed this as the number of bytes is implementation dependent, for
>> example if using a compressed BloomFilter.
>>
>> If the number of bytes is intended to be for the uncompressed
>> representation then this does not match up with the canonical format of the
>> BloomFilter using a long[].
>>
>> Q. Why do you need the number of bytes?
>>
>>
>> One thing we do not have is the computation for a false probability using
>> the current cardinality (this is in the Guava code):
>>
>> p(false positive) = (cardinality() / m) ^ k
>>
>> Which is just a probability that you will choose all k indexes as ones
>> that are already set. This is not on the wikipedia page. Is this something
>> you have used?
>>
>>
>> Also on the wikipedia page is a section on CountingBloomFilters where
>> each bucket is stated as typically having 3-4 bits. Using 4 bits would
>> allow counting to 16 for each index. It would be more space efficient than
>> the current use of 32-bit integers. Changing to 8-bit bytes is a simple
>> compromise. There is also a main page on counting Bloom filters [1]:
>>
>> For a typical shape with p 1e-6 and n items we have
>>
>> N = 1000, M = 28756, k = 20
>> N = 10000, M = 287552, k = 20
>>
>> Worst case scenario is each item added sets the same index and count
>> could be equal to n. But if the hash is uniform then the count for each bit
>> will be much lower. It is taken from the binomial distribution using number
>> of trials = nk, and probability of success = 1/m. Here are the expected
>> counts (l) for an index when the filter is filled with n items (p is just
>> used to set optimal k and m using n):
>>
>> n       k     m        p           l    binom(l, nk, 1/m)
>> 1000    20    28756    1.00E-06    0    0.498815437
>> 1000    20    28756    1.00E-06    1    0.346941705
>> 1000    20    28756    1.00E-06    2    0.12064836
>> 1000    20    28756    1.00E-06    4    0.004862559
>> 1000    20    28756    1.00E-06    8    6.76619E-07
>> 1000    20    28756    1.00E-06    16   7.10853E-17
>> 1000    20    28756    1.00E-06    32   1.66389E-41
>> 1000    20    28756    1.00E-06    64   2.8772E-100
>> 1000    20    28756    1.00E-06    127  1.0449E-234
>> 10000   20    287552   1.00E-06    0    0.498811214
>> 10000   20    287552   1.00E-06    1    0.346937562
>> 10000   20    287552   1.00E-06    2    0.120651928
>> 10000   20    287552   1.00E-06    4    0.004863763
>> 10000   20    287552   1.00E-06    8    6.77448E-07
>> 10000   20    287552   1.00E-06    16   7.14657E-17
>> 10000   20    287552   1.00E-06    32   1.70126E-41
>> 10000   20    287552   1.00E-06    64   3.15E-100
>> 10000   20    287552   1.00E-06    127  1.4984E-234
>>
>> So being able to store counts of int max value is total overkill. We
>> could using 4 bits per counter. But this is harder to do with the current
>> design that uses the sign of entries as a guard digit for overflow or
>> negative counts. A simple switch is to drop to a byte[] and allow counts up
>> to 127. Any overflow would mark the state invalid. This makes the
>> ArrayCountingBloomFilter 4 times more space efficient.
>>
>> Note that a counting Bloom filter is also described to have functionality
>> we do not currently have. We can test for membership of a Hasher (e.g. an
>> item) within the CountingBloomFilter but not query for a set number of the
>> same item. Should this be added to the interface:
>>
>> boolean contains(Hasher hasher, int count)
>>
>> Here the Hasher is an issue as it can represent multiple items which
>> could overlap indexes. In this case the 'contains a specified count' method
>> is invalid as the expected count for those indexes that are overlapping
>> would be higher.
>>
>> [1] https://en.wikipedia.org/wiki/Counting_Bloom_filter <
>> https://en.wikipedia.org/wiki/Counting_Bloom_filter>
>> >
>> > Claude
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert
In reply to this post by Claude Warren


> On 17 Mar 2020, at 22:34, Claude Warren <[hidden email]> wrote:
>
> Builder discussion:
>
> Let's go with
>
>>> Builder with(CharSequence, Charset);
>>> Builder withUnencoded(CharSequence);

Added to master.

I already note that not having it mandate UTF-8 is annoying. I had to include StandardCharsets in a lot of places in the test code. So perhaps we add:

Builder withUtf8(CharSequence)

>
> Shape Discussion:
>
> as for getNumberOfBytes() it should return the maximum number of bytes
> returned by a getBits() call to a filter with this shape.  So yes, if there
> is a compressed internal representation, no it won't be that.  It is a
> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
> 8.0 )
>
> Basically, if you want to create an array that will fit all the bits
> returned by BloomFilter.iterator() you need an array of
> Shape.getNumberOfBytes().  And that is actually what I use it for.

Then you are also mapping the index to a byte index and a bit within the byte. So if you are doing these two actions then this is something that you should control.

>
> Bloom Filter Discussion:
>
> I have not use probability() on a single filter, just on the Shape.
> However, I can see the value of it.  It seems to me that the difference
> between Shape.probability() and BloomFilter.probability() would give you an
> indication of how many collisions have already occurred.
>
> In the SetOperations class there is an "estimateSize()" method.  It is the
> only single Bloom filter argument method in the class and I think it should
> be moved into BloomFilter so we have 2 new methods:
>
> probability()
> estimateSize()
>
> Counting Filter Discussion:
>
> As for counting filters we could implement several
> int
> short
> byte

Yes. Each supports adding a maximum number of the same item. Since we do not know the use case leaving it at only int for now is easiest.

The alternative is duplicating the logic for each backing storage, removing the public constructor, making the class abstract and providing a factory constructor:

ArrayCountingBloomFilter bf = ArrayCountingBloomFilter.create(shape, int maximumDuplicateItems);

The actual instance returned is then based on the capacity required to store the duplicates.

However the counts at each index are random and may exceed the duplicates by chance. I don’t want to go the route of requiring probability computations in the factory constructor for likelihood of exceeding the capacity.

A simple approach would be to have a byte[] version when you expect not to add duplicates. This filter will simply function to allow removing items you previously added. The probabilities I previously listed show that a count of 127 by random chance is < 1e-100 if the filter is reasonably big. We should at least provide a link to this computation. It requires a binomial distribution and collections does not current depend on common-math.

The int[] version should be used when you expect to be able to add duplicates and want to use the contains(Hasher, count) function.

>
> Currently they would all have to return int counts but realistically that
> is what would be used in code anyway.  Also, once primitives can be used in
> generics this will be easier.
>
> As for contains( Hasher, int ), I think we would need to add contains(
> BloomFilter, int).  If I understand correctly, contains( BloomFilter, X )
> would test that a BloomFilter has been added X times or rather that there
> are enough counts in the right places to make it appear that BloomFilter
> has been added X times.  When used with a Hasher, remove the duplicates,
> and perform the same test.
>
> I see no reason not to add them.

OK.

>
> On Tue, Mar 17, 2020 at 6:23 PM Alex Herbert <[hidden email] <mailto:[hidden email]>>
> wrote:
>
>>
>>
>>> On 17 Mar 2020, at 17:06, Claude Warren <[hidden email]> wrote:
>>>
>>> On Tue, Mar 17, 2020 at 4:38 PM Alex Herbert <[hidden email]>
>>> wrote:
>>>
>>>>
>>>>
>>>>> On 17 Mar 2020, at 15:41, Claude Warren <[hidden email]> wrote:
>>>>>
>>>>> I agree with the HashFunction changes.
>>>>
>>>> OK, but which ones?
>>>>
>>>
>>> DOH! this one...
>>>
>>>>
>>>> Changing HashFunction to have two methods:
>>>>
>>>> long hash(byte[])
>>>> long increment(int seed)
>>
>> OK. I’ll update.
>>
>>>>
>>>>> I think Builder should have
>>>>> with(byte[])
>>>>> with(byte[], int offset, int len )
>>>>
>>>> Not convinced here. The HashFunction requires a byte[] and cannot
>> operate
>>>> on a range. This change should be made in conjunction with a similar
>> change
>>>> to HashFunction. So should we update HashFunction to:
>>>>
>>>>
>>> Given the depth of the change let's just leave the with( byte[] )
>>>
>>>
>>>>> with(String)
>>>>>
>>>>> I find that I use with(String) more than any other with() method.
>>>>
>>>> That may be so but String.getBytes(Charset) is trivial to call for the
>>>> user. Then they get to decide on the encoding and not leave it to the
>>>> Hasher. I would use UTF-16 because it would be fast. But UTF-8 is nice
>> as a
>>>> cross-language standard. Leave it out of the API for now, or add both:
>>>>
>>>> Builder with(CharSequence, Charset);
>>>> Builder withUnencoded(CharSequence);
>>>>
>>>
>>> CharSequence has no easy method to convert to a byte[]. While it could be
>>> done, it looks to be more of a streaming interface.  Let's leave that
>> out.
>>
>> I was thinking:
>>
>>        /**
>>         * Adds a character sequence item to the hasher using the
>> specified encoding.
>>         *
>>         * @param item the item to add
>>         * @param charset the character set
>>         * @return a reference to this object
>>         */
>>        default Builder with(CharSequence item, Charset charset) {
>>            return with(item.toString().getBytes(charset));
>>        }
>>
>>        /**
>>         * Adds a character sequence item to the hasher. Each 16-bit
>> character is
>>         * converted to 2 bytes using little-endian order.
>>         *
>>         * @param item the item to add
>>         * @return a reference to this object
>>         */
>>        default Builder withUnencoded(CharSequence item) {
>>            final int length = item.length();
>>            final byte[] bytes = new byte[length * 2];
>>            for (int i = 0; i < length; i++) {
>>                final char ch = item.charAt(i);
>>                bytes[i * 2] = (byte) ch;
>>                bytes[i * 2 + 1] = (byte) (ch >>> 8);
>>            }
>>            return with(bytes);
>>        }
>>
>>>
>>>
>>>> I would argue that you may use BloomFilters for Strings but if we see a
>>>> BloomFilter as a collection then we should really support all Objects
>> (with
>>>> a decorator or by typing the Builder) or not support Objects. Currently
>> we
>>>> are not supporting any Object so for now would drop this and the
>>>> Hasher.Builder then becomes a very simple API that specifies that you
>> put
>>>> in items represented as a byte[] and call build to create a Hasher
>>>> containing those items and reset for further use.
>>>>
>>>
>>> I have code example in several places where I hash GeoCode entities.
>> Since
>>> they are comprised of strings, for the most part, building a hasher for
>>> them simply requires hashing the Strings.  Many web services use JSON and
>>> most JSON is string based.  I disagree with removing with(String) because
>>> it is so convenient in so many cases.  It also makes the code
>>> cleaner/easier to read.  But if you feel strongly about removing it then
>> OK.
>>
>> So with(String) is replaced by a default with(CharSequence, Charset)
>> method.
>>
>>>
>>> The only other thing that is really bothersome is the lack of
>>> Shape.getNumberOfBytes().  Yes it is easy to call Math.ceil(
>>> Shape.getNumberOfBits / 8.0 ).  But getNumberOfBytes() is much more
>>> readable in the code.
>>
>> I removed this as the number of bytes is implementation dependent, for
>> example if using a compressed BloomFilter.
>>
>> If the number of bytes is intended to be for the uncompressed
>> representation then this does not match up with the canonical format of the
>> BloomFilter using a long[].
>>
>> Q. Why do you need the number of bytes?
>>
>>
>> One thing we do not have is the computation for a false probability using
>> the current cardinality (this is in the Guava code):
>>
>> p(false positive) = (cardinality() / m) ^ k
>>
>> Which is just a probability that you will choose all k indexes as ones
>> that are already set. This is not on the wikipedia page. Is this something
>> you have used?
>>
>>
>> Also on the wikipedia page is a section on CountingBloomFilters where each
>> bucket is stated as typically having 3-4 bits. Using 4 bits would allow
>> counting to 16 for each index. It would be more space efficient than the
>> current use of 32-bit integers. Changing to 8-bit bytes is a simple
>> compromise. There is also a main page on counting Bloom filters [1]:
>>
>> For a typical shape with p 1e-6 and n items we have
>>
>> N = 1000, M = 28756, k = 20
>> N = 10000, M = 287552, k = 20
>>
>> Worst case scenario is each item added sets the same index and count could
>> be equal to n. But if the hash is uniform then the count for each bit will
>> be much lower. It is taken from the binomial distribution using number of
>> trials = nk, and probability of success = 1/m. Here are the expected counts
>> (l) for an index when the filter is filled with n items (p is just used to
>> set optimal k and m using n):
>>
>> n       k     m        p           l    binom(l, nk, 1/m)
>> 1000    20    28756    1.00E-06    0    0.498815437
>> 1000    20    28756    1.00E-06    1    0.346941705
>> 1000    20    28756    1.00E-06    2    0.12064836
>> 1000    20    28756    1.00E-06    4    0.004862559
>> 1000    20    28756    1.00E-06    8    6.76619E-07
>> 1000    20    28756    1.00E-06    16   7.10853E-17
>> 1000    20    28756    1.00E-06    32   1.66389E-41
>> 1000    20    28756    1.00E-06    64   2.8772E-100
>> 1000    20    28756    1.00E-06    127  1.0449E-234
>> 10000   20    287552   1.00E-06    0    0.498811214
>> 10000   20    287552   1.00E-06    1    0.346937562
>> 10000   20    287552   1.00E-06    2    0.120651928
>> 10000   20    287552   1.00E-06    4    0.004863763
>> 10000   20    287552   1.00E-06    8    6.77448E-07
>> 10000   20    287552   1.00E-06    16   7.14657E-17
>> 10000   20    287552   1.00E-06    32   1.70126E-41
>> 10000   20    287552   1.00E-06    64   3.15E-100
>> 10000   20    287552   1.00E-06    127  1.4984E-234
>>
>> So being able to store counts of int max value is total overkill. We could
>> using 4 bits per counter. But this is harder to do with the current design
>> that uses the sign of entries as a guard digit for overflow or negative
>> counts. A simple switch is to drop to a byte[] and allow counts up to 127.
>> Any overflow would mark the state invalid. This makes the
>> ArrayCountingBloomFilter 4 times more space efficient.
>>
>> Note that a counting Bloom filter is also described to have functionality
>> we do not currently have. We can test for membership of a Hasher (e.g. an
>> item) within the CountingBloomFilter but not query for a set number of the
>> same item. Should this be added to the interface:
>>
>> boolean contains(Hasher hasher, int count)
>>
>> Here the Hasher is an issue as it can represent multiple items which could
>> overlap indexes. In this case the 'contains a specified count' method is
>> invalid as the expected count for those indexes that are overlapping would
>> be higher.
>>
>> [1] https://en.wikipedia.org/wiki/Counting_Bloom_filter <
>> https://en.wikipedia.org/wiki/Counting_Bloom_filter <https://en.wikipedia.org/wiki/Counting_Bloom_filter>>
>>>
>>> Claude
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com <http://like-like.xenei.com/>>
> LinkedIn: http://www.linkedin.com/in/claudewarren <http://www.linkedin.com/in/claudewarren>
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert
In reply to this post by Claude Warren


> On 18 Mar 2020, at 11:14, Claude Warren <[hidden email]> wrote:
>
> On a slightly different note.  CountingBloomFilters have no way to perform
> a reload.  All other bloom filters you can dump the bits and reload
> (trivial) but if you preserve the counts from a bloom filter and want to
> reload them you can't.  We need a constructor that takes the index,count
> pairs somehow.

Iterator<int[]> ?

Or foolproof:

class IndexCount {
    final int index;
    final int count;
    // ...
}

Iterator<IndexCount>


The CountingBloomFilter already has a method forEachCount(…).

I was reluctant to add some sort of iterator:

Iterator<?> iterator()

But we could put in:

Iterator<IndexCount> iterator()

It would be inefficient but at least it is fool-proof. The operation is unlikely to be used very often.



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
You don't need Iterator<IndexCount> iterator() as we have forEachCount(
BitCountConsumer )

I guess we need something like add( Iterator<IndexCount>) or add(
Collection<IndexCount> ) or add( Stream<IndexCount> )

It would be nice if we could have a BitCountProducer class that we could
just pass to an add() method.




On Wed, Mar 18, 2020 at 11:50 AM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 18 Mar 2020, at 11:14, Claude Warren <[hidden email]> wrote:
> >
> > On a slightly different note.  CountingBloomFilters have no way to
> perform
> > a reload.  All other bloom filters you can dump the bits and reload
> > (trivial) but if you preserve the counts from a bloom filter and want to
> > reload them you can't.  We need a constructor that takes the index,count
> > pairs somehow.
>
> Iterator<int[]> ?
>
> Or foolproof:
>
> class IndexCount {
>     final int index;
>     final int count;
>     // ...
> }
>
> Iterator<IndexCount>
>
>
> The CountingBloomFilter already has a method forEachCount(…).
>
> I was reluctant to add some sort of iterator:
>
> Iterator<?> iterator()
>
> But we could put in:
>
> Iterator<IndexCount> iterator()
>
> It would be inefficient but at least it is fool-proof. The operation is
> unlikely to be used very often.
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
We are getting to the point where there are a lot of options that determine
which implementation is "best".  We could take a stab at creating a
BloomFIlterFactory that takes a Shape as an argument and does a "finger in
the air" guestimate of which implementation best fits.  Store values in
long blocks or as integers in a list, that sort of thing.  Perhaps in a
month or so when we really have some idea.



On Wed, Mar 18, 2020 at 2:16 PM Claude Warren <[hidden email]> wrote:

> You don't need Iterator<IndexCount> iterator() as we have forEachCount(
> BitCountConsumer )
>
> I guess we need something like add( Iterator<IndexCount>) or add(
> Collection<IndexCount> ) or add( Stream<IndexCount> )
>
> It would be nice if we could have a BitCountProducer class that we could
> just pass to an add() method.
>
>
>
>
> On Wed, Mar 18, 2020 at 11:50 AM Alex Herbert <[hidden email]>
> wrote:
>
>>
>>
>> > On 18 Mar 2020, at 11:14, Claude Warren <[hidden email]> wrote:
>> >
>> > On a slightly different note.  CountingBloomFilters have no way to
>> perform
>> > a reload.  All other bloom filters you can dump the bits and reload
>> > (trivial) but if you preserve the counts from a bloom filter and want to
>> > reload them you can't.  We need a constructor that takes the index,count
>> > pairs somehow.
>>
>> Iterator<int[]> ?
>>
>> Or foolproof:
>>
>> class IndexCount {
>>     final int index;
>>     final int count;
>>     // ...
>> }
>>
>> Iterator<IndexCount>
>>
>>
>> The CountingBloomFilter already has a method forEachCount(…).
>>
>> I was reluctant to add some sort of iterator:
>>
>> Iterator<?> iterator()
>>
>> But we could put in:
>>
>> Iterator<IndexCount> iterator()
>>
>> It would be inefficient but at least it is fool-proof. The operation is
>> unlikely to be used very often.
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
>> Shape Discussion:
>>
>> as for getNumberOfBytes() it should return the maximum number of bytes
>> returned by a getBits() call to a filter with this shape.  So yes, if
there
>> is a compressed internal representation, no it won't be that.  It is a
>> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
>> 8.0 )
>>
>> Basically, if you want to create an array that will fit all the bits
>> returned by BloomFilter.iterator() you need an array of
>> Shape.getNumberOfBytes().  And that is actually what I use it for.

>Then you are also mapping the index to a byte index and a bit within the
byte. So if you are doing these two actions then this is something that you
should control.

BloomFilter.getBits returns a long[].  that long[] may be shorter than the
absolute number of bytes specified by Shape.  It also may be longer.

If you want to create a copy of the byte[] you have to know how long it
should be.  The only way to determine that is from Shape, and currently
only if you do the Ceil() method noted above.  There is a convenience in
knowing how long (in bytes) the buffer can be.



On Wed, Mar 18, 2020 at 2:19 PM Claude Warren <[hidden email]> wrote:

> We are getting to the point where there are a lot of options that
> determine which implementation is "best".  We could take a stab at creating
> a BloomFIlterFactory that takes a Shape as an argument and does a "finger
> in the air" guestimate of which implementation best fits.  Store values in
> long blocks or as integers in a list, that sort of thing.  Perhaps in a
> month or so when we really have some idea.
>
>
>
> On Wed, Mar 18, 2020 at 2:16 PM Claude Warren <[hidden email]> wrote:
>
>> You don't need Iterator<IndexCount> iterator() as we have forEachCount(
>> BitCountConsumer )
>>
>> I guess we need something like add( Iterator<IndexCount>) or add(
>> Collection<IndexCount> ) or add( Stream<IndexCount> )
>>
>> It would be nice if we could have a BitCountProducer class that we could
>> just pass to an add() method.
>>
>>
>>
>>
>> On Wed, Mar 18, 2020 at 11:50 AM Alex Herbert <[hidden email]>
>> wrote:
>>
>>>
>>>
>>> > On 18 Mar 2020, at 11:14, Claude Warren <[hidden email]> wrote:
>>> >
>>> > On a slightly different note.  CountingBloomFilters have no way to
>>> perform
>>> > a reload.  All other bloom filters you can dump the bits and reload
>>> > (trivial) but if you preserve the counts from a bloom filter and want
>>> to
>>> > reload them you can't.  We need a constructor that takes the
>>> index,count
>>> > pairs somehow.
>>>
>>> Iterator<int[]> ?
>>>
>>> Or foolproof:
>>>
>>> class IndexCount {
>>>     final int index;
>>>     final int count;
>>>     // ...
>>> }
>>>
>>> Iterator<IndexCount>
>>>
>>>
>>> The CountingBloomFilter already has a method forEachCount(…).
>>>
>>> I was reluctant to add some sort of iterator:
>>>
>>> Iterator<?> iterator()
>>>
>>> But we could put in:
>>>
>>> Iterator<IndexCount> iterator()
>>>
>>> It would be inefficient but at least it is fool-proof. The operation is
>>> unlikely to be used very often.
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [hidden email]
>>> For additional commands, e-mail: [hidden email]
>>>
>>>
>>
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
12