[BloomFilters] changes to BloomFilter

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert


> On 18 Mar 2020, at 14:39, Claude Warren <[hidden email]> wrote:
>
>>> Shape Discussion:
>>>
>>> as for getNumberOfBytes() it should return the maximum number of bytes
>>> returned by a getBits() call to a filter with this shape.  So yes, if
> there
>>> is a compressed internal representation, no it won't be that.  It is a
>>> method on Shape so it should literally be Math.ceil( getNumberOfBits() /
>>> 8.0 )
>>>
>>> Basically, if you want to create an array that will fit all the bits
>>> returned by BloomFilter.iterator() you need an array of
>>> Shape.getNumberOfBytes().  And that is actually what I use it for.
>
>> Then you are also mapping the index to a byte index and a bit within the
> byte. So if you are doing these two actions then this is something that you
> should control.
>
> BloomFilter.getBits returns a long[].  that long[] may be shorter than the
> absolute number of bytes specified by Shape.  It also may be longer.
>
> If you want to create a copy of the byte[] you have to know how long it
> should be.  The only way to determine that is from Shape, and currently
> only if you do the Ceil() method noted above.  There is a convenience in
> knowing how long (in bytes) the buffer can be.

Copy of what byte[]?

There is no method to create a byte[] for a BloomFilter. So no need for getNumberOfBytes().

Are you talking about compressing the long[] to a byte[] by truncating the final long into 1-8 bytes?

    BloomFilter bf;
    long[] bits = bf.getBits();
    ByteBuffer bb = ByteBuffer.allocate(bits.length * Long.BYTES).order(ByteOrder.LITTLE_ENDIAN);
    Arrays.stream(bits).forEachOrdered(bb::putLong);
    byte[] bytes = bb.array();
    int expected = (int) Math.ceil(bf.getShape().getNumberOfBits() / 8.0);
    if (bytes.length != expected) {
        bytes = Arrays.copyOf(bytes, expected);
    }

For a BloomFilter of any reasonable number of bits the storage saving will be small.

Is this for serialisation? This is outside of the scope of the library.


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
bf.getBits() * Long.BYTES  may be as long as Math.Ceil(
Shape.getNumberOfBits() / 8.0 ) or it may be shorter.

I am building byte buffers of fixed length that is the maximum size that
any valid bf.getBits() * Long.BYTES I need to know
Shape.getNumberOfBytes().
The  conversion is required for some Bloom filter indexing techniques.

And while serialization is outside the scope of the library, it is only
reasonable that we provide enough information to allow developers to
serialize/deserialse the data.  For example BloomFilter allows you to get
either the long[] representation or the list of bit indexes (via OfInt) and
there are ways to reconstruct a BloomFilter if you were to write that out
and read it back.



On Wed, Mar 18, 2020 at 4:07 PM Alex Herbert <[hidden email]>
wrote:

>
>
> > On 18 Mar 2020, at 14:39, Claude Warren <[hidden email]> wrote:
> >
> >>> Shape Discussion:
> >>>
> >>> as for getNumberOfBytes() it should return the maximum number of bytes
> >>> returned by a getBits() call to a filter with this shape.  So yes, if
> > there
> >>> is a compressed internal representation, no it won't be that.  It is a
> >>> method on Shape so it should literally be Math.ceil( getNumberOfBits()
> /
> >>> 8.0 )
> >>>
> >>> Basically, if you want to create an array that will fit all the bits
> >>> returned by BloomFilter.iterator() you need an array of
> >>> Shape.getNumberOfBytes().  And that is actually what I use it for.
> >
> >> Then you are also mapping the index to a byte index and a bit within the
> > byte. So if you are doing these two actions then this is something that
> you
> > should control.
> >
> > BloomFilter.getBits returns a long[].  that long[] may be shorter than
> the
> > absolute number of bytes specified by Shape.  It also may be longer.
> >
> > If you want to create a copy of the byte[] you have to know how long it
> > should be.  The only way to determine that is from Shape, and currently
> > only if you do the Ceil() method noted above.  There is a convenience in
> > knowing how long (in bytes) the buffer can be.
>
> Copy of what byte[]?
>
> There is no method to create a byte[] for a BloomFilter. So no need for
> getNumberOfBytes().
>
> Are you talking about compressing the long[] to a byte[] by truncating the
> final long into 1-8 bytes?
>
>     BloomFilter bf;
>     long[] bits = bf.getBits();
>     ByteBuffer bb = ByteBuffer.allocate(bits.length *
> Long.BYTES).order(ByteOrder.LITTLE_ENDIAN);
>     Arrays.stream(bits).forEachOrdered(bb::putLong);
>     byte[] bytes = bb.array();
>     int expected = (int) Math.ceil(bf.getShape().getNumberOfBits() / 8.0);
>     if (bytes.length != expected) {
>         bytes = Arrays.copyOf(bytes, expected);
>     }
>
> For a BloomFilter of any reasonable number of bits the storage saving will
> be small.
>
> Is this for serialisation? This is outside of the scope of the library.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert
Sorry for an absence. I have been thinking on ways to move the BloomFilter API forward that consolidates the current functionality but makes it simpler to use for the common case.

> On 18 Mar 2020, at 17:12, Claude Warren <[hidden email]> wrote:
>
> bf.getBits() * Long.BYTES  may be as long as Math.Ceil(
> Shape.getNumberOfBits() / 8.0 ) or it may be shorter.

I presume you mean:

bf.getBits().length * Long.BYTES  may be equal to Math.Ceil(
Shape.getNumberOfBits() / 8.0 ) or it may be longer.

>
> I am building byte buffers of fixed length that is the maximum size that
> any valid bf.getBits() * Long.BYTES I need to know
> Shape.getNumberOfBytes().
> The  conversion is required for some Bloom filter indexing techniques.

OK. I gave you an example of how to do this. In my example the Shape.getNumberOfBytes() was 1 line of code in 8.

This method has a practical use if you are creating a byte[] representation of the bits in a filter. So you already have to be writing code to do that. In the context of this process it is not going to save you a lot of code. It seems like this function is very specific to your need and not something generic required within the API.

>
> And while serialization is outside the scope of the library, it is only
> reasonable that we provide enough information to allow developers to
> serialize/deserialse the data.  For example BloomFilter allows you to get
> either the long[] representation or the list of bit indexes (via OfInt) and
> there are ways to reconstruct a BloomFilter if you were to write that out
> and read it back.

Yes and you are free to do data transfer how you please with that information.

There is a lot of this API that is changing. I don’t want to keep adding methods before the basic functionality is fixed. So leave Shape.getNumberOfBytes() for now.

On to the main topic for discussion ...


IMO the basic functionality is not yet fixed. It is hard to use due to the separation of the BloomFilter from the hashing of objects you want to put into it. Thus at all times to use a BloomFilter you need to also have something that can create Hashers. Currently they are easy to create if you have a byte[]. So your primary use case of hashing Strings is easy. But what about other things? Here are my current issues with the design:

1. The decoupling of the hashing of an item from the BloomFilter, thus the filter does not control the number of indexes generated
2. The use of 32-bit integers for bit indexes
3. Hashing currently requires a full length byte[] to pass to a hash function
4. Thread safety

In order:

1. The decoupling of the hashing of an item from the BloomFilter, thus the filter does not control the number of indexes generated

I still think the API needs to have some simple way to use a BloomFilter that allows adding objects. Currently you cannot use a BloomFilter on its own. You can only use them if you have another BloomFilter or a Hasher. So you need to have the ability to create one of those. It is not very in the spirit of a collection. A collection should be able to interact directly with what it is a collection of. Currently the BloomFilter cannot. We have this situation:

BloomFilter bf = BloomFilters.create(shape); // Construct somehow
T t;
// lots of code (maybe with ByteBuffer): T => byte[] bytes
bf.merge(new DynamicHasher.Builder(bf.getShape()).with(bytes).build());

The final merge operation also relies on the fact that the Hasher will create the correct number of indexes for the shape. A contract that it can easily violate and break the BloomFilter functionality.

I am thinking about a better way to do this. I like the typed BloomFilter<T> approach in Guava. To me it makes more sense for a BloomFilter in commons-collections to be typed like all other collections. Having an untyped raw interface is strange for a generic collection framework. I think we need an abstraction layer between the raw processing of indexes done in the BloomFilter and the objects T that you put in.

If you wish to maintain flexibility of defining Hashing then perhaps we extend the functionality in the Guava approach (which uses a fixed internal index generator) and allow the generator of indexes to be specified:

interface IndexGenerator<T> {
    // infinite stream of 64-bit longs. Not yet mapped to a BloomFilter shape.
    LongSupplier generate(T item);
}

IndexGenerator<T> generator = ...;
BloomFilter<T> bf = BloomFilter.create(generator, shape);
T t;
bf.add(t);
bf.contains(t);

This moves the hashing of the items inside the BloomFilter but allows the method to hash them to be configured as it is inside the IndexGenerator. This encapsulation prevents the filter state from being abused by adding hashers, which as discussed before should adhere to a contract to generate the correct number of indexes in the correct range for the number of bits but do not have to. If the BloomFilter is controlling the mapping of a stream of indexes to the appropriate range and number then this prevents state corruption.

You previously mentioned separation of concerns as a reason to have a BloomFilter and the process for creation of a Hasher separate. This was to allow you to send a specialised Hasher over a network. Under this system you can create a BloomFilter<long[]> that stores a representation of a cyclic hash. The IndexGenerator<T> would be a IndexGenerator<long[]> that simply cycles the pair of longs to create indexes. Sending the pre-hashed long[2] over a network is simple. You thus set up a remote hash algorithm to convert your items to long[2] and send them for remote query.

I’ll show an example later of how this approach could be incorporated on top of out current code base.


2. The use of 32-bit integers for bit indexes

The lack of 64-bit indexes for the bit positions in a filter prevents construction of very large filters. This is a limitation as shown here:

P = 1e-6
M = 2^31, N = 74,681,641
M= 2^31 * 64, N =  4,779,624,982

So by using a full length long[] representation that could be addressed in memory from a long index divided by 64 allows you to increase capacity of an array base filter by 64-fold. This pushes it into a range where the number of items that can be stored is much larger than that fully supported by the JDK collections (which use 32-bit integers for size). It future proofs the API for expansion when RAM is cheap. The actual change to the code underneath is minimal. It would eliminate some data structures that we currently have e.g. for the CountingBloomFilter, but these could be marked to throw exceptions if the size is too big.

It does make the API suitable for huge filters that can for example be used to check database keys exist before a call to a database.

I suspect the use or 32-bit indexes was to allow the use of Set<Integer> and BitSet. We no longer use Set<Integer> for the counting bloom filter. A TreeSet<Long> would function just as well where we do use it. We can create a custom implementation of BitSet to allow long indexes. The BitSet has overhead due to its dynamic sizing that we can remove. A simple BitArray implementation would be easy to do.


3. Hashing currently requires a full length byte[] to pass to a hash function

Hashing requires a preallocated byte[]. I would prefer to see hashing as a dynamic process where bytes are fed into an algorithm. For example MurmurHash3 32-bit algorithm only requires 4 bytes at a time. The 128-bit algorithm requires 2 longs. Currently the Objects32 hasher uses 1 byte at a time so could be totally online.

The Guava code has the concept of a Hasher as a sink that can accepts data of any primitive type. In their implementation bytes are always processed in the input order as the code is also tied to their implementation of various hash algorithms so the output must be correct. However there is no reason that the bytes have to be processed in order. This allows for an optimisation where a DynamicHasher can process bytes as it chooses such as immediately process int/float and long/double input and for smaller input put those in a temp storage until 4 bytes have been received. The DynamicHasher need only ensure all bytes contribute to the final hash but can choose the order in which bytes are processed.

I’ll show an example later of how this could be incorporated.


4. Thread safety

Currently the hash functions are not thread safe. If the function is moved inside a BloomFilter then the architecture must exist to create the stream of indexes for each input object in a thread safe manner. Thus multiple threads would be able to query the same BloomFilter instance. If we create the correct design this feature will be naturally included. Obviously adding to a BloomFilter may not be concurrent (it would be for the standard implementation using a bit array) but if a filter is just a gateway filter which never changes then I think concurrent query should be expected to work.


Here is a modification to the current code using an abstraction layer to convert items T to indexes. I was thinking about something like the following skeleton:

// Represent a hash
interface Hash {
    // Size of bits of the hash that is created
    int bits();
    // 64 bits of the hash (little-endian), starting from offset * 64.
    long getLongBits(int offset);
}

// Generate a Hash from a pre-existing byte[].
// (This is modified from the current HashFunction.)
interface StaticHashFunction {
    Hash apply(byte[], int seed);
}

// Generate indexes for an item.
// Left to the BloomFilter to call this appropriately for its own shape.
// The IndexGenerator would be constructed with:
// - A system to convert T to data for hashing (e.g. Function<T, byte[]>)
// - A system to convert data to a Hash (e.g. StaticHashFunction)
// - Other details of how a hash is incremented, e.g. cyclic/iterative process type
interface IndexGenerator<T> {
    // The supplier provides an indefinite stream
    LongSupplier generate(T item);
}

// Note:
// Initial implementation of IndexGenerator would be
// StaticIndexGenerator using StaticHashFunction with variants for iterative or cyclic.
// In either case the user supplies the Function<T, byte[]> to create a byte[] from
// the object. This is then used to create a Hash for each call of the LongSupplier
// or using a cyclic process from a single call to create the base Hash.

// Note: A PrehashedCyclicIndexGenerator<long[]> class can be used to just generate
// the indexes directly using the first 2 longs in the long[] object using a cyclic function.
// This allows constructing a BloomFilter<long[]> and storing pre-hashed items in it. The
// hashed representation is simple to send over a network (it is only 2 longs) allowing remote
// hashing and query.


// Typed to items
interface BloomFilter<T> {
    boolean add(T item);
}


// To use it you require the function to convert objects to data
Function<T, byte[]> converter = t -> t.getBytes();

IndexGenerator<T> hasher = IndexGeneratores.create(converter, staticHashFunction, processType);
BloomFilter<T> bf = new BloomFilter<>(hasher, shape);

// or common use case with default hash function and process type
BloomFilter<T> bf = BloomFilters.create(n, p, converter);


// common API
T t;
bf.add(t);
bf.contains(t);


The following is an extension to allow dynamic hashing similar to the Guava API

// Accepts primitive data
interface DataSink {
 DataSink putByte(byte b);
 DataSink putBytes(byte[] b, int offset, int length);
 DataSink putChars(CharSequence cs, Charset encoding);
 DataSink putUnencodedChars(CharSequence cs);
 DataSink putInt(int i);
 // All other primitives
}

// Specify how item T is decomposed into primitive data
interface Pipe<T> {
 void connect(T item, DataSink sink);
}

// Accepts data and dynamically hash it.
interface Hasher extends DataSink {
 Hash build();
}

// Provides an implementation to create a Hash dynamically
interface DynamicHashFunction {
 Hasher newHasher(int seed);
 // Size of bits of the hash that is created
 int bits();
}

// To use it you require the pipe implementation to convert objects to data
Pipe<T> pipe = (t, sink) -> sink.putInt(t.getIntProperty());

// explicit control over hash function and conversion of hashes to indexes
IndexGenerator<T> hasher = IndexGeneratores.create(pipe, dyanmicHashFunction, processType);
BloomFilter<T> bf = new BloomFilter<>(hasher, shape);

// or common use case with default hash function and process type
BloomFilter<T> bf = BloomFilters.create(n, p, pipe);



With various factory methods to use defaults for a hash function and process type and a few implementations of Pipe for common things (e.g. String, Integer).

There are lots of holes to fill in. But it allows you to directly add items T to a BloomFilter<T> and control the hash function and the iterative process.

One main issue is how to deal with filter compatibility. We can compare filter properties k and m easily. But how to compare the IndexGenerator<T> used by the filter? IMO the current HashFunctionIdentity is subject to error. No system is perfect. Perhaps we go with the simple option of using Object.equals(Object) to test equality of the IndexGenerator. In the common use case you will have the same object for your IndexGenerator<T> when merging two filters. The equals will be fast. All the implementations we provide can override equals to make this efficient using an internal signature. The exact method can be private until we have the rest of the framework complete.

There are issues. There is a lot of object creation. I’d like to minimise this. But I don’t see how you can have a flexible API without object creation. For example you could make the BloomFilter convert the Hash to indexes and avoid the index generator object. But then you lose the ability to use an item of type long[] as a prehashed item for conversion with a customised index generator.

If this is not the best way then I still think the API needs to have some simple way to use a BloomFilter that allows adding objects. Currently you cannot use a BloomFilter on its own. You can only use them if you have another BloomFilter or a Hasher. So you need to have the ability to create one of those. It is not very in the spirit of a collection. A collection should be able to interact directly with what it is a collection of. Currently the BloomFilter cannot. So this suggestion is to make the conversion of objects to data for hashing part of the API.

Note also that here we drop the ability to query using a Hasher build with multiple objects as it is covered by query using a BloomFilter<T> as a collection of T.

These are just thoughts. But I’d like you opinion on whether we can make this work for all the scenarios you are applying this to. I think it would work for your common use cases of storing String and the remote filter query. The goal would be to make a BloomFilter<T> like a fuzzy Set<T> (may or may not contain T) and just as easy to use. Otherwise I think you risk making the API too complex for the common use case. That would be a user wants to store items T and knows the fields in T that uniquely define the item. They just build a function to create a byte[] representation of the object or a pipe to send data from T to a downstream receiver that will collate the data. The reset of the process to create indexes from the item data is provided by the framework.

Alex



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
Sorry it has taken me so long to respond, but I spent a long time thinking
about this and started a response several times.

I see no reason not to go to long bit indexes.  This would be reflected in
Shape.getNumberOfBits returning a long as well as other places.

I would like to revisit some definitions and use cases before delving into
the suggestions.
Definitions

   -

   Bloom Filter
   -

      A set of bits and functions to query bit patterns.
      -

      Responsible for
      -

         Management of storage of the set of bits.
         -

         Producing a standard format external representation of the set of
         bits.
         -

         Importing a standard format external representation of the set of
         bits.
         -

         Performing specific bit queries on the set of bits:
         -

            Cardinality (number of bits enabled) of the filter.
            -

            Cardinality of a logical “and” with another filter.
            -

            Cardinality of a logical “or” with another filter.
            -

            Cardinality of a logical “xor” with another filter.
            -

            Determining if this filter “contains” another filter (this
            “and” that == that )
            -

         Merging another filter into this one. This is not technically
         required but was a design decision. The alternative for a filter to be
         immutable and to create a new filter from the merging of two filters.
         -

   Shape
   -

      The definition of a Bloom filter. It includes:
      -

         The maximum number of bits in the filter. (aka: m)
         -

         The expected number of items encoded in the filter. (aka: n)
         -

         The number of hashes generated for each item in the filter. (aka:
         t)
         -

      Responsible for providing the information necessary to identify when
      two filters are built with the same strategy.
      -

   Hasher
   -

      Converts one or more items into a Bloom filter based upon a Shape.
      -

      Is repeatable. (i.e. when called multiple times with the same shape
      produces the same output).
      -

      Responsible for:
      -

         Tracking the number of items being hashed.
         -

         Generating the number of hash values specified by Shape for each
         item.


Types of usage Representation of a single simple object

This is probably the most common usage pattern if measured by number of
constructor calls. In this usage case a single object is converted to a
single item in a Bloom filter. A common use case would be to determine if a
URL has been seen. The filter is constructed by converting the hashing the
URL “t” times, each hash value converted to the range [0,m) and the
associated bits enabled in the Bloom filter.

The single object filter is commonly used to test the presence of the
object in a collection of filters or to register the presence of the object
within the collection.
Representation of a single complex object

In this usage the Bloom filter represents a single object with more than
one property as contrasted to the simple object were a single property is
represented. In this the number of items in the Shape actually refers to
the number of properties for a single complex object time the number of
complex objects. So a collection of 1000 complex objects with 3 properties
each yields 3000 items in the Shape.

There are several hashing strategies used to build complex objects. The
simplest is to hash each property as though it were the property in a
single object. This works well if the individual properties being indexed
do not have any collisions across the representations. A more complex
solution is to seed the hash function with different values based upon the
property being indexed.
Representation of a single complete complex object

This use case is similar to the single complex object in that multiple
properties are represented as individual items in the resulting Bloom
filter. The difference is that the number of expected items, as defined by
the shape, is the number of properties in the object. Thus each object
produces a fully populated Bloom filter. This strategy is used when a
compact searchable representation of the object is required. This type of
filter is generally added to a Multidimensional filter (see below).
Representation of a partial complex object

In this case the representation is built using the same method as the
single complex object but not all properties are specified. This is
commonly used to locate objects in a collection of complex object Bloom
filters.
Representation of multiple objects

In this case multiple single-object Bloom filters are merged together to
create one Bloom filter representing all the objects. This is probably the
most common usage for filters that are persisted across application runs.
This implementation simply the merged bit patters of all the Bloom filters
being added. It is commonly used in applications like Hadoop to determine
which nodes might contain a specific file.
Multidimensional representation Bloom filters

This is a use case that extends the ability to store excess items (>”n”)
into a Bloom filter without significantly modifying the probability of
false positives. To do this efficiently the Multidimensional filter has to
know the Shape of the filters being inserted. In the most efficient cases
it requires the hasher so that it can build filters of different but
associated shapes. In addition, this type of filter can return ID’s
associated with the various filters. So that it in effect becomes an index
of objects associated with Bloom filters.

This type of filter is used in systems that have many (>1000) locations to
check for a Bloom filter match as they can perform the match much faster
than the linear search alternative.
Suggestions Move the hashing into the Bloom filter

Originally the Hasher was called a ProtoBloomFilter. There was a request
for a name change early on. Once a Hasher is created it can be used to
create any concrete Bloom filter. Currently this is done with the
getBits(Shape) method. Perhaps what is needed is a rework in this section.

If the Hasher had a method to return a LongSupplier for each item then the
Bloom filter constructor could be in control of the number of hash
functions that were called for each item.

So assume Hasher has 2 new methods:

int items(); // the number of items

LongSupplier supplier( int itemNumber )



Then BloomFilter(Hasher, Shape) constructor could call:

for (int i =0; i<hasher.items();i++) {

LongSupplier sup = hasher.supplier(i);

for (int t=0;t<shape.getNumberOfHashFunctions();t++)

{

enableBit( Math.floorMod( sup.getAsLong(), shape.getNumberOfBits() );

}

}
Remove the requirement for a full length byte[]

The original Hash.Factory was a simple implementation to allow multiple
items in the Hash.

We could create a DataSink interface along the lines you are talking about.
Then have Hash.Factory extend that so that all those base data types can be
used as items in their own right and add a method with(DataSink) that would
take all the items in the DataSink and combine them into a single item for
the Hasher.
Create an add(T)/contains(T) method pair or Make a simple BloomFilter<T>

First I think this would be merge(T)/contains(T) pair. But I think it is a
bad idea to confuse the general BloomFilter responsibility with the Hasher
responsibility. I suggest leaving the general structure as defined above in
place and create a TypedBloomFilter<T> class that works much as you
suggest. We have to make a number of assumptions. For example we assume
that BitSetBloomFilter is the appropriate implementation and the
DynamicHasher the appropriate Hasher.

TypedBloomFilter<T> extends BitSetBloomFilter {

private Function<T,DataSink> fn;

TypedBloomFilter( Function<T,DataSink> fn, Shape shape ) {

super( shape );

this.fn = fn;

}

public void merge( T t ) {

merge( new DynamicHasher.Builder().with( fn.apply( t ) ) );

}

public boolean contains( T t ) {

return contains( new DynamicHasher.Builder().with( fn.apply( t ) ) );

}

}
Thread Safety

I think we note that Bloom filters are not generally safe for multithreaded
use without external synchronization.
Rework the HashFunction and HashFunctionIdentifier issues

We recognize that there are issues with HashFunctions and the identifier as
preserved in the Shape. I think that it is overly complex and prone to
error. Perhaps the proper solution is to remove the HasherIdentity from
Shape and create a method on Hasher String getID(). In most cases the
string would return the fully qualified class name, in cases where
difference hashing algorithms can be defined the string should return the
className prepended to the algorithm name or something similar. We can then
remove the HashFunction interface and all the stuff that derives from it.
Bloom filter equivalence will then depend upon Shape and Hasher equivalence.

On Fri, Mar 20, 2020 at 10:51 AM Alex Herbert <[hidden email]>
wrote:

> Sorry for an absence. I have been thinking on ways to move the BloomFilter
> API forward that consolidates the current functionality but makes it
> simpler to use for the common case.
>
> > On 18 Mar 2020, at 17:12, Claude Warren <[hidden email]> wrote:
> >
> > bf.getBits() * Long.BYTES  may be as long as Math.Ceil(
> > Shape.getNumberOfBits() / 8.0 ) or it may be shorter.
>
> I presume you mean:
>
> bf.getBits().length * Long.BYTES  may be equal to Math.Ceil(
> Shape.getNumberOfBits() / 8.0 ) or it may be longer.
>
> >
> > I am building byte buffers of fixed length that is the maximum size that
> > any valid bf.getBits() * Long.BYTES I need to know
> > Shape.getNumberOfBytes().
> > The  conversion is required for some Bloom filter indexing techniques.
>
> OK. I gave you an example of how to do this. In my example the
> Shape.getNumberOfBytes() was 1 line of code in 8.
>
> This method has a practical use if you are creating a byte[]
> representation of the bits in a filter. So you already have to be writing
> code to do that. In the context of this process it is not going to save you
> a lot of code. It seems like this function is very specific to your need
> and not something generic required within the API.
>
> >
> > And while serialization is outside the scope of the library, it is only
> > reasonable that we provide enough information to allow developers to
> > serialize/deserialse the data.  For example BloomFilter allows you to get
> > either the long[] representation or the list of bit indexes (via OfInt)
> and
> > there are ways to reconstruct a BloomFilter if you were to write that out
> > and read it back.
>
> Yes and you are free to do data transfer how you please with that
> information.
>
> There is a lot of this API that is changing. I don’t want to keep adding
> methods before the basic functionality is fixed. So leave
> Shape.getNumberOfBytes() for now.
>
> On to the main topic for discussion ...
>
>
> IMO the basic functionality is not yet fixed. It is hard to use due to the
> separation of the BloomFilter from the hashing of objects you want to put
> into it. Thus at all times to use a BloomFilter you need to also have
> something that can create Hashers. Currently they are easy to create if you
> have a byte[]. So your primary use case of hashing Strings is easy. But
> what about other things? Here are my current issues with the design:
>
> 1. The decoupling of the hashing of an item from the BloomFilter, thus the
> filter does not control the number of indexes generated
> 2. The use of 32-bit integers for bit indexes
> 3. Hashing currently requires a full length byte[] to pass to a hash
> function
> 4. Thread safety
>
> In order:
>
> 1. The decoupling of the hashing of an item from the BloomFilter, thus the
> filter does not control the number of indexes generated
>
> I still think the API needs to have some simple way to use a BloomFilter
> that allows adding objects. Currently you cannot use a BloomFilter on its
> own. You can only use them if you have another BloomFilter or a Hasher. So
> you need to have the ability to create one of those. It is not very in the
> spirit of a collection. A collection should be able to interact directly
> with what it is a collection of. Currently the BloomFilter cannot. We have
> this situation:
>
> BloomFilter bf = BloomFilters.create(shape); // Construct somehow
> T t;
> // lots of code (maybe with ByteBuffer): T => byte[] bytes
> bf.merge(new DynamicHasher.Builder(bf.getShape()).with(bytes).build());
>
> The final merge operation also relies on the fact that the Hasher will
> create the correct number of indexes for the shape. A contract that it can
> easily violate and break the BloomFilter functionality.
>
> I am thinking about a better way to do this. I like the typed
> BloomFilter<T> approach in Guava. To me it makes more sense for a
> BloomFilter in commons-collections to be typed like all other collections.
> Having an untyped raw interface is strange for a generic collection
> framework. I think we need an abstraction layer between the raw processing
> of indexes done in the BloomFilter and the objects T that you put in.
>
> If you wish to maintain flexibility of defining Hashing then perhaps we
> extend the functionality in the Guava approach (which uses a fixed internal
> index generator) and allow the generator of indexes to be specified:
>
> interface IndexGenerator<T> {
>     // infinite stream of 64-bit longs. Not yet mapped to a BloomFilter
> shape.
>     LongSupplier generate(T item);
> }
>
> IndexGenerator<T> generator = ...;
> BloomFilter<T> bf = BloomFilter.create(generator, shape);
> T t;
> bf.add(t);
> bf.contains(t);
>
> This moves the hashing of the items inside the BloomFilter but allows the
> method to hash them to be configured as it is inside the IndexGenerator.
> This encapsulation prevents the filter state from being abused by adding
> hashers, which as discussed before should adhere to a contract to generate
> the correct number of indexes in the correct range for the number of bits
> but do not have to. If the BloomFilter is controlling the mapping of a
> stream of indexes to the appropriate range and number then this prevents
> state corruption.
>
> You previously mentioned separation of concerns as a reason to have a
> BloomFilter and the process for creation of a Hasher separate. This was to
> allow you to send a specialised Hasher over a network. Under this system
> you can create a BloomFilter<long[]> that stores a representation of a
> cyclic hash. The IndexGenerator<T> would be a IndexGenerator<long[]> that
> simply cycles the pair of longs to create indexes. Sending the pre-hashed
> long[2] over a network is simple. You thus set up a remote hash algorithm
> to convert your items to long[2] and send them for remote query.
>
> I’ll show an example later of how this approach could be incorporated on
> top of out current code base.
>
>
> 2. The use of 32-bit integers for bit indexes
>
> The lack of 64-bit indexes for the bit positions in a filter prevents
> construction of very large filters. This is a limitation as shown here:
>
> P = 1e-6
> M = 2^31, N = 74,681,641
> M= 2^31 * 64, N =  4,779,624,982
>
> So by using a full length long[] representation that could be addressed in
> memory from a long index divided by 64 allows you to increase capacity of
> an array base filter by 64-fold. This pushes it into a range where the
> number of items that can be stored is much larger than that fully supported
> by the JDK collections (which use 32-bit integers for size). It future
> proofs the API for expansion when RAM is cheap. The actual change to the
> code underneath is minimal. It would eliminate some data structures that we
> currently have e.g. for the CountingBloomFilter, but these could be marked
> to throw exceptions if the size is too big.
>
> It does make the API suitable for huge filters that can for example be
> used to check database keys exist before a call to a database.
>
> I suspect the use or 32-bit indexes was to allow the use of Set<Integer>
> and BitSet. We no longer use Set<Integer> for the counting bloom filter. A
> TreeSet<Long> would function just as well where we do use it. We can create
> a custom implementation of BitSet to allow long indexes. The BitSet has
> overhead due to its dynamic sizing that we can remove. A simple BitArray
> implementation would be easy to do.
>
>
> 3. Hashing currently requires a full length byte[] to pass to a hash
> function
>
> Hashing requires a preallocated byte[]. I would prefer to see hashing as a
> dynamic process where bytes are fed into an algorithm. For example
> MurmurHash3 32-bit algorithm only requires 4 bytes at a time. The 128-bit
> algorithm requires 2 longs. Currently the Objects32 hasher uses 1 byte at a
> time so could be totally online.
>
> The Guava code has the concept of a Hasher as a sink that can accepts data
> of any primitive type. In their implementation bytes are always processed
> in the input order as the code is also tied to their implementation of
> various hash algorithms so the output must be correct. However there is no
> reason that the bytes have to be processed in order. This allows for an
> optimisation where a DynamicHasher can process bytes as it chooses such as
> immediately process int/float and long/double input and for smaller input
> put those in a temp storage until 4 bytes have been received. The
> DynamicHasher need only ensure all bytes contribute to the final hash but
> can choose the order in which bytes are processed.
>
> I’ll show an example later of how this could be incorporated.
>
>
> 4. Thread safety
>
> Currently the hash functions are not thread safe. If the function is moved
> inside a BloomFilter then the architecture must exist to create the stream
> of indexes for each input object in a thread safe manner. Thus multiple
> threads would be able to query the same BloomFilter instance. If we create
> the correct design this feature will be naturally included. Obviously
> adding to a BloomFilter may not be concurrent (it would be for the standard
> implementation using a bit array) but if a filter is just a gateway filter
> which never changes then I think concurrent query should be expected to
> work.
>
>
> Here is a modification to the current code using an abstraction layer to
> convert items T to indexes. I was thinking about something like the
> following skeleton:
>
> // Represent a hash
> interface Hash {
>     // Size of bits of the hash that is created
>     int bits();
>     // 64 bits of the hash (little-endian), starting from offset * 64.
>     long getLongBits(int offset);
> }
>
> // Generate a Hash from a pre-existing byte[].
> // (This is modified from the current HashFunction.)
> interface StaticHashFunction {
>     Hash apply(byte[], int seed);
> }
>
> // Generate indexes for an item.
> // Left to the BloomFilter to call this appropriately for its own shape.
> // The IndexGenerator would be constructed with:
> // - A system to convert T to data for hashing (e.g. Function<T, byte[]>)
> // - A system to convert data to a Hash (e.g. StaticHashFunction)
> // - Other details of how a hash is incremented, e.g. cyclic/iterative
> process type
> interface IndexGenerator<T> {
>     // The supplier provides an indefinite stream
>     LongSupplier generate(T item);
> }
>
> // Note:
> // Initial implementation of IndexGenerator would be
> // StaticIndexGenerator using StaticHashFunction with variants for
> iterative or cyclic.
> // In either case the user supplies the Function<T, byte[]> to create a
> byte[] from
> // the object. This is then used to create a Hash for each call of the
> LongSupplier
> // or using a cyclic process from a single call to create the base Hash.
>
> // Note: A PrehashedCyclicIndexGenerator<long[]> class can be used to just
> generate
> // the indexes directly using the first 2 longs in the long[] object using
> a cyclic function.
> // This allows constructing a BloomFilter<long[]> and storing pre-hashed
> items in it. The
> // hashed representation is simple to send over a network (it is only 2
> longs) allowing remote
> // hashing and query.
>
>
> // Typed to items
> interface BloomFilter<T> {
>     boolean add(T item);
> }
>
>
> // To use it you require the function to convert objects to data
> Function<T, byte[]> converter = t -> t.getBytes();
>
> IndexGenerator<T> hasher = IndexGeneratores.create(converter,
> staticHashFunction, processType);
> BloomFilter<T> bf = new BloomFilter<>(hasher, shape);
>
> // or common use case with default hash function and process type
> BloomFilter<T> bf = BloomFilters.create(n, p, converter);
>
>
> // common API
> T t;
> bf.add(t);
> bf.contains(t);
>
>
> The following is an extension to allow dynamic hashing similar to the
> Guava API
>
> // Accepts primitive data
> interface DataSink {
>  DataSink putByte(byte b);
>  DataSink putBytes(byte[] b, int offset, int length);
>  DataSink putChars(CharSequence cs, Charset encoding);
>  DataSink putUnencodedChars(CharSequence cs);
>  DataSink putInt(int i);
>  // All other primitives
> }
>
> // Specify how item T is decomposed into primitive data
> interface Pipe<T> {
>  void connect(T item, DataSink sink);
> }
>
> // Accepts data and dynamically hash it.
> interface Hasher extends DataSink {
>  Hash build();
> }
>
> // Provides an implementation to create a Hash dynamically
> interface DynamicHashFunction {
>  Hasher newHasher(int seed);
>  // Size of bits of the hash that is created
>  int bits();
> }
>
> // To use it you require the pipe implementation to convert objects to data
> Pipe<T> pipe = (t, sink) -> sink.putInt(t.getIntProperty());
>
> // explicit control over hash function and conversion of hashes to indexes
> IndexGenerator<T> hasher = IndexGeneratores.create(pipe,
> dyanmicHashFunction, processType);
> BloomFilter<T> bf = new BloomFilter<>(hasher, shape);
>
> // or common use case with default hash function and process type
> BloomFilter<T> bf = BloomFilters.create(n, p, pipe);
>
>
>
> With various factory methods to use defaults for a hash function and
> process type and a few implementations of Pipe for common things (e.g.
> String, Integer).
>
> There are lots of holes to fill in. But it allows you to directly add
> items T to a BloomFilter<T> and control the hash function and the iterative
> process.
>
> One main issue is how to deal with filter compatibility. We can compare
> filter properties k and m easily. But how to compare the IndexGenerator<T>
> used by the filter? IMO the current HashFunctionIdentity is subject to
> error. No system is perfect. Perhaps we go with the simple option of using
> Object.equals(Object) to test equality of the IndexGenerator. In the common
> use case you will have the same object for your IndexGenerator<T> when
> merging two filters. The equals will be fast. All the implementations we
> provide can override equals to make this efficient using an internal
> signature. The exact method can be private until we have the rest of the
> framework complete.
>
> There are issues. There is a lot of object creation. I’d like to minimise
> this. But I don’t see how you can have a flexible API without object
> creation. For example you could make the BloomFilter convert the Hash to
> indexes and avoid the index generator object. But then you lose the ability
> to use an item of type long[] as a prehashed item for conversion with a
> customised index generator.
>
> If this is not the best way then I still think the API needs to have some
> simple way to use a BloomFilter that allows adding objects. Currently you
> cannot use a BloomFilter on its own. You can only use them if you have
> another BloomFilter or a Hasher. So you need to have the ability to create
> one of those. It is not very in the spirit of a collection. A collection
> should be able to interact directly with what it is a collection of.
> Currently the BloomFilter cannot. So this suggestion is to make the
> conversion of objects to data for hashing part of the API.
>
> Note also that here we drop the ability to query using a Hasher build with
> multiple objects as it is covered by query using a BloomFilter<T> as a
> collection of T.
>
> These are just thoughts. But I’d like you opinion on whether we can make
> this work for all the scenarios you are applying this to. I think it would
> work for your common use cases of storing String and the remote filter
> query. The goal would be to make a BloomFilter<T> like a fuzzy Set<T> (may
> or may not contain T) and just as easy to use. Otherwise I think you risk
> making the API too complex for the common use case. That would be a user
> wants to store items T and knows the fields in T that uniquely define the
> item. They just build a function to create a byte[] representation of the
> object or a pipe to send data from T to a downstream receiver that will
> collate the data. The reset of the process to create indexes from the item
> data is provided by the framework.
>
> Alex
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert


> On 22 Mar 2020, at 17:44, Claude Warren <[hidden email]> wrote:
>
> Sorry it has taken me so long to respond, but I spent a long time thinking
> about this and started a response several times.
>
> I see no reason not to go to long bit indexes.  This would be reflected in
> Shape.getNumberOfBits returning a long as well as other places.
>
> I would like to revisit some definitions and use cases before delving into
> the suggestions.
> Definitions
>
>   -
>
>   Bloom Filter
>   -
>
>      A set of bits and functions to query bit patterns.
>      -
>
>      Responsible for
>      -
>
>         Management of storage of the set of bits.
>         -
>
>         Producing a standard format external representation of the set of
>         bits.
>         -
>
>         Importing a standard format external representation of the set of
>         bits.
>         -
>
>         Performing specific bit queries on the set of bits:
>         -
>
>            Cardinality (number of bits enabled) of the filter.
>            -
>
>            Cardinality of a logical “and” with another filter.
>            -
>
>            Cardinality of a logical “or” with another filter.
>            -
>
>            Cardinality of a logical “xor” with another filter.
>            -
>
>            Determining if this filter “contains” another filter (this
>            “and” that == that )
>            -

Initially it is not apparent from the javadoc why you need all the cardinality tests. We should add the primary purpose of these:

- AND cardinality can be used to determine if one filter is within another filter when combined with the cardinality of the query filter. Is there another purpose to it?
- OR cardinality will tell you how many bits would be taken after a merge (union) of filters.
- XOR cardinality will tell you how different the filters are.

Are any of these operations essential to have in BloomFilter? They can all be implemented just using the long[] getBits(). Note that this has to be changed to be implemented as a primitive iterator OfLong that returns 64 bits from the filter starting at [0, 63] for all the bits when a filter number of bits is changed from int to long.

Removing the combined cardinality tests reduces clutter in the main interface and removes the requirement for a filter to implement them. It just implements an efficient iterator of bits. These operations get moved to the SetOperations class.

>
>         Merging another filter into this one. This is not technically
>         required but was a design decision. The alternative for a filter to be
>         immutable and to create a new filter from the merging of two filters.
>         -
>
>   Shape
>   -
>
>      The definition of a Bloom filter. It includes:
>      -
>
>         The maximum number of bits in the filter. (aka: m)
>         -
>
>         The expected number of items encoded in the filter. (aka: n)
>         -
>
>         The number of hashes generated for each item in the filter. (aka:
>         t)
>         -
>
>      Responsible for providing the information necessary to identify when
>      two filters are built with the same strategy.
>      -

And as previously discusses the number of items is redundant other than to provide the original intention when the shape was constructed.

>
>   Hasher
>   -
>
>      Converts one or more items into a Bloom filter based upon a Shape.
>      -
>
>      Is repeatable. (i.e. when called multiple times with the same shape
>      produces the same output).
>      -
>
>      Responsible for:
>      -
>
>         Tracking the number of items being hashed.

Which currently is not in the API.

>         -
>
>         Generating the number of hash values specified by Shape for each
>         item.

Which breaks BloomFilter encapsulation.

>
>
> Types of usage Representation of a single simple object
>
> This is probably the most common usage pattern if measured by number of
> constructor calls. In this usage case a single object is converted to a
> single item in a Bloom filter. A common use case would be to determine if a
> URL has been seen. The filter is constructed by converting the hashing the
> URL “t” times, each hash value converted to the range [0,m) and the
> associated bits enabled in the Bloom filter.

OK.

>
> The single object filter is commonly used to test the presence of the
> object in a collection of filters or to register the presence of the object
> within the collection.


> Representation of a single complex object
>
> In this usage the Bloom filter represents a single object with more than
> one property as contrasted to the simple object were a single property is
> represented. In this the number of items in the Shape actually refers to
> the number of properties for a single complex object time the number of
> complex objects. So a collection of 1000 complex objects with 3 properties
> each yields 3000 items in the Shape.
>
> There are several hashing strategies used to build complex objects. The
> simplest is to hash each property as though it were the property in a
> single object. This works well if the individual properties being indexed
> do not have any collisions across the representations. A more complex
> solution is to seed the hash function with different values based upon the
> property being indexed.

I cannot understand why you would do this. If I have a person with a first name and last name field and I want to store 50,000 person objects you are saying that I construct a Shape to store 100,000 items as I will generate k indexes for first name and last name separately. This makes no sense to me. I want to store 50,000 items. An item is something that can be hashed. So combine the first and last name to a single set of bytes to hash and only produce k indexes, not 2k indexes.

This adds a whole layer of complexity to constructing the Shape that you have not put in your current API. I.e. when I construct a Shape I need not only the number of items I am to store but the number of properties per item.


> Representation of a single complete complex object
>
> This use case is similar to the single complex object in that multiple
> properties are represented as individual items in the resulting Bloom
> filter. The difference is that the number of expected items, as defined by
> the shape, is the number of properties in the object. Thus each object
> produces a fully populated Bloom filter. This strategy is used when a
> compact searchable representation of the object is required. This type of
> filter is generally added to a Multidimensional filter (see below).

So if using BloomFilter<T> then T would be a property in a single object. That is fine. Effectively you have BloomFilter<Object> representing a fast look-up for all the properties in an item. Here the design I suggested is less suitable as the IndexGenerator<Object> would have to handle all the object types for all the properties. I think in the end though the amount of code you have to write to work with it would be the same. Since you are not storing items in a filter but all the properties of one item you construct one by passing all the different properties to the filter merge(T) method. You then query it using a property you are interested in. The code to populate and query it would be similar with the current Hasher design or a typed and encapsulated filter as you are mapping properties of lots of different types to a single form to be put into the filter (probably a byte[]).


> Representation of a partial complex object
>
> In this case the representation is built using the same method as the
> single complex object but not all properties are specified. This is
> commonly used to locate objects in a collection of complex object Bloom
> filters.

This is not a case. It is just a sub-set of the other case.

> Representation of multiple objects
>
> In this case multiple single-object Bloom filters are merged together to
> create one Bloom filter representing all the objects. This is probably the
> most common usage for filters that are persisted across application runs.
> This implementation simply the merged bit patters of all the Bloom filters
> being added. It is commonly used in applications like Hadoop to determine
> which nodes might contain a specific file.

The basic use case. A filter contains many items.

> Multidimensional representation Bloom filters
>
> This is a use case that extends the ability to store excess items (>”n”)
> into a Bloom filter without significantly modifying the probability of
> false positives. To do this efficiently the Multidimensional filter has to
> know the Shape of the filters being inserted. In the most efficient cases
> it requires the hasher so that it can build filters of different but
> associated shapes. In addition, this type of filter can return ID’s
> associated with the various filters. So that it in effect becomes an index
> of objects associated with Bloom filters.
>
> This type of filter is used in systems that have many (>1000) locations to
> check for a Bloom filter match as they can perform the match much faster
> than the linear search alternative.

IIUC this does not fit the current design either. A MultiDimensional filter has more than one Shape?

Can you send code example or a link to an article on this? Basically more explanation on how this works. I found this:

https://github.com/lemire/bloofi <https://github.com/lemire/bloofi>

Bloomfi = BloomFilterIndex

Which creates a special data structure that stores/collates filters. Each filter is a type BloomFilter<T>. These all appear to have the same shape and store the same things.

So this indexes many BloomFilters of the same shape.

What are you proposing that is beyond this? Are you suggesting that each BloomFilter<T> is storing properties of one or more items. You would like to create a query for the property x and return list of all the BloomFilters that may contain an item that has that property?

Can you post an interface API for a MultiDimensionalBloomFilter?



> Suggestions Move the hashing into the Bloom filter
>
> Originally the Hasher was called a ProtoBloomFilter. There was a request
> for a name change early on. Once a Hasher is created it can be used to
> create any concrete Bloom filter. Currently this is done with the
> getBits(Shape) method. Perhaps what is needed is a rework in this section.
>
> If the Hasher had a method to return a LongSupplier for each item then the
> Bloom filter constructor could be in control of the number of hash
> functions that were called for each item.
>
> So assume Hasher has 2 new methods:
>
> int items(); // the number of items
>
> LongSupplier supplier( int itemNumber )
>
>
>
> Then BloomFilter(Hasher, Shape) constructor could call:
>
> for (int i =0; i<hasher.items();i++) {
>
> LongSupplier sup = hasher.supplier(i);
>
> for (int t=0;t<shape.getNumberOfHashFunctions();t++)
>
> {
>
> enableBit( Math.floorMod( sup.getAsLong(), shape.getNumberOfBits() );
>
> }
>
> }

In the common use case of searching for 1 item this extra use of a loop over 1 item seems a waste of time.


> Remove the requirement for a full length byte[]
>
> The original Hash.Factory was a simple implementation to allow multiple
> items in the Hash.
>
> We could create a DataSink interface along the lines you are talking about.
> Then have Hash.Factory extend that so that all those base data types can be
> used as items in their own right and add a method with(DataSink) that would
> take all the items in the DataSink and combine them into a single item for
> the Hasher.

The Hasher has to be the DataSink implementation. Have a look at Guava BloomFilter<E> and see how it works. You specify what data from your item is important by passing properties to the DataSink. This is actually a Hasher that creates the hash code.

> Create an add(T)/contains(T) method pair or Make a simple BloomFilter<T>
>
> First I think this would be merge(T)/contains(T) pair. But I think it is a
> bad idea to confuse the general BloomFilter responsibility with the Hasher
> responsibility. I suggest leaving the general structure as defined above in
> place and create a TypedBloomFilter<T> class that works much as you
> suggest. We have to make a number of assumptions. For example we assume
> that BitSetBloomFilter is the appropriate implementation and the
> DynamicHasher the appropriate Hasher.
>
> TypedBloomFilter<T> extends BitSetBloomFilter {
>
> private Function<T,DataSink> fn;
>
> TypedBloomFilter( Function<T,DataSink> fn, Shape shape ) {
>
> super( shape );
>
> this.fn = fn;
>
> }
>
> public void merge( T t ) {
>
> merge( new DynamicHasher.Builder().with( fn.apply( t ) ) );
>
> }
>
> public boolean contains( T t ) {
>
> return contains( new DynamicHasher.Builder().with( fn.apply( t ) ) );
>
> }
>
> }

IMO the general BloomFilter responsibility is storing indexes. I am suggesting that the indexes always come from somewhere: an item T. So change BloomFilter to be typed to <T> and require that the filter is provided with an implementation that converts T to indexes.

If you want to store properties from an item in a filter then you can still do this. You just have a more advanced implementation to convert T (the base property type) to indexes.


> Thread Safety
>
> I think we note that Bloom filters are not generally safe for multithreaded
> use without external synchronization.

OK. But as I said the search for items in a filter can be thread-safe since no structural modification of the filter is performed. I think this should be considered. It is the same for all other collections in the Java SDK that are not explicitly concurrent. Query of them is thread safe as long as no structural modification occurs during the query.

> Rework the HashFunction and HashFunctionIdentifier issues
>
> We recognize that there are issues with HashFunctions and the identifier as
> preserved in the Shape. I think that it is overly complex and prone to
> error. Perhaps the proper solution is to remove the HasherIdentity from
> Shape and create a method on Hasher String getID(). In most cases the
> string would return the fully qualified class name, in cases where
> difference hashing algorithms can be defined the string should return the
> className prepended to the algorithm name or something similar. We can then
> remove the HashFunction interface and all the stuff that derives from it.
> Bloom filter equivalence will then depend upon Shape and Hasher equivalence.

Dropping the entire HashFunctionIdentity from the public API is what I suggest. Test equivalence of filter using their strategy to convert items to indexes. But if a BloomFilter is to state equivalence using the Hasher then it would need to store its own hasher. This is not currently done. If you do this then you are moving one step closer to my suggestion of storing the IndexGenerator<T> in the filter. This is effectively a dynamic hasher of objects. I avoided the name Hasher for this purpose but it could be called that.


In summary I do not currently have a good reason to understand why BloomFilter cannot be typed. If there is a lot more code that you have to contribute it seems to be that we should put together all the code and what you intend it to do. It is the extra cases that you have that conflict with what I have suggested. Since I do not know everything you currently do with BloomFilters then it may be better to see it all before deciding the design for commons collections.

It is important to consider what commons-collections is trying to achieve. It is a stable API for code structures that are used to collectively group items. Almost all of it is generic collections typed to items. Any new code should be compatible with this scope. Since we maintain a stable API it is essential that the API is correct on first release. This means we should consider all future functionality that we are to support. It is made easier by restricting functionality to coherent blocks that should not require a change.

What do you think to either:

1. moving all this to a bloom-filters development branch and working on it there while the API is sorted to include the remaining code
2. limiting the scope of the contribution to commons-collections to basic BloomFilter functionality

1. This option may be close to just taking over maintenance of your current code base. If the code is generally applicable then fine. From what has so far been contributed I feel that we have made a lot of changes to clarify things. If this continues then further code will require a fair bit more work and a development branch seems a suitable action.

2. This should provide a structure for BloomFilters similar to others I can find for Java (e.g. Guava). I suggest we offer something more than other implementations by allowing configuration of the hashing of items. It would allow advanced usage to optimise item hashing. This is in line with what the current code currently allows. But it may mean stripping the API to a minimal first release.


>
> On Fri, Mar 20, 2020 at 10:51 AM Alex Herbert <[hidden email]>
> wrote:
>
>> Sorry for an absence. I have been thinking on ways to move the BloomFilter
>> API forward that consolidates the current functionality but makes it
>> simpler to use for the common case.
>>
>>> On 18 Mar 2020, at 17:12, Claude Warren <[hidden email]> wrote:
>>>
>>> bf.getBits() * Long.BYTES  may be as long as Math.Ceil(
>>> Shape.getNumberOfBits() / 8.0 ) or it may be shorter.
>>
>> I presume you mean:
>>
>> bf.getBits().length * Long.BYTES  may be equal to Math.Ceil(
>> Shape.getNumberOfBits() / 8.0 ) or it may be longer.
>>
>>>
>>> I am building byte buffers of fixed length that is the maximum size that
>>> any valid bf.getBits() * Long.BYTES I need to know
>>> Shape.getNumberOfBytes().
>>> The  conversion is required for some Bloom filter indexing techniques.
>>
>> OK. I gave you an example of how to do this. In my example the
>> Shape.getNumberOfBytes() was 1 line of code in 8.
>>
>> This method has a practical use if you are creating a byte[]
>> representation of the bits in a filter. So you already have to be writing
>> code to do that. In the context of this process it is not going to save you
>> a lot of code. It seems like this function is very specific to your need
>> and not something generic required within the API.
>>
>>>
>>> And while serialization is outside the scope of the library, it is only
>>> reasonable that we provide enough information to allow developers to
>>> serialize/deserialse the data.  For example BloomFilter allows you to get
>>> either the long[] representation or the list of bit indexes (via OfInt)
>> and
>>> there are ways to reconstruct a BloomFilter if you were to write that out
>>> and read it back.
>>
>> Yes and you are free to do data transfer how you please with that
>> information.
>>
>> There is a lot of this API that is changing. I don’t want to keep adding
>> methods before the basic functionality is fixed. So leave
>> Shape.getNumberOfBytes() for now.
>>
>> On to the main topic for discussion ...
>>
>>
>> IMO the basic functionality is not yet fixed. It is hard to use due to the
>> separation of the BloomFilter from the hashing of objects you want to put
>> into it. Thus at all times to use a BloomFilter you need to also have
>> something that can create Hashers. Currently they are easy to create if you
>> have a byte[]. So your primary use case of hashing Strings is easy. But
>> what about other things? Here are my current issues with the design:
>>
>> 1. The decoupling of the hashing of an item from the BloomFilter, thus the
>> filter does not control the number of indexes generated
>> 2. The use of 32-bit integers for bit indexes
>> 3. Hashing currently requires a full length byte[] to pass to a hash
>> function
>> 4. Thread safety
>>
>> In order:
>>
>> 1. The decoupling of the hashing of an item from the BloomFilter, thus the
>> filter does not control the number of indexes generated
>>
>> I still think the API needs to have some simple way to use a BloomFilter
>> that allows adding objects. Currently you cannot use a BloomFilter on its
>> own. You can only use them if you have another BloomFilter or a Hasher. So
>> you need to have the ability to create one of those. It is not very in the
>> spirit of a collection. A collection should be able to interact directly
>> with what it is a collection of. Currently the BloomFilter cannot. We have
>> this situation:
>>
>> BloomFilter bf = BloomFilters.create(shape); // Construct somehow
>> T t;
>> // lots of code (maybe with ByteBuffer): T => byte[] bytes
>> bf.merge(new DynamicHasher.Builder(bf.getShape()).with(bytes).build());
>>
>> The final merge operation also relies on the fact that the Hasher will
>> create the correct number of indexes for the shape. A contract that it can
>> easily violate and break the BloomFilter functionality.
>>
>> I am thinking about a better way to do this. I like the typed
>> BloomFilter<T> approach in Guava. To me it makes more sense for a
>> BloomFilter in commons-collections to be typed like all other collections.
>> Having an untyped raw interface is strange for a generic collection
>> framework. I think we need an abstraction layer between the raw processing
>> of indexes done in the BloomFilter and the objects T that you put in.
>>
>> If you wish to maintain flexibility of defining Hashing then perhaps we
>> extend the functionality in the Guava approach (which uses a fixed internal
>> index generator) and allow the generator of indexes to be specified:
>>
>> interface IndexGenerator<T> {
>>    // infinite stream of 64-bit longs. Not yet mapped to a BloomFilter
>> shape.
>>    LongSupplier generate(T item);
>> }
>>
>> IndexGenerator<T> generator = ...;
>> BloomFilter<T> bf = BloomFilter.create(generator, shape);
>> T t;
>> bf.add(t);
>> bf.contains(t);
>>
>> This moves the hashing of the items inside the BloomFilter but allows the
>> method to hash them to be configured as it is inside the IndexGenerator.
>> This encapsulation prevents the filter state from being abused by adding
>> hashers, which as discussed before should adhere to a contract to generate
>> the correct number of indexes in the correct range for the number of bits
>> but do not have to. If the BloomFilter is controlling the mapping of a
>> stream of indexes to the appropriate range and number then this prevents
>> state corruption.
>>
>> You previously mentioned separation of concerns as a reason to have a
>> BloomFilter and the process for creation of a Hasher separate. This was to
>> allow you to send a specialised Hasher over a network. Under this system
>> you can create a BloomFilter<long[]> that stores a representation of a
>> cyclic hash. The IndexGenerator<T> would be a IndexGenerator<long[]> that
>> simply cycles the pair of longs to create indexes. Sending the pre-hashed
>> long[2] over a network is simple. You thus set up a remote hash algorithm
>> to convert your items to long[2] and send them for remote query.
>>
>> I’ll show an example later of how this approach could be incorporated on
>> top of out current code base.
>>
>>
>> 2. The use of 32-bit integers for bit indexes
>>
>> The lack of 64-bit indexes for the bit positions in a filter prevents
>> construction of very large filters. This is a limitation as shown here:
>>
>> P = 1e-6
>> M = 2^31, N = 74,681,641
>> M= 2^31 * 64, N =  4,779,624,982
>>
>> So by using a full length long[] representation that could be addressed in
>> memory from a long index divided by 64 allows you to increase capacity of
>> an array base filter by 64-fold. This pushes it into a range where the
>> number of items that can be stored is much larger than that fully supported
>> by the JDK collections (which use 32-bit integers for size). It future
>> proofs the API for expansion when RAM is cheap. The actual change to the
>> code underneath is minimal. It would eliminate some data structures that we
>> currently have e.g. for the CountingBloomFilter, but these could be marked
>> to throw exceptions if the size is too big.
>>
>> It does make the API suitable for huge filters that can for example be
>> used to check database keys exist before a call to a database.
>>
>> I suspect the use or 32-bit indexes was to allow the use of Set<Integer>
>> and BitSet. We no longer use Set<Integer> for the counting bloom filter. A
>> TreeSet<Long> would function just as well where we do use it. We can create
>> a custom implementation of BitSet to allow long indexes. The BitSet has
>> overhead due to its dynamic sizing that we can remove. A simple BitArray
>> implementation would be easy to do.
>>
>>
>> 3. Hashing currently requires a full length byte[] to pass to a hash
>> function
>>
>> Hashing requires a preallocated byte[]. I would prefer to see hashing as a
>> dynamic process where bytes are fed into an algorithm. For example
>> MurmurHash3 32-bit algorithm only requires 4 bytes at a time. The 128-bit
>> algorithm requires 2 longs. Currently the Objects32 hasher uses 1 byte at a
>> time so could be totally online.
>>
>> The Guava code has the concept of a Hasher as a sink that can accepts data
>> of any primitive type. In their implementation bytes are always processed
>> in the input order as the code is also tied to their implementation of
>> various hash algorithms so the output must be correct. However there is no
>> reason that the bytes have to be processed in order. This allows for an
>> optimisation where a DynamicHasher can process bytes as it chooses such as
>> immediately process int/float and long/double input and for smaller input
>> put those in a temp storage until 4 bytes have been received. The
>> DynamicHasher need only ensure all bytes contribute to the final hash but
>> can choose the order in which bytes are processed.
>>
>> I’ll show an example later of how this could be incorporated.
>>
>>
>> 4. Thread safety
>>
>> Currently the hash functions are not thread safe. If the function is moved
>> inside a BloomFilter then the architecture must exist to create the stream
>> of indexes for each input object in a thread safe manner. Thus multiple
>> threads would be able to query the same BloomFilter instance. If we create
>> the correct design this feature will be naturally included. Obviously
>> adding to a BloomFilter may not be concurrent (it would be for the standard
>> implementation using a bit array) but if a filter is just a gateway filter
>> which never changes then I think concurrent query should be expected to
>> work.
>>
>>
>> Here is a modification to the current code using an abstraction layer to
>> convert items T to indexes. I was thinking about something like the
>> following skeleton:
>>
>> // Represent a hash
>> interface Hash {
>>    // Size of bits of the hash that is created
>>    int bits();
>>    // 64 bits of the hash (little-endian), starting from offset * 64.
>>    long getLongBits(int offset);
>> }
>>
>> // Generate a Hash from a pre-existing byte[].
>> // (This is modified from the current HashFunction.)
>> interface StaticHashFunction {
>>    Hash apply(byte[], int seed);
>> }
>>
>> // Generate indexes for an item.
>> // Left to the BloomFilter to call this appropriately for its own shape.
>> // The IndexGenerator would be constructed with:
>> // - A system to convert T to data for hashing (e.g. Function<T, byte[]>)
>> // - A system to convert data to a Hash (e.g. StaticHashFunction)
>> // - Other details of how a hash is incremented, e.g. cyclic/iterative
>> process type
>> interface IndexGenerator<T> {
>>    // The supplier provides an indefinite stream
>>    LongSupplier generate(T item);
>> }
>>
>> // Note:
>> // Initial implementation of IndexGenerator would be
>> // StaticIndexGenerator using StaticHashFunction with variants for
>> iterative or cyclic.
>> // In either case the user supplies the Function<T, byte[]> to create a
>> byte[] from
>> // the object. This is then used to create a Hash for each call of the
>> LongSupplier
>> // or using a cyclic process from a single call to create the base Hash.
>>
>> // Note: A PrehashedCyclicIndexGenerator<long[]> class can be used to just
>> generate
>> // the indexes directly using the first 2 longs in the long[] object using
>> a cyclic function.
>> // This allows constructing a BloomFilter<long[]> and storing pre-hashed
>> items in it. The
>> // hashed representation is simple to send over a network (it is only 2
>> longs) allowing remote
>> // hashing and query.
>>
>>
>> // Typed to items
>> interface BloomFilter<T> {
>>    boolean add(T item);
>> }
>>
>>
>> // To use it you require the function to convert objects to data
>> Function<T, byte[]> converter = t -> t.getBytes();
>>
>> IndexGenerator<T> hasher = IndexGeneratores.create(converter,
>> staticHashFunction, processType);
>> BloomFilter<T> bf = new BloomFilter<>(hasher, shape);
>>
>> // or common use case with default hash function and process type
>> BloomFilter<T> bf = BloomFilters.create(n, p, converter);
>>
>>
>> // common API
>> T t;
>> bf.add(t);
>> bf.contains(t);
>>
>>
>> The following is an extension to allow dynamic hashing similar to the
>> Guava API
>>
>> // Accepts primitive data
>> interface DataSink {
>> DataSink putByte(byte b);
>> DataSink putBytes(byte[] b, int offset, int length);
>> DataSink putChars(CharSequence cs, Charset encoding);
>> DataSink putUnencodedChars(CharSequence cs);
>> DataSink putInt(int i);
>> // All other primitives
>> }
>>
>> // Specify how item T is decomposed into primitive data
>> interface Pipe<T> {
>> void connect(T item, DataSink sink);
>> }
>>
>> // Accepts data and dynamically hash it.
>> interface Hasher extends DataSink {
>> Hash build();
>> }
>>
>> // Provides an implementation to create a Hash dynamically
>> interface DynamicHashFunction {
>> Hasher newHasher(int seed);
>> // Size of bits of the hash that is created
>> int bits();
>> }
>>
>> // To use it you require the pipe implementation to convert objects to data
>> Pipe<T> pipe = (t, sink) -> sink.putInt(t.getIntProperty());
>>
>> // explicit control over hash function and conversion of hashes to indexes
>> IndexGenerator<T> hasher = IndexGeneratores.create(pipe,
>> dyanmicHashFunction, processType);
>> BloomFilter<T> bf = new BloomFilter<>(hasher, shape);
>>
>> // or common use case with default hash function and process type
>> BloomFilter<T> bf = BloomFilters.create(n, p, pipe);
>>
>>
>>
>> With various factory methods to use defaults for a hash function and
>> process type and a few implementations of Pipe for common things (e.g.
>> String, Integer).
>>
>> There are lots of holes to fill in. But it allows you to directly add
>> items T to a BloomFilter<T> and control the hash function and the iterative
>> process.
>>
>> One main issue is how to deal with filter compatibility. We can compare
>> filter properties k and m easily. But how to compare the IndexGenerator<T>
>> used by the filter? IMO the current HashFunctionIdentity is subject to
>> error. No system is perfect. Perhaps we go with the simple option of using
>> Object.equals(Object) to test equality of the IndexGenerator. In the common
>> use case you will have the same object for your IndexGenerator<T> when
>> merging two filters. The equals will be fast. All the implementations we
>> provide can override equals to make this efficient using an internal
>> signature. The exact method can be private until we have the rest of the
>> framework complete.
>>
>> There are issues. There is a lot of object creation. I’d like to minimise
>> this. But I don’t see how you can have a flexible API without object
>> creation. For example you could make the BloomFilter convert the Hash to
>> indexes and avoid the index generator object. But then you lose the ability
>> to use an item of type long[] as a prehashed item for conversion with a
>> customised index generator.
>>
>> If this is not the best way then I still think the API needs to have some
>> simple way to use a BloomFilter that allows adding objects. Currently you
>> cannot use a BloomFilter on its own. You can only use them if you have
>> another BloomFilter or a Hasher. So you need to have the ability to create
>> one of those. It is not very in the spirit of a collection. A collection
>> should be able to interact directly with what it is a collection of.
>> Currently the BloomFilter cannot. So this suggestion is to make the
>> conversion of objects to data for hashing part of the API.
>>
>> Note also that here we drop the ability to query using a Hasher build with
>> multiple objects as it is covered by query using a BloomFilter<T> as a
>> collection of T.
>>
>> These are just thoughts. But I’d like you opinion on whether we can make
>> this work for all the scenarios you are applying this to. I think it would
>> work for your common use cases of storing String and the remote filter
>> query. The goal would be to make a BloomFilter<T> like a fuzzy Set<T> (may
>> or may not contain T) and just as easy to use. Otherwise I think you risk
>> making the API too complex for the common use case. That would be a user
>> wants to store items T and knows the fields in T that uniquely define the
>> item. They just build a function to create a byte[] representation of the
>> object or a pipe to send data from T to a downstream receiver that will
>> collate the data. The reset of the process to create indexes from the item
>> data is provided by the framework.
>>
>> Alex
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert


> On 23 Mar 2020, at 10:13, Alex Herbert <[hidden email]> wrote:
>
>
>
>> On 22 Mar 2020, at 17:44, Claude Warren <[hidden email] <mailto:[hidden email]>> wrote:
>>
>> Sorry it has taken me so long to respond, but I spent a long time thinking
>> about this and started a response several times.
>>
>> I see no reason not to go to long bit indexes.  This would be reflected in
>> Shape.getNumberOfBits returning a long as well as other places.
>>
>> I would like to revisit some definitions and use cases before delving into
>> the suggestions.
>> Definitions
>>
>>   -
>>
>>   Bloom Filter
>>   -
>>
>>      A set of bits and functions to query bit patterns.
>>      -
>>
>>      Responsible for
>>      -
>>
>>         Management of storage of the set of bits.
>>         -
>>
>>         Producing a standard format external representation of the set of
>>         bits.
>>         -
>>
>>         Importing a standard format external representation of the set of
>>         bits.
>>         -
>>
>>         Performing specific bit queries on the set of bits:
>>         -
>>
>>            Cardinality (number of bits enabled) of the filter.
>>            -
>>
>>            Cardinality of a logical “and” with another filter.
>>            -
>>
>>            Cardinality of a logical “or” with another filter.
>>            -
>>
>>            Cardinality of a logical “xor” with another filter.
>>            -
>>
>>            Determining if this filter “contains” another filter (this
>>            “and” that == that )
>>            -
>
> Initially it is not apparent from the javadoc why you need all the cardinality tests. We should add the primary purpose of these:
>
> - AND cardinality can be used to determine if one filter is within another filter when combined with the cardinality of the query filter. Is there another purpose to it?
> - OR cardinality will tell you how many bits would be taken after a merge (union) of filters.
> - XOR cardinality will tell you how different the filters are.
>
> Are any of these operations essential to have in BloomFilter? They can all be implemented just using the long[] getBits(). Note that this has to be changed to be implemented as a primitive iterator OfLong that returns 64 bits from the filter starting at [0, 63] for all the bits when a filter number of bits is changed from int to long.
>
> Removing the combined cardinality tests reduces clutter in the main interface and removes the requirement for a filter to implement them. It just implements an efficient iterator of bits. These operations get moved to the SetOperations class.
>
>>
>>         Merging another filter into this one. This is not technically
>>         required but was a design decision. The alternative for a filter to be
>>         immutable and to create a new filter from the merging of two filters.
>>         -
>>
>>   Shape
>>   -
>>
>>      The definition of a Bloom filter. It includes:
>>      -
>>
>>         The maximum number of bits in the filter. (aka: m)
>>         -
>>
>>         The expected number of items encoded in the filter. (aka: n)
>>         -
>>
>>         The number of hashes generated for each item in the filter. (aka:
>>         t)
>>         -
>>
>>      Responsible for providing the information necessary to identify when
>>      two filters are built with the same strategy.
>>      -
>
> And as previously discusses the number of items is redundant other than to provide the original intention when the shape was constructed.
>
>>
>>   Hasher
>>   -
>>
>>      Converts one or more items into a Bloom filter based upon a Shape.
>>      -
>>
>>      Is repeatable. (i.e. when called multiple times with the same shape
>>      produces the same output).
>>      -
>>
>>      Responsible for:
>>      -
>>
>>         Tracking the number of items being hashed.
>
> Which currently is not in the API.
>
>>         -
>>
>>         Generating the number of hash values specified by Shape for each
>>         item.
>
> Which breaks BloomFilter encapsulation.
>
>>
>>
>> Types of usage Representation of a single simple object
>>
>> This is probably the most common usage pattern if measured by number of
>> constructor calls. In this usage case a single object is converted to a
>> single item in a Bloom filter. A common use case would be to determine if a
>> URL has been seen. The filter is constructed by converting the hashing the
>> URL “t” times, each hash value converted to the range [0,m) and the
>> associated bits enabled in the Bloom filter.
>
> OK.
>
>>
>> The single object filter is commonly used to test the presence of the
>> object in a collection of filters or to register the presence of the object
>> within the collection.
>
>
>> Representation of a single complex object
>>
>> In this usage the Bloom filter represents a single object with more than
>> one property as contrasted to the simple object were a single property is
>> represented. In this the number of items in the Shape actually refers to
>> the number of properties for a single complex object time the number of
>> complex objects. So a collection of 1000 complex objects with 3 properties
>> each yields 3000 items in the Shape.
>>
>> There are several hashing strategies used to build complex objects. The
>> simplest is to hash each property as though it were the property in a
>> single object. This works well if the individual properties being indexed
>> do not have any collisions across the representations. A more complex
>> solution is to seed the hash function with different values based upon the
>> property being indexed.
>
> I cannot understand why you would do this. If I have a person with a first name and last name field and I want to store 50,000 person objects you are saying that I construct a Shape to store 100,000 items as I will generate k indexes for first name and last name separately. This makes no sense to me. I want to store 50,000 items. An item is something that can be hashed. So combine the first and last name to a single set of bytes to hash and only produce k indexes, not 2k indexes.
>
> This adds a whole layer of complexity to constructing the Shape that you have not put in your current API. I.e. when I construct a Shape I need not only the number of items I am to store but the number of properties per item.
>
>
>> Representation of a single complete complex object
>>
>> This use case is similar to the single complex object in that multiple
>> properties are represented as individual items in the resulting Bloom
>> filter. The difference is that the number of expected items, as defined by
>> the shape, is the number of properties in the object. Thus each object
>> produces a fully populated Bloom filter. This strategy is used when a
>> compact searchable representation of the object is required. This type of
>> filter is generally added to a Multidimensional filter (see below).
>
> So if using BloomFilter<T> then T would be a property in a single object. That is fine. Effectively you have BloomFilter<Object> representing a fast look-up for all the properties in an item. Here the design I suggested is less suitable as the IndexGenerator<Object> would have to handle all the object types for all the properties. I think in the end though the amount of code you have to write to work with it would be the same. Since you are not storing items in a filter but all the properties of one item you construct one by passing all the different properties to the filter merge(T) method. You then query it using a property you are interested in. The code to populate and query it would be similar with the current Hasher design or a typed and encapsulated filter as you are mapping properties of lots of different types to a single form to be put into the filter (probably a byte[]).
>
>
>> Representation of a partial complex object
>>
>> In this case the representation is built using the same method as the
>> single complex object but not all properties are specified. This is
>> commonly used to locate objects in a collection of complex object Bloom
>> filters.
>
> This is not a case. It is just a sub-set of the other case.
>
>> Representation of multiple objects
>>
>> In this case multiple single-object Bloom filters are merged together to
>> create one Bloom filter representing all the objects. This is probably the
>> most common usage for filters that are persisted across application runs.
>> This implementation simply the merged bit patters of all the Bloom filters
>> being added. It is commonly used in applications like Hadoop to determine
>> which nodes might contain a specific file.
>
> The basic use case. A filter contains many items.
>
>> Multidimensional representation Bloom filters
>>
>> This is a use case that extends the ability to store excess items (>”n”)
>> into a Bloom filter without significantly modifying the probability of
>> false positives. To do this efficiently the Multidimensional filter has to
>> know the Shape of the filters being inserted. In the most efficient cases
>> it requires the hasher so that it can build filters of different but
>> associated shapes. In addition, this type of filter can return ID’s
>> associated with the various filters. So that it in effect becomes an index
>> of objects associated with Bloom filters.
>>
>> This type of filter is used in systems that have many (>1000) locations to
>> check for a Bloom filter match as they can perform the match much faster
>> than the linear search alternative.
>
> IIUC this does not fit the current design either. A MultiDimensional filter has more than one Shape?
>
> Can you send code example or a link to an article on this? Basically more explanation on how this works. I found this:
>
> https://github.com/lemire/bloofi <https://github.com/lemire/bloofi>
>
> Bloomfi = BloomFilterIndex
>
> Which creates a special data structure that stores/collates filters. Each filter is a type BloomFilter<T>. These all appear to have the same shape and store the same things.
>
> So this indexes many BloomFilters of the same shape.
>
> What are you proposing that is beyond this? Are you suggesting that each BloomFilter<T> is storing properties of one or more items. You would like to create a query for the property x and return list of all the BloomFilters that may contain an item that has that property?
>
> Can you post an interface API for a MultiDimensionalBloomFilter?
>
>
>
>> Suggestions Move the hashing into the Bloom filter
>>
>> Originally the Hasher was called a ProtoBloomFilter. There was a request
>> for a name change early on. Once a Hasher is created it can be used to
>> create any concrete Bloom filter. Currently this is done with the
>> getBits(Shape) method. Perhaps what is needed is a rework in this section.
>>
>> If the Hasher had a method to return a LongSupplier for each item then the
>> Bloom filter constructor could be in control of the number of hash
>> functions that were called for each item.
>>
>> So assume Hasher has 2 new methods:
>>
>> int items(); // the number of items
>>
>> LongSupplier supplier( int itemNumber )
>>
>>
>>
>> Then BloomFilter(Hasher, Shape) constructor could call:
>>
>> for (int i =0; i<hasher.items();i++) {
>>
>> LongSupplier sup = hasher.supplier(i);
>>
>> for (int t=0;t<shape.getNumberOfHashFunctions();t++)
>>
>> {
>>
>> enableBit( Math.floorMod( sup.getAsLong(), shape.getNumberOfBits() );
>>
>> }
>>
>> }
>
> In the common use case of searching for 1 item this extra use of a loop over 1 item seems a waste of time.
>
>
>> Remove the requirement for a full length byte[]
>>
>> The original Hash.Factory was a simple implementation to allow multiple
>> items in the Hash.
>>
>> We could create a DataSink interface along the lines you are talking about.
>> Then have Hash.Factory extend that so that all those base data types can be
>> used as items in their own right and add a method with(DataSink) that would
>> take all the items in the DataSink and combine them into a single item for
>> the Hasher.
>
> The Hasher has to be the DataSink implementation. Have a look at Guava BloomFilter<E> and see how it works. You specify what data from your item is important by passing properties to the DataSink. This is actually a Hasher that creates the hash code.
>
>> Create an add(T)/contains(T) method pair or Make a simple BloomFilter<T>
>>
>> First I think this would be merge(T)/contains(T) pair. But I think it is a
>> bad idea to confuse the general BloomFilter responsibility with the Hasher
>> responsibility. I suggest leaving the general structure as defined above in
>> place and create a TypedBloomFilter<T> class that works much as you
>> suggest. We have to make a number of assumptions. For example we assume
>> that BitSetBloomFilter is the appropriate implementation and the
>> DynamicHasher the appropriate Hasher.
>>
>> TypedBloomFilter<T> extends BitSetBloomFilter {
>>
>> private Function<T,DataSink> fn;
>>
>> TypedBloomFilter( Function<T,DataSink> fn, Shape shape ) {
>>
>> super( shape );
>>
>> this.fn = fn;
>>
>> }
>>
>> public void merge( T t ) {
>>
>> merge( new DynamicHasher.Builder().with( fn.apply( t ) ) );
>>
>> }
>>
>> public boolean contains( T t ) {
>>
>> return contains( new DynamicHasher.Builder().with( fn.apply( t ) ) );
>>
>> }
>>
>> }
>
> IMO the general BloomFilter responsibility is storing indexes. I am suggesting that the indexes always come from somewhere: an item T. So change BloomFilter to be typed to <T> and require that the filter is provided with an implementation that converts T to indexes.
>
> If you want to store properties from an item in a filter then you can still do this. You just have a more advanced implementation to convert T (the base property type) to indexes.
>
>
>> Thread Safety
>>
>> I think we note that Bloom filters are not generally safe for multithreaded
>> use without external synchronization.
>
> OK. But as I said the search for items in a filter can be thread-safe since no structural modification of the filter is performed. I think this should be considered. It is the same for all other collections in the Java SDK that are not explicitly concurrent. Query of them is thread safe as long as no structural modification occurs during the query.
>
>> Rework the HashFunction and HashFunctionIdentifier issues
>>
>> We recognize that there are issues with HashFunctions and the identifier as
>> preserved in the Shape. I think that it is overly complex and prone to
>> error. Perhaps the proper solution is to remove the HasherIdentity from
>> Shape and create a method on Hasher String getID(). In most cases the
>> string would return the fully qualified class name, in cases where
>> difference hashing algorithms can be defined the string should return the
>> className prepended to the algorithm name or something similar. We can then
>> remove the HashFunction interface and all the stuff that derives from it.
>> Bloom filter equivalence will then depend upon Shape and Hasher equivalence.
>
> Dropping the entire HashFunctionIdentity from the public API is what I suggest. Test equivalence of filter using their strategy to convert items to indexes. But if a BloomFilter is to state equivalence using the Hasher then it would need to store its own hasher. This is not currently done. If you do this then you are moving one step closer to my suggestion of storing the IndexGenerator<T> in the filter. This is effectively a dynamic hasher of objects. I avoided the name Hasher for this purpose but it could be called that.
>
>
> In summary I do not currently have a good reason to understand why BloomFilter cannot be typed. If there is a lot more code that you have to contribute it seems to be that we should put together all the code and what you intend it to do. It is the extra cases that you have that conflict with what I have suggested. Since I do not know everything you currently do with BloomFilters then it may be better to see it all before deciding the design for commons collections.
>
> It is important to consider what commons-collections is trying to achieve. It is a stable API for code structures that are used to collectively group items. Almost all of it is generic collections typed to items. Any new code should be compatible with this scope. Since we maintain a stable API it is essential that the API is correct on first release. This means we should consider all future functionality that we are to support. It is made easier by restricting functionality to coherent blocks that should not require a change.
>
> What do you think to either:
>
> 1. moving all this to a bloom-filters development branch and working on it there while the API is sorted to include the remaining code
> 2. limiting the scope of the contribution to commons-collections to basic BloomFilter functionality
>
> 1. This option may be close to just taking over maintenance of your current code base. If the code is generally applicable then fine. From what has so far been contributed I feel that we have made a lot of changes to clarify things. If this continues then further code will require a fair bit more work and a development branch seems a suitable action.
>
> 2. This should provide a structure for BloomFilters similar to others I can find for Java (e.g. Guava). I suggest we offer something more than other implementations by allowing configuration of the hashing of items. It would allow advanced usage to optimise item hashing. This is in line with what the current code currently allows. But it may mean stripping the API to a minimal first release.
>

Attempting to re-awaken this thread.

IMO the bloomfilter package is currently under development. The original contributor was working through submitting Bloom filter functionality in parts. My involvement was to ensure the current code that was initially merged was documented (in javadoc) and functioned as described. This led to changes of the current functionality and ideas for changes. There were many ideas in this thread history that were discussed, some agreed and not yet implemented, some still open for discussion.

Development has stalled and the vision for the final package has not been completed. At present the bloomfilter added to collections does not use generics and it has very little to do with java.util.Collection. So is it even in the remit of commons-collections?

I suggested moving to a development branch while the package functionality is stabilised if any other contributors wish to work on this. I have left development alone as it seems redundant to progress without a defined scope for the ultimate functionality.

Opinions on how to proceed?

Alex

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Gilles Sadowski-2
Hello.

> > [...]
> >
>
> Attempting to re-awaken this thread.
>
> IMO the bloomfilter package is currently under development. The original contributor was working through submitting Bloom filter functionality in parts. My involvement was to ensure the current code that was initially merged was documented (in javadoc) and functioned as described. This led to changes of the current functionality and ideas for changes. There were many ideas in this thread history that were discussed, some agreed and not yet implemented, some still open for discussion.
>
> Development has stalled and the vision for the final package has not been completed. At present the bloomfilter added to collections does not use generics and it has very little to do with java.util.Collection. So is it even in the remit of commons-collections?

What would be the alternative(s)?

>
> I suggested moving to a development branch while the package functionality is stabilised if any other contributors wish to work on this.

IIRC, in Commons, branches other than "master" have usually become stale,
unfortunately.  They've been rare, and worked on by a single person...
Perhaps somehow mark the feature as not ready, and to be removed from
the release branch (?).

> I have left development alone as it seems redundant to progress without a defined scope for the ultimate functionality.

Do you mean that progress depends on intended usage?

>
> Opinions on how to proceed?

I got lost along the way; sorry.
Are there still uncertainties about the "basic" features?

Regards,
Gilles

>
> Alex
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
Bloom filters should not use generics.  That has been my stated opinion.
They are not like other collections in that you don't get out what you put
in.  They are collections of hashes so the idea that generics should be
used to somehow define what goes in is misleading.

If commons-collections is supposed to be all about putting things into a
collection and getting them back out then perhaps bloom filters do not
belong here.

The only major point of contention is should the package end up looking
like the Guice bloom filter package.  In my opinion the Guice package is
very restrictive.  It does not allows for different hashing
implementations, it forces a one to one correspondence between an Object
and the filter, this makes putting properties of an Object into the filter
as separate items difficult if not impossible, and it makes creating
partial filters for matching the same still more difficult.

Any more complex usage of Bloom filters (e.g. in genomics) will be much
harder if not impossible.

Going the Guice route also begs the question: Why not just use Guice?

The intention of this contribution was a framework that allows the
developer to build Bloom filters that
a) met the requirements of the application.
b) were easy to share between applications.
c) could implement most strategies for Bloom filters.

The Guice implementation is easy enough to construct with the framework as
defined in the current Commons Collections Bloom filter code.  And I have
no objection to providing a Simple Bloom filter implementation that does
that.  But doing so should not modify the framework in such a way as to
make other usage more difficult.

There have been lots of good conversations and lots of improvements since
the code was contributed.  I have several open source projects that
utilized the original code and have been able to easily modify them to use
the Commons versions as development and improvements progressed.  I would
hope to continue to be able to do that as the code moves to a releasable
state.

As I said above, it may be that commons collections is not the place for
Bloom filters.  Perhaps they belong in codec or in its own project.  I
leave that to others to decide.

Claude




On Wed, Apr 22, 2020 at 12:47 AM Gilles Sadowski <[hidden email]>
wrote:

> Hello.
>
> > > [...]
> > >
> >
> > Attempting to re-awaken this thread.
> >
> > IMO the bloomfilter package is currently under development. The original
> contributor was working through submitting Bloom filter functionality in
> parts. My involvement was to ensure the current code that was initially
> merged was documented (in javadoc) and functioned as described. This led to
> changes of the current functionality and ideas for changes. There were many
> ideas in this thread history that were discussed, some agreed and not yet
> implemented, some still open for discussion.
> >
> > Development has stalled and the vision for the final package has not
> been completed. At present the bloomfilter added to collections does not
> use generics and it has very little to do with java.util.Collection. So is
> it even in the remit of commons-collections?
>
> What would be the alternative(s)?
>
> >
> > I suggested moving to a development branch while the package
> functionality is stabilised if any other contributors wish to work on this.
>
> IIRC, in Commons, branches other than "master" have usually become stale,
> unfortunately.  They've been rare, and worked on by a single person...
> Perhaps somehow mark the feature as not ready, and to be removed from
> the release branch (?).
>
> > I have left development alone as it seems redundant to progress without
> a defined scope for the ultimate functionality.
>
> Do you mean that progress depends on intended usage?
>
> >
> > Opinions on how to proceed?
>
> I got lost along the way; sorry.
> Are there still uncertainties about the "basic" features?
>
> Regards,
> Gilles
>
> >
> > Alex
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

garydgregory
Hi All,

I'd like to pat ourselves on the back for a smooth ride so far in bringing
in new code for a new non-trivial feature :-)

IMO, let's not get hung up on generics for the first cut of the Bloom
Filter code. Why? Because generics are erased by the compiler and we can
always add them later without break binary compatibility. Now, if adding
generics later would require other changes or other changes would be
desired with said generics which would end up breaking BC, then we should
pause and consider why that is. Any thoughts on that?

WRT to 'fitting in' Commons Collection, I would like someone else to reply
(please ;-)
- Does the BF code implement some Commons Collection interfaces?
- Does the BF code extend  some Commons Collection classes?
- Does the BF code reuse Commons Collection utilities?

Tangent: There were UML tools/Maven plugins in the past that could
visualize these kinds of relationships, has anyone used any in the recent
past? Anything FOSS we could use?

Gary

On Wed, Apr 22, 2020 at 7:00 AM Claude Warren <[hidden email]> wrote:

> Bloom filters should not use generics.  That has been my stated opinion.
> They are not like other collections in that you don't get out what you put
> in.  They are collections of hashes so the idea that generics should be
> used to somehow define what goes in is misleading.
>
> If commons-collections is supposed to be all about putting things into a
> collection and getting them back out then perhaps bloom filters do not
> belong here.
>
> The only major point of contention is should the package end up looking
> like the Guice bloom filter package.  In my opinion the Guice package is
> very restrictive.  It does not allows for different hashing
> implementations, it forces a one to one correspondence between an Object
> and the filter, this makes putting properties of an Object into the filter
> as separate items difficult if not impossible, and it makes creating
> partial filters for matching the same still more difficult.
>
> Any more complex usage of Bloom filters (e.g. in genomics) will be much
> harder if not impossible.
>
> Going the Guice route also begs the question: Why not just use Guice?
>
> The intention of this contribution was a framework that allows the
> developer to build Bloom filters that
> a) met the requirements of the application.
> b) were easy to share between applications.
> c) could implement most strategies for Bloom filters.
>
> The Guice implementation is easy enough to construct with the framework as
> defined in the current Commons Collections Bloom filter code.  And I have
> no objection to providing a Simple Bloom filter implementation that does
> that.  But doing so should not modify the framework in such a way as to
> make other usage more difficult.
>
> There have been lots of good conversations and lots of improvements since
> the code was contributed.  I have several open source projects that
> utilized the original code and have been able to easily modify them to use
> the Commons versions as development and improvements progressed.  I would
> hope to continue to be able to do that as the code moves to a releasable
> state.
>
> As I said above, it may be that commons collections is not the place for
> Bloom filters.  Perhaps they belong in codec or in its own project.  I
> leave that to others to decide.
>
> Claude
>
>
>
>
> On Wed, Apr 22, 2020 at 12:47 AM Gilles Sadowski <[hidden email]>
> wrote:
>
> > Hello.
> >
> > > > [...]
> > > >
> > >
> > > Attempting to re-awaken this thread.
> > >
> > > IMO the bloomfilter package is currently under development. The
> original
> > contributor was working through submitting Bloom filter functionality in
> > parts. My involvement was to ensure the current code that was initially
> > merged was documented (in javadoc) and functioned as described. This led
> to
> > changes of the current functionality and ideas for changes. There were
> many
> > ideas in this thread history that were discussed, some agreed and not yet
> > implemented, some still open for discussion.
> > >
> > > Development has stalled and the vision for the final package has not
> > been completed. At present the bloomfilter added to collections does not
> > use generics and it has very little to do with java.util.Collection. So
> is
> > it even in the remit of commons-collections?
> >
> > What would be the alternative(s)?
> >
> > >
> > > I suggested moving to a development branch while the package
> > functionality is stabilised if any other contributors wish to work on
> this.
> >
> > IIRC, in Commons, branches other than "master" have usually become stale,
> > unfortunately.  They've been rare, and worked on by a single person...
> > Perhaps somehow mark the feature as not ready, and to be removed from
> > the release branch (?).
> >
> > > I have left development alone as it seems redundant to progress without
> > a defined scope for the ultimate functionality.
> >
> > Do you mean that progress depends on intended usage?
> >
> > >
> > > Opinions on how to proceed?
> >
> > I got lost along the way; sorry.
> > Are there still uncertainties about the "basic" features?
> >
> > Regards,
> > Gilles
> >
> > >
> > > Alex
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Gilles Sadowski-2
Hello.

Le mer. 22 avr. 2020 à 14:56, Gary Gregory <[hidden email]> a écrit :

>
> Hi All,
>
> I'd like to pat ourselves on the back for a smooth ride so far in bringing
> in new code for a new non-trivial feature :-)
>
> IMO, let's not get hung up on generics for the first cut of the Bloom
> Filter code. Why? Because generics are erased by the compiler and we can
> always add them later without break binary compatibility. Now, if adding
> generics later would require other changes or other changes would be
> desired with said generics which would end up breaking BC, then we should
> pause and consider why that is. Any thoughts on that?
>
> WRT to 'fitting in' Commons Collection, I would like someone else to reply
> (please ;-)
> - Does the BF code implement some Commons Collection interfaces?

No.

> - Does the BF code extend  some Commons Collection classes?

No.

> - Does the BF code reuse Commons Collection utilities?

Yes, in class "HasherBloomFilter":
import org.apache.commons.collections4.iterators.EmptyIterator;
import org.apache.commons.collections4.iterators.IteratorChain;

But it also imports from "Commons Codec", in classes "Murmur128x64Cyclic"
and "Murmur32x86Iterative":
import org.apache.commons.codec.digest.MurmurHash3;

> Tangent: There were UML tools/Maven plugins in the past that could
> visualize these kinds of relationships, has anyone used any in the recent
> past? Anything FOSS we could use?

Some info is provided in the "Jdepend" report.

Gilles

>
> Gary
>
> On Wed, Apr 22, 2020 at 7:00 AM Claude Warren <[hidden email]> wrote:
>
> > Bloom filters should not use generics.  That has been my stated opinion.
> > They are not like other collections in that you don't get out what you put
> > in.  They are collections of hashes so the idea that generics should be
> > used to somehow define what goes in is misleading.
> >
> > If commons-collections is supposed to be all about putting things into a
> > collection and getting them back out then perhaps bloom filters do not
> > belong here.
> >
> > The only major point of contention is should the package end up looking
> > like the Guice bloom filter package.  In my opinion the Guice package is
> > very restrictive.  It does not allows for different hashing
> > implementations, it forces a one to one correspondence between an Object
> > and the filter, this makes putting properties of an Object into the filter
> > as separate items difficult if not impossible, and it makes creating
> > partial filters for matching the same still more difficult.
> >
> > Any more complex usage of Bloom filters (e.g. in genomics) will be much
> > harder if not impossible.
> >
> > Going the Guice route also begs the question: Why not just use Guice?
> >
> > The intention of this contribution was a framework that allows the
> > developer to build Bloom filters that
> > a) met the requirements of the application.
> > b) were easy to share between applications.
> > c) could implement most strategies for Bloom filters.
> >
> > The Guice implementation is easy enough to construct with the framework as
> > defined in the current Commons Collections Bloom filter code.  And I have
> > no objection to providing a Simple Bloom filter implementation that does
> > that.  But doing so should not modify the framework in such a way as to
> > make other usage more difficult.
> >
> > There have been lots of good conversations and lots of improvements since
> > the code was contributed.  I have several open source projects that
> > utilized the original code and have been able to easily modify them to use
> > the Commons versions as development and improvements progressed.  I would
> > hope to continue to be able to do that as the code moves to a releasable
> > state.
> >
> > As I said above, it may be that commons collections is not the place for
> > Bloom filters.  Perhaps they belong in codec or in its own project.  I
> > leave that to others to decide.
> >
> > Claude
> >
> >
> >
> >
> > On Wed, Apr 22, 2020 at 12:47 AM Gilles Sadowski <[hidden email]>
> > wrote:
> >
> > > Hello.
> > >
> > > > > [...]
> > > > >
> > > >
> > > > Attempting to re-awaken this thread.
> > > >
> > > > IMO the bloomfilter package is currently under development. The
> > original
> > > contributor was working through submitting Bloom filter functionality in
> > > parts. My involvement was to ensure the current code that was initially
> > > merged was documented (in javadoc) and functioned as described. This led
> > to
> > > changes of the current functionality and ideas for changes. There were
> > many
> > > ideas in this thread history that were discussed, some agreed and not yet
> > > implemented, some still open for discussion.
> > > >
> > > > Development has stalled and the vision for the final package has not
> > > been completed. At present the bloomfilter added to collections does not
> > > use generics and it has very little to do with java.util.Collection. So
> > is
> > > it even in the remit of commons-collections?
> > >
> > > What would be the alternative(s)?
> > >
> > > >
> > > > I suggested moving to a development branch while the package
> > > functionality is stabilised if any other contributors wish to work on
> > this.
> > >
> > > IIRC, in Commons, branches other than "master" have usually become stale,
> > > unfortunately.  They've been rare, and worked on by a single person...
> > > Perhaps somehow mark the feature as not ready, and to be removed from
> > > the release branch (?).
> > >
> > > > I have left development alone as it seems redundant to progress without
> > > a defined scope for the ultimate functionality.
> > >
> > > Do you mean that progress depends on intended usage?
> > >
> > > >
> > > > Opinions on how to proceed?
> > >
> > > I got lost along the way; sorry.
> > > Are there still uncertainties about the "basic" features?
> > >
> > > Regards,
> > > Gilles
> > >
> > > >
> > > > Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert
In reply to this post by garydgregory
On Wed, 22 Apr 2020 at 13:56, Gary Gregory <[hidden email]> wrote:

> Hi All,
>
> I'd like to pat ourselves on the back for a smooth ride so far in bringing
> in new code for a new non-trivial feature :-)
>
> IMO, let's not get hung up on generics for the first cut of the Bloom
> Filter code. Why? Because generics are erased by the compiler and we can
> always add them later without break binary compatibility. Now, if adding
> generics later would require other changes or other changes would be
> desired with said generics which would end up breaking BC, then we should
> pause and consider why that is. Any thoughts on that?
>

Currently you add a Hasher to a BloomFilter. The BloomFilter is a set of n
bits. The Hasher provides indexes into the set given its size. The
BloomFilter can enable the bits based on a Hasher (merge) or state if
the bits are enabled for the Hasher (query). You can also merge and query
using a BloomFilter of the same shape.

A generic version would have the BloomFilter allow merge and query with
item T. This requires moving the control for conversion of an item into a
Hasher into the BloomFilter. This would require a restructure of the entire
BloomFilter Hasher paradigm so cannot be added later. An alternative is to
provide a higher level wrapper class that sits on top of a BloomFilter and
contains a mechanism to convert item T into a Hasher. This idea can be
built on top of the existing code.

So in summary I don't think we can convert what we currently have now to
generics. We can build on top of it a generics layer.

I've not checked the recent thread topics but I believe the present points
under discussion for change are:

- Switch to long indexes: Allows storing 64-times larger filters with a
basic array of long[]. Future future support could be much larger with
specialised structures.
- Switch the mechanism for providing the enabled bits of a BloomFilter.
This is to support serialisation and generic merge of two filters which may
maintain a different internal structure. Currently it uses a method to
return a long[] and also a frozen version of a Hasher. This could be
changed to use iterators allowing unlimited size. This is relevant if we
switch to long indexes and want to support more than Integer.MAX_VALUE * 64
bits (i.e. more that a long[] can possibly hold). Using a forEach type
pattern was also discussed. This was added to the CountingBloomFilter but
not the (super) BloomFilter.
- Revisit how a Hasher communicates to a BloomFilter that it is using the
underlying mechanism for generating indexes that all the other Hashers
added to the filter have used. Currently this uses a HashFunctionIdentity
which is based on a set of properties and also applying the HashFunction to
a string created from those properties. The system is fallible.
- Revisit how to convert any object to a Hasher. Currently this requires
conversion to a byte[] which is then used with a HashFunction to create
long output using a generator pattern. This is a topic where generics could
be used and how a generic framework can be created that allows a user to
specify the parts of object T that are important to add to the byte[] that
is used by the HashFunction.
- Possible support for dynamic HashFunctions that do not require a byte[]
and can do hashing on the fly.

Some of these topics require API changes. Thus my concern that the API is
still under development.


>
> WRT to 'fitting in' Commons Collection, I would like someone else to reply
> (please ;-)
> - Does the BF code implement some Commons Collection interfaces?
>
No

> - Does the BF code extend  some Commons Collection classes?
>
No

> - Does the BF code reuse Commons Collection utilities?
>
No

It is totally separate. It uses a hash function from Commons Codec (which
Claude kindly fixed as the MurmurHash3 function did not compute the correct
hash).


>
> Tangent: There were UML tools/Maven plugins in the past that could
> visualize these kinds of relationships, has anyone used any in the recent
> past? Anything FOSS we could use?
>

No need.

git grep commons.collections4
src/main/java/org/apache/commons/collections4/bloomfilter/

The only items that are imported are those below the bloomfilter package.

Answers to Claude's point below:


> On Wed, Apr 22, 2020 at 7:00 AM Claude Warren <[hidden email]> wrote:
>
> > Bloom filters should not use generics.  That has been my stated opinion.
> > They are not like other collections in that you don't get out what you
> put
> > in.  They are collections of hashes so the idea that generics should be
> > used to somehow define what goes in is misleading.
>

That is fine. If we set out that is the architectural decision in the
package docs then it is clear what the package is for.

Currently we have a very tight coupling between the BloomFilter, Hasher,
Shape and HashFunctionIdentity classes. This leads me to think there should
be level above this that allows construction of a class containing all the
required components to function as a collection:

ObjectBloomFilter<T> filter = BloomFilters.create(...)

You would pass in specifications for the component parts providing a simple
use case API. Advanced usage can work with the low level components.


> >
> > If commons-collections is supposed to be all about putting things into a
> > collection and getting them back out then perhaps bloom filters do not
> > belong here.
>

Yet. Since we are in development and perhaps have a bit further to go to
get to a collection of items.


> >
> > The only major point of contention is should the package end up looking
> > like the Guice bloom filter package.  In my opinion the Guice package is
> > very restrictive.  It does not allows for different hashing
> > implementations, it forces a one to one correspondence between an Object
> > and the filter, this makes putting properties of an Object into the
> filter
> > as separate items difficult if not impossible, and it makes creating
> > partial filters for matching the same still more difficult.
>

Note you can always work around some issues with properties of objects by
having the BloomFilter typed to a control Object or super Object of things
you want to add. The conversion of the Object to indices can respond
appropriately based on the object type. But given we have not yet included
partial filters I don't know exactly how it would work.


> >
> > Any more complex usage of Bloom filters (e.g. in genomics) will be much
> > harder if not impossible.
> >
> > Going the Guice route also begs the question: Why not just use Guice?
>

I agree that Guava's implementation does one use case very well, but does
not allow configuring the hash function or other filter functionality. We
should not underestimate the number of people who want to do the same but
have a bit more control over the hash function. This is a use case we
should try to achieve in the end product.


> >
> > The intention of this contribution was a framework that allows the
> > developer to build Bloom filters that
> > a) met the requirements of the application.
> > b) were easy to share between applications.
> > c) could implement most strategies for Bloom filters.
> >
> > The Guice implementation is easy enough to construct with the framework
> as
> > defined in the current Commons Collections Bloom filter code.  And I have
> > no objection to providing a Simple Bloom filter implementation that does
> > that.  But doing so should not modify the framework in such a way as to
> > make other usage more difficult.
>

Thus we build on top of it. But we should maintain that the entire API is
fluid until we have that functionality in place too. If we expect and allow
changes to the API then the development tag is still relevant.


> >
> > There have been lots of good conversations and lots of improvements since
> > the code was contributed.  I have several open source projects that
> > utilized the original code and have been able to easily modify them to
> use
> > the Commons versions as development and improvements progressed.  I would
> > hope to continue to be able to do that as the code moves to a releasable
> > state.
>

It is helpful to know that the code has improved.


> >
> > As I said above, it may be that commons collections is not the place for
> > Bloom filters.  Perhaps they belong in codec or in its own project.  I
> > leave that to others to decide.
>

This is the crux. Currently we have a structure that collects bits and
provides merge and query based on bits. It uses nothing from the rest of
commons collections. But a second layer can be added that provides a
collections type API on top of the current code.

Alex
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Alex Herbert
In reply to this post by Gilles Sadowski-2
On Wed, 22 Apr 2020 at 17:17, Gilles Sadowski <[hidden email]> wrote:

>
> > - Does the BF code reuse Commons Collection utilities?
>
> Yes, in class "HasherBloomFilter":
> import org.apache.commons.collections4.iterators.EmptyIterator;
> import org.apache.commons.collections4.iterators.IteratorChain;
>

Missed that one. But I believe one of the discussions was how to improve
the HasherBloomFilter to remove this code or even drop the
HasherBloomFilter as the functionality it provides was heading towards
being redundant. I'd have to check back what was discussed.
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Claude Warren
I keep wondering if Bloom filters belong in Collections.  They are not a
collection in the standard sense of the word.  Would it make more sense to
spit it out as a new Commons project?  How does one even go about that?

On Wed, Apr 22, 2020 at 5:37 PM Alex Herbert <[hidden email]>
wrote:

> On Wed, 22 Apr 2020 at 17:17, Gilles Sadowski <[hidden email]>
> wrote:
>
> >
> > > - Does the BF code reuse Commons Collection utilities?
> >
> > Yes, in class "HasherBloomFilter":
> > import org.apache.commons.collections4.iterators.EmptyIterator;
> > import org.apache.commons.collections4.iterators.IteratorChain;
> >
>
> Missed that one. But I believe one of the discussions was how to improve
> the HasherBloomFilter to remove this code or even drop the
> HasherBloomFilter as the functionality it provides was heading towards
> being redundant. I'd have to check back what was discussed.
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [BloomFilters] changes to BloomFilter

Gilles Sadowski-2
Hi.

2020-05-10 18:38 UTC+02:00, Claude Warren <[hidden email]>:
> I keep wondering if Bloom filters belong in Collections.  They are not a
> collection in the standard sense of the word.

If it does not depend on [Collections] code and it provides
a functionality that can be used without [Collections], then
IMO it does not belong.

> Would it make more sense to
> spit it out as a new Commons project?

+1
[And +1 to making a beta release.]

> How does one even go about that?

A request to INFRA and off we go...
[Perhaps a (lazy) vote is in order.]

Best,
Gilles

>> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12