[collections] bloom filters comments

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[collections] bloom filters comments

garydgregory
Hi Claude and all:

Here are a couple of comments on the bloom filter PR.

10,100 ft level comment we do not have to worry about today: Before the
release, we might want to split Commons Collection into a multi-module
project and have the BF code in it own module. The _main_ reason for this
is that it will allow the dependency on Commons Codecis be a non-optional
dependency. Optional dependency are a a bit of a pain IMO, especially when
you deal with large stacks. You end up sucking in everything that is
optional when you deliver an app because you do not know for certain what's
really required at runtime.

Closer to the ground: I would make BloomFilter an interface and rename
the current BloomFilter class AbstractBloomFilter implements BloomFilter.
This will allow the user and maintainer to see what is truly public.

Since this is a first cut for Commons Collection, I would consider only
making APIs public that must be public. Once released, we MUST maintain
binary compatibility within minor releases. Major releases allow us to
break this compatibility but these do not happen very often. I do not know
what this means yet for BF but it's a thought to consider. This kind of
work is made harder due to the current packaging of the BF code.

We could consider putting it all in one package if that helps reduce the
public API footprint.

Q: All the code in main's
org.apache.commons.collections4.bloomfilter.hasher.function is only called
from test code. Why is that?

Nit: Some odd formatting like spaces after open parens: 'toUpperCase(
Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'

Gary
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Claude Warren
On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory <[hidden email]> wrote:

> Hi Claude and all:
>
> Here are a couple of comments on the bloom filter PR.
>
> 10,100 ft level comment we do not have to worry about today: Before the
> release, we might want to split Commons Collection into a multi-module
> project and have the BF code in it own module. The _main_ reason for this
> is that it will allow the dependency on Commons Codecis be a non-optional
> dependency. Optional dependency are a a bit of a pain IMO, especially when
> you deal with large stacks. You end up sucking in everything that is
> optional when you deliver an app because you do not know for certain what's
> really required at runtime.
>
> Closer to the ground: I would make BloomFilter an interface and rename
> the current BloomFilter class AbstractBloomFilter implements BloomFilter.
> This will allow the user and maintainer to see what is truly public.
>

I have done this (not checked in) but the net effect is that all the public
methods in the AbstractBloomFilter class are reflected in the BloomFilter
interface and the Shape class has moved to the Interface as well.

I can remove several methods from BloomFilter that are not strictly
necessary for this code to function.  I am and have been reticent to do so
since there are a number of home-grown libraries used by various
researchers that provide these functions.  But if that is what it takes to
get this out the door it will be done.


> Since this is a first cut for Commons Collection, I would consider only
> making APIs public that must be public. Once released, we MUST maintain
> binary compatibility within minor releases. Major releases allow us to
> break this compatibility but these do not happen very often. I do not know
> what this means yet for BF but it's a thought to consider. This kind of
> work is made harder due to the current packaging of the BF code.
>
> We could consider putting it all in one package if that helps reduce the
> public API footprint.
>

I think that putting all the pieces into a single package will muddy the
waters a bit.  The classes in o.a.c.c.bloomfilter are high level classes
and used wherever the Bloom filter code is used.  Objects in the
o.a.c.c.bloomfilter.hasher classes are Hashers and are really only selected
when the application developer determines which hasher is best suited for
the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
HasherFunction implementations.  These can be used or special cases built
by the implementer as necessary.  Perhaps I am being a bit too pedantic but
I do not think it makes sense to merge them into a single package.


>
> Q: All the code in main's
> org.apache.commons.collections4.bloomfilter.hasher.function is only called
> from test code. Why is that?
>

They are pluggable modules.  When a developer uses them they an instance is
created and it is passed to the BloomFilter.Shape() constructor as well as
to the various Hasher constructors.  For example:

    HashFunction hashFunction = new Murmur128x86Cyclic();
    Shape bloomFilterConfig = new Shape( hashFunction, 3, 1.0 / 100000);


>
> Nit: Some odd formatting like spaces after open parens: 'toUpperCase(
> Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'
>
> Gary
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

garydgregory
On Fri, Dec 27, 2019 at 11:36 AM Claude Warren <[hidden email]> wrote:

> On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory <[hidden email]>
> wrote:
>
> > Hi Claude and all:
> >
> > Here are a couple of comments on the bloom filter PR.
> >
> > 10,100 ft level comment we do not have to worry about today: Before the
> > release, we might want to split Commons Collection into a multi-module
> > project and have the BF code in it own module. The _main_ reason for this
> > is that it will allow the dependency on Commons Codecis be a non-optional
> > dependency. Optional dependency are a a bit of a pain IMO, especially
> when
> > you deal with large stacks. You end up sucking in everything that is
> > optional when you deliver an app because you do not know for certain
> what's
> > really required at runtime.
> >
> > Closer to the ground: I would make BloomFilter an interface and rename
> > the current BloomFilter class AbstractBloomFilter implements BloomFilter.
> > This will allow the user and maintainer to see what is truly public.
> >
>
> I have done this (not checked in) but the net effect is that all the public
> methods in the AbstractBloomFilter class are reflected in the BloomFilter
> interface and the Shape class has moved to the Interface as well.
>

OK, I think I like BloomFilter as an interface especially since it is used
as an argument in methods. I'll leave it to you as to whether Shape needs
to be folded in. I did notice that Shape is an argument in a few places.
Might we loose some focus and abstraction by folding Shape into the
BloomFilter interface?


> I can remove several methods from BloomFilter that are not strictly
> necessary for this code to function.  I am and have been reticent to do so
> since there are a number of home-grown libraries used by various
> researchers that provide these functions.  But if that is what it takes to
> get this out the door it will be done.
>

Once you have BloomFilter as an interface with an implementing class, it
might make it much clearer as to what really belongs in the interface. The
handy methods can stay in the abstract class of course.


>
> > Since this is a first cut for Commons Collection, I would consider only
> > making APIs public that must be public. Once released, we MUST maintain
> > binary compatibility within minor releases. Major releases allow us to
> > break this compatibility but these do not happen very often. I do not
> know
> > what this means yet for BF but it's a thought to consider. This kind of
> > work is made harder due to the current packaging of the BF code.
> >
> > We could consider putting it all in one package if that helps reduce the
> > public API footprint.
> >
>
> I think that putting all the pieces into a single package will muddy the
> waters a bit.  The classes in o.a.c.c.bloomfilter are high level classes
> and used wherever the Bloom filter code is used.  Objects in the
> o.a.c.c.bloomfilter.hasher classes are Hashers and are really only selected
> when the application developer determines which hasher is best suited for
> the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
> HasherFunction implementations.  These can be used or special cases built
> by the implementer as necessary.  Perhaps I am being a bit too pedantic but
> I do not think it makes sense to merge them into a single package.
>

Let's leave it as is then.

Side note: I'll need to release Commons Codec 1.14 before we can bring in
this PR. I hope to do this after Commons VFS has gone through its own
release cycle (which should be done in 48 hours or so if all goes well.)
Well, we could bring in the PR but depending on a SNAPSHOT build is usually
not a good idea unless it is truly short term.

Thank you,
Gary


>
>
> >
> > Q: All the code in main's
> > org.apache.commons.collections4.bloomfilter.hasher.function is only
> called
> > from test code. Why is that?
> >
>
> They are pluggable modules.  When a developer uses them they an instance is
> created and it is passed to the BloomFilter.Shape() constructor as well as
> to the various Hasher constructors.  For example:
>
>     HashFunction hashFunction = new Murmur128x86Cyclic();
>     Shape bloomFilterConfig = new Shape( hashFunction, 3, 1.0 / 100000);
>
>
> >
> > Nit: Some odd formatting like spaces after open parens: 'toUpperCase(
> > Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'
> >
> > Gary
> >
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Claude Warren
Once the interface is extracted and reduced to the minimum necessary the
following methods are removed:

orCardinality() -- we have andCardinality() and xorCardinality() this was
included for completeness.

isFull() -- true if all the bits in the vector are on. A convenience method
used to short circuit some logic.

I think these 2 methods should go into the BloomFilter interface.


Set operations:

jaccardSimilarity -- a measure of similarity in sets with range [0,1]

jaccardDistance -- defined as 1 - jaccardSimilarity

cosineSimilarity -- a measure of similarity with range  [0,1]

cosineDistance -- defined as 1 - cosineSimilarity

estimateSize -- Set operation estimate of the number of items that were
placed in the filter. (Requires Shape)

estimateUnionSize -- Set operation estimate of the number of items that
would be represented by the merging of two filters. (Requires Shape)

estimateIntersectionSize -- Set operations estimate of the number of items
in the intersection. (Requires Shape)

Perhaps it makes sense to move the Set operations into their own class with
static methods?  The set operations all operate on 2 Bloom filters.  Moving
them would clarify the AbstractBloomFilter class.

Claude


On Sat, Dec 28, 2019 at 2:01 AM Gary Gregory <[hidden email]> wrote:

> On Fri, Dec 27, 2019 at 11:36 AM Claude Warren <[hidden email]> wrote:
>
> > On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory <[hidden email]>
> > wrote:
> >
> > > Hi Claude and all:
> > >
> > > Here are a couple of comments on the bloom filter PR.
> > >
> > > 10,100 ft level comment we do not have to worry about today: Before the
> > > release, we might want to split Commons Collection into a multi-module
> > > project and have the BF code in it own module. The _main_ reason for
> this
> > > is that it will allow the dependency on Commons Codecis be a
> non-optional
> > > dependency. Optional dependency are a a bit of a pain IMO, especially
> > when
> > > you deal with large stacks. You end up sucking in everything that is
> > > optional when you deliver an app because you do not know for certain
> > what's
> > > really required at runtime.
> > >
> > > Closer to the ground: I would make BloomFilter an interface and rename
> > > the current BloomFilter class AbstractBloomFilter implements
> BloomFilter.
> > > This will allow the user and maintainer to see what is truly public.
> > >
> >
> > I have done this (not checked in) but the net effect is that all the
> public
> > methods in the AbstractBloomFilter class are reflected in the BloomFilter
> > interface and the Shape class has moved to the Interface as well.
> >
>
> OK, I think I like BloomFilter as an interface especially since it is used
> as an argument in methods. I'll leave it to you as to whether Shape needs
> to be folded in. I did notice that Shape is an argument in a few places.
> Might we loose some focus and abstraction by folding Shape into the
> BloomFilter interface?
>
>
> > I can remove several methods from BloomFilter that are not strictly
> > necessary for this code to function.  I am and have been reticent to do
> so
> > since there are a number of home-grown libraries used by various
> > researchers that provide these functions.  But if that is what it takes
> to
> > get this out the door it will be done.
> >
>
> Once you have BloomFilter as an interface with an implementing class, it
> might make it much clearer as to what really belongs in the interface. The
> handy methods can stay in the abstract class of course.
>
>
> >
> > > Since this is a first cut for Commons Collection, I would consider only
> > > making APIs public that must be public. Once released, we MUST maintain
> > > binary compatibility within minor releases. Major releases allow us to
> > > break this compatibility but these do not happen very often. I do not
> > know
> > > what this means yet for BF but it's a thought to consider. This kind of
> > > work is made harder due to the current packaging of the BF code.
> > >
> > > We could consider putting it all in one package if that helps reduce
> the
> > > public API footprint.
> > >
> >
> > I think that putting all the pieces into a single package will muddy the
> > waters a bit.  The classes in o.a.c.c.bloomfilter are high level classes
> > and used wherever the Bloom filter code is used.  Objects in the
> > o.a.c.c.bloomfilter.hasher classes are Hashers and are really only
> selected
> > when the application developer determines which hasher is best suited for
> > the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
> > HasherFunction implementations.  These can be used or special cases built
> > by the implementer as necessary.  Perhaps I am being a bit too pedantic
> but
> > I do not think it makes sense to merge them into a single package.
> >
>
> Let's leave it as is then.
>
> Side note: I'll need to release Commons Codec 1.14 before we can bring in
> this PR. I hope to do this after Commons VFS has gone through its own
> release cycle (which should be done in 48 hours or so if all goes well.)
> Well, we could bring in the PR but depending on a SNAPSHOT build is usually
> not a good idea unless it is truly short term.
>
> Thank you,
> Gary
>
>
> >
> >
> > >
> > > Q: All the code in main's
> > > org.apache.commons.collections4.bloomfilter.hasher.function is only
> > called
> > > from test code. Why is that?
> > >
> >
> > They are pluggable modules.  When a developer uses them they an instance
> is
> > created and it is passed to the BloomFilter.Shape() constructor as well
> as
> > to the various Hasher constructors.  For example:
> >
> >     HashFunction hashFunction = new Murmur128x86Cyclic();
> >     Shape bloomFilterConfig = new Shape( hashFunction, 3, 1.0 / 100000);
> >
> >
> > >
> > > Nit: Some odd formatting like spaces after open parens: 'toUpperCase(
> > > Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'
> > >
> > > Gary
> > >
> >
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Claude Warren
If the Shape class (BloomFilter.Shape) is extracted from the BloomFilter
interface and made a stand-alone class would the name change or is the fact
that it is in the o.a.c.c.bloomfilter package enough?

I prefer the name Shape to BloomFilterShape.

Claude

On Sat, Dec 28, 2019 at 9:06 AM Claude Warren <[hidden email]> wrote:

> Once the interface is extracted and reduced to the minimum necessary the
> following methods are removed:
>
> orCardinality() -- we have andCardinality() and xorCardinality() this was
> included for completeness.
>
> isFull() -- true if all the bits in the vector are on. A convenience
> method used to short circuit some logic.
>
> I think these 2 methods should go into the BloomFilter interface.
>
>
> Set operations:
>
> jaccardSimilarity -- a measure of similarity in sets with range [0,1]
>
> jaccardDistance -- defined as 1 - jaccardSimilarity
>
> cosineSimilarity -- a measure of similarity with range  [0,1]
>
> cosineDistance -- defined as 1 - cosineSimilarity
>
> estimateSize -- Set operation estimate of the number of items that were
> placed in the filter. (Requires Shape)
>
> estimateUnionSize -- Set operation estimate of the number of items that
> would be represented by the merging of two filters. (Requires Shape)
>
> estimateIntersectionSize -- Set operations estimate of the number of items
> in the intersection. (Requires Shape)
>
> Perhaps it makes sense to move the Set operations into their own class
> with static methods?  The set operations all operate on 2 Bloom filters.
> Moving them would clarify the AbstractBloomFilter class.
>
> Claude
>
>
> On Sat, Dec 28, 2019 at 2:01 AM Gary Gregory <[hidden email]>
> wrote:
>
>> On Fri, Dec 27, 2019 at 11:36 AM Claude Warren <[hidden email]> wrote:
>>
>> > On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory <[hidden email]>
>> > wrote:
>> >
>> > > Hi Claude and all:
>> > >
>> > > Here are a couple of comments on the bloom filter PR.
>> > >
>> > > 10,100 ft level comment we do not have to worry about today: Before
>> the
>> > > release, we might want to split Commons Collection into a multi-module
>> > > project and have the BF code in it own module. The _main_ reason for
>> this
>> > > is that it will allow the dependency on Commons Codecis be a
>> non-optional
>> > > dependency. Optional dependency are a a bit of a pain IMO, especially
>> > when
>> > > you deal with large stacks. You end up sucking in everything that is
>> > > optional when you deliver an app because you do not know for certain
>> > what's
>> > > really required at runtime.
>> > >
>> > > Closer to the ground: I would make BloomFilter an interface and rename
>> > > the current BloomFilter class AbstractBloomFilter implements
>> BloomFilter.
>> > > This will allow the user and maintainer to see what is truly public.
>> > >
>> >
>> > I have done this (not checked in) but the net effect is that all the
>> public
>> > methods in the AbstractBloomFilter class are reflected in the
>> BloomFilter
>> > interface and the Shape class has moved to the Interface as well.
>> >
>>
>> OK, I think I like BloomFilter as an interface especially since it is used
>> as an argument in methods. I'll leave it to you as to whether Shape needs
>> to be folded in. I did notice that Shape is an argument in a few places.
>> Might we loose some focus and abstraction by folding Shape into the
>> BloomFilter interface?
>>
>>
>> > I can remove several methods from BloomFilter that are not strictly
>> > necessary for this code to function.  I am and have been reticent to do
>> so
>> > since there are a number of home-grown libraries used by various
>> > researchers that provide these functions.  But if that is what it takes
>> to
>> > get this out the door it will be done.
>> >
>>
>> Once you have BloomFilter as an interface with an implementing class, it
>> might make it much clearer as to what really belongs in the interface. The
>> handy methods can stay in the abstract class of course.
>>
>>
>> >
>> > > Since this is a first cut for Commons Collection, I would consider
>> only
>> > > making APIs public that must be public. Once released, we MUST
>> maintain
>> > > binary compatibility within minor releases. Major releases allow us to
>> > > break this compatibility but these do not happen very often. I do not
>> > know
>> > > what this means yet for BF but it's a thought to consider. This kind
>> of
>> > > work is made harder due to the current packaging of the BF code.
>> > >
>> > > We could consider putting it all in one package if that helps reduce
>> the
>> > > public API footprint.
>> > >
>> >
>> > I think that putting all the pieces into a single package will muddy the
>> > waters a bit.  The classes in o.a.c.c.bloomfilter are high level classes
>> > and used wherever the Bloom filter code is used.  Objects in the
>> > o.a.c.c.bloomfilter.hasher classes are Hashers and are really only
>> selected
>> > when the application developer determines which hasher is best suited
>> for
>> > the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
>> > HasherFunction implementations.  These can be used or special cases
>> built
>> > by the implementer as necessary.  Perhaps I am being a bit too pedantic
>> but
>> > I do not think it makes sense to merge them into a single package.
>> >
>>
>> Let's leave it as is then.
>>
>> Side note: I'll need to release Commons Codec 1.14 before we can bring in
>> this PR. I hope to do this after Commons VFS has gone through its own
>> release cycle (which should be done in 48 hours or so if all goes well.)
>> Well, we could bring in the PR but depending on a SNAPSHOT build is
>> usually
>> not a good idea unless it is truly short term.
>>
>> Thank you,
>> Gary
>>
>>
>> >
>> >
>> > >
>> > > Q: All the code in main's
>> > > org.apache.commons.collections4.bloomfilter.hasher.function is only
>> > called
>> > > from test code. Why is that?
>> > >
>> >
>> > They are pluggable modules.  When a developer uses them they an
>> instance is
>> > created and it is passed to the BloomFilter.Shape() constructor as well
>> as
>> > to the various Hasher constructors.  For example:
>> >
>> >     HashFunction hashFunction = new Murmur128x86Cyclic();
>> >     Shape bloomFilterConfig = new Shape( hashFunction, 3, 1.0 / 100000);
>> >
>> >
>> > >
>> > > Nit: Some odd formatting like spaces after open parens: 'toUpperCase(
>> > > Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'
>> > >
>> > > Gary
>> > >
>> >
>> >
>> > --
>> > I like: Like Like - The likeliest place on the web
>> > <http://like-like.xenei.com>
>> > LinkedIn: http://www.linkedin.com/in/claudewarren
>> >
>>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Gilles Sadowski-2
Le dim. 29 déc. 2019 à 15:30, Claude Warren <[hidden email]> a écrit :
>
> If the Shape class (BloomFilter.Shape) is extracted from the BloomFilter
> interface and made a stand-alone class would the name change or is the fact
> that it is in the o.a.c.c.bloomfilter package enough?
>
> I prefer the name Shape to BloomFilterShape.

If "Shape" is only used for "BloomFilter" (as the suggestion above
seems to indicate, why not declare it as a sub-interface:
---CUT---
public interface BloomFilter {
    // ...

    public interface Shape {
        // ...
    }
}
---CUT---
?

Regards,
Gilles

>
> Claude
>
> On Sat, Dec 28, 2019 at 9:06 AM Claude Warren <[hidden email]> wrote:
>
> > Once the interface is extracted and reduced to the minimum necessary the
> > following methods are removed:
> >
> > orCardinality() -- we have andCardinality() and xorCardinality() this was
> > included for completeness.
> >
> > isFull() -- true if all the bits in the vector are on. A convenience
> > method used to short circuit some logic.
> >
> > I think these 2 methods should go into the BloomFilter interface.
> >
> >
> > Set operations:
> >
> > jaccardSimilarity -- a measure of similarity in sets with range [0,1]
> >
> > jaccardDistance -- defined as 1 - jaccardSimilarity
> >
> > cosineSimilarity -- a measure of similarity with range  [0,1]
> >
> > cosineDistance -- defined as 1 - cosineSimilarity
> >
> > estimateSize -- Set operation estimate of the number of items that were
> > placed in the filter. (Requires Shape)
> >
> > estimateUnionSize -- Set operation estimate of the number of items that
> > would be represented by the merging of two filters. (Requires Shape)
> >
> > estimateIntersectionSize -- Set operations estimate of the number of items
> > in the intersection. (Requires Shape)
> >
> > Perhaps it makes sense to move the Set operations into their own class
> > with static methods?  The set operations all operate on 2 Bloom filters.
> > Moving them would clarify the AbstractBloomFilter class.
> >
> > Claude
> >
> >
> > On Sat, Dec 28, 2019 at 2:01 AM Gary Gregory <[hidden email]>
> > wrote:
> >
> >> On Fri, Dec 27, 2019 at 11:36 AM Claude Warren <[hidden email]> wrote:
> >>
> >> > On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory <[hidden email]>
> >> > wrote:
> >> >
> >> > > Hi Claude and all:
> >> > >
> >> > > Here are a couple of comments on the bloom filter PR.
> >> > >
> >> > > 10,100 ft level comment we do not have to worry about today: Before
> >> the
> >> > > release, we might want to split Commons Collection into a multi-module
> >> > > project and have the BF code in it own module. The _main_ reason for
> >> this
> >> > > is that it will allow the dependency on Commons Codecis be a
> >> non-optional
> >> > > dependency. Optional dependency are a a bit of a pain IMO, especially
> >> > when
> >> > > you deal with large stacks. You end up sucking in everything that is
> >> > > optional when you deliver an app because you do not know for certain
> >> > what's
> >> > > really required at runtime.
> >> > >
> >> > > Closer to the ground: I would make BloomFilter an interface and rename
> >> > > the current BloomFilter class AbstractBloomFilter implements
> >> BloomFilter.
> >> > > This will allow the user and maintainer to see what is truly public.
> >> > >
> >> >
> >> > I have done this (not checked in) but the net effect is that all the
> >> public
> >> > methods in the AbstractBloomFilter class are reflected in the
> >> BloomFilter
> >> > interface and the Shape class has moved to the Interface as well.
> >> >
> >>
> >> OK, I think I like BloomFilter as an interface especially since it is used
> >> as an argument in methods. I'll leave it to you as to whether Shape needs
> >> to be folded in. I did notice that Shape is an argument in a few places.
> >> Might we loose some focus and abstraction by folding Shape into the
> >> BloomFilter interface?
> >>
> >>
> >> > I can remove several methods from BloomFilter that are not strictly
> >> > necessary for this code to function.  I am and have been reticent to do
> >> so
> >> > since there are a number of home-grown libraries used by various
> >> > researchers that provide these functions.  But if that is what it takes
> >> to
> >> > get this out the door it will be done.
> >> >
> >>
> >> Once you have BloomFilter as an interface with an implementing class, it
> >> might make it much clearer as to what really belongs in the interface. The
> >> handy methods can stay in the abstract class of course.
> >>
> >>
> >> >
> >> > > Since this is a first cut for Commons Collection, I would consider
> >> only
> >> > > making APIs public that must be public. Once released, we MUST
> >> maintain
> >> > > binary compatibility within minor releases. Major releases allow us to
> >> > > break this compatibility but these do not happen very often. I do not
> >> > know
> >> > > what this means yet for BF but it's a thought to consider. This kind
> >> of
> >> > > work is made harder due to the current packaging of the BF code.
> >> > >
> >> > > We could consider putting it all in one package if that helps reduce
> >> the
> >> > > public API footprint.
> >> > >
> >> >
> >> > I think that putting all the pieces into a single package will muddy the
> >> > waters a bit.  The classes in o.a.c.c.bloomfilter are high level classes
> >> > and used wherever the Bloom filter code is used.  Objects in the
> >> > o.a.c.c.bloomfilter.hasher classes are Hashers and are really only
> >> selected
> >> > when the application developer determines which hasher is best suited
> >> for
> >> > the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
> >> > HasherFunction implementations.  These can be used or special cases
> >> built
> >> > by the implementer as necessary.  Perhaps I am being a bit too pedantic
> >> but
> >> > I do not think it makes sense to merge them into a single package.
> >> >
> >>
> >> Let's leave it as is then.
> >>
> >> Side note: I'll need to release Commons Codec 1.14 before we can bring in
> >> this PR. I hope to do this after Commons VFS has gone through its own
> >> release cycle (which should be done in 48 hours or so if all goes well.)
> >> Well, we could bring in the PR but depending on a SNAPSHOT build is
> >> usually
> >> not a good idea unless it is truly short term.
> >>
> >> Thank you,
> >> Gary
> >>
> >>
> >> >
> >> >
> >> > >
> >> > > Q: All the code in main's
> >> > > org.apache.commons.collections4.bloomfilter.hasher.function is only
> >> > called
> >> > > from test code. Why is that?
> >> > >
> >> >
> >> > They are pluggable modules.  When a developer uses them they an
> >> instance is
> >> > created and it is passed to the BloomFilter.Shape() constructor as well
> >> as
> >> > to the various Hasher constructors.  For example:
> >> >
> >> >     HashFunction hashFunction = new Murmur128x86Cyclic();
> >> >     Shape bloomFilterConfig = new Shape( hashFunction, 3, 1.0 / 100000);
> >> >
> >> >
> >> > >
> >> > > Nit: Some odd formatting like spaces after open parens: 'toUpperCase(
> >> > > Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'
> >> > >
> >> > > Gary
> >> > >
> >> >
> >> >
> >> > --
> >> > I like: Like Like - The likeliest place on the web
> >> > <http://like-like.xenei.com>
> >> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >> >
> >>
> >
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Claude Warren
It is currently a sub-class.  There was a suggestion to move it outside of
the BloomFilter interface.

On Sun, Dec 29, 2019 at 3:47 PM Gilles Sadowski <[hidden email]>
wrote:

> Le dim. 29 déc. 2019 à 15:30, Claude Warren <[hidden email]> a écrit :
> >
> > If the Shape class (BloomFilter.Shape) is extracted from the BloomFilter
> > interface and made a stand-alone class would the name change or is the
> fact
> > that it is in the o.a.c.c.bloomfilter package enough?
> >
> > I prefer the name Shape to BloomFilterShape.
>
> If "Shape" is only used for "BloomFilter" (as the suggestion above
> seems to indicate, why not declare it as a sub-interface:
> ---CUT---
> public interface BloomFilter {
>     // ...
>
>     public interface Shape {
>         // ...
>     }
> }
> ---CUT---
> ?
>
> Regards,
> Gilles
>
> >
> > Claude
> >
> > On Sat, Dec 28, 2019 at 9:06 AM Claude Warren <[hidden email]> wrote:
> >
> > > Once the interface is extracted and reduced to the minimum necessary
> the
> > > following methods are removed:
> > >
> > > orCardinality() -- we have andCardinality() and xorCardinality() this
> was
> > > included for completeness.
> > >
> > > isFull() -- true if all the bits in the vector are on. A convenience
> > > method used to short circuit some logic.
> > >
> > > I think these 2 methods should go into the BloomFilter interface.
> > >
> > >
> > > Set operations:
> > >
> > > jaccardSimilarity -- a measure of similarity in sets with range [0,1]
> > >
> > > jaccardDistance -- defined as 1 - jaccardSimilarity
> > >
> > > cosineSimilarity -- a measure of similarity with range  [0,1]
> > >
> > > cosineDistance -- defined as 1 - cosineSimilarity
> > >
> > > estimateSize -- Set operation estimate of the number of items that were
> > > placed in the filter. (Requires Shape)
> > >
> > > estimateUnionSize -- Set operation estimate of the number of items that
> > > would be represented by the merging of two filters. (Requires Shape)
> > >
> > > estimateIntersectionSize -- Set operations estimate of the number of
> items
> > > in the intersection. (Requires Shape)
> > >
> > > Perhaps it makes sense to move the Set operations into their own class
> > > with static methods?  The set operations all operate on 2 Bloom
> filters.
> > > Moving them would clarify the AbstractBloomFilter class.
> > >
> > > Claude
> > >
> > >
> > > On Sat, Dec 28, 2019 at 2:01 AM Gary Gregory <[hidden email]>
> > > wrote:
> > >
> > >> On Fri, Dec 27, 2019 at 11:36 AM Claude Warren <[hidden email]>
> wrote:
> > >>
> > >> > On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory <
> [hidden email]>
> > >> > wrote:
> > >> >
> > >> > > Hi Claude and all:
> > >> > >
> > >> > > Here are a couple of comments on the bloom filter PR.
> > >> > >
> > >> > > 10,100 ft level comment we do not have to worry about today:
> Before
> > >> the
> > >> > > release, we might want to split Commons Collection into a
> multi-module
> > >> > > project and have the BF code in it own module. The _main_ reason
> for
> > >> this
> > >> > > is that it will allow the dependency on Commons Codecis be a
> > >> non-optional
> > >> > > dependency. Optional dependency are a a bit of a pain IMO,
> especially
> > >> > when
> > >> > > you deal with large stacks. You end up sucking in everything that
> is
> > >> > > optional when you deliver an app because you do not know for
> certain
> > >> > what's
> > >> > > really required at runtime.
> > >> > >
> > >> > > Closer to the ground: I would make BloomFilter an interface and
> rename
> > >> > > the current BloomFilter class AbstractBloomFilter implements
> > >> BloomFilter.
> > >> > > This will allow the user and maintainer to see what is truly
> public.
> > >> > >
> > >> >
> > >> > I have done this (not checked in) but the net effect is that all the
> > >> public
> > >> > methods in the AbstractBloomFilter class are reflected in the
> > >> BloomFilter
> > >> > interface and the Shape class has moved to the Interface as well.
> > >> >
> > >>
> > >> OK, I think I like BloomFilter as an interface especially since it is
> used
> > >> as an argument in methods. I'll leave it to you as to whether Shape
> needs
> > >> to be folded in. I did notice that Shape is an argument in a few
> places.
> > >> Might we loose some focus and abstraction by folding Shape into the
> > >> BloomFilter interface?
> > >>
> > >>
> > >> > I can remove several methods from BloomFilter that are not strictly
> > >> > necessary for this code to function.  I am and have been reticent
> to do
> > >> so
> > >> > since there are a number of home-grown libraries used by various
> > >> > researchers that provide these functions.  But if that is what it
> takes
> > >> to
> > >> > get this out the door it will be done.
> > >> >
> > >>
> > >> Once you have BloomFilter as an interface with an implementing class,
> it
> > >> might make it much clearer as to what really belongs in the
> interface. The
> > >> handy methods can stay in the abstract class of course.
> > >>
> > >>
> > >> >
> > >> > > Since this is a first cut for Commons Collection, I would consider
> > >> only
> > >> > > making APIs public that must be public. Once released, we MUST
> > >> maintain
> > >> > > binary compatibility within minor releases. Major releases allow
> us to
> > >> > > break this compatibility but these do not happen very often. I do
> not
> > >> > know
> > >> > > what this means yet for BF but it's a thought to consider. This
> kind
> > >> of
> > >> > > work is made harder due to the current packaging of the BF code.
> > >> > >
> > >> > > We could consider putting it all in one package if that helps
> reduce
> > >> the
> > >> > > public API footprint.
> > >> > >
> > >> >
> > >> > I think that putting all the pieces into a single package will
> muddy the
> > >> > waters a bit.  The classes in o.a.c.c.bloomfilter are high level
> classes
> > >> > and used wherever the Bloom filter code is used.  Objects in the
> > >> > o.a.c.c.bloomfilter.hasher classes are Hashers and are really only
> > >> selected
> > >> > when the application developer determines which hasher is best
> suited
> > >> for
> > >> > the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
> > >> > HasherFunction implementations.  These can be used or special cases
> > >> built
> > >> > by the implementer as necessary.  Perhaps I am being a bit too
> pedantic
> > >> but
> > >> > I do not think it makes sense to merge them into a single package.
> > >> >
> > >>
> > >> Let's leave it as is then.
> > >>
> > >> Side note: I'll need to release Commons Codec 1.14 before we can
> bring in
> > >> this PR. I hope to do this after Commons VFS has gone through its own
> > >> release cycle (which should be done in 48 hours or so if all goes
> well.)
> > >> Well, we could bring in the PR but depending on a SNAPSHOT build is
> > >> usually
> > >> not a good idea unless it is truly short term.
> > >>
> > >> Thank you,
> > >> Gary
> > >>
> > >>
> > >> >
> > >> >
> > >> > >
> > >> > > Q: All the code in main's
> > >> > > org.apache.commons.collections4.bloomfilter.hasher.function is
> only
> > >> > called
> > >> > > from test code. Why is that?
> > >> > >
> > >> >
> > >> > They are pluggable modules.  When a developer uses them they an
> > >> instance is
> > >> > created and it is passed to the BloomFilter.Shape() constructor as
> well
> > >> as
> > >> > to the various Hasher constructors.  For example:
> > >> >
> > >> >     HashFunction hashFunction = new Murmur128x86Cyclic();
> > >> >     Shape bloomFilterConfig = new Shape( hashFunction, 3, 1.0 /
> 100000);
> > >> >
> > >> >
> > >> > >
> > >> > > Nit: Some odd formatting like spaces after open parens:
> 'toUpperCase(
> > >> > > Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'
> > >> > >
> > >> > > Gary
> > >> > >
> > >> >
> > >> >
> > >> > --
> > >> > I like: Like Like - The likeliest place on the web
> > >> > <http://like-like.xenei.com>
> > >> > LinkedIn: http://www.linkedin.com/in/claudewarren
> > >> >
> > >>
> > >
> > >
> > > --
> > > I like: Like Like - The likeliest place on the web
> > > <http://like-like.xenei.com>
> > > LinkedIn: http://www.linkedin.com/in/claudewarren
> > >
> >
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

garydgregory
I think it is time to bring this PR in and make any adjustments within
master beyond that. This will be quicker and simpler than going round and
round for simple things like Javadoc tweaks and small non-functional
changes (formatting, variable names, and so on.) I'll proceed with that
tonight.

Gary

On Sun, Dec 29, 2019 at 11:56 AM Claude Warren <[hidden email]> wrote:

> It is currently a sub-class.  There was a suggestion to move it outside of
> the BloomFilter interface.
>
> On Sun, Dec 29, 2019 at 3:47 PM Gilles Sadowski <[hidden email]>
> wrote:
>
> > Le dim. 29 déc. 2019 à 15:30, Claude Warren <[hidden email]> a écrit :
> > >
> > > If the Shape class (BloomFilter.Shape) is extracted from the
> BloomFilter
> > > interface and made a stand-alone class would the name change or is the
> > fact
> > > that it is in the o.a.c.c.bloomfilter package enough?
> > >
> > > I prefer the name Shape to BloomFilterShape.
> >
> > If "Shape" is only used for "BloomFilter" (as the suggestion above
> > seems to indicate, why not declare it as a sub-interface:
> > ---CUT---
> > public interface BloomFilter {
> >     // ...
> >
> >     public interface Shape {
> >         // ...
> >     }
> > }
> > ---CUT---
> > ?
> >
> > Regards,
> > Gilles
> >
> > >
> > > Claude
> > >
> > > On Sat, Dec 28, 2019 at 9:06 AM Claude Warren <[hidden email]>
> wrote:
> > >
> > > > Once the interface is extracted and reduced to the minimum necessary
> > the
> > > > following methods are removed:
> > > >
> > > > orCardinality() -- we have andCardinality() and xorCardinality() this
> > was
> > > > included for completeness.
> > > >
> > > > isFull() -- true if all the bits in the vector are on. A convenience
> > > > method used to short circuit some logic.
> > > >
> > > > I think these 2 methods should go into the BloomFilter interface.
> > > >
> > > >
> > > > Set operations:
> > > >
> > > > jaccardSimilarity -- a measure of similarity in sets with range [0,1]
> > > >
> > > > jaccardDistance -- defined as 1 - jaccardSimilarity
> > > >
> > > > cosineSimilarity -- a measure of similarity with range  [0,1]
> > > >
> > > > cosineDistance -- defined as 1 - cosineSimilarity
> > > >
> > > > estimateSize -- Set operation estimate of the number of items that
> were
> > > > placed in the filter. (Requires Shape)
> > > >
> > > > estimateUnionSize -- Set operation estimate of the number of items
> that
> > > > would be represented by the merging of two filters. (Requires Shape)
> > > >
> > > > estimateIntersectionSize -- Set operations estimate of the number of
> > items
> > > > in the intersection. (Requires Shape)
> > > >
> > > > Perhaps it makes sense to move the Set operations into their own
> class
> > > > with static methods?  The set operations all operate on 2 Bloom
> > filters.
> > > > Moving them would clarify the AbstractBloomFilter class.
> > > >
> > > > Claude
> > > >
> > > >
> > > > On Sat, Dec 28, 2019 at 2:01 AM Gary Gregory <[hidden email]
> >
> > > > wrote:
> > > >
> > > >> On Fri, Dec 27, 2019 at 11:36 AM Claude Warren <[hidden email]>
> > wrote:
> > > >>
> > > >> > On Fri, Dec 27, 2019 at 1:02 AM Gary Gregory <
> > [hidden email]>
> > > >> > wrote:
> > > >> >
> > > >> > > Hi Claude and all:
> > > >> > >
> > > >> > > Here are a couple of comments on the bloom filter PR.
> > > >> > >
> > > >> > > 10,100 ft level comment we do not have to worry about today:
> > Before
> > > >> the
> > > >> > > release, we might want to split Commons Collection into a
> > multi-module
> > > >> > > project and have the BF code in it own module. The _main_ reason
> > for
> > > >> this
> > > >> > > is that it will allow the dependency on Commons Codecis be a
> > > >> non-optional
> > > >> > > dependency. Optional dependency are a a bit of a pain IMO,
> > especially
> > > >> > when
> > > >> > > you deal with large stacks. You end up sucking in everything
> that
> > is
> > > >> > > optional when you deliver an app because you do not know for
> > certain
> > > >> > what's
> > > >> > > really required at runtime.
> > > >> > >
> > > >> > > Closer to the ground: I would make BloomFilter an interface and
> > rename
> > > >> > > the current BloomFilter class AbstractBloomFilter implements
> > > >> BloomFilter.
> > > >> > > This will allow the user and maintainer to see what is truly
> > public.
> > > >> > >
> > > >> >
> > > >> > I have done this (not checked in) but the net effect is that all
> the
> > > >> public
> > > >> > methods in the AbstractBloomFilter class are reflected in the
> > > >> BloomFilter
> > > >> > interface and the Shape class has moved to the Interface as well.
> > > >> >
> > > >>
> > > >> OK, I think I like BloomFilter as an interface especially since it
> is
> > used
> > > >> as an argument in methods. I'll leave it to you as to whether Shape
> > needs
> > > >> to be folded in. I did notice that Shape is an argument in a few
> > places.
> > > >> Might we loose some focus and abstraction by folding Shape into the
> > > >> BloomFilter interface?
> > > >>
> > > >>
> > > >> > I can remove several methods from BloomFilter that are not
> strictly
> > > >> > necessary for this code to function.  I am and have been reticent
> > to do
> > > >> so
> > > >> > since there are a number of home-grown libraries used by various
> > > >> > researchers that provide these functions.  But if that is what it
> > takes
> > > >> to
> > > >> > get this out the door it will be done.
> > > >> >
> > > >>
> > > >> Once you have BloomFilter as an interface with an implementing
> class,
> > it
> > > >> might make it much clearer as to what really belongs in the
> > interface. The
> > > >> handy methods can stay in the abstract class of course.
> > > >>
> > > >>
> > > >> >
> > > >> > > Since this is a first cut for Commons Collection, I would
> consider
> > > >> only
> > > >> > > making APIs public that must be public. Once released, we MUST
> > > >> maintain
> > > >> > > binary compatibility within minor releases. Major releases allow
> > us to
> > > >> > > break this compatibility but these do not happen very often. I
> do
> > not
> > > >> > know
> > > >> > > what this means yet for BF but it's a thought to consider. This
> > kind
> > > >> of
> > > >> > > work is made harder due to the current packaging of the BF code.
> > > >> > >
> > > >> > > We could consider putting it all in one package if that helps
> > reduce
> > > >> the
> > > >> > > public API footprint.
> > > >> > >
> > > >> >
> > > >> > I think that putting all the pieces into a single package will
> > muddy the
> > > >> > waters a bit.  The classes in o.a.c.c.bloomfilter are high level
> > classes
> > > >> > and used wherever the Bloom filter code is used.  Objects in the
> > > >> > o.a.c.c.bloomfilter.hasher classes are Hashers and are really only
> > > >> selected
> > > >> > when the application developer determines which hasher is best
> > suited
> > > >> for
> > > >> > the environment.  Finally o.a.c.c.bloomfilter.hasher.function are
> > > >> > HasherFunction implementations.  These can be used or special
> cases
> > > >> built
> > > >> > by the implementer as necessary.  Perhaps I am being a bit too
> > pedantic
> > > >> but
> > > >> > I do not think it makes sense to merge them into a single package.
> > > >> >
> > > >>
> > > >> Let's leave it as is then.
> > > >>
> > > >> Side note: I'll need to release Commons Codec 1.14 before we can
> > bring in
> > > >> this PR. I hope to do this after Commons VFS has gone through its
> own
> > > >> release cycle (which should be done in 48 hours or so if all goes
> > well.)
> > > >> Well, we could bring in the PR but depending on a SNAPSHOT build is
> > > >> usually
> > > >> not a good idea unless it is truly short term.
> > > >>
> > > >> Thank you,
> > > >> Gary
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > Q: All the code in main's
> > > >> > > org.apache.commons.collections4.bloomfilter.hasher.function is
> > only
> > > >> > called
> > > >> > > from test code. Why is that?
> > > >> > >
> > > >> >
> > > >> > They are pluggable modules.  When a developer uses them they an
> > > >> instance is
> > > >> > created and it is passed to the BloomFilter.Shape() constructor as
> > well
> > > >> as
> > > >> > to the various Hasher constructors.  For example:
> > > >> >
> > > >> >     HashFunction hashFunction = new Murmur128x86Cyclic();
> > > >> >     Shape bloomFilterConfig = new Shape( hashFunction, 3, 1.0 /
> > 100000);
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > Nit: Some odd formatting like spaces after open parens:
> > 'toUpperCase(
> > > >> > > Locale.ROOT)' should be  'toUpperCase(Locale.ROOT)'
> > > >> > >
> > > >> > > Gary
> > > >> > >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > I like: Like Like - The likeliest place on the web
> > > >> > <http://like-like.xenei.com>
> > > >> > LinkedIn: http://www.linkedin.com/in/claudewarren
> > > >> >
> > > >>
> > > >
> > > >
> > > > --
> > > > I like: Like Like - The likeliest place on the web
> > > > <http://like-like.xenei.com>
> > > > LinkedIn: http://www.linkedin.com/in/claudewarren
> > > >
> > >
> > >
> > > --
> > > I like: Like Like - The likeliest place on the web
> > > <http://like-like.xenei.com>
> > > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Gilles Sadowski-2
Le mer. 8 janv. 2020 à 15:15, Gary Gregory <[hidden email]> a écrit :
>
> I think it is time to bring this PR in and make any adjustments within
> master beyond that. This will be quicker and simpler than going round and
> round for simple things like Javadoc tweaks and small non-functional
> changes (formatting, variable names, and so on.) I'll proceed with that
> tonight.

Design issues were raised on the ML: With no agreement and no opinions
other than Claude's and mine, things stayed where they were.

Gilles

>> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Claude Warren
I believe the issue (I think history is at
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600)
is about the identification of hash implementations.

Currently there are a couple of classes involved:

Hasher interface, has a method that returns a HashFunctionIdentity and a
method that returns an iterator of enabled bits.  There are a couple of
implementations of Hasher:  the DynamicHasher contains buffers that are
passed to the hash function several times, the StaticHasher contains a list
of bits enabled by a hasher for a specific Shape.

HashFunction interface: extends HashFunctionIdentity and adds a method that
calls the actual hash function.

HashFunctionIdentity: contains the name of the hash function, the name of
the provider, the processType (cyclic or iterative), Signedness and a
signature.

There are places in the code where the actual function is not required and
is some use cases would make the implementation difficult or fragile.
These code places are where the Bloom filter has been built and the system
is verifying that two filters used the same hash function.  In these cases
the comparison is the hashName, processType and Signedness.  In cases where
the bloom filters are stored in a database retrieval would mean some sort
of serialization/deserialization of the hash function or ensure that the
hash function is otherwise available.  This is problematic.

The provider was added in a nod to a future factory that would follow the
JCA pattern and allow implementations of multiple providers.

The signature was added to support a requested quick check.  The signature
is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).

There were suggestions to create an enum of HashFunctions controlled  by
the Collections.  I think that this adds a layer of coordination and
management on the Collections team that as a team we may not want to take
on.  In addition, it makes it almost impossible for 3rd party users to
create new hash functions and test them with the library.

I believe the current implementation provides the minimal information
necessary to determine if two functions are supposed to produce the same
result.  In my mind the signature and provider methods are extra and not
necessary but desirable.

I think this is a summary of the open discussion.


On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <[hidden email]> wrote:

> Le mer. 8 janv. 2020 à 15:15, Gary Gregory <[hidden email]> a
> écrit :
> >
> > I think it is time to bring this PR in and make any adjustments within
> > master beyond that. This will be quicker and simpler than going round and
> > round for simple things like Javadoc tweaks and small non-functional
> > changes (formatting, variable names, and so on.) I'll proceed with that
> > tonight.
>
> Design issues were raised on the ML: With no agreement and no opinions
> other than Claude's and mine, things stayed where they were.
>
> Gilles
>
> >> [...]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Bruno P. Kinoshita-3
 Thanks for the summary Claude. I read some of the messages, but got lost in the middle of the discussion, and tend to understand problems better if there's some code to read along. And I am used to GitHub/GitLab diff interface.
So I agree with Gary that this could be a good time for a PR (maybe a draft one).
Bruno


    On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <[hidden email]> wrote:  
 
 I believe the issue (I think history is at
https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600)
is about the identification of hash implementations.

Currently there are a couple of classes involved:

Hasher interface, has a method that returns a HashFunctionIdentity and a
method that returns an iterator of enabled bits.  There are a couple of
implementations of Hasher:  the DynamicHasher contains buffers that are
passed to the hash function several times, the StaticHasher contains a list
of bits enabled by a hasher for a specific Shape.

HashFunction interface: extends HashFunctionIdentity and adds a method that
calls the actual hash function.

HashFunctionIdentity: contains the name of the hash function, the name of
the provider, the processType (cyclic or iterative), Signedness and a
signature.

There are places in the code where the actual function is not required and
is some use cases would make the implementation difficult or fragile.
These code places are where the Bloom filter has been built and the system
is verifying that two filters used the same hash function.  In these cases
the comparison is the hashName, processType and Signedness.  In cases where
the bloom filters are stored in a database retrieval would mean some sort
of serialization/deserialization of the hash function or ensure that the
hash function is otherwise available.  This is problematic.

The provider was added in a nod to a future factory that would follow the
JCA pattern and allow implementations of multiple providers.

The signature was added to support a requested quick check.  The signature
is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).

There were suggestions to create an enum of HashFunctions controlled  by
the Collections.  I think that this adds a layer of coordination and
management on the Collections team that as a team we may not want to take
on.  In addition, it makes it almost impossible for 3rd party users to
create new hash functions and test them with the library.

I believe the current implementation provides the minimal information
necessary to determine if two functions are supposed to produce the same
result.  In my mind the signature and provider methods are extra and not
necessary but desirable.

I think this is a summary of the open discussion.


On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <[hidden email]> wrote:

> Le mer. 8 janv. 2020 à 15:15, Gary Gregory <[hidden email]> a
> écrit :
> >
> > I think it is time to bring this PR in and make any adjustments within
> > master beyond that. This will be quicker and simpler than going round and
> > round for simple things like Javadoc tweaks and small non-functional
> > changes (formatting, variable names, and so on.) I'll proceed with that
> > tonight.
>
> Design issues were raised on the ML: With no agreement and no opinions
> other than Claude's and mine, things stayed where they were.
>
> Gilles
>
> >> [...]
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren 
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Claude Warren
There is a pull request open:
https://github.com/apache/commons-collections/pull/83

On Wed, Jan 8, 2020 at 9:32 PM Bruno P. Kinoshita
<[hidden email]> wrote:

>  Thanks for the summary Claude. I read some of the messages, but got lost
> in the middle of the discussion, and tend to understand problems better if
> there's some code to read along. And I am used to GitHub/GitLab diff
> interface.
> So I agree with Gary that this could be a good time for a PR (maybe a
> draft one).
> Bruno
>
>
>     On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <
> [hidden email]> wrote:
>
>  I believe the issue (I think history is at
>
> https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600
> )
> is about the identification of hash implementations.
>
> Currently there are a couple of classes involved:
>
> Hasher interface, has a method that returns a HashFunctionIdentity and a
> method that returns an iterator of enabled bits.  There are a couple of
> implementations of Hasher:  the DynamicHasher contains buffers that are
> passed to the hash function several times, the StaticHasher contains a list
> of bits enabled by a hasher for a specific Shape.
>
> HashFunction interface: extends HashFunctionIdentity and adds a method that
> calls the actual hash function.
>
> HashFunctionIdentity: contains the name of the hash function, the name of
> the provider, the processType (cyclic or iterative), Signedness and a
> signature.
>
> There are places in the code where the actual function is not required and
> is some use cases would make the implementation difficult or fragile.
> These code places are where the Bloom filter has been built and the system
> is verifying that two filters used the same hash function.  In these cases
> the comparison is the hashName, processType and Signedness.  In cases where
> the bloom filters are stored in a database retrieval would mean some sort
> of serialization/deserialization of the hash function or ensure that the
> hash function is otherwise available.  This is problematic.
>
> The provider was added in a nod to a future factory that would follow the
> JCA pattern and allow implementations of multiple providers.
>
> The signature was added to support a requested quick check.  The signature
> is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
> getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).
>
> There were suggestions to create an enum of HashFunctions controlled  by
> the Collections.  I think that this adds a layer of coordination and
> management on the Collections team that as a team we may not want to take
> on.  In addition, it makes it almost impossible for 3rd party users to
> create new hash functions and test them with the library.
>
> I believe the current implementation provides the minimal information
> necessary to determine if two functions are supposed to produce the same
> result.  In my mind the signature and provider methods are extra and not
> necessary but desirable.
>
> I think this is a summary of the open discussion.
>
>
> On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <[hidden email]>
> wrote:
>
> > Le mer. 8 janv. 2020 à 15:15, Gary Gregory <[hidden email]> a
> > écrit :
> > >
> > > I think it is time to bring this PR in and make any adjustments within
> > > master beyond that. This will be quicker and simpler than going round
> and
> > > round for simple things like Javadoc tweaks and small non-functional
> > > changes (formatting, variable names, and so on.) I'll proceed with that
> > > tonight.
> >
> > Design issues were raised on the ML: With no agreement and no opinions
> > other than Claude's and mine, things stayed where they were.
> >
> > Gilles
> >
> > >> [...]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren



--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Bruno P. Kinoshita-3
 Sorry, I'd read Gary's reply as if there was no PR yet. Reviewed it a bit now, lots of tests! Will play with the code and read the comments and finish the review by the end of the week.

Thanks Claude

    On Thursday, 9 January 2020, 11:20:40 am NZDT, Claude Warren <[hidden email]> wrote:  
 
 There is a pull request open:
https://github.com/apache/commons-collections/pull/83

On Wed, Jan 8, 2020 at 9:32 PM Bruno P. Kinoshita
<[hidden email]> wrote:

>  Thanks for the summary Claude. I read some of the messages, but got lost
> in the middle of the discussion, and tend to understand problems better if
> there's some code to read along. And I am used to GitHub/GitLab diff
> interface.
> So I agree with Gary that this could be a good time for a PR (maybe a
> draft one).
> Bruno
>
>
>    On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <
> [hidden email]> wrote:
>
>  I believe the issue (I think history is at
>
> https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600
> )
> is about the identification of hash implementations.
>
> Currently there are a couple of classes involved:
>
> Hasher interface, has a method that returns a HashFunctionIdentity and a
> method that returns an iterator of enabled bits.  There are a couple of
> implementations of Hasher:  the DynamicHasher contains buffers that are
> passed to the hash function several times, the StaticHasher contains a list
> of bits enabled by a hasher for a specific Shape.
>
> HashFunction interface: extends HashFunctionIdentity and adds a method that
> calls the actual hash function.
>
> HashFunctionIdentity: contains the name of the hash function, the name of
> the provider, the processType (cyclic or iterative), Signedness and a
> signature.
>
> There are places in the code where the actual function is not required and
> is some use cases would make the implementation difficult or fragile.
> These code places are where the Bloom filter has been built and the system
> is verifying that two filters used the same hash function.  In these cases
> the comparison is the hashName, processType and Signedness.  In cases where
> the bloom filters are stored in a database retrieval would mean some sort
> of serialization/deserialization of the hash function or ensure that the
> hash function is otherwise available.  This is problematic.
>
> The provider was added in a nod to a future factory that would follow the
> JCA pattern and allow implementations of multiple providers.
>
> The signature was added to support a requested quick check.  The signature
> is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
> getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).
>
> There were suggestions to create an enum of HashFunctions controlled  by
> the Collections.  I think that this adds a layer of coordination and
> management on the Collections team that as a team we may not want to take
> on.  In addition, it makes it almost impossible for 3rd party users to
> create new hash functions and test them with the library.
>
> I believe the current implementation provides the minimal information
> necessary to determine if two functions are supposed to produce the same
> result.  In my mind the signature and provider methods are extra and not
> necessary but desirable.
>
> I think this is a summary of the open discussion.
>
>
> On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <[hidden email]>
> wrote:
>
> > Le mer. 8 janv. 2020 à 15:15, Gary Gregory <[hidden email]> a
> > écrit :
> > >
> > > I think it is time to bring this PR in and make any adjustments within
> > > master beyond that. This will be quicker and simpler than going round
> and
> > > round for simple things like Javadoc tweaks and small non-functional
> > > changes (formatting, variable names, and so on.) I'll proceed with that
> > > tonight.
> >
> > Design issues were raised on the ML: With no agreement and no opinions
> > other than Claude's and mine, things stayed where they were.
> >
> > Gilles
> >
> > >> [...]
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren



--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren 
Reply | Threaded
Open this post in threaded view
|

Re: [collections] bloom filters comments

Claude Warren
Gary suggested merging the pull request, Gilles points out there were
issues raised on the mailing list, I posted what I think was an accurate
summary.

What needs to be done to resolve this so that a final decision on the pull
request can be made?

Claude

On Thu, Jan 9, 2020 at 2:57 AM Bruno P. Kinoshita
<[hidden email]> wrote:

>  Sorry, I'd read Gary's reply as if there was no PR yet. Reviewed it a bit
> now, lots of tests! Will play with the code and read the comments and
> finish the review by the end of the week.
>
> Thanks Claude
>
>     On Thursday, 9 January 2020, 11:20:40 am NZDT, Claude Warren <
> [hidden email]> wrote:
>
>  There is a pull request open:
> https://github.com/apache/commons-collections/pull/83
>
> On Wed, Jan 8, 2020 at 9:32 PM Bruno P. Kinoshita
> <[hidden email]> wrote:
>
> >  Thanks for the summary Claude. I read some of the messages, but got lost
> > in the middle of the discussion, and tend to understand problems better
> if
> > there's some code to read along. And I am used to GitHub/GitLab diff
> > interface.
> > So I agree with Gary that this could be a good time for a PR (maybe a
> > draft one).
> > Bruno
> >
> >
> >    On Thursday, 9 January 2020, 6:32:09 am NZDT, Claude Warren <
> > [hidden email]> wrote:
> >
> >  I believe the issue (I think history is at
> >
> >
> https://issues.apache.org/jira/browse/COLLECTIONS-728?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17003600
> > )
> > is about the identification of hash implementations.
> >
> > Currently there are a couple of classes involved:
> >
> > Hasher interface, has a method that returns a HashFunctionIdentity and a
> > method that returns an iterator of enabled bits.  There are a couple of
> > implementations of Hasher:  the DynamicHasher contains buffers that are
> > passed to the hash function several times, the StaticHasher contains a
> list
> > of bits enabled by a hasher for a specific Shape.
> >
> > HashFunction interface: extends HashFunctionIdentity and adds a method
> that
> > calls the actual hash function.
> >
> > HashFunctionIdentity: contains the name of the hash function, the name of
> > the provider, the processType (cyclic or iterative), Signedness and a
> > signature.
> >
> > There are places in the code where the actual function is not required
> and
> > is some use cases would make the implementation difficult or fragile.
> > These code places are where the Bloom filter has been built and the
> system
> > is verifying that two filters used the same hash function.  In these
> cases
> > the comparison is the hashName, processType and Signedness.  In cases
> where
> > the bloom filters are stored in a database retrieval would mean some sort
> > of serialization/deserialization of the hash function or ensure that the
> > hash function is otherwise available.  This is problematic.
> >
> > The provider was added in a nod to a future factory that would follow the
> > JCA pattern and allow implementations of multiple providers.
> >
> > The signature was added to support a requested quick check.  The
> signature
> > is calculated by calling hashFunction.apply( String.format( "%s-%s-%s",
> > getName(), getSignedness(), getProcess() ).getBytes( "UTF-8" ), 0 ).
> >
> > There were suggestions to create an enum of HashFunctions controlled  by
> > the Collections.  I think that this adds a layer of coordination and
> > management on the Collections team that as a team we may not want to take
> > on.  In addition, it makes it almost impossible for 3rd party users to
> > create new hash functions and test them with the library.
> >
> > I believe the current implementation provides the minimal information
> > necessary to determine if two functions are supposed to produce the same
> > result.  In my mind the signature and provider methods are extra and not
> > necessary but desirable.
> >
> > I think this is a summary of the open discussion.
> >
> >
> > On Wed, Jan 8, 2020 at 2:32 PM Gilles Sadowski <[hidden email]>
> > wrote:
> >
> > > Le mer. 8 janv. 2020 à 15:15, Gary Gregory <[hidden email]> a
> > > écrit :
> > > >
> > > > I think it is time to bring this PR in and make any adjustments
> within
> > > > master beyond that. This will be quicker and simpler than going round
> > and
> > > > round for simple things like Javadoc tweaks and small non-functional
> > > > changes (formatting, variable names, and so on.) I'll proceed with
> that
> > > > tonight.
> > >
> > > Design issues were raised on the ML: With no agreement and no opinions
> > > other than Claude's and mine, things stayed where they were.
> > >
> > > Gilles
> > >
> > > >> [...]
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren



--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren