New Sub-project Proposal.

classic Classic list List threaded Threaded
46 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Gilles Sadowski-2
Le jeu. 12 sept. 2019 à 17:20, Gary Gregory <[hidden email]> a écrit :
>
> Let's talk about modules after the PR comes, I only see that as needed to
> avoid bringing in dependencies for all users. IOW I would only see breaking
> up Collections into Maven modules if either the PR is giant or it depends
> on other artifacts.

I was also considering the complementary "issue": a project wanting to
use <some functionality> (i.e. the Bloom filter in this case) without a
dependency on the rest of [Collections].
[IMHO, that's how it should work in general for low-level functionalities
(i.e. for "Commons" in particular): Make modules with just the minimum
of (interdependent) classes, and make module with higher (composed)
functionality depend on more basic ones.  All the way down to the
standard library.]

Gilles

>
> Gary
>
> On Thu, Sep 12, 2019, 11:15 Claude Warren <[hidden email]> wrote:
>
> > @Gilles
> >
> > Missed your suggestion about modularity.  Can you point me to the original
> > message or paraphrase it here?
> >
> > Claude
> >
> > On Thu, Sep 12, 2019 at 11:03 AM Gilles Sadowski <[hidden email]>
> > wrote:
> >
> > > Le jeu. 12 sept. 2019 à 10:28, Stian Soiland-Reyes <[hidden email]> a
> > > écrit :
> > > >
> > > > On Thu, 12 Sep 2019 08:06:59 +0100, Claude Warren <[hidden email]>
> > > wrote:
> > > > > Actually the code I was thinking of is the multi-filter branch.  It
> > > cleans
> > > > > up some names and simplifies a few things.  The collections and
> > storage
> > > > > packages might be best added as examples rather than as mainline
> > code.
> > > > >
> > > > > In this case we just provide the bloom filter implementation,  If we
> > > want
> > > > > to provide the container implementation then I think it should
> > > probably be
> > > > > modified to accept any SortedSet or NavigatableSet in the
> > constructor.
> > > > >
> > > > > When I return home, next week, I'll take a swipe at moving the
> > packages
> > > > > over to org.apache.commons.collections4.bloomfilter package (unless
> > > there
> > > > > is a better package name).  We can then look at the entire code
> > > donation
> > > > > and decide what changes need to be made before it is accepted.
> > > > >
> > > > > Does this sound like a reasonable approach?
> > >
> > > Any comment about my suggestion to make [Collections] modular,
> > > starting with that code ([Collections] is nearing 30k LOC...)?
> > >
> > > Gilles
> > >
> > > > [...]
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Gilles Sadowski-2
In reply to this post by Claude Warren
Le jeu. 12 sept. 2019 à 17:32, Claude Warren <[hidden email]> a écrit :
>
> The base code depended on commons-lang3 for building hashes.  Is this
> acceptable or should the hash generation code from lang3 be cut and pasted
> into the classes.  Not sure what the standard is in this project.

There is no "standard".
The least common denominator is "no dependency", not even towards
other components of this "Commons" project (at the cost of duplication
of code and/or API).
The only dependencies are between modules of a component, and on
the condition that all the modules evolve synchronously (same version).

IMHO, we should and could do better (than resort to cut-and-paste).
"Shading" is an alternative.

Gilles

> On Thu, Sep 12, 2019 at 4:14 PM Claude Warren <[hidden email]> wrote:
>
> > @Gilles
> >
> > Missed your suggestion about modularity.  Can you point me to the original
> > message or paraphrase it here?
> >
> > Claude
> >
> > On Thu, Sep 12, 2019 at 11:03 AM Gilles Sadowski <[hidden email]>
> > wrote:
> >
> >> Le jeu. 12 sept. 2019 à 10:28, Stian Soiland-Reyes <[hidden email]> a
> >> écrit :
> >> >
> >> > On Thu, 12 Sep 2019 08:06:59 +0100, Claude Warren <[hidden email]>
> >> wrote:
> >> > > Actually the code I was thinking of is the multi-filter branch.  It
> >> cleans
> >> > > up some names and simplifies a few things.  The collections and
> >> storage
> >> > > packages might be best added as examples rather than as mainline code.
> >> > >
> >> > > In this case we just provide the bloom filter implementation,  If we
> >> want
> >> > > to provide the container implementation then I think it should
> >> probably be
> >> > > modified to accept any SortedSet or NavigatableSet in the constructor.
> >> > >
> >> > > When I return home, next week, I'll take a swipe at moving the
> >> packages
> >> > > over to org.apache.commons.collections4.bloomfilter package (unless
> >> there
> >> > > is a better package name).  We can then look at the entire code
> >> donation
> >> > > and decide what changes need to be made before it is accepted.
> >> > >
> >> > > Does this sound like a reasonable approach?
> >>
> >> Any comment about my suggestion to make [Collections] modular,
> >> starting with that code ([Collections] is nearing 30k LOC...)?
> >>
> >> Gilles
> >>
> >> > [...]
> >>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Gilles Sadowski-2
In reply to this post by Claude Warren
Le jeu. 12 sept. 2019 à 17:15, Claude Warren <[hidden email]> a écrit :
>
> @Gilles
>
> Missed your suggestion about modularity.  Can you point me to the original
> message or paraphrase it here?

https://markmail.org/message/4bibv2zsibmtyrsg

Gilles

>> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

garydgregory
In reply to this post by Gilles Sadowski-2
On Thu, Sep 12, 2019 at 11:42 AM Gilles Sadowski <[hidden email]>
wrote:

> Le jeu. 12 sept. 2019 à 17:32, Claude Warren <[hidden email]> a écrit :
> >
> > The base code depended on commons-lang3 for building hashes.  Is this
> > acceptable or should the hash generation code from lang3 be cut and
> pasted
> > into the classes.  Not sure what the standard is in this project.
>
> There is no "standard".
> The least common denominator is "no dependency", not even towards
> other components of this "Commons" project (at the cost of duplication
> of code and/or API).
>

Note that plenty of Commons components depend on other Commons components,
not just for testing. The most prominent example is Commons DBCP which
relies on Commons Pool; in which case duplicating code would be madness ;-)

Some Commons components are "higher" level, like Commons Configuration and
Commons VFS, and so depend on other Commons components and other 3rd party
libraries.

It is fair to say that other components we have like Commons Lang and
perhaps Commons IO are considered "low" level and do not depend on
anything. I would expect Commons Lang to ever depend on anything else
(except for testing of course.)

Gary

The only dependencies are between modules of a component, and on

> the condition that all the modules evolve synchronously (same version).
>
> IMHO, we should and could do better (than resort to cut-and-paste).
> "Shading" is an alternative.
>
> Gilles
>
> > On Thu, Sep 12, 2019 at 4:14 PM Claude Warren <[hidden email]> wrote:
> >
> > > @Gilles
> > >
> > > Missed your suggestion about modularity.  Can you point me to the
> original
> > > message or paraphrase it here?
> > >
> > > Claude
> > >
> > > On Thu, Sep 12, 2019 at 11:03 AM Gilles Sadowski <[hidden email]
> >
> > > wrote:
> > >
> > >> Le jeu. 12 sept. 2019 à 10:28, Stian Soiland-Reyes <[hidden email]>
> a
> > >> écrit :
> > >> >
> > >> > On Thu, 12 Sep 2019 08:06:59 +0100, Claude Warren <[hidden email]
> >
> > >> wrote:
> > >> > > Actually the code I was thinking of is the multi-filter branch.
> It
> > >> cleans
> > >> > > up some names and simplifies a few things.  The collections and
> > >> storage
> > >> > > packages might be best added as examples rather than as mainline
> code.
> > >> > >
> > >> > > In this case we just provide the bloom filter implementation,  If
> we
> > >> want
> > >> > > to provide the container implementation then I think it should
> > >> probably be
> > >> > > modified to accept any SortedSet or NavigatableSet in the
> constructor.
> > >> > >
> > >> > > When I return home, next week, I'll take a swipe at moving the
> > >> packages
> > >> > > over to org.apache.commons.collections4.bloomfilter package
> (unless
> > >> there
> > >> > > is a better package name).  We can then look at the entire code
> > >> donation
> > >> > > and decide what changes need to be made before it is accepted.
> > >> > >
> > >> > > Does this sound like a reasonable approach?
> > >>
> > >> Any comment about my suggestion to make [Collections] modular,
> > >> starting with that code ([Collections] is nearing 30k LOC...)?
> > >>
> > >> Gilles
> > >>
> > >> > [...]
> > >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Gilles Sadowski-2
Le jeu. 12 sept. 2019 à 20:28, Gary Gregory <[hidden email]> a écrit :

>
> On Thu, Sep 12, 2019 at 11:42 AM Gilles Sadowski <[hidden email]>
> wrote:
>
> > Le jeu. 12 sept. 2019 à 17:32, Claude Warren <[hidden email]> a écrit :
> > >
> > > The base code depended on commons-lang3 for building hashes.  Is this
> > > acceptable or should the hash generation code from lang3 be cut and
> > pasted
> > > into the classes.  Not sure what the standard is in this project.
> >
> > There is no "standard".
> > The least common denominator is "no dependency", not even towards
> > other components of this "Commons" project (at the cost of duplication
> > of code and/or API).
> >
>
> Note that plenty of Commons components depend on other Commons components,
> not just for testing. The most prominent example is Commons DBCP which
> relies on Commons Pool; in which case duplicating code would be madness ;-)

Good to hear.

> Some Commons components are "higher" level, like Commons Configuration and
> Commons VFS, and so depend on other Commons components and other 3rd party
> libraries.

Right. But for those, it's their purpose to provide a uniform interface to
external APIs.
What I had in mind however is the kind of reluctance that popped up,
for example, when [Text] introduced functionality related to random
numbers but would not tolerate a dependency on the API defined in
[RNG] (even though it was defined in a maven module to not incur any
unnecessary dependencies).

> It is fair to say that other components we have like Commons Lang and
> perhaps Commons IO are considered "low" level and do not depend on
> anything. I would expect Commons Lang to ever depend on anything else
> (except for testing of course.)

Sure, [Lang] should not depend on anything.  But it also should not
attempt to provide anything that would be better located in another
component.  Again, RNG functionality is a case in point.
Creating [Text] was a step in the right direction (i.s. stop the bloat).

We don't look actively into defining/refining scopes for the components,
and how to make them modular.  This should be a design requirement,
rather than an afterthought.

Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

garydgregory
On Thu, Sep 12, 2019 at 5:24 PM Gilles Sadowski <[hidden email]>
wrote:

> Le jeu. 12 sept. 2019 à 20:28, Gary Gregory <[hidden email]> a
> écrit :
> >
> > On Thu, Sep 12, 2019 at 11:42 AM Gilles Sadowski <[hidden email]>
> > wrote:
> >
> > > Le jeu. 12 sept. 2019 à 17:32, Claude Warren <[hidden email]> a
> écrit :
> > > >
> > > > The base code depended on commons-lang3 for building hashes.  Is this
> > > > acceptable or should the hash generation code from lang3 be cut and
> > > pasted
> > > > into the classes.  Not sure what the standard is in this project.
> > >
> > > There is no "standard".
> > > The least common denominator is "no dependency", not even towards
> > > other components of this "Commons" project (at the cost of duplication
> > > of code and/or API).
> > >
> >
> > Note that plenty of Commons components depend on other Commons
> components,
> > not just for testing. The most prominent example is Commons DBCP which
> > relies on Commons Pool; in which case duplicating code would be madness
> ;-)
>
> Good to hear.
>
> > Some Commons components are "higher" level, like Commons Configuration
> and
> > Commons VFS, and so depend on other Commons components and other 3rd
> party
> > libraries.
>
> Right. But for those, it's their purpose to provide a uniform interface to
> external APIs.
> What I had in mind however is the kind of reluctance that popped up,
> for example, when [Text] introduced functionality related to random
> numbers but would not tolerate a dependency on the API defined in
> [RNG] (even though it was defined in a maven module to not incur any
> unnecessary dependencies).
>
> > It is fair to say that other components we have like Commons Lang and
> > perhaps Commons IO are considered "low" level and do not depend on
> > anything. I would expect Commons Lang to ever depend on anything else
> > (except for testing of course.)
>
> Sure, [Lang] should not depend on anything.  But it also should not
> attempt to provide anything that would be better located in another
> component.  Again, RNG functionality is a case in point.
> Creating [Text] was a step in the right direction (i.s. stop the bloat).
>
> We don't look actively into defining/refining scopes for the components,
> and how to make them modular.  This should be a design requirement,
> rather than an afterthought.
>

Gilles,

Take a look at the PR, I added comments to allow the PR to have 0 deps.

Gary


>
> Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Gilles Sadowski-2
Hi.

> > [...]
> >
>
> Gilles,
>
> Take a look at the PR, I added comments to allow the PR to have 0 deps.

How about creating a branch on the Apache repository, so that
we can commit examples of what needs to be modified in order
to comply with the project's style and requirements?

Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Gilles Sadowski-2
In reply to this post by garydgregory
> > [...]
>
> Gilles,
>
> Take a look at the PR, I added comments to allow the PR to have 0 deps.
>
> Gary
>
>

IMO, the package should be named "bloomfilter" (without "s").
Naming seems inconsistent in [Collections]: Some (package)
names are singular, other plural.

* Indent must be 4 spaces.
* All fields and methods must be commented.
* "FilterConfig" contains a "main" method.
* Local variables and fields must be declared "final" whenever constant.
* Some methods are declared "final", others not.
* "BloomFilter" should be made immutable.
* "BloomFilter" could do without a second constructor: "FilterConfig"
should have a method that returns a "BitSet".
* "BloomFilter" must make a defensive copy of its "bitSet" argument.
* What's the difference between "match" and "equals"?
* Why is "STORED_SIZE" public?
* "Utils" classes should be avoided.
* Hash function should be imported/shaded from "Commons Codec"
(and defined there if missing).
* Constructor(s) should be private (instantiation through factory
methods (cf. ValJO).
* "Serializable" should be avoided.
* The "update" methods are missing explanations (or reference) about
their purpose.  Also, it seems that "update" and "build" are redundant.
* Does "getLog" return a standard value?  If so, the reference should
appear in the Javadoc (not just as code comment).
* What is the purpose of method "asBitSet()"?
* (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
"ProtoBloomFilterBuilder" should be a static inner class of that
factory.

Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Claude Warren
@Gilles

I am happy to rename the package without the plural if that is the
standard, I will also fix the indent issue.  Is there a definition that can
be quickly imported into Eclipse to do the proper formatting?

I am adding/updating all comments in the code.

FilterConfig contains a main method to make it easy to assess how the
various settings influence the length and number of functions in the
filter.  This could go into another class or a Utility class but neither of
those seem like a better solution.

I agree with your assessment of final and immutable for methods, variables
and classes.

I am removing the FilterConfig constructor for BloomFilter and ensuring the
BloomFilter makes a defensive copy.

The difference between match() and equals() is that equals() is a bit
vector equality and match is the standard bloom filter comparison T&C = T.
Documentation is being added to the match() and inverseMatch() to explain
this.

STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed).  It
identifies how many bytes the FilterConfig will use when written to a
DataOuptut or read from a DataInput.

I think the Utils class can be removed once the other code is moved to
Java8 streams and not extended iterators.

Hash function is changed to Objects.hash()

Constructors are/will be private where factories make sense.  For some the
simple constructor makes more sense.  (e.g. BloomFilter(BitSet) )

I have no problem removing the Serializable on FilterConfig, but can you
explain (or point to a post) why it should be avoided?

The update/build methods are being documented and an explanation of the
difference included.

getLog() is also being documented

The asBitSet() method was to extract the bitset from the bloom filter so
that it could be stored otherwise used with BitSet based processes.  The
name of the method is being chagned to getBitSet() and it makes a defensive
copy.

The ProtoBloomFilter is not a factory.  It is really a bean and there are
times when it makes sense to use the ProtoBloomFilter.  The name is
important as it identifies what it is.  It contains the necessary
information to build any configuration of BloomFilter.  There is a proposal
to use ProtoBloomFilters in service calls.  It could be defined as an
interface with a concrete implementation in the ProtoBloomFilterBuilder.

As an example of ProtoBloomFilter usage.  Say you have a Person object that
has the user's first name, last name, eye color, hair color.

The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to
update() followed by a build() to build the protoBloomFilter.
ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update( "Claude"
).update( "Warren" ).update( "brown" ).update( "silver" ).build();

I can create a bloom filter configuration that has 1x10^7  items and
accepts 1 / 1x10^6 collisions.
FilterConfig config1 = new FilterConfig( 10000000, 1000000)

and build the BloomFilter
BloomFilter filter1 = proto.create( config1 )

I can use that filter to determine if the person is in a collection of
1x10^7 items.  The implementation of that collection could be multiple
buckets of  1x10^4 items with a collision rate of 5x10^3 items.  To check
that I create an appropriate FilterConfig
FilterConfig config2 = new FilterConfig( 10000, 5000 )

and then create the proper bloom filter
BloomFilter filter2 = proto.create( config2 )

Claude


On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski <[hidden email]>
wrote:

> > > [...]
> >
> > Gilles,
> >
> > Take a look at the PR, I added comments to allow the PR to have 0 deps.
> >
> > Gary
> >
> >
>
> IMO, the package should be named "bloomfilter" (without "s").
> Naming seems inconsistent in [Collections]: Some (package)
> names are singular, other plural.
>
> * Indent must be 4 spaces.
> * All fields and methods must be commented.
> * "FilterConfig" contains a "main" method.
> * Local variables and fields must be declared "final" whenever constant.
> * Some methods are declared "final", others not.
> * "BloomFilter" should be made immutable.
> * "BloomFilter" could do without a second constructor: "FilterConfig"
> should have a method that returns a "BitSet".
> * "BloomFilter" must make a defensive copy of its "bitSet" argument.
> * What's the difference between "match" and "equals"?
> * Why is "STORED_SIZE" public?
> * "Utils" classes should be avoided.
> * Hash function should be imported/shaded from "Commons Codec"
> (and defined there if missing).
> * Constructor(s) should be private (instantiation through factory
> methods (cf. ValJO).
> * "Serializable" should be avoided.
> * The "update" methods are missing explanations (or reference) about
> their purpose.  Also, it seems that "update" and "build" are redundant.
> * Does "getLog" return a standard value?  If so, the reference should
> appear in the Javadoc (not just as code comment).
> * What is the purpose of method "asBitSet()"?
> * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
> "ProtoBloomFilterBuilder" should be a static inner class of that
> factory.
>
> Regards,
> Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

garydgregory
On Sat, Sep 14, 2019 at 2:15 AM Claude Warren <[hidden email]> wrote:

> @Gilles
>
> I am happy to rename the package without the plural if that is the
> standard, I will also fix the indent issue.  Is there a definition that can
> be quickly imported into Eclipse to do the proper formatting?
>
> I am adding/updating all comments in the code.
>
> FilterConfig contains a main method to make it easy to assess how the
> various settings influence the length and number of functions in the
> filter.  This could go into another class or a Utility class but neither of
> those seem like a better solution.
>
> I agree with your assessment of final and immutable for methods, variables
> and classes.
>
> I am removing the FilterConfig constructor for BloomFilter and ensuring the
> BloomFilter makes a defensive copy.
>
> The difference between match() and equals() is that equals() is a bit
> vector equality and match is the standard bloom filter comparison T&C = T.
> Documentation is being added to the match() and inverseMatch() to explain
> this.
>
> STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed).  It
> identifies how many bytes the FilterConfig will use when written to a
> DataOuptut or read from a DataInput.
>
> I think the Utils class can be removed once the other code is moved to
> Java8 streams and not extended iterators.
>
> Hash function is changed to Objects.hash()
>
> Constructors are/will be private where factories make sense.  For some the
> simple constructor makes more sense.  (e.g. BloomFilter(BitSet) )
>
> I have no problem removing the Serializable on FilterConfig, but can you
> explain (or point to a post) why it should be avoided?
>

From the master:
https://www.youtube.com/watch?v=V1vQf4qyMXg&feature=youtu.be&t=56m12s

Also related:
https://opensource.googleblog.com/2017/03/operation-rosehub.html

Gary


> The update/build methods are being documented and an explanation of the
> difference included.
>
> getLog() is also being documented
>
> The asBitSet() method was to extract the bitset from the bloom filter so
> that it could be stored otherwise used with BitSet based processes.  The
> name of the method is being chagned to getBitSet() and it makes a defensive
> copy.
>
> The ProtoBloomFilter is not a factory.  It is really a bean and there are
> times when it makes sense to use the ProtoBloomFilter.  The name is
> important as it identifies what it is.  It contains the necessary
> information to build any configuration of BloomFilter.  There is a proposal
> to use ProtoBloomFilters in service calls.  It could be defined as an
> interface with a concrete implementation in the ProtoBloomFilterBuilder.
>
> As an example of ProtoBloomFilter usage.  Say you have a Person object that
> has the user's first name, last name, eye color, hair color.
>
> The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to
> update() followed by a build() to build the protoBloomFilter.
> ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update( "Claude"
> ).update( "Warren" ).update( "brown" ).update( "silver" ).build();
>
> I can create a bloom filter configuration that has 1x10^7  items and
> accepts 1 / 1x10^6 collisions.
> FilterConfig config1 = new FilterConfig( 10000000, 1000000)
>
> and build the BloomFilter
> BloomFilter filter1 = proto.create( config1 )
>
> I can use that filter to determine if the person is in a collection of
> 1x10^7 items.  The implementation of that collection could be multiple
> buckets of  1x10^4 items with a collision rate of 5x10^3 items.  To check
> that I create an appropriate FilterConfig
> FilterConfig config2 = new FilterConfig( 10000, 5000 )
>
> and then create the proper bloom filter
> BloomFilter filter2 = proto.create( config2 )
>
> Claude
>
>
> On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski <[hidden email]>
> wrote:
>
> > > > [...]
> > >
> > > Gilles,
> > >
> > > Take a look at the PR, I added comments to allow the PR to have 0 deps.
> > >
> > > Gary
> > >
> > >
> >
> > IMO, the package should be named "bloomfilter" (without "s").
> > Naming seems inconsistent in [Collections]: Some (package)
> > names are singular, other plural.
> >
> > * Indent must be 4 spaces.
> > * All fields and methods must be commented.
> > * "FilterConfig" contains a "main" method.
> > * Local variables and fields must be declared "final" whenever constant.
> > * Some methods are declared "final", others not.
> > * "BloomFilter" should be made immutable.
> > * "BloomFilter" could do without a second constructor: "FilterConfig"
> > should have a method that returns a "BitSet".
> > * "BloomFilter" must make a defensive copy of its "bitSet" argument.
> > * What's the difference between "match" and "equals"?
> > * Why is "STORED_SIZE" public?
> > * "Utils" classes should be avoided.
> > * Hash function should be imported/shaded from "Commons Codec"
> > (and defined there if missing).
> > * Constructor(s) should be private (instantiation through factory
> > methods (cf. ValJO).
> > * "Serializable" should be avoided.
> > * The "update" methods are missing explanations (or reference) about
> > their purpose.  Also, it seems that "update" and "build" are redundant.
> > * Does "getLog" return a standard value?  If so, the reference should
> > appear in the Javadoc (not just as code comment).
> > * What is the purpose of method "asBitSet()"?
> > * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
> > "ProtoBloomFilterBuilder" should be a static inner class of that
> > factory.
> >
> > Regards,
> > Gilles
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Gilles Sadowski-2
In reply to this post by Claude Warren
Hi.

Le sam. 14 sept. 2019 à 08:15, Claude Warren <[hidden email]> a écrit :
>
> @Gilles
>
> I am happy to rename the package without the plural if that is the
> standard, I will also fix the indent issue.  Is there a definition that can
> be quickly imported into Eclipse to do the proper formatting?

Hopefully, someone else can answer that one.

> I am adding/updating all comments in the code.
>
> FilterConfig contains a main method to make it easy to assess how the
> various settings influence the length and number of functions in the
> filter.  This could go into another class or a Utility class but neither of
> those seem like a better solution.

I don't think there should be a "main" method in a library.
Also, just printing to stdout does not seem very flexible.
Perhaps define a "toString()" method?

> I agree with your assessment of final and immutable for methods, variables
> and classes.
>
> I am removing the FilterConfig constructor for BloomFilter and ensuring the
> BloomFilter makes a defensive copy.
>
> The difference between match() and equals() is that equals() is a bit
> vector equality and match is the standard bloom filter comparison T&C = T.
> Documentation is being added to the match() and inverseMatch() to explain
> this.
>
> STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed).  It
> identifies how many bytes the FilterConfig will use when written to a
> DataOuptut or read from a DataInput.

Why is it necessary to know such internal details?
Anyways, I think that persistence is best left to higher-level code.

>
> I think the Utils class can be removed once the other code is moved to
> Java8 streams and not extended iterators.
>
> Hash function is changed to Objects.hash()
>
> Constructors are/will be private where factories make sense.  For some the
> simple constructor makes more sense.  (e.g. BloomFilter(BitSet) )

See
    https://www.baeldung.com/java-constructors-vs-static-factory-methods
and
    https://blog.joda.org/2014/03/valjos-value-java-objects.html

> I have no problem removing the Serializable on FilterConfig, but can you
> explain (or point to a post) why it should be avoided?

Same as above (not prejudging how application want to handle serialization).

> The update/build methods are being documented and an explanation of the
> difference included.

Am I wrong that the "build(...)" methods only spare the typing of 2
characters?
If so, it's not worth the increase of the API size.

>
> getLog() is also being documented
>
> The asBitSet() method was to extract the bitset from the bloom filter so
> that it could be stored otherwise used with BitSet based processes.  The
> name of the method is being chagned to getBitSet() and it makes a defensive
> copy.
>
> The ProtoBloomFilter is not a factory.  It is really a bean and there are
> times when it makes sense to use the ProtoBloomFilter.  The name is
> important as it identifies what it is.  It contains the necessary
> information to build any configuration of BloomFilter.  There is a proposal
> to use ProtoBloomFilters in service calls.  It could be defined as an
> interface with a concrete implementation in the ProtoBloomFilterBuilder.
>
> As an example of ProtoBloomFilter usage.  Say you have a Person object that
> has the user's first name, last name, eye color, hair color.
>
> The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to
> update() followed by a build() to build the protoBloomFilter.
> ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update( "Claude"
> ).update( "Warren" ).update( "brown" ).update( "silver" ).build();

I'm still missing why it cannot be a factory, but maybe it is a
question of design habits.  What about the following:

public class BloomFilter {
    private final BitSet bitSet;

    private BloomFilter(BitSet bs) {
        bitSet = (BitSet) (bs.clone());
    }

    /** Factory method. (cf. "ValJO") */
    public BloomFilter of(BitSet bs) {
        return new BloomFilter(bs);
    }

    // Combined "ProtoBloomFilter" and "ProtoBloomFilterBuilder" logic.
    public static class Builder {
         private final Set<Hash> hashes = new HashSet<>();

         private Builder() {}

         /** Factory method(s). */
         public Builder with(Builder other) {
              hashes.addAll(other.hashes);
         }
         public Builder with(ByteBuffer b) { /* ... */ }
         // etc.

        /** Factory (converter). */
        public BloomFilter create(FilterConfig cfg) {
             final BitSet set = new BitSet(cfg.getNumberOfBits());
             for (Hash hash : hashes) {
                  hash.populate(set, cfg);
             }
             return BloomFilter.of(set);
        }
    }
}

Perhaps I'm taking too many shortcuts... But, as pointed out by
Gary, more coverage is required; and unit tests will help figure out
the leanest design.

>
> I can create a bloom filter configuration that has 1x10^7  items and
> accepts 1 / 1x10^6 collisions.
> FilterConfig config1 = new FilterConfig( 10000000, 1000000)
>
> and build the BloomFilter
> BloomFilter filter1 = proto.create( config1 )
>
> I can use that filter to determine if the person is in a collection of
> 1x10^7 items.  The implementation of that collection could be multiple
> buckets of  1x10^4 items with a collision rate of 5x10^3 items.  To check
> that I create an appropriate FilterConfig
> FilterConfig config2 = new FilterConfig( 10000, 5000 )
>
> and then create the proper bloom filter
> BloomFilter filter2 = proto.create( config2 )
>

With my above attempt at an alternative API, we'd have (untested):

BloomFilter.Builder proto = BloomFilter
     .with("Claude")
     .with("Warren")
     .with("brown").
     .with("silver");

BloomFilter filter1 = proto.create(config1);
BloomFilter filter2 = proto.create(config2);


IIUC, in your current implementation of "ProtoBloomFilter", there
is a side-effect of "build()" (i.e. the call to "clear()"); thus calling
"build()" twice will not yield the same result.  I don't think that's
a desirable feature.

Regards,
Gilles

> Claude
>
>
> On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski <[hidden email]>
> wrote:
>
> > > > [...]
> > >
> > > Gilles,
> > >
> > > Take a look at the PR, I added comments to allow the PR to have 0 deps.
> > >
> > > Gary
> > >
> > >
> >
> > IMO, the package should be named "bloomfilter" (without "s").
> > Naming seems inconsistent in [Collections]: Some (package)
> > names are singular, other plural.
> >
> > * Indent must be 4 spaces.
> > * All fields and methods must be commented.
> > * "FilterConfig" contains a "main" method.
> > * Local variables and fields must be declared "final" whenever constant.
> > * Some methods are declared "final", others not.
> > * "BloomFilter" should be made immutable.
> > * "BloomFilter" could do without a second constructor: "FilterConfig"
> > should have a method that returns a "BitSet".
> > * "BloomFilter" must make a defensive copy of its "bitSet" argument.
> > * What's the difference between "match" and "equals"?
> > * Why is "STORED_SIZE" public?
> > * "Utils" classes should be avoided.
> > * Hash function should be imported/shaded from "Commons Codec"
> > (and defined there if missing).
> > * Constructor(s) should be private (instantiation through factory
> > methods (cf. ValJO).
> > * "Serializable" should be avoided.
> > * The "update" methods are missing explanations (or reference) about
> > their purpose.  Also, it seems that "update" and "build" are redundant.
> > * Does "getLog" return a standard value?  If so, the reference should
> > appear in the Javadoc (not just as code comment).
> > * What is the purpose of method "asBitSet()"?
> > * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
> > "ProtoBloomFilterBuilder" should be a static inner class of that
> > factory.
> >
> > Regards,
> > Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

sebb-2-2
On Mon, 16 Sep 2019 at 00:17, Gilles Sadowski <[hidden email]> wrote:

>
> Hi.
>
> Le sam. 14 sept. 2019 à 08:15, Claude Warren <[hidden email]> a écrit :
> >
> > @Gilles
> >
> > I am happy to rename the package without the plural if that is the
> > standard, I will also fix the indent issue.  Is there a definition that can
> > be quickly imported into Eclipse to do the proper formatting?
>
> Hopefully, someone else can answer that one.
>
> > I am adding/updating all comments in the code.
> >
> > FilterConfig contains a main method to make it easy to assess how the
> > various settings influence the length and number of functions in the
> > filter.  This could go into another class or a Utility class but neither of
> > those seem like a better solution.
>
> I don't think there should be a "main" method in a library.
> Also, just printing to stdout does not seem very flexible.
> Perhaps define a "toString()" method?
>
> > I agree with your assessment of final and immutable for methods, variables
> > and classes.
> >
> > I am removing the FilterConfig constructor for BloomFilter and ensuring the
> > BloomFilter makes a defensive copy.
> >
> > The difference between match() and equals() is that equals() is a bit
> > vector equality and match is the standard bloom filter comparison T&C = T.
> > Documentation is being added to the match() and inverseMatch() to explain
> > this.
> >
> > STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed).  It
> > identifies how many bytes the FilterConfig will use when written to a
> > DataOuptut or read from a DataInput.
>
> Why is it necessary to know such internal details?
> Anyways, I think that persistence is best left to higher-level code.
>
> >
> > I think the Utils class can be removed once the other code is moved to
> > Java8 streams and not extended iterators.
> >
> > Hash function is changed to Objects.hash()
> >
> > Constructors are/will be private where factories make sense.  For some the
> > simple constructor makes more sense.  (e.g. BloomFilter(BitSet) )
>
> See
>     https://www.baeldung.com/java-constructors-vs-static-factory-methods
> and
>     https://blog.joda.org/2014/03/valjos-value-java-objects.html
>
> > I have no problem removing the Serializable on FilterConfig, but can you
> > explain (or point to a post) why it should be avoided?
>
> Same as above (not prejudging how application want to handle serialization).

It's difficult to code Serializable classes properly, and just as hard
to maintain them.
Also the serialised format becomes part of the API making changes harder.

We should not promise what we likely cannot continue to deliver,
especially since it may not be needed.

Serializable classes are also a target for code exploits.

> > The update/build methods are being documented and an explanation of the
> > difference included.
>
> Am I wrong that the "build(...)" methods only spare the typing of 2
> characters?
> If so, it's not worth the increase of the API size.
>
> >
> > getLog() is also being documented
> >
> > The asBitSet() method was to extract the bitset from the bloom filter so
> > that it could be stored otherwise used with BitSet based processes.  The
> > name of the method is being chagned to getBitSet() and it makes a defensive
> > copy.
> >
> > The ProtoBloomFilter is not a factory.  It is really a bean and there are
> > times when it makes sense to use the ProtoBloomFilter.  The name is
> > important as it identifies what it is.  It contains the necessary
> > information to build any configuration of BloomFilter.  There is a proposal
> > to use ProtoBloomFilters in service calls.  It could be defined as an
> > interface with a concrete implementation in the ProtoBloomFilterBuilder.
> >
> > As an example of ProtoBloomFilter usage.  Say you have a Person object that
> > has the user's first name, last name, eye color, hair color.
> >
> > The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to
> > update() followed by a build() to build the protoBloomFilter.
> > ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update( "Claude"
> > ).update( "Warren" ).update( "brown" ).update( "silver" ).build();
>
> I'm still missing why it cannot be a factory, but maybe it is a
> question of design habits.  What about the following:
>
> public class BloomFilter {
>     private final BitSet bitSet;
>
>     private BloomFilter(BitSet bs) {
>         bitSet = (BitSet) (bs.clone());
>     }
>
>     /** Factory method. (cf. "ValJO") */
>     public BloomFilter of(BitSet bs) {
>         return new BloomFilter(bs);
>     }
>
>     // Combined "ProtoBloomFilter" and "ProtoBloomFilterBuilder" logic.
>     public static class Builder {
>          private final Set<Hash> hashes = new HashSet<>();
>
>          private Builder() {}
>
>          /** Factory method(s). */
>          public Builder with(Builder other) {
>               hashes.addAll(other.hashes);
>          }
>          public Builder with(ByteBuffer b) { /* ... */ }
>          // etc.
>
>         /** Factory (converter). */
>         public BloomFilter create(FilterConfig cfg) {
>              final BitSet set = new BitSet(cfg.getNumberOfBits());
>              for (Hash hash : hashes) {
>                   hash.populate(set, cfg);
>              }
>              return BloomFilter.of(set);
>         }
>     }
> }
>
> Perhaps I'm taking too many shortcuts... But, as pointed out by
> Gary, more coverage is required; and unit tests will help figure out
> the leanest design.
>
> >
> > I can create a bloom filter configuration that has 1x10^7  items and
> > accepts 1 / 1x10^6 collisions.
> > FilterConfig config1 = new FilterConfig( 10000000, 1000000)
> >
> > and build the BloomFilter
> > BloomFilter filter1 = proto.create( config1 )
> >
> > I can use that filter to determine if the person is in a collection of
> > 1x10^7 items.  The implementation of that collection could be multiple
> > buckets of  1x10^4 items with a collision rate of 5x10^3 items.  To check
> > that I create an appropriate FilterConfig
> > FilterConfig config2 = new FilterConfig( 10000, 5000 )
> >
> > and then create the proper bloom filter
> > BloomFilter filter2 = proto.create( config2 )
> >
>
> With my above attempt at an alternative API, we'd have (untested):
>
> BloomFilter.Builder proto = BloomFilter
>      .with("Claude")
>      .with("Warren")
>      .with("brown").
>      .with("silver");
>
> BloomFilter filter1 = proto.create(config1);
> BloomFilter filter2 = proto.create(config2);
>
>
> IIUC, in your current implementation of "ProtoBloomFilter", there
> is a side-effect of "build()" (i.e. the call to "clear()"); thus calling
> "build()" twice will not yield the same result.  I don't think that's
> a desirable feature.
>
> Regards,
> Gilles
>
> > Claude
> >
> >
> > On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski <[hidden email]>
> > wrote:
> >
> > > > > [...]
> > > >
> > > > Gilles,
> > > >
> > > > Take a look at the PR, I added comments to allow the PR to have 0 deps.
> > > >
> > > > Gary
> > > >
> > > >
> > >
> > > IMO, the package should be named "bloomfilter" (without "s").
> > > Naming seems inconsistent in [Collections]: Some (package)
> > > names are singular, other plural.
> > >
> > > * Indent must be 4 spaces.
> > > * All fields and methods must be commented.
> > > * "FilterConfig" contains a "main" method.
> > > * Local variables and fields must be declared "final" whenever constant.
> > > * Some methods are declared "final", others not.
> > > * "BloomFilter" should be made immutable.
> > > * "BloomFilter" could do without a second constructor: "FilterConfig"
> > > should have a method that returns a "BitSet".
> > > * "BloomFilter" must make a defensive copy of its "bitSet" argument.
> > > * What's the difference between "match" and "equals"?
> > > * Why is "STORED_SIZE" public?
> > > * "Utils" classes should be avoided.
> > > * Hash function should be imported/shaded from "Commons Codec"
> > > (and defined there if missing).
> > > * Constructor(s) should be private (instantiation through factory
> > > methods (cf. ValJO).
> > > * "Serializable" should be avoided.
> > > * The "update" methods are missing explanations (or reference) about
> > > their purpose.  Also, it seems that "update" and "build" are redundant.
> > > * Does "getLog" return a standard value?  If so, the reference should
> > > appear in the Javadoc (not just as code comment).
> > > * What is the purpose of method "asBitSet()"?
> > > * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
> > > "ProtoBloomFilterBuilder" should be a static inner class of that
> > > factory.
> > >
> > > Regards,
> > > Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

garydgregory
On Sun, Sep 15, 2019 at 8:17 PM sebb <[hidden email]> wrote:

> On Mon, 16 Sep 2019 at 00:17, Gilles Sadowski <[hidden email]>
> wrote:
> >
> > Hi.
> >
> > Le sam. 14 sept. 2019 à 08:15, Claude Warren <[hidden email]> a écrit
> :
> > >
> > > @Gilles
> > >
> > > I am happy to rename the package without the plural if that is the
> > > standard, I will also fix the indent issue.  Is there a definition
> that can
> > > be quickly imported into Eclipse to do the proper formatting?
> >
> > Hopefully, someone else can answer that one.
> >
> > > I am adding/updating all comments in the code.
> > >
> > > FilterConfig contains a main method to make it easy to assess how the
> > > various settings influence the length and number of functions in the
> > > filter.  This could go into another class or a Utility class but
> neither of
> > > those seem like a better solution.
> >
> > I don't think there should be a "main" method in a library.
> > Also, just printing to stdout does not seem very flexible.
> > Perhaps define a "toString()" method?
> >
> > > I agree with your assessment of final and immutable for methods,
> variables
> > > and classes.
> > >
> > > I am removing the FilterConfig constructor for BloomFilter and
> ensuring the
> > > BloomFilter makes a defensive copy.
> > >
> > > The difference between match() and equals() is that equals() is a bit
> > > vector equality and match is the standard bloom filter comparison T&C
> = T.
> > > Documentation is being added to the match() and inverseMatch() to
> explain
> > > this.
> > >
> > > STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed).
> It
> > > identifies how many bytes the FilterConfig will use when written to a
> > > DataOuptut or read from a DataInput.
> >
> > Why is it necessary to know such internal details?
> > Anyways, I think that persistence is best left to higher-level code.
> >
> > >
> > > I think the Utils class can be removed once the other code is moved to
> > > Java8 streams and not extended iterators.
> > >
> > > Hash function is changed to Objects.hash()
> > >
> > > Constructors are/will be private where factories make sense.  For some
> the
> > > simple constructor makes more sense.  (e.g. BloomFilter(BitSet) )
> >
> > See
> >     https://www.baeldung.com/java-constructors-vs-static-factory-methods
> > and
> >     https://blog.joda.org/2014/03/valjos-value-java-objects.html
> >
> > > I have no problem removing the Serializable on FilterConfig, but can
> you
> > > explain (or point to a post) why it should be avoided?
> >
> > Same as above (not prejudging how application want to handle
> serialization).
>
> It's difficult to code Serializable classes properly, and just as hard
> to maintain them.
> Also the serialised format becomes part of the API making changes harder.
>
> We should not promise what we likely cannot continue to deliver,
> especially since it may not be needed.
>
> Serializable classes are also a target for code exploits.
>

+1 to removing Serializable support in its current implementation. If
serialization is a requirement, then it should be implemented using a
Serialization Proxy a la J. Block:

https://www.youtube.com/watch?v=V1vQf4qyMXg&feature=youtu.be&t=56m12s

Gary


> > > The update/build methods are being documented and an explanation of the
> > > difference included.
> >
> > Am I wrong that the "build(...)" methods only spare the typing of 2
> > characters?
> > If so, it's not worth the increase of the API size.
> >
> > >
> > > getLog() is also being documented
> > >
> > > The asBitSet() method was to extract the bitset from the bloom filter
> so
> > > that it could be stored otherwise used with BitSet based processes.
> The
> > > name of the method is being chagned to getBitSet() and it makes a
> defensive
> > > copy.
> > >
> > > The ProtoBloomFilter is not a factory.  It is really a bean and there
> are
> > > times when it makes sense to use the ProtoBloomFilter.  The name is
> > > important as it identifies what it is.  It contains the necessary
> > > information to build any configuration of BloomFilter.  There is a
> proposal
> > > to use ProtoBloomFilters in service calls.  It could be defined as an
> > > interface with a concrete implementation in the
> ProtoBloomFilterBuilder.
> > >
> > > As an example of ProtoBloomFilter usage.  Say you have a Person object
> that
> > > has the user's first name, last name, eye color, hair color.
> > >
> > > The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to
> > > update() followed by a build() to build the protoBloomFilter.
> > > ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update( "Claude"
> > > ).update( "Warren" ).update( "brown" ).update( "silver" ).build();
> >
> > I'm still missing why it cannot be a factory, but maybe it is a
> > question of design habits.  What about the following:
> >
> > public class BloomFilter {
> >     private final BitSet bitSet;
> >
> >     private BloomFilter(BitSet bs) {
> >         bitSet = (BitSet) (bs.clone());
> >     }
> >
> >     /** Factory method. (cf. "ValJO") */
> >     public BloomFilter of(BitSet bs) {
> >         return new BloomFilter(bs);
> >     }
> >
> >     // Combined "ProtoBloomFilter" and "ProtoBloomFilterBuilder" logic.
> >     public static class Builder {
> >          private final Set<Hash> hashes = new HashSet<>();
> >
> >          private Builder() {}
> >
> >          /** Factory method(s). */
> >          public Builder with(Builder other) {
> >               hashes.addAll(other.hashes);
> >          }
> >          public Builder with(ByteBuffer b) { /* ... */ }
> >          // etc.
> >
> >         /** Factory (converter). */
> >         public BloomFilter create(FilterConfig cfg) {
> >              final BitSet set = new BitSet(cfg.getNumberOfBits());
> >              for (Hash hash : hashes) {
> >                   hash.populate(set, cfg);
> >              }
> >              return BloomFilter.of(set);
> >         }
> >     }
> > }
> >
> > Perhaps I'm taking too many shortcuts... But, as pointed out by
> > Gary, more coverage is required; and unit tests will help figure out
> > the leanest design.
> >
> > >
> > > I can create a bloom filter configuration that has 1x10^7  items and
> > > accepts 1 / 1x10^6 collisions.
> > > FilterConfig config1 = new FilterConfig( 10000000, 1000000)
> > >
> > > and build the BloomFilter
> > > BloomFilter filter1 = proto.create( config1 )
> > >
> > > I can use that filter to determine if the person is in a collection of
> > > 1x10^7 items.  The implementation of that collection could be multiple
> > > buckets of  1x10^4 items with a collision rate of 5x10^3 items.  To
> check
> > > that I create an appropriate FilterConfig
> > > FilterConfig config2 = new FilterConfig( 10000, 5000 )
> > >
> > > and then create the proper bloom filter
> > > BloomFilter filter2 = proto.create( config2 )
> > >
> >
> > With my above attempt at an alternative API, we'd have (untested):
> >
> > BloomFilter.Builder proto = BloomFilter
> >      .with("Claude")
> >      .with("Warren")
> >      .with("brown").
> >      .with("silver");
> >
> > BloomFilter filter1 = proto.create(config1);
> > BloomFilter filter2 = proto.create(config2);
> >
> >
> > IIUC, in your current implementation of "ProtoBloomFilter", there
> > is a side-effect of "build()" (i.e. the call to "clear()"); thus calling
> > "build()" twice will not yield the same result.  I don't think that's
> > a desirable feature.
> >
> > Regards,
> > Gilles
> >
> > > Claude
> > >
> > >
> > > On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski <[hidden email]>
> > > wrote:
> > >
> > > > > > [...]
> > > > >
> > > > > Gilles,
> > > > >
> > > > > Take a look at the PR, I added comments to allow the PR to have 0
> deps.
> > > > >
> > > > > Gary
> > > > >
> > > > >
> > > >
> > > > IMO, the package should be named "bloomfilter" (without "s").
> > > > Naming seems inconsistent in [Collections]: Some (package)
> > > > names are singular, other plural.
> > > >
> > > > * Indent must be 4 spaces.
> > > > * All fields and methods must be commented.
> > > > * "FilterConfig" contains a "main" method.
> > > > * Local variables and fields must be declared "final" whenever
> constant.
> > > > * Some methods are declared "final", others not.
> > > > * "BloomFilter" should be made immutable.
> > > > * "BloomFilter" could do without a second constructor: "FilterConfig"
> > > > should have a method that returns a "BitSet".
> > > > * "BloomFilter" must make a defensive copy of its "bitSet" argument.
> > > > * What's the difference between "match" and "equals"?
> > > > * Why is "STORED_SIZE" public?
> > > > * "Utils" classes should be avoided.
> > > > * Hash function should be imported/shaded from "Commons Codec"
> > > > (and defined there if missing).
> > > > * Constructor(s) should be private (instantiation through factory
> > > > methods (cf. ValJO).
> > > > * "Serializable" should be avoided.
> > > > * The "update" methods are missing explanations (or reference) about
> > > > their purpose.  Also, it seems that "update" and "build" are
> redundant.
> > > > * Does "getLog" return a standard value?  If so, the reference should
> > > > appear in the Javadoc (not just as code comment).
> > > > * What is the purpose of method "asBitSet()"?
> > > > * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
> > > > "ProtoBloomFilterBuilder" should be a static inner class of that
> > > > factory.
> > > >
> > > > Regards,
> > > > Gilles
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Claude Warren
I am refactoring some of the code to make the Builder and Hash classes
enclosed in the ProtoBloomFilter class.  I have also add the proxy
serializer as noted in the effective java talk.  This simplifies the
classes significantly.  I will have unit tests and fix the javadocs before
the next push.

Claude

On Mon, Sep 16, 2019 at 2:01 AM Gary Gregory <[hidden email]> wrote:

> On Sun, Sep 15, 2019 at 8:17 PM sebb <[hidden email]> wrote:
>
> > On Mon, 16 Sep 2019 at 00:17, Gilles Sadowski <[hidden email]>
> > wrote:
> > >
> > > Hi.
> > >
> > > Le sam. 14 sept. 2019 à 08:15, Claude Warren <[hidden email]> a
> écrit
> > :
> > > >
> > > > @Gilles
> > > >
> > > > I am happy to rename the package without the plural if that is the
> > > > standard, I will also fix the indent issue.  Is there a definition
> > that can
> > > > be quickly imported into Eclipse to do the proper formatting?
> > >
> > > Hopefully, someone else can answer that one.
> > >
> > > > I am adding/updating all comments in the code.
> > > >
> > > > FilterConfig contains a main method to make it easy to assess how the
> > > > various settings influence the length and number of functions in the
> > > > filter.  This could go into another class or a Utility class but
> > neither of
> > > > those seem like a better solution.
> > >
> > > I don't think there should be a "main" method in a library.
> > > Also, just printing to stdout does not seem very flexible.
> > > Perhaps define a "toString()" method?
> > >
> > > > I agree with your assessment of final and immutable for methods,
> > variables
> > > > and classes.
> > > >
> > > > I am removing the FilterConfig constructor for BloomFilter and
> > ensuring the
> > > > BloomFilter makes a defensive copy.
> > > >
> > > > The difference between match() and equals() is that equals() is a bit
> > > > vector equality and match is the standard bloom filter comparison T&C
> > = T.
> > > > Documentation is being added to the match() and inverseMatch() to
> > explain
> > > > this.
> > > >
> > > > STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed).
> > It
> > > > identifies how many bytes the FilterConfig will use when written to a
> > > > DataOuptut or read from a DataInput.
> > >
> > > Why is it necessary to know such internal details?
> > > Anyways, I think that persistence is best left to higher-level code.
> > >
> > > >
> > > > I think the Utils class can be removed once the other code is moved
> to
> > > > Java8 streams and not extended iterators.
> > > >
> > > > Hash function is changed to Objects.hash()
> > > >
> > > > Constructors are/will be private where factories make sense.  For
> some
> > the
> > > > simple constructor makes more sense.  (e.g. BloomFilter(BitSet) )
> > >
> > > See
> > >
> https://www.baeldung.com/java-constructors-vs-static-factory-methods
> > > and
> > >     https://blog.joda.org/2014/03/valjos-value-java-objects.html
> > >
> > > > I have no problem removing the Serializable on FilterConfig, but can
> > you
> > > > explain (or point to a post) why it should be avoided?
> > >
> > > Same as above (not prejudging how application want to handle
> > serialization).
> >
> > It's difficult to code Serializable classes properly, and just as hard
> > to maintain them.
> > Also the serialised format becomes part of the API making changes harder.
> >
> > We should not promise what we likely cannot continue to deliver,
> > especially since it may not be needed.
> >
> > Serializable classes are also a target for code exploits.
> >
>
> +1 to removing Serializable support in its current implementation. If
> serialization is a requirement, then it should be implemented using a
> Serialization Proxy a la J. Block:
>
> https://www.youtube.com/watch?v=V1vQf4qyMXg&feature=youtu.be&t=56m12s
>
> Gary
>
>
> > > > The update/build methods are being documented and an explanation of
> the
> > > > difference included.
> > >
> > > Am I wrong that the "build(...)" methods only spare the typing of 2
> > > characters?
> > > If so, it's not worth the increase of the API size.
> > >
> > > >
> > > > getLog() is also being documented
> > > >
> > > > The asBitSet() method was to extract the bitset from the bloom filter
> > so
> > > > that it could be stored otherwise used with BitSet based processes.
> > The
> > > > name of the method is being chagned to getBitSet() and it makes a
> > defensive
> > > > copy.
> > > >
> > > > The ProtoBloomFilter is not a factory.  It is really a bean and there
> > are
> > > > times when it makes sense to use the ProtoBloomFilter.  The name is
> > > > important as it identifies what it is.  It contains the necessary
> > > > information to build any configuration of BloomFilter.  There is a
> > proposal
> > > > to use ProtoBloomFilters in service calls.  It could be defined as an
> > > > interface with a concrete implementation in the
> > ProtoBloomFilterBuilder.
> > > >
> > > > As an example of ProtoBloomFilter usage.  Say you have a Person
> object
> > that
> > > > has the user's first name, last name, eye color, hair color.
> > > >
> > > > The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to
> > > > update() followed by a build() to build the protoBloomFilter.
> > > > ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update(
> "Claude"
> > > > ).update( "Warren" ).update( "brown" ).update( "silver" ).build();
> > >
> > > I'm still missing why it cannot be a factory, but maybe it is a
> > > question of design habits.  What about the following:
> > >
> > > public class BloomFilter {
> > >     private final BitSet bitSet;
> > >
> > >     private BloomFilter(BitSet bs) {
> > >         bitSet = (BitSet) (bs.clone());
> > >     }
> > >
> > >     /** Factory method. (cf. "ValJO") */
> > >     public BloomFilter of(BitSet bs) {
> > >         return new BloomFilter(bs);
> > >     }
> > >
> > >     // Combined "ProtoBloomFilter" and "ProtoBloomFilterBuilder" logic.
> > >     public static class Builder {
> > >          private final Set<Hash> hashes = new HashSet<>();
> > >
> > >          private Builder() {}
> > >
> > >          /** Factory method(s). */
> > >          public Builder with(Builder other) {
> > >               hashes.addAll(other.hashes);
> > >          }
> > >          public Builder with(ByteBuffer b) { /* ... */ }
> > >          // etc.
> > >
> > >         /** Factory (converter). */
> > >         public BloomFilter create(FilterConfig cfg) {
> > >              final BitSet set = new BitSet(cfg.getNumberOfBits());
> > >              for (Hash hash : hashes) {
> > >                   hash.populate(set, cfg);
> > >              }
> > >              return BloomFilter.of(set);
> > >         }
> > >     }
> > > }
> > >
> > > Perhaps I'm taking too many shortcuts... But, as pointed out by
> > > Gary, more coverage is required; and unit tests will help figure out
> > > the leanest design.
> > >
> > > >
> > > > I can create a bloom filter configuration that has 1x10^7  items and
> > > > accepts 1 / 1x10^6 collisions.
> > > > FilterConfig config1 = new FilterConfig( 10000000, 1000000)
> > > >
> > > > and build the BloomFilter
> > > > BloomFilter filter1 = proto.create( config1 )
> > > >
> > > > I can use that filter to determine if the person is in a collection
> of
> > > > 1x10^7 items.  The implementation of that collection could be
> multiple
> > > > buckets of  1x10^4 items with a collision rate of 5x10^3 items.  To
> > check
> > > > that I create an appropriate FilterConfig
> > > > FilterConfig config2 = new FilterConfig( 10000, 5000 )
> > > >
> > > > and then create the proper bloom filter
> > > > BloomFilter filter2 = proto.create( config2 )
> > > >
> > >
> > > With my above attempt at an alternative API, we'd have (untested):
> > >
> > > BloomFilter.Builder proto = BloomFilter
> > >      .with("Claude")
> > >      .with("Warren")
> > >      .with("brown").
> > >      .with("silver");
> > >
> > > BloomFilter filter1 = proto.create(config1);
> > > BloomFilter filter2 = proto.create(config2);
> > >
> > >
> > > IIUC, in your current implementation of "ProtoBloomFilter", there
> > > is a side-effect of "build()" (i.e. the call to "clear()"); thus
> calling
> > > "build()" twice will not yield the same result.  I don't think that's
> > > a desirable feature.
> > >
> > > Regards,
> > > Gilles
> > >
> > > > Claude
> > > >
> > > >
> > > > On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski <
> [hidden email]>
> > > > wrote:
> > > >
> > > > > > > [...]
> > > > > >
> > > > > > Gilles,
> > > > > >
> > > > > > Take a look at the PR, I added comments to allow the PR to have 0
> > deps.
> > > > > >
> > > > > > Gary
> > > > > >
> > > > > >
> > > > >
> > > > > IMO, the package should be named "bloomfilter" (without "s").
> > > > > Naming seems inconsistent in [Collections]: Some (package)
> > > > > names are singular, other plural.
> > > > >
> > > > > * Indent must be 4 spaces.
> > > > > * All fields and methods must be commented.
> > > > > * "FilterConfig" contains a "main" method.
> > > > > * Local variables and fields must be declared "final" whenever
> > constant.
> > > > > * Some methods are declared "final", others not.
> > > > > * "BloomFilter" should be made immutable.
> > > > > * "BloomFilter" could do without a second constructor:
> "FilterConfig"
> > > > > should have a method that returns a "BitSet".
> > > > > * "BloomFilter" must make a defensive copy of its "bitSet"
> argument.
> > > > > * What's the difference between "match" and "equals"?
> > > > > * Why is "STORED_SIZE" public?
> > > > > * "Utils" classes should be avoided.
> > > > > * Hash function should be imported/shaded from "Commons Codec"
> > > > > (and defined there if missing).
> > > > > * Constructor(s) should be private (instantiation through factory
> > > > > methods (cf. ValJO).
> > > > > * "Serializable" should be avoided.
> > > > > * The "update" methods are missing explanations (or reference)
> about
> > > > > their purpose.  Also, it seems that "update" and "build" are
> > redundant.
> > > > > * Does "getLog" return a standard value?  If so, the reference
> should
> > > > > appear in the Javadoc (not just as code comment).
> > > > > * What is the purpose of method "asBitSet()"?
> > > > > * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
> > > > > "ProtoBloomFilterBuilder" should be a static inner class of that
> > > > > factory.
> > > > >
> > > > > Regards,
> > > > > Gilles
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Matt Sicker
Due to numerous security flaws in Java serialization, if you do use
it, make sure to use serialization proxies and make them as simple as
possible.

On Sun, 15 Sep 2019 at 20:32, Claude Warren <[hidden email]> wrote:

>
> I am refactoring some of the code to make the Builder and Hash classes
> enclosed in the ProtoBloomFilter class.  I have also add the proxy
> serializer as noted in the effective java talk.  This simplifies the
> classes significantly.  I will have unit tests and fix the javadocs before
> the next push.
>
> Claude
>
> On Mon, Sep 16, 2019 at 2:01 AM Gary Gregory <[hidden email]> wrote:
>
> > On Sun, Sep 15, 2019 at 8:17 PM sebb <[hidden email]> wrote:
> >
> > > On Mon, 16 Sep 2019 at 00:17, Gilles Sadowski <[hidden email]>
> > > wrote:
> > > >
> > > > Hi.
> > > >
> > > > Le sam. 14 sept. 2019 à 08:15, Claude Warren <[hidden email]> a
> > écrit
> > > :
> > > > >
> > > > > @Gilles
> > > > >
> > > > > I am happy to rename the package without the plural if that is the
> > > > > standard, I will also fix the indent issue.  Is there a definition
> > > that can
> > > > > be quickly imported into Eclipse to do the proper formatting?
> > > >
> > > > Hopefully, someone else can answer that one.
> > > >
> > > > > I am adding/updating all comments in the code.
> > > > >
> > > > > FilterConfig contains a main method to make it easy to assess how the
> > > > > various settings influence the length and number of functions in the
> > > > > filter.  This could go into another class or a Utility class but
> > > neither of
> > > > > those seem like a better solution.
> > > >
> > > > I don't think there should be a "main" method in a library.
> > > > Also, just printing to stdout does not seem very flexible.
> > > > Perhaps define a "toString()" method?
> > > >
> > > > > I agree with your assessment of final and immutable for methods,
> > > variables
> > > > > and classes.
> > > > >
> > > > > I am removing the FilterConfig constructor for BloomFilter and
> > > ensuring the
> > > > > BloomFilter makes a defensive copy.
> > > > >
> > > > > The difference between match() and equals() is that equals() is a bit
> > > > > vector equality and match is the standard bloom filter comparison T&C
> > > = T.
> > > > > Documentation is being added to the match() and inverseMatch() to
> > > explain
> > > > > this.
> > > > >
> > > > > STORED_SIZE is similar to Long.BYTES (perhaps it should be renamed).
> > > It
> > > > > identifies how many bytes the FilterConfig will use when written to a
> > > > > DataOuptut or read from a DataInput.
> > > >
> > > > Why is it necessary to know such internal details?
> > > > Anyways, I think that persistence is best left to higher-level code.
> > > >
> > > > >
> > > > > I think the Utils class can be removed once the other code is moved
> > to
> > > > > Java8 streams and not extended iterators.
> > > > >
> > > > > Hash function is changed to Objects.hash()
> > > > >
> > > > > Constructors are/will be private where factories make sense.  For
> > some
> > > the
> > > > > simple constructor makes more sense.  (e.g. BloomFilter(BitSet) )
> > > >
> > > > See
> > > >
> > https://www.baeldung.com/java-constructors-vs-static-factory-methods
> > > > and
> > > >     https://blog.joda.org/2014/03/valjos-value-java-objects.html
> > > >
> > > > > I have no problem removing the Serializable on FilterConfig, but can
> > > you
> > > > > explain (or point to a post) why it should be avoided?
> > > >
> > > > Same as above (not prejudging how application want to handle
> > > serialization).
> > >
> > > It's difficult to code Serializable classes properly, and just as hard
> > > to maintain them.
> > > Also the serialised format becomes part of the API making changes harder.
> > >
> > > We should not promise what we likely cannot continue to deliver,
> > > especially since it may not be needed.
> > >
> > > Serializable classes are also a target for code exploits.
> > >
> >
> > +1 to removing Serializable support in its current implementation. If
> > serialization is a requirement, then it should be implemented using a
> > Serialization Proxy a la J. Block:
> >
> > https://www.youtube.com/watch?v=V1vQf4qyMXg&feature=youtu.be&t=56m12s
> >
> > Gary
> >
> >
> > > > > The update/build methods are being documented and an explanation of
> > the
> > > > > difference included.
> > > >
> > > > Am I wrong that the "build(...)" methods only spare the typing of 2
> > > > characters?
> > > > If so, it's not worth the increase of the API size.
> > > >
> > > > >
> > > > > getLog() is also being documented
> > > > >
> > > > > The asBitSet() method was to extract the bitset from the bloom filter
> > > so
> > > > > that it could be stored otherwise used with BitSet based processes.
> > > The
> > > > > name of the method is being chagned to getBitSet() and it makes a
> > > defensive
> > > > > copy.
> > > > >
> > > > > The ProtoBloomFilter is not a factory.  It is really a bean and there
> > > are
> > > > > times when it makes sense to use the ProtoBloomFilter.  The name is
> > > > > important as it identifies what it is.  It contains the necessary
> > > > > information to build any configuration of BloomFilter.  There is a
> > > proposal
> > > > > to use ProtoBloomFilters in service calls.  It could be defined as an
> > > > > interface with a concrete implementation in the
> > > ProtoBloomFilterBuilder.
> > > > >
> > > > > As an example of ProtoBloomFilter usage.  Say you have a Person
> > object
> > > that
> > > > > has the user's first name, last name, eye color, hair color.
> > > > >
> > > > > The ProtoBloomFilterBuilder would accept the 4 values as 4 calls to
> > > > > update() followed by a build() to build the protoBloomFilter.
> > > > > ProtoBloomFilter proto = ne wProtoBloomFilterBuilder().update(
> > "Claude"
> > > > > ).update( "Warren" ).update( "brown" ).update( "silver" ).build();
> > > >
> > > > I'm still missing why it cannot be a factory, but maybe it is a
> > > > question of design habits.  What about the following:
> > > >
> > > > public class BloomFilter {
> > > >     private final BitSet bitSet;
> > > >
> > > >     private BloomFilter(BitSet bs) {
> > > >         bitSet = (BitSet) (bs.clone());
> > > >     }
> > > >
> > > >     /** Factory method. (cf. "ValJO") */
> > > >     public BloomFilter of(BitSet bs) {
> > > >         return new BloomFilter(bs);
> > > >     }
> > > >
> > > >     // Combined "ProtoBloomFilter" and "ProtoBloomFilterBuilder" logic.
> > > >     public static class Builder {
> > > >          private final Set<Hash> hashes = new HashSet<>();
> > > >
> > > >          private Builder() {}
> > > >
> > > >          /** Factory method(s). */
> > > >          public Builder with(Builder other) {
> > > >               hashes.addAll(other.hashes);
> > > >          }
> > > >          public Builder with(ByteBuffer b) { /* ... */ }
> > > >          // etc.
> > > >
> > > >         /** Factory (converter). */
> > > >         public BloomFilter create(FilterConfig cfg) {
> > > >              final BitSet set = new BitSet(cfg.getNumberOfBits());
> > > >              for (Hash hash : hashes) {
> > > >                   hash.populate(set, cfg);
> > > >              }
> > > >              return BloomFilter.of(set);
> > > >         }
> > > >     }
> > > > }
> > > >
> > > > Perhaps I'm taking too many shortcuts... But, as pointed out by
> > > > Gary, more coverage is required; and unit tests will help figure out
> > > > the leanest design.
> > > >
> > > > >
> > > > > I can create a bloom filter configuration that has 1x10^7  items and
> > > > > accepts 1 / 1x10^6 collisions.
> > > > > FilterConfig config1 = new FilterConfig( 10000000, 1000000)
> > > > >
> > > > > and build the BloomFilter
> > > > > BloomFilter filter1 = proto.create( config1 )
> > > > >
> > > > > I can use that filter to determine if the person is in a collection
> > of
> > > > > 1x10^7 items.  The implementation of that collection could be
> > multiple
> > > > > buckets of  1x10^4 items with a collision rate of 5x10^3 items.  To
> > > check
> > > > > that I create an appropriate FilterConfig
> > > > > FilterConfig config2 = new FilterConfig( 10000, 5000 )
> > > > >
> > > > > and then create the proper bloom filter
> > > > > BloomFilter filter2 = proto.create( config2 )
> > > > >
> > > >
> > > > With my above attempt at an alternative API, we'd have (untested):
> > > >
> > > > BloomFilter.Builder proto = BloomFilter
> > > >      .with("Claude")
> > > >      .with("Warren")
> > > >      .with("brown").
> > > >      .with("silver");
> > > >
> > > > BloomFilter filter1 = proto.create(config1);
> > > > BloomFilter filter2 = proto.create(config2);
> > > >
> > > >
> > > > IIUC, in your current implementation of "ProtoBloomFilter", there
> > > > is a side-effect of "build()" (i.e. the call to "clear()"); thus
> > calling
> > > > "build()" twice will not yield the same result.  I don't think that's
> > > > a desirable feature.
> > > >
> > > > Regards,
> > > > Gilles
> > > >
> > > > > Claude
> > > > >
> > > > >
> > > > > On Fri, Sep 13, 2019 at 2:22 PM Gilles Sadowski <
> > [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > > > [...]
> > > > > > >
> > > > > > > Gilles,
> > > > > > >
> > > > > > > Take a look at the PR, I added comments to allow the PR to have 0
> > > deps.
> > > > > > >
> > > > > > > Gary
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > IMO, the package should be named "bloomfilter" (without "s").
> > > > > > Naming seems inconsistent in [Collections]: Some (package)
> > > > > > names are singular, other plural.
> > > > > >
> > > > > > * Indent must be 4 spaces.
> > > > > > * All fields and methods must be commented.
> > > > > > * "FilterConfig" contains a "main" method.
> > > > > > * Local variables and fields must be declared "final" whenever
> > > constant.
> > > > > > * Some methods are declared "final", others not.
> > > > > > * "BloomFilter" should be made immutable.
> > > > > > * "BloomFilter" could do without a second constructor:
> > "FilterConfig"
> > > > > > should have a method that returns a "BitSet".
> > > > > > * "BloomFilter" must make a defensive copy of its "bitSet"
> > argument.
> > > > > > * What's the difference between "match" and "equals"?
> > > > > > * Why is "STORED_SIZE" public?
> > > > > > * "Utils" classes should be avoided.
> > > > > > * Hash function should be imported/shaded from "Commons Codec"
> > > > > > (and defined there if missing).
> > > > > > * Constructor(s) should be private (instantiation through factory
> > > > > > methods (cf. ValJO).
> > > > > > * "Serializable" should be avoided.
> > > > > > * The "update" methods are missing explanations (or reference)
> > about
> > > > > > their purpose.  Also, it seems that "update" and "build" are
> > > redundant.
> > > > > > * Does "getLog" return a standard value?  If so, the reference
> > should
> > > > > > appear in the Javadoc (not just as code comment).
> > > > > > * What is the purpose of method "asBitSet()"?
> > > > > > * (IIUC) "ProtoBloomFilter" should be renamed "BloomFilterFactory".
> > > > > > "ProtoBloomFilterBuilder" should be a static inner class of that
> > > > > > factory.
> > > > > >
> > > > > > Regards,
> > > > > > Gilles
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [hidden email]
> > > > For additional commands, e-mail: [hidden email]
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [hidden email]
> > > For additional commands, e-mail: [hidden email]
> > >
> > >
> >
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren



--
Matt Sicker <[hidden email]>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Gilles Sadowski-2
Hello.

Here are a few comment from a quick browse of today's update
of PR #83.

* "package private for testing" is not a good reason (IMO)
* There are spurious blank spaces in some of the file ("git diff"
shows them in red)
* You should always perform "git rebase master"
* In "Builder":
** Field "hashes" should be "final"
** I still fail to see the necessity to duplicate "with" functionality
with "build" methods
* Why should so many classes be "Serializable"?
* "StandardBloomFilter" is documented as "immutable" but it is not:
There are 2 non "final" fields and 1 "final" but "protected" (hence
it can be modified outside the class without ensuring its invariants)
* IMHO use of transient fields is detrimental to the robustness of the
design: It seems that "hamming weight" is not _used_ by the
implementations (called only in the unit tests?)
* Ditto for "logValue"
* Did you check whether "murmur 128 hash" is available in "Commons
Codec"?
* Name of classes should not be abbreviated (cf. "Config", "Proto", "Stats" ...)
* Package name is still plural: Should be singular IMO
* Rule checks general suppression should be removed in the config
files (CheckStyle, FindBugs) and replace by specific (within code)
statements
if applicable (e.g. the disabling of "no default in switch" is unnecessary
since AFAICT a "default" is always provided).
* All (also "private") fields and methods must be commented with Javadoc,
not just code comment.
* There should be a reference to what is a Bloom filter
* Occurrences of "bloom" should be replaced with "Bloom"
* There some style issues:
** There should be no space after an opening parenthesis, or before
a closing one
** Curly braces position: opening on the same line, closing on a
separate line

Best regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Claude Warren
For the style issues.... is there an Eclipse style package that meets the
commons style or some other tool that will correctly configure the format
and style options in Eclipse?

On Mon, Sep 23, 2019 at 10:54 AM Gilles Sadowski <[hidden email]>
wrote:

> Hello.
>
> Here are a few comment from a quick browse of today's update
> of PR #83.
>
> * "package private for testing" is not a good reason (IMO)
> * There are spurious blank spaces in some of the file ("git diff"
> shows them in red)
> * You should always perform "git rebase master"
> * In "Builder":
> ** Field "hashes" should be "final"
> ** I still fail to see the necessity to duplicate "with" functionality
> with "build" methods
> * Why should so many classes be "Serializable"?
> * "StandardBloomFilter" is documented as "immutable" but it is not:
> There are 2 non "final" fields and 1 "final" but "protected" (hence
> it can be modified outside the class without ensuring its invariants)
> * IMHO use of transient fields is detrimental to the robustness of the
> design: It seems that "hamming weight" is not _used_ by the
> implementations (called only in the unit tests?)
> * Ditto for "logValue"
> * Did you check whether "murmur 128 hash" is available in "Commons
> Codec"?
> * Name of classes should not be abbreviated (cf. "Config", "Proto",
> "Stats" ...)
> * Package name is still plural: Should be singular IMO
> * Rule checks general suppression should be removed in the config
> files (CheckStyle, FindBugs) and replace by specific (within code)
> statements
> if applicable (e.g. the disabling of "no default in switch" is unnecessary
> since AFAICT a "default" is always provided).
> * All (also "private") fields and methods must be commented with Javadoc,
> not just code comment.
> * There should be a reference to what is a Bloom filter
> * Occurrences of "bloom" should be replaced with "Bloom"
> * There some style issues:
> ** There should be no space after an opening parenthesis, or before
> a closing one
> ** Curly braces position: opening on the same line, closing on a
> separate line
>
> Best regards,
> Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Claude Warren
I will rework to remove the package private and other access issues noted.

In Builder there is a difference between with() and build().  This follows
the pattern established by MessageDigest[1] where it is possible to build a
digest in one call or by adding multiple items and then calling digest.
The differences here are digest has an update() method the builder uses
with(), digest uses a digest() method to return the result builder uses
build().

The serializable classes are classes that in the indexing code are sent
across the wire or stored.  I suppose there are other ways to do this and
the serialization could be removed.  Not a direction I would like to take
but is doable.

HammingWeight and Log values re used in indexing.  I think HammingWeight is
used in the collections as well.  HammingWeight is a well known property of
a bloom filter.  The Log not so much so.  I suppose the log could be
removed and created as a separate library method that would execute against
BitSets.  Moving the calculation outside of the Bloom filter means that we
will have to make copies of bloom filters to perform the log and the log
will not be retained with the filter when used in indexing.

Murmer 128 hash is not in Commons Codec (at least I could not find it).
Perhaps it should be added?  I could extract the code and provide it there
if that is the appropriate solution.

ProtoBloomFilter is not an abbreviation of PrototypeBloomFilter as it is
not a Bloom filter but a primitive object from which Bloom filters may be
created.  I think this is a proper use of the Proto[2] prefix.

Package name plural.  I thought there was some disagreement on this,
however I will change it.

I will revert the FindBugs config change as it is now commented out
anyway.  How does one suppress the fall through checkstyle error.  I could
find no annotations that would suppress the error and did not want to add
another dependency to the build.  I did find that other components in the
project are using the checkstyle filter and so added the supression of the
FallThrough check.  I

The reference to what is a Bloom filter is in
bloomfilters/package-info.java.  Does there need to be a reference from the
BloomFilter interface definition to that?

Thank you for taking the time to review and comment.

Claude

[1]
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/security/MessageDigest.html
[2] https://www.dictionary.com/browse/proto?s=t

On Mon, Sep 23, 2019 at 11:13 AM Claude Warren <[hidden email]> wrote:

> For the style issues.... is there an Eclipse style package that meets the
> commons style or some other tool that will correctly configure the format
> and style options in Eclipse?
>
> On Mon, Sep 23, 2019 at 10:54 AM Gilles Sadowski <[hidden email]>
> wrote:
>
>> Hello.
>>
>> Here are a few comment from a quick browse of today's update
>> of PR #83.
>>
>> * "package private for testing" is not a good reason (IMO)
>> * There are spurious blank spaces in some of the file ("git diff"
>> shows them in red)
>> * You should always perform "git rebase master"
>> * In "Builder":
>> ** Field "hashes" should be "final"
>> ** I still fail to see the necessity to duplicate "with" functionality
>> with "build" methods
>> * Why should so many classes be "Serializable"?
>> * "StandardBloomFilter" is documented as "immutable" but it is not:
>> There are 2 non "final" fields and 1 "final" but "protected" (hence
>> it can be modified outside the class without ensuring its invariants)
>> * IMHO use of transient fields is detrimental to the robustness of the
>> design: It seems that "hamming weight" is not _used_ by the
>> implementations (called only in the unit tests?)
>> * Ditto for "logValue"
>> * Did you check whether "murmur 128 hash" is available in "Commons
>> Codec"?
>> * Name of classes should not be abbreviated (cf. "Config", "Proto",
>> "Stats" ...)
>> * Package name is still plural: Should be singular IMO
>> * Rule checks general suppression should be removed in the config
>> files (CheckStyle, FindBugs) and replace by specific (within code)
>> statements
>> if applicable (e.g. the disabling of "no default in switch" is unnecessary
>> since AFAICT a "default" is always provided).
>> * All (also "private") fields and methods must be commented with Javadoc,
>> not just code comment.
>> * There should be a reference to what is a Bloom filter
>> * Occurrences of "bloom" should be replaced with "Bloom"
>> * There some style issues:
>> ** There should be no space after an opening parenthesis, or before
>> a closing one
>> ** Curly braces position: opening on the same line, closing on a
>> separate line
>>
>> Best regards,
>> Gilles
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


--
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Alex Herbert
In reply to this post by Claude Warren
On 23/09/2019 11:13, Claude Warren wrote:
> For the style issues.... is there an Eclipse style package that meets the
> commons style or some other tool that will correctly configure the format
> and style options in Eclipse?

The Commons style across most projects is loosely based on the Java
coding style, although there is no definite single style. Checkstyle is
used to make sure the code is consistent in the project so you should
aim to pass those checks. Just run:

mvn checkstyle:checkstyle

You can then look at the report in target/site/checkstyle.html

You can even get a plugin for Eclipse for checkstyle [1]. It can be
setup to work with the project checkstyle config and suppression file.
This will highlight problems for you directly in the Eclipse editor.

However it is useful to let a formatter do some work for you. Personally
I rarely use the Eclipse formatter for Commons and not on a whole file.
It is useful to highlight sections of code and then run the formatter on
those. It will help speed up reformatting. E.g. if you have lots of
loops and if statements with braces on the 'wrong' line.

I attached a modified version of the built-in Java coding style from
eclipse that I use for the Commons RNG project.

This one is based on 120 characters per line and a few other project
specific styles. You may have to tweak the settings to match what the
project checkstyle configuration will allow. For example line wrapping
with expressions will either have the operator on the same line or the
next line.

I do not trust the formatter to run as fire-and-forget. I would locally
stash/stage files in git, run the formatter then do a diff to see what
it has done. You can do this with the git tools within Eclipse (e.g.
Compare with index).

Note that Eclipse has poor options for formatting HTML in comments. For
example it compresses list elements <LI> into one large line whereas it
is more human readable to have each on a single line. So make sure to
check any javadoc with large blocks containing HTML tags. Or do not
format them and only locally highlight code for formatting.

If you find that this example is nowhere close to the
commons-collections checkstyle rules then just throw it away and start
from the default java codestyle in Eclipse. Configuring the rules is
pretty easy.

Alex

[1] https://checkstyle.org/eclipse-cs/#!/



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

eclipse-format-java-modifed.xml (33K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: New Sub-project Proposal.

Gilles Sadowski-2
In reply to this post by Claude Warren
Hi.

Le lun. 23 sept. 2019 à 12:59, Claude Warren <[hidden email]> a écrit :
>
> I will rework to remove the package private and other access issues noted.
>
> In Builder there is a difference between with() and build().  This follows
> the pattern established by MessageDigest[1] where it is possible to build a
> digest in one call or by adding multiple items and then calling digest.
> The differences here are digest has an update() method the builder uses
> with(), digest uses a digest() method to return the result builder uses
> build().

I noted the difference but still think it better to leave out the all
the "build" methods except the argument-less "build()".  IMHO,
saving the one call to the latter is not worth the duplication.

> The serializable classes are classes that in the indexing code are sent
> across the wire or stored.

At first sight, I'd say that serialization is out-of-scope (we
should let application developers deal with that using the
available accessors).

> I suppose there are other ways to do this and
> the serialization could be removed.  Not a direction I would like to take
> but is doable.
>
> HammingWeight and Log values re used in indexing.  I think HammingWeight is
> used in the collections as well.  HammingWeight is a well known property of
> a bloom filter.

Why is it not computed unconditionally at instantiation?

>  The Log not so much so.  I suppose the log could be
> removed and created as a separate library method that would execute against
> BitSets.  Moving the calculation outside of the Bloom filter means that we
> will have to make copies of bloom filters to perform the log and the log
> will not be retained with the filter when used in indexing.

In both cases, a method could return (but not store) the values; up
to the application developers to manage them according to their
needs (?).

>
> Murmer 128 hash is not in Commons Codec (at least I could not find it).
> Perhaps it should be added?

It seems so.  Better ask in another post (with [Codec] as the "Subject:"
prefix).

>  I could extract the code and provide it there
> if that is the appropriate solution.
>
> ProtoBloomFilter is not an abbreviation of PrototypeBloomFilter as it is
> not a Bloom filter but a primitive object from which Bloom filters may be
> created.  I think this is a proper use of the Proto[2] prefix.
>
> Package name plural.  I thought there was some disagreement on this,
> however I will change it.
>
> I will revert the FindBugs config change as it is now commented out
> anyway.  How does one suppress the fall through checkstyle error.  I could
> find no annotations that would suppress the error and did not want to add
> another dependency to the build.  I did find that other components in the
> project are using the checkstyle filter and so added the supression of the
> FallThrough check.

I think that the comment
---CUT---
// fallthru
---CUT---
should do the trick.

>
> The reference to what is a Bloom filter is in
> bloomfilters/package-info.java.

OK, then.  Sorry for the false positive. ;-)

>  Does there need to be a reference from the
> BloomFilter interface definition to that?
>
> Thank you for taking the time to review and comment.

Thank you for your contribution,
Gilles

>
> Claude
>
> [1]
> https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/security/MessageDigest.html
> [2] https://www.dictionary.com/browse/proto?s=t
>
> On Mon, Sep 23, 2019 at 11:13 AM Claude Warren <[hidden email]> wrote:
>
> > For the style issues.... is there an Eclipse style package that meets the
> > commons style or some other tool that will correctly configure the format
> > and style options in Eclipse?
> >
> > On Mon, Sep 23, 2019 at 10:54 AM Gilles Sadowski <[hidden email]>
> > wrote:
> >
> >> Hello.
> >>
> >> Here are a few comment from a quick browse of today's update
> >> of PR #83.
> >>
> >> * "package private for testing" is not a good reason (IMO)
> >> * There are spurious blank spaces in some of the file ("git diff"
> >> shows them in red)
> >> * You should always perform "git rebase master"
> >> * In "Builder":
> >> ** Field "hashes" should be "final"
> >> ** I still fail to see the necessity to duplicate "with" functionality
> >> with "build" methods
> >> * Why should so many classes be "Serializable"?
> >> * "StandardBloomFilter" is documented as "immutable" but it is not:
> >> There are 2 non "final" fields and 1 "final" but "protected" (hence
> >> it can be modified outside the class without ensuring its invariants)
> >> * IMHO use of transient fields is detrimental to the robustness of the
> >> design: It seems that "hamming weight" is not _used_ by the
> >> implementations (called only in the unit tests?)
> >> * Ditto for "logValue"
> >> * Did you check whether "murmur 128 hash" is available in "Commons
> >> Codec"?
> >> * Name of classes should not be abbreviated (cf. "Config", "Proto",
> >> "Stats" ...)
> >> * Package name is still plural: Should be singular IMO
> >> * Rule checks general suppression should be removed in the config
> >> files (CheckStyle, FindBugs) and replace by specific (within code)
> >> statements
> >> if applicable (e.g. the disabling of "no default in switch" is unnecessary
> >> since AFAICT a "default" is always provided).
> >> * All (also "private") fields and methods must be commented with Javadoc,
> >> not just code comment.
> >> * There should be a reference to what is a Bloom filter
> >> * Occurrences of "bloom" should be replaced with "Bloom"
> >> * There some style issues:
> >> ** There should be no space after an opening parenthesis, or before
> >> a closing one
> >> ** Curly braces position: opening on the same line, closing on a
> >> separate line
> >>
> >> Best regards,
> >> Gilles
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: [hidden email]
> >> For additional commands, e-mail: [hidden email]
> >>
> >>
> >
> > --
> > I like: Like Like - The likeliest place on the web
> > <http://like-like.xenei.com>
> > LinkedIn: http://www.linkedin.com/in/claudewarren
> >
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

123