Re: [math] Discuss: New feature MiniBatchKMeansClusterer

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
[Re-sending post so that it is attached to the original thread.]

Hello.

Le mar. 18 févr. 2020 à 04:49, 陈 涛 <[hidden email]> a écrit :
>
> Hi Gilles:
>
>    I really do not know if anyone received my last mail, no one replay me for a long time so I send it again and copy to you with another email system.

Sorry for the delay. :-}

>
> > Some remarks:
>
> > * I didn't get why the "KMeansPlusPlusCentroidInitializer" class
> > does not call the existing "KMeansPlusPlusClusterer".
> > Code seems duplicated: As there is a case for reuse, the currently
> > "private" centroid initialization code should be factored out.
>
> This is alpha version for discuss the "MiniBatchKMeansClusterer" algorithm,

I guess that you mean that we discuss your implementation of the
algorithm referenced in the Javadoc.

> and when "centroidOf" is extracted from "KMeansPlusPlusClusterer",
>
> the "KMeansPlusPlusClusterer" is not "KMeansPlusPlusClusterer" anymore but "KMeansClusterer",

I don't follow.

>
> this is a significant change, so I did not reactor it.

Significant changes are welcome (since the next release will contain
other major changes anyways) if they improve the code base (like e.g.
reducing code duplication).

>
>
>
> > * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
> > that a computation is performed (as opposed to selecting from an
> > existing data structure).

I think I'd prefer "selectCentroids".

>
> It is extract from "KMeansPlusPlusClusterer.centroidOf", should remain be "centroidOf"?

I don't understand.

It would be easier if you create a pull request, so that we can clearly see
what codes are added/removed/changed.

>
> The subclass "RandomCentroidInitializer" and "KMeansPlusPlusCentroidInitializer" indicate the algorithm used.
>
>
> > * Not convinced that there should be so many constructors (in most
> > cases, it makes no sense to pick default values that are likely to
> > be heavily application-dependent.
>
> I can add more constructors.

No, the constructors with default values clutter the API, for
no obvious gain IMHO.
[If the default values make sense, they must be documented.]

>
> I'd like a builder class more than constructors, but does not meet the historical code style.

Now is the time for improving the API.
It would be quite helpful to create a report on JIRA with "sub-tasks"
for all such API proposed changes.

> > * Input should generally be validated: e.g. the maximum number of
> > iterations should not be changed unwittingly; rather, an exception
> > should be raised if the user passed a negative value.
>
> Thanks for your advices, I will improve these.
>
> > Could be nice to illustrate (not just with a picture, but in a table
> > with entries average over several runs) the differences in result
> > between the implementations, using various setups (number of
> > clusters, stopping criterion, etc.).
>
> I will make more tests, include benchmarks.
>
> It is a challenge for me to generate the various kinds of test data,
>
> could anybody supply me the test data of this comparsion: http://commons.apache.org/proper/commons-math/userguide/ml.html

They are generated programmatically; code is here:
    https://gitbox.apache.org/repos/asf?p=commons-math.git;a=tree;f=src/userguide/java/org/apache/commons/math4/userguide

[I've just updated the codes so that they compiles and run using
the new dependencies (see the "README" file).]

> > "MT_64" is probably not the best default.  And this is one of the
> > parameters from which there should not be a default IMO.
>
> I will do more tests

You don't need to test the generators; users should choose
by themselves (from those provided in "Commons RNG").

>
> > [Note: there are spurious characters in your message (see e.g. the
> > paragraph quoted just above) that make it difficult to read.]
>
> I had well format my mail in my mail box, it may been changed by the Mail List service.
>
> I will try various kinds of mail editor. It will helpful if you told me which mail editor is work well with the ML.

It's probably an encoding thing (setting it to "UTF-8" should be
fine).

Best regards,
Gilles

> > [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
Hi Gilles:
  I really appricate for your patient to help me to solve the mail sending problem, I try to set the only setting about charset "Use Unicode" for this mail.
  I have created a pull request: https://github.com/apache/commons-math/pull/118
  And a comparsion between KMeans and MiniBatchKMeans by using the "org.apache.commons.math4.userguide.ClusterAlgorithmComparison":





------------------ Original ------------------
From:  "Gilles Sadowski";<[hidden email]>;
Date:  Feb 19, 2020
To:  "[hidden email]"<[hidden email]>;

Subject:  Re: [math] Discuss: New feature MiniBatchKMeansClusterer



[Re-sending post so that it is attached to the original thread.]

Hello.

Le mar. 18 févr. 2020 à 04:49, 陈 涛 <[hidden email]> a écrit :
>
> Hi Gilles:
>
>    I really do not know if anyone received my last mail, no one replay me for a long time so I send it again and copy to you with another email system.

Sorry for the delay. :-}

>
> > Some remarks:
>
> > * I didn't get why the "KMeansPlusPlusCentroidInitializer" class
> > does not call the existing "KMeansPlusPlusClusterer".
> > Code seems duplicated: As there is a case for reuse, the currently
> > "private" centroid initialization code should be factored out.
>
> This is alpha version for discuss the "MiniBatchKMeansClusterer" algorithm,

I guess that you mean that we discuss your implementation of the
algorithm referenced in the Javadoc.

> and when "centroidOf" is extracted from "KMeansPlusPlusClusterer",
>
> the "KMeansPlusPlusClusterer" is not "KMeansPlusPlusClusterer" anymore but "KMeansClusterer",

I don't follow.

>
> this is a significant change, so I did not reactor it.

Significant changes are welcome (since the next release will contain
other major changes anyways) if they improve the code base (like e.g.
reducing code duplication).

>
>
>
> > * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
> > that a computation is performed (as opposed to selecting from an
> > existing data structure).

I think I'd prefer "selectCentroids".

>
> It is extract from "KMeansPlusPlusClusterer.centroidOf", should remain be "centroidOf"?

I don't understand.

It would be easier if you create a pull request, so that we can clearly see
what codes are added/removed/changed.

>
> The subclass "RandomCentroidInitializer" and "KMeansPlusPlusCentroidInitializer" indicate the algorithm used.
>
>
> > * Not convinced that there should be so many constructors (in most
> > cases, it makes no sense to pick default values that are likely to
> > be heavily application-dependent.
>
> I can add more constructors.

No, the constructors with default values clutter the API, for
no obvious gain IMHO.
[If the default values make sense, they must be documented.]

>
> I'd like a builder class more than constructors, but does not meet the historical code style.

Now is the time for improving the API.
It would be quite helpful to create a report on JIRA with "sub-tasks"
for all such API proposed changes.

> > * Input should generally be validated: e.g. the maximum number of
> > iterations should not be changed unwittingly; rather, an exception
> > should be raised if the user passed a negative value.
>
> Thanks for your advices, I will improve these.
>
> > Could be nice to illustrate (not just with a picture, but in a table
> > with entries average over several runs) the differences in result
> > between the implementations, using various setups (number of
> > clusters, stopping criterion, etc.).
>
> I will make more tests, include benchmarks.
>
> It is a challenge for me to generate the various kinds of test data,
>
> could anybody supply me the test data of this comparsion: http://commons.apache.org/proper/commons-math/userguide/ml.html

They are generated programmatically; code is here:
    https://gitbox.apache.org/repos/asf?p=commons-math.git;a=tree;f=src/userguide/java/org/apache/commons/math4/userguide

[I've just updated the codes so that they compiles and run using
the new dependencies (see the "README" file).]

> > "MT_64" is probably not the best default.  And this is one of the
> > parameters from which there should not be a default IMO.
>
> I will do more tests

You don't need to test the generators; users should choose
by themselves (from those provided in "Commons RNG").

>
> > [Note: there are spurious characters in your message (see e.g. the
> > paragraph quoted just above) that make it difficult to read.]
>
> I had well format my mail in my mail box, it may been changed by the Mail List service.
>
> I will try various kinds of mail editor. It will helpful if you told me which mail editor is work well with the ML.

It's probably an encoding thing (setting it to "UTF-8" should be
fine).

Best regards,
Gilles

> > [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hi.

Le sam. 22 févr. 2020 à 14:37, CT <[hidden email]> a écrit :
>
> Hi Gilles:
>   I really appricate for your patient to help me to solve the mail sending problem, I try to set the only setting about charset "Use Unicode" for this mail.

The contents seems fine.  However, there is something wrong: replying to your
message sends to you instead of the mailing list...

>   I have created a pull request: https://github.com/apache/commons-math/pull/118

That PR fails the build:
   https://travis-ci.org/apache/commons-math/builds/653293451?utm_source=github_status&utm_medium=notification

Also, the code doesn't take into account all the remarks which I
made in previous comments (e.g. file JIRA reports to allow tracking
of changes to existing code and not mix those with new code).

In addition, you should avoid mixing massive cosmetic changes (e.g.
Javadoc reformatting) within a commit than contains other types of
change:
    https://github.com/apache/commons-math/pull/118/commits/47a055e6264d084547854f9290461f020f2131cf

>   And a comparsion between KMeans and MiniBatchKMeans by using the "org.apache.commons.math4.userguide.ClusterAlgorithmComparison":

Functionality seems to work fine, but ability to review incremental
changes is extremely important in a collaborative project.

Thanks,
Gilles


>
> Hello.
>
> Le mar. 18 févr. 2020 à 04:49, 陈 涛 <[hidden email]> a écrit :
> >
> > Hi Gilles:
> >
> >    I really do not know if anyone received my last mail, no one replay me for a long time so I send it again and copy to you with another email system.
>
> Sorry for the delay. :-}
>
> >
> > > Some remarks:
> >
> > > * I didn't get why the "KMeansPlusPlusCentroidInitializer" class
> > > does not call the existing "KMeansPlusPlusClusterer".
> > > Code seems duplicated: As there is a case for reuse, the currently
> > > "private" centroid initialization code should be factored out.
> >
> > This is alpha version for discuss the "MiniBatchKMeansClusterer" algorithm,
>
> I guess that you mean that we discuss your implementation of the
> algorithm referenced in the Javadoc.
>
> > and when "centroidOf" is extracted from "KMeansPlusPlusClusterer",
> >
> > the "KMeansPlusPlusClusterer" is not "KMeansPlusPlusClusterer" anymore but "KMeansClusterer",
>
> I don't follow.
>
> >
> > this is a significant change, so I did not reactor it.
>
> Significant changes are welcome (since the next release will contain
> other major changes anyways) if they improve the code base (like e.g.
> reducing code duplication).
>
> >
> >
> >
> > > * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
> > > that a computation is performed (as opposed to selecting from an
> > > existing data structure).
>
> I think I'd prefer "selectCentroids".
>
> >
> > It is extract from "KMeansPlusPlusClusterer.centroidOf", should remain be "centroidOf"?
>
> I don't understand.
>
> It would be easier if you create a pull request, so that we can clearly see
> what codes are added/removed/changed.
>
> >
> > The subclass "RandomCentroidInitializer" and "KMeansPlusPlusCentroidInitializer" indicate the algorithm used.
> >
> >
> > > * Not convinced that there should be so many constructors (in most
> > > cases, it makes no sense to pick default values that are likely to
> > > be heavily application-dependent.
> >
> > I can add more constructors.
>
> No, the constructors with default values clutter the API, for
> no obvious gain IMHO.
> [If the default values make sense, they must be documented.]
>
> >
> > I'd like a builder class more than constructors, but does not meet the historical code style.
>
> Now is the time for improving the API.
> It would be quite helpful to create a report on JIRA with "sub-tasks"
> for all such API proposed changes.
>
> > > * Input should generally be validated: e.g. the maximum number of
> > > iterations should not be changed unwittingly; rather, an exception
> > > should be raised if the user passed a negative value.
> >
> > Thanks for your advices, I will improve these.
> >
> > > Could be nice to illustrate (not just with a picture, but in a table
> > > with entries average over several runs) the differences in result
> > > between the implementations, using various setups (number of
> > > clusters, stopping criterion, etc.).
> >
> > I will make more tests, include benchmarks.
> >
> > It is a challenge for me to generate the various kinds of test data,
> >
> > could anybody supply me the test data of this comparsion: http://commons.apache.org/proper/commons-math/userguide/ml.html
>
> They are generated programmatically; code is here:
>     https://gitbox.apache.org/repos/asf?p=commons-math.git;a=tree;f=src/userguide/java/org/apache/commons/math4/userguide
>
> [I've just updated the codes so that they compiles and run using
> the new dependencies (see the "README" file).]
>
> > > "MT_64" is probably not the best default.  And this is one of the
> > > parameters from which there should not be a default IMO.
> >
> > I will do more tests
>
> You don't need to test the generators; users should choose
> by themselves (from those provided in "Commons RNG").
>
> >
> > > [Note: there are spurious characters in your message (see e.g. the
> > > paragraph quoted just above) that make it difficult to read.]
> >
> > I had well format my mail in my mail box, it may been changed by the Mail List service.
> >
> > I will try various kinds of mail editor. It will helpful if you told me which mail editor is work well with the ML.
>
> It's probably an encoding thing (setting it to "UTF-8" should be
> fine).
>
> Best regards,
> Gilles
>
> > > [...]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
Hi Gilles:
&nbsp; Sorry for my unfamiliar in contribution. I started a new PR for most of your suggestion:
https://github.com/apache/commons-math/pull/120
I remain have one question below:


------------------&nbsp;Original&nbsp;------------------
From:&nbsp;"Gilles Sadowski"<[hidden email]&gt;;
Date:&nbsp;Mon, Feb 24, 2020 09:52 PM
To:&nbsp;"dev"<[hidden email]&gt;;

Subject:&nbsp;Re: [math] Discuss: New feature MiniBatchKMeansClusterer



&gt;Hi.
&gt;
&gt;Le sam. 22 févr. 2020 à 14:37, CT <[hidden email]&gt; a écrit :
&gt;&gt;
&gt;&gt; Hi Gilles:
&gt;&gt;&nbsp;&nbsp; I really appricate for your patient to help me to solve the mail sending problem, I try to set the only setting about charset "Use &gt;Unicode" for this mail.
&gt;
&gt;The contents seems fine.&nbsp; However, there is something wrong: replying to your
&gt;message sends to you instead of the mailing list...
&gt;
&gt;&gt;&nbsp;&nbsp; I have created a pull request: https://github.com/apache/commons-math/pull/118
&gt;
&gt;That PR fails the build:
&gt;&nbsp; &nbsp;https://travis-ci.org/apache/commons-math/builds/653293451?utm_source=github_status&amp;utm_medium=notification
&gt;
&gt;Also, the code doesn't take into account all the remarks which I
&gt;made in previous comments (e.g. file JIRA reports to allow tracking
&gt;of changes to existing code and not mix those with new code).
&gt;


I did not get what kinds of problem should tracking by JIRA.
In my opinion all the change is related to my PR.&nbsp;
The JIRA issue about Feature MiniBatch has been closed.
Should I fire a new issue?

&gt;In addition, you should avoid mixing massive cosmetic changes (e.g.
&gt;Javadoc reformatting) within a commit than contains other types of
&gt;change:
&gt;&nbsp; &nbsp; https://github.com/apache/commons-math/pull/118/commits/47a055e6264d084547854f9290461f020f2131cf
&gt;
&gt;&gt;&nbsp;&nbsp; And a comparsion between KMeans and MiniBatchKMeans by using the &gt;"org.apache.commons.math4.userguide.ClusterAlgorithmComparison":
&gt;
&gt;Functionality seems to work fine, but ability to review incremental
&gt;changes is extremely important in a collaborative project.
&gt;


Thanks for your&nbsp;remind, and I created a new PR that improve the problems one by one.

&gt;Thanks,
&gt;Gilles

Thank you,
Chentao

&gt;
&gt; Hello.
&gt;
&gt; Le mar. 18 févr. 2020 à 04:49, 陈 涛 <[hidden email]&gt; a écrit :
&gt; &gt;
&gt; &gt; Hi Gilles:
&gt; &gt;
&gt; &gt;&nbsp;&nbsp;&nbsp; I really do not know if anyone received my last mail, no one replay me for a long time so I send it again and copy to you with another email system.
&gt;
&gt; Sorry for the delay. :-}
&gt;
&gt; &gt;
&gt; &gt; &gt; Some remarks:
&gt; &gt;
&gt; &gt; &gt; * I didn't get why the "KMeansPlusPlusCentroidInitializer" class
&gt; &gt; &gt; does not call the existing "KMeansPlusPlusClusterer".
&gt; &gt; &gt; Code seems duplicated: As there is a case for reuse, the currently
&gt; &gt; &gt; "private" centroid initialization code should be factored out.
&gt; &gt;
&gt; &gt; This is alpha version for discuss the "MiniBatchKMeansClusterer" algorithm,
&gt;
&gt; I guess that you mean that we discuss your implementation of the
&gt; algorithm referenced in the Javadoc.
&gt;
&gt; &gt; and when "centroidOf" is extracted from "KMeansPlusPlusClusterer",
&gt; &gt;
&gt; &gt; the "KMeansPlusPlusClusterer" is not "KMeansPlusPlusClusterer" anymore but "KMeansClusterer",
&gt;
&gt; I don't follow.
&gt;
&gt; &gt;
&gt; &gt; this is a significant change, so I did not reactor it.
&gt;
&gt; Significant changes are welcome (since the next release will contain
&gt; other major changes anyways) if they improve the code base (like e.g.
&gt; reducing code duplication).
&gt;
&gt; &gt;
&gt; &gt;
&gt; &gt;
&gt; &gt; &gt; * In "CentroidInitializer", I'd rename "chooseCentroids" to emphasize
&gt; &gt; &gt; that a computation is performed (as opposed to selecting from an
&gt; &gt; &gt; existing data structure).
&gt;
&gt; I think I'd prefer "selectCentroids".
&gt;
&gt; &gt;
&gt; &gt; It is extract from "KMeansPlusPlusClusterer.centroidOf", should remain be "centroidOf"?
&gt;
&gt; I don't understand.
&gt;
&gt; It would be easier if you create a pull request, so that we can clearly see
&gt; what codes are added/removed/changed.
&gt;
&gt; &gt;
&gt; &gt; The subclass "RandomCentroidInitializer" and "KMeansPlusPlusCentroidInitializer" indicate the algorithm used.
&gt; &gt;
&gt; &gt;
&gt; &gt; &gt; * Not convinced that there should be so many constructors (in most
&gt; &gt; &gt; cases, it makes no sense to pick default values that are likely to
&gt; &gt; &gt; be heavily application-dependent.
&gt; &gt;
&gt; &gt; I can add more constructors.
&gt;
&gt; No, the constructors with default values clutter the API, for
&gt; no obvious gain IMHO.
&gt; [If the default values make sense, they must be documented.]
&gt;
&gt; &gt;
&gt; &gt; I'd like a builder class more than constructors, but does not meet the historical code style.
&gt;
&gt; Now is the time for improving the API.
&gt; It would be quite helpful to create a report on JIRA with "sub-tasks"
&gt; for all such API proposed changes.
&gt;
&gt; &gt; &gt; * Input should generally be validated: e.g. the maximum number of
&gt; &gt; &gt; iterations should not be changed unwittingly; rather, an exception
&gt; &gt; &gt; should be raised if the user passed a negative value.
&gt; &gt;
&gt; &gt; Thanks for your advices, I will improve these.
&gt; &gt;
&gt; &gt; &gt; Could be nice to illustrate (not just with a picture, but in a table
&gt; &gt; &gt; with entries average over several runs) the differences in result
&gt; &gt; &gt; between the implementations, using various setups (number of
&gt; &gt; &gt; clusters, stopping criterion, etc.).
&gt; &gt;
&gt; &gt; I will make more tests, include benchmarks.
&gt; &gt;
&gt; &gt; It is a challenge for me to generate the various kinds of test data,
&gt; &gt;
&gt; &gt; could anybody supply me the test data of this comparsion: http://commons.apache.org/proper/commons-math/userguide/ml.html
&gt;
&gt; They are generated programmatically; code is here:
&gt;&nbsp;&nbsp;&nbsp;&nbsp; https://gitbox.apache.org/repos/asf?p=commons-math.git;a=tree;f=src/userguide/java/org/apache/commons/math4/userguide
&gt;
&gt; [I've just updated the codes so that they compiles and run using
&gt; the new dependencies (see the "README" file).]
&gt;
&gt; &gt; &gt; "MT_64" is probably not the best default.&nbsp; And this is one of the
&gt; &gt; &gt; parameters from which there should not be a default IMO.
&gt; &gt;
&gt; &gt; I will do more tests
&gt;
&gt; You don't need to test the generators; users should choose
&gt; by themselves (from those provided in "Commons RNG").
&gt;
&gt; &gt;
&gt; &gt; &gt; [Note: there are spurious characters in your message (see e.g. the
&gt; &gt; &gt; paragraph quoted just above) that make it difficult to read.]
&gt; &gt;
&gt; &gt; I had well format my mail in my mail box, it may been changed by the Mail List service.
&gt; &gt;
&gt; &gt; I will try various kinds of mail editor. It will helpful if you told me which mail editor is work well with the ML.
&gt;
&gt; It's probably an encoding thing (setting it to "UTF-8" should be
&gt; fine).
&gt;
&gt; Best regards,
&gt; Gilles
&gt;
&gt; &gt; &gt; [...]
&gt;

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hello.

[Side question: Do you send a copy of your posts directly to me?
If so, that is not necessary and is even annoying because when I
hit "reply" in my mail client, the conversation continues off-list...]

Le mar. 25 févr. 2020 à 14:53, CT <[hidden email]> a écrit :
>
> Hi Gilles:
>   Sorry for my unfamiliar in contribution. I started a new PR for most of your suggestion:
> https://github.com/apache/commons-math/pull/120

Sorry for seemingly nit-picking but the global issue is the same
as with PR #118: It contains too many unrelated changes.
There should be *one* PR for each batch of significant changes.
More precisely, for this work, that would entail (a.o. things), one
PR for each of the following:
 * Class "ClusterUtils" (design yet to be discussed)
 * Factoring out whatever code is necessary for your proposed
feature (that would otherwise be duplicated)

We try and avoid bloating the API, hence the changes which
I've suggested:
 * remove the constructors with default parameters

Overall, each PR should probably contain a single commit, in
order to ease review.

> I remain have one question below:
>
> ------------------ Original ------------------
> From: "Gilles Sadowski"<[hidden email]>;
> Date: Mon, Feb 24, 2020 09:52 PM
> To: "dev"<[hidden email]>;
> Subject: Re: [math] Discuss: New feature MiniBatchKMeansClusterer
>
> >Hi.
> >
> >Le sam. 22 févr. 2020 à 14:37, CT <[hidden email]> a écrit :
> >>
> >> Hi Gilles:
> >>   I really appricate for your patient to help me to solve the mail sending problem, I try to set the only setting about charset "Use >Unicode" for this mail.
> >
> >The contents seems fine.  However, there is something wrong: replying to your
> >message sends to you instead of the mailing list...
> >
> >>   I have created a pull request: https://github.com/apache/commons-math/pull/118
> >
> >That PR fails the build:
> >   https://travis-ci.org/apache/commons-math/builds/653293451?utm_source=github_status&utm_medium=notification
> >
> >Also, the code doesn't take into account all the remarks which I
> >made in previous comments (e.g. file JIRA reports to allow tracking
> >of changes to existing code and not mix those with new code).
> >
>
> I did not get what kinds of problem should tracking by JIRA.
> In my opinion all the change is related to my PR.

Yes, they are related but they could be provided in more focused
PRs as explained above.

> The JIRA issue about Feature MiniBatch has been closed.

It is not closed:
   https://issues.apache.org/jira/projects/MATH/issues/MATH-1509

> Should I fire a new issue?

You should create one JIRA issue per self-contained change.
You could also create "sub-tasks" of MATH-1509 (for changes that
are strongly related to your contribution).

>
> >In addition, you should avoid mixing massive cosmetic changes (e.g.
> >Javadoc reformatting) within a commit than contains other types of
> >change:
> >    https://github.com/apache/commons-math/pull/118/commits/47a055e6264d084547854f9290461f020f2131cf
> >
> >>   And a comparsion between KMeans and MiniBatchKMeans by using the >"org.apache.commons.math4.userguide.ClusterAlgorithmComparison":
> >
> >Functionality seems to work fine, but ability to review incremental
> >changes is extremely important in a collaborative project.
> >
>
> Thanks for your remind, and I created a new PR that improve the problems one by one.

I hope that I've now fixed the misunderstanding,
Gilles

> > [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
In reply to this post by Gilles Sadowski-2
Hi,


------------------&nbsp;Original&nbsp;------------------
From:&nbsp;"GillesSadowski"<[hidden email]&gt;;
Date:&nbsp;Wed, Feb 26, 2020 00:32 AM
To:&nbsp;"Commons Developers List"<[hidden email]&gt;;

Subject:&nbsp;Re: [math] Discuss: New feature MiniBatchKMeansClusterer



&gt;Hello.
&gt;
&gt;[Side question: Do you send a copy of your posts directly to me?
&gt;If so, that is not necessary and is even annoying because when I
&gt;hit "reply" in my mail client, the conversation continues off-list...]

Yes, I always copy to you, I think the ML never forward my email to others,
that I subscribe the ML with my other email and never received my own email.
Sorry for annoying&nbsp;you, but you are the only person received my email because I CC to you.

&gt;&gt;Le mar. 25 févr. 2020 à 14:53, CT <[hidden email]&gt; a écrit :
&gt;&gt;
&gt;&gt; Hi Gilles:
&gt;&gt;&nbsp;&nbsp; Sorry for my unfamiliar in contribution. I started a new PR for most of your suggestion:
&gt;&gt; https://github.com/apache/commons-math/pull/120
&gt;
&gt;Sorry for seemingly nit-picking but the global issue is the same
&gt;as with PR #118: It contains too many unrelated changes.
&gt;There should be *one* PR for each batch of significant changes.
&gt;More precisely, for this work, that would entail (a.o. things), one
&gt;PR for each of the following:
&gt;&nbsp;* Class "ClusterUtils" (design yet to be discussed)
&gt;&nbsp;* Factoring out whatever code is necessary for your proposed
&gt;feature (that would otherwise be duplicated)
&gt;
&gt;We try and avoid bloating the API, hence the changes which
&gt;I've suggested:
&gt;&nbsp;* remove the constructors with default parameters
&gt;
&gt;Overall, each PR should probably contain a single commit, in
&gt;order to ease review.
Do you mean this:
&nbsp;* For JIRA issue #MATH-1509 Start a PR&nbsp;with "MiniBatchKMeansClusterer", but without the "ClusterUtils",
despite the duplicate code between "MiniBatchKMeansClusterer" and "KMeansPlusPlusClusterer",
also with "CentroidInitializer" and test code with in a single commit.
&nbsp;* Suggestions like "remove the constructors with default parameters" should apply as a new commit of the PR above,
and tracking by a subtask of JIRA issue&nbsp;#MATH-1509.
&nbsp;* Fire a new JIRA issue for the duplicate&nbsp;code, and start another PR with "ClusterUtils" in,
and extract duplicate code into "ClusterUtils".


&gt;&gt;"org.apache.commons.math4.userguide.ClusterAlgorithmComparison":
&gt;&gt; I remain have one question below:
&gt;&gt;
&gt;&gt; ------------------ Original ------------------
&gt; From: "Gilles Sadowski"<[hidden email]&gt;;
&gt; Date: Mon, Feb 24, 2020 09:52 PM
&gt; To: "dev"<[hidden email]&gt;;
&gt; Subject: Re: [math] Discuss: New feature MiniBatchKMeansClusterer
&gt;
&gt; &gt;Hi.
&gt; &gt;
&gt; &gt;Le sam. 22 févr. 2020 à 14:37, CT <[hidden email]&gt; a écrit :
&gt; &gt;&gt;
&gt; &gt;&gt; Hi Gilles:
&gt; &gt;&gt;&nbsp;&nbsp; I really appricate for your patient to help me to solve the mail sending problem, I try to set the only setting about charset "Use &gt;Unicode" for this mail.
&gt; &gt;
&gt; &gt;The contents seems fine.&nbsp; However, there is something wrong: replying to your
&gt; &gt;message sends to you instead of the mailing list...
&gt; &gt;
&gt; &gt;&gt;&nbsp;&nbsp; I have created a pull request: https://github.com/apache/commons-math/pull/118
&gt; &gt;
&gt; &gt;That PR fails the build:
&gt; &gt;&nbsp;&nbsp; https://travis-ci.org/apache/commons-math/builds/653293451?utm_source=github_status&amp;utm_medium=notification
&gt; &gt;
&gt; &gt;Also, the code doesn't take into account all the remarks which I
&gt; &gt;made in previous comments (e.g. file JIRA reports to allow tracking
&gt; &gt;of changes to existing code and not mix those with new code).
&gt; &gt;
&gt;
&gt; I did not get what kinds of problem should tracking by JIRA.
&gt; In my opinion all the change is related to my PR.
&gt;
&gt;Yes, they are related but they could be provided in more focused
&gt;PRs as explained above.
&gt;
&gt; The JIRA issue about Feature MiniBatch has been closed.
&gt;
&gt;It is not closed:
&gt;&nbsp;&nbsp; https://issues.apache.org/jira/projects/MATH/issues/MATH-1509
&gt;
&gt;&gt; Should I fire a new issue?
&gt;
&gt;You should create one JIRA issue per self-contained change.
&gt;You could also create "sub-tasks" of MATH-1509 (for changes that
&gt;are strongly related to your contribution).

&gt;&gt;
&gt;&gt; &gt;In addition, you should avoid mixing massive cosmetic changes (e.g.
&gt;&gt; &gt;Javadoc reformatting) within a commit than contains other types of
&gt;&gt; &gt;change:
&gt;&gt; &gt;&nbsp;&nbsp;&nbsp; https://github.com/apache/commons-math/pull/118/commits/47a055e6264d084547854f9290461f020f2131cf
&gt;&gt; &gt;
&gt;&gt; &gt;&gt;&nbsp;&nbsp; And a comparsion between KMeans and MiniBatchKMeans by using the&nbsp;&gt;&gt;"org.apache.commons.math4.userguide.ClusterAlgorithmComparison":
&gt;&gt; &gt;
&gt;&gt; &gt;Functionality seems to work fine, but ability to review incremental
&gt;&gt; &gt;changes is extremely important in a collaborative project.
&gt;&gt; &gt;
&gt;&gt;
&gt;&gt; Thanks for your remind, and I created a new PR that improve the problems one by one.

&gt;I hope that I've now fixed the misunderstanding,
&gt;Gilles

&gt;&gt; &gt; [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
In reply to this post by Gilles Sadowski-2
Hi Gilles,


------------------ Original ------------------
From:&nbsp;"GillesSadowski"<[hidden email]&gt;;
Date:&nbsp;Wed, Feb 26, 2020 00:32 AM
To:&nbsp;"Commons Developers List"<[hidden email]&gt;;

Subject:&nbsp;Re: [math] Discuss: New feature MiniBatchKMeansClusterer



&gt;Hello.
&gt;
&gt;[Side question: Do you send a copy of your posts directly to me?
&gt;If so, that is not necessary and is even annoying because when I
&gt;hit "reply" in my mail client, the conversation continues off-list...]

Yes, I always copy to you, I found the ML never forward my email up to now
(that I subscribe the ML with my other email and never received my own email).
Sorry for annoying you, but you are the only person received my email because I CC to you.

&gt;&gt;Le mar. 25 févr. 2020 à 14:53, CT <[hidden email]&gt; a écrit :
&gt;&gt;
&gt;&gt; Hi Gilles:
&gt;&gt; &nbsp; Sorry for my unfamiliar in contribution. I started a new PR for most of your suggestion:
&gt;&gt; https://github.com/apache/commons-math/pull/120
&gt;
&gt;Sorry for seemingly nit-picking but the global issue is the same
&gt;as with PR #118: It contains too many unrelated changes.
&gt;There should be *one* PR for each batch of significant changes.
&gt;More precisely, for this work, that would entail (a.o. things), one
&gt;PR for each of the following:
&gt; * Class "ClusterUtils" (design yet to be discussed)
&gt; * Factoring out whatever code is necessary for your proposed
&gt;feature (that would otherwise be duplicated)
&gt;
&gt;We try and avoid bloating the API, hence the changes which
&gt;I've suggested:
&gt; * remove the constructors with default parameters
&gt;
&gt;Overall, each PR should probably contain a single commit, in
&gt;order to ease review.
Do you mean this:
&nbsp;* For JIRA issue #MATH-1509 Start a PR with "MiniBatchKMeansClusterer", but without the "ClusterUtils",
despite the duplicate code between "MiniBatchKMeansClusterer" and "KMeansPlusPlusClusterer",
also with "CentroidInitializer" and test code with in a single commit.
&nbsp;* Suggestions like "remove the constructors with default parameters" should apply as a new commit of the PR above,
and tracking by a subtask of JIRA issue #MATH-1509.
&nbsp;* Fire a new JIRA issue for the duplicate code, and start another PR with "ClusterUtils" in,
and extract duplicate code into "ClusterUtils".


&gt;&gt;"org.apache.commons.math4.userguide.ClusterAlgorithmComparison":
&gt;&gt; I remain have one question below:
&gt;&gt;
&gt;&gt; ------------------ Original ------------------
&gt; From: "Gilles Sadowski"<[hidden email]&gt;;
&gt; Date: Mon, Feb 24, 2020 09:52 PM
&gt; To: "dev"<[hidden email]&gt;;
&gt; Subject: Re: [math] Discuss: New feature MiniBatchKMeansClusterer
&gt;
&gt; &gt;Hi.
&gt; &gt;
&gt; &gt;Le sam. 22 févr. 2020 à 14:37, CT <[hidden email]&gt; a écrit :
&gt; &gt;&gt;
&gt; &gt;&gt; Hi Gilles:
&gt; &gt;&gt; &nbsp; I really appricate for your patient to help me to solve the mail sending problem, I try to set the only setting about charset "Use &gt;Unicode" for this mail.
&gt; &gt;
&gt; &gt;The contents seems fine.&nbsp; However, there is something wrong: replying to your
&gt; &gt;message sends to you instead of the mailing list...
&gt; &gt;
&gt; &gt;&gt; &nbsp; I have created a pull request: https://github.com/apache/commons-math/pull/118
&gt; &gt;
&gt; &gt;That PR fails the build:
&gt; &gt; &nbsp; https://travis-ci.org/apache/commons-math/builds/653293451?utm_source=github_status&amp;utm_medium=notification
&gt; &gt;
&gt; &gt;Also, the code doesn't take into account all the remarks which I
&gt; &gt;made in previous comments (e.g. file JIRA reports to allow tracking
&gt; &gt;of changes to existing code and not mix those with new code).
&gt; &gt;
&gt;
&gt; I did not get what kinds of problem should tracking by JIRA.
&gt; In my opinion all the change is related to my PR.
&gt;
&gt;Yes, they are related but they could be provided in more focused
&gt;PRs as explained above.
&gt;
&gt; The JIRA issue about Feature MiniBatch has been closed.
&gt;
&gt;It is not closed:
&gt; &nbsp; https://issues.apache.org/jira/projects/MATH/issues/MATH-1509
&gt;
&gt;&gt; Should I fire a new issue?
&gt;
&gt;You should create one JIRA issue per self-contained change.
&gt;You could also create "sub-tasks" of MATH-1509 (for changes that
&gt;are strongly related to your contribution).

&gt;&gt;
&gt;&gt; &gt;In addition, you should avoid mixing massive cosmetic changes (e.g.
&gt;&gt; &gt;Javadoc reformatting) within a commit than contains other types of
&gt;&gt; &gt;change:
&gt;&gt; &gt;&nbsp; &nbsp; https://github.com/apache/commons-math/pull/118/commits/47a055e6264d084547854f9290461f020f2131cf
&gt;&gt; &gt;
&gt;&gt; &gt;&gt; &nbsp; And a comparsion between KMeans and MiniBatchKMeans by using the &gt;&gt;"org.apache.commons.math4.userguide.ClusterAlgorithmComparison":
&gt;&gt; &gt;
&gt;&gt; &gt;Functionality seems to work fine, but ability to review incremental
&gt;&gt; &gt;changes is extremely important in a collaborative project.
&gt;&gt; &gt;
&gt;&gt;
&gt;&gt; Thanks for your remind, and I created a new PR that improve the problems one by one.

&gt;I hope that I've now fixed the misunderstanding,
&gt;Gilles

&gt;&gt; &gt; [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hello.

Le mer. 26 févr. 2020 à 03:49, CT <[hidden email]> a écrit :

>
> Hi Gilles,
>
>
> ------------------ Original ------------------
> From:&nbsp;"GillesSadowski"<[hidden email]&gt;;
> Date:&nbsp;Wed, Feb 26, 2020 00:32 AM
> To:&nbsp;"Commons Developers List"<[hidden email]&gt;;
>
> Subject:&nbsp;Re: [math] Discuss: New feature MiniBatchKMeansClusterer
>
>
>
> &gt;Hello.
> &gt;
> &gt;[Side question: Do you send a copy of your posts directly to me?
> &gt;If so, that is not necessary and is even annoying because when I
> &gt;hit "reply" in my mail client, the conversation continues off-list...]
>
> Yes, I always copy to you, I found the ML never forward my email up to now
> (that I subscribe the ML with my other email and never received my own email).
> Sorry for annoying you, but you are the only person received my email because I CC to you.

This last mail of yours was received by the mailing list, as witnessed
by the archive copy:
    https://markmail.org/message/qjk6zmdypqigxl2k

Hitting "reply" in my mail client now works as expected (setting the
destination to the ML).

However, you sent the same post twice.  And the quoted text is
somewhat mangled again by the quotation mark being replaced
by the corresponding HTML entity ("&gt;").

> &gt;&gt;Le mar. 25 févr. 2020 à 14:53, CT <[hidden email]&gt; a écrit :
> &gt;&gt;
> &gt;&gt; Hi Gilles:
> &gt;&gt; &nbsp Sorry for my unfamiliar in contribution. I started a new PR for most of your suggestion:
> &gt;&gt; https://github.com/apache/commons-math/pull/120
> &gt;
> &gt;Sorry for seemingly nit-picking but the global issue is the same
> &gt;as with PR #118: It contains too many unrelated changes.
> &gt;There should be *one* PR for each batch of significant changes.
> &gt;More precisely, for this work, that would entail (a.o. things), one
> &gt;PR for each of the following:
> &gt; * Class "ClusterUtils" (design yet to be discussed)
> &gt; * Factoring out whatever code is necessary for your proposed
> &gt;feature (that would otherwise be duplicated)
> &gt;
> &gt;We try and avoid bloating the API, hence the changes which
> &gt;I've suggested:
> &gt; * remove the constructors with default parameters
> &gt;
> &gt;Overall, each PR should probably contain a single commit, in
> &gt;order to ease review.
> Do you mean this:
> &nbsp;* For JIRA issue #MATH-1509 Start a PR with "MiniBatchKMeansClusterer", but without the "ClusterUtils",
> despite the duplicate code between "MiniBatchKMeansClusterer" and "KMeansPlusPlusClusterer",
> also with "CentroidInitializer" and test code with in a single commit.
> &nbsp;* Suggestions like "remove the constructors with default parameters" should apply as a new commit of the PR above,
> and tracking by a subtask of JIRA issue #MATH-1509.
> &nbsp;* Fire a new JIRA issue for the duplicate code, and start another PR with "ClusterUtils" in,
> and extract duplicate code into "ClusterUtils".

No, you should start with the smallest possible self-contained PR.
For example, why should we commit a code that defines several
constructors, while we already know that a second commit should
remove them?

As you've noticed that some functionality must be factored out of
"KMeansPlusPlusClusterer", this should be done first as a separate
JIRA issue. IIUC, you propose "ClusterUtils".  By reviewing a
minimal PR, we should be able to examine whether another
approach might be better (than a "utility" class) in order to expose
functionality common to all clusterer algorithms.
For example, could all "Kmeans" implementations inherit from
a common base class?

Best regards,
Gilles

> &gt;&gt;"org.apache.commons.math4.userguide.ClusterAlgorithmComparison":
> &gt;&gt; I remain have one question below:
> &gt;&gt;
> &gt;&gt; ------------------ Original ------------------
> &gt; From: "Gilles Sadowski"<[hidden email]&gt;;
> &gt; Date: Mon, Feb 24, 2020 09:52 PM
> &gt; To: "dev"<[hidden email]&gt;;
> &gt; Subject: Re: [math] Discuss: New feature MiniBatchKMeansClusterer
> &gt;
> &gt; &gt;Hi.
> &gt; &gt;
> &gt; &gt;Le sam. 22 févr. 2020 à 14:37, CT <[hidden email]&gt; a écrit :
> &gt; &gt;&gt;
> &gt; &gt;&gt; Hi Gilles:
> &gt; &gt;&gt; &nbsp; I really appricate for your patient to help me to solve the mail sending problem, I try to set the only setting about charset "Use &gt;Unicode" for this mail.
> &gt; &gt;
> &gt; &gt;The contents seems fine.&nbsp; However, there is something wrong: replying to your
> &gt; &gt;message sends to you instead of the mailing list...
> &gt; &gt;
> &gt; &gt;&gt; &nbsp; I have created a pull request: https://github.com/apache/commons-math/pull/118
> &gt; &gt;
> &gt; &gt;That PR fails the build:
> &gt; &gt; &nbsp; https://travis-ci.org/apache/commons-math/builds/653293451?utm_source=github_status&amp;utm_medium=notification
> &gt; &gt;
> &gt; &gt;Also, the code doesn't take into account all the remarks which I
> &gt; &gt;made in previous comments (e.g. file JIRA reports to allow tracking
> &gt; &gt;of changes to existing code and not mix those with new code).
> &gt; &gt;
> &gt;
> &gt; I did not get what kinds of problem should tracking by JIRA.
> &gt; In my opinion all the change is related to my PR.
> &gt;
> &gt;Yes, they are related but they could be provided in more focused
> &gt;PRs as explained above.
> &gt;
> &gt; The JIRA issue about Feature MiniBatch has been closed.
> &gt;
> &gt;It is not closed:
> &gt; &nbsp; https://issues.apache.org/jira/projects/MATH/issues/MATH-1509
> &gt;
> &gt;&gt; Should I fire a new issue?
> &gt;
> &gt;You should create one JIRA issue per self-contained change.
> &gt;You could also create "sub-tasks" of MATH-1509 (for changes that
> &gt;are strongly related to your contribution).
>
> &gt;&gt;
> &gt;&gt; &gt;In addition, you should avoid mixing massive cosmetic changes (e.g.
> &gt;&gt; &gt;Javadoc reformatting) within a commit than contains other types of
> &gt;&gt; &gt;change:
> &gt;&gt; &gt;&nbsp; &nbsp; https://github.com/apache/commons-math/pull/118/commits/47a055e6264d084547854f9290461f020f2131cf
> &gt;&gt; &gt;
> &gt;&gt; &gt;&gt; &nbsp; And a comparsion between KMeans and MiniBatchKMeans by using the &gt;&gt;"org.apache.commons.math4.userguide.ClusterAlgorithmComparison":
> &gt;&gt; &gt;
> &gt;&gt; &gt;Functionality seems to work fine, but ability to review incremental
> &gt;&gt; &gt;changes is extremely important in a collaborative project.
> &gt;&gt; &gt;
> &gt;&gt;
> &gt;&gt; Thanks for your remind, and I created a new PR that improve the problems one by one.
>
> &gt;I hope that I've now fixed the misunderstanding,
> &gt;Gilles
>
> &gt;&gt; &gt; [...]
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
In reply to this post by Gilles Sadowski-2
Hi Gilles,
------------------&nbsp;Original&nbsp;------------------
From:&nbsp;"GillesSadowski"<[hidden email]&gt;;
Date:&nbsp;Wed, Feb 26, 2020 05:41 PM
To:&nbsp;"Commons Developers List"<[hidden email]&gt;;

Subject:&nbsp;Re: [math] Discuss: New feature MiniBatchKMeansClusterer



&gt;[...]

&gt;&gt; Do you mean this:
&gt;&gt; &amp;nbsp;* For JIRA issue #MATH-1509 Start a PR with "MiniBatchKMeansClusterer", but without the "ClusterUtils",
&gt;&gt; despite the duplicate code between "MiniBatchKMeansClusterer" and "KMeansPlusPlusClusterer",
&gt;&gt; also with "CentroidInitializer" and test code with in a single commit.
&gt;&gt; &amp;nbsp;* Suggestions like "remove the constructors with default parameters" should apply as a new commit of the PR above,
&gt;&gt; and tracking by a subtask of JIRA issue #MATH-1509.
&gt;&gt; &amp;nbsp;* Fire a new JIRA issue for the duplicate code, and start another PR with "ClusterUtils" in,
&gt;&gt; and extract duplicate code into "ClusterUtils".
&gt;
&gt;No, you should start with the smallest possible self-contained PR.
&gt;For example, why should we commit a code that defines several
&gt;constructors, while we already know that a second commit should
&gt;remove them?
&gt;
&gt;As you've noticed that some functionality must be factored out of
&gt;"KMeansPlusPlusClusterer", this should be done first as a separate
&gt;JIRA issue. IIUC, you propose "ClusterUtils".&nbsp; By reviewing a
&gt;minimal PR, we should be able to examine whether another
&gt;approach might be better (than a "utility" class) in order to expose
&gt;functionality common to all clusterer algorithms.
&gt;For example, could all "Kmeans" implementations inherit from
&gt;a common base class?

Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and "chooseInitialCenters",
then start a PR and a disscuss about "ClusterUtils"?
And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?

&gt;[...]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hello.

[Please try and set your mail client to send plain text messages.]

Le mer. 26 févr. 2020 à 14:05, CT <[hidden email]> a écrit :

>
> Hi Gilles,
> ------------------&nbsp;Original&nbsp;------------------
> From:&nbsp;"GillesSadowski"<[hidden email]&gt;;
> Date:&nbsp;Wed, Feb 26, 2020 05:41 PM
> To:&nbsp;"Commons Developers List"<[hidden email]&gt;;
>
> Subject:&nbsp;Re: [math] Discuss: New feature MiniBatchKMeansClusterer
>
>
>
> &gt;[...]
>
> &gt;&gt; Do you mean this:
> &gt;&gt; &amp;nbsp;* For JIRA issue #MATH-1509 Start a PR with "MiniBatchKMeansClusterer", but without the "ClusterUtils",
> &gt;&gt; despite the duplicate code between "MiniBatchKMeansClusterer" and "KMeansPlusPlusClusterer",
> &gt;&gt; also with "CentroidInitializer" and test code with in a single commit.
> &gt;&gt; &amp;nbsp;* Suggestions like "remove the constructors with default parameters" should apply as a new commit of the PR above,
> &gt;&gt; and tracking by a subtask of JIRA issue #MATH-1509.
> &gt;&gt; &amp;nbsp;* Fire a new JIRA issue for the duplicate code, and start another PR with "ClusterUtils" in,
> &gt;&gt; and extract duplicate code into "ClusterUtils".
> &gt;
> &gt;No, you should start with the smallest possible self-contained PR.
> &gt;For example, why should we commit a code that defines several
> &gt;constructors, while we already know that a second commit should
> &gt;remove them?
> &gt;
> &gt;As you've noticed that some functionality must be factored out of
> &gt;"KMeansPlusPlusClusterer", this should be done first as a separate
> &gt;JIRA issue. IIUC, you propose "ClusterUtils".&nbsp; By reviewing a
> &gt;minimal PR, we should be able to examine whether another
> &gt;approach might be better (than a "utility" class) in order to expose
> &gt;functionality common to all clusterer algorithms.
> &gt;For example, could all "Kmeans" implementations inherit from
> &gt;a common base class?
>
> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and "chooseInitialCenters",
> then start a PR and a disscuss about "ClusterUtils"?
> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?

I cannot guarantee that the whole process will be streamlined.
In effect, you can work on multiple branches (one for each
prospective PR).
I'd say that you should start by describing (here on the ML) the
rationale for "ClusterUtils" (and contrast it with say, a common
base class).
[Only when the design has been agreed on,  a JIRA issue to
implement it should be created in order to track the actual
coding work).]

Gilles

> &gt;[...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
Hi,

>Hello.
>
>[Please try and set your mail client to send plain text messages.]
>
>Le mer. 26 févr. 2020 à 14:05, CT <[hidden email]> a écrit :
>>
>> Hi Gilles,
>> ------------------&nbsp;Original&nbsp;------------------
>> From:&nbsp;"GillesSadowski"<[hidden email]&gt;;
>> Date:&nbsp;Wed, Feb 26, 2020 05:41 PM
>> To:&nbsp;"Commons Developers List"<[hidden email]&gt;;
>>
>> Subject:&nbsp;Re: [math] Discuss: New feature MiniBatchKMeansClusterer
>>
>>
>>
> [...]
>>
>> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and "chooseInitialCenters",
>> then start a PR and a disscuss about "ClusterUtils"?
>> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?
>
>I cannot guarantee that the whole process will be streamlined.
>In effect, you can work on multiple branches (one for each
>prospective PR).
>I'd say that you should start by describing (here on the ML) the
>rationale for "ClusterUtils" (and contrast it with say, a common
>base class).
>[Only when the design has been agreed on,  a JIRA issue to
>implement it should be created in order to track the actual
>coding work).]

OK, I think we should start from here:

The method "centroidOf"  and "chooseInitialCenters" in KMeansPlusPlusClusterer
 could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer which I want to implement.

There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
1. Extract a abstract class for KMeans Clusterer named "AbstractKMeansClusterer",
 and move "centroidOf"  and "chooseInitialCenters" as protected methods in it;
 the EmptyClusterStrategy and related logic can also move to the "AbstractKMeansClusterer".
2. Create a static utility class, and move "centroidOf"  and "chooseInitialCenters" in it,
 and some useful clustering method like predict(Predict which cluster is best for a specified point) can put in it.

>
>Gilles
>
>> [...]
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hello.

[Message formatting is fine now.  Thanks!]

Le mer. 26 févr. 2020 à 15:20, [hidden email] <[hidden email]> a écrit :

>
> Hi,
>
> >Hello.
> >
> >[Please try and set your mail client to send plain text messages.]
> >
> >Le mer. 26 févr. 2020 à 14:05, CT <[hidden email]> a écrit :
> >>
> >> Hi Gilles,
> >> ------------------&nbsp;Original&nbsp;------------------
> >> From:&nbsp;"GillesSadowski"<[hidden email]&gt;;
> >> Date:&nbsp;Wed, Feb 26, 2020 05:41 PM
> >> To:&nbsp;"Commons Developers List"<[hidden email]&gt;;
> >>
> >> Subject:&nbsp;Re: [math] Discuss: New feature MiniBatchKMeansClusterer
> >>
> >>
> >>
> > [...]
> >>
> >> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and "chooseInitialCenters",
> >> then start a PR and a disscuss about "ClusterUtils"?
> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?
> >
> >I cannot guarantee that the whole process will be streamlined.
> >In effect, you can work on multiple branches (one for each
> >prospective PR).
> >I'd say that you should start by describing (here on the ML) the
> >rationale for "ClusterUtils" (and contrast it with say, a common
> >base class).
> >[Only when the design has been agreed on,  a JIRA issue to
> >implement it should be created in order to track the actual
> >coding work).]
>
> OK, I think we should start from here:
>
> The method "centroidOf"  and "chooseInitialCenters" in KMeansPlusPlusClusterer
>  could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer which I want to implement.
>
> There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
> 1. Extract a abstract class for KMeans Clusterer named "AbstractKMeansClusterer",
>  and move "centroidOf"  and "chooseInitialCenters" as protected methods in it;
>  the EmptyClusterStrategy and related logic can also move to the "AbstractKMeansClusterer".
> 2. Create a static utility class, and move "centroidOf"  and "chooseInitialCenters" in it,
>  and some useful clustering method like predict(Predict which cluster is best for a specified point) can put in it.
>

At first sight, I prefer option 1.
Indeed, o.a things "chooseInitialCenters" is a method that is of no interest to
users of the functionality (and so should not be part of the "public" API).
Method "centroidOf" looks generally useful.  Shouldn't it be part of
the "Cluster"
interface?  What is the difference with method "getCenter" (define by class
"CentroidCluster")?

Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
Hi,

> [...]
>> >>
>> >> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and "chooseInitialCenters",
>> >> then start a PR and a disscuss about "ClusterUtils"?
>> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?
>> >
>> >I cannot guarantee that the whole process will be streamlined.
>> >In effect, you can work on multiple branches (one for each
>> >prospective PR).
>> >I'd say that you should start by describing (here on the ML) the
>> >rationale for "ClusterUtils" (and contrast it with say, a common
>> >base class).
>> >[Only when the design has been agreed on,  a JIRA issue to
>> >implement it should be created in order to track the actual
>> >coding work).]
>>
>> OK, I think we should start from here:
>>
>> The method "centroidOf"  and "chooseInitialCenters" in KMeansPlusPlusClusterer
>>  could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer which I want to implement.
>>
>> There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
>> 1. Extract a abstract class for KMeans Clusterer named "AbstractKMeansClusterer",
>>  and move "centroidOf"  and "chooseInitialCenters" as protected methods in it;
>>  the EmptyClusterStrategy and related logic can also move to the "AbstractKMeansClusterer".
>> 2. Create a static utility class, and move "centroidOf"  and "chooseInitialCenters" in it,
>>  and some useful clustering method like predict(Predict which cluster is best for a specified point) can put in it.
>>
>
>At first sight, I prefer option 1.
>Indeed, o.a things "chooseInitialCenters" is a method that is of no interest to
>users of the functionality (and so should not be part of the "public" API).

Persuasive explain, and I agree with you, that extract a abstract class for KMeans is better.
And how can we make a conclusion?
---------------------------------------------

Mention the "public API", I suppose there should be a series of "CentroidInitializer",
 that "chooseInitialCenters" with various of algorithms.
The k-means++ cluster algorithm is a special implementation of k-means
 which initialize cluster centers with k-means++ algorithm.
So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be "KMeansClusterer"
 with a "KMeansPlusPlusCentroidInitializer" strategy.
When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it is a common k-means.

----------------------------------------------------------
>Method "centroidOf" looks generally useful.  Shouldn't it be part of
>the "Cluster"
>interface?  What is the difference with method "getCenter" (define by class
>"CentroidCluster")?

My understanding is,:
 * "Cluster" is a data class that carry the result of a clustering,
"getCenter" is just a get method of CentroidCluster for get the value of a center point.
 * "Cluster[er]" is a (Interface of )algorithm that classify points to sets of Cluster.
 * "CentroidCluster" is the result of a group of special Clusterer algorithm like k-means, 
 "centroidOf" is a specific logic to calculate the center point for a collection of points.
[Instead the DBScan cluster algorithm dose not care about the "Centroid"]

So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
 but different with "CentroidCluster.getCenter".

>
>Regards,
>Gilles
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hi.

Le jeu. 27 févr. 2020 à 06:17, [hidden email] <[hidden email]> a écrit :

>
> Hi,
>
> > [...]
> >> >>
> >> >> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and "chooseInitialCenters",
> >> >> then start a PR and a disscuss about "ClusterUtils"?
> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?
> >> >
> >> >I cannot guarantee that the whole process will be streamlined.
> >> >In effect, you can work on multiple branches (one for each
> >> >prospective PR).
> >> >I'd say that you should start by describing (here on the ML) the
> >> >rationale for "ClusterUtils" (and contrast it with say, a common
> >> >base class).
> >> >[Only when the design has been agreed on,  a JIRA issue to
> >> >implement it should be created in order to track the actual
> >> >coding work).]
> >>
> >> OK, I think we should start from here:
> >>
> >> The method "centroidOf"  and "chooseInitialCenters" in KMeansPlusPlusClusterer
> >>  could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer which I want to implement.
> >>
> >> There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
> >> 1. Extract a abstract class for KMeans Clusterer named "AbstractKMeansClusterer",
> >>  and move "centroidOf"  and "chooseInitialCenters" as protected methods in it;
> >>  the EmptyClusterStrategy and related logic can also move to the "AbstractKMeansClusterer".
> >> 2. Create a static utility class, and move "centroidOf"  and "chooseInitialCenters" in it,
> >>  and some useful clustering method like predict(Predict which cluster is best for a specified point) can put in it.
> >>
> >
> >At first sight, I prefer option 1.
> >Indeed, o.a things "chooseInitialCenters" is a method that is of no interest to
> >users of the functionality (and so should not be part of the "public" API).
>
> Persuasive explain, and I agree with you, that extract a abstract class for KMeans is better.
> And how can we make a conclusion?
> ---------------------------------------------
>
> Mention the "public API", I suppose there should be a series of "CentroidInitializer",
>  that "chooseInitialCenters" with various of algorithms.
> The k-means++ cluster algorithm is a special implementation of k-means
>  which initialize cluster centers with k-means++ algorithm.
> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be "KMeansClusterer"
>  with a "KMeansPlusPlusCentroidInitializer" strategy.
> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it is a common k-means.
>
> ----------------------------------------------------------
> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
> >the "Cluster"
> >interface?  What is the difference with method "getCenter" (define by class
> >"CentroidCluster")?
>
> My understanding is,:
>  * "Cluster" is a data class that carry the result of a clustering,
> "getCenter" is just a get method of CentroidCluster for get the value of a center point.
>  * "Cluster[er]" is a (Interface of )algorithm that classify points to sets of Cluster.
>  * "CentroidCluster" is the result of a group of special Clusterer algorithm like k-means,
>  "centroidOf" is a specific logic to calculate the center point for a collection of points.
> [Instead the DBScan cluster algorithm dose not care about the "Centroid"]
>
> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
>  but different with "CentroidCluster.getCenter".

I may be missing something about the existing design,
but it seems strange that "CentroidCluster" is initialized
with a given "center", yet it is possible to add points after
initialization (which IIUC would invalidate the "center").
It would seem that "center" should be a property computed
from the contents of "Cluster" e.g.:

@FunctionalInterface
public interface ClusterCenterComputer<T extends Clusterable> {
    T centroidOf(Cluster<T> cluster);
}

Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com



--------------
[hidden email]

>Hi.
>
>Le jeu. 27 févr. 2020 à 06:17, [hidden email] <[hidden email]> a écrit :
>>
>> Hi,
>>
>> > [...]
>> >> >>
>> >> >> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and "chooseInitialCenters",
>> >> >> then start a PR and a disscuss about "ClusterUtils"?
>> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?
>> >> >
>> >> >I cannot guarantee that the whole process will be streamlined.
>> >> >In effect, you can work on multiple branches (one for each
>> >> >prospective PR).
>> >> >I'd say that you should start by describing (here on the ML) the
>> >> >rationale for "ClusterUtils" (and contrast it with say, a common
>> >> >base class).
>> >> >[Only when the design has been agreed on,  a JIRA issue to
>> >> >implement it should be created in order to track the actual
>> >> >coding work).]
>> >>
>> >> OK, I think we should start from here:
>> >>
>> >> The method "centroidOf"  and "chooseInitialCenters" in KMeansPlusPlusClusterer
>> >>  could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer which I want to implement.
>> >>
>> >> There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
>> >> 1. Extract a abstract class for KMeans Clusterer named "AbstractKMeansClusterer",
>> >>  and move "centroidOf"  and "chooseInitialCenters" as protected methods in it;
>> >>  the EmptyClusterStrategy and related logic can also move to the "AbstractKMeansClusterer".
>> >> 2. Create a static utility class, and move "centroidOf"  and "chooseInitialCenters" in it,
>> >>  and some useful clustering method like predict(Predict which cluster is best for a specified point) can put in it.
>> >>
>> >
>> >At first sight, I prefer option 1.
>> >Indeed, o.a things "chooseInitialCenters" is a method that is of no interest to
>> >users of the functionality (and so should not be part of the "public" API).
>>
>> Persuasive explain, and I agree with you, that extract a abstract class for KMeans is better.
>> And how can we make a conclusion?
>> ---------------------------------------------
>>
>> Mention the "public API", I suppose there should be a series of "CentroidInitializer",
>>  that "chooseInitialCenters" with various of algorithms.
>> The k-means++ cluster algorithm is a special implementation of k-means
>>  which initialize cluster centers with k-means++ algorithm.
>> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be "KMeansClusterer"
>>  with a "KMeansPlusPlusCentroidInitializer" strategy.
>> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it is a common k-means.
>>
>> ----------------------------------------------------------
>> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
>> >the "Cluster"
>> >interface?  What is the difference with method "getCenter" (define by class
>> >"CentroidCluster")?
>>
>> My understanding is,:
>>  * "Cluster" is a data class that carry the result of a clustering,
>> "getCenter" is just a get method of CentroidCluster for get the value of a center point.
>>  * "Cluster[er]" is a (Interface of )algorithm that classify points to sets of Cluster.
>>  * "CentroidCluster" is the result of a group of special Clusterer algorithm like k-means,
>>  "centroidOf" is a specific logic to calculate the center point for a collection of points.
>> [Instead the DBScan cluster algorithm dose not care about the "Centroid"]
>>
>> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
>>  but different with "CentroidCluster.getCenter".
>
>I may be missing something about the existing design,
>but it seems strange that "CentroidCluster" is initialized
>with a given "center", yet it is possible to add points after
>initialization (which IIUC would invalidate the "center").

The "centroidOf" could be part of "CentroidCluster",
but I think the existsing desgin was focus on decouple of "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".

Center recalculate often happens in each iteration of k-means Clustering, 
always with points reassign to clusters.
We often use k-means as two pharse:
Pharse 1: Training, classify thousands of points to set of clusters.
Pharse 2: Predict, predict which cluster is best for a new point,
or add a new point to the best cluster in ClusterSet,
but we never update the cluster center until next retraining.

The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math" just design for pharse "Training",
it is clearly if we can consider "CentroidCluster" as a pure data class just for k-means clustering result.

If we want the cluster result useful enough for parse "Predict",
 the result of "KMeansPlusPlusClusterer.cluster" should return a  "ClusterSet":
```java
public interface ClusterSet<T extends Clusterable> extends Collection<T> {
  // Retrun the cluster which the point should belong to.
  Cluster predict(T point);
  // Add a point to best cluster.
  void addPoint(T point);
}
```
And "centroidOf"(just used in clustering iteration) can move up into a abstract class like "CenroidClusterer".

>It would seem that "center" should be a property computed
>from the contents of "Cluster" e.g.:
>
>@FunctionalInterface
>public interface ClusterCenterComputer<T extends Clusterable> {
>    T centroidOf(Cluster<T> cluster);
>}
>
>Regards,
>Gilles
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hi.

Le ven. 28 févr. 2020 à 05:04, [hidden email] <[hidden email]> a écrit :

>
>
>
>
> --------------
> [hidden email]
> >Hi.
> >
> >Le jeu. 27 févr. 2020 à 06:17, [hidden email] <[hidden email]> a écrit :
> >>
> >> Hi,
> >>
> >> > [...]
> >> >> >>
> >> >> >> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and "chooseInitialCenters",
> >> >> >> then start a PR and a disscuss about "ClusterUtils"?
> >> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?
> >> >> >
> >> >> >I cannot guarantee that the whole process will be streamlined.
> >> >> >In effect, you can work on multiple branches (one for each
> >> >> >prospective PR).
> >> >> >I'd say that you should start by describing (here on the ML) the
> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common
> >> >> >base class).
> >> >> >[Only when the design has been agreed on,  a JIRA issue to
> >> >> >implement it should be created in order to track the actual
> >> >> >coding work).]
> >> >>
> >> >> OK, I think we should start from here:
> >> >>
> >> >> The method "centroidOf"  and "chooseInitialCenters" in KMeansPlusPlusClusterer
> >> >>  could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer which I want to implement.
> >> >>
> >> >> There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
> >> >> 1. Extract a abstract class for KMeans Clusterer named "AbstractKMeansClusterer",
> >> >>  and move "centroidOf"  and "chooseInitialCenters" as protected methods in it;
> >> >>  the EmptyClusterStrategy and related logic can also move to the "AbstractKMeansClusterer".
> >> >> 2. Create a static utility class, and move "centroidOf"  and "chooseInitialCenters" in it,
> >> >>  and some useful clustering method like predict(Predict which cluster is best for a specified point) can put in it.
> >> >>
> >> >
> >> >At first sight, I prefer option 1.
> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no interest to
> >> >users of the functionality (and so should not be part of the "public" API).
> >>
> >> Persuasive explain, and I agree with you, that extract a abstract class for KMeans is better.
> >> And how can we make a conclusion?
> >> ---------------------------------------------
> >>
> >> Mention the "public API", I suppose there should be a series of "CentroidInitializer",
> >>  that "chooseInitialCenters" with various of algorithms.
> >> The k-means++ cluster algorithm is a special implementation of k-means
> >>  which initialize cluster centers with k-means++ algorithm.
> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be "KMeansClusterer"
> >>  with a "KMeansPlusPlusCentroidInitializer" strategy.
> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it is a common k-means.
> >>
> >> ----------------------------------------------------------
> >> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
> >> >the "Cluster"
> >> >interface?  What is the difference with method "getCenter" (define by class
> >> >"CentroidCluster")?
> >>
> >> My understanding is,:
> >>  * "Cluster" is a data class that carry the result of a clustering,
> >> "getCenter" is just a get method of CentroidCluster for get the value of a center point.
> >>  * "Cluster[er]" is a (Interface of )algorithm that classify points to sets of Cluster.
> >>  * "CentroidCluster" is the result of a group of special Clusterer algorithm like k-means,
> >>  "centroidOf" is a specific logic to calculate the center point for a collection of points.
> >> [Instead the DBScan cluster algorithm dose not care about the "Centroid"]
> >>
> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
> >>  but different with "CentroidCluster.getCenter".
> >
> >I may be missing something about the existing design,
> >but it seems strange that "CentroidCluster" is initialized
> >with a given "center", yet it is possible to add points after
> >initialization (which IIUC would invalidate the "center").
>
> The "centroidOf" could be part of "CentroidCluster",
> but I think the existsing desgin was focus on decouple of "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".

I don't see why we need both "Cluster" and "CentroidCluster".
Indeed, as suggested before, the "center" can be computed
from a "Cluster", but does not need to be stored in it.

>
> Center recalculate often happens in each iteration of k-means Clustering,
> always with points reassign to clusters.
> We often use k-means as two pharse:
> Pharse 1: Training, classify thousands of points to set of clusters.
> Pharse 2: Predict, predict which cluster is best for a new point,
> or add a new point to the best cluster in ClusterSet,

Method "cluster" returns a "List<Cluster>"; there is no need for a
new "ClusterSet" class.
Also, IIUC the centers can be collected into a "List<Clusterable>",
so that the association is through the index into the list(s).

> but we never update the cluster center until next retraining.

IMO, that's the reason for *not*" storing the center (in such a
mutable instance).

>
> The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math" just design for pharse "Training",
> it is clearly if we can consider "CentroidCluster" as a pure data class just for k-means clustering result.

See above.

Discussing the existing design further, I think that the "cluster" method should
rather be:
---CUT---
public List<Cluster<T>> cluster(Collections<T> points, DistanceMeasure dist)
---CUT---

And, similarly,
---CUT---
@FunctionalInterface
public interface ClusterFinder<T extends Clusterable> {
    public int indexOf(T point, List<Cluster<T> clusters, DistanceMeasure dist);
}
---CUT---

> If we want the cluster result useful enough for parse "Predict",
>  the result of "KMeansPlusPlusClusterer.cluster" should return a  "ClusterSet":
> ```java
> public interface ClusterSet<T extends Clusterable> extends Collection<T> {
>   // Retrun the cluster which the point should belong to.
>   Cluster predict(T point);
>   // Add a point to best cluster.
>   void addPoint(T point);
> }
> ```

This  "ClusterSet" seems less flexible than a "List<Cluster>".

> And "centroidOf"(just used in clustering iteration) can move up into a abstract class like "CenroidClusterer".

It seems that this method could be useful for users too.

Best,
Gilles

> >It would seem that "center" should be a property computed
> >from the contents of "Cluster" e.g.:
> >
> >@FunctionalInterface
> >public interface ClusterCenterComputer<T extends Clusterable> {
> >    T centroidOf(Cluster<T> cluster);
> >}
> >
> >Regards,
> >Gilles
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
Hi,

>Hi.
>
>Le ven. 28 févr. 2020 à 05:04, [hidden email] <[hidden email]> a écrit :
>>
>>
>>
>>
>> --------------
>> [hidden email]
>> >Hi.
>> >
>> >Le jeu. 27 févr. 2020 à 06:17, [hidden email] <[hidden email]> a écrit :
>> >>
>> >> Hi,
>> >>
>> >> > [...]
>> >> >> >>
>> >> >> >> Do you mean I should fire a JIRA issue about reuse&nbsp;"centroidOf" and "chooseInitialCenters",
>> >> >> >> then start a PR and a disscuss about "ClusterUtils"?
>> >> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after all done?
>> >> >> >
>> >> >> >I cannot guarantee that the whole process will be streamlined.
>> >> >> >In effect, you can work on multiple branches (one for each
>> >> >> >prospective PR).
>> >> >> >I'd say that you should start by describing (here on the ML) the
>> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common
>> >> >> >base class).
>> >> >> >[Only when the design has been agreed on,  a JIRA issue to
>> >> >> >implement it should be created in order to track the actual
>> >> >> >coding work).]
>> >> >>
>> >> >> OK, I think we should start from here:
>> >> >>
>> >> >> The method "centroidOf"  and "chooseInitialCenters" in KMeansPlusPlusClusterer
>> >> >>  could be reused by other KMeans Clusterer like MiniBatchKMeansClusterer which I want to implement.
>> >> >>
>> >> >> There are two solution for reuse "centroidOf"  and "chooseInitialCenters":
>> >> >> 1. Extract a abstract class for KMeans Clusterer named "AbstractKMeansClusterer",
>> >> >>  and move "centroidOf"  and "chooseInitialCenters" as protected methods in it;
>> >> >>  the EmptyClusterStrategy and related logic can also move to the "AbstractKMeansClusterer".
>> >> >> 2. Create a static utility class, and move "centroidOf"  and "chooseInitialCenters" in it,
>> >> >>  and some useful clustering method like predict(Predict which cluster is best for a specified point) can put in it.
>> >> >>
>> >> >
>> >> >At first sight, I prefer option 1.
>> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no interest to
>> >> >users of the functionality (and so should not be part of the "public" API).
>> >>
>> >> Persuasive explain, and I agree with you, that extract a abstract class for KMeans is better.
>> >> And how can we make a conclusion?
>> >> ---------------------------------------------
>> >>
>> >> Mention the "public API", I suppose there should be a series of "CentroidInitializer",
>> >>  that "chooseInitialCenters" with various of algorithms.
>> >> The k-means++ cluster algorithm is a special implementation of k-means
>> >>  which initialize cluster centers with k-means++ algorithm.
>> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can be "KMeansClusterer"
>> >>  with a "KMeansPlusPlusCentroidInitializer" strategy.
>> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer", it is a common k-means.
>> >>
>> >> ----------------------------------------------------------
>> >> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
>> >> >the "Cluster"
>> >> >interface?  What is the difference with method "getCenter" (define by class
>> >> >"CentroidCluster")?
>> >>
>> >> My understanding is,:
>> >>  * "Cluster" is a data class that carry the result of a clustering,
>> >> "getCenter" is just a get method of CentroidCluster for get the value of a center point.
>> >>  * "Cluster[er]" is a (Interface of )algorithm that classify points to sets of Cluster.
>> >>  * "CentroidCluster" is the result of a group of special Clusterer algorithm like k-means,
>> >>  "centroidOf" is a specific logic to calculate the center point for a collection of points.
>> >> [Instead the DBScan cluster algorithm dose not care about the "Centroid"]
>> >>
>> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists yet),
>> >>  but different with "CentroidCluster.getCenter".
>> >
>> >I may be missing something about the existing design,
>> >but it seems strange that "CentroidCluster" is initialized
>> >with a given "center", yet it is possible to add points after
>> >initialization (which IIUC would invalidate the "center").
>>
>> The "centroidOf" could be part of "CentroidCluster",
>> but I think the existsing desgin was focus on decouple of "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".
>
>I don't see why we need both "Cluster" and "CentroidCluster".
>Indeed, as suggested before, the "center" can be computed
>from a "Cluster", but does not need to be stored in it.

Typical usecase for a List<CentroidCluster> is, when we need classify a new point,
calculate the distance of the new point to each CentroidCluster.center is the simplest,
and the center should be cached.

>
>>
>> Center recalculate often happens in each iteration of k-means Clustering,
>> always with points reassign to clusters.
>> We often use k-means as two pharse:
>> Pharse 1: Training, classify thousands of points to set of clusters.
>> Pharse 2: Predict, predict which cluster is best for a new point,
>> or add a new point to the best cluster in ClusterSet,
>
>Method "cluster" returns a "List<Cluster>"; there is no need for a
>new "ClusterSet" class.
>Also, IIUC the centers can be collected into a "List<Clusterable>",
>so that the association is through the index into the list(s).
>
>> but we never update the cluster center until next retraining.
>
>IMO, that's the reason for *not*" storing the center (in such a
>mutable instance).
>
>>
>> The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math" just design for pharse "Training",
>> it is clearly if we can consider "CentroidCluster" as a pure data class just for k-means clustering result.
>
>See above.
>
>Discussing the existing design further, I think that the "cluster" method should
>rather be:
>---CUT---
>public List<Cluster<T>> cluster(Collections<T> points, DistanceMeasure dist)
>---CUT---
>
>And, similarly,
>---CUT---
>@FunctionalInterface
>public interface ClusterFinder<T extends Clusterable> {
>    public int indexOf(T point, List<Cluster<T> clusters, DistanceMeasure dist);
>}
>---CUT---

I think there is a balance between flexibility, efficiency and usability.
The DistanceMeasure should be assigned once for a kmeans, and can not change in exists clusters.
So DistanceMeasure should be a property of Cluster, but redundant for a List of Cluster.

Consider the Cluster fewly use separately, the DistanceMeasure should be a readonly property of List<Cluster>,
If there is a ClusterSet, so we can serialize and unserialize the Clusters correctly and with no redundancy.

As a comparison, dl4j's design is easy to use for predict, especial the ClusterSet
but leak of  flexibility(stroke coupling to nd4j) and efficiency(complexity on nd4j implemention):
https://github.com/eclipse/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering

>
>> If we want the cluster result useful enough for parse "Predict",
>>  the result of "KMeansPlusPlusClusterer.cluster" should return a  "ClusterSet":
>> ```java
>> public interface ClusterSet<T extends Clusterable> extends Collection<T> {
>>   // Retrun the cluster which the point should belong to.
>>   Cluster predict(T point);
>>   // Add a point to best cluster.
>>   void addPoint(T point);
>> }
>> ```
>
>This  "ClusterSet" seems less flexible than a "List<Cluster>".

"ClusterSet" is useful for predict as I mentioned above.

>
>> And "centroidOf"(just used in clustering iteration) can move up into a abstract class like "CenroidClusterer".
>
>It seems that this method could be useful for users too.
>

May be "centroidOf" is useful, but now it is just used in kmeans.

>Best,
>Gilles
>
>> >It would seem that "center" should be a property computed
>> >from the contents of "Cluster" e.g.:
>> >
>> >@FunctionalInterface
>> >public interface ClusterCenterComputer<T extends Clusterable> {
>> >    T centroidOf(Cluster<T> cluster);
>> >}
>> >
>> >Regards,
>> >Gilles
>> >
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hello.

2020-03-05 4:50 UTC+01:00, [hidden email] <[hidden email]>:

> Hi,
>
>>Hi.
>>
>>Le ven. 28 févr. 2020 à 05:04, [hidden email] <[hidden email]> a
>> écrit :
>>>
>>>
>>>
>>>
>>> --------------
>>> [hidden email]
>>> >Hi.
>>> >
>>> >Le jeu. 27 févr. 2020 à 06:17, [hidden email] <[hidden email]> a
>>> > écrit :
>>> >>
>>> >> Hi,
>>> >>
>>> >> > [...]
>>> >> >> >>
>>> >> >> >> Do you mean I should fire a JIRA issue about
>>> >> >> >> reuse&nbsp;"centroidOf" and "chooseInitialCenters",
>>> >> >> >> then start a PR and a disscuss about "ClusterUtils"?
>>> >> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after
>>> >> >> >> all done?
>>> >> >> >
>>> >> >> >I cannot guarantee that the whole process will be streamlined.
>>> >> >> >In effect, you can work on multiple branches (one for each
>>> >> >> >prospective PR).
>>> >> >> >I'd say that you should start by describing (here on the ML) the
>>> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common
>>> >> >> >base class).
>>> >> >> >[Only when the design has been agreed on,  a JIRA issue to
>>> >> >> >implement it should be created in order to track the actual
>>> >> >> >coding work).]
>>> >> >>
>>> >> >> OK, I think we should start from here:
>>> >> >>
>>> >> >> The method "centroidOf"  and "chooseInitialCenters" in
>>> >> >> KMeansPlusPlusClusterer
>>> >> >>  could be reused by other KMeans Clusterer like
>>> >> >> MiniBatchKMeansClusterer which I want to implement.
>>> >> >>
>>> >> >> There are two solution for reuse "centroidOf"  and
>>> >> >> "chooseInitialCenters":
>>> >> >> 1. Extract a abstract class for KMeans Clusterer named
>>> >> >> "AbstractKMeansClusterer",
>>> >> >>  and move "centroidOf"  and "chooseInitialCenters" as protected
>>> >> >> methods in it;
>>> >> >>  the EmptyClusterStrategy and related logic can also move to the
>>> >> >> "AbstractKMeansClusterer".
>>> >> >> 2. Create a static utility class, and move "centroidOf"  and
>>> >> >> "chooseInitialCenters" in it,
>>> >> >>  and some useful clustering method like predict(Predict which
>>> >> >> cluster is best for a specified point) can put in it.
>>> >> >>
>>> >> >
>>> >> >At first sight, I prefer option 1.
>>> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no
>>> >> > interest to
>>> >> >users of the functionality (and so should not be part of the "public"
>>> >> > API).
>>> >>
>>> >> Persuasive explain, and I agree with you, that extract a abstract
>>> >> class for KMeans is better.
>>> >> And how can we make a conclusion?
>>> >> ---------------------------------------------
>>> >>
>>> >> Mention the "public API", I suppose there should be a series of
>>> >> "CentroidInitializer",
>>> >>  that "chooseInitialCenters" with various of algorithms.
>>> >> The k-means++ cluster algorithm is a special implementation of k-means
>>> >>  which initialize cluster centers with k-means++ algorithm.
>>> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can
>>> >> be "KMeansClusterer"
>>> >>  with a "KMeansPlusPlusCentroidInitializer" strategy.
>>> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer",
>>> >> it is a common k-means.
>>> >>
>>> >> ----------------------------------------------------------
>>> >> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
>>> >> >the "Cluster"
>>> >> >interface?  What is the difference with method "getCenter" (define by
>>> >> > class
>>> >> >"CentroidCluster")?
>>> >>
>>> >> My understanding is,:
>>> >>  * "Cluster" is a data class that carry the result of a clustering,
>>> >> "getCenter" is just a get method of CentroidCluster for get the value
>>> >> of a center point.
>>> >>  * "Cluster[er]" is a (Interface of )algorithm that classify points to
>>> >> sets of Cluster.
>>> >>  * "CentroidCluster" is the result of a group of special Clusterer
>>> >> algorithm like k-means,
>>> >>  "centroidOf" is a specific logic to calculate the center point for a
>>> >> collection of points.
>>> >> [Instead the DBScan cluster algorithm dose not care about the
>>> >> "Centroid"]
>>> >>
>>> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists
>>> >> yet),
>>> >>  but different with "CentroidCluster.getCenter".
>>> >
>>> >I may be missing something about the existing design,
>>> >but it seems strange that "CentroidCluster" is initialized
>>> >with a given "center", yet it is possible to add points after
>>> >initialization (which IIUC would invalidate the "center").
>>>
>>> The "centroidOf" could be part of "CentroidCluster",
>>> but I think the existsing desgin was focus on decouple of
>>> "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".
>>
>>I don't see why we need both "Cluster" and "CentroidCluster".
>>Indeed, as suggested before, the "center" can be computed
>>from a "Cluster", but does not need to be stored in it.
>
> Typical usecase for a List<CentroidCluster> is, when we need classify a new
> point,

Is there a use for "Cluster" instances that do not have a "center"?

> calculate the distance of the new point to each CentroidCluster.center is
> the simplest,
> and the center should be cached.

Yes, but not necessarily within a dedicated class.
I agree that there should be an easy way (API-wise) to classify
a point wrt a list of clusters but we should avoid the potential
inconsistency of a mutable "Cluster" instance.
Efficiency during cluster building (a.o. caching the current center)
is a separate issue from using the resulting list of clusters.

>
>>
>>>
>>> Center recalculate often happens in each iteration of k-means Clustering,
>>> always with points reassign to clusters.
>>> We often use k-means as two pharse:
>>> Pharse 1: Training, classify thousands of points to set of clusters.
>>> Pharse 2: Predict, predict which cluster is best for a new point,
>>> or add a new point to the best cluster in ClusterSet,
>>
>>Method "cluster" returns a "List<Cluster>"; there is no need for a
>>new "ClusterSet" class.
>>Also, IIUC the centers can be collected into a "List<Clusterable>",
>>so that the association is through the index into the list(s).
>>
>>> but we never update the cluster center until next retraining.
>>
>>IMO, that's the reason for *not*" storing the center (in such a
>>mutable instance).
>>
>>>
>>> The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math"
>>> just design for pharse "Training",
>>> it is clearly if we can consider "CentroidCluster" as a pure data class
>>> just for k-means clustering result.
>>
>>See above.
>>
>>Discussing the existing design further, I think that the "cluster" method
>> should
>>rather be:
>>---CUT---
>>public List<Cluster<T>> cluster(Collections<T> points, DistanceMeasure
>> dist)
>>---CUT---
>>
>>And, similarly,
>>---CUT---
>>@FunctionalInterface
>>public interface ClusterFinder<T extends Clusterable> {
>>    public int indexOf(T point, List<Cluster<T> clusters, DistanceMeasure
>> dist);
>>}
>>---CUT---
>
> I think there is a balance between flexibility, efficiency and usability.
> The DistanceMeasure should be assigned once for a kmeans,

Of course, during learning, only one distance measure is to
be used.
But it does not imply that it must be an instance field of the
clustering algorithm.

> and can not change
> in exists clusters.

The "Cluster" class does not define a "DistanceMeasure" field.

> So DistanceMeasure should be a property of Cluster, but redundant for a List
> of Cluster.

Above, you said that it must be a property of "kmeans"...

> Consider the Cluster fewly use separately, the DistanceMeasure should be a
> readonly property of List<Cluster>,

Why?
shouldn't a user be able to query the same list of clusters
using different distance measures?

> If there is a ClusterSet, so we can serialize and unserialize the Clusters
> correctly and with no redundancy.

I don't know about "ClusterSet"; what is its interface?

Anyways, "serialization" is yet another issue, and it should be
dealt with separately from the usage API.
Do you suggest to (de)serialize "Cluster" objects (i.e. instances
of an unknown type)?

> As a comparison, dl4j's design is easy to use for predict, especial the
> ClusterSet
> but leak of  flexibility(stroke coupling to nd4j) and efficiency(complexity
> on nd4j implemention):
> https://github.com/eclipse/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering

Please show stripped down versions of the API that illustrates
the points which you want to highlight.  Thanks.

>>
>>> If we want the cluster result useful enough for parse "Predict",
>>>  the result of "KMeansPlusPlusClusterer.cluster" should return a
>>> "ClusterSet":
>>> ```java
>>> public interface ClusterSet<T extends Clusterable> extends Collection<T>
>>> {
>>>   // Retrun the cluster which the point should belong to.
>>>   Cluster predict(T point);
>>>   // Add a point to best cluster.
>>>   void addPoint(T point);
>>> }
>>> ```
>>
>>This  "ClusterSet" seems less flexible than a "List<Cluster>".
>
> "ClusterSet" is useful for predict as I mentioned above.

Cf. previous comment.

>>
>>> And "centroidOf"(just used in clustering iteration) can move up into a
>>> abstract class like "CenroidClusterer".
>>
>>It seems that this method could be useful for users too.
>>
>
> May be "centroidOf" is useful, but now it is just used in kmeans.

Is "Kmeans" the only algorithm that computes clusters centers?
Do users never need to compute a center?

Best regards,
Gilles

>>> >It would seem that "center" should be a property computed
>>> >from the contents of "Cluster" e.g.:
>>> >
>>> >@FunctionalInterface
>>> >public interface ClusterCenterComputer<T extends Clusterable> {
>>> >    T centroidOf(Cluster<T> cluster);
>>> >}
>>> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
Hi,

>Hello.
>
>2020-03-05 4:50 UTC+01:00, [hidden email] <[hidden email]>:
>> Hi,
>>
>>>Hi.
>>>
>>>Le ven. 28 févr. 2020 à 05:04, [hidden email] <[hidden email]> a
>>> écrit :
>>>>
>>>>
>>>>
>>>>
>>>> --------------
>>>> [hidden email]
>>>> >Hi.
>>>> >
>>>> >Le jeu. 27 févr. 2020 à 06:17, [hidden email] <[hidden email]> a
>>>> > écrit :
>>>> >>
>>>> >> Hi,
>>>> >>
>>>> >> > [...]
>>>> >> >> >>
>>>> >> >> >> Do you mean I should fire a JIRA issue about
>>>> >> >> >> reuse&nbsp;"centroidOf" and "chooseInitialCenters",
>>>> >> >> >> then start a PR and a disscuss about "ClusterUtils"?
>>>> >> >> >> And then&nbsp;start the PR of "MiniBatchKMeansClusterer" after
>>>> >> >> >> all done?
>>>> >> >> >
>>>> >> >> >I cannot guarantee that the whole process will be streamlined.
>>>> >> >> >In effect, you can work on multiple branches (one for each
>>>> >> >> >prospective PR).
>>>> >> >> >I'd say that you should start by describing (here on the ML) the
>>>> >> >> >rationale for "ClusterUtils" (and contrast it with say, a common
>>>> >> >> >base class).
>>>> >> >> >[Only when the design has been agreed on,  a JIRA issue to
>>>> >> >> >implement it should be created in order to track the actual
>>>> >> >> >coding work).]
>>>> >> >>
>>>> >> >> OK, I think we should start from here:
>>>> >> >>
>>>> >> >> The method "centroidOf"  and "chooseInitialCenters" in
>>>> >> >> KMeansPlusPlusClusterer
>>>> >> >>  could be reused by other KMeans Clusterer like
>>>> >> >> MiniBatchKMeansClusterer which I want to implement.
>>>> >> >>
>>>> >> >> There are two solution for reuse "centroidOf"  and
>>>> >> >> "chooseInitialCenters":
>>>> >> >> 1. Extract a abstract class for KMeans Clusterer named
>>>> >> >> "AbstractKMeansClusterer",
>>>> >> >>  and move "centroidOf"  and "chooseInitialCenters" as protected
>>>> >> >> methods in it;
>>>> >> >>  the EmptyClusterStrategy and related logic can also move to the
>>>> >> >> "AbstractKMeansClusterer".
>>>> >> >> 2. Create a static utility class, and move "centroidOf"  and
>>>> >> >> "chooseInitialCenters" in it,
>>>> >> >>  and some useful clustering method like predict(Predict which
>>>> >> >> cluster is best for a specified point) can put in it.
>>>> >> >>
>>>> >> >
>>>> >> >At first sight, I prefer option 1.
>>>> >> >Indeed, o.a things "chooseInitialCenters" is a method that is of no
>>>> >> > interest to
>>>> >> >users of the functionality (and so should not be part of the "public"
>>>> >> > API).
>>>> >>
>>>> >> Persuasive explain, and I agree with you, that extract a abstract
>>>> >> class for KMeans is better.
>>>> >> And how can we make a conclusion?
>>>> >> ---------------------------------------------
>>>> >>
>>>> >> Mention the "public API", I suppose there should be a series of
>>>> >> "CentroidInitializer",
>>>> >>  that "chooseInitialCenters" with various of algorithms.
>>>> >> The k-means++ cluster algorithm is a special implementation of k-means
>>>> >>  which initialize cluster centers with k-means++ algorithm.
>>>> >> So if there is a "CentroidInitializer", "KMeansPlusPlusClusterer" can
>>>> >> be "KMeansClusterer"
>>>> >>  with a "KMeansPlusPlusCentroidInitializer" strategy.
>>>> >> When "KMeansClusterer" initialize with a "RandomCentroidInitializer",
>>>> >> it is a common k-means.
>>>> >>
>>>> >> ----------------------------------------------------------
>>>> >> >Method "centroidOf" looks generally useful.  Shouldn't it be part of
>>>> >> >the "Cluster"
>>>> >> >interface?  What is the difference with method "getCenter" (define by
>>>> >> > class
>>>> >> >"CentroidCluster")?
>>>> >>
>>>> >> My understanding is,:
>>>> >>  * "Cluster" is a data class that carry the result of a clustering,
>>>> >> "getCenter" is just a get method of CentroidCluster for get the value
>>>> >> of a center point.
>>>> >>  * "Cluster[er]" is a (Interface of )algorithm that classify points to
>>>> >> sets of Cluster.
>>>> >>  * "CentroidCluster" is the result of a group of special Clusterer
>>>> >> algorithm like k-means,
>>>> >>  "centroidOf" is a specific logic to calculate the center point for a
>>>> >> collection of points.
>>>> >> [Instead the DBScan cluster algorithm dose not care about the
>>>> >> "Centroid"]
>>>> >>
>>>> >> So, "centroidOf" may be a method of "CentroidCluster[er]"(not exists
>>>> >> yet),
>>>> >>  but different with "CentroidCluster.getCenter".
>>>> >
>>>> >I may be missing something about the existing design,
>>>> >but it seems strange that "CentroidCluster" is initialized
>>>> >with a given "center", yet it is possible to add points after
>>>> >initialization (which IIUC would invalidate the "center").
>>>>
>>>> The "centroidOf" could be part of "CentroidCluster",
>>>> but I think the existsing desgin was focus on decouple of
>>>> "DistanceMeasure"("centroidOf" depends on it) and "CentroidCluster".
>>>
>>>I don't see why we need both "Cluster" and "CentroidCluster".
>>>Indeed, as suggested before, the "center" can be computed
>>>from a "Cluster", but does not need to be stored in it.
>>
>> Typical usecase for a List<CentroidCluster> is, when we need classify a new
>> point,
>
>Is there a use for "Cluster" instances that do not have a "center"?

Yes, the DBScan clusters do not have centers,
Center is meaningless for DBScan clusters, 
if force the DBScan clusters to have centers,
2 or more clusters may share one center,
for example the data points graph is two concentric circles.
The commons-math userguide show us there is no center for DBScan clusers:
http://commons.apache.org/proper/commons-math/userguide/ml.html

>
>> calculate the distance of the new point to each CentroidCluster.center is
>> the simplest,
>> and the center should be cached.
>
>Yes, but not necessarily within a dedicated class.
>I agree that there should be an easy way (API-wise) to classify
>a point wrt a list of clusters but we should avoid the potential
>inconsistency of a mutable "Cluster" instance.
>Efficiency during cluster building (a.o. caching the current center)
>is a separate issue from using the resulting list of clusters.
>
>>
>>>
>>>>
>>>> Center recalculate often happens in each iteration of k-means Clustering,
>>>> always with points reassign to clusters.
>>>> We often use k-means as two pharse:
>>>> Pharse 1: Training, classify thousands of points to set of clusters.
>>>> Pharse 2: Predict, predict which cluster is best for a new point,
>>>> or add a new point to the best cluster in ClusterSet,
>>>
>>>Method "cluster" returns a "List<Cluster>"; there is no need for a
>>>new "ClusterSet" class.
>>>Also, IIUC the centers can be collected into a "List<Clusterable>",
>>>so that the association is through the index into the list(s).
>>>
>>>> but we never update the cluster center until next retraining.
>>>
>>>IMO, that's the reason for *not*" storing the center (in such a
>>>mutable instance).
>>>
>>>>
>>>> The KMeansPlusPlusClusterer and other Cluster algorithm in "commons-math"
>>>> just design for pharse "Training",
>>>> it is clearly if we can consider "CentroidCluster" as a pure data class
>>>> just for k-means clustering result.
>>>
>>>See above.
>>>
>>>Discussing the existing design further, I think that the "cluster" method
>>> should
>>>rather be:
>>>---CUT---
>>>public List<Cluster<T>> cluster(Collections<T> points, DistanceMeasure
>>> dist)
>>>---CUT---
>>>
>>>And, similarly,
>>>---CUT---
>>>@FunctionalInterface
>>>public interface ClusterFinder<T extends Clusterable> {
>>>    public int indexOf(T point, List<Cluster<T> clusters, DistanceMeasure
>>> dist);
>>>}
>>>---CUT---
>>
>> I think there is a balance between flexibility, efficiency and usability.
>> The DistanceMeasure should be assigned once for a kmeans,
>
>Of course, during learning, only one distance measure is to
>be used.
>But it does not imply that it must be an instance field of the
>clustering algorithm.

I agree the DistanceMeasure should be a argument of Clusterer.cluster,
 as you metioned above.
It is a API change, how do we start it?

>
>> and can not change
>> in exists clusters.
>
>The "Cluster" class does not define a "DistanceMeasure" field.
>
>> So DistanceMeasure should be a property of Cluster, but redundant for a List
>> of Cluster.
>
>Above, you said that it must be a property of "kmeans"...
>
>> Consider the Cluster fewly use separately, the DistanceMeasure should be a
>> readonly property of List<Cluster>,
>
>Why?
>shouldn't a user be able to query the same list of clusters
>using different distance measures?

The distance measure has significant effect in clustering,
 especial on high dimension data.
It should not be changed through training and predict.
Different distance measures in exists clusters is meaningless.

>
>> If there is a ClusterSet, so we can serialize and unserialize the Clusters
>> correctly and with no redundancy.
>
>I don't know about "ClusterSet"; what is its interface?
>
>Anyways, "serialization" is yet another issue, and it should be
>dealt with separately from the usage API.
>Do you suggest to (de)serialize "Cluster" objects (i.e. instances
>of an unknown type)?
>
>> As a comparison, dl4j's design is easy to use for predict, especial the
>> ClusterSet
>> but leak of  flexibility(stroke coupling to nd4j) and efficiency(complexity
>> on nd4j implemention):
>> https://github.com/eclipse/deeplearning4j/tree/master/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering
>
>Please show stripped down versions of the API that illustrates
>the points which you want to highlight.  Thanks.
>
I think there are two graceful design in DL4j:
1. The "ClusterSet" class is useful;
https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering/cluster/ClusterSet.java
2. The kmeans++ is not the only center initialize algorithm in "KMeansClustering".
https://github.com/eclipse/deeplearning4j/blob/master/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering/kmeans/KMeansClustering.java

>>>
>>>> If we want the cluster result useful enough for parse "Predict",
>>>>  the result of "KMeansPlusPlusClusterer.cluster" should return a
>>>> "ClusterSet":
>>>> ```java
>>>> public interface ClusterSet<T extends Clusterable> extends Collection<T>
>>>> {
>>>>   // Retrun the cluster which the point should belong to.
>>>>   Cluster predict(T point);
>>>>   // Add a point to best cluster.
>>>>   void addPoint(T point);
>>>> }
>>>> ```
>>>
>>>This  "ClusterSet" seems less flexible than a "List<Cluster>".
>>
>> "ClusterSet" is useful for predict as I mentioned above.
>
>Cf. previous comment.
>
>>>
>>>> And "centroidOf"(just used in clustering iteration) can move up into a
>>>> abstract class like "CenroidClusterer".
>>>
>>>It seems that this method could be useful for users too.
>>>
>>
>> May be "centroidOf" is useful, but now it is just used in kmeans.
>
>Is "Kmeans" the only algorithm that computes clusters centers?
>Do users never need to compute a center?
It seems "centroidOf" only use in KMeansPlusPlusClusterer,
and other centroid clusterer like MultiKMeansPlusPlusClusterer uses KMeansPlusPlusClusterer.
The "ClusterEvaluator.centroidOf" is partial duplicate.

>
>Best regards,
>Gilles
>
>>>> >It would seem that "center" should be a property computed
>>>> >from the contents of "Cluster" e.g.:
>>>> >
>>>> >@FunctionalInterface
>>>> >public interface ClusterCenterComputer<T extends Clusterable> {
>>>> >    T centroidOf(Cluster<T> cluster);
>>>> >}
>>>> >
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hello.

Le jeu. 5 mars 2020 à 13:23, [hidden email] <[hidden email]> a écrit :
>
> [...]
> It is a API change, how do we start it?

I'd suggest putting the new API in a (temporary) package
  org.apache.math4.ml.clustering2

Regards,
Gilles

> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12