Re: [math] Discuss: New feature MiniBatchKMeansClusterer

classic Classic list List threaded Threaded
27 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hi Tao.

I've merged PR #128 but please see my comment on the JIRA page.[1]

Thanks for your interest in improving the library,
Gilles

[1] https://issues.apache.org/jira/browse/MATH-1509?focusedCommentId=17064306&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17064306

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
Hi,

    I have started 2 PRs to solve the problem you metioned.

    About the "CentroidInitializer" I have a new idea:
Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
and add a construct parameter and a property "useKMeansPlusPlus" to "KMeansPlusPlusCluster":
```java
// Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
public class KMeansPlusPlusClusterer<T extends Clusterable> extends Clusterer<T> {
    public KMeansPlusPlusClusterer(final int k, final int maxIterations,
                               final DistanceMeasure measure,
                               final UniformRandomProvider random,
                               final EmptyClusterStrategy emptyStrategy,
+                             final useKMeansPlusPlus) {
    // ...
-  // Use K-means++ to choose the initial centers.
-  this.centroidInitializer = new KMeansPlusPlusCentroidInitializer(measure, random);
+  this.useKMeansPlusPlus = useKMeansPlusPlus;
}

public boolean isUseKMeansPlusPlus() {return this.useKMeansPlusPlus;}

// Make "chooseInitialCenters" package-private and call "CentroidInitializer.selectCentroids"
// Then the chooseInitialCenters can be reused by "MiniBatchKMeans".
List<CentroidCluster<T>> chooseInitialCenters(final Collection<T> points){
    // Use K-means++ to choose the initial centers.
    final CentroidInitializer centroidInitializer = useKMeansPlusPlus?
                            new KMeansPlusPlusCentroidInitializer(this.measure, this.random)
                            :new RandomCentroidInitializer(this.random);
    return centroidInitializer.selectCentroids(points, this.k);
}

// Make CentroidInitializer private
private static interface CentroidInitializer {
    <T extends Clusterable> List<CentroidCluster<T>> selectCentroids(final Collection<T> points, final int k);
}
private static class RandomCentroidInitializer implements CentroidInitializer {...}
private static class KMeansPlusPlusCentroidInitializer implements CentroidInitializer {...}
```

The "CentroidInitializer" only used in "KMeansPlusPlusClusterer" and "MiniBatchKMeans",
 the other k-means based algorithm use "KMeansPlusPlusClusterer" as a parameter.
```java
// Changes in "MiniBatchKMeansClusterer"
public class MiniBatchKMeansClusterer<T extends Clusterable>
    public MiniBatchKMeansClusterer(final int k,
                                    final int maxIterations,
                                    final int batchSize,
                                    final int initIterations,
                                    final int initBatchSize,
                                    final int maxNoImprovementTimes,
                                    final DistanceMeasure measure,
                                    final UniformRandomProvider random,
                                    final EmptyClusterStrategy emptyStrategy,
+                                  final useKMeansPlusPlus) {
-       super(k, maxIterations, measure, random, emptyStrategy);
+      super(k, maxIterations, measure, random, emptyStrategy, useKMeansPlusPlus);
        //...
    }
    //...
    private List<CentroidCluster<T>> initialCenters(final List<T> points) {
        //...
-      final List<CentroidCluster<T>> clusters = getCentroidInitializer().selectCentroids(initialPoints, getK());
+     final List<CentroidCluster<T>> clusters = chooseInitialCenters(initialPoints);
        //...
    }
}
```


>Hi Tao.
>
>I've merged PR #128 but please see my comment on the JIRA page.[1]
>
>Thanks for your interest in improving the library,
>Gilles
>
>[1] https://issues.apache.org/jira/browse/MATH-1509?focusedCommentId=17064306&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17064306
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hello.

Le mar. 24 mars 2020 à 06:39, [hidden email] <[hidden email]> a écrit :

>
> Hi,
>
>     I have started 2 PRs to solve the problem you metioned.
>
>     About the "CentroidInitializer" I have a new idea:
> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
> and add a construct parameter and a property "useKMeansPlusPlus" to "KMeansPlusPlusCluster":
> ```java
> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
> public class KMeansPlusPlusClusterer<T extends Clusterable> extends Clusterer<T> {
>     public KMeansPlusPlusClusterer(final int k, final int maxIterations,
>                                final DistanceMeasure measure,
>                                final UniformRandomProvider random,
>                                final EmptyClusterStrategy emptyStrategy,
> +                             final useKMeansPlusPlus) {
>     // ...
> -  // Use K-means++ to choose the initial centers.
> -  this.centroidInitializer = new KMeansPlusPlusCentroidInitializer(measure, random);
> +  this.useKMeansPlusPlus = useKMeansPlusPlus;
> }

What if one comes up with a third way to initialize the centroids?
If you can ensure that there is no other initialization procedure,
a boolean is fine, if not, we could still make the existing procedures
package-private (e.g. moving them in as classes defined within
"KMeansPlusPlusClusterer".

Also, in the current implementation of "KMeansPlusPlusClusterer", the
initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
Perhaps we don't want to depart from the original (?) algorithm; if so,
the new constructor could be made protected (thus simplifying the API).

> public boolean isUseKMeansPlusPlus() {return this.useKMeansPlusPlus;}

Why should this method be defined?

Regards,
Gilles

> [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
Hi,

>Hello.
>
>Le mar. 24 mars 2020 à 06:39, [hidden email] <[hidden email]> a écrit :
>>
>> Hi,
>>
>>     I have started 2 PRs to solve the problem you metioned.
>>
>>     About the "CentroidInitializer" I have a new idea:
>> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
>> and add a construct parameter and a property "useKMeansPlusPlus" to "KMeansPlusPlusCluster":
>> ```java
>> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
>> public class KMeansPlusPlusClusterer<T extends Clusterable> extends Clusterer<T> {
>>     public KMeansPlusPlusClusterer(final int k, final int maxIterations,
>>                                final DistanceMeasure measure,
>>                                final UniformRandomProvider random,
>>                                final EmptyClusterStrategy emptyStrategy,
>> +                             final useKMeansPlusPlus) {
>>     // ...
>> -  // Use K-means++ to choose the initial centers.
>> -  this.centroidInitializer = new KMeansPlusPlusCentroidInitializer(measure, random);
>> +  this.useKMeansPlusPlus = useKMeansPlusPlus;
>> }
>
>What if one comes up with a third way to initialize the centroids?
>If you can ensure that there is no other initialization procedure,
>a boolean is fine, if not, we could still make the existing procedures
>package-private (e.g. moving them in as classes defined within
>"KMeansPlusPlusClusterer".

As I know the k-means has two center initialize methods, random and k-means++ so far,
use a boolean to choose which method to use is good enough for current use,
but there are two situations use need to implement the center initialize method themselves:
1. The Commoans Maths's implements is not good enough;
2. There are new center initialize methods.

>
>Also, in the current implementation of "KMeansPlusPlusClusterer", the
>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
>Perhaps we don't want to depart from the original (?) algorithm; if so,
>the new constructor could be made protected (thus simplifying the API).

k-means++ is the recommend center initialize method for now days,
show we let user to fall back to random choose centers, that is a question need to tradeoff.
Show we make the API simple or rich?

>
>> public boolean isUseKMeansPlusPlus() {return this.useKMeansPlusPlus;}
>
>Why should this method be defined?

To let user get their cluster parameters, same as "getEmptyStrategy()"

>
>Regards,
>Gilles
>
>> [...]
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
2020-03-25 4:54 UTC+01:00, [hidden email] <[hidden email]>:

> Hi,
>
>>Hello.
>>
>>Le mar. 24 mars 2020 à 06:39, [hidden email] <[hidden email]> a écrit
>> :
>>>
>>> Hi,
>>>
>>>     I have started 2 PRs to solve the problem you metioned.
>>>
>>>     About the "CentroidInitializer" I have a new idea:
>>> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
>>> and add a construct parameter and a property "useKMeansPlusPlus" to
>>> "KMeansPlusPlusCluster":
>>> ```java
>>> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
>>> public class KMeansPlusPlusClusterer<T extends Clusterable> extends
>>> Clusterer<T> {
>>>     public KMeansPlusPlusClusterer(final int k, final int maxIterations,
>>>                                final DistanceMeasure measure,
>>>                                final UniformRandomProvider random,
>>>                                final EmptyClusterStrategy emptyStrategy,
>>> +                             final useKMeansPlusPlus) {
>>>     // ...
>>> -  // Use K-means++ to choose the initial centers.
>>> -  this.centroidInitializer = new
>>> KMeansPlusPlusCentroidInitializer(measure, random);
>>> +  this.useKMeansPlusPlus = useKMeansPlusPlus;
>>> }
>>
>>What if one comes up with a third way to initialize the centroids?
>>If you can ensure that there is no other initialization procedure,
>>a boolean is fine, if not, we could still make the existing procedures
>>package-private (e.g. moving them in as classes defined within
>>"KMeansPlusPlusClusterer".
>
> As I know the k-means has two center initialize methods, random and
> k-means++ so far,
> use a boolean to choose which method to use is good enough for current use,

Indeed but the question is: Will it be future-proof?
[We don't want to break compatibility (and have to release a
new major version of the library) for having overlooked the
question which I'm asking.]

> but there are two situations use need to implement the center initialize
> method themselves:
> 1. The Commoans Maths's implements is not good enough;

As this is FLOSS, the understanding (IMO) is in that case users
would contribute back to fix what needs be.

> 2. There are new center initialize methods.

So, that would be arguing against using a boolean (?).
Cf. above (about breaking compatibility).

>>
>>Also, in the current implementation of "KMeansPlusPlusClusterer", the
>>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
>>Perhaps we don't want to depart from the original (?) algorithm; if so,
>>the new constructor could be made protected (thus simplifying the API).
>
> k-means++ is the recommend center initialize method for now days,
> show we let user to fall back to random choose centers, that is a question
> need to tradeoff.

Is there any advantage to "random" init vs "kmeans" init?
E.g. is "random" faster, yet would give similar clustering results?

> Show we make the API simple or rich?

I'd keep it simple until we fix the (IMHO) more important
issues of thread-safety and sparse data.

>>
>>> public boolean isUseKMeansPlusPlus() {return this.useKMeansPlusPlus;}
>>
>>Why should this method be defined?
>
> To let user get their cluster parameters, same as "getEmptyStrategy()"
>

Well, obviously.  But then, perhaps obviously too, the "user" is the
one who called the constructor in the first place, and knowns the
value of all the arguments.
And if we consider the general use-case, when client code is passed
an instance as type "Clusterer", then the specific accessor methods
are not available anymore (short of using "instanceof" and a cast).


Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

chentao106@qq.com
Hi,

>2020-03-25 4:54 UTC+01:00, [hidden email] <[hidden email]>:
>> Hi,
>>
>>>Hello.
>>>
>>>Le mar. 24 mars 2020 à 06:39, [hidden email] <[hidden email]> a écrit
>>> :
>>>>
>>>> Hi,
>>>>
>>>>     I have started 2 PRs to solve the problem you metioned.
>>>>
>>>>     About the "CentroidInitializer" I have a new idea:
>>>> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
>>>> and add a construct parameter and a property "useKMeansPlusPlus" to
>>>> "KMeansPlusPlusCluster":
>>>> ```java
>>>> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
>>>> public class KMeansPlusPlusClusterer<T extends Clusterable> extends
>>>> Clusterer<T> {
>>>>     public KMeansPlusPlusClusterer(final int k, final int maxIterations,
>>>>                                final DistanceMeasure measure,
>>>>                                final UniformRandomProvider random,
>>>>                                final EmptyClusterStrategy emptyStrategy,
>>>> +                             final useKMeansPlusPlus) {
>>>>     // ...
>>>> -  // Use K-means++ to choose the initial centers.
>>>> -  this.centroidInitializer = new
>>>> KMeansPlusPlusCentroidInitializer(measure, random);
>>>> +  this.useKMeansPlusPlus = useKMeansPlusPlus;
>>>> }
>>>
>>>What if one comes up with a third way to initialize the centroids?
>>>If you can ensure that there is no other initialization procedure,
>>>a boolean is fine, if not, we could still make the existing procedures
>>>package-private (e.g. moving them in as classes defined within
>>>"KMeansPlusPlusClusterer".
>>
>> As I know the k-means has two center initialize methods, random and
>> k-means++ so far,
>> use a boolean to choose which method to use is good enough for current use,
>
>Indeed but the question is: Will it be future-proof?
>[We don't want to break compatibility (and have to release a
>new major version of the library) for having overlooked the
>question which I'm asking.]
>
>> but there are two situations use need to implement the center initialize
>> method themselves:
>> 1. The Commoans Maths's implements is not good enough;
>
>As this is FLOSS, the understanding (IMO) is in that case users
>would contribute back to fix what needs be.
>
>> 2. There are new center initialize methods.
>
>So, that would be arguing against using a boolean (?).
>Cf. above (about breaking compatibility).
The name "KMeansPlusPlusClusterer" decide it won't support new initialize methods.
Whether "KMeansPlusPlusClusterer" compatible random 

>
>>>
>>>Also, in the current implementation of "KMeansPlusPlusClusterer", the
>>>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
>>>Perhaps we don't want to depart from the original (?) algorithm; if so,
>>>the new constructor could be made protected (thus simplifying the API).
>>
>> k-means++ is the recommend center initialize method for now days,
>> show we let user to fall back to random choose centers, that is a question
>> need to tradeoff.
>
>Is there any advantage to "random" init vs "kmeans" init?
>E.g. is "random" faster, yet would give similar clustering results?

k-means++ is an improvement of k-means(random init).
The performance loss by using "k-means++" is tinily relative to entire clustering process.
I think keep old state is OK(make chooseInitialCenters package-private),
but now "CentroidInitializer" is a middle state.

>
>> Show we make the API simple or rich?
>
>I'd keep it simple until we fix the (IMHO) more important
>issues of thread-safety and sparse data.

Some Personal opinion about "Sparse data":
Math ML do not use a martix for clustering, but list of vectors(double arrays),
 this is flexible and simple (This is the main attraction of Math ML to me).

Operations on sparse martixs can be faster than common martixs,
but not that faster on sparse vectors, but the performance improvement will bring complexity.

Anyhow if we want to start this optimization,  I think there are 2 ways:
1. Define a abstract Martix, and make Cluser API work on Martix, this is a big change.
2. Defina a abstract Vector, and replace double[] with Vector, implement a series of operations on Vector, include Sparse Vector.

>
>>>
>>>> public boolean isUseKMeansPlusPlus() {return this.useKMeansPlusPlus;}
>>>
>>>Why should this method be defined?
>>
>> To let user get their cluster parameters, same as "getEmptyStrategy()"
>>
>
>Well, obviously.  But then, perhaps obviously too, the "user" is the
>one who called the constructor in the first place, and knowns the
>value of all the arguments.
>And if we consider the general use-case, when client code is passed
>an instance as type "Clusterer", then the specific accessor methods
>are not available anymore (short of using "instanceof" and a cast).
>
>
>Regards,
>Gilles
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [hidden email]
>For additional commands, e-mail: [hidden email]
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Re: [math] Discuss: New feature MiniBatchKMeansClusterer

Gilles Sadowski-2
Hello.

> > [...]
> >>>>
> >>>>     I have started 2 PRs to solve the problem you metioned.
> >>>>
> >>>>     About the "CentroidInitializer" I have a new idea:
> >>>> Move CentroidInitializers as inner classes of "KMeansPlusPlusCluster",
> >>>> and add a construct parameter and a property "useKMeansPlusPlus" to
> >>>> "KMeansPlusPlusCluster":
> >>>> ```java
> >>>> // Add "useKMeansPlusPlus" to "KMeansPlusPlusClusterer"
> >>>> public class KMeansPlusPlusClusterer<T extends Clusterable> extends
> >>>> Clusterer<T> {
> >>>>     public KMeansPlusPlusClusterer(final int k, final int maxIterations,
> >>>>                                final DistanceMeasure measure,
> >>>>                                final UniformRandomProvider random,
> >>>>                                final EmptyClusterStrategy emptyStrategy,
> >>>> +                             final useKMeansPlusPlus) {
> >>>>     // ...
> >>>> -  // Use K-means++ to choose the initial centers.
> >>>> -  this.centroidInitializer = new
> >>>> KMeansPlusPlusCentroidInitializer(measure, random);
> >>>> +  this.useKMeansPlusPlus = useKMeansPlusPlus;
> >>>> }
> >>>
> >>>What if one comes up with a third way to initialize the centroids?
> >>>If you can ensure that there is no other initialization procedure,
> >>>a boolean is fine, if not, we could still make the existing procedures
> >>>package-private (e.g. moving them in as classes defined within
> >>>"KMeansPlusPlusClusterer".
> >>
> >> As I know the k-means has two center initialize methods, random and
> >> k-means++ so far,
> >> use a boolean to choose which method to use is good enough for current use,
> >
> >Indeed but the question is: Will it be future-proof?
> >[We don't want to break compatibility (and have to release a
> >new major version of the library) for having overlooked the
> >question which I'm asking.]
> >
> >> but there are two situations use need to implement the center initialize
> >> method themselves:
> >> 1. The Commoans Maths's implements is not good enough;
> >
> >As this is FLOSS, the understanding (IMO) is in that case users
> >would contribute back to fix what needs be.
> >
> >> 2. There are new center initialize methods.
> >
> >So, that would be arguing against using a boolean (?).
> >Cf. above (about breaking compatibility).
> The name "KMeansPlusPlusClusterer" decide it won't support new initialize methods.
> Whether "KMeansPlusPlusClusterer" compatible random

Sorry, but I don't understand what you mean.

> >>>
> >>>Also, in the current implementation of "KMeansPlusPlusClusterer", the
> >>>initialization is not configurable ("KMeansPlusPlusCentroidInitializer").
> >>>Perhaps we don't want to depart from the original (?) algorithm; if so,
> >>>the new constructor could be made protected (thus simplifying the API).
> >>
> >> k-means++ is the recommend center initialize method for now days,
> >> show we let user to fall back to random choose centers, that is a question
> >> need to tradeoff.
> >
> >Is there any advantage to "random" init vs "kmeans" init?
> >E.g. is "random" faster, yet would give similar clustering results?
>
> k-means++ is an improvement of k-means(random init).

Do I understand correctly that to call the algorithm KMeans++,
the only change (from "Kmeans") is the centroid initialization
(using "KMeans" vs using "random")?
If so, it would be a contradiction (name vs functionality) to allow
"random" in a class called "KMeansPlusPlusClusterer".

So it seems the alternatives are:
 (1) Drop "random" init (and keep "KMeansPlusPlusClusterer") if
you are positive that this functionality won"t be missed.
 (2) Allow flexibility for init (and rename "KMeansPlusPlusClusterer"
to "KMeansClusterer"), adding a boolean argument to the constructor:
"doKMeansPlusPlus" to select "random" (false) or "kmeans" (true).
 (3) Allow flexibility by passing a function to perform the init.

According to what you wrote, (1) is the better alternative right
now. Option (3) might (perhaps!) become a nice enhancement,
but only after we settle on a better design for the fundamental
classes ("Cluster", "ClusterSet", "ClusterResult", ...).

> The performance loss by using "k-means++" is tinily relative to entire clustering process.
> I think keep old state is OK(make chooseInitialCenters package-private),
> but now "CentroidInitializer" is a middle state.

+1 if you mean that you'll remove, for now, the package "initialization".
Please submit a PR.

> >
> >> Show we make the API simple or rich?
> >
> >I'd keep it simple until we fix the (IMHO) more important
> >issues of thread-safety and sparse data.
>
> Some Personal opinion about "Sparse data":
> Math ML do not use a martix for clustering, but list of vectors(double arrays),
>  this is flexible and simple (This is the main attraction of Math ML to me).
>
> Operations on sparse martixs can be faster than common martixs,
> but not that faster on sparse vectors, but the performance improvement will bring complexity.
>
> Anyhow if we want to start this optimization,  I think there are 2 ways:
> 1. Define a abstract Martix, and make Cluser API work on Martix, this is a big change.
> 2. Defina a abstract Vector, and replace double[] with Vector, implement a series of operations on Vector, include Sparse Vector.

Unless I'm missing some point, this is going astray; my mentioning
of "sparse data" is related to issue MATH-1330.[1]  IIUC the concern
there was that it should be possible to process the data without
forcing the instantiation of a "double[]".


Best,
Gilles

[1] https://issues.apache.org/jira/browse/MATH-1330

> > [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

12