[statistics][descriptive] Classes or static methods for common descriptive statistics?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[statistics][descriptive] Classes or static methods for common descriptive statistics?

Eric Barnhill
The previous commons-math interface for descriptive statistics used a
paradigm of constructing classes for various statistical functions and
calling evaluate(). Example

Mean mean = new Mean();
double mn = mean.evaluate(double[])

I wrote this type of code all through grad school and always found it
unnecessarily bulky.  To me these summary statistics are classic use cases
for static methods:

double mean .= Mean.evaluate(double[])

I don't have any particular problem with the evaluate() syntax.

I looked over the old Math 4 API to see if there were any benefits to the
previous class-oriented approach that we might not want to lose. But I
don't think there were, the functionality outside of evaluate() is minimal.

Finally we should consider whether we really need a separate class for each
statistic at all. Do we want to call:

Mean.evaluate()

or

SummaryStats.mean()

or maybe

Stats.mean() ?

The last being nice and compact.

Let's make a decision so our esteemed mentee Virendra knows in what
direction to take his work this summer. :)
Reply | Threaded
Open this post in threaded view
|

Re: [statistics][descriptive] Classes or static methods for common descriptive statistics?

Alex Herbert


> On 28 May 2019, at 18:09, Eric Barnhill <[hidden email]> wrote:
>
> The previous commons-math interface for descriptive statistics used a
> paradigm of constructing classes for various statistical functions and
> calling evaluate(). Example
>
> Mean mean = new Mean();
> double mn = mean.evaluate(double[])
>
> I wrote this type of code all through grad school and always found it
> unnecessarily bulky.  To me these summary statistics are classic use cases
> for static methods:
>
> double mean .= Mean.evaluate(double[])
>
> I don't have any particular problem with the evaluate() syntax.
>
> I looked over the old Math 4 API to see if there were any benefits to the
> previous class-oriented approach that we might not want to lose. But I
> don't think there were, the functionality outside of evaluate() is minimal.

A quick check shows that evaluate comes from UnivariateStatistic. This has some more methods that add little to an instance view of the computation:

double evaluate(double[] values) throws MathIllegalArgumentException;
double evaluate(double[] values, int begin, int length) throws MathIllegalArgumentException;
UnivariateStatistic copy();

However it is extended by StorelessUnivariateStatistic which adds methods to update the statistic:

void increment(double d);
void incrementAll(double[] values) throws MathIllegalArgumentException;
void incrementAll(double[] values, int start, int length) throws MathIllegalArgumentException;
double getResult();
long getN();
void clear();
StorelessUnivariateStatistic copy();

This type of functionality would be lost by static methods.

If you are moving to a functional interface type pattern for each statistic then you will lose the other functionality possible with an instance state, namely updating with more values or combining instances.

So this is a question of whether updating a statistic is required after the first computation.

Will there be an alternative in the library for a map-reduce type operation using instances that can be combined using Stream.collect:

    <R> R collect(Supplier<R> supplier,
                  ObjDoubleConsumer<R> accumulator,
                  BiConsumer<R, R> combiner);

Here <R> would be Mean:

double mean = Arrays.stream(new double[1000]).collect(Mean::new, Mean::add, Mean::add).getMean() with:

void add(double);
void add(Mean);
double getMean();

(Untested code)

>
> Finally we should consider whether we really need a separate class for each
> statistic at all. Do we want to call:
>
> Mean.evaluate()
>
> or
>
> SummaryStats.mean()
>
> or maybe
>
> Stats.mean() ?
>
> The last being nice and compact.
>
> Let's make a decision so our esteemed mentee Virendra knows in what
> direction to take his work this summer. :)


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [statistics][descriptive] Classes or static methods for common descriptive statistics?

Gilles Sadowski-2
Hello.

Le mar. 28 mai 2019 à 20:36, Alex Herbert <[hidden email]> a écrit :

>
>
>
> > On 28 May 2019, at 18:09, Eric Barnhill <[hidden email]> wrote:
> >
> > The previous commons-math interface for descriptive statistics used a
> > paradigm of constructing classes for various statistical functions and
> > calling evaluate(). Example
> >
> > Mean mean = new Mean();
> > double mn = mean.evaluate(double[])
> >
> > I wrote this type of code all through grad school and always found it
> > unnecessarily bulky.  To me these summary statistics are classic use cases
> > for static methods:
> >
> > double mean .= Mean.evaluate(double[])
> >
> > I don't have any particular problem with the evaluate() syntax.
> >
> > I looked over the old Math 4 API to see if there were any benefits to the
> > previous class-oriented approach that we might not want to lose. But I
> > don't think there were, the functionality outside of evaluate() is minimal.
>
> A quick check shows that evaluate comes from UnivariateStatistic. This has some more methods that add little to an instance view of the computation:
>
> double evaluate(double[] values) throws MathIllegalArgumentException;
> double evaluate(double[] values, int begin, int length) throws MathIllegalArgumentException;
> UnivariateStatistic copy();
>
> However it is extended by StorelessUnivariateStatistic which adds methods to update the statistic:
>
> void increment(double d);
> void incrementAll(double[] values) throws MathIllegalArgumentException;
> void incrementAll(double[] values, int start, int length) throws MathIllegalArgumentException;
> double getResult();
> long getN();
> void clear();
> StorelessUnivariateStatistic copy();
>
> This type of functionality would be lost by static methods.
>
> If you are moving to a functional interface type pattern for each statistic then you will lose the other functionality possible with an instance state, namely updating with more values or combining instances.
>
> So this is a question of whether updating a statistic is required after the first computation.
>
> Will there be an alternative in the library for a map-reduce type operation using instances that can be combined using Stream.collect:
>
>     <R> R collect(Supplier<R> supplier,
>                   ObjDoubleConsumer<R> accumulator,
>                   BiConsumer<R, R> combiner);
>
> Here <R> would be Mean:
>
> double mean = Arrays.stream(new double[1000]).collect(Mean::new, Mean::add, Mean::add).getMean() with:
>
> void add(double);
> void add(Mean);
> double getMean();
>
> (Untested code)
>
> >
> > Finally we should consider whether we really need a separate class for each
> > statistic at all. Do we want to call:
> >
> > Mean.evaluate()
> >
> > or
> >
> > SummaryStats.mean()
> >
> > or maybe
> >
> > Stats.mean() ?
> >
> > The last being nice and compact.
> >
> > Let's make a decision so our esteemed mentee Virendra knows in what
> > direction to take his work this summer. :)
>

I'm not sure I understand the implicit conclusions of this conversation
and the other one there:
    https://markmail.org/message/7dmyhzuy6lublyb5

Do we agree that the core issue is *not* how to compute a mean, or a
median, or a fourth moment, but how any and all of those can be
computed seamlessly through a functional API (stream)?

As Alex pointed out, a useful functionality is the ability to "combine"
instances, e.g. if data are collected by several threads.
A potential use-case is the retrieval of the current value of (any)
statistical quantities while the data continues to be collected.

An initial idea would be:
public interface StatQuantity {
    public double value(double[]); // For "basic" usage.
    public double value(DoubleStream); // For "advanced" usage.
}

public class StatCollection {
    /** Specify which quantities this collection will hold/compute. */
    public StatCollection(Map<String, StatQuantity> stats) { /*... */ }

    /**
     * Start a worker thread.
     * @param data Values for which the stat quantities must be computed.
     */
    public void startCollector(DoubleStream data) { /* ... */ }

    /** Combine current state of workers. */
    public void collect() { /* ... */ }

    /** @return the current (combined) value of a named quantity. */
    public double get(String name) { /* ... */ }

    private StatCollector implements Callable {
        StatCollector(DoubleStream data) { /* ... */ }
    }
}

This is all totally untested, very partial, and probably wrong-headed but
I thought that we were looking at this kind of refactoring.

Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [statistics][descriptive] Classes or static methods for common descriptive statistics?

Alex Herbert

On 29/05/2019 12:50, Gilles Sadowski wrote:

> Hello.
>
> Le mar. 28 mai 2019 à 20:36, Alex Herbert <[hidden email]> a écrit :
>>
>>
>>> On 28 May 2019, at 18:09, Eric Barnhill <[hidden email]> wrote:
>>>
>>> The previous commons-math interface for descriptive statistics used a
>>> paradigm of constructing classes for various statistical functions and
>>> calling evaluate(). Example
>>>
>>> Mean mean = new Mean();
>>> double mn = mean.evaluate(double[])
>>>
>>> I wrote this type of code all through grad school and always found it
>>> unnecessarily bulky.  To me these summary statistics are classic use cases
>>> for static methods:
>>>
>>> double mean .= Mean.evaluate(double[])
>>>
>>> I don't have any particular problem with the evaluate() syntax.
>>>
>>> I looked over the old Math 4 API to see if there were any benefits to the
>>> previous class-oriented approach that we might not want to lose. But I
>>> don't think there were, the functionality outside of evaluate() is minimal.
>> A quick check shows that evaluate comes from UnivariateStatistic. This has some more methods that add little to an instance view of the computation:
>>
>> double evaluate(double[] values) throws MathIllegalArgumentException;
>> double evaluate(double[] values, int begin, int length) throws MathIllegalArgumentException;
>> UnivariateStatistic copy();
>>
>> However it is extended by StorelessUnivariateStatistic which adds methods to update the statistic:
>>
>> void increment(double d);
>> void incrementAll(double[] values) throws MathIllegalArgumentException;
>> void incrementAll(double[] values, int start, int length) throws MathIllegalArgumentException;
>> double getResult();
>> long getN();
>> void clear();
>> StorelessUnivariateStatistic copy();
>>
>> This type of functionality would be lost by static methods.
>>
>> If you are moving to a functional interface type pattern for each statistic then you will lose the other functionality possible with an instance state, namely updating with more values or combining instances.
>>
>> So this is a question of whether updating a statistic is required after the first computation.
>>
>> Will there be an alternative in the library for a map-reduce type operation using instances that can be combined using Stream.collect:
>>
>>      <R> R collect(Supplier<R> supplier,
>>                    ObjDoubleConsumer<R> accumulator,
>>                    BiConsumer<R, R> combiner);
>>
>> Here <R> would be Mean:
>>
>> double mean = Arrays.stream(new double[1000]).collect(Mean::new, Mean::add, Mean::add).getMean() with:
>>
>> void add(double);
>> void add(Mean);
>> double getMean();
>>
>> (Untested code)
>>
>>> Finally we should consider whether we really need a separate class for each
>>> statistic at all. Do we want to call:
>>>
>>> Mean.evaluate()
>>>
>>> or
>>>
>>> SummaryStats.mean()
>>>
>>> or maybe
>>>
>>> Stats.mean() ?
>>>
>>> The last being nice and compact.
>>>
>>> Let's make a decision so our esteemed mentee Virendra knows in what
>>> direction to take his work this summer. :)
> I'm not sure I understand the implicit conclusions of this conversation
> and the other one there:
>      https://markmail.org/message/7dmyhzuy6lublyb5
>
> Do we agree that the core issue is *not* how to compute a mean, or a
> median, or a fourth moment, but how any and all of those can be
> computed seamlessly through a functional API (stream)?
>
> As Alex pointed out, a useful functionality is the ability to "combine"
> instances, e.g. if data are collected by several threads.
> A potential use-case is the retrieval of the current value of (any)
> statistical quantities while the data continues to be collected.
>
> An initial idea would be:
> public interface StatQuantity {
>      public double value(double[]); // For "basic" usage.
>      public double value(DoubleStream); // For "advanced" usage.
> }
>
> public class StatCollection {
>      /** Specify which quantities this collection will hold/compute. */
>      public StatCollection(Map<String, StatQuantity> stats) { /*... */ }
>
>      /**
>       * Start a worker thread.
>       * @param data Values for which the stat quantities must be computed.
>       */
>      public void startCollector(DoubleStream data) { /* ... */ }
>
>      /** Combine current state of workers. */
>      public void collect() { /* ... */ }
>
>      /** @return the current (combined) value of a named quantity. */
>      public double get(String name) { /* ... */ }
>
>      private StatCollector implements Callable {
>          StatCollector(DoubleStream data) { /* ... */ }
>      }
> }
>
> This is all totally untested, very partial, and probably wrong-headed but
> I thought that we were looking at this kind of refactoring.
>
> Regards,
> Gilles

I don't think you can pass in a Stream to be worked on. The Stream API
requires that you pass something into the stream and the stream contents
are changed (intermediate operation) or consumed (terminating
operation). Only when a terminating operation is invoked is the stream
pipeline activated.

So the new classes have to be useable in intermediate and terminating
operations.

If the idea of the refactoring was to move all the old API to a new API
that can be used with streams then each Statistic should be based on
ideas presented in:

java.util.DoubleSummaryStatistics
java.util.IntSummaryStatistics
java.util.LongSummaryStatistics

Each of which implement respectively:

DoubleConsumer
IntConsumer
LongConsumer

Plus:

- a method for combining with themselves
- an empty constructor (to act as a Supplier of the object)

So this would require:

public interface DoubleStatQuantity extends DoubleConsumer {
   // inherits:
   // public void accept(double value)
   public void combine(DoubleStatQuantity other);
   public double value();
   public DoubleStatQuantity newInstance();
}

Note that the combine method would have to check the input is of the
correct type. This can be fixed using Self-types with Java [1]:

public interface DoubleStatQuantity<B extends DoubleStatQuantity<B>> extends DoubleConsumer {
   public void combine(B other);
   public double value();
}

public class Max implements DoubleStatQuantity<Max> {
   private double max = Double.NEGATIVE_INFINITY;

   @Override
   public void accept(double value) {
     max = Math.max(max, value);
   }

   @Override
   public void combine(Max other) {
     max = Math.max(max, other.max);
   }

   @Override
   public double value() {
     return max;
   }

   @Override
   public Max newInstance() {
     return new Max();
   }
}

It is a matter of opinion on whether this is readable. It is probably
why the JDK implementations offer the functionality but do not declare
it in a generic way.

The StatCollection would then be:

public class StatCollection implements DoubleConsumer {
   private Map<String, DoubleStatQuantity<?>> stats;

   /** Specify which quantities this collection will hold/compute. */
   public StatCollection(Map<String, DoubleStatQuantity<?>> stats) {
     this.stats = stats;
   }

   @Override
   public void accept(double value) {
     stats.values().forEach(stat -> stat.accept(value));
   }

   /** @return the current value of a named quantity. */
   public double get(String name) {
     return stats.get(name).value();
   }
}

A more performant implementation is required (based on a list and stats
by index) but this is the idea.

Note that with the generic <?> it is not easily possible for
StatCollection to implement DoubleStatQuantity<StatCollection> as the
combine method ends up having to combine stats of type
DoubleStatQuantity<?>. I've not tried but expect a runtime exception if
the classes are different when combined.

I prefer the route of adding a set of classes that implement the current
algorithms, support DoubleConsumer and have a combine method. These
requirements could be specified by an interface:

public interface DoubleStatQuantity extends DoubleConsumer {
   // inherits:
   // public void accept(double value)
   public void combine(DoubleStatQuantity other);
   public double value();
}

The interface behaviour for the combine method if the other cannot be
combined should do one of: throw; or ignore the input.

Note that the interface it is not strictly necessary other than to
support a generic combined statistics collection. However this may be
better served with a dedicated class to compute combined statistics (as
per the JDK classes) in order to eliminate duplication in the algorithm.

Keeping the idea simple and based on the current JDK implementations
would allow the port to begin. It can always be made more explicit with
interfaces to specify operations later.


[1] https://www.sitepoint.com/self-types-with-javas-generics/

>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: [statistics][descriptive] Classes or static methods for common descriptive statistics?

Eric Barnhill
In reply to this post by Gilles Sadowski-2
At the end of the day, like we just saw on the user list today. users are
going to come around with arrays and want to get the mean, median,
variance, or quantiles of that array. The easiest way to do this is to have
some sort of static method that delivers these:

double mean = Stats.mean(double[] data)

and the user doesn't have to think more than that. Yes this should
implemented functionally, although in this simple case we probably just
need to call Java's SummaryStats() under the hood. If we overcomplicate
this, again like we just saw on the user list, users will simply not use
the code.

Then yes, I agree Alex's argument for updateable instances containing state
is compelling. How to relate these more complicated instances with the
simple cases is a great design question.

But first, let's nail the Matlab/Numpy case of just having an array of
doubles and wanting the mean / median. I am just speaking of my own use
cases here but I used exactly this functionality all the time:

Mean m = new Mean().
double mean = m.evaluate(data)

and I think this should be the central use case for the new module.


On Wed, May 29, 2019 at 4:51 AM Gilles Sadowski <[hidden email]>
wrote:

> Hello.
>
> Le mar. 28 mai 2019 à 20:36, Alex Herbert <[hidden email]> a
> écrit :
> >
> >
> >
> > > On 28 May 2019, at 18:09, Eric Barnhill <[hidden email]>
> wrote:
> > >
> > > The previous commons-math interface for descriptive statistics used a
> > > paradigm of constructing classes for various statistical functions and
> > > calling evaluate(). Example
> > >
> > > Mean mean = new Mean();
> > > double mn = mean.evaluate(double[])
> > >
> > > I wrote this type of code all through grad school and always found it
> > > unnecessarily bulky.  To me these summary statistics are classic use
> cases
> > > for static methods:
> > >
> > > double mean .= Mean.evaluate(double[])
> > >
> > > I don't have any particular problem with the evaluate() syntax.
> > >
> > > I looked over the old Math 4 API to see if there were any benefits to
> the
> > > previous class-oriented approach that we might not want to lose. But I
> > > don't think there were, the functionality outside of evaluate() is
> minimal.
> >
> > A quick check shows that evaluate comes from UnivariateStatistic. This
> has some more methods that add little to an instance view of the
> computation:
> >
> > double evaluate(double[] values) throws MathIllegalArgumentException;
> > double evaluate(double[] values, int begin, int length) throws
> MathIllegalArgumentException;
> > UnivariateStatistic copy();
> >
> > However it is extended by StorelessUnivariateStatistic which adds
> methods to update the statistic:
> >
> > void increment(double d);
> > void incrementAll(double[] values) throws MathIllegalArgumentException;
> > void incrementAll(double[] values, int start, int length) throws
> MathIllegalArgumentException;
> > double getResult();
> > long getN();
> > void clear();
> > StorelessUnivariateStatistic copy();
> >
> > This type of functionality would be lost by static methods.
> >
> > If you are moving to a functional interface type pattern for each
> statistic then you will lose the other functionality possible with an
> instance state, namely updating with more values or combining instances.
> >
> > So this is a question of whether updating a statistic is required after
> the first computation.
> >
> > Will there be an alternative in the library for a map-reduce type
> operation using instances that can be combined using Stream.collect:
> >
> >     <R> R collect(Supplier<R> supplier,
> >                   ObjDoubleConsumer<R> accumulator,
> >                   BiConsumer<R, R> combiner);
> >
> > Here <R> would be Mean:
> >
> > double mean = Arrays.stream(new double[1000]).collect(Mean::new,
> Mean::add, Mean::add).getMean() with:
> >
> > void add(double);
> > void add(Mean);
> > double getMean();
> >
> > (Untested code)
> >
> > >
> > > Finally we should consider whether we really need a separate class for
> each
> > > statistic at all. Do we want to call:
> > >
> > > Mean.evaluate()
> > >
> > > or
> > >
> > > SummaryStats.mean()
> > >
> > > or maybe
> > >
> > > Stats.mean() ?
> > >
> > > The last being nice and compact.
> > >
> > > Let's make a decision so our esteemed mentee Virendra knows in what
> > > direction to take his work this summer. :)
> >
>
> I'm not sure I understand the implicit conclusions of this conversation
> and the other one there:
>     https://markmail.org/message/7dmyhzuy6lublyb5
>
> Do we agree that the core issue is *not* how to compute a mean, or a
> median, or a fourth moment, but how any and all of those can be
> computed seamlessly through a functional API (stream)?
>
> As Alex pointed out, a useful functionality is the ability to "combine"
> instances, e.g. if data are collected by several threads.
> A potential use-case is the retrieval of the current value of (any)
> statistical quantities while the data continues to be collected.
>
> An initial idea would be:
> public interface StatQuantity {
>     public double value(double[]); // For "basic" usage.
>     public double value(DoubleStream); // For "advanced" usage.
> }
>
> public class StatCollection {
>     /** Specify which quantities this collection will hold/compute. */
>     public StatCollection(Map<String, StatQuantity> stats) { /*... */ }
>
>     /**
>      * Start a worker thread.
>      * @param data Values for which the stat quantities must be computed.
>      */
>     public void startCollector(DoubleStream data) { /* ... */ }
>
>     /** Combine current state of workers. */
>     public void collect() { /* ... */ }
>
>     /** @return the current (combined) value of a named quantity. */
>     public double get(String name) { /* ... */ }
>
>     private StatCollector implements Callable {
>         StatCollector(DoubleStream data) { /* ... */ }
>     }
> }
>
> This is all totally untested, very partial, and probably wrong-headed but
> I thought that we were looking at this kind of refactoring.
>
> Regards,
> Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [statistics][descriptive] Classes or static methods for common descriptive statistics?

Alex Herbert


> On 29 May 2019, at 21:57, Eric Barnhill <[hidden email]> wrote:
>
> At the end of the day, like we just saw on the user list today. users are
> going to come around with arrays and want to get the mean, median,
> variance, or quantiles of that array. The easiest way to do this is to have
> some sort of static method that delivers these:
>
> double mean = Stats.mean(double[] data)

This Stats class can be just a utility class with static helper methods invoking the appropriate class implementation.

All the algorithms should be in one place (to minimise code duplication).

I don’t think calling SummaryStats under the hood is the best solution for these helper methods. It does a lot more work than is necessary to compute one metric. It should be done with individual classes for each metric and an appropriate helper method for each.

Looking at math4 this would be helpers for:

moment/FirstMoment.java
moment/FourthMoment.java
moment/GeometricMean.java
moment/Kurtosis.java
moment/Mean.java
moment/SecondMoment.java
moment/SemiVariance.java
moment/Skewness.java
moment/StandardDeviation.java
moment/ThirdMoment.java
moment/Variance.java
rank/Max.java
rank/Median.java
rank/Min.java
rank/Percentile.java
summary/Product.java
summary/Sum.java
summary/SumOfLogs.java
summary/SumOfSquares.java
DescriptiveStatistics.java (mean, variance, StdDev, Max, Min, Count, Sum, Skewness, Kurtosis, Percentile)
SummaryStatistics.java (mean, variance, StdDev, Max, Min, Count, Sum)

Left out those operating on a double[] for each increment (not a single double):

moment/VectorialCovariance.java
moment/VectorialMean.java
MultivariateSummaryStatistics.java

Left out this as it is an approximation when the entire double[] cannot be held in memory:

rank/PSquarePercentile.java

Note that some metrics are not applicable to undefined data lengths and so cannot be written to support streams:

Median


>
> and the user doesn't have to think more than that. Yes this should
> implemented functionally, although in this simple case we probably just
> need to call Java's SummaryStats() under the hood. If we overcomplicate
> this, again like we just saw on the user list, users will simply not use
> the code.
>
> Then yes, I agree Alex's argument for updateable instances containing state
> is compelling. How to relate these more complicated instances with the
> simple cases is a great design question.
>
> But first, let's nail the Matlab/Numpy case of just having an array of
> doubles and wanting the mean / median. I am just speaking of my own use
> cases here but I used exactly this functionality all the time:
>
> Mean m = new Mean().
> double mean = m.evaluate(data)
>
> and I think this should be the central use case for the new module.
>
>
> On Wed, May 29, 2019 at 4:51 AM Gilles Sadowski <[hidden email]>
> wrote:
>
>> Hello.
>>
>> Le mar. 28 mai 2019 à 20:36, Alex Herbert <[hidden email]> a
>> écrit :
>>>
>>>
>>>
>>>> On 28 May 2019, at 18:09, Eric Barnhill <[hidden email]>
>> wrote:
>>>>
>>>> The previous commons-math interface for descriptive statistics used a
>>>> paradigm of constructing classes for various statistical functions and
>>>> calling evaluate(). Example
>>>>
>>>> Mean mean = new Mean();
>>>> double mn = mean.evaluate(double[])
>>>>
>>>> I wrote this type of code all through grad school and always found it
>>>> unnecessarily bulky.  To me these summary statistics are classic use
>> cases
>>>> for static methods:
>>>>
>>>> double mean .= Mean.evaluate(double[])
>>>>
>>>> I don't have any particular problem with the evaluate() syntax.
>>>>
>>>> I looked over the old Math 4 API to see if there were any benefits to
>> the
>>>> previous class-oriented approach that we might not want to lose. But I
>>>> don't think there were, the functionality outside of evaluate() is
>> minimal.
>>>
>>> A quick check shows that evaluate comes from UnivariateStatistic. This
>> has some more methods that add little to an instance view of the
>> computation:
>>>
>>> double evaluate(double[] values) throws MathIllegalArgumentException;
>>> double evaluate(double[] values, int begin, int length) throws
>> MathIllegalArgumentException;
>>> UnivariateStatistic copy();
>>>
>>> However it is extended by StorelessUnivariateStatistic which adds
>> methods to update the statistic:
>>>
>>> void increment(double d);
>>> void incrementAll(double[] values) throws MathIllegalArgumentException;
>>> void incrementAll(double[] values, int start, int length) throws
>> MathIllegalArgumentException;
>>> double getResult();
>>> long getN();
>>> void clear();
>>> StorelessUnivariateStatistic copy();
>>>
>>> This type of functionality would be lost by static methods.
>>>
>>> If you are moving to a functional interface type pattern for each
>> statistic then you will lose the other functionality possible with an
>> instance state, namely updating with more values or combining instances.
>>>
>>> So this is a question of whether updating a statistic is required after
>> the first computation.
>>>
>>> Will there be an alternative in the library for a map-reduce type
>> operation using instances that can be combined using Stream.collect:
>>>
>>>    <R> R collect(Supplier<R> supplier,
>>>                  ObjDoubleConsumer<R> accumulator,
>>>                  BiConsumer<R, R> combiner);
>>>
>>> Here <R> would be Mean:
>>>
>>> double mean = Arrays.stream(new double[1000]).collect(Mean::new,
>> Mean::add, Mean::add).getMean() with:
>>>
>>> void add(double);
>>> void add(Mean);
>>> double getMean();
>>>
>>> (Untested code)
>>>
>>>>
>>>> Finally we should consider whether we really need a separate class for
>> each
>>>> statistic at all. Do we want to call:
>>>>
>>>> Mean.evaluate()
>>>>
>>>> or
>>>>
>>>> SummaryStats.mean()
>>>>
>>>> or maybe
>>>>
>>>> Stats.mean() ?
>>>>
>>>> The last being nice and compact.
>>>>
>>>> Let's make a decision so our esteemed mentee Virendra knows in what
>>>> direction to take his work this summer. :)
>>>
>>
>> I'm not sure I understand the implicit conclusions of this conversation
>> and the other one there:
>>    https://markmail.org/message/7dmyhzuy6lublyb5
>>
>> Do we agree that the core issue is *not* how to compute a mean, or a
>> median, or a fourth moment, but how any and all of those can be
>> computed seamlessly through a functional API (stream)?
>>
>> As Alex pointed out, a useful functionality is the ability to "combine"
>> instances, e.g. if data are collected by several threads.
>> A potential use-case is the retrieval of the current value of (any)
>> statistical quantities while the data continues to be collected.
>>
>> An initial idea would be:
>> public interface StatQuantity {
>>    public double value(double[]); // For "basic" usage.
>>    public double value(DoubleStream); // For "advanced" usage.
>> }
>>
>> public class StatCollection {
>>    /** Specify which quantities this collection will hold/compute. */
>>    public StatCollection(Map<String, StatQuantity> stats) { /*... */ }
>>
>>    /**
>>     * Start a worker thread.
>>     * @param data Values for which the stat quantities must be computed.
>>     */
>>    public void startCollector(DoubleStream data) { /* ... */ }
>>
>>    /** Combine current state of workers. */
>>    public void collect() { /* ... */ }
>>
>>    /** @return the current (combined) value of a named quantity. */
>>    public double get(String name) { /* ... */ }
>>
>>    private StatCollector implements Callable {
>>        StatCollector(DoubleStream data) { /* ... */ }
>>    }
>> }
>>
>> This is all totally untested, very partial, and probably wrong-headed but
>> I thought that we were looking at this kind of refactoring.
>>
>> Regards,
>> Gilles
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>