[Commons][Descriptive][STATISTICS-7][GSoC] SummaryStatistics class design & Whether to use DoubleSummaryStatistics class from java.util package?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[Commons][Descriptive][STATISTICS-7][GSoC] SummaryStatistics class design & Whether to use DoubleSummaryStatistics class from java.util package?

Virendra singh Rajpurohit
I've been trying to make summary statistics class. I have some doubt. There is a class DoubleSummaryStatistics in java.util package(There are two more for Int and Long). I'll attach this file here. 
Do I have to design SummaryStatistics in this way only? I mean, description on DoubleSummaryStatistics is "This class is designed to work with (though does not require) streams. For example, you can compute summary statistics on a stream of doubles with:
 
 DoubleSummaryStatistics stats = doubleStream.collect(DoubleSummaryStatistics::new,
                                                      DoubleSummaryStatistics::accept, 
                                                      DoubleSummaryStatistics::combine);"
Earlier my understanding of the project was that the user just have to call the function "getSummary()" & all the calculations will be done automatically in streams. but As we can see in DoubleSummaryStatistics we have to call collect() method.  
There are some functions like max, min, sum, count, average which are already defined in this class. So should I extend this class in my class or not? Also, I'll have to add more statistics other than max,min,sum for that I have to override accept() function which will be used for  streams.

Warm Regards,
--
Virendra Singh Rajpurohit

University of Petroleum and Energy Studies,Dehradun





Mailtrack Sender notified by
Mailtrack 06/02/19, 6:14:27 PM


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [Commons][Descriptive][STATISTICS-7][GSoC] SummaryStatistics class design & Whether to use DoubleSummaryStatistics class from java.util package?

Alex Herbert


> On 2 Jun 2019, at 13:45, Virendra singh Rajpurohit <[hidden email]> wrote:
>
> I've been trying to make summary statistics class. I have some doubt. There is a class DoubleSummaryStatistics in java.util package(There are two more for Int and Long). I'll attach this file here.
> Do I have to design SummaryStatistics in this way only? I mean, description on DoubleSummaryStatistics is "This class is designed to work with (though does not require) streams <https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html>. For example, you can compute summary statistics on a stream of doubles with:
>  
>  DoubleSummaryStatistics stats = doubleStream.collect(DoubleSummaryStatistics::new,
>                                                       DoubleSummaryStatistics::accept,
>                                                       DoubleSummaryStatistics::combine);"
> Earlier my understanding of the project was that the user just have to call the function "getSummary()" & all the calculations will be done automatically in streams.

If you put all the work with streams inside the getSummary() function then the user cannot decide how to build the stream (e.g. serial or parallel). So designing like the JDK class to work with streams would be better.

> but As we can see in DoubleSummaryStatistics we have to call collect() method.  
> There are some functions like max, min, sum, count, average which are already defined in this class. So should I extend this class in my class or not? Also, I'll have to add more statistics other than max,min,sum for that I have to override accept() function which will be used for  streams.

You could extend this JDK class to add functionality. In the accept and combine method just call super.accept and super.combine. Then do the extra work you require.

One useful stat that is missing from the class is variance. A first addition would be to extend DoubleSummaryStatistics and add a variance (plus standard deviation) function with a variant for the population variance (or population standard deviation).

Note that a method to add a second moment to another second moment is required. This is not present in math4 AFAIK. There is this parallel variance algorithm [1] that would allow you to implement the combine() method to join two instances of your summary statistics class.

Alex


[1] https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm <https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm>


>
> Warm Regards,
> --
> Virendra Singh Rajpurohit
>
> University of Petroleum and Energy Studies,Dehradun
> Linkedin:https://www.linkedin.com/in/virendra-singh-rajpurohit <https://www.linkedin.com/in/virendra-singh-rajpurohit>
>
>
>
>
>
>   <https://mailtrack.io/?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> Sender notified by
> Mailtrack <https://mailtrack.io/?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> 06/02/19, 6:14:27 PM
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [Commons][Descriptive][STATISTICS-7][GSoC] SummaryStatistics class design & Whether to use DoubleSummaryStatistics class from java.util package?

Eric Barnhill
In reply to this post by Virendra singh Rajpurohit
As discussed on prior threads you should have both. There will need to be
static convenience methods for a user who wants to make a very simple call,
say Stats.mean() . But, as Alex said, this convenience class will just be a
front end for the statistics functionality itself. That needs to be in its
own classes (Mean(), Variance()) which can produce instances that give the
user more flexibility, For example storeless statistics like Mean() or
Variance(), or StandardDeviation(), should be updatable, as Gilles said, or
handle different kind of streams like Alex said. Yet these classes need to
be designed so that they perform as well as simple implementations when
desired.






On Sun, Jun 2, 2019 at 5:45 AM Virendra singh Rajpurohit <
[hidden email]> wrote:

> I've been trying to make summary statistics class. I have some doubt.
> There is a class DoubleSummaryStatistics in java.util package(There are two
> more for Int and Long). I'll attach this file here.
> Do I have to design SummaryStatistics in this way only? I mean,
> description on DoubleSummaryStatistics is "This class is designed to work
> with (though does not require) streams
> <https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html>.
> For example, you can compute summary statistics on a stream of doubles with:
>
>
>  DoubleSummaryStatistics stats = doubleStream.collect(DoubleSummaryStatistics::new,
>                                                       DoubleSummaryStatistics::accept,
>
>
> DoubleSummaryStatistics::combine);"
> Earlier my understanding of the project was that the user just have to
> call the function "getSummary()" & all the calculations will be done
> automatically in streams. but As we can see in DoubleSummaryStatistics we
> have to call collect() method.
> There are some functions like max, min, sum, count, average which are
> already defined in this class. So should I extend this class in my class or
> not? Also, I'll have to add more statistics other than max,min,sum for that
> I have to override accept() function which will be used for  streams.
>
> Warm Regards,
> --
> *Virendra Singh Rajpurohit*
>
> *University of Petroleum and Energy Studies,Dehradun*
> Linkedin:https://www.linkedin.com/in/virendra-singh-rajpurohit
>
>
>
>
>
> [image: Mailtrack]
> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> Sender
> notified by
> Mailtrack
> <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> 06/02/19,
> 6:14:27 PM
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: [Commons][Descriptive][STATISTICS-7][GSoC] SummaryStatistics class design & Whether to use DoubleSummaryStatistics class from java.util package?

Gilles Sadowski-2
Hello.

Side note: Top-posting is quite annoying in these discussions...

Le dim. 2 juin 2019 à 21:27, Eric Barnhill <[hidden email]> a écrit :

>
> As discussed on prior threads you should have both. There will need to be
> static convenience methods for a user who wants to make a very simple call,
> say Stats.mean() . But, as Alex said, this convenience class will just be a
> front end for the statistics functionality itself. That needs to be in its
> own classes (Mean(), Variance()) which can produce instances that give the
> user more flexibility, For example storeless statistics like Mean() or
> Variance(), or StandardDeviation(), should be updatable, as Gilles said, or
> handle different kind of streams like Alex said. Yet these classes need to
> be designed so that they perform as well as simple implementations when
> desired.
>

Related discussion:
    https://issues.apache.org/jira/browse/STATISTICS-14

I agree with the requirement that "simple" usage must be possible.
However, it seems to me that the discussion is upside-down: simple
usage can always be provided by another layer (similar to the "toArray"
method in JDK's "List").  Seamless integration with stream does not
as obvious; hence should not be an afterthought.
Unless I'm mistaken, another way to look at it, is the "in-memoy" vs
"storeless" divide.  The latter being the most interesting case (when the
quantity can be computed) design-wise.

I suggest that the testing ground (read: code) is to provide the variance.
And see how it plays with a "DoubleStream", how it can also provide
"sum of squares" and "mean"; or how, inversely, "sum of squares" and
"mean" can be "combined" to provide variance.

Regards,
Gilles

> On Sun, Jun 2, 2019 at 5:45 AM Virendra singh Rajpurohit <
> [hidden email]> wrote:
>
> > I've been trying to make summary statistics class. I have some doubt.
> > There is a class DoubleSummaryStatistics in java.util package(There are two
> > more for Int and Long). I'll attach this file here.
> > Do I have to design SummaryStatistics in this way only? I mean,
> > description on DoubleSummaryStatistics is "This class is designed to work
> > with (though does not require) streams
> > <https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html>.
> > For example, you can compute summary statistics on a stream of doubles with:
> >
> >
> >  DoubleSummaryStatistics stats = doubleStream.collect(DoubleSummaryStatistics::new,
> >                                                       DoubleSummaryStatistics::accept,
> >
> >
> > DoubleSummaryStatistics::combine);"
> > Earlier my understanding of the project was that the user just have to
> > call the function "getSummary()" & all the calculations will be done
> > automatically in streams. but As we can see in DoubleSummaryStatistics we
> > have to call collect() method.
> > There are some functions like max, min, sum, count, average which are
> > already defined in this class. So should I extend this class in my class or
> > not? Also, I'll have to add more statistics other than max,min,sum for that
> > I have to override accept() function which will be used for  streams.
> >
> > Warm Regards,
> > --
> > *Virendra Singh Rajpurohit*
> >
> > *University of Petroleum and Energy Studies,Dehradun*
> > Linkedin:https://www.linkedin.com/in/virendra-singh-rajpurohit
> >
> >
> >
> >
> >
> > [image: Mailtrack]
> > <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> Sender
> > notified by
> > Mailtrack
> > <https://mailtrack.io?utm_source=gmail&utm_medium=signature&utm_campaign=signaturevirality5&> 06/02/19,
> > 6:14:27 PM
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]