[math] Apache Commons Math Median performance

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[math] Apache Commons Math Median performance

Marco Neumann
I am evaluating the use of Apache Math Commons Median for the querying of
large data sets in another Apache project called Apache Jena.

In my preliminary performance tests I was surprised to find that a simple
implementation of a median function with Arrays.sort() and a programmatic
selection of the median value yields much faster results
than Median().evaluate() or DescriptiveStatistics.getPercentile(50).

Since we only use this function for  Arrays of confirmed numbers is there a
particular benefit in using Apache Commons Math for this task or are we
better advised to use our own implementation here?

Thank You
Reply | Threaded
Open this post in threaded view
|

Re: [math] Apache Commons Math Median performance

Gilles Sadowski-2
Hello.

Le mer. 29 mai 2019 à 12:24, Marco Neumann <[hidden email]> a écrit :
>
> I am evaluating the use of Apache Math Commons Median for the querying of
> large data sets in another Apache project called Apache Jena.
>
> In my preliminary performance tests I was surprised to find that a simple
> implementation of a median function with Arrays.sort() and a programmatic
> selection of the median value yields much faster results
> than Median().evaluate() or DescriptiveStatistics.getPercentile(50).

:-(

> Since we only use this function for  Arrays of confirmed numbers

What is a "confirmed number"?

> is there a
> particular benefit in using Apache Commons Math for this task or are we
> better advised to use our own implementation here?

There is ongoing work to refactor the "o.a.c.m.stat.descriptive" package
of "Commons Math".  The new code will be in "Commons Statistics".[1]
Your observation is an interesting data point for this task; could you please
file a report in JIRA[2] and/or mention on the "dev" ML?

Thanks,
Gilles

[1] http://commons.apache.org/proper/commons-statistics/
[2] https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-15?filter=allopenissues

>
> Thank You

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Apache Commons Math Median performance

Eric Barnhill
In reply to this post by Marco Neumann
Hi Marco,

Thanks a lot for this feedback. I am one of the contribs building out the
new commons-numbers release which will replace much of commons-math
including the statistics libraries.

I took a look at the commons-math code for Median. Based on my reading of
it I am not surprised by what you report. Median just invokes Percentile.
Percentile does seem to have some overhead, for example lots of range and
null checking. As as one off Median is probably not as efficient as a
simple sort. However it can store data and recalculate percentiles, and
there it might be efficient to use, as well as having flexibility in how it
is implemented.

I quite agree that most users simply want to call median() on their array
and expect at least as efficient an algorithm as they could hand code
themselves. For medians as we all learned in CS 101 class, for a guaranteed
result one cannot improve on O(n log n) time, just sorting like you
mentioned, however a method can deliver O(log n) time on average, but with
a potential worst case of O(n^2), and the user could be given this choice
of implementation.

The fact that the commons-math contributors did not seem aware of this
finding shows some of the occasional weaknesses I am finding as we design a
new, fleeter library in commons-numbers. We are redesigning the statistics
libraries over this summer with support from Google Summer of Code, so I
will see what I can do. Without feedback like yours we would not find these
points of improvement, so thanks.

On Wed, May 29, 2019 at 3:24 AM Marco Neumann <[hidden email]>
wrote:

> I am evaluating the use of Apache Math Commons Median for the querying of
> large data sets in another Apache project called Apache Jena.
>
> In my preliminary performance tests I was surprised to find that a simple
> implementation of a median function with Arrays.sort() and a programmatic
> selection of the median value yields much faster results
> than Median().evaluate() or DescriptiveStatistics.getPercentile(50).
>
> Since we only use this function for  Arrays of confirmed numbers is there a
> particular benefit in using Apache Commons Math for this task or are we
> better advised to use our own implementation here?
>
> Thank You
>
Reply | Threaded
Open this post in threaded view
|

Re: [math] Apache Commons Math Median performance

Marco Neumann
Thank you for the feedback and confirmation Eric. I am looking forward to
the new statistics libraries. I am glad that the work continuous on
these libraries and that they will be available to the community to improve
and enhance our projects.



On Wed, May 29, 2019 at 10:56 PM Eric Barnhill <[hidden email]>
wrote:

> Hi Marco,
>
> Thanks a lot for this feedback. I am one of the contribs building out the
> new commons-numbers release which will replace much of commons-math
> including the statistics libraries.
>
> I took a look at the commons-math code for Median. Based on my reading of
> it I am not surprised by what you report. Median just invokes Percentile.
> Percentile does seem to have some overhead, for example lots of range and
> null checking. As as one off Median is probably not as efficient as a
> simple sort. However it can store data and recalculate percentiles, and
> there it might be efficient to use, as well as having flexibility in how it
> is implemented.
>
> I quite agree that most users simply want to call median() on their array
> and expect at least as efficient an algorithm as they could hand code
> themselves. For medians as we all learned in CS 101 class, for a guaranteed
> result one cannot improve on O(n log n) time, just sorting like you
> mentioned, however a method can deliver O(log n) time on average, but with
> a potential worst case of O(n^2), and the user could be given this choice
> of implementation.
>
> The fact that the commons-math contributors did not seem aware of this
> finding shows some of the occasional weaknesses I am finding as we design a
> new, fleeter library in commons-numbers. We are redesigning the statistics
> libraries over this summer with support from Google Summer of Code, so I
> will see what I can do. Without feedback like yours we would not find these
> points of improvement, so thanks.
>
> On Wed, May 29, 2019 at 3:24 AM Marco Neumann <[hidden email]>
> wrote:
>
> > I am evaluating the use of Apache Math Commons Median for the querying of
> > large data sets in another Apache project called Apache Jena.
> >
> > In my preliminary performance tests I was surprised to find that a simple
> > implementation of a median function with Arrays.sort() and a programmatic
> > selection of the median value yields much faster results
> > than Median().evaluate() or DescriptiveStatistics.getPercentile(50).
> >
> > Since we only use this function for  Arrays of confirmed numbers is
> there a
> > particular benefit in using Apache Commons Math for this task or are we
> > better advised to use our own implementation here?
> >
> > Thank You
> >
>


--


---
Marco Neumann
KONA
Reply | Threaded
Open this post in threaded view
|

Re: [math] Apache Commons Math Median performance

Marco Neumann
In reply to this post by Gilles Sadowski-2
Hi Gilles,

On Wed, May 29, 2019 at 11:18 PM Gilles Sadowski <[hidden email]>
wrote:

> Hello.
>
> Le mer. 29 mai 2019 à 12:24, Marco Neumann <[hidden email]> a
> écrit :
> >
> > I am evaluating the use of Apache Math Commons Median for the querying of
> > large data sets in another Apache project called Apache Jena.
> >
> > In my preliminary performance tests I was surprised to find that a simple
> > implementation of a median function with Arrays.sort() and a programmatic
> > selection of the median value yields much faster results
> > than Median().evaluate() or DescriptiveStatistics.getPercentile(50).
>
> :-(
>

no worries, I still consider Apache Commons Math still a very valuable
effort.


>
> > Since we only use this function for  Arrays of confirmed numbers
>
> What is a "confirmed number"?
>

should probably read more like programmatically confirmed "numbers" rather
than "confirmed number". I am not dealing with NaN and infinite values in
the sort at the moment.


> > is there a
> > particular benefit in using Apache Commons Math for this task or are we
> > better advised to use our own implementation here?
>
> There is ongoing work to refactor the "o.a.c.m.stat.descriptive" package
> of "Commons Math".  The new code will be in "Commons Statistics".[1]
> Your observation is an interesting data point for this task; could you
> please
> file a report in JIRA[2] and/or mention on the "dev" ML?
>

I can certainly file a report and will do so tomorrow. I am looking forward
to the results of the work on the new stats package!

Best,
Marco

Thanks,

> Gilles
>
> [1] http://commons.apache.org/proper/commons-statistics/
> [2]
> https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-15?filter=allopenissues
>
> >
> > Thank You
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>

--


---
Marco Neumann
KONA