I am evaluating the use of Apache Math Commons Median for the querying of
large data sets in another Apache project called Apache Jena. In my preliminary performance tests I was surprised to find that a simple implementation of a median function with Arrays.sort() and a programmatic selection of the median value yields much faster results than Median().evaluate() or DescriptiveStatistics.getPercentile(50). Since we only use this function for Arrays of confirmed numbers is there a particular benefit in using Apache Commons Math for this task or are we better advised to use our own implementation here? Thank You |
Hello.
Le mer. 29 mai 2019 à 12:24, Marco Neumann <[hidden email]> a écrit : > > I am evaluating the use of Apache Math Commons Median for the querying of > large data sets in another Apache project called Apache Jena. > > In my preliminary performance tests I was surprised to find that a simple > implementation of a median function with Arrays.sort() and a programmatic > selection of the median value yields much faster results > than Median().evaluate() or DescriptiveStatistics.getPercentile(50). :-( > Since we only use this function for Arrays of confirmed numbers What is a "confirmed number"? > is there a > particular benefit in using Apache Commons Math for this task or are we > better advised to use our own implementation here? There is ongoing work to refactor the "o.a.c.m.stat.descriptive" package of "Commons Math". The new code will be in "Commons Statistics".[1] Your observation is an interesting data point for this task; could you please file a report in JIRA[2] and/or mention on the "dev" ML? Thanks, Gilles [1] http://commons.apache.org/proper/commons-statistics/ [2] https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-15?filter=allopenissues > > Thank You --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
In reply to this post by Marco Neumann
Hi Marco,
Thanks a lot for this feedback. I am one of the contribs building out the new commons-numbers release which will replace much of commons-math including the statistics libraries. I took a look at the commons-math code for Median. Based on my reading of it I am not surprised by what you report. Median just invokes Percentile. Percentile does seem to have some overhead, for example lots of range and null checking. As as one off Median is probably not as efficient as a simple sort. However it can store data and recalculate percentiles, and there it might be efficient to use, as well as having flexibility in how it is implemented. I quite agree that most users simply want to call median() on their array and expect at least as efficient an algorithm as they could hand code themselves. For medians as we all learned in CS 101 class, for a guaranteed result one cannot improve on O(n log n) time, just sorting like you mentioned, however a method can deliver O(log n) time on average, but with a potential worst case of O(n^2), and the user could be given this choice of implementation. The fact that the commons-math contributors did not seem aware of this finding shows some of the occasional weaknesses I am finding as we design a new, fleeter library in commons-numbers. We are redesigning the statistics libraries over this summer with support from Google Summer of Code, so I will see what I can do. Without feedback like yours we would not find these points of improvement, so thanks. On Wed, May 29, 2019 at 3:24 AM Marco Neumann <[hidden email]> wrote: > I am evaluating the use of Apache Math Commons Median for the querying of > large data sets in another Apache project called Apache Jena. > > In my preliminary performance tests I was surprised to find that a simple > implementation of a median function with Arrays.sort() and a programmatic > selection of the median value yields much faster results > than Median().evaluate() or DescriptiveStatistics.getPercentile(50). > > Since we only use this function for Arrays of confirmed numbers is there a > particular benefit in using Apache Commons Math for this task or are we > better advised to use our own implementation here? > > Thank You > |
Thank you for the feedback and confirmation Eric. I am looking forward to
the new statistics libraries. I am glad that the work continuous on these libraries and that they will be available to the community to improve and enhance our projects. On Wed, May 29, 2019 at 10:56 PM Eric Barnhill <[hidden email]> wrote: > Hi Marco, > > Thanks a lot for this feedback. I am one of the contribs building out the > new commons-numbers release which will replace much of commons-math > including the statistics libraries. > > I took a look at the commons-math code for Median. Based on my reading of > it I am not surprised by what you report. Median just invokes Percentile. > Percentile does seem to have some overhead, for example lots of range and > null checking. As as one off Median is probably not as efficient as a > simple sort. However it can store data and recalculate percentiles, and > there it might be efficient to use, as well as having flexibility in how it > is implemented. > > I quite agree that most users simply want to call median() on their array > and expect at least as efficient an algorithm as they could hand code > themselves. For medians as we all learned in CS 101 class, for a guaranteed > result one cannot improve on O(n log n) time, just sorting like you > mentioned, however a method can deliver O(log n) time on average, but with > a potential worst case of O(n^2), and the user could be given this choice > of implementation. > > The fact that the commons-math contributors did not seem aware of this > finding shows some of the occasional weaknesses I am finding as we design a > new, fleeter library in commons-numbers. We are redesigning the statistics > libraries over this summer with support from Google Summer of Code, so I > will see what I can do. Without feedback like yours we would not find these > points of improvement, so thanks. > > On Wed, May 29, 2019 at 3:24 AM Marco Neumann <[hidden email]> > wrote: > > > I am evaluating the use of Apache Math Commons Median for the querying of > > large data sets in another Apache project called Apache Jena. > > > > In my preliminary performance tests I was surprised to find that a simple > > implementation of a median function with Arrays.sort() and a programmatic > > selection of the median value yields much faster results > > than Median().evaluate() or DescriptiveStatistics.getPercentile(50). > > > > Since we only use this function for Arrays of confirmed numbers is > there a > > particular benefit in using Apache Commons Math for this task or are we > > better advised to use our own implementation here? > > > > Thank You > > > -- --- Marco Neumann KONA |
In reply to this post by Gilles Sadowski-2
Hi Gilles,
On Wed, May 29, 2019 at 11:18 PM Gilles Sadowski <[hidden email]> wrote: > Hello. > > Le mer. 29 mai 2019 à 12:24, Marco Neumann <[hidden email]> a > écrit : > > > > I am evaluating the use of Apache Math Commons Median for the querying of > > large data sets in another Apache project called Apache Jena. > > > > In my preliminary performance tests I was surprised to find that a simple > > implementation of a median function with Arrays.sort() and a programmatic > > selection of the median value yields much faster results > > than Median().evaluate() or DescriptiveStatistics.getPercentile(50). > > :-( > no worries, I still consider Apache Commons Math still a very valuable effort. > > > Since we only use this function for Arrays of confirmed numbers > > What is a "confirmed number"? > should probably read more like programmatically confirmed "numbers" rather than "confirmed number". I am not dealing with NaN and infinite values in the sort at the moment. > > is there a > > particular benefit in using Apache Commons Math for this task or are we > > better advised to use our own implementation here? > > There is ongoing work to refactor the "o.a.c.m.stat.descriptive" package > of "Commons Math". The new code will be in "Commons Statistics".[1] > Your observation is an interesting data point for this task; could you > please > file a report in JIRA[2] and/or mention on the "dev" ML? > I can certainly file a report and will do so tomorrow. I am looking forward to the results of the work on the new stats package! Best, Marco Thanks, > Gilles > > [1] http://commons.apache.org/proper/commons-statistics/ > [2] > https://issues.apache.org/jira/projects/STATISTICS/issues/STATISTICS-15?filter=allopenissues > > > > > Thank You > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > -- --- Marco Neumann KONA |
Free forum by Nabble | Edit this page |