Greetings!
Recently I stumbled into the Commons math project; nice design, good abstractions, "smart updates" and even unit tests! :-) the Smart updates are a key feature for event stream processing / time series simulation. The only piece that is missing from a time series analysis and simulation perspective is the ability to supply a lag that defines a fixed sample size and perform rolling calculations. I was very happy to see this as an item on the wish list. A ThoughtWorks colleague (Yaxin Wang) and I are prototyping a java time series simulation engine and we are considering the commons math as the base of our numerical libraries. In order to do this we need to complete the rolling calculations, so here is our first spike (spike means prototype that can be thrown away / not a real patch.) We thought we would start with an easy case; mean, which uses sum. We have already combined the rolling calculations with the smart update algorithms before in the numerical libraries for our previous time series simulation engine. As you have mentioned in the wish list notes, our past experience is that some of the algorithms can not avoid using queues for rolling updates case. Obviously it is something pretty fundamental to the design and requires a bit of work across a lot of places to do this for all the statistics (at least starting with summary statistics.) Please give feedback on the design, any issues with performance (better data structure than the queue we used), etc! If the community is OK with this initial spike, then we can start submitting patches. :-) /brad --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] statistics.tar.gz (1K) Download Attachment |
On 9/22/07, Bradford Cross <[hidden email]> wrote:
> Greetings! > > Recently I stumbled into the Commons math project; nice design, good > abstractions, "smart updates" and even unit tests! :-) > Thanks! > the Smart updates are a key feature for event stream processing / time > series simulation. The only piece that is missing from a time series > analysis and simulation perspective is the ability to supply a lag that > defines a fixed sample size and perform rolling calculations. > That functionality actually already exists in the DescriptiveStatistics class. You can set a "window size" for rolling computations of univariate statistics using the concrete implementation of this class, o.a.c.math.stat.descriptive.DescriptiveStatisticsImpl. See http://commons.apache.org/math/userguide/stat.html > I was very happy to see this as an item on the wish list. The wishlist item is not as clear as it could be. Sorry about that. In addition to the computations in DescriptiveStatistics that require that you maintain all of the values in the current window in memory, we also support "storeless" computation of statistics than can be computed in one pass through the data. This allows very large data streams to be handled with fixed storage overhead. I think that what the wishlist item refers to is something in between - ways to support the window concept without storing all of the data. Strictly speaking, this is impossible, but doing things like sampling from the streams, periodically resetting or maintaining arrays of storeless stats with different offsets would in theory be possible. > > A ThoughtWorks colleague (Yaxin Wang) and I are prototyping a java time > series simulation engine and we are considering the commons math as the base > of our numerical libraries. In order to do this we need to complete the > rolling calculations, so here is our first spike (spike means prototype that > can be thrown away / not a real patch.) We thought we would start with an > easy case; mean, which uses sum. > > We have already combined the rolling calculations with the smart update > algorithms before in the numerical libraries for our previous time series > simulation engine. As you have mentioned in the wish list notes, our past > experience is that some of the algorithms can not avoid using queues for > rolling updates case. Obviously it is something pretty fundamental to the > design and requires a bit of work across a lot of places to do this for all > the statistics (at least starting with summary statistics.) > > Please give feedback on the design, any issues with performance (better data > structure than the queue we used), etc! > > If the community is OK with this initial spike, then we can start submitting > patches. :-) > Thanks for the contribution! There are a few problems with incorporating the code as is, though. First it uses generics and the concurrent package, which requires JDK 1.5 and our current minimum JDK level is 1.3. That could probably be eliminated fairly easily, though. The second is really whether or not the queue implementation is going to improve performance over the ResizeableDoubleArray store that DescriptiveStatisticsImpl uses now. If you think so and can demonstrate with benchmarks, we can talk about swapping out that implementation. Otherwise, its probably better to use ResizeableDoubleArray. I am +1 on adding a RollingStatistic abstract base class (would prefer that name to "Statistic" since it is specialized) like you have defined and rolling versions of the individual statistics. This would be a convenience over the current setup and provide a more intuitive way to access rolling stats than to use DescriptiveStatisticsImpl as a container. Currently this is only the only way to do it. So if you can refactor to either use ResizableDoubleArray as the backing store (look at DescriptiveStatisticsImpl.apply - the convenience classes could just use that pattern) or otherwise eliminate the JDK 1.5 dependency, I would support adding the rolling stats. If I understand correctly the idea of what you mean by Sum, and Mean (using constructor arguments to determine whether or not statistic is rolling), I would prefer to leave the existing statistics in commons-math as is and introduce Rolling versions as separate classes. One more thing. It is very important that any contributions that you make can be made in accordance with the Apache Contributor's License Agreement. Have a look here: http://www.apache.org/licenses/#clas and make sure you can agree to those terms. Then you can start submitting patches with attachements to Jira tickets. Thanks! Phil --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
On 9/29/07, Phil Steitz <[hidden email]> wrote:
> > On 9/22/07, Bradford Cross <[hidden email]> > > > the Smart updates are a key feature for event stream processing / time > > series simulation. The only piece that is missing from a time series > > analysis and simulation perspective is the ability to supply a lag that > > defines a fixed sample size and perform rolling calculations. > > > > That functionality actually already exists in the > DescriptiveStatistics class. You can set a "window size" for rolling > computations of univariate statistics using the concrete > implementation of this class, > o.a.c.math.stat.descriptive.DescriptiveStatisticsImpl. See > http://commons.apache.org/math/userguide/stat.html cool - i did not see this yet. > > > > If the community is OK with this initial spike, then we can start > submitting > > patches. :-) > > > > Thanks for the contribution! There are a few problems with > incorporating the code as is, though. First it uses generics and the > concurrent package, which requires JDK 1.5 and our current minimum JDK > level is 1.3. That could probably be eliminated fairly easily, > though. The second is really whether or not the queue implementation > is going to improve performance over the ResizeableDoubleArray store > that DescriptiveStatisticsImpl uses now. If you think so and can > demonstrate with benchmarks, we can talk about swapping out that > implementation. Otherwise, its probably better to use > ResizeableDoubleArray. Yes, this is just a "spike" - a proof of concept. :-) Today I setup a benchmark test and swapped in lots of different collections. The fastest java queue I found is the ArrayDeque from java-6. Interestingly, the calculations are about twice as fast using this queue compared with some of the other queue implementations in the java collections for a run of about 10K calls to calculate(). Nevertheless, the ResizableDoubleArray seems to be a bit faster. I will formalize my benchmarking with a bit more rigor and publish the results on this thread. I am +1 on adding a RollingStatistic abstract base class (would prefer > that name to "Statistic" since it is specialized) like you have > defined and rolling versions of the individual statistics. This would > be a convenience over the current setup and provide a more intuitive > way to access rolling stats than to use DescriptiveStatisticsImpl as a > container. Currently this is only the only way to do it. So if you > can refactor to either use ResizableDoubleArray as the backing store > (look at DescriptiveStatisticsImpl.apply - the convenience classes > could just use that pattern) or otherwise eliminate the JDK 1.5 > dependency, I would support adding the rolling stats. If I understand > correctly the idea of what you mean by Sum, and Mean (using > constructor arguments to determine whether or not statistic is > rolling), I would prefer to leave the existing statistics in > commons-math as is and introduce Rolling versions as separate classes. Sounds good - I will start working on the RollingStatistics. As for the convience pattern I used in Mean/Sum (using constructor arguments to determine whether or not statistic is rolling) - it is easy to do the refactorings later after the rolling statistics are added. We can just leave the current statistics as is and wait to see if we find some valuable reason to do it. One more thing. It is very important that any contributions that you > make can be made in accordance with the Apache Contributor's License > Agreement. Have a look here: > http://www.apache.org/licenses/#clas > and make sure you can agree to those terms. Yep, no problems. :-) Then you can start > submitting patches with attachements to Jira tickets. Sounds good. |
In reply to this post by Phil Steitz
OK, I have created a patch...I tried to follow the instructions to file a
bug on bugzilla but i can't seem to find the right place to file a new bug to either commons or commons math. I wonder if someone could help me out. /b On 9/29/07, Phil Steitz <[hidden email]> wrote: > > On 9/22/07, Bradford Cross <[hidden email]> wrote: > > Greetings! > > > > Recently I stumbled into the Commons math project; nice design, good > > abstractions, "smart updates" and even unit tests! :-) > > > Thanks! > > > the Smart updates are a key feature for event stream processing / time > > series simulation. The only piece that is missing from a time series > > analysis and simulation perspective is the ability to supply a lag that > > defines a fixed sample size and perform rolling calculations. > > > > That functionality actually already exists in the > DescriptiveStatistics class. You can set a "window size" for rolling > computations of univariate statistics using the concrete > implementation of this class, > o.a.c.math.stat.descriptive.DescriptiveStatisticsImpl. See > http://commons.apache.org/math/userguide/stat.html > > > I was very happy to see this as an item on the wish list. > > The wishlist item is not as clear as it could be. Sorry about that. > In addition to the computations in DescriptiveStatistics that require > that you maintain all of the values in the current window in memory, > we also support "storeless" computation of statistics than can be > computed in one pass through the data. This allows very large data > streams to be handled with fixed storage overhead. I think that what > the wishlist item refers to is something in between - ways to support > the window concept without storing all of the data. Strictly > speaking, this is impossible, but doing things like sampling from the > streams, periodically resetting or maintaining arrays of storeless > stats with different offsets would in theory be possible. > > > > A ThoughtWorks colleague (Yaxin Wang) and I are prototyping a java time > > series simulation engine and we are considering the commons math as the > base > > of our numerical libraries. In order to do this we need to complete the > > rolling calculations, so here is our first spike (spike means prototype > that > > can be thrown away / not a real patch.) We thought we would start with > an > > easy case; mean, which uses sum. > > > > We have already combined the rolling calculations with the smart update > > algorithms before in the numerical libraries for our previous time > series > > simulation engine. As you have mentioned in the wish list notes, our > past > > experience is that some of the algorithms can not avoid using queues for > > rolling updates case. Obviously it is something pretty fundamental to > the > > design and requires a bit of work across a lot of places to do this for > all > > the statistics (at least starting with summary statistics.) > > > > Please give feedback on the design, any issues with performance (better > data > > structure than the queue we used), etc! > > > > If the community is OK with this initial spike, then we can start > submitting > > patches. :-) > > > > Thanks for the contribution! There are a few problems with > incorporating the code as is, though. First it uses generics and the > concurrent package, which requires JDK 1.5 and our current minimum JDK > level is 1.3. That could probably be eliminated fairly easily, > though. The second is really whether or not the queue implementation > is going to improve performance over the ResizeableDoubleArray store > that DescriptiveStatisticsImpl uses now. If you think so and can > demonstrate with benchmarks, we can talk about swapping out that > implementation. Otherwise, its probably better to use > ResizeableDoubleArray. > > I am +1 on adding a RollingStatistic abstract base class (would prefer > that name to "Statistic" since it is specialized) like you have > defined and rolling versions of the individual statistics. This would > be a convenience over the current setup and provide a more intuitive > way to access rolling stats than to use DescriptiveStatisticsImpl as a > container. Currently this is only the only way to do it. So if you > can refactor to either use ResizableDoubleArray as the backing store > (look at DescriptiveStatisticsImpl.apply - the convenience classes > could just use that pattern) or otherwise eliminate the JDK 1.5 > dependency, I would support adding the rolling stats. If I understand > correctly the idea of what you mean by Sum, and Mean (using > constructor arguments to determine whether or not statistic is > rolling), I would prefer to leave the existing statistics in > commons-math as is and introduce Rolling versions as separate classes. > > One more thing. It is very important that any contributions that you > make can be made in accordance with the Apache Contributor's License > Agreement. Have a look here: > http://www.apache.org/licenses/#clas > and make sure you can agree to those terms. Then you can start > submitting patches with attachements to Jira tickets. > > Thanks! > > Phil > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > |
On 10/3/07, Bradford Cross <[hidden email]> wrote:
> OK, I have created a patch...I tried to follow the instructions to file a > bug on bugzilla but i can't seem to find the right place to file a new bug > to either commons or commons math. > > I wonder if someone could help me out. Start here for [math]: http://commons.apache.org/math/issue-tracking.html -- Wendy --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
In reply to this post by Bradford Cross
On 10/3/07, Bradford Cross <[hidden email]> wrote:
> OK, I have created a patch...I tried to follow the instructions to file a > bug on bugzilla but i can't seem to find the right place to file a new bug > to either commons or commons math. > > I wonder if someone could help me out. > Sorry for the response latency and sorry if you were led to Bugzilla by the incorrect link that I just noticed on http://commons.apache.org/math/developers.html. I will fix that. We now use Jira for issue tracking / patch submission. Here is a link to the commons math issues page: http://commons.apache.org/math/issue-tracking.html Phil > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
Cool - first patch finally submitted. :-)
On 10/6/07, Phil Steitz <[hidden email]> wrote: > > On 10/3/07, Bradford Cross <[hidden email]> wrote: > > OK, I have created a patch...I tried to follow the instructions to file > a > > bug on bugzilla but i can't seem to find the right place to file a new > bug > > to either commons or commons math. > > > > I wonder if someone could help me out. > > > > Sorry for the response latency and sorry if you were led to Bugzilla > by the incorrect link that I just noticed on > http://commons.apache.org/math/developers.html. I will fix that. > > We now use Jira for issue tracking / patch submission. Here is a link > to the commons math issues page: > > http://commons.apache.org/math/issue-tracking.html > > Phil > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > |
Thanks!
Had a quick look and have a couple of comments. First, have a look at the developers guide (linked on the [math] website) for style and other guidlines and if you are not set up yet with maven, let me know if you need help getting set up so you can get checkstyle and pmd reports. We should probably talk a little about the API - i.e., what the method names actually mean - and this all needs to be specified in the method javadoc. Commons math is as much about the API as the implementations, so we try to be pretty careful about getting the names and semantics right. Like other commons components, we have an implicit commitment to maintain backward compatibility, so that makes it even more important to get the APIs right. So can you describe what exactly the public methods mean and why they are named as they are? One more patch-generation point. I notice that the patch includes some variable name changes and other stylistic changes to existing code. While there is nothing wrong with suggesting this kind of change, we try to separate the style / formatting changes from the changes that introduce new features or fix bugs. That makes the diffs and change logs easier to read. So it would be good to remove those changes from the patch. Checkstyle will flag this, but two other little things I noticed are the presence of tabs (we use spaces in place of tabs) and if - then - else with no braces (we like braces). Thanks again for your interest and contributions! Phil On 10/11/07, Bradford Cross <[hidden email]> wrote: > Cool - first patch finally submitted. :-) > > On 10/6/07, Phil Steitz <[hidden email]> wrote: > > > > On 10/3/07, Bradford Cross <[hidden email]> wrote: > > > OK, I have created a patch...I tried to follow the instructions to file > > a > > > bug on bugzilla but i can't seem to find the right place to file a new > > bug > > > to either commons or commons math. > > > > > > I wonder if someone could help me out. > > > > > > > Sorry for the response latency and sorry if you were led to Bugzilla > > by the incorrect link that I just noticed on > > http://commons.apache.org/math/developers.html. I will fix that. > > > > We now use Jira for issue tracking / patch submission. Here is a link > > to the commons math issues page: > > > > http://commons.apache.org/math/issue-tracking.html > > > > Phil > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [hidden email] > > For additional commands, e-mail: [hidden email] > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
On 10/12/07, Phil Steitz <[hidden email]> wrote:
> > Thanks! > > Had a quick look and have a couple of comments. > > First, have a look at the developers guide (linked on the [math] > website) for style and other guidlines and if you are not set up yet > with maven, let me know if you need help getting set up so you can > get checkstyle and pmd reports. >>>Sure. We should probably talk a little about the API - i.e., what the method > names actually mean - and this all needs to be specified in the method > javadoc. Commons math is as much about the API as the > implementations, so we try to be pretty careful about getting the > names and semantics right. Like other commons components, we have an > implicit commitment to maintain backward compatibility, so that makes > it even more important to get the APIs right. So can you describe > what exactly the public methods mean and why they are named as they > are? >>>I agree. calculate() can be renamed to increment() for consistency with the rest of the code base. I also suggest that all increment() methods return the double value of the incremental calculation rather than void, which forces the client code to do statistic.increment(), and statistic.getResult() - the change doesn't violate backward compatibility and makes client code use of the API simpler and more intuitive. Also note that evaluate() methods return double, so this change would make calculation methods consistent. On this note, I also think it is better to decompose each method of calculation into a different object. There is static calculation on a collection via evaluate, incremental accumulated calculate, and incremental rolling calculate. This way we can simplify the main statistic classes by making them decorators, keep backwards compatibility in the API for all the statistics, and delegate the three different calculation types to specialized objects rather than cluttering things up. Maybe there are also better names ... like staticEvaluate, accumulatedEvaluate, rollingEvaluate..or likewise for ___Calculate ...bBut that would violate backward compatibility. One more patch-generation point. I notice that the patch includes > some variable name changes and other stylistic changes to existing > code. While there is nothing wrong with suggesting this kind of > change, we try to separate the style / formatting changes from the > changes that introduce new features or fix bugs. That makes the diffs > and change logs easier to read. So it would be good to remove those > changes from the patch. > > Checkstyle will flag this, but two other little things I noticed are > the presence of tabs (we use spaces in place of tabs) and if - then - > else with no braces (we like braces). >>>No problem, I will make these changes, add javadoc and resubmit the patch. Thanks again for your interest and contributions! > > Phil > > On 10/11/07, Bradford Cross <[hidden email]> wrote: > > Cool - first patch finally submitted. :-) > > > > On 10/6/07, Phil Steitz <[hidden email]> wrote: > > > > > > On 10/3/07, Bradford Cross <[hidden email]> wrote: > > > > OK, I have created a patch...I tried to follow the instructions to > file > > > a > > > > bug on bugzilla but i can't seem to find the right place to file a > new > > > bug > > > > to either commons or commons math. > > > > > > > > I wonder if someone could help me out. > > > > > > > > > > Sorry for the response latency and sorry if you were led to Bugzilla > > > by the incorrect link that I just noticed on > > > http://commons.apache.org/math/developers.html. I will fix that. > > > > > > We now use Jira for issue tracking / patch submission. Here is a link > > > to the commons math issues page: > > > > > > http://commons.apache.org/math/issue-tracking.html > > > > > > Phil > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [hidden email] > > > For additional commands, e-mail: [hidden email] > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > |
I am fixing these issues tonight and finishing up the ranking statistics.
I still need to optimize the sorted deque structure that I created for RollingPercentile that uses a linked list and hte ResizeableDoubleArray. I will create a patch thhat includes the first version while I work an a more performant and less hackish implimentaiton of htis data structure - when I submit the patch please have a look and give me any suggestions. I notice that the Moment statistics, FirstMoment, SecondMoment and their wrappers Mean, Variance, etc. all extend from AbstractStatelessUnivariateStatistic and generally use the increment() calcualtion to produce the value returned by the evaluate() function. This is right in line with the abstractions I am talking about in my last email - we have a few different kinds of calculations; static (pass in an array, get back a value), incremental-accumulating (keep passing in values, keep updating calcualtion), incremental-rolling (keep passing in values, removing data outside the lag window, and updating calculation.) I propose to continue the abstractions in this direction, maintain backward compatability, and change increment() from void to double return type as I mention in the previous email. In the case of ranking statistics, there are currently no impliemntations for incrent() - I propose to add those. Also, while I was working on the rank statistics tonight I realized that we have two different cases for incremental rank statistics. 1) You want to use Percentile and specify a quantile in the constructor, you are interested in repeated calls to increment() for the same quantile (f.ex. min, max, median.) 2) You want to use Perceltile with incremental updates but without specifying a quantile in the constructor, by executing repeated calls on increment() and calling getValue with a quantile argument; this is useful if you need, f.ex. both min and max of the same data, there is no need to create multiple instances of Percentile. This si implimented in the current Percentile with setQuantile() ... again we can keep taht for backward compatability and just have a convienience method for getValue(double quantile). Thoughts? On 10/12/07, Bradford Cross <[hidden email]> wrote: > > > > On 10/12/07, Phil Steitz <[hidden email]> wrote: > > > > Thanks! > > > > Had a quick look and have a couple of comments. > > > > First, have a look at the developers guide (linked on the [math] > > website) for style and other guidlines and if you are not set up yet > > with maven, let me know if you need help getting set up so you can > > get checkstyle and pmd reports. > > > >>>Sure. > > We should probably talk a little about the API - i.e., what the method > > names actually mean - and this all needs to be specified in the method > > javadoc. Commons math is as much about the API as the > > implementations, so we try to be pretty careful about getting the > > names and semantics right. Like other commons components, we have an > > implicit commitment to maintain backward compatibility, so that makes > > it even more important to get the APIs right. So can you describe > > what exactly the public methods mean and why they are named as they > > are? > > > >>>I agree. calculate() can be renamed to increment() for consistency > with the rest of the code base. I also suggest that all increment() methods > return the double value of the incremental calculation rather than void, > which forces the client code to do statistic.increment(), and > statistic.getResult() - the change doesn't violate backward compatibility > and makes client code use of the API simpler and more intuitive. Also note > that evaluate() methods return double, so this change would make calculation > methods consistent. On this note, I also think it is better to decompose > each method of calculation into a different object. There is static > calculation on a collection via evaluate, incremental accumulated calculate, > and incremental rolling calculate. This way we can simplify the main > statistic classes by making them decorators, keep backwards compatibility in > the API for all the statistics, and delegate the three different calculation > types to specialized objects rather than cluttering things up. Maybe there > are also better names ... like staticEvaluate, accumulatedEvaluate, > rollingEvaluate..or likewise for ___Calculate ...bBut that would violate > backward compatibility. > > One more patch-generation point. I notice that the patch includes > > some variable name changes and other stylistic changes to existing > > code. While there is nothing wrong with suggesting this kind of > > change, we try to separate the style / formatting changes from the > > changes that introduce new features or fix bugs. That makes the diffs > > and change logs easier to read. So it would be good to remove those > > changes from the patch. > > > > Checkstyle will flag this, but two other little things I noticed are > > the presence of tabs (we use spaces in place of tabs) and if - then - > > else with no braces (we like braces). > > > > >>>No problem, I will make these changes, add javadoc and resubmit the > patch. > > Thanks again for your interest and contributions! > > > > Phil > > > > On 10/11/07, Bradford Cross <[hidden email]> wrote: > > > Cool - first patch finally submitted. :-) > > > > > > On 10/6/07, Phil Steitz <[hidden email]> wrote: > > > > > > > > On 10/3/07, Bradford Cross < [hidden email]> wrote: > > > > > OK, I have created a patch...I tried to follow the instructions to > > file > > > > a > > > > > bug on bugzilla but i can't seem to find the right place to file a > > new > > > > bug > > > > > to either commons or commons math. > > > > > > > > > > I wonder if someone could help me out. > > > > > > > > > > > > > Sorry for the response latency and sorry if you were led to Bugzilla > > > > > > by the incorrect link that I just noticed on > > > > http://commons.apache.org/math/developers.html. I will fix that. > > > > > > > > We now use Jira for issue tracking / patch submission. Here is a > > link > > > > to the commons math issues page: > > > > > > > > http://commons.apache.org/math/issue-tracking.html > > > > > > > > Phil > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > To unsubscribe, e-mail: [hidden email] > > > > For additional commands, e-mail: [hidden email] > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [hidden email] > > For additional commands, e-mail: [hidden email] > > > > > |
New patch for rolling stats is attached to ticket. Notes are included.
Please review and get back to me. On 10/13/07, Bradford Cross <[hidden email]> wrote: > > I am fixing these issues tonight and finishing up the ranking statistics. > > > I still need to optimize the sorted deque structure that I created for > RollingPercentile that uses a linked list and hte ResizeableDoubleArray. I > will create a patch thhat includes the first version while I work an a more > performant and less hackish implimentaiton of htis data structure - when I > submit the patch please have a look and give me any suggestions. > > I notice that the Moment statistics, FirstMoment, SecondMoment and their > wrappers Mean, Variance, etc. all extend from > AbstractStatelessUnivariateStatistic and generally use the increment() > calcualtion to produce the value returned by the evaluate() function. This > is right in line with the abstractions I am talking about in my last email - > we have a few different kinds of calculations; static (pass in an array, get > back a value), incremental-accumulating (keep passing in values, keep > updating calcualtion), incremental-rolling (keep passing in values, removing > data outside the lag window, and updating calculation.) I propose to > continue the abstractions in this direction, maintain backward > compatability, and change increment() from void to double return type as I > mention in the previous email. > > In the case of ranking statistics, there are currently no impliemntations > for incrent() - I propose to add those. Also, while I was working on the > rank statistics tonight I realized that we have two different cases for > incremental rank statistics. 1) You want to use Percentile and specify a > quantile in the constructor, you are interested in repeated calls to > increment() for the same quantile ( f.ex. min, max, median.) 2) You want > to use Perceltile with incremental updates but without specifying a quantile > in the constructor, by executing repeated calls on increment() and calling > getValue with a quantile argument; this is useful if you need, f.ex. both > min and max of the same data, there is no need to create multiple instances > of Percentile. This si implimented in the current Percentile with > setQuantile() ... again we can keep taht for backward compatability and just > have a convienience method for getValue(double quantile). > > Thoughts? > > > > On 10/12/07, Bradford Cross <[hidden email]> wrote: > > > > > > > > On 10/12/07, Phil Steitz <[hidden email]> wrote: > > > > > > Thanks! > > > > > > Had a quick look and have a couple of comments. > > > > > > First, have a look at the developers guide (linked on the [math] > > > website) for style and other guidlines and if you are not set up yet > > > with maven, let me know if you need help getting set up so you can > > > get checkstyle and pmd reports. > > > > > > >>>Sure. > > > > We should probably talk a little about the API - i.e., what the method > > > names actually mean - and this all needs to be specified in the method > > > javadoc. Commons math is as much about the API as the > > > implementations, so we try to be pretty careful about getting the > > > names and semantics right. Like other commons components, we have an > > > implicit commitment to maintain backward compatibility, so that makes > > > it even more important to get the APIs right. So can you describe > > > what exactly the public methods mean and why they are named as they > > > are? > > > > > > >>>I agree. calculate() can be renamed to increment() for consistency > > with the rest of the code base. I also suggest that all increment() methods > > return the double value of the incremental calculation rather than void, > > which forces the client code to do statistic.increment(), and > > statistic.getResult() - the change doesn't violate backward > > compatibility and makes client code use of the API simpler and more > > intuitive. Also note that evaluate() methods return double, so this change > > would make calculation methods consistent. On this note, I also think it is > > better to decompose each method of calculation into a different object. > > There is static calculation on a collection via evaluate, incremental > > accumulated calculate, and incremental rolling calculate. This way we can > > simplify the main statistic classes by making them decorators, keep > > backwards compatibility in the API for all the statistics, and delegate the > > three different calculation types to specialized objects rather than > > cluttering things up. Maybe there are also better names ... like > > staticEvaluate, accumulatedEvaluate, rollingEvaluate..or likewise for > > ___Calculate ...bBut that would violate backward compatibility. > > > > One more patch-generation point. I notice that the patch includes > > > some variable name changes and other stylistic changes to existing > > > code. While there is nothing wrong with suggesting this kind of > > > change, we try to separate the style / formatting changes from the > > > changes that introduce new features or fix bugs. That makes the diffs > > > and change logs easier to read. So it would be good to remove those > > > changes from the patch. > > > > > > Checkstyle will flag this, but two other little things I noticed are > > > the presence of tabs (we use spaces in place of tabs) and if - then - > > > else with no braces (we like braces). > > > > > > > > >>>No problem, I will make these changes, add javadoc and resubmit the > > patch. > > > > Thanks again for your interest and contributions! > > > > > > Phil > > > > > > On 10/11/07, Bradford Cross <[hidden email] > wrote: > > > > Cool - first patch finally submitted. :-) > > > > > > > > On 10/6/07, Phil Steitz <[hidden email]> wrote: > > > > > > > > > > On 10/3/07, Bradford Cross < [hidden email]> wrote: > > > > > > OK, I have created a patch...I tried to follow the instructions > > > to file > > > > > a > > > > > > bug on bugzilla but i can't seem to find the right place to file > > > a new > > > > > bug > > > > > > to either commons or commons math. > > > > > > > > > > > > I wonder if someone could help me out. > > > > > > > > > > > > > > > > Sorry for the response latency and sorry if you were led to > > > Bugzilla > > > > > by the incorrect link that I just noticed on > > > > > http://commons.apache.org/math/developers.html . I will fix that. > > > > > > > > > > We now use Jira for issue tracking / patch submission. Here is a > > > link > > > > > to the commons math issues page: > > > > > > > > > > http://commons.apache.org/math/issue-tracking.html > > > > > > > > > > Phil > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: [hidden email] > > > > > For additional commands, e-mail: [hidden email] > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [hidden email] > > > For additional commands, e-mail: [hidden email] > > > > > > > > > |
Free forum by Nabble | Edit this page |