
12

Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.

Key: MATH607
URL: https://issues.apache.org/jira/browse/MATH607 Project: Commons Math
Issue Type: New Feature
Affects Versions: 3.0
Environment: Java
Reporter: greg sterijevski
Fix For: 3.0
The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
Thanks,
Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel ]
greg sterijevski updated MATH607:

Attachment: updating_reg_ifaces
This is the patch file with the proposed changes.
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13060744#comment13060744 ]
Phil Steitz commented on MATH607:

First, thanks for pushing this along and sorry to be slow to respond.
I like both of the abstractions, but I am not sure that defining interfaces is the best way to go in either case. The reporting interface (RegressionResults) could be a concrete class and it is probably best to define a base class that omits some of the reported stats (e.g. isRedundant, getRedundant). Making this a class gives us more flexibility. It also makes it a little easier / more convenient for users who want to store off intermediate results. One thing that I would add to either the base or an extended version is adjusted Rsquare. I think it is also a good idea at the point to ask what else might be missing. Your suggestions on redundancy are a good example. For now, I would suggest making RegressionResults a serializable class as we finalize its contents. One small quibble on naming: s/getNobs/getNumberOfObservations or if that is too onerous getN (similar to other stats).
Regarding the model interface, I would again suggest that we just define this as a class, UpdatingOLSRegression. I suppose that if we end up implementing a weighted or other nonOLS version, we might want to factor out a common interface like what exists for MultipleLinearRegression, but in retrospect, I am not sure that interface was worth much. Note that all that we could factor out is essentially what is in MultivariateRegression, which is analogous to your RegressionResults.
So, modulo the one name change, I propose to just change these to classes and get going on the implementation. Any other suggestions on what we should add / modify in the RegressionResults?
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13060744#comment13060744 ]
Phil Steitz edited comment on MATH607 at 7/6/11 6:31 PM:

First, thanks for pushing this along and sorry to be slow to respond.
I like both of the abstractions, but I am not sure that defining interfaces is the best way to go in either case. The reporting interface (RegressionResults) could be a concrete class and it is probably best to define a base class that omits some of the reported stats (e.g. isRedundant, getRedundant). Making this a class gives us more flexibility. It also makes it a little easier / more convenient for users who want to store off intermediate results. One thing that I would add to either the base or an extended version is adjusted Rsquare. I think it is also a good idea at this point to ask what else might be missing. Your suggestions on redundancy are a good example. For now, I would suggest making RegressionResults a serializable class as we finalize its contents. One small quibble on naming: s/getNobs/getNumberOfObservations or if that is too onerous getN (similar to other stats).
Regarding the model interface, I would again suggest that we just define this as a class, UpdatingOLSRegression. I suppose that if we end up implementing a weighted or other nonOLS version, we might want to factor out a common interface like what exists for MultipleLinearRegression, but in retrospect, I am not sure that interface was worth much. Note that all that we could factor out is essentially what is in MultivariateRegression, which is analogous to your RegressionResults.
So, modulo the one name change, I propose to just change these to classes and get going on the implementation. Any other suggestions on what we should add / modify in the RegressionResults?
was (Author: psteitz):
First, thanks for pushing this along and sorry to be slow to respond.
I like both of the abstractions, but I am not sure that defining interfaces is the best way to go in either case. The reporting interface (RegressionResults) could be a concrete class and it is probably best to define a base class that omits some of the reported stats (e.g. isRedundant, getRedundant). Making this a class gives us more flexibility. It also makes it a little easier / more convenient for users who want to store off intermediate results. One thing that I would add to either the base or an extended version is adjusted Rsquare. I think it is also a good idea at the point to ask what else might be missing. Your suggestions on redundancy are a good example. For now, I would suggest making RegressionResults a serializable class as we finalize its contents. One small quibble on naming: s/getNobs/getNumberOfObservations or if that is too onerous getN (similar to other stats).
Regarding the model interface, I would again suggest that we just define this as a class, UpdatingOLSRegression. I suppose that if we end up implementing a weighted or other nonOLS version, we might want to factor out a common interface like what exists for MultipleLinearRegression, but in retrospect, I am not sure that interface was worth much. Note that all that we could factor out is essentially what is in MultivariateRegression, which is analogous to your RegressionResults.
So, modulo the one name change, I propose to just change these to classes and get going on the implementation. Any other suggestions on what we should add / modify in the RegressionResults?
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13060774#comment13060774 ]
greg sterijevski commented on MATH607:

Phil,
underlying solver is QR or Gaussian this info would exist. If the underlying
method is SVD, then we would register the rank reduction, but we would not
be able to attribute it to a particular column in the design matrix.
I am probably in agreement with with making RegressionResults concrete, but
there were a couple of considerations which forced me to interface.
Say that I begin with the following augmented matrix:
 X'X X'Y
 X'Y Y'Y
where X is the design matrix ( nobs x nreg ), Y is the dependent variable
(nobs x 1 )
On a copy of the cross products matrix (the thing above), I get the
following via gaussian elimination:
 inv(X'X) beta
 beta e'e
inv(X'X) is the inverse of the X'X matrix. beta is the OLS vector of
slopes. e'e is the sum of squared errors.
Getting most of the info (that RegressionResults surfaces) is simply a
matter of indexing. All I need to do in this case is write a wrapper around
a symmetric matrix which implements the interface.
I suppose that there could be constructor which took the matrix above and
did the indexing, but that seems too dirty. Furthermore, there are probably
other optimized formats for OLS which have similar aspects. I wanted to keep
the door open to other schemes, without making (potentially large) copies of
variance matrices, standard errors and so forth a necessity.
On the name of the getter for number of observations, I am okay with
whatever you feel is a better name.
So you are saying the UpdatingOLSRegression be an abstract class? There are
not that many methods in the interface. That would be okay if were sure that
subclasses always overrode either the regress(...) methods or the
addObservations(...) methods. I worry that you might get have a base class
full of nothing but abstract functions.
So, modulo the one name change, I propose to just change these to classes
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13060782#comment13060782 ]
greg sterijevski commented on MATH607:

Sorry for duplicating part of my response, but gmail has truncated it (maybe google is telling me something about my ideas... ;0 )
My complete response is:
I agree on eliminating getRedundant() and isRedundant(int idx). If the underlying solver is QR or Gaussian this info would exist. If the underlying method is SVD, then we would register the rank reduction, but we would not be able to attribute it to a particular column in the design matrix.
I am probably in agreement with with making RegressionResults concrete, but there were a couple of considerations which forced me to interface.
Say that I begin with the following augmented matrix:
 X'X X'Y
 X'Y Y'Y
where X is the design matrix ( nobs x nreg ), Y is the dependent variable (nobs x 1 )
On a copy of the cross products matrix (the thing above), I get the following via gaussian elimination:
 inv(X'X) beta
 beta e'e
inv(X'X) is the inverse of the X'X matrix. beta is the OLS vector of slopes. e'e is the sum of squared errors.
Getting most of the info (that RegressionResults surfaces) is simply a matter of indexing. All I need to do in this case is write a wrapper around a symmetric matrix which implements the interface.
I suppose that there could be constructor which took the matrix above and did the indexing, but that seems too dirty. Furthermore, there are probably other optimized formats for OLS which have similar aspects. I wanted to keep the door open to other schemes, without making (potentially large) copies of variance matrices, standard errors and so forth a necessity.
On the name of the getter for number of observations, I am okay with whatever you feel is a better name.
Regarding the model interface, I would again suggest that we just define this as a class, UpdatingOLSRegression. I suppose that if we end up implementing a weighted or other nonOLS version, we might want to factor out a common interface like what exists for MultipleLinearRegression, but in retrospect, I am not sure that interface was worth much. Note that all that we could factor out is essentially what is in MultivariateRegression, which is analogous to your RegressionResults.
So you are saying the UpdatingOLSRegression be an abstract class? There are not that many methods in the interface. That would be okay if were sure that subclasses always overrode either the regress(...) methods or the addObservations(...) methods. I worry that you might get have a base class full of nothing but abstract functions.
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13060791#comment13060791 ]
Phil Steitz commented on MATH607:

I get your point on the Results interface. It did not look "large" to me at first (i.e., generally o(vars) vs o(obs)). If it could get "large" it would indeed be better to leave as an interface. The problem there is really nailing it because interfaces are very hard to change. My sense at this point is that we may want to rev this a few times before it is really stable, so a concrete class would be better to start with. Also, having the "value" class is handy. StatisticalSummaryValues is an example of that (which implements the interface that preceded it  so maybe having both is a good longer term solution). If it turns out to be too unwieldy to create the results factory methods, I am OK starting with the interface approach, but in that case we should review it very carefully prior to release.
I did not mean to suggest that UpdatingOLSRegression should be an abstract class. If and when a weighted or nonOLS updating regression is implemented, we might consider introducing an abstract parent, but I would need to see good reason for this. IMO, what we have now in OLS, WLS is of marginal value (I mean the abstract superclass and interface).
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13060792#comment13060792 ]
greg sterijevski commented on MATH607:

One more thing, on the subject of the adjusted R Squared. I am not sure I would include this, since this is dependent on knowledge that a constant exists. I currently envision being handed some data. If the data has a column which is nothing but ones, great. If not, great again. I could not come up with an elegant way to handle constant detection, and therefore a clean way to determine the Busse R squared.
I guess we could keep a flag for each regressor. If the regressor has a changed value then we would say it is not a constant. The other approach is to test the residuals for biasif there is no bias, then constant or not we are okay. Though that would be messy since I do not keep the data around. Either way makes for a bit of unpleasantness that yields very little?
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13060797#comment13060797 ]
Phil Steitz commented on MATH607:

Thanks, I forgot to mention that important point. Initially, we took the "take what we are given" approach, but that proved confusing and errorprone for users (forcing them to add unitary columns to input data). I think it is best to expect no unitary columns in the design matrix and have the user explicitly specify "noIntercept" to estimate a model without an intercept term. This is how the MultipleLinearRegression impls now work. (See the javadoc for newSampleData in AbstractMultipleLinearRegression). In the updating impls, this can work the same way, allowing users to omit initial "1"s from added rows. I guess this will have to be a constructor parameter to work correctly in the updating impls.
Another thing I forgot to mention is careful specification and validation of array shape constraints on input data (i.e., when things have to be rectangular and/or of length = previously determined nVars. I liked the lack of a setter for the number of explanatory variables, but that means the first addData becomes definitional.
One final suggestion  maybe the row version of addData should be addObservation and the matrix version should be addObservations.
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13060800#comment13060800 ]
greg sterijevski commented on MATH607:

On the results object:
There are vars *( vars + 1 ) /2 elements in the cov matrix, vars int
parameters, vars int standard errors and a some other assorted stuff. Not
terribly large at first. However, consider doing panel regression via dummy
variables, the covariance matrix can get fast very quickly. That being said,
I don't think making RegressionResults a concrete class is a gamestopper.
Should I send a follow up patch with results made concrete?
On the regression object:
Are you concerned that we will be removing methods from any interface we
specify today? Or do you think the contract is too restrictive? The reason I
am pushing for interface is that I have two candidates for concrete
implementation of updating regression. The first implementation is based on
Gentleman's lemma and is detailed in the following article:
Algorithm AS 274: Least Squares Routines to Supplement those of Gentleman
Author: Alan J Miller
Source Journal of the Royal Statistical Society Vol 41 No 2 (1992)
The second approach is one detailed by this article by Goodnight:
A Tutorial on the SWEEP Operator
James H. Goodnight
The American Statistician, Vol. 33, No. 3. (Aug., 1979), pp. 149158.
The first approach never forms the cross products matrix, the second does.
They are significantly different approaches to dealing with large data sets.
How would I do this in the concrete class you propose?
Thanks,
Greg
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13060814#comment13060814 ]
Phil Steitz commented on MATH607:

I did not see the parameter covariance matrix in RegressionResults. I agree with your basic point on this, though. I am less concerned with wanting to add stuff than including things that we either wish we had omitted (e.g. the redundancy stuff as just an example) or typed or constrained differently. How about starting with a minimalist concrete class and once we have the interface stabilized, we can peel it off and keep the class for persisting / serializing results.
Sorry to flip/flop, but looking carefully at the UpdatingLinearRegression interface again, I think it is fine to just add it as an interface. I would suggest the s/data/observation change in my last comment though and maybe renaming it to UpdatingMultipleLinearRegression.
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel ]
greg sterijevski updated MATH607:

Attachment: updating_reg_cut2
Phil,
Attached is the patch based on your comments. Please review.
Greg
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_cut2, updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13062832#comment13062832 ]
Phil Steitz commented on MATH607:

Thanks, Greg!
I committed the patch with minor modifications to make it (almost) consistent with [math] style guidelines. (Running "mvn site" and looking at the checkstyle report shows where problems are with patches). I didn't make any really substantive changes, but there is still some work to be done. I wanted to get the classes committed, though, so we could start the implementation work and refine them as we go.
Here is what still needs attention on the interface/value classes:
1) There is some missing javadoc
2) I made the static constants for the overall stats private in RegressionResults. I did not see any use for them outside of the class and in fact I think it would likely be better to replace the internal array representation of those data by an inner class with proper field names or just define separate data members. Maybe you see that array has having variable length for some models? I am OK leaving as is for now, but lets keep it all private.
3) We can wait to fix this until we know more exactly what is going to come out of the implementations, but we need to fit the exceptions into the [math] hierarchy and be explicit in the throws clauses.
4) There are a couple of references in the javadoc for "reduncancy flags" but these are not actually available in the RegressionResults. Probably the references should be dropped and subclasses that expose these will be added for models that include them.
5) The preconditions statements are good to retain, but I don't think they actually belong in the RegressionResults javadoc. Most likely they should be in the javadoc for either UpdatingRegression#regress or the implementations.
Thanks for the patch!
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: updating_reg_cut2, updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel ]
greg sterijevski updated MATH607:

Attachment: regres_change1
Mea culpa,
I made a mistake in retrieving the standard errors. Two lines are defective.
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: regres_change1, updating_reg_cut2, updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel ]
greg sterijevski updated MATH607:

Attachment: millerregtest
millerreg
Attached is the Miller regression and tests.
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: millerreg, millerregtest, regres_change1, updating_reg_cut2, updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13068001#comment13068001 ]
Phil Steitz commented on MATH607:

Second Miller patch committed in r1148557, modified to meet coding standards, other than a) missing javadoc and b) exceptions.
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: millerreg, millerregtest, regres_change1, updating_reg_cut2, updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel ]
greg sterijevski updated MATH607:

Attachment: millerreg_take2
Attached patch should fix the checkstyle errors... for the miller regression.
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: millerreg, millerreg_take2, millerregtest, regres_change1, updating_reg_cut2, updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel ]
greg sterijevski updated MATH607:

Attachment: RegressResults2
This patch should fix the errors in checkstyle for RegressionResults.
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: RegressResults2, millerreg, millerreg_take2, millerregtest, regres_change1, updating_reg_cut2, updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13069167#comment13069167 ]
Phil Steitz commented on MATH607:

Checkstyle fixes committed in r1149281, r1149335. Many thanks, Greg. Still remaining: 1) fix exceptions. I will do this if I get no objections to RegressionModelSpecificationException proposed on the mailing list. 2) findbugs config to ignore unsafe access check in RegressionResults. 3) RegressionResults has no getter for rank 4) I am still not getting why we need globalFitInfo in RegressionResults. Why not just name the fields instead of maintaining an array and names for indexes into it?
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: RegressResults2, millerreg, millerreg_take2, millerregtest, regres_change1, updating_reg_cut2, updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[ https://issues.apache.org/jira/browse/MATH607?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13069256#comment13069256 ]
greg sterijevski commented on MATH607:

Phil,
1. Fix exceptions. I am not 100% sure what I needed to do in order to
correctly exclude this from the bug report. I did not want to commit
something half baked. I would appreciate your help here.
3. Yes, the rank getter is missing. I can put that in.
4. There are a couple of reasons I thought we should keep all that info in
an array.
a.) Neater, all of the information on the fit is in one member
variable, as opposed to 5, 10 or 15 member variables. We really should have
a GlobalInfoEnum maybe? Then we could eliminate all the getters with:
public double getGlobalFitInfo( GlobalInfoEnum gie );
b.) Serialization is a bit easier should a hand coded serialization
routine need to be written.
c.) Model Selection. If we use the regression results object in model
selection algorithms, then the criteria used for evaluate goodness of fit
could be accessible by an index (or enum) into that array. For example, I
might write a little app that runs a million regressions and chooses the top
1% by Rsquared. (I know that this example is complete ad hoc[ery]. ) You
might then decide that mean squared error is really the criterion you want
to use. Instead of recoding the objective function to call
getMeanSquaredError() instead of getRSquared(), you simple provide the index
or the enum.
d.) Growth. While we have a few parameters of global fit now, I am sure
that number will grow. We might need add likelihood function value, an F
Test of global applicability,.... In a simple beans interface setup we would
add many methods... I can't help but feel that this is messy and tedious for
the user.
Greg
> Current Multiple Regression Object does calculations with all data incore. There are non incore techniques which would be useful with large datasets.
> 
>
> Key: MATH607
> URL: https://issues.apache.org/jira/browse/MATH607> Project: Commons Math
> Issue Type: New Feature
> Affects Versions: 3.0
> Environment: Java
> Reporter: greg sterijevski
> Labels: Gentleman's, QR, Regression, Updating, decomposition, lemma
> Fix For: 3.0
>
> Attachments: RegressResults2, millerreg, millerreg_take2, millerregtest, regres_change1, updating_reg_cut2, updating_reg_ifaces
>
> Original Estimate: 840h
> Remaining Estimate: 840h
>
> The current multiple regression class does a QR decomposition on the complete data set. This necessitates the loading incore of the complete dataset. For large datasets, or large datasets and a requirement to do datamining or stepwise regression this is not practical. There are techniques which form the normal equations on the fly, as well as ones which form the QR decomposition on an update basis. I am proposing, first, the specification of an "UpdatingLinearRegression" interface which defines basic functionality all such techniques must fulfill.
> Related to this 'updating' regression, the results of running a regression on some subset of the data should be encapsulated in an immutable object. This is to ensure that subsequent additions of observations do not corrupt or render inconsistent parameter estimates. I am calling this interface "RegressionResults".
> Once the community has reached a consensus on the interface, work on the concrete implementation of these techniques will take place.
> Thanks,
> Greg

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

12
