[statistics] Pull request for GLSMultipleLinearRegression

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[statistics] Pull request for GLSMultipleLinearRegression

Елена Картышева
Hello.

I would like to propose a pull request implementing an option to use variance vector instead of covariance matrix. It allows users to avoid unnecessary memory usage and excessive computation in case of uncorrelated but heteroscedastic errors thus making it possible to work with huge input matrices. Using variance vector in such cases allows to reduce time complexity from O(N^2) to just O(N) (where N is a number of observations) and dramatically reduce memory usage. For example, in my practice arose a need to train generalized linear model. Usage of Iteratively reweighted least squares algorithm requires weighted regression with more than a million observations. Current implementation would require approximately 12 terabytes of memory while patched version needs only 8 megabytes. Since IRLS is iterative algorithm a million-times complexity reduction is also pretty handy.

 
--
Sincerely yours, Elena Kartysheva.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [statistics] Pull request for GLSMultipleLinearRegression

Ben Nguyen
Hello,

There is currently a transition from the commons-math-stat libraries to the new commons-statistics library. I am working on regression related design for my Google Summer of Code project. I am a new contributor and would love to work with more people who have used these tools extensively for more insights.

The transition is mostly in the design stages. We are still figuring out essential problems like which linear math library to use (not from commons-math since its outdated) and designing a better/more flexible UI.

I have not looked into GLS as in-depth yet (as much as OLS or the new LogisticRegression component), perhaps you can help contribute to the GLS component to ensure your needs are met. Our goal is also to maximize efficiencies in all areas, utilizing Java 8 features such as the Streams API where it would increase performance.

Issue for regression component, please post insights here as well: https://issues.apache.org/jira/browse/STATISTICS-8
GitHub Repo: https://github.com/apache/commons-statistics

Thank you for your post,
Cheers,
-Ben Nguyen

From: Елена Картышева
Sent: Thursday, May 23, 2019 8:44 AM
To: dev
Subject: [statistics] Pull request for GLSMultipleLinearRegression

Hello.

I would like to propose a pull request implementing an option to use variance vector instead of covariance matrix. It allows users to avoid unnecessary memory usage and excessive computation in case of uncorrelated but heteroscedastic errors thus making it possible to work with huge input matrices. Using variance vector in such cases allows to reduce time complexity from O(N^2) to just O(N) (where N is a number of observations) and dramatically reduce memory usage. For example, in my practice arose a need to train generalized linear model. Usage of Iteratively reweighted least squares algorithm requires weighted regression with more than a million observations. Current implementation would require approximately 12 terabytes of memory while patched version needs only 8 megabytes. Since IRLS is iterative algorithm a million-times complexity reduction is also pretty handy.

 
--
Sincerely yours, Elena Kartysheva.

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: [statistics] Pull request for GLSMultipleLinearRegression

Eric Barnhill
In reply to this post by Елена Картышева
Hi Elena,

Thanks for this intriguing idea. As far as I ever knew IRLS requires a
matrix. Can you provide me with a citation where I can read about this
vector-based approach?

Thanks,
Eric


On Thu, May 23, 2019, 06:44 Елена Картышева <[hidden email]> wrote:

> Hello.
>
> I would like to propose a pull request implementing an option to use
> variance vector instead of covariance matrix. It allows users to avoid
> unnecessary memory usage and excessive computation in case of uncorrelated
> but heteroscedastic errors thus making it possible to work with huge input
> matrices. Using variance vector in such cases allows to reduce time
> complexity from O(N^2) to just O(N) (where N is a number of observations)
> and dramatically reduce memory usage. For example, in my practice arose a
> need to train generalized linear model. Usage of Iteratively reweighted
> least squares algorithm requires weighted regression with more than a
> million observations. Current implementation would require approximately 12
> terabytes of memory while patched version needs only 8 megabytes. Since
> IRLS is iterative algorithm a million-times complexity reduction is also
> pretty handy.
>
>
> --
> Sincerely yours, Elena Kartysheva.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [statistics] Pull request for GLSMultipleLinearRegression

Alexey Dievsky
Hi Eric,

as Elena's mentor, I can try to shed some light on this question.
IRLS, as used in GLM fitting, does require a matrix, but it's always
diagonal. See, for example, https://bwlewis.github.io/GLM/ , section
"Algorithm IRLS". The motivation for this pull request is exactly this
-- implement a special case of generalized least squares where the
covariance matrix is diagonal, since for a diagonal matrix we can
perform the least squares estimation in O(n) instead of O(n^2) for
both time and memory (where n is the number of observations).

Sincerely,
Aleksei Dievskii.

On Fri, May 24, 2019 at 2:32 AM Eric Barnhill <[hidden email]> wrote:

>
> Hi Elena,
>
> Thanks for this intriguing idea. As far as I ever knew IRLS requires a
> matrix. Can you provide me with a citation where I can read about this
> vector-based approach?
>
> Thanks,
> Eric
>
>
> On Thu, May 23, 2019, 06:44 Елена Картышева <[hidden email]> wrote:
>
> > Hello.
> >
> > I would like to propose a pull request implementing an option to use
> > variance vector instead of covariance matrix. It allows users to avoid
> > unnecessary memory usage and excessive computation in case of uncorrelated
> > but heteroscedastic errors thus making it possible to work with huge input
> > matrices. Using variance vector in such cases allows to reduce time
> > complexity from O(N^2) to just O(N) (where N is a number of observations)
> > and dramatically reduce memory usage. For example, in my practice arose a
> > need to train generalized linear model. Usage of Iteratively reweighted
> > least squares algorithm requires weighted regression with more than a
> > million observations. Current implementation would require approximately 12
> > terabytes of memory while patched version needs only 8 megabytes. Since
> > IRLS is iterative algorithm a million-times complexity reduction is also
> > pretty handy.
> >
> >
> > --
> > Sincerely yours, Elena Kartysheva.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> > For additional commands, e-mail: [hidden email]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]