[math] correlation analysis with NaNs

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

[math] correlation analysis with NaNs

Martin Rosellen
> Hi.
>>> [...]
>>>> Now is as good a time as any to think about how to correctly
>>>> represent and handle missing data.  The unfortunate thing is that in
>>>> Java working with primitive doubles we are back to the old Fortran
>>>> days of having no natural representation of a missing value.
>>>> Sticking with primitives, the only thing we can do is either use NaN
>>>> or allow the "missing" designator to be configured by the user.  I am
>>>> curious what others have done in this area.
>>> As you say, as I said, with primitive double, there is no value that
>>> can readily serve as "missing". It's a user's choice (e.g.
>>> "Double.NaN", "Double.MAX_VALUE", "-Double.MAX_VALUE", "any negative
>>> value", ...), that depends on the context.
>>>> The second question is what strategies do we support for handling
>>>> missing data and how do we represent those strategies.   The
>>>> simplest and easiest strategy to implement is to delete observations
>>>> that include missing data.  This is a data-only strategy and would
>>>> work the same way across algorithms.  I am afraid, however, that this
>>>> is the only strategy that is not algorithm-dependent (unless you
>>>> consider, e.g. EM as a missing data strategy or very simple
>>>> imputation strategies).  So that means individual algorithms need to
>>>> include missing data strategies in their specifications.  It might be
>>>> good to define and implement these for the correlation and regression
>>>> classes and see if we can generalize.  Any ideas on how best to do
>>>> this?
>>> I'm sorry if I'm dense, but I don't remember if or why the option that
>>> users should provide clean input data to CM has been ruled out.
>>> I.e. filtering (by user) is done before computation (by CM's algo).
>>> If the data is missing, how can you use it (to correlate, to fit, ...)?
>> There are multiple techniques that can be used to adjust for missing data,
>> depending on the algorithm.  See [1], for example, for a summary of the
>> kinds of techniques that can be used in regression.
>> Basically, saying users need to adjust the data before providing it to the
>> algorithm allows only the "data only" approaches and may be inconvenient or
>> make impossible other analyses to be performed on the same data.
>> Phil
>> [1]
>> http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html
>> I agree that we should consider a more comprehensive treatment of missing
>> data. Perhaps we should start by designing an interface that could be
>> implemented by existing classes. For example, an imputation interface could
>> have methods like miimpute, mianalyze and misummarize and this interface
>> could be implemented in a class that extends OLSMultipleLinearRegression.
>> This approach allows each estimation method to adopt its own treatment of
>> missing data.
>> An alternative is to develop data structures that represent the original and
>> complete data sets. Missing data methods could be applied to the data
>> structures and return a complete data set for use in estimation methods.
>> I guess the decision is whether the missing data treatment should be part of
>> an independent data structure or part integrated into estimation method.
>> Just some thoughts about possible ways of handling it.
>> Patrick
> Is the issue (in CM) about handling missing data or representing missing
> data?
> IIUC, handling is algorithm-dependent. Representation is a matter of
> convention (i.e. user-dependent).
> My proposal would be that for every algorithm that is able to handle
> missing data, we provide an argument (to constructors) that specifies the
> "double" value that represents a missing value.
> Regards,
> Gilles
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:[hidden email]
> For additional commands, e-mail:[hidden email]


I follow this topic with high interest on the dev mailings list. Thanks
to Phil for the nice summary of handling missing data. For my purpose it
the strategy pairwise deletion would be best. There are no issues in the
tracking system for new strategies regarding missing data. I can do the
filtering on my own but I want to state here that new strategies would
be of great benefit.


Reply | Threaded
Open this post in threaded view

Re: [math] correlation analysis with NaNs

Ted Dunning
This is a nice way to allow various conventions.  Sometimes -1 might be the
right value.  Other times a more principled value like NaN might be

My proposal would be that for every algorithm that is able to handle
>> missing data, we provide an argument (to constructors) that specifies the
>> "double" value that represents a missing value.