Re: [math] correlation analysis with NaNs

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Thomas Neidhart
Hi Patrick,

On 11/07/2012 04:37 PM, Patrick Meyer wrote:

> I agree that it would be nice to have a constructor that allows you to
> specific the ranking algorithm only.
>
> As far as NaN and the Spearman correlation, maybe we should add a default
> strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
> encountered. R uses this treatment of missing data and forces users to
> choose how to handle it. If we implemented something like listwise or
> pairwise deletion it could be used in other classes too. As such, treatment
> of missing data should be part of a larger discussion and handled in a more
> comprehensive and systematic way.

I think this additional option makes sense, but I forward this
discussion to the dev mailing list where it is better suited.

Thomas

> -----Original Message-----
> From: Thomas Neidhart [mailto:[hidden email]]
> Sent: Wednesday, November 07, 2012 8:09 AM
> To: [hidden email]
> Subject: Re: [math] correlation analysis with NaNs
>
> On 11/07/2012 01:38 PM, Patrick Meyer wrote:
>> You are getting values like 2.5 because of the default ties strategy.
>> If you do not want to use that method, create an instance of
>> RankingAlgorithm with a different ties strategy and pass it to the
>> constructor for the SpearmanCorrelation. This approach also gives you
>> control over the method for dealing with NaNs. Something like,
>>
>> //create data matrix
>> double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 =
>> new double[]{10, 2, 10}; Array2DRowRealMatrix mydata = new
>> Array2DRowRealMatrix(); For(int i=0;i<column1.length;i++){
>> mydata.addToEntry(i, 0, column1[i]);
>> mydata.addToEntry(i, 1, column2[i]);
>> }
>>
>> //compute correlation
>> NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED,
>> TiesStrategy.RANDOM); SpearmanCorrelation spearman = new
>> SpearmanCorrelation(ranking, mydata);
>>
>> Try that.
>
> Hi,
>
> this will not really help imho.
>
> As far as I can see, there are at least two problems with the current use of
> the RankingAlgorithm in the SpearmanCorrelation class:
>
>  * there is no way to select the ranking algorithm in the constructor
>    without passing the values at the same time
>  * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes
>    the NaN only from the input array where it occurs but not in the
>    corresponding array, thus rendering it useless as it will result in
>    exceptions (array lengths differ)
>
> Would you be able to create an issue for this on the issue tracker and
> provide the test case?
>
> Thanks,
>
> Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Gilles Sadowski
On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:

> Hi Patrick,
>
> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
> > I agree that it would be nice to have a constructor that allows you to
> > specific the ranking algorithm only.
> >
> > As far as NaN and the Spearman correlation, maybe we should add a default
> > strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
> > encountered. R uses this treatment of missing data and forces users to
> > choose how to handle it. If we implemented something like listwise or
> > pairwise deletion it could be used in other classes too. As such, treatment
> > of missing data should be part of a larger discussion and handled in a more
> > comprehensive and systematic way.
>
> I think this additional option makes sense, but I forward this
> discussion to the dev mailing list where it is better suited.

I'm wary of having CM handle "missing" data.
For one thing we'd have to define a "convention" to represent missing data.
There is no good way to do that in Java. Using NaN for this purpose in a
low-level library is not a good idea IMHO. Then, any convention might not be
suitable for some user applications, which would lead such an application's
developer to filter the data anyway in order to change his representation to
CM's representation. Rather that calling two redundant filtering codes, I'd
rather assume that CM gets a clean input on which its algorithm can operate.
As usual, the input is subjected to precondition checks, and exceptions are
thrown if the data is not clean enough.

In summary: data validation (in the sense of discarding input) should not be
done _before_ calling CM routines.


Regards,
Gilles

> Thomas
>
> > -----Original Message-----
> > From: Thomas Neidhart [mailto:[hidden email]]
> > Sent: Wednesday, November 07, 2012 8:09 AM
> > To: [hidden email]
> > Subject: Re: [math] correlation analysis with NaNs
> >
> > On 11/07/2012 01:38 PM, Patrick Meyer wrote:
> >> You are getting values like 2.5 because of the default ties strategy.
> >> If you do not want to use that method, create an instance of
> >> RankingAlgorithm with a different ties strategy and pass it to the
> >> constructor for the SpearmanCorrelation. This approach also gives you
> >> control over the method for dealing with NaNs. Something like,
> >>
> >> //create data matrix
> >> double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 =
> >> new double[]{10, 2, 10}; Array2DRowRealMatrix mydata = new
> >> Array2DRowRealMatrix(); For(int i=0;i<column1.length;i++){
> >> mydata.addToEntry(i, 0, column1[i]);
> >> mydata.addToEntry(i, 1, column2[i]);
> >> }
> >>
> >> //compute correlation
> >> NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED,
> >> TiesStrategy.RANDOM); SpearmanCorrelation spearman = new
> >> SpearmanCorrelation(ranking, mydata);
> >>
> >> Try that.
> >
> > Hi,
> >
> > this will not really help imho.
> >
> > As far as I can see, there are at least two problems with the current use of
> > the RankingAlgorithm in the SpearmanCorrelation class:
> >
> >  * there is no way to select the ranking algorithm in the constructor
> >    without passing the values at the same time
> >  * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes
> >    the NaN only from the input array where it occurs but not in the
> >    corresponding array, thus rendering it useless as it will result in
> >    exceptions (array lengths differ)
> >
> > Would you be able to create an issue for this on the issue tracker and
> > provide the test case?
> >
> > Thanks,
> >
> > Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Sébastien Brisard
Hi,

2012/11/8 Gilles Sadowski <[hidden email]>:

> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>> Hi Patrick,
>>
>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>> > I agree that it would be nice to have a constructor that allows you to
>> > specific the ranking algorithm only.
>> >
>> > As far as NaN and the Spearman correlation, maybe we should add a default
>> > strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
>> > encountered. R uses this treatment of missing data and forces users to
>> > choose how to handle it. If we implemented something like listwise or
>> > pairwise deletion it could be used in other classes too. As such, treatment
>> > of missing data should be part of a larger discussion and handled in a more
>> > comprehensive and systematic way.
>>
>> I think this additional option makes sense, but I forward this
>> discussion to the dev mailing list where it is better suited.
>
> I'm wary of having CM handle "missing" data.
> For one thing we'd have to define a "convention" to represent missing data.
> There is no good way to do that in Java. Using NaN for this purpose in a
> low-level library is not a good idea IMHO.
>
I agree with Gilles, here. If I remember correctly, R has a special
value NA, or something similar, which differs from NaN.

>
> Then, any convention might not be
> suitable for some user applications, which would lead such an application's
> developer to filter the data anyway in order to change his representation to
> CM's representation. Rather that calling two redundant filtering codes, I'd
> rather assume that CM gets a clean input on which its algorithm can operate.
> As usual, the input is subjected to precondition checks, and exceptions are
> thrown if the data is not clean enough.
>
> In summary: data validation (in the sense of discarding input) should not be
> done _before_ calling CM routines.
>
+1.

Sébastien

>
> Regards,
> Gilles
>
>> Thomas
>>
>> > -----Original Message-----
>> > From: Thomas Neidhart [mailto:[hidden email]]
>> > Sent: Wednesday, November 07, 2012 8:09 AM
>> > To: [hidden email]
>> > Subject: Re: [math] correlation analysis with NaNs
>> >
>> > On 11/07/2012 01:38 PM, Patrick Meyer wrote:
>> >> You are getting values like 2.5 because of the default ties strategy.
>> >> If you do not want to use that method, create an instance of
>> >> RankingAlgorithm with a different ties strategy and pass it to the
>> >> constructor for the SpearmanCorrelation. This approach also gives you
>> >> control over the method for dealing with NaNs. Something like,
>> >>
>> >> //create data matrix
>> >> double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 =
>> >> new double[]{10, 2, 10}; Array2DRowRealMatrix mydata = new
>> >> Array2DRowRealMatrix(); For(int i=0;i<column1.length;i++){
>> >>    mydata.addToEntry(i, 0, column1[i]);
>> >>    mydata.addToEntry(i, 1, column2[i]);
>> >> }
>> >>
>> >> //compute correlation
>> >> NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED,
>> >> TiesStrategy.RANDOM); SpearmanCorrelation spearman = new
>> >> SpearmanCorrelation(ranking, mydata);
>> >>
>> >> Try that.
>> >
>> > Hi,
>> >
>> > this will not really help imho.
>> >
>> > As far as I can see, there are at least two problems with the current use of
>> > the RankingAlgorithm in the SpearmanCorrelation class:
>> >
>> >  * there is no way to select the ranking algorithm in the constructor
>> >    without passing the values at the same time
>> >  * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes
>> >    the NaN only from the input array where it occurs but not in the
>> >    corresponding array, thus rendering it useless as it will result in
>> >    exceptions (array lengths differ)
>> >
>> > Would you be able to create an issue for this on the issue tracker and
>> > provide the test case?
>> >
>> > Thanks,
>> >
>> > Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Thomas Neidhart
On 11/08/2012 02:01 PM, Sébastien Brisard wrote:

> Hi,
>
> 2012/11/8 Gilles Sadowski <[hidden email]>:
>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>> Hi Patrick,
>>>
>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>> I agree that it would be nice to have a constructor that allows you to
>>>> specific the ranking algorithm only.
>>>>
>>>> As far as NaN and the Spearman correlation, maybe we should add a default
>>>> strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
>>>> encountered. R uses this treatment of missing data and forces users to
>>>> choose how to handle it. If we implemented something like listwise or
>>>> pairwise deletion it could be used in other classes too. As such, treatment
>>>> of missing data should be part of a larger discussion and handled in a more
>>>> comprehensive and systematic way.
>>>
>>> I think this additional option makes sense, but I forward this
>>> discussion to the dev mailing list where it is better suited.
>>
>> I'm wary of having CM handle "missing" data.
>> For one thing we'd have to define a "convention" to represent missing data.
>> There is no good way to do that in Java. Using NaN for this purpose in a
>> low-level library is not a good idea IMHO.
>>
> I agree with Gilles, here. If I remember correctly, R has a special
> value NA, or something similar, which differs from NaN.
>>
>> Then, any convention might not be
>> suitable for some user applications, which would lead such an application's
>> developer to filter the data anyway in order to change his representation to
>> CM's representation. Rather that calling two redundant filtering codes, I'd
>> rather assume that CM gets a clean input on which its algorithm can operate.
>> As usual, the input is subjected to precondition checks, and exceptions are
>> thrown if the data is not clean enough.
>>
>> In summary: data validation (in the sense of discarding input) should not be
>> done _before_ calling CM routines.
>>
> +1.

ok, I am now confused. First you say that CM should not be involved in
data cleaning, but then you state that data validation should not be
done before calling CM? May be there is a *not* too much?

I think the proposition from Patrick was to exactly do that: throw an
exception if such invalid data is encountered (NaNStrategy.FAIL).

The other thing is, that the NaNStrategy.REMOVED is broken, so either we
fix is or deprecate it.

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Gilles Sadowski
On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:

> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
> > Hi,
> >
> > 2012/11/8 Gilles Sadowski <[hidden email]>:
> >> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
> >>> Hi Patrick,
> >>>
> >>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
> >>>> I agree that it would be nice to have a constructor that allows you to
> >>>> specific the ranking algorithm only.
> >>>>
> >>>> As far as NaN and the Spearman correlation, maybe we should add a default
> >>>> strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
> >>>> encountered. R uses this treatment of missing data and forces users to
> >>>> choose how to handle it. If we implemented something like listwise or
> >>>> pairwise deletion it could be used in other classes too. As such, treatment
> >>>> of missing data should be part of a larger discussion and handled in a more
> >>>> comprehensive and systematic way.
> >>>
> >>> I think this additional option makes sense, but I forward this
> >>> discussion to the dev mailing list where it is better suited.
> >>
> >> I'm wary of having CM handle "missing" data.
> >> For one thing we'd have to define a "convention" to represent missing data.
> >> There is no good way to do that in Java. Using NaN for this purpose in a
> >> low-level library is not a good idea IMHO.
> >>
> > I agree with Gilles, here. If I remember correctly, R has a special
> > value NA, or something similar, which differs from NaN.
> >>
> >> Then, any convention might not be
> >> suitable for some user applications, which would lead such an application's
> >> developer to filter the data anyway in order to change his representation to
> >> CM's representation. Rather that calling two redundant filtering codes, I'd
> >> rather assume that CM gets a clean input on which its algorithm can operate.
> >> As usual, the input is subjected to precondition checks, and exceptions are
> >> thrown if the data is not clean enough.
> >>
> >> In summary: data validation (in the sense of discarding input) should not be
> >> done _before_ calling CM routines.
> >>
> > +1.
>
> ok, I am now confused. First you say that CM should not be involved in
> data cleaning, but then you state that data validation should not be
> done before calling CM? May be there is a *not* too much?

Yes, you are right: I wrote the opposite of what I meant.
---
  In summary: data validation (in the sense of discarding input) should
  be done _before_ calling CM routines.
---

>
> I think the proposition from Patrick was to exactly do that: throw an
> exception if such invalid data is encountered (NaNStrategy.FAIL).
>
> The other thing is, that the NaNStrategy.REMOVED is broken, so either we
> fix is or deprecate it.

+1
[I mean (I think): If people rely on CM's removal of NaNs, we could fix it.
However, if nobody could actually rely on this feature because it is broken,
I'd prefer to remove it.]


Sorry for the confusion,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Phil Steitz
On 11/8/12 8:23 AM, Gilles Sadowski wrote:

> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
>>> Hi,
>>>
>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>>>> Hi Patrick,
>>>>>
>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>> I agree that it would be nice to have a constructor that allows you to
>>>>>> specific the ranking algorithm only.
+1 - patches welcome.
>>>>>>
>>>>>> As far as NaN and the Spearman correlation, maybe we should add a default
>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
>>>>>> encountered. R uses this treatment of missing data and forces users to
>>>>>> choose how to handle it. If we implemented something like listwise or
>>>>>> pairwise deletion it could be used in other classes too. As such, treatment
>>>>>> of missing data should be part of a larger discussion and handled in a more
>>>>>> comprehensive and systematic way.
+1 to develop a strategy for representing how to represent and
handle missing data (see below)

>>>>> I think this additional option makes sense, but I forward this
>>>>> discussion to the dev mailing list where it is better suited.
>>>> I'm wary of having CM handle "missing" data.
>>>> For one thing we'd have to define a "convention" to represent missing data.
>>>> There is no good way to do that in Java. Using NaN for this purpose in a
>>>> low-level library is not a good idea IMHO.
>>>>
>>> I agree with Gilles, here. If I remember correctly, R has a special
>>> value NA, or something similar, which differs from NaN.
>>>> Then, any convention might not be
>>>> suitable for some user applications, which would lead such an application's
>>>> developer to filter the data anyway in order to change his representation to
>>>> CM's representation. Rather that calling two redundant filtering codes, I'd
>>>> rather assume that CM gets a clean input on which its algorithm can operate.
>>>> As usual, the input is subjected to precondition checks, and exceptions are
>>>> thrown if the data is not clean enough.
>>>>
>>>> In summary: data validation (in the sense of discarding input) should not be
>>>> done _before_ calling CM routines.
>>>>
>>> +1.
>> ok, I am now confused. First you say that CM should not be involved in
>> data cleaning, but then you state that data validation should not be
>> done before calling CM? May be there is a *not* too much?
> Yes, you are right: I wrote the opposite of what I meant.
> ---
>   In summary: data validation (in the sense of discarding input) should
>   be done _before_ calling CM routines.
> ---
>
>> I think the proposition from Patrick was to exactly do that: throw an
>> exception if such invalid data is encountered (NaNStrategy.FAIL).
>>
>> The other thing is, that the NaNStrategy.REMOVED is broken, so either we
>> fix is or deprecate it.

That we should fix.  Please open a JIRA for this.  I assume you are
talking about the implementation in NaturalRanking.
> +1
> [I mean (I think): If people rely on CM's removal of NaNs, we could fix it.
> However, if nobody could actually rely on this feature because it is broken,
> I'd prefer to remove it.]

There are two issues here.  One is specific to ranking algorithms.
To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
the result has to be a total ordering.  The NaNStrategy.REMOVED
strategy is intended to represent removal of NaNs from the data to
be ordered.  If it is not implemented correctly in NaturalRanking or
other rankings that is a bug and needs to be fixed.

The second issue is the more general one of how to represent and
handle missing data.  I have always seen that as a limitation that
we would eventually address on an algorithm by algorithm basis.
Different algorithms can be configured to do different things when
missing data are encountered.  It is not always possible or
desirable to preprocess the data to "eliminate" or impute missing
data.  Saying that we are just not going to deal with it is a
limitation that I don't think we should impose.  I am would like to
hear others' ideas about good ways to model missing data in Java.

Phil


>
>
> Sorry for the confusion,
> Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Phil Steitz
On 11/8/12 9:44 AM, Phil Steitz wrote:

> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
>>>> Hi,
>>>>
>>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>>>>> Hi Patrick,
>>>>>>
>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>>> I agree that it would be nice to have a constructor that allows you to
>>>>>>> specific the ranking algorithm only.
> +1 - patches welcome.
>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a default
>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
>>>>>>> encountered. R uses this treatment of missing data and forces users to
>>>>>>> choose how to handle it. If we implemented something like listwise or
>>>>>>> pairwise deletion it could be used in other classes too. As such, treatment
>>>>>>> of missing data should be part of a larger discussion and handled in a more
>>>>>>> comprehensive and systematic way.
> +1 to develop a strategy for representing how to represent and
> handle missing data (see below)
>>>>>> I think this additional option makes sense, but I forward this
>>>>>> discussion to the dev mailing list where it is better suited.
>>>>> I'm wary of having CM handle "missing" data.
>>>>> For one thing we'd have to define a "convention" to represent missing data.
>>>>> There is no good way to do that in Java. Using NaN for this purpose in a
>>>>> low-level library is not a good idea IMHO.
>>>>>
>>>> I agree with Gilles, here. If I remember correctly, R has a special
>>>> value NA, or something similar, which differs from NaN.
>>>>> Then, any convention might not be
>>>>> suitable for some user applications, which would lead such an application's
>>>>> developer to filter the data anyway in order to change his representation to
>>>>> CM's representation. Rather that calling two redundant filtering codes, I'd
>>>>> rather assume that CM gets a clean input on which its algorithm can operate.
>>>>> As usual, the input is subjected to precondition checks, and exceptions are
>>>>> thrown if the data is not clean enough.
>>>>>
>>>>> In summary: data validation (in the sense of discarding input) should not be
>>>>> done _before_ calling CM routines.
>>>>>
>>>> +1.
>>> ok, I am now confused. First you say that CM should not be involved in
>>> data cleaning, but then you state that data validation should not be
>>> done before calling CM? May be there is a *not* too much?
>> Yes, you are right: I wrote the opposite of what I meant.
>> ---
>>   In summary: data validation (in the sense of discarding input) should
>>   be done _before_ calling CM routines.
>> ---
>>
>>> I think the proposition from Patrick was to exactly do that: throw an
>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
>>>
>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either we
>>> fix is or deprecate it.
> That we should fix.  Please open a JIRA for this.  I assume you are
> talking about the implementation in NaturalRanking.
>> +1
>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix it.
>> However, if nobody could actually rely on this feature because it is broken,
>> I'd prefer to remove it.]
> There are two issues here.  One is specific to ranking algorithms.
> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
> the result has to be a total ordering.  The NaNStrategy.REMOVED
> strategy is intended to represent removal of NaNs from the data to
> be ordered.  If it is not implemented correctly in NaturalRanking or
> other rankings that is a bug and needs to be fixed.

Sorry, I just reread Patrick's original mail.  IIUC, there is
nothing wrong with the implementation of NaNStrategy.REMOVED in
NaturalRanking or other implemented rankings.  The problem is how
the Spearman's impl handles it.  That is indeed a bug in Spearman's
impl that should be fixed.  The correct fix is to throw out the
corresponding entry in the second array when REMOVED is the
configured NaNStrategy.  I agree with Patrick that adding .FAIL and
setting that as the default is a good idea.  Patches welcome.

>
> The second issue is the more general one of how to represent and
> handle missing data.  I have always seen that as a limitation that
> we would eventually address on an algorithm by algorithm basis.
> Different algorithms can be configured to do different things when
> missing data are encountered.  It is not always possible or
> desirable to preprocess the data to "eliminate" or impute missing
> data.  Saying that we are just not going to deal with it is a
> limitation that I don't think we should impose.  I am would like to
> hear others' ideas about good ways to model missing data in Java.
>
> Phil
>
>
>>
>> Sorry for the confusion,
>> Gilles
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Thomas Neidhart
On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <[hidden email]> wrote:

> On 11/8/12 9:44 AM, Phil Steitz wrote:
> > On 11/8/12 8:23 AM, Gilles Sadowski wrote:
> >> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
> >>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
> >>>> Hi,
> >>>>
> >>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
> >>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
> >>>>>> Hi Patrick,
> >>>>>>
> >>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
> >>>>>>> I agree that it would be nice to have a constructor that allows
> you to
> >>>>>>> specific the ranking algorithm only.
> > +1 - patches welcome.
> >>>>>>> As far as NaN and the Spearman correlation, maybe we should add a
> default
> >>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if
> any NaN is
> >>>>>>> encountered. R uses this treatment of missing data and forces
> users to
> >>>>>>> choose how to handle it. If we implemented something like listwise
> or
> >>>>>>> pairwise deletion it could be used in other classes too. As such,
> treatment
> >>>>>>> of missing data should be part of a larger discussion and handled
> in a more
> >>>>>>> comprehensive and systematic way.
> > +1 to develop a strategy for representing how to represent and
> > handle missing data (see below)
> >>>>>> I think this additional option makes sense, but I forward this
> >>>>>> discussion to the dev mailing list where it is better suited.
> >>>>> I'm wary of having CM handle "missing" data.
> >>>>> For one thing we'd have to define a "convention" to represent
> missing data.
> >>>>> There is no good way to do that in Java. Using NaN for this purpose
> in a
> >>>>> low-level library is not a good idea IMHO.
> >>>>>
> >>>> I agree with Gilles, here. If I remember correctly, R has a special
> >>>> value NA, or something similar, which differs from NaN.
> >>>>> Then, any convention might not be
> >>>>> suitable for some user applications, which would lead such an
> application's
> >>>>> developer to filter the data anyway in order to change his
> representation to
> >>>>> CM's representation. Rather that calling two redundant filtering
> codes, I'd
> >>>>> rather assume that CM gets a clean input on which its algorithm can
> operate.
> >>>>> As usual, the input is subjected to precondition checks, and
> exceptions are
> >>>>> thrown if the data is not clean enough.
> >>>>>
> >>>>> In summary: data validation (in the sense of discarding input)
> should not be
> >>>>> done _before_ calling CM routines.
> >>>>>
> >>>> +1.
> >>> ok, I am now confused. First you say that CM should not be involved in
> >>> data cleaning, but then you state that data validation should not be
> >>> done before calling CM? May be there is a *not* too much?
> >> Yes, you are right: I wrote the opposite of what I meant.
> >> ---
> >>   In summary: data validation (in the sense of discarding input) should
> >>   be done _before_ calling CM routines.
> >> ---
> >>
> >>> I think the proposition from Patrick was to exactly do that: throw an
> >>> exception if such invalid data is encountered (NaNStrategy.FAIL).
> >>>
> >>> The other thing is, that the NaNStrategy.REMOVED is broken, so either
> we
> >>> fix is or deprecate it.
> > That we should fix.  Please open a JIRA for this.  I assume you are
> > talking about the implementation in NaturalRanking.
> >> +1
> >> [I mean (I think): If people rely on CM's removal of NaNs, we could fix
> it.
> >> However, if nobody could actually rely on this feature because it is
> broken,
> >> I'd prefer to remove it.]
> > There are two issues here.  One is specific to ranking algorithms.
> > To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
> > the result has to be a total ordering.  The NaNStrategy.REMOVED
> > strategy is intended to represent removal of NaNs from the data to
> > be ordered.  If it is not implemented correctly in NaturalRanking or
> > other rankings that is a bug and needs to be fixed.
>
> Sorry, I just reread Patrick's original mail.  IIUC, there is
> nothing wrong with the implementation of NaNStrategy.REMOVED in
> NaturalRanking or other implemented rankings.  The problem is how
> the Spearman's impl handles it.  That is indeed a bug in Spearman's
> impl that should be fixed.  The correct fix is to throw out the
> corresponding entry in the second array when REMOVED is the
> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
> setting that as the default is a good idea.  Patches welcome.
> >
> > The second issue is the more general one of how to represent and
> > handle missing data.  I have always seen that as a limitation that
> > we would eventually address on an algorithm by algorithm basis.
> > Different algorithms can be configured to do different things when
> > missing data are encountered.  It is not always possible or
> > desirable to preprocess the data to "eliminate" or impute missing
> > data.  Saying that we are just not going to deal with it is a
> > limitation that I don't think we should impose.  I am would like to
> > hear others' ideas about good ways to model missing data in Java.
>

Hi Phil,

ok I have created three new issues:

 * MATH-891
 * MATH-892
 * MATH-893

Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
the RankingAlgorithm interface a bit. Right now, it only takes as input a
one-dimensional array. But in case of correlations, you have two input
arrays. If you remove from one array the NaN values, you have no means to
know at which index they have been removed to do the same with the other
array.

Thomas
Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Phil Steitz
On 11/9/12 12:18 AM, Thomas Neidhart wrote:

> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <[hidden email]> wrote:
>
>> On 11/8/12 9:44 AM, Phil Steitz wrote:
>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
>>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
>>>>>> Hi,
>>>>>>
>>>>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>>>>>>> Hi Patrick,
>>>>>>>>
>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>>>>> I agree that it would be nice to have a constructor that allows
>> you to
>>>>>>>>> specific the ranking algorithm only.
>>> +1 - patches welcome.
>>>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a
>> default
>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if
>> any NaN is
>>>>>>>>> encountered. R uses this treatment of missing data and forces
>> users to
>>>>>>>>> choose how to handle it. If we implemented something like listwise
>> or
>>>>>>>>> pairwise deletion it could be used in other classes too. As such,
>> treatment
>>>>>>>>> of missing data should be part of a larger discussion and handled
>> in a more
>>>>>>>>> comprehensive and systematic way.
>>> +1 to develop a strategy for representing how to represent and
>>> handle missing data (see below)
>>>>>>>> I think this additional option makes sense, but I forward this
>>>>>>>> discussion to the dev mailing list where it is better suited.
>>>>>>> I'm wary of having CM handle "missing" data.
>>>>>>> For one thing we'd have to define a "convention" to represent
>> missing data.
>>>>>>> There is no good way to do that in Java. Using NaN for this purpose
>> in a
>>>>>>> low-level library is not a good idea IMHO.
>>>>>>>
>>>>>> I agree with Gilles, here. If I remember correctly, R has a special
>>>>>> value NA, or something similar, which differs from NaN.
>>>>>>> Then, any convention might not be
>>>>>>> suitable for some user applications, which would lead such an
>> application's
>>>>>>> developer to filter the data anyway in order to change his
>> representation to
>>>>>>> CM's representation. Rather that calling two redundant filtering
>> codes, I'd
>>>>>>> rather assume that CM gets a clean input on which its algorithm can
>> operate.
>>>>>>> As usual, the input is subjected to precondition checks, and
>> exceptions are
>>>>>>> thrown if the data is not clean enough.
>>>>>>>
>>>>>>> In summary: data validation (in the sense of discarding input)
>> should not be
>>>>>>> done _before_ calling CM routines.
>>>>>>>
>>>>>> +1.
>>>>> ok, I am now confused. First you say that CM should not be involved in
>>>>> data cleaning, but then you state that data validation should not be
>>>>> done before calling CM? May be there is a *not* too much?
>>>> Yes, you are right: I wrote the opposite of what I meant.
>>>> ---
>>>>   In summary: data validation (in the sense of discarding input) should
>>>>   be done _before_ calling CM routines.
>>>> ---
>>>>
>>>>> I think the proposition from Patrick was to exactly do that: throw an
>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
>>>>>
>>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either
>> we
>>>>> fix is or deprecate it.
>>> That we should fix.  Please open a JIRA for this.  I assume you are
>>> talking about the implementation in NaturalRanking.
>>>> +1
>>>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix
>> it.
>>>> However, if nobody could actually rely on this feature because it is
>> broken,
>>>> I'd prefer to remove it.]
>>> There are two issues here.  One is specific to ranking algorithms.
>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
>>> the result has to be a total ordering.  The NaNStrategy.REMOVED
>>> strategy is intended to represent removal of NaNs from the data to
>>> be ordered.  If it is not implemented correctly in NaturalRanking or
>>> other rankings that is a bug and needs to be fixed.
>> Sorry, I just reread Patrick's original mail.  IIUC, there is
>> nothing wrong with the implementation of NaNStrategy.REMOVED in
>> NaturalRanking or other implemented rankings.  The problem is how
>> the Spearman's impl handles it.  That is indeed a bug in Spearman's
>> impl that should be fixed.  The correct fix is to throw out the
>> corresponding entry in the second array when REMOVED is the
>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>> setting that as the default is a good idea.  Patches welcome.
>>> The second issue is the more general one of how to represent and
>>> handle missing data.  I have always seen that as a limitation that
>>> we would eventually address on an algorithm by algorithm basis.
>>> Different algorithms can be configured to do different things when
>>> missing data are encountered.  It is not always possible or
>>> desirable to preprocess the data to "eliminate" or impute missing
>>> data.  Saying that we are just not going to deal with it is a
>>> limitation that I don't think we should impose.  I am would like to
>>> hear others' ideas about good ways to model missing data in Java.
> Hi Phil,
>
> ok I have created three new issues:
>
>  * MATH-891
>  * MATH-892
>  * MATH-893

Thanks!
>
> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
> the RankingAlgorithm interface a bit. Right now, it only takes as input a
> one-dimensional array. But in case of correlations, you have two input
> arrays. If you remove from one array the NaN values, you have no means to
> know at which index they have been removed to do the same with the other
> array.

Or you push that responsibility to the client - in this case
SpearmansCorrelation.   My first thought on how to fix the
Spearman's impl was to have it compare lengths of ranked / unranked
when invoked with the REMOVED NaN strategy and then scan the
original arrays when removals happen, adjusting the ranked arrays
accordingly.  

Phil
>
> Thomas
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Thomas Neidhart
On 11/09/2012 11:14 PM, Phil Steitz wrote:

> On 11/9/12 12:18 AM, Thomas Neidhart wrote:
>> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <[hidden email]> wrote:
>>
>>> On 11/8/12 9:44 AM, Phil Steitz wrote:
>>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
>>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
>>>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
>>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>>>>>>>> Hi Patrick,
>>>>>>>>>
>>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>>>>>> I agree that it would be nice to have a constructor that allows
>>> you to
>>>>>>>>>> specific the ranking algorithm only.
>>>> +1 - patches welcome.
>>>>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a
>>> default
>>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if
>>> any NaN is
>>>>>>>>>> encountered. R uses this treatment of missing data and forces
>>> users to
>>>>>>>>>> choose how to handle it. If we implemented something like listwise
>>> or
>>>>>>>>>> pairwise deletion it could be used in other classes too. As such,
>>> treatment
>>>>>>>>>> of missing data should be part of a larger discussion and handled
>>> in a more
>>>>>>>>>> comprehensive and systematic way.
>>>> +1 to develop a strategy for representing how to represent and
>>>> handle missing data (see below)
>>>>>>>>> I think this additional option makes sense, but I forward this
>>>>>>>>> discussion to the dev mailing list where it is better suited.
>>>>>>>> I'm wary of having CM handle "missing" data.
>>>>>>>> For one thing we'd have to define a "convention" to represent
>>> missing data.
>>>>>>>> There is no good way to do that in Java. Using NaN for this purpose
>>> in a
>>>>>>>> low-level library is not a good idea IMHO.
>>>>>>>>
>>>>>>> I agree with Gilles, here. If I remember correctly, R has a special
>>>>>>> value NA, or something similar, which differs from NaN.
>>>>>>>> Then, any convention might not be
>>>>>>>> suitable for some user applications, which would lead such an
>>> application's
>>>>>>>> developer to filter the data anyway in order to change his
>>> representation to
>>>>>>>> CM's representation. Rather that calling two redundant filtering
>>> codes, I'd
>>>>>>>> rather assume that CM gets a clean input on which its algorithm can
>>> operate.
>>>>>>>> As usual, the input is subjected to precondition checks, and
>>> exceptions are
>>>>>>>> thrown if the data is not clean enough.
>>>>>>>>
>>>>>>>> In summary: data validation (in the sense of discarding input)
>>> should not be
>>>>>>>> done _before_ calling CM routines.
>>>>>>>>
>>>>>>> +1.
>>>>>> ok, I am now confused. First you say that CM should not be involved in
>>>>>> data cleaning, but then you state that data validation should not be
>>>>>> done before calling CM? May be there is a *not* too much?
>>>>> Yes, you are right: I wrote the opposite of what I meant.
>>>>> ---
>>>>>   In summary: data validation (in the sense of discarding input) should
>>>>>   be done _before_ calling CM routines.
>>>>> ---
>>>>>
>>>>>> I think the proposition from Patrick was to exactly do that: throw an
>>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
>>>>>>
>>>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either
>>> we
>>>>>> fix is or deprecate it.
>>>> That we should fix.  Please open a JIRA for this.  I assume you are
>>>> talking about the implementation in NaturalRanking.
>>>>> +1
>>>>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix
>>> it.
>>>>> However, if nobody could actually rely on this feature because it is
>>> broken,
>>>>> I'd prefer to remove it.]
>>>> There are two issues here.  One is specific to ranking algorithms.
>>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
>>>> the result has to be a total ordering.  The NaNStrategy.REMOVED
>>>> strategy is intended to represent removal of NaNs from the data to
>>>> be ordered.  If it is not implemented correctly in NaturalRanking or
>>>> other rankings that is a bug and needs to be fixed.
>>> Sorry, I just reread Patrick's original mail.  IIUC, there is
>>> nothing wrong with the implementation of NaNStrategy.REMOVED in
>>> NaturalRanking or other implemented rankings.  The problem is how
>>> the Spearman's impl handles it.  That is indeed a bug in Spearman's
>>> impl that should be fixed.  The correct fix is to throw out the
>>> corresponding entry in the second array when REMOVED is the
>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>>> setting that as the default is a good idea.  Patches welcome.
>>>> The second issue is the more general one of how to represent and
>>>> handle missing data.  I have always seen that as a limitation that
>>>> we would eventually address on an algorithm by algorithm basis.
>>>> Different algorithms can be configured to do different things when
>>>> missing data are encountered.  It is not always possible or
>>>> desirable to preprocess the data to "eliminate" or impute missing
>>>> data.  Saying that we are just not going to deal with it is a
>>>> limitation that I don't think we should impose.  I am would like to
>>>> hear others' ideas about good ways to model missing data in Java.
>> Hi Phil,
>>
>> ok I have created three new issues:
>>
>>  * MATH-891
>>  * MATH-892
>>  * MATH-893
>
> Thanks!
>>
>> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
>> the RankingAlgorithm interface a bit. Right now, it only takes as input a
>> one-dimensional array. But in case of correlations, you have two input
>> arrays. If you remove from one array the NaN values, you have no means to
>> know at which index they have been removed to do the same with the other
>> array.
>
> Or you push that responsibility to the client - in this case
> SpearmansCorrelation.   My first thought on how to fix the
> Spearman's impl was to have it compare lengths of ranked / unranked
> when invoked with the REMOVED NaN strategy and then scan the
> original arrays when removals happen, adjusting the ranked arrays
> accordingly.  

I thought about this a bit more, and I do not think it can be done
safely on the client side (i.e. SpearmansCorrelation).

Consider the following case:

 x: [NaN, 1, 2]
 y: [1, NaN, 2]

the ranking algorithm with a NaNStrategy of REMOVED would rank as follows:

 x: [1, 2]
 y: [1, 2]

on the client side, everything looks fine, but in fact we would
correlate wrong data.

Additionally, on the client side, we have no means to know the actual
NaNStrategy that is used, as it is hidden in the ranking algorithm.

Moreover, comparing with the original array may also not work, as the
ranking algorithm may change the data, so alignment is not always possible.

>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>>> setting that as the default is a good idea.  Patches welcome.

The NaNStrategy.FAILED has been added already, shall we make it the
default then, what do you think?

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Gilles Sadowski
On Sun, Nov 18, 2012 at 11:01:18PM +0100, Thomas Neidhart wrote:

> On 11/09/2012 11:14 PM, Phil Steitz wrote:
> > On 11/9/12 12:18 AM, Thomas Neidhart wrote:
> >> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <[hidden email]> wrote:
> >>
> >>> On 11/8/12 9:44 AM, Phil Steitz wrote:
> >>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
> >>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
> >>>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
> >>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
> >>>>>>>>> Hi Patrick,
> >>>>>>>>>
> >>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
> >>>>>>>>>> I agree that it would be nice to have a constructor that allows
> >>> you to
> >>>>>>>>>> specific the ranking algorithm only.
> >>>> +1 - patches welcome.
> >>>>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a
> >>> default
> >>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if
> >>> any NaN is
> >>>>>>>>>> encountered. R uses this treatment of missing data and forces
> >>> users to
> >>>>>>>>>> choose how to handle it. If we implemented something like listwise
> >>> or
> >>>>>>>>>> pairwise deletion it could be used in other classes too. As such,
> >>> treatment
> >>>>>>>>>> of missing data should be part of a larger discussion and handled
> >>> in a more
> >>>>>>>>>> comprehensive and systematic way.
> >>>> +1 to develop a strategy for representing how to represent and
> >>>> handle missing data (see below)
> >>>>>>>>> I think this additional option makes sense, but I forward this
> >>>>>>>>> discussion to the dev mailing list where it is better suited.
> >>>>>>>> I'm wary of having CM handle "missing" data.
> >>>>>>>> For one thing we'd have to define a "convention" to represent
> >>> missing data.
> >>>>>>>> There is no good way to do that in Java. Using NaN for this purpose
> >>> in a
> >>>>>>>> low-level library is not a good idea IMHO.
> >>>>>>>>
> >>>>>>> I agree with Gilles, here. If I remember correctly, R has a special
> >>>>>>> value NA, or something similar, which differs from NaN.
> >>>>>>>> Then, any convention might not be
> >>>>>>>> suitable for some user applications, which would lead such an
> >>> application's
> >>>>>>>> developer to filter the data anyway in order to change his
> >>> representation to
> >>>>>>>> CM's representation. Rather that calling two redundant filtering
> >>> codes, I'd
> >>>>>>>> rather assume that CM gets a clean input on which its algorithm can
> >>> operate.
> >>>>>>>> As usual, the input is subjected to precondition checks, and
> >>> exceptions are
> >>>>>>>> thrown if the data is not clean enough.
> >>>>>>>>
> >>>>>>>> In summary: data validation (in the sense of discarding input)
> >>> should not be
> >>>>>>>> done _before_ calling CM routines.
> >>>>>>>>
> >>>>>>> +1.
> >>>>>> ok, I am now confused. First you say that CM should not be involved in
> >>>>>> data cleaning, but then you state that data validation should not be
> >>>>>> done before calling CM? May be there is a *not* too much?
> >>>>> Yes, you are right: I wrote the opposite of what I meant.
> >>>>> ---
> >>>>>   In summary: data validation (in the sense of discarding input) should
> >>>>>   be done _before_ calling CM routines.
> >>>>> ---
> >>>>>
> >>>>>> I think the proposition from Patrick was to exactly do that: throw an
> >>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
> >>>>>>
> >>>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either
> >>> we
> >>>>>> fix is or deprecate it.
> >>>> That we should fix.  Please open a JIRA for this.  I assume you are
> >>>> talking about the implementation in NaturalRanking.
> >>>>> +1
> >>>>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix
> >>> it.
> >>>>> However, if nobody could actually rely on this feature because it is
> >>> broken,
> >>>>> I'd prefer to remove it.]
> >>>> There are two issues here.  One is specific to ranking algorithms.
> >>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
> >>>> the result has to be a total ordering.  The NaNStrategy.REMOVED
> >>>> strategy is intended to represent removal of NaNs from the data to
> >>>> be ordered.  If it is not implemented correctly in NaturalRanking or
> >>>> other rankings that is a bug and needs to be fixed.
> >>> Sorry, I just reread Patrick's original mail.  IIUC, there is
> >>> nothing wrong with the implementation of NaNStrategy.REMOVED in
> >>> NaturalRanking or other implemented rankings.  The problem is how
> >>> the Spearman's impl handles it.  That is indeed a bug in Spearman's
> >>> impl that should be fixed.  The correct fix is to throw out the
> >>> corresponding entry in the second array when REMOVED is the
> >>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
> >>> setting that as the default is a good idea.  Patches welcome.
> >>>> The second issue is the more general one of how to represent and
> >>>> handle missing data.  I have always seen that as a limitation that
> >>>> we would eventually address on an algorithm by algorithm basis.
> >>>> Different algorithms can be configured to do different things when
> >>>> missing data are encountered.  It is not always possible or
> >>>> desirable to preprocess the data to "eliminate" or impute missing
> >>>> data.  Saying that we are just not going to deal with it is a
> >>>> limitation that I don't think we should impose.  I am would like to
> >>>> hear others' ideas about good ways to model missing data in Java.
> >> Hi Phil,
> >>
> >> ok I have created three new issues:
> >>
> >>  * MATH-891
> >>  * MATH-892
> >>  * MATH-893
> >
> > Thanks!
> >>
> >> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
> >> the RankingAlgorithm interface a bit. Right now, it only takes as input a
> >> one-dimensional array. But in case of correlations, you have two input
> >> arrays. If you remove from one array the NaN values, you have no means to
> >> know at which index they have been removed to do the same with the other
> >> array.
> >
> > Or you push that responsibility to the client - in this case
> > SpearmansCorrelation.   My first thought on how to fix the
> > Spearman's impl was to have it compare lengths of ranked / unranked
> > when invoked with the REMOVED NaN strategy and then scan the
> > original arrays when removals happen, adjusting the ranked arrays
> > accordingly.  
>
> I thought about this a bit more, and I do not think it can be done
> safely on the client side (i.e. SpearmansCorrelation).
>
> Consider the following case:
>
>  x: [NaN, 1, 2]
>  y: [1, NaN, 2]
>
> the ranking algorithm with a NaNStrategy of REMOVED would rank as follows:
>
>  x: [1, 2]
>  y: [1, 2]
>
> on the client side, everything looks fine, but in fact we would
> correlate wrong data.
>
> Additionally, on the client side, we have no means to know the actual
> NaNStrategy that is used, as it is hidden in the ranking algorithm.
>
> Moreover, comparing with the original array may also not work, as the
> ranking algorithm may change the data, so alignment is not always possible.
>
> >>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
> >>> setting that as the default is a good idea.  Patches welcome.
>
> The NaNStrategy.FAILED has been added already, shall we make it the
> default then, what do you think?

+1

Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Phil Steitz
In reply to this post by Thomas Neidhart
On 11/18/12 2:01 PM, Thomas Neidhart wrote:

> On 11/09/2012 11:14 PM, Phil Steitz wrote:
>> On 11/9/12 12:18 AM, Thomas Neidhart wrote:
>>> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <[hidden email]> wrote:
>>>
>>>> On 11/8/12 9:44 AM, Phil Steitz wrote:
>>>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
>>>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
>>>>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
>>>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>>>>>>>>> Hi Patrick,
>>>>>>>>>>
>>>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>>>>>>> I agree that it would be nice to have a constructor that allows
>>>> you to
>>>>>>>>>>> specific the ranking algorithm only.
>>>>> +1 - patches welcome.
>>>>>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a
>>>> default
>>>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if
>>>> any NaN is
>>>>>>>>>>> encountered. R uses this treatment of missing data and forces
>>>> users to
>>>>>>>>>>> choose how to handle it. If we implemented something like listwise
>>>> or
>>>>>>>>>>> pairwise deletion it could be used in other classes too. As such,
>>>> treatment
>>>>>>>>>>> of missing data should be part of a larger discussion and handled
>>>> in a more
>>>>>>>>>>> comprehensive and systematic way.
>>>>> +1 to develop a strategy for representing how to represent and
>>>>> handle missing data (see below)
>>>>>>>>>> I think this additional option makes sense, but I forward this
>>>>>>>>>> discussion to the dev mailing list where it is better suited.
>>>>>>>>> I'm wary of having CM handle "missing" data.
>>>>>>>>> For one thing we'd have to define a "convention" to represent
>>>> missing data.
>>>>>>>>> There is no good way to do that in Java. Using NaN for this purpose
>>>> in a
>>>>>>>>> low-level library is not a good idea IMHO.
>>>>>>>>>
>>>>>>>> I agree with Gilles, here. If I remember correctly, R has a special
>>>>>>>> value NA, or something similar, which differs from NaN.
>>>>>>>>> Then, any convention might not be
>>>>>>>>> suitable for some user applications, which would lead such an
>>>> application's
>>>>>>>>> developer to filter the data anyway in order to change his
>>>> representation to
>>>>>>>>> CM's representation. Rather that calling two redundant filtering
>>>> codes, I'd
>>>>>>>>> rather assume that CM gets a clean input on which its algorithm can
>>>> operate.
>>>>>>>>> As usual, the input is subjected to precondition checks, and
>>>> exceptions are
>>>>>>>>> thrown if the data is not clean enough.
>>>>>>>>>
>>>>>>>>> In summary: data validation (in the sense of discarding input)
>>>> should not be
>>>>>>>>> done _before_ calling CM routines.
>>>>>>>>>
>>>>>>>> +1.
>>>>>>> ok, I am now confused. First you say that CM should not be involved in
>>>>>>> data cleaning, but then you state that data validation should not be
>>>>>>> done before calling CM? May be there is a *not* too much?
>>>>>> Yes, you are right: I wrote the opposite of what I meant.
>>>>>> ---
>>>>>>   In summary: data validation (in the sense of discarding input) should
>>>>>>   be done _before_ calling CM routines.
>>>>>> ---
>>>>>>
>>>>>>> I think the proposition from Patrick was to exactly do that: throw an
>>>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
>>>>>>>
>>>>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either
>>>> we
>>>>>>> fix is or deprecate it.
>>>>> That we should fix.  Please open a JIRA for this.  I assume you are
>>>>> talking about the implementation in NaturalRanking.
>>>>>> +1
>>>>>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix
>>>> it.
>>>>>> However, if nobody could actually rely on this feature because it is
>>>> broken,
>>>>>> I'd prefer to remove it.]
>>>>> There are two issues here.  One is specific to ranking algorithms.
>>>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
>>>>> the result has to be a total ordering.  The NaNStrategy.REMOVED
>>>>> strategy is intended to represent removal of NaNs from the data to
>>>>> be ordered.  If it is not implemented correctly in NaturalRanking or
>>>>> other rankings that is a bug and needs to be fixed.
>>>> Sorry, I just reread Patrick's original mail.  IIUC, there is
>>>> nothing wrong with the implementation of NaNStrategy.REMOVED in
>>>> NaturalRanking or other implemented rankings.  The problem is how
>>>> the Spearman's impl handles it.  That is indeed a bug in Spearman's
>>>> impl that should be fixed.  The correct fix is to throw out the
>>>> corresponding entry in the second array when REMOVED is the
>>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>>>> setting that as the default is a good idea.  Patches welcome.
>>>>> The second issue is the more general one of how to represent and
>>>>> handle missing data.  I have always seen that as a limitation that
>>>>> we would eventually address on an algorithm by algorithm basis.
>>>>> Different algorithms can be configured to do different things when
>>>>> missing data are encountered.  It is not always possible or
>>>>> desirable to preprocess the data to "eliminate" or impute missing
>>>>> data.  Saying that we are just not going to deal with it is a
>>>>> limitation that I don't think we should impose.  I am would like to
>>>>> hear others' ideas about good ways to model missing data in Java.
>>> Hi Phil,
>>>
>>> ok I have created three new issues:
>>>
>>>  * MATH-891
>>>  * MATH-892
>>>  * MATH-893
>> Thanks!
>>> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
>>> the RankingAlgorithm interface a bit. Right now, it only takes as input a
>>> one-dimensional array. But in case of correlations, you have two input
>>> arrays. If you remove from one array the NaN values, you have no means to
>>> know at which index they have been removed to do the same with the other
>>> array.
>> Or you push that responsibility to the client - in this case
>> SpearmansCorrelation.   My first thought on how to fix the
>> Spearman's impl was to have it compare lengths of ranked / unranked
>> when invoked with the REMOVED NaN strategy and then scan the
>> original arrays when removals happen, adjusting the ranked arrays
>> accordingly.  
> I thought about this a bit more, and I do not think it can be done
> safely on the client side (i.e. SpearmansCorrelation).
>
> Consider the following case:
>
>  x: [NaN, 1, 2]
>  y: [1, NaN, 2]
>
> the ranking algorithm with a NaNStrategy of REMOVED would rank as follows:
>
>  x: [1, 2]
>  y: [1, 2]
>
> on the client side, everything looks fine, but in fact we would
> correlate wrong data.
>
> Additionally, on the client side, we have no means to know the actual
> NaNStrategy that is used, as it is hidden in the ranking algorithm.
>
> Moreover, comparing with the original array may also not work, as the
> ranking algorithm may change the data, so alignment is not always possible
>
>>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>>>> setting that as the default is a good idea.  Patches welcome.
> The NaNStrategy.FAILED has been added already, shall we make it the
> default then, what do you think?

I think that is probably best, since what I was trying to do was a
poor man's strategy for missing data.  In the case above, I would
have the client eliminate both of the first two observations, so
there would not be enough data left, but this is hard to document
and implement and is really just a hack to support one missing data
scenario.

Now is as good a time as any to think about how to correctly
represent and handle missing data.  The unfortunate thing is that in
Java working with primitive doubles we are back to the old Fortran
days of having no natural representation of a missing value.
Sticking with primitives, the only thing we can do is either use NaN
or allow the "missing" designator to be configured by the user.  I
am curious what others have done in this area.

The second question is what strategies do we support for handling
missing data and how do we represent those strategies.   The
simplest and easiest strategy to implement is to delete observations
that include missing data.  This is a data-only strategy and would
work the same way across algorithms.  I am afraid, however, that
this is the only strategy that is not algorithm-dependent (unless
you consider, e.g. EM as a missing data strategy or very simple
imputation strategies).  So that means individual algorithms need to
include missing data strategies in their specifications.  It might
be good to define and implement these for the correlation and
regression classes and see if we can generalize.  Any ideas on how
best to do this?

Phil


>
> Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Gilles Sadowski
On Sun, Nov 18, 2012 at 09:27:41PM -0800, Phil Steitz wrote:

> On 11/18/12 2:01 PM, Thomas Neidhart wrote:
> > On 11/09/2012 11:14 PM, Phil Steitz wrote:
> >> On 11/9/12 12:18 AM, Thomas Neidhart wrote:
> >>> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <[hidden email]> wrote:
> >>>
> >>>> On 11/8/12 9:44 AM, Phil Steitz wrote:
> >>>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
> >>>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
> >>>>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
> >>>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
> >>>>>>>>>> Hi Patrick,
> >>>>>>>>>>
> >>>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
> >>>>>>>>>>> I agree that it would be nice to have a constructor that allows
> >>>> you to
> >>>>>>>>>>> specific the ranking algorithm only.
> >>>>> +1 - patches welcome.
> >>>>>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a
> >>>> default
> >>>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if
> >>>> any NaN is
> >>>>>>>>>>> encountered. R uses this treatment of missing data and forces
> >>>> users to
> >>>>>>>>>>> choose how to handle it. If we implemented something like listwise
> >>>> or
> >>>>>>>>>>> pairwise deletion it could be used in other classes too. As such,
> >>>> treatment
> >>>>>>>>>>> of missing data should be part of a larger discussion and handled
> >>>> in a more
> >>>>>>>>>>> comprehensive and systematic way.
> >>>>> +1 to develop a strategy for representing how to represent and
> >>>>> handle missing data (see below)
> >>>>>>>>>> I think this additional option makes sense, but I forward this
> >>>>>>>>>> discussion to the dev mailing list where it is better suited.
> >>>>>>>>> I'm wary of having CM handle "missing" data.
> >>>>>>>>> For one thing we'd have to define a "convention" to represent
> >>>> missing data.
> >>>>>>>>> There is no good way to do that in Java. Using NaN for this purpose
> >>>> in a
> >>>>>>>>> low-level library is not a good idea IMHO.
> >>>>>>>>>
> >>>>>>>> I agree with Gilles, here. If I remember correctly, R has a special
> >>>>>>>> value NA, or something similar, which differs from NaN.
> >>>>>>>>> Then, any convention might not be
> >>>>>>>>> suitable for some user applications, which would lead such an
> >>>> application's
> >>>>>>>>> developer to filter the data anyway in order to change his
> >>>> representation to
> >>>>>>>>> CM's representation. Rather that calling two redundant filtering
> >>>> codes, I'd
> >>>>>>>>> rather assume that CM gets a clean input on which its algorithm can
> >>>> operate.
> >>>>>>>>> As usual, the input is subjected to precondition checks, and
> >>>> exceptions are
> >>>>>>>>> thrown if the data is not clean enough.
> >>>>>>>>>
> >>>>>>>>> In summary: data validation (in the sense of discarding input)
> >>>> should not be
> >>>>>>>>> done _before_ calling CM routines.
> >>>>>>>>>
> >>>>>>>> +1.
> >>>>>>> ok, I am now confused. First you say that CM should not be involved in
> >>>>>>> data cleaning, but then you state that data validation should not be
> >>>>>>> done before calling CM? May be there is a *not* too much?
> >>>>>> Yes, you are right: I wrote the opposite of what I meant.
> >>>>>> ---
> >>>>>>   In summary: data validation (in the sense of discarding input) should
> >>>>>>   be done _before_ calling CM routines.
> >>>>>> ---
> >>>>>>
> >>>>>>> I think the proposition from Patrick was to exactly do that: throw an
> >>>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
> >>>>>>>
> >>>>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either
> >>>> we
> >>>>>>> fix is or deprecate it.
> >>>>> That we should fix.  Please open a JIRA for this.  I assume you are
> >>>>> talking about the implementation in NaturalRanking.
> >>>>>> +1
> >>>>>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix
> >>>> it.
> >>>>>> However, if nobody could actually rely on this feature because it is
> >>>> broken,
> >>>>>> I'd prefer to remove it.]
> >>>>> There are two issues here.  One is specific to ranking algorithms.
> >>>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
> >>>>> the result has to be a total ordering.  The NaNStrategy.REMOVED
> >>>>> strategy is intended to represent removal of NaNs from the data to
> >>>>> be ordered.  If it is not implemented correctly in NaturalRanking or
> >>>>> other rankings that is a bug and needs to be fixed.
> >>>> Sorry, I just reread Patrick's original mail.  IIUC, there is
> >>>> nothing wrong with the implementation of NaNStrategy.REMOVED in
> >>>> NaturalRanking or other implemented rankings.  The problem is how
> >>>> the Spearman's impl handles it.  That is indeed a bug in Spearman's
> >>>> impl that should be fixed.  The correct fix is to throw out the
> >>>> corresponding entry in the second array when REMOVED is the
> >>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
> >>>> setting that as the default is a good idea.  Patches welcome.
> >>>>> The second issue is the more general one of how to represent and
> >>>>> handle missing data.  I have always seen that as a limitation that
> >>>>> we would eventually address on an algorithm by algorithm basis.
> >>>>> Different algorithms can be configured to do different things when
> >>>>> missing data are encountered.  It is not always possible or
> >>>>> desirable to preprocess the data to "eliminate" or impute missing
> >>>>> data.  Saying that we are just not going to deal with it is a
> >>>>> limitation that I don't think we should impose.  I am would like to
> >>>>> hear others' ideas about good ways to model missing data in Java.
> >>> Hi Phil,
> >>>
> >>> ok I have created three new issues:
> >>>
> >>>  * MATH-891
> >>>  * MATH-892
> >>>  * MATH-893
> >> Thanks!
> >>> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
> >>> the RankingAlgorithm interface a bit. Right now, it only takes as input a
> >>> one-dimensional array. But in case of correlations, you have two input
> >>> arrays. If you remove from one array the NaN values, you have no means to
> >>> know at which index they have been removed to do the same with the other
> >>> array.
> >> Or you push that responsibility to the client - in this case
> >> SpearmansCorrelation.   My first thought on how to fix the
> >> Spearman's impl was to have it compare lengths of ranked / unranked
> >> when invoked with the REMOVED NaN strategy and then scan the
> >> original arrays when removals happen, adjusting the ranked arrays
> >> accordingly.  
> > I thought about this a bit more, and I do not think it can be done
> > safely on the client side (i.e. SpearmansCorrelation).
> >
> > Consider the following case:
> >
> >  x: [NaN, 1, 2]
> >  y: [1, NaN, 2]
> >
> > the ranking algorithm with a NaNStrategy of REMOVED would rank as follows:
> >
> >  x: [1, 2]
> >  y: [1, 2]
> >
> > on the client side, everything looks fine, but in fact we would
> > correlate wrong data.
> >
> > Additionally, on the client side, we have no means to know the actual
> > NaNStrategy that is used, as it is hidden in the ranking algorithm.
> >
> > Moreover, comparing with the original array may also not work, as the
> > ranking algorithm may change the data, so alignment is not always possible
> >
> >>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
> >>>> setting that as the default is a good idea.  Patches welcome.
> > The NaNStrategy.FAILED has been added already, shall we make it the
> > default then, what do you think?
>
> I think that is probably best, since what I was trying to do was a
> poor man's strategy for missing data.  In the case above, I would
> have the client eliminate both of the first two observations, so
> there would not be enough data left, but this is hard to document
> and implement and is really just a hack to support one missing data
> scenario.
>
> Now is as good a time as any to think about how to correctly
> represent and handle missing data.  The unfortunate thing is that in
> Java working with primitive doubles we are back to the old Fortran
> days of having no natural representation of a missing value.
> Sticking with primitives, the only thing we can do is either use NaN
> or allow the "missing" designator to be configured by the user.  I
> am curious what others have done in this area.

As you say, as I said, with primitive double, there is no value that can
readily serve as "missing". It's a user's choice (e.g. "Double.NaN",
"Double.MAX_VALUE", "-Double.MAX_VALUE", "any negative value", ...), that
depends on the context.

>
> The second question is what strategies do we support for handling
> missing data and how do we represent those strategies.   The
> simplest and easiest strategy to implement is to delete observations
> that include missing data.  This is a data-only strategy and would
> work the same way across algorithms.  I am afraid, however, that
> this is the only strategy that is not algorithm-dependent (unless
> you consider, e.g. EM as a missing data strategy or very simple
> imputation strategies).  So that means individual algorithms need to
> include missing data strategies in their specifications.  It might
> be good to define and implement these for the correlation and
> regression classes and see if we can generalize.  Any ideas on how
> best to do this?

I'm sorry if I'm dense, but I don't remember if or why the option that users
should provide clean input data to CM has been ruled out.
I.e. filtering (by user) is done before computation (by CM's algo).

If the data is missing, how can you use it (to correlate, to fit, ...)?


Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Phil Steitz
On 11/19/12 3:31 AM, Gilles Sadowski wrote:

> On Sun, Nov 18, 2012 at 09:27:41PM -0800, Phil Steitz wrote:
>> On 11/18/12 2:01 PM, Thomas Neidhart wrote:
>>> On 11/09/2012 11:14 PM, Phil Steitz wrote:
>>>> On 11/9/12 12:18 AM, Thomas Neidhart wrote:
>>>>> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <[hidden email]> wrote:
>>>>>
>>>>>> On 11/8/12 9:44 AM, Phil Steitz wrote:
>>>>>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
>>>>>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
>>>>>>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
>>>>>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>>>>>>>>>>> Hi Patrick,
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>>>>>>>>> I agree that it would be nice to have a constructor that allows
>>>>>> you to
>>>>>>>>>>>>> specific the ranking algorithm only.
>>>>>>> +1 - patches welcome.
>>>>>>>>>>>>> As far as NaN and the Spearman correlation, maybe we should add a
>>>>>> default
>>>>>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would occur if
>>>>>> any NaN is
>>>>>>>>>>>>> encountered. R uses this treatment of missing data and forces
>>>>>> users to
>>>>>>>>>>>>> choose how to handle it. If we implemented something like listwise
>>>>>> or
>>>>>>>>>>>>> pairwise deletion it could be used in other classes too. As such,
>>>>>> treatment
>>>>>>>>>>>>> of missing data should be part of a larger discussion and handled
>>>>>> in a more
>>>>>>>>>>>>> comprehensive and systematic way.
>>>>>>> +1 to develop a strategy for representing how to represent and
>>>>>>> handle missing data (see below)
>>>>>>>>>>>> I think this additional option makes sense, but I forward this
>>>>>>>>>>>> discussion to the dev mailing list where it is better suited.
>>>>>>>>>>> I'm wary of having CM handle "missing" data.
>>>>>>>>>>> For one thing we'd have to define a "convention" to represent
>>>>>> missing data.
>>>>>>>>>>> There is no good way to do that in Java. Using NaN for this purpose
>>>>>> in a
>>>>>>>>>>> low-level library is not a good idea IMHO.
>>>>>>>>>>>
>>>>>>>>>> I agree with Gilles, here. If I remember correctly, R has a special
>>>>>>>>>> value NA, or something similar, which differs from NaN.
>>>>>>>>>>> Then, any convention might not be
>>>>>>>>>>> suitable for some user applications, which would lead such an
>>>>>> application's
>>>>>>>>>>> developer to filter the data anyway in order to change his
>>>>>> representation to
>>>>>>>>>>> CM's representation. Rather that calling two redundant filtering
>>>>>> codes, I'd
>>>>>>>>>>> rather assume that CM gets a clean input on which its algorithm can
>>>>>> operate.
>>>>>>>>>>> As usual, the input is subjected to precondition checks, and
>>>>>> exceptions are
>>>>>>>>>>> thrown if the data is not clean enough.
>>>>>>>>>>>
>>>>>>>>>>> In summary: data validation (in the sense of discarding input)
>>>>>> should not be
>>>>>>>>>>> done _before_ calling CM routines.
>>>>>>>>>>>
>>>>>>>>>> +1.
>>>>>>>>> ok, I am now confused. First you say that CM should not be involved in
>>>>>>>>> data cleaning, but then you state that data validation should not be
>>>>>>>>> done before calling CM? May be there is a *not* too much?
>>>>>>>> Yes, you are right: I wrote the opposite of what I meant.
>>>>>>>> ---
>>>>>>>>   In summary: data validation (in the sense of discarding input) should
>>>>>>>>   be done _before_ calling CM routines.
>>>>>>>> ---
>>>>>>>>
>>>>>>>>> I think the proposition from Patrick was to exactly do that: throw an
>>>>>>>>> exception if such invalid data is encountered (NaNStrategy.FAIL).
>>>>>>>>>
>>>>>>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so either
>>>>>> we
>>>>>>>>> fix is or deprecate it.
>>>>>>> That we should fix.  Please open a JIRA for this.  I assume you are
>>>>>>> talking about the implementation in NaturalRanking.
>>>>>>>> +1
>>>>>>>> [I mean (I think): If people rely on CM's removal of NaNs, we could fix
>>>>>> it.
>>>>>>>> However, if nobody could actually rely on this feature because it is
>>>>>> broken,
>>>>>>>> I'd prefer to remove it.]
>>>>>>> There are two issues here.  One is specific to ranking algorithms.
>>>>>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy, since
>>>>>>> the result has to be a total ordering.  The NaNStrategy.REMOVED
>>>>>>> strategy is intended to represent removal of NaNs from the data to
>>>>>>> be ordered.  If it is not implemented correctly in NaturalRanking or
>>>>>>> other rankings that is a bug and needs to be fixed.
>>>>>> Sorry, I just reread Patrick's original mail.  IIUC, there is
>>>>>> nothing wrong with the implementation of NaNStrategy.REMOVED in
>>>>>> NaturalRanking or other implemented rankings.  The problem is how
>>>>>> the Spearman's impl handles it.  That is indeed a bug in Spearman's
>>>>>> impl that should be fixed.  The correct fix is to throw out the
>>>>>> corresponding entry in the second array when REMOVED is the
>>>>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>>>>>> setting that as the default is a good idea.  Patches welcome.
>>>>>>> The second issue is the more general one of how to represent and
>>>>>>> handle missing data.  I have always seen that as a limitation that
>>>>>>> we would eventually address on an algorithm by algorithm basis.
>>>>>>> Different algorithms can be configured to do different things when
>>>>>>> missing data are encountered.  It is not always possible or
>>>>>>> desirable to preprocess the data to "eliminate" or impute missing
>>>>>>> data.  Saying that we are just not going to deal with it is a
>>>>>>> limitation that I don't think we should impose.  I am would like to
>>>>>>> hear others' ideas about good ways to model missing data in Java.
>>>>> Hi Phil,
>>>>>
>>>>> ok I have created three new issues:
>>>>>
>>>>>  * MATH-891
>>>>>  * MATH-892
>>>>>  * MATH-893
>>>> Thanks!
>>>>> Regarding the NaNStrategy.REMOVED, I think it will be necessary to adjust
>>>>> the RankingAlgorithm interface a bit. Right now, it only takes as input a
>>>>> one-dimensional array. But in case of correlations, you have two input
>>>>> arrays. If you remove from one array the NaN values, you have no means to
>>>>> know at which index they have been removed to do the same with the other
>>>>> array.
>>>> Or you push that responsibility to the client - in this case
>>>> SpearmansCorrelation.   My first thought on how to fix the
>>>> Spearman's impl was to have it compare lengths of ranked / unranked
>>>> when invoked with the REMOVED NaN strategy and then scan the
>>>> original arrays when removals happen, adjusting the ranked arrays
>>>> accordingly.  
>>> I thought about this a bit more, and I do not think it can be done
>>> safely on the client side (i.e. SpearmansCorrelation).
>>>
>>> Consider the following case:
>>>
>>>  x: [NaN, 1, 2]
>>>  y: [1, NaN, 2]
>>>
>>> the ranking algorithm with a NaNStrategy of REMOVED would rank as follows:
>>>
>>>  x: [1, 2]
>>>  y: [1, 2]
>>>
>>> on the client side, everything looks fine, but in fact we would
>>> correlate wrong data.
>>>
>>> Additionally, on the client side, we have no means to know the actual
>>> NaNStrategy that is used, as it is hidden in the ranking algorithm.
>>>
>>> Moreover, comparing with the original array may also not work, as the
>>> ranking algorithm may change the data, so alignment is not always possible
>>>
>>>>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL and
>>>>>> setting that as the default is a good idea.  Patches welcome.
>>> The NaNStrategy.FAILED has been added already, shall we make it the
>>> default then, what do you think?
>> I think that is probably best, since what I was trying to do was a
>> poor man's strategy for missing data.  In the case above, I would
>> have the client eliminate both of the first two observations, so
>> there would not be enough data left, but this is hard to document
>> and implement and is really just a hack to support one missing data
>> scenario.
>>
>> Now is as good a time as any to think about how to correctly
>> represent and handle missing data.  The unfortunate thing is that in
>> Java working with primitive doubles we are back to the old Fortran
>> days of having no natural representation of a missing value.
>> Sticking with primitives, the only thing we can do is either use NaN
>> or allow the "missing" designator to be configured by the user.  I
>> am curious what others have done in this area.
> As you say, as I said, with primitive double, there is no value that can
> readily serve as "missing". It's a user's choice (e.g. "Double.NaN",
> "Double.MAX_VALUE", "-Double.MAX_VALUE", "any negative value", ...), that
> depends on the context.
>
>> The second question is what strategies do we support for handling
>> missing data and how do we represent those strategies.   The
>> simplest and easiest strategy to implement is to delete observations
>> that include missing data.  This is a data-only strategy and would
>> work the same way across algorithms.  I am afraid, however, that
>> this is the only strategy that is not algorithm-dependent (unless
>> you consider, e.g. EM as a missing data strategy or very simple
>> imputation strategies).  So that means individual algorithms need to
>> include missing data strategies in their specifications.  It might
>> be good to define and implement these for the correlation and
>> regression classes and see if we can generalize.  Any ideas on how
>> best to do this?
> I'm sorry if I'm dense, but I don't remember if or why the option that users
> should provide clean input data to CM has been ruled out.
> I.e. filtering (by user) is done before computation (by CM's algo).
>
> If the data is missing, how can you use it (to correlate, to fit, ...)?

There are multiple techniques that can be used to adjust for missing
data, depending on the algorithm.  See [1], for example, for a
summary of the kinds of techniques that can be used in regression.
Basically, saying users need to adjust the data before providing it
to the algorithm allows only the "data only" approaches and may be
inconvenient or make impossible other analyses to be performed on
the same data.

Phil

[1]
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html

>
>
> Regards,
> Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [math] correlation analysis with NaNs

Patrick Meyer


-----Original Message-----
From: Phil Steitz [mailto:[hidden email]]
Sent: Monday, November 19, 2012 12:45 PM
To: Commons Developers List
Subject: Re: [math] correlation analysis with NaNs

On 11/19/12 3:31 AM, Gilles Sadowski wrote:
> On Sun, Nov 18, 2012 at 09:27:41PM -0800, Phil Steitz wrote:
>> On 11/18/12 2:01 PM, Thomas Neidhart wrote:
>>> On 11/09/2012 11:14 PM, Phil Steitz wrote:
>>>> On 11/9/12 12:18 AM, Thomas Neidhart wrote:
>>>>> On Thu, Nov 8, 2012 at 7:21 PM, Phil Steitz <[hidden email]>
wrote:

>>>>>
>>>>>> On 11/8/12 9:44 AM, Phil Steitz wrote:
>>>>>>> On 11/8/12 8:23 AM, Gilles Sadowski wrote:
>>>>>>>> On Thu, Nov 08, 2012 at 05:00:52PM +0100, Thomas Neidhart wrote:
>>>>>>>>> On 11/08/2012 02:01 PM, Sébastien Brisard wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> 2012/11/8 Gilles Sadowski <[hidden email]>:
>>>>>>>>>>> On Thu, Nov 08, 2012 at 09:39:00AM +0100, Thomas Neidhart wrote:
>>>>>>>>>>>> Hi Patrick,
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/07/2012 04:37 PM, Patrick Meyer wrote:
>>>>>>>>>>>>> I agree that it would be nice to have a constructor that
>>>>>>>>>>>>> allows
>>>>>> you to
>>>>>>>>>>>>> specific the ranking algorithm only.
>>>>>>> +1 - patches welcome.
>>>>>>>>>>>>> As far as NaN and the Spearman correlation, maybe we
>>>>>>>>>>>>> should add a
>>>>>> default
>>>>>>>>>>>>> strategy of NaNStrategy.FAIL so that an exception would
>>>>>>>>>>>>> occur if
>>>>>> any NaN is
>>>>>>>>>>>>> encountered. R uses this treatment of missing data and
>>>>>>>>>>>>> forces
>>>>>> users to
>>>>>>>>>>>>> choose how to handle it. If we implemented something like
>>>>>>>>>>>>> listwise
>>>>>> or
>>>>>>>>>>>>> pairwise deletion it could be used in other classes too.
>>>>>>>>>>>>> As such,
>>>>>> treatment
>>>>>>>>>>>>> of missing data should be part of a larger discussion and
>>>>>>>>>>>>> handled
>>>>>> in a more
>>>>>>>>>>>>> comprehensive and systematic way.
>>>>>>> +1 to develop a strategy for representing how to represent and
>>>>>>> handle missing data (see below)
>>>>>>>>>>>> I think this additional option makes sense, but I forward
>>>>>>>>>>>> this discussion to the dev mailing list where it is better
suited.

>>>>>>>>>>> I'm wary of having CM handle "missing" data.
>>>>>>>>>>> For one thing we'd have to define a "convention" to
>>>>>>>>>>> represent
>>>>>> missing data.
>>>>>>>>>>> There is no good way to do that in Java. Using NaN for this
>>>>>>>>>>> purpose
>>>>>> in a
>>>>>>>>>>> low-level library is not a good idea IMHO.
>>>>>>>>>>>
>>>>>>>>>> I agree with Gilles, here. If I remember correctly, R has a
>>>>>>>>>> special value NA, or something similar, which differs from NaN.
>>>>>>>>>>> Then, any convention might not be suitable for some user
>>>>>>>>>>> applications, which would lead such an
>>>>>> application's
>>>>>>>>>>> developer to filter the data anyway in order to change his
>>>>>> representation to
>>>>>>>>>>> CM's representation. Rather that calling two redundant
>>>>>>>>>>> filtering
>>>>>> codes, I'd
>>>>>>>>>>> rather assume that CM gets a clean input on which its
>>>>>>>>>>> algorithm can
>>>>>> operate.
>>>>>>>>>>> As usual, the input is subjected to precondition checks, and
>>>>>> exceptions are
>>>>>>>>>>> thrown if the data is not clean enough.
>>>>>>>>>>>
>>>>>>>>>>> In summary: data validation (in the sense of discarding
>>>>>>>>>>> input)
>>>>>> should not be
>>>>>>>>>>> done _before_ calling CM routines.
>>>>>>>>>>>
>>>>>>>>>> +1.
>>>>>>>>> ok, I am now confused. First you say that CM should not be
>>>>>>>>> involved in data cleaning, but then you state that data
>>>>>>>>> validation should not be done before calling CM? May be there is a
*not* too much?
>>>>>>>> Yes, you are right: I wrote the opposite of what I meant.
>>>>>>>> ---
>>>>>>>>   In summary: data validation (in the sense of discarding input)
should
>>>>>>>>   be done _before_ calling CM routines.
>>>>>>>> ---
>>>>>>>>
>>>>>>>>> I think the proposition from Patrick was to exactly do that:
>>>>>>>>> throw an exception if such invalid data is encountered
(NaNStrategy.FAIL).

>>>>>>>>>
>>>>>>>>> The other thing is, that the NaNStrategy.REMOVED is broken, so
>>>>>>>>> either
>>>>>> we
>>>>>>>>> fix is or deprecate it.
>>>>>>> That we should fix.  Please open a JIRA for this.  I assume you
>>>>>>> are talking about the implementation in NaturalRanking.
>>>>>>>> +1
>>>>>>>> [I mean (I think): If people rely on CM's removal of NaNs, we
>>>>>>>> could fix
>>>>>> it.
>>>>>>>> However, if nobody could actually rely on this feature because
>>>>>>>> it is
>>>>>> broken,
>>>>>>>> I'd prefer to remove it.]
>>>>>>> There are two issues here.  One is specific to ranking algorithms.
>>>>>>> To be well-defined, a RankingAlgorithm needs a NaNStrategy,
>>>>>>> since the result has to be a total ordering.  The
>>>>>>> NaNStrategy.REMOVED strategy is intended to represent removal of
>>>>>>> NaNs from the data to be ordered.  If it is not implemented
>>>>>>> correctly in NaturalRanking or other rankings that is a bug and
needs to be fixed.
>>>>>> Sorry, I just reread Patrick's original mail.  IIUC, there is
>>>>>> nothing wrong with the implementation of NaNStrategy.REMOVED in
>>>>>> NaturalRanking or other implemented rankings.  The problem is how
>>>>>> the Spearman's impl handles it.  That is indeed a bug in
>>>>>> Spearman's impl that should be fixed.  The correct fix is to
>>>>>> throw out the corresponding entry in the second array when
>>>>>> REMOVED is the configured NaNStrategy.  I agree with Patrick that
>>>>>> adding .FAIL and setting that as the default is a good idea.  Patches
welcome.
>>>>>>> The second issue is the more general one of how to represent and
>>>>>>> handle missing data.  I have always seen that as a limitation
>>>>>>> that we would eventually address on an algorithm by algorithm basis.
>>>>>>> Different algorithms can be configured to do different things
>>>>>>> when missing data are encountered.  It is not always possible or
>>>>>>> desirable to preprocess the data to "eliminate" or impute
>>>>>>> missing data.  Saying that we are just not going to deal with it
>>>>>>> is a limitation that I don't think we should impose.  I am would
>>>>>>> like to hear others' ideas about good ways to model missing data in
Java.

>>>>> Hi Phil,
>>>>>
>>>>> ok I have created three new issues:
>>>>>
>>>>>  * MATH-891
>>>>>  * MATH-892
>>>>>  * MATH-893
>>>> Thanks!
>>>>> Regarding the NaNStrategy.REMOVED, I think it will be necessary to
>>>>> adjust the RankingAlgorithm interface a bit. Right now, it only
>>>>> takes as input a one-dimensional array. But in case of
>>>>> correlations, you have two input arrays. If you remove from one
>>>>> array the NaN values, you have no means to know at which index
>>>>> they have been removed to do the same with the other array.
>>>> Or you push that responsibility to the client - in this case
>>>> SpearmansCorrelation.   My first thought on how to fix the
>>>> Spearman's impl was to have it compare lengths of ranked / unranked
>>>> when invoked with the REMOVED NaN strategy and then scan the
>>>> original arrays when removals happen, adjusting the ranked arrays
>>>> accordingly.
>>> I thought about this a bit more, and I do not think it can be done
>>> safely on the client side (i.e. SpearmansCorrelation).
>>>
>>> Consider the following case:
>>>
>>>  x: [NaN, 1, 2]
>>>  y: [1, NaN, 2]
>>>
>>> the ranking algorithm with a NaNStrategy of REMOVED would rank as
follows:
>>>
>>>  x: [1, 2]
>>>  y: [1, 2]
>>>
>>> on the client side, everything looks fine, but in fact we would
>>> correlate wrong data.
>>>
>>> Additionally, on the client side, we have no means to know the
>>> actual NaNStrategy that is used, as it is hidden in the ranking
algorithm.

>>>
>>> Moreover, comparing with the original array may also not work, as
>>> the ranking algorithm may change the data, so alignment is not
>>> always possible
>>>
>>>>>> configured NaNStrategy.  I agree with Patrick that adding .FAIL
>>>>>> and setting that as the default is a good idea.  Patches welcome.
>>> The NaNStrategy.FAILED has been added already, shall we make it the
>>> default then, what do you think?
>> I think that is probably best, since what I was trying to do was a
>> poor man's strategy for missing data.  In the case above, I would
>> have the client eliminate both of the first two observations, so
>> there would not be enough data left, but this is hard to document and
>> implement and is really just a hack to support one missing data
>> scenario.
>>
>> Now is as good a time as any to think about how to correctly
>> represent and handle missing data.  The unfortunate thing is that in
>> Java working with primitive doubles we are back to the old Fortran
>> days of having no natural representation of a missing value.
>> Sticking with primitives, the only thing we can do is either use NaN
>> or allow the "missing" designator to be configured by the user.  I am
>> curious what others have done in this area.
> As you say, as I said, with primitive double, there is no value that
> can readily serve as "missing". It's a user's choice (e.g.
> "Double.NaN", "Double.MAX_VALUE", "-Double.MAX_VALUE", "any negative
> value", ...), that depends on the context.
>
>> The second question is what strategies do we support for handling
>> missing data and how do we represent those strategies.   The
>> simplest and easiest strategy to implement is to delete observations
>> that include missing data.  This is a data-only strategy and would
>> work the same way across algorithms.  I am afraid, however, that this
>> is the only strategy that is not algorithm-dependent (unless you
>> consider, e.g. EM as a missing data strategy or very simple
>> imputation strategies).  So that means individual algorithms need to
>> include missing data strategies in their specifications.  It might be
>> good to define and implement these for the correlation and regression
>> classes and see if we can generalize.  Any ideas on how best to do
>> this?
> I'm sorry if I'm dense, but I don't remember if or why the option that
> users should provide clean input data to CM has been ruled out.
> I.e. filtering (by user) is done before computation (by CM's algo).
>
> If the data is missing, how can you use it (to correlate, to fit, ...)?

There are multiple techniques that can be used to adjust for missing data,
depending on the algorithm.  See [1], for example, for a summary of the
kinds of techniques that can be used in regression.
Basically, saying users need to adjust the data before providing it to the
algorithm allows only the "data only" approaches and may be inconvenient or
make impossible other analyses to be performed on the same data.

Phil

[1]
http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html


I agree that we should consider a more comprehensive treatment of missing
data. Perhaps we should start by designing an interface that could be
implemented by existing classes. For example, an imputation interface could
have methods like miimpute, mianalyze and misummarize and this interface
could be implemented in a class that extends OLSMultipleLinearRegression.
This approach allows each estimation method to adopt its own treatment of
missing data.

An alternative is to develop data structures that represent the original and
complete data sets. Missing data methods could be applied to the data
structures and return a complete data set for use in estimation methods.

I guess the decision is whether the missing data treatment should be part of
an independent data structure or part integrated into estimation method.
Just some thoughts about possible ways of handling it.

Patrick



>
>
> Regards,
> Gilles
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Gilles Sadowski
Hi.

> > [...]

> >>
> >> Now is as good a time as any to think about how to correctly
> >> represent and handle missing data.  The unfortunate thing is that in
> >> Java working with primitive doubles we are back to the old Fortran
> >> days of having no natural representation of a missing value.
> >> Sticking with primitives, the only thing we can do is either use NaN
> >> or allow the "missing" designator to be configured by the user.  I am
> >> curious what others have done in this area.
> > As you say, as I said, with primitive double, there is no value that
> > can readily serve as "missing". It's a user's choice (e.g.
> > "Double.NaN", "Double.MAX_VALUE", "-Double.MAX_VALUE", "any negative
> > value", ...), that depends on the context.
> >
> >> The second question is what strategies do we support for handling
> >> missing data and how do we represent those strategies.   The
> >> simplest and easiest strategy to implement is to delete observations
> >> that include missing data.  This is a data-only strategy and would
> >> work the same way across algorithms.  I am afraid, however, that this
> >> is the only strategy that is not algorithm-dependent (unless you
> >> consider, e.g. EM as a missing data strategy or very simple
> >> imputation strategies).  So that means individual algorithms need to
> >> include missing data strategies in their specifications.  It might be
> >> good to define and implement these for the correlation and regression
> >> classes and see if we can generalize.  Any ideas on how best to do
> >> this?
> > I'm sorry if I'm dense, but I don't remember if or why the option that
> > users should provide clean input data to CM has been ruled out.
> > I.e. filtering (by user) is done before computation (by CM's algo).
> >
> > If the data is missing, how can you use it (to correlate, to fit, ...)?
>
> There are multiple techniques that can be used to adjust for missing data,
> depending on the algorithm.  See [1], for example, for a summary of the
> kinds of techniques that can be used in regression.
> Basically, saying users need to adjust the data before providing it to the
> algorithm allows only the "data only" approaches and may be inconvenient or
> make impossible other analyses to be performed on the same data.
>
> Phil
>
> [1]
> http://www.uvm.edu/~dhowell/StatPages/More_Stuff/Missing_Data/Missing.html
>
>
> I agree that we should consider a more comprehensive treatment of missing
> data. Perhaps we should start by designing an interface that could be
> implemented by existing classes. For example, an imputation interface could
> have methods like miimpute, mianalyze and misummarize and this interface
> could be implemented in a class that extends OLSMultipleLinearRegression.
> This approach allows each estimation method to adopt its own treatment of
> missing data.
>
> An alternative is to develop data structures that represent the original and
> complete data sets. Missing data methods could be applied to the data
> structures and return a complete data set for use in estimation methods.
>
> I guess the decision is whether the missing data treatment should be part of
> an independent data structure or part integrated into estimation method.
> Just some thoughts about possible ways of handling it.
>
> Patrick
>

Is the issue (in CM) about handling missing data or representing missing
data?
IIUC, handling is algorithm-dependent. Representation is a matter of
convention (i.e. user-dependent).

My proposal would be that for every algorithm that is able to handle
missing data, we provide an argument (to constructors) that specifies the
"double" value that represents a missing value.


Regards,
Gilles

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

[math] pearson and spearman correlation runtime complexity

Martin Rosellen
Hi again,

I tried to implement the pearson and spearman algorithm myself and the
computation took very long. That is why I now use the commons math
solution. I am curious about the runtime complexity of the Pearson and
the Spearman correlation coefficient. Can someone help me with that?

Greetz
Martin

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] pearson and spearman correlation runtime complexity

Ted Dunning
Can you say more about how you implemented these?

The Pearson coefficient should be quite simple.  A few passes through the
data should suffice and it can probably be done in one pass, especially if
you aren't worried about 1ULP accuracy.

The Spearman coefficient should be no worse than the cost of sorting plus
the cost of the Pearson computation.  There are often faster methods as
well if there are no ties.

On Thu, Dec 13, 2012 at 6:57 AM, Martin Rosellen <
[hidden email]> wrote:

> Hi again,
>
> I tried to implement the pearson and spearman algorithm myself and the
> computation took very long. That is why I now use the commons math
> solution. I am curious about the runtime complexity of the Pearson and the
> Spearman correlation coefficient. Can someone help me with that?
>
> Greetz
> Martin
>
> ------------------------------**------------------------------**---------
> To unsubscribe, e-mail: dev-unsubscribe@commons.**apache.org<[hidden email]>
> For additional commands, e-mail: [hidden email]
>
>