[math] correlation analysis with NaNs

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[math] correlation analysis with NaNs

Martin Rosellen
Dear all,

I have difficulties using the Spearman correlation analysis with double
arrays that may contain NaN entries. As you see in my example I want to
analyse the columns with entries {Double.NaN, 1, 2} and {10, 2, 10}. The
output of the execution of the code below is:

Ranking [1.0, 2.0]
Ranking [2.5, 1.0, 2.5]
correlations 0.8660254037844386


{code}
         double[] column1 = new double[]{Double.NaN, 1, 2};
         double[] column2 = new double[]{10, 2, 10};

         NaturalRanking rank = new NaturalRanking(NaNStrategy.REMOVED);
         double[] ranking1 = rank.rank(column1);
         double[] ranking2 = rank.rank(column2);

         System.out.println("Ranking " + Arrays.toString(ranking1));
         System.out.println("Ranking " + Arrays.toString(ranking2));

         SpearmansCorrelation s_corrs = new SpearmansCorrelation();
         double correlations = s_corrs.correlation(column1, column2);

         System.out.println("correlations " + correlations);
{code}

Like I understand Spearman the result of the correlation should be 1
because tuples that contain NaNs should be ignored in the ranking and in
the correlation analysis. What I don't understand is why there are ranks
like 2.5.

My workaround works as follows:
- use NaNStrategy.FIXED, so that the NaNs stay in place
- execute the ranking
- round down the ranks like 2.5 if they are not NaN (NaNs are cast to 0.0)
- execute custom Pearson correlation that ignores tuples with NaNs on
the ranked arrays

Here is the code:
{code}
double[] column1 = new double[]{Double.NaN, 1, 2};
         double[] column2 = new double[]{10, 2, 10};



         NaturalRanking rank = new NaturalRanking(NaNStrategy.FIXED);

         double[] ranking1 = rank.rank(column1);
         double[] ranking2 = rank.rank(column2);

         for (int i = 0; i < ranking1.length; i++) {
             if (!Double.isNaN(ranking1[i])) {
                 ranking1[i] = (int) ranking1[i];
             }

             if (!Double.isNaN(ranking2[i])) {
                 ranking2[i] = (int) ranking2[i];
             }
         }


         System.out.println("Ranking " + Arrays.toString(ranking1));
         System.out.println("Ranking " + Arrays.toString(ranking2));

         PearsonsCorrelation p_corrs = new PearsonsCorrelation();
         double correlations = p_corrs.correlationNaNs(column1, column2);

         System.out.println("correlations " + correlations);
{code}

I hope that my solution for dealing with NaNs isn't missing anything.
Perhaps you can comment on this.

Kind regards
Martin


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [math] correlation analysis with NaNs

Patrick Meyer
You are getting values like 2.5 because of the default ties strategy. If you
do not want to use that method, create an instance of RankingAlgorithm with
a different ties strategy and pass it to the constructor for the
SpearmanCorrelation. This approach also gives you control over the method
for dealing with NaNs. Something like,

//create data matrix
double[] column1 = new double[]{Double.NaN, 1, 2};
double[] column2 = new double[]{10, 2, 10};
Array2DRowRealMatrix mydata = new Array2DRowRealMatrix();
For(int i=0;i<column1.length;i++){
        mydata.addToEntry(i, 0, column1[i]);
        mydata.addToEntry(i, 1, column2[i]);
}

//compute correlation
NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED,
TiesStrategy.RANDOM);
SpearmanCorrelation spearman = new SpearmanCorrelation(ranking, mydata);

Try that.



-----Original Message-----
From: Martin Rosellen [mailto:[hidden email]]
Sent: Wednesday, November 07, 2012 6:10 AM
To: Commons Users List
Subject: [math] correlation analysis with NaNs

Dear all,

I have difficulties using the Spearman correlation analysis with double
arrays that may contain NaN entries. As you see in my example I want to
analyse the columns with entries {Double.NaN, 1, 2} and {10, 2, 10}. The
output of the execution of the code below is:

Ranking [1.0, 2.0]
Ranking [2.5, 1.0, 2.5]
correlations 0.8660254037844386


{code}
         double[] column1 = new double[]{Double.NaN, 1, 2};
         double[] column2 = new double[]{10, 2, 10};

         NaturalRanking rank = new NaturalRanking(NaNStrategy.REMOVED);
         double[] ranking1 = rank.rank(column1);
         double[] ranking2 = rank.rank(column2);

         System.out.println("Ranking " + Arrays.toString(ranking1));
         System.out.println("Ranking " + Arrays.toString(ranking2));

         SpearmansCorrelation s_corrs = new SpearmansCorrelation();
         double correlations = s_corrs.correlation(column1, column2);

         System.out.println("correlations " + correlations); {code}

Like I understand Spearman the result of the correlation should be 1 because
tuples that contain NaNs should be ignored in the ranking and in the
correlation analysis. What I don't understand is why there are ranks like
2.5.

My workaround works as follows:
- use NaNStrategy.FIXED, so that the NaNs stay in place
- execute the ranking
- round down the ranks like 2.5 if they are not NaN (NaNs are cast to 0.0)
- execute custom Pearson correlation that ignores tuples with NaNs on the
ranked arrays

Here is the code:
{code}
double[] column1 = new double[]{Double.NaN, 1, 2};
         double[] column2 = new double[]{10, 2, 10};



         NaturalRanking rank = new NaturalRanking(NaNStrategy.FIXED);

         double[] ranking1 = rank.rank(column1);
         double[] ranking2 = rank.rank(column2);

         for (int i = 0; i < ranking1.length; i++) {
             if (!Double.isNaN(ranking1[i])) {
                 ranking1[i] = (int) ranking1[i];
             }

             if (!Double.isNaN(ranking2[i])) {
                 ranking2[i] = (int) ranking2[i];
             }
         }


         System.out.println("Ranking " + Arrays.toString(ranking1));
         System.out.println("Ranking " + Arrays.toString(ranking2));

         PearsonsCorrelation p_corrs = new PearsonsCorrelation();
         double correlations = p_corrs.correlationNaNs(column1, column2);

         System.out.println("correlations " + correlations); {code}

I hope that my solution for dealing with NaNs isn't missing anything.
Perhaps you can comment on this.

Kind regards
Martin


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Thomas Neidhart
On 11/07/2012 01:38 PM, Patrick Meyer wrote:

> You are getting values like 2.5 because of the default ties strategy. If you
> do not want to use that method, create an instance of RankingAlgorithm with
> a different ties strategy and pass it to the constructor for the
> SpearmanCorrelation. This approach also gives you control over the method
> for dealing with NaNs. Something like,
>
> //create data matrix
> double[] column1 = new double[]{Double.NaN, 1, 2};
> double[] column2 = new double[]{10, 2, 10};
> Array2DRowRealMatrix mydata = new Array2DRowRealMatrix();
> For(int i=0;i<column1.length;i++){
> mydata.addToEntry(i, 0, column1[i]);
> mydata.addToEntry(i, 1, column2[i]);
> }
>
> //compute correlation
> NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED,
> TiesStrategy.RANDOM);
> SpearmanCorrelation spearman = new SpearmanCorrelation(ranking, mydata);
>
> Try that.

Hi,

this will not really help imho.

As far as I can see, there are at least two problems with the current
use of the RankingAlgorithm in the SpearmanCorrelation class:

 * there is no way to select the ranking algorithm in the constructor
   without passing the values at the same time
 * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes
   the NaN only from the input array where it occurs but not in the
   corresponding array, thus rendering it useless as it will result in
   exceptions (array lengths differ)

Would you be able to create an issue for this on the issue tracker and
provide the test case?

Thanks,

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: [math] correlation analysis with NaNs

Patrick Meyer
I agree that it would be nice to have a constructor that allows you to
specific the ranking algorithm only.

As far as NaN and the Spearman correlation, maybe we should add a default
strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
encountered. R uses this treatment of missing data and forces users to
choose how to handle it. If we implemented something like listwise or
pairwise deletion it could be used in other classes too. As such, treatment
of missing data should be part of a larger discussion and handled in a more
comprehensive and systematic way.



-----Original Message-----
From: Thomas Neidhart [mailto:[hidden email]]
Sent: Wednesday, November 07, 2012 8:09 AM
To: [hidden email]
Subject: Re: [math] correlation analysis with NaNs

On 11/07/2012 01:38 PM, Patrick Meyer wrote:

> You are getting values like 2.5 because of the default ties strategy.
> If you do not want to use that method, create an instance of
> RankingAlgorithm with a different ties strategy and pass it to the
> constructor for the SpearmanCorrelation. This approach also gives you
> control over the method for dealing with NaNs. Something like,
>
> //create data matrix
> double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 =
> new double[]{10, 2, 10}; Array2DRowRealMatrix mydata = new
> Array2DRowRealMatrix(); For(int i=0;i<column1.length;i++){
> mydata.addToEntry(i, 0, column1[i]);
> mydata.addToEntry(i, 1, column2[i]);
> }
>
> //compute correlation
> NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED,
> TiesStrategy.RANDOM); SpearmanCorrelation spearman = new
> SpearmanCorrelation(ranking, mydata);
>
> Try that.

Hi,

this will not really help imho.

As far as I can see, there are at least two problems with the current use of
the RankingAlgorithm in the SpearmanCorrelation class:

 * there is no way to select the ranking algorithm in the constructor
   without passing the values at the same time
 * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes
   the NaN only from the input array where it occurs but not in the
   corresponding array, thus rendering it useless as it will result in
   exceptions (array lengths differ)

Would you be able to create an issue for this on the issue tracker and
provide the test case?

Thanks,

Thomas

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] correlation analysis with NaNs

Thomas Neidhart
Hi Patrick,

On 11/07/2012 04:37 PM, Patrick Meyer wrote:

> I agree that it would be nice to have a constructor that allows you to
> specific the ranking algorithm only.
>
> As far as NaN and the Spearman correlation, maybe we should add a default
> strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is
> encountered. R uses this treatment of missing data and forces users to
> choose how to handle it. If we implemented something like listwise or
> pairwise deletion it could be used in other classes too. As such, treatment
> of missing data should be part of a larger discussion and handled in a more
> comprehensive and systematic way.

I think this additional option makes sense, but I forward this
discussion to the dev mailing list where it is better suited.

Thomas

> -----Original Message-----
> From: Thomas Neidhart [mailto:[hidden email]]
> Sent: Wednesday, November 07, 2012 8:09 AM
> To: [hidden email]
> Subject: Re: [math] correlation analysis with NaNs
>
> On 11/07/2012 01:38 PM, Patrick Meyer wrote:
>> You are getting values like 2.5 because of the default ties strategy.
>> If you do not want to use that method, create an instance of
>> RankingAlgorithm with a different ties strategy and pass it to the
>> constructor for the SpearmanCorrelation. This approach also gives you
>> control over the method for dealing with NaNs. Something like,
>>
>> //create data matrix
>> double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 =
>> new double[]{10, 2, 10}; Array2DRowRealMatrix mydata = new
>> Array2DRowRealMatrix(); For(int i=0;i<column1.length;i++){
>> mydata.addToEntry(i, 0, column1[i]);
>> mydata.addToEntry(i, 1, column2[i]);
>> }
>>
>> //compute correlation
>> NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED,
>> TiesStrategy.RANDOM); SpearmanCorrelation spearman = new
>> SpearmanCorrelation(ranking, mydata);
>>
>> Try that.
>
> Hi,
>
> this will not really help imho.
>
> As far as I can see, there are at least two problems with the current use of
> the RankingAlgorithm in the SpearmanCorrelation class:
>
>  * there is no way to select the ranking algorithm in the constructor
>    without passing the values at the same time
>  * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes
>    the NaN only from the input array where it occurs but not in the
>    corresponding array, thus rendering it useless as it will result in
>    exceptions (array lengths differ)
>
> Would you be able to create an issue for this on the issue tracker and
> provide the test case?
>
> Thanks,
>
> Thomas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]