Dear all,
I have difficulties using the Spearman correlation analysis with double arrays that may contain NaN entries. As you see in my example I want to analyse the columns with entries {Double.NaN, 1, 2} and {10, 2, 10}. The output of the execution of the code below is: Ranking [1.0, 2.0] Ranking [2.5, 1.0, 2.5] correlations 0.8660254037844386 {code} double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 = new double[]{10, 2, 10}; NaturalRanking rank = new NaturalRanking(NaNStrategy.REMOVED); double[] ranking1 = rank.rank(column1); double[] ranking2 = rank.rank(column2); System.out.println("Ranking " + Arrays.toString(ranking1)); System.out.println("Ranking " + Arrays.toString(ranking2)); SpearmansCorrelation s_corrs = new SpearmansCorrelation(); double correlations = s_corrs.correlation(column1, column2); System.out.println("correlations " + correlations); {code} Like I understand Spearman the result of the correlation should be 1 because tuples that contain NaNs should be ignored in the ranking and in the correlation analysis. What I don't understand is why there are ranks like 2.5. My workaround works as follows: - use NaNStrategy.FIXED, so that the NaNs stay in place - execute the ranking - round down the ranks like 2.5 if they are not NaN (NaNs are cast to 0.0) - execute custom Pearson correlation that ignores tuples with NaNs on the ranked arrays Here is the code: {code} double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 = new double[]{10, 2, 10}; NaturalRanking rank = new NaturalRanking(NaNStrategy.FIXED); double[] ranking1 = rank.rank(column1); double[] ranking2 = rank.rank(column2); for (int i = 0; i < ranking1.length; i++) { if (!Double.isNaN(ranking1[i])) { ranking1[i] = (int) ranking1[i]; } if (!Double.isNaN(ranking2[i])) { ranking2[i] = (int) ranking2[i]; } } System.out.println("Ranking " + Arrays.toString(ranking1)); System.out.println("Ranking " + Arrays.toString(ranking2)); PearsonsCorrelation p_corrs = new PearsonsCorrelation(); double correlations = p_corrs.correlationNaNs(column1, column2); System.out.println("correlations " + correlations); {code} I hope that my solution for dealing with NaNs isn't missing anything. Perhaps you can comment on this. Kind regards Martin --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
You are getting values like 2.5 because of the default ties strategy. If you
do not want to use that method, create an instance of RankingAlgorithm with a different ties strategy and pass it to the constructor for the SpearmanCorrelation. This approach also gives you control over the method for dealing with NaNs. Something like, //create data matrix double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 = new double[]{10, 2, 10}; Array2DRowRealMatrix mydata = new Array2DRowRealMatrix(); For(int i=0;i<column1.length;i++){ mydata.addToEntry(i, 0, column1[i]); mydata.addToEntry(i, 1, column2[i]); } //compute correlation NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED, TiesStrategy.RANDOM); SpearmanCorrelation spearman = new SpearmanCorrelation(ranking, mydata); Try that. -----Original Message----- From: Martin Rosellen [mailto:[hidden email]] Sent: Wednesday, November 07, 2012 6:10 AM To: Commons Users List Subject: [math] correlation analysis with NaNs Dear all, I have difficulties using the Spearman correlation analysis with double arrays that may contain NaN entries. As you see in my example I want to analyse the columns with entries {Double.NaN, 1, 2} and {10, 2, 10}. The output of the execution of the code below is: Ranking [1.0, 2.0] Ranking [2.5, 1.0, 2.5] correlations 0.8660254037844386 {code} double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 = new double[]{10, 2, 10}; NaturalRanking rank = new NaturalRanking(NaNStrategy.REMOVED); double[] ranking1 = rank.rank(column1); double[] ranking2 = rank.rank(column2); System.out.println("Ranking " + Arrays.toString(ranking1)); System.out.println("Ranking " + Arrays.toString(ranking2)); SpearmansCorrelation s_corrs = new SpearmansCorrelation(); double correlations = s_corrs.correlation(column1, column2); System.out.println("correlations " + correlations); {code} Like I understand Spearman the result of the correlation should be 1 because tuples that contain NaNs should be ignored in the ranking and in the correlation analysis. What I don't understand is why there are ranks like 2.5. My workaround works as follows: - use NaNStrategy.FIXED, so that the NaNs stay in place - execute the ranking - round down the ranks like 2.5 if they are not NaN (NaNs are cast to 0.0) - execute custom Pearson correlation that ignores tuples with NaNs on the ranked arrays Here is the code: {code} double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 = new double[]{10, 2, 10}; NaturalRanking rank = new NaturalRanking(NaNStrategy.FIXED); double[] ranking1 = rank.rank(column1); double[] ranking2 = rank.rank(column2); for (int i = 0; i < ranking1.length; i++) { if (!Double.isNaN(ranking1[i])) { ranking1[i] = (int) ranking1[i]; } if (!Double.isNaN(ranking2[i])) { ranking2[i] = (int) ranking2[i]; } } System.out.println("Ranking " + Arrays.toString(ranking1)); System.out.println("Ranking " + Arrays.toString(ranking2)); PearsonsCorrelation p_corrs = new PearsonsCorrelation(); double correlations = p_corrs.correlationNaNs(column1, column2); System.out.println("correlations " + correlations); {code} I hope that my solution for dealing with NaNs isn't missing anything. Perhaps you can comment on this. Kind regards Martin --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
On 11/07/2012 01:38 PM, Patrick Meyer wrote:
> You are getting values like 2.5 because of the default ties strategy. If you > do not want to use that method, create an instance of RankingAlgorithm with > a different ties strategy and pass it to the constructor for the > SpearmanCorrelation. This approach also gives you control over the method > for dealing with NaNs. Something like, > > //create data matrix > double[] column1 = new double[]{Double.NaN, 1, 2}; > double[] column2 = new double[]{10, 2, 10}; > Array2DRowRealMatrix mydata = new Array2DRowRealMatrix(); > For(int i=0;i<column1.length;i++){ > mydata.addToEntry(i, 0, column1[i]); > mydata.addToEntry(i, 1, column2[i]); > } > > //compute correlation > NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED, > TiesStrategy.RANDOM); > SpearmanCorrelation spearman = new SpearmanCorrelation(ranking, mydata); > > Try that. Hi, this will not really help imho. As far as I can see, there are at least two problems with the current use of the RankingAlgorithm in the SpearmanCorrelation class: * there is no way to select the ranking algorithm in the constructor without passing the values at the same time * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes the NaN only from the input array where it occurs but not in the corresponding array, thus rendering it useless as it will result in exceptions (array lengths differ) Would you be able to create an issue for this on the issue tracker and provide the test case? Thanks, Thomas --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
I agree that it would be nice to have a constructor that allows you to
specific the ranking algorithm only. As far as NaN and the Spearman correlation, maybe we should add a default strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is encountered. R uses this treatment of missing data and forces users to choose how to handle it. If we implemented something like listwise or pairwise deletion it could be used in other classes too. As such, treatment of missing data should be part of a larger discussion and handled in a more comprehensive and systematic way. -----Original Message----- From: Thomas Neidhart [mailto:[hidden email]] Sent: Wednesday, November 07, 2012 8:09 AM To: [hidden email] Subject: Re: [math] correlation analysis with NaNs On 11/07/2012 01:38 PM, Patrick Meyer wrote: > You are getting values like 2.5 because of the default ties strategy. > If you do not want to use that method, create an instance of > RankingAlgorithm with a different ties strategy and pass it to the > constructor for the SpearmanCorrelation. This approach also gives you > control over the method for dealing with NaNs. Something like, > > //create data matrix > double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 = > new double[]{10, 2, 10}; Array2DRowRealMatrix mydata = new > Array2DRowRealMatrix(); For(int i=0;i<column1.length;i++){ > mydata.addToEntry(i, 0, column1[i]); > mydata.addToEntry(i, 1, column2[i]); > } > > //compute correlation > NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED, > TiesStrategy.RANDOM); SpearmanCorrelation spearman = new > SpearmanCorrelation(ranking, mydata); > > Try that. Hi, this will not really help imho. As far as I can see, there are at least two problems with the current use of the RankingAlgorithm in the SpearmanCorrelation class: * there is no way to select the ranking algorithm in the constructor without passing the values at the same time * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes the NaN only from the input array where it occurs but not in the corresponding array, thus rendering it useless as it will result in exceptions (array lengths differ) Would you be able to create an issue for this on the issue tracker and provide the test case? Thanks, Thomas --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
Hi Patrick,
On 11/07/2012 04:37 PM, Patrick Meyer wrote: > I agree that it would be nice to have a constructor that allows you to > specific the ranking algorithm only. > > As far as NaN and the Spearman correlation, maybe we should add a default > strategy of NaNStrategy.FAIL so that an exception would occur if any NaN is > encountered. R uses this treatment of missing data and forces users to > choose how to handle it. If we implemented something like listwise or > pairwise deletion it could be used in other classes too. As such, treatment > of missing data should be part of a larger discussion and handled in a more > comprehensive and systematic way. I think this additional option makes sense, but I forward this discussion to the dev mailing list where it is better suited. Thomas > -----Original Message----- > From: Thomas Neidhart [mailto:[hidden email]] > Sent: Wednesday, November 07, 2012 8:09 AM > To: [hidden email] > Subject: Re: [math] correlation analysis with NaNs > > On 11/07/2012 01:38 PM, Patrick Meyer wrote: >> You are getting values like 2.5 because of the default ties strategy. >> If you do not want to use that method, create an instance of >> RankingAlgorithm with a different ties strategy and pass it to the >> constructor for the SpearmanCorrelation. This approach also gives you >> control over the method for dealing with NaNs. Something like, >> >> //create data matrix >> double[] column1 = new double[]{Double.NaN, 1, 2}; double[] column2 = >> new double[]{10, 2, 10}; Array2DRowRealMatrix mydata = new >> Array2DRowRealMatrix(); For(int i=0;i<column1.length;i++){ >> mydata.addToEntry(i, 0, column1[i]); >> mydata.addToEntry(i, 1, column2[i]); >> } >> >> //compute correlation >> NaturalRanking ranking = new NaturalRanking(NaNStrategy.FIXED, >> TiesStrategy.RANDOM); SpearmanCorrelation spearman = new >> SpearmanCorrelation(ranking, mydata); >> >> Try that. > > Hi, > > this will not really help imho. > > As far as I can see, there are at least two problems with the current use of > the RankingAlgorithm in the SpearmanCorrelation class: > > * there is no way to select the ranking algorithm in the constructor > without passing the values at the same time > * the NaNStrategy.REMOVED does not work symmetrically, i.e. it removes > the NaN only from the input array where it occurs but not in the > corresponding array, thus rendering it useless as it will result in > exceptions (array lengths differ) > > Would you be able to create an issue for this on the issue tracker and > provide the test case? > > Thanks, > > Thomas > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [hidden email] > For additional commands, e-mail: [hidden email] > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
Free forum by Nabble | Edit this page |