Apache Commons-Math seems like an excellent library, but what I don't
understand is why we are using 'double' everywhere instead of BigDecimal. I wrote a simple program to run a Multiple Regression Analysis followed by Rank, and compared my results to those from R stats package and because of lack of precision the 'ranks' are way off. I mean I am assuming that if we had used BigDecimal the ranks would have matched to the ones from R. Is there something I am missing? |
Doesn't R use doubles under the covers? Note this quote from the manual:
*R has no single precision data type. All real numbers are stored in double precision format*. (from http://stat.ethz.ch/R-manual/R-patched/library/base/html/double.html) Any difference in the results that you saw is likely due to different algorithms. If you mean rank as in the rank of a matrix, then the exact value is very much a matter of judgment since it involves an implicit comparison of a numerical value to zero. Using BigDecimal is very unlikely to have significantly affected your results. On Thu, Feb 11, 2010 at 9:29 PM, Something Something < [hidden email]> wrote: > I wrote a simple program to run a Multiple Regression Analysis followed by > Rank, and compared my results to those from R stats package and because of > lack of precision the 'ranks' are way off. I mean I am assuming that if we > had used BigDecimal the ranks would have matched to the ones from R. > > Is there something I am missing? > -- Ted Dunning, CTO DeepDyve |
Interesting that this is a precision issue. I'm not surprised depending on what you are doing, double precision may not be enough. It depends a lot on how the calculations are broken into smaller parts. BigDecimal is fantastically useful...
Andy http://www.geog.leeds.ac.uk/people/a.turner/ -----Original Message----- From: Ted Dunning [mailto:[hidden email]] Sent: 12 February 2010 05:59 To: Commons Users List Subject: Re: Why not BigDecimal? Doesn't R use doubles under the covers? Note this quote from the manual: *R has no single precision data type. All real numbers are stored in double precision format*. (from http://stat.ethz.ch/R-manual/R-patched/library/base/html/double.html) Any difference in the results that you saw is likely due to different algorithms. If you mean rank as in the rank of a matrix, then the exact value is very much a matter of judgment since it involves an implicit comparison of a numerical value to zero. Using BigDecimal is very unlikely to have significantly affected your results. On Thu, Feb 11, 2010 at 9:29 PM, Something Something < [hidden email]> wrote: > I wrote a simple program to run a Multiple Regression Analysis followed by > Rank, and compared my results to those from R stats package and because of > lack of precision the 'ranks' are way off. I mean I am assuming that if we > had used BigDecimal the ranks would have matched to the ones from R. > > Is there something I am missing? > -- Ted Dunning, CTO DeepDyve |
In reply to this post by Something Something
Something Something a écrit :
> Apache Commons-Math seems like an excellent library, but what I don't > understand is why we are using 'double' everywhere instead of BigDecimal. Commons-math is a low level library intended to be used by many different types of applications. Using primitive double as the standard type is a fair bet to integrate smoothly with a priori unknown applications. Using BigDecimal would greatly restrict the audience. Also note that BigDecimal lacks many functions (sin, cos, sqrt, cbrt, exp, log ...). I also doubt performances would be on par with primitive doubles with repset to speed, but this is only a personal guess that would need to be verified. > > I wrote a simple program to run a Multiple Regression Analysis followed by > Rank, and compared my results to those from R stats package and because of > lack of precision the 'ranks' are way off. I mean I am assuming that if we > had used BigDecimal the ranks would have matched to the ones from R. There are many other things that could explain differences. Even with BigDecimal, it is difficult to set the proper scale, so your assumtion needs to be verified. Luc > > Is there something I am missing? > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
In reply to this post by Andy Turner
It is not a precision issue. R and commons-math use different algorithms
with the same underlying numerical implementation. It is even an open question which result is better. R has lots of credibility, but I have found cases where it lacked precision (and I coded up a patch that was accepted). Unbounded precision integers and rationals are very useful, but not usually for large scale numerical programming. Except in a very few cases, if you need more than 17 digits of precision, you have other very serious problems that precision won't help. On Fri, Feb 12, 2010 at 1:40 AM, Andy Turner <[hidden email]>wrote: > Interesting that this is a precision issue. I'm not surprised depending on > what you are doing, double precision may not be enough. It depends a lot on > how the calculations are broken into smaller parts. BigDecimal is > fantastically useful... > -- Ted Dunning, CTO DeepDyve |
Okay... Let's not worry about R, BigDecimal & precision for time being. I
might have been looking at wrong values. So let's hold that thought. Let's take a simple example for getting Y-Hat values using Multiple Regression given in this PDF: http://www.utdallas.edu/~herve/abdi-prc-pretty.pdf I created a small CSV called, students.csv that contains the following data: s1 14 4 1 s2 23 4 2 s3 30 7 2 s4 50 7 4 s5 39 10 3 s6 67 10 6 Col headers: Student id, Memory span(Y), age(X1), speech rate(X2) Now the expected results are: yHat[0]:15.166666666666668 yHat[1]:24.666666666666668 yHat[2]:27.666666666666664 yHat[3]:46.666666666666664 yHat[4]:40.166666666666664 yHat[5]:68.66666666666667 This is based on the following equation (given in the PDF): Y = 1.67 + X1 + 9.50 X2 I wrote the following small quick and dirty code to use OLSMultipleLinearRegression. The 'calculateHat()' method returns a RealMatrix, but I can't see the above results in there. Am I using this class correctly? Please let me know. Thanks. private static void regression1() { double[][] X = new double[6][2]; double[] Y = new double[6]; try { File file = new File("C:\\students.csv"); FileReader reader = new FileReader(file); BufferedReader in = new BufferedReader(reader); String line; int count = 0; while ((line = in.readLine()) != null) { // System.out.println(line); Scanner scanner = new Scanner(line); scanner.useDelimiter(" "); String[] cols = new String[4]; int col = 0; while (scanner.hasNext()) { cols[col++] = scanner.next(); } Y[count] = Double.valueOf(cols[1]); X[count] [0] = Double.valueOf(cols[2]); X[count] [1] = Double.valueOf(cols[3]); count++; } in.close(); reader.close(); } catch (IOException e) { e.printStackTrace(); } OLSMultipleLinearRegression regression = new OLSMultipleLinearRegression(); regression.newSampleData(Y, X); RealMatrix matrix = regression.calculateHat(); System.out.println("matrix:" + matrix.getColumnDimension()); } On Fri, Feb 12, 2010 at 12:08 PM, Ted Dunning <[hidden email]> wrote: > It is not a precision issue. R and commons-math use different algorithms > with the same underlying numerical implementation. > > It is even an open question which result is better. R has lots of > credibility, but I have found cases where it lacked precision (and I coded > up a patch that was accepted). > > Unbounded precision integers and rationals are very useful, but not usually > for large scale numerical programming. Except in a very few cases, if you > need more than 17 digits of precision, you have other very serious problems > that precision won't help. > > On Fri, Feb 12, 2010 at 1:40 AM, Andy Turner <[hidden email] > >wrote: > > > Interesting that this is a precision issue. I'm not surprised depending > on > > what you are doing, double precision may not be enough. It depends a lot > on > > how the calculations are broken into smaller parts. BigDecimal is > > fantastically useful... > > > > > > -- > Ted Dunning, CTO > DeepDyve > |
Something Something wrote:
> Okay... Let's not worry about R, BigDecimal & precision for time being. I > might have been looking at wrong values. So let's hold that thought. > > Let's take a simple example for getting Y-Hat values using Multiple > Regression given in this PDF: > http://www.utdallas.edu/~herve/abdi-prc-pretty.pdf > > I created a small CSV called, students.csv that contains the following data: > > s1 14 4 1 > s2 23 4 2 > s3 30 7 2 > s4 50 7 4 > s5 39 10 3 > s6 67 10 6 > > Col headers: Student id, Memory span(Y), age(X1), speech rate(X2) > > Now the expected results are: > > yHat[0]:15.166666666666668 > yHat[1]:24.666666666666668 > yHat[2]:27.666666666666664 > yHat[3]:46.666666666666664 > yHat[4]:40.166666666666664 > yHat[5]:68.66666666666667 > > This is based on the following equation (given in the PDF): Y = 1.67 + X1 + > 9.50 X2 > > I wrote the following small quick and dirty code to > use OLSMultipleLinearRegression. The 'calculateHat()' method returns a > RealMatrix, but I can't see the above results in there. Am I using this > class correctly? Please let me know. Thanks. The "hat matrix," as defined in the javadoc for calculateHat, is not the same as the vector of yHat values. See the javadoc and the references that it contains for the definition of the hat matrix. To compute predicted values, you need to post-multiply the design matrix, X, by the estimated coefficients. Using the variable definitions below, this is RealVector b = regression.calculateBeta(); RealVector yHat = X.operate(b); Side note: the residuals, Y - Y-hat, are available directly via estimateResiduals; but to get predicted values directly you need to compute them from the coeffients and design matrix as above. A computePredictedValues method added to AbsractMultipleLinearRegression might be a good enhancement, as well as a predict(RealVector) method similar to what SimpleRegression has. Patches welcome! Phil > > > > private static void regression1() { > double[][] X = new double[6][2]; > double[] Y = new double[6]; > try { > File file = new File("C:\\students.csv"); > FileReader reader = new FileReader(file); > BufferedReader in = new BufferedReader(reader); > String line; > int count = 0; > while ((line = in.readLine()) != null) { > // System.out.println(line); > Scanner scanner = new Scanner(line); > scanner.useDelimiter(" "); > String[] cols = new String[4]; > int col = 0; > while (scanner.hasNext()) { > cols[col++] = scanner.next(); > } > Y[count] = Double.valueOf(cols[1]); > X[count] [0] = Double.valueOf(cols[2]); > X[count] [1] = Double.valueOf(cols[3]); > count++; > } > in.close(); > reader.close(); > } catch (IOException e) { > e.printStackTrace(); > } > OLSMultipleLinearRegression regression = new > OLSMultipleLinearRegression(); > regression.newSampleData(Y, X); > RealMatrix matrix = regression.calculateHat(); > System.out.println("matrix:" + matrix.getColumnDimension()); > } > > > On Fri, Feb 12, 2010 at 12:08 PM, Ted Dunning <[hidden email]> wrote: > >> It is not a precision issue. R and commons-math use different algorithms >> with the same underlying numerical implementation. >> >> It is even an open question which result is better. R has lots of >> credibility, but I have found cases where it lacked precision (and I coded >> up a patch that was accepted). >> >> Unbounded precision integers and rationals are very useful, but not usually >> for large scale numerical programming. Except in a very few cases, if you >> need more than 17 digits of precision, you have other very serious problems >> that precision won't help. >> >> On Fri, Feb 12, 2010 at 1:40 AM, Andy Turner <[hidden email] >>> wrote: >>> Interesting that this is a precision issue. I'm not surprised depending >> on >>> what you are doing, double precision may not be enough. It depends a lot >> on >>> how the calculations are broken into smaller parts. BigDecimal is >>> fantastically useful... >>> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] |
Free forum by Nabble | Edit this page |