# Commons Math vs. Excel stats?

6 messages
Open this post in threaded view
|

## Commons Math vs. Excel stats?

 I'm having a weird problem when using the commons math package.  When I run statistics using the Commons math, then compare the results to Excel, I get different standard deviation and median, but min, max, and count are the same.  I'd appreciate any ideas on how Commons Math and Excel differ in these calculations. MEDIAN:  Excel:  468,231   CommonsMath:  485,711 STD:        Excel:    11,861   CommonsMath:    10,678 The data set is 18,000 integers so I won't include those.  They are mostly 6 digit numbers.  Here's the code: import org.apache.commons.math.stat.descriptive.moment.StandardDeviation; import org.apache.commons.math.stat.descriptive.rank.Max; import org.apache.commons.math.stat.descriptive.rank.Median; import org.apache.commons.math.stat.descriptive.rank.Min; import gnu.trove.TDoubleHashSet; public class ExampleForMailingList {     StandardDeviation std                    = new StandardDeviation( );     Min               min                    = new Min( );     Max               max                    = new Max( );     Median            medianInstance               = new Median();     private double    minimum                = 0;     private double    maximum                = 0;     private double    standardDev            = 0;     private double median = 0;     private boolean   isCalcDone             = false;     private double   count                  = 0;     /**      * data If the length is zero, then only 0 measurements were added.      */     TDoubleHashSet    data                   = new TDoubleHashSet( );     /**      * If the measurement is greater than 0, then add it to the data.      *      * @param measurement      */     public void addMeasurement( int measurement ) {             data.add( measurement );             count++;     }     /**      * Must be called before using the getters.  This method calculates the statistics.      */     public void calculate() {         try {             double[] dataArray = data.toArray( );             minimum = min.evaluate( dataArray );             maximum = max.evaluate( dataArray );             standardDev = std.evaluate( dataArray );             median = medianInstance.evaluate(dataArray);             isCalcDone = true;         } catch ( RuntimeException e ) {             // TODO Auto-generated catch block             e.printStackTrace( );         }     } // calculate     public double getMinimum() throws CalcNotDoneException {         return minimum;     } // get minimum     public double getMaximum() throws CalcNotDoneException {            return maximum;     } // get maximum     public double getStd() throws CalcNotDoneException {          return standardDev;     } // get std     public double getMedian() throws CalcNotDoneException {  return median;     } // get median     /**      * Converts a result set into a set of statistics which a table model consumes. Calculates:
* 1. min
* 2. average
* 3. max
* 4. median
* 5. percent threshold violations
* @param resultSetArg      *            Results of an order table query     */     public void processResults( ResultSet results,String column ) {         int value = Integer.MAX_VALUE;            try {             while ( results.next( ) ) {                       value = ( int ) results.getLong( column );                         if ( value > -1 ) {                             addMeasurement( value );                         }                 }         } catch ( SQLException e ) {             // TODO Auto-generated catch block             e.printStackTrace();         } // while  } // processResults     public static void main( String[] args ) {         ExampleForMailingList example = new ExampleForMailingList();         example.processResults(ResultSet set,"columnA");         example.calculate( );         System.out.println("std: "+ example.getStd( ));         System.out.println("std: "+ example.getMedian( ));     } } Thanks!
Open this post in threaded view
|

## Re: Commons Math vs. Excel stats?

 I remember having similar things a while back (between custom Java and Excel rather than Commons Math). I seem to recall Excel having some overflow oddities/bugs that led to the incorrect results. It was a couple of years ago at a previous company, so my memory isn't too hot :) Hen On 11/7/06, Jeff Drew <[hidden email]> wrote: > I'm having a weird problem when using the commons math package.  When I run > statistics using the Commons math, then compare the results to Excel, I get > different standard deviation and median, but min, max, and count are the > same.  I'd appreciate any ideas on how Commons Math and Excel differ in > these calculations. > > MEDIAN:  Excel:  468,231   CommonsMath:  485,711 > STD:        Excel:    11,861   CommonsMath:    10,678 > > The data set is 18,000 integers so I won't include those.  They are mostly 6 > digit numbers.  Here's the code: > > import org.apache.commons.math.stat.descriptive.moment.StandardDeviation; > import org.apache.commons.math.stat.descriptive.rank.Max; > import org.apache.commons.math.stat.descriptive.rank.Median; > import org.apache.commons.math.stat.descriptive.rank.Min; > import gnu.trove.TDoubleHashSet; > > public class ExampleForMailingList { > >     StandardDeviation std                    = new StandardDeviation( ); > >     Min               min                    = new Min( ); > >     Max               max                    = new Max( ); > >     Median            medianInstance               = new Median(); > >     private double    minimum                = 0; > >     private double    maximum                = 0; > >     private double    standardDev            = 0; > >     private double median = 0; > >     private boolean   isCalcDone             = false; > >     private double   count                  = 0; > >     /** >      * data If the length is zero, then only 0 measurements > were added. >      */ >     TDoubleHashSet    data                   = new TDoubleHashSet( ); > >     /** >      * If the measurement is greater than 0, then add it to the > data. >      * >      * @param measurement >      */ >     public void addMeasurement( int measurement ) { > >             data.add( measurement ); > >             count++; >     } > >     /** >      * Must be called before using the getters.  This method calculates the > statistics. >      */ >     public void calculate() { > >         try { >             double[] dataArray = data.toArray( ); > >             minimum = min.evaluate( dataArray ); > >             maximum = max.evaluate( dataArray ); > >             standardDev = std.evaluate( dataArray ); > >             median = medianInstance.evaluate(dataArray); > >             isCalcDone = true; > >         } catch ( RuntimeException e ) { >             // TODO Auto-generated catch block >             e.printStackTrace( ); >         } >     } // calculate > >     public double getMinimum() throws CalcNotDoneException { >         return minimum; >     } // get minimum > >     public double getMaximum() throws CalcNotDoneException { >            return maximum; >     } // get maximum > >     public double getStd() throws CalcNotDoneException { >          return standardDev; >     } // get std > >     public double getMedian() throws CalcNotDoneException { >  return median; >     } // get median > >     /** >      * Converts a result set into a set of statistics which a table model > consumes. Calculates:
>      * 1. min
>      * 2. average
>      * 3. max
>      * 4. median
>      * 5. percent threshold violations
> >      * @param resultSetArg >      *            Results of an order table query >     */ >     public void processResults( ResultSet results,String column ) { > >         int value = Integer.MAX_VALUE; > >            try { >             while ( results.next( ) ) { > >                       value = ( int ) results.getLong( column ); > >                         if ( value > -1 ) { >                             addMeasurement( value ); >                         } >                 } >         } catch ( SQLException e ) { >             // TODO Auto-generated catch block >             e.printStackTrace(); >         } // while >  } // processResults > >     public static void main( String[] args ) { >         ExampleForMailingList example = new ExampleForMailingList(); >         example.processResults(ResultSet set,"columnA"); >         example.calculate( ); > >         System.out.println("std: "+ example.getStd( )); >         System.out.println("std: "+ example.getMedian( )); >     } > } > > Thanks! > > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email]
Open this post in threaded view
|

## Re: Commons Math vs. Excel stats?

 In reply to this post by Jeff Drew-2 Median is defined as the middle value of a sorted list, or if the list has an even number of elements, the average of the two middle values of the sorted list. The code below uses this definition to calculate the median. If this code gives the same answer as commons math for your large dataset, then perhaps Excel is wrong. double median(double[] values) {      if (values.length == 0) {          return Double.NaN; // act same as commons.Math      }      double[] copy = new double[values.length];      System.arraycopy(values, 0, copy, 0, values.length);      java.util.Arrays.sort(copy);      int midpos = copy.length / 2;      if (copy.length % 2 == 1) { // odd number of elements          return copy[midpos]; // return middle element      }      // even number of elements: return avg of two middle elements      return (copy[midpos - 1] + copy[midpos]) / 2; } I tested with some small datasets, and this method gave the same answer as commons.Math... Hope this helps, --Remko Popma Jeff Drew wrote: > I'm having a weird problem when using the commons math package.  When I run > statistics using the Commons math, then compare the results to Excel, I get > different standard deviation and median, but min, max, and count are the > same.  I'd appreciate any ideas on how Commons Math and Excel differ in > these calculations. > > MEDIAN:  Excel:  468,231   CommonsMath:  485,711 > STD:        Excel:    11,861   CommonsMath:    10,678 > > The data set is 18,000 integers so I won't include those.  They are > mostly 6 > digit numbers.  Here's the code: > > import org.apache.commons.math.stat.descriptive.moment.StandardDeviation; > import org.apache.commons.math.stat.descriptive.rank.Max; > import org.apache.commons.math.stat.descriptive.rank.Median; > import org.apache.commons.math.stat.descriptive.rank.Min; > import gnu.trove.TDoubleHashSet; > > public class ExampleForMailingList { > >    StandardDeviation std                    = new StandardDeviation( ); > >    Min               min                    = new Min( ); > >    Max               max                    = new Max( ); > >    Median            medianInstance               = new Median(); > >    private double    minimum                = 0; > >    private double    maximum                = 0; > >    private double    standardDev            = 0; > >    private double median = 0; > >    private boolean   isCalcDone             = false; > >    private double   count                  = 0; > >    /** >     * data If the length is zero, then only 0 measurements > were added. >     */ >    TDoubleHashSet    data                   = new TDoubleHashSet( ); > >    /** >     * If the measurement is greater than 0, then add it to the > data. >     * >     * @param measurement >     */ >    public void addMeasurement( int measurement ) { > >            data.add( measurement ); > >            count++; >    } > >    /** >     * Must be called before using the getters.  This method calculates the > statistics. >     */ >    public void calculate() { > >        try { >            double[] dataArray = data.toArray( ); > >            minimum = min.evaluate( dataArray ); > >            maximum = max.evaluate( dataArray ); > >            standardDev = std.evaluate( dataArray ); > >            median = medianInstance.evaluate(dataArray); > >            isCalcDone = true; > >        } catch ( RuntimeException e ) { >            // TODO Auto-generated catch block >            e.printStackTrace( ); >        } >    } // calculate > >    public double getMinimum() throws CalcNotDoneException { >        return minimum; >    } // get minimum > >    public double getMaximum() throws CalcNotDoneException { >           return maximum; >    } // get maximum > >    public double getStd() throws CalcNotDoneException { >         return standardDev; >    } // get std > >    public double getMedian() throws CalcNotDoneException { > return median; >    } // get median > >    /** >     * Converts a result set into a set of statistics which a table model > consumes. Calculates:
>     * 1. min
>     * 2. average
>     * 3. max
>     * 4. median
>     * 5. percent threshold violations
> >     * @param resultSetArg >     *            Results of an order table query >    */ >    public void processResults( ResultSet results,String column ) { > >        int value = Integer.MAX_VALUE; > >           try { >            while ( results.next( ) ) { > >                      value = ( int ) results.getLong( column ); > >                        if ( value > -1 ) { >                            addMeasurement( value ); >                        } >                } >        } catch ( SQLException e ) { >            // TODO Auto-generated catch block >            e.printStackTrace(); >        } // while > } // processResults > >    public static void main( String[] args ) { >        ExampleForMailingList example = new ExampleForMailingList(); >        example.processResults(ResultSet set,"columnA"); >        example.calculate( ); > >        System.out.println("std: "+ example.getStd( )); >        System.out.println("std: "+ example.getMedian( )); >    } > } > > Thanks! > --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email]
Open this post in threaded view
|

## RE: Commons Math vs. Excel stats?

 In reply to this post by Jeff Drew-2 Hi, Not sure of the details of either Commons Math or Excel. But could this be a precission issue. Are something like BigInteger and BigDecimal used in the calculations? If not precision will probably be lost for large calculations. Best wishes, Andy [hidden email] http://www.geog.leeds.ac.uk/people/a.turner-----Original Message----- From: Jeff Drew [mailto:[hidden email]] Sent: 07 November 2006 22:51 To: [hidden email] Subject: Commons Math vs. Excel stats? I'm having a weird problem when using the commons math package.  When I run statistics using the Commons math, then compare the results to Excel, I get different standard deviation and median, but min, max, and count are the same.  I'd appreciate any ideas on how Commons Math and Excel differ in these calculations. MEDIAN:  Excel:  468,231   CommonsMath:  485,711 STD:        Excel:    11,861   CommonsMath:    10,678 The data set is 18,000 integers so I won't include those.  They are mostly 6 digit numbers.  Here's the code: import org.apache.commons.math.stat.descriptive.moment.StandardDeviation; import org.apache.commons.math.stat.descriptive.rank.Max; import org.apache.commons.math.stat.descriptive.rank.Median; import org.apache.commons.math.stat.descriptive.rank.Min; import gnu.trove.TDoubleHashSet; public class ExampleForMailingList {     StandardDeviation std                    = new StandardDeviation( );     Min               min                    = new Min( );     Max               max                    = new Max( );     Median            medianInstance               = new Median();     private double    minimum                = 0;     private double    maximum                = 0;     private double    standardDev            = 0;     private double median = 0;     private boolean   isCalcDone             = false;     private double   count                  = 0;     /**      * data If the length is zero, then only 0 measurements were added.      */     TDoubleHashSet    data                   = new TDoubleHashSet( );     /**      * If the measurement is greater than 0, then add it to the data.      *      * @param measurement      */     public void addMeasurement( int measurement ) {             data.add( measurement );             count++;     }     /**      * Must be called before using the getters.  This method calculates the statistics.      */     public void calculate() {         try {             double[] dataArray = data.toArray( );             minimum = min.evaluate( dataArray );             maximum = max.evaluate( dataArray );             standardDev = std.evaluate( dataArray );             median = medianInstance.evaluate(dataArray);             isCalcDone = true;         } catch ( RuntimeException e ) {             // TODO Auto-generated catch block             e.printStackTrace( );         }     } // calculate     public double getMinimum() throws CalcNotDoneException {         return minimum;     } // get minimum     public double getMaximum() throws CalcNotDoneException {            return maximum;     } // get maximum     public double getStd() throws CalcNotDoneException {          return standardDev;     } // get std     public double getMedian() throws CalcNotDoneException {  return median;     } // get median     /**      * Converts a result set into a set of statistics which a table model consumes. Calculates:
* 1. min
* 2. average
* 3. max
* 4. median
* 5. percent threshold violations
* @param resultSetArg      *            Results of an order table query     */     public void processResults( ResultSet results,String column ) {         int value = Integer.MAX_VALUE;            try {             while ( results.next( ) ) {                       value = ( int ) results.getLong( column );                         if ( value > -1 ) {                             addMeasurement( value );                         }                 }         } catch ( SQLException e ) {             // TODO Auto-generated catch block             e.printStackTrace();         } // while  } // processResults     public static void main( String[] args ) {         ExampleForMailingList example = new ExampleForMailingList();         example.processResults(ResultSet set,"columnA");         example.calculate( );         System.out.println("std: "+ example.getStd( ));         System.out.println("std: "+ example.getMedian( ));     } } Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email]
Open this post in threaded view
|

## Re: Commons Math vs. Excel stats?

 In reply to this post by Jeff Drew-2 Selon Jeff Drew <[hidden email]>: > I'm having a weird problem when using the commons math package.  When I run > statistics using the Commons math, then compare the results to Excel, I get > different standard deviation and median, but min, max, and count are the > same.  I'd appreciate any ideas on how Commons Math and Excel differ in > these calculations. > > MEDIAN:  Excel:  468,231   CommonsMath:  485,711 > STD:        Excel:    11,861   CommonsMath:    10,678 > > The data set is 18,000 integers so I won't include those.  They are mostly 6 > digit numbers.  Here's the code: This is weird ... For the median, one way to check what happens is to sort your data in ascending order set and look at the data at the middle index. If you have an even number of samples 2k and 468231 is at index k (counting from 1) and 485711 is at index k+1 then it is a matter of interpretation. If you have an even number of samples 2k+1, then the result MUST BE the value at index k+1 (counting from 1). Could you check this in both Excel and CommonsMath ? For the standard deviation, one way to check is to split your data set in two parts, compute the various moments and combine them afterwards to check. Unfortunately, I am currently replying to you from a public area and cannot provide you with the equations for the combination. These equations are based on the linearity of the expectancy (is this the right english term ?) and the definition of the variance from the expectancy. If you prefer to wait until Monday, I can provide  this equations for you. Luc --------------------------------------------------------------------- To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email]