[jira] [Commented] (NUMBERS-142) Improve LinearCombination accuracy during summation of the round-off errors

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

[jira] [Commented] (NUMBERS-142) Improve LinearCombination accuracy during summation of the round-off errors

ASF GitHub Bot (Jira)

    [ https://issues.apache.org/jira/browse/NUMBERS-142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038014#comment-17038014 ]

Alex Herbert commented on NUMBERS-142:

I have put all the code together for this into a [PR 77|https://github.com/apache/commons-numbers/pull/77] that implements the LinearCombination as an interface. The PR puts the code in the JMH module for performance testing.

Have a look and see what you think of the design. Any current use of the LinearCombination class can be replaced by choosing an implementation from LinearCombinations:
// current
double sum1 = LinearCombination.value(1, 2, 3, 4, 5, 6);
// new
double sum2 = LinearCombinations.Exact.INSTANCE.value(1, 2, 3, 4, 5, 6);
The implementations and references for the performance test are:
|Standard|The standard precision dot product.|
|Current|The current implementation in LinearCombination. This is a variant of Ogita's Dot2s algorithm for 2-fold precision that uses the summation of the round-off parts in a different order to Ogita's paper.|
|Dekker|Dekker's double-length precision dot product.|
|Dot2s|Ogita's Dot2s algorithm for 2-fold precision. This method does not require a working array (see discussion below).|
|DotK|Ogita's DotK algorithm for K-fold precision. The k can be specified as a parameter to the instance. Singletons are provided for Dot3 through to Dot7.|
|Expanded|Shewchuk's arbitrary precision algorithms for exact computation using variable length p-bit numbers where the bits are stored in doubles.|
|Exact|Uses BigDecimal.|

The raw output of the performance test is:
Benchmark                             (c)  (length)    (name)  (size)  Mode  Cnt         Score        Error  Units
LinearCombinationPerformance.twoD    1e20       N/A  standard    1000  avgt    5      6710.841 ±    268.219  ns/op
LinearCombinationPerformance.twoD    1e20       N/A   current    1000  avgt    5     21571.098 ±    589.948  ns/op
LinearCombinationPerformance.twoD    1e20       N/A    dekker    1000  avgt    5     22798.832 ±    535.729  ns/op
LinearCombinationPerformance.twoD    1e20       N/A     dot2s    1000  avgt    5     15739.809 ±    279.495  ns/op
LinearCombinationPerformance.twoD    1e20       N/A      dot2    1000  avgt    5     17551.580 ±    299.583  ns/op
LinearCombinationPerformance.twoD    1e20       N/A      dot3    1000  avgt    5     24441.026 ±    912.896  ns/op
LinearCombinationPerformance.twoD    1e20       N/A      dot4    1000  avgt    5     30620.358 ±    334.428  ns/op
LinearCombinationPerformance.twoD    1e20       N/A      dot5    1000  avgt    5     38060.952 ±    910.109  ns/op
LinearCombinationPerformance.twoD    1e20       N/A      dot6    1000  avgt    5     45877.618 ±   2071.469  ns/op
LinearCombinationPerformance.twoD    1e20       N/A      dot7    1000  avgt    5     53256.087 ±   3022.064  ns/op
LinearCombinationPerformance.twoD    1e20       N/A  extended    1000  avgt    5     24391.977 ±   1150.544  ns/op
LinearCombinationPerformance.twoD    1e20       N/A     exact    1000  avgt    5   4293537.179 ±  29041.774  ns/op
LinearCombinationPerformance.threeD  1e20       N/A  standard    1000  avgt    5      7677.865 ±    274.872  ns/op
LinearCombinationPerformance.threeD  1e20       N/A   current    1000  avgt    5     30349.426 ±   1436.852  ns/op
LinearCombinationPerformance.threeD  1e20       N/A    dekker    1000  avgt    5     25110.456 ±   2464.337  ns/op
LinearCombinationPerformance.threeD  1e20       N/A     dot2s    1000  avgt    5     22854.831 ±   1008.271  ns/op
LinearCombinationPerformance.threeD  1e20       N/A      dot2    1000  avgt    5     23625.147 ±    703.822  ns/op
LinearCombinationPerformance.threeD  1e20       N/A      dot3    1000  avgt    5     26691.231 ±    270.333  ns/op
LinearCombinationPerformance.threeD  1e20       N/A      dot4    1000  avgt    5     36075.471 ±    445.362  ns/op
LinearCombinationPerformance.threeD  1e20       N/A      dot5    1000  avgt    5     55473.600 ±  22107.620  ns/op
LinearCombinationPerformance.threeD  1e20       N/A      dot6    1000  avgt    5     60561.452 ±   2094.282  ns/op
LinearCombinationPerformance.threeD  1e20       N/A      dot7    1000  avgt    5     68363.077 ±   2514.378  ns/op
LinearCombinationPerformance.threeD  1e20       N/A  extended    1000  avgt    5     44935.899 ±    908.291  ns/op
LinearCombinationPerformance.threeD  1e20       N/A     exact    1000  avgt    5   5152541.712 ±  74986.141  ns/op
LinearCombinationPerformance.fourD   1e20       N/A  standard    1000  avgt    5      8397.397 ±     77.708  ns/op
LinearCombinationPerformance.fourD   1e20       N/A   current    1000  avgt    5     39489.947 ±   1018.499  ns/op
LinearCombinationPerformance.fourD   1e20       N/A    dekker    1000  avgt    5     46615.843 ±    889.102  ns/op
LinearCombinationPerformance.fourD   1e20       N/A     dot2s    1000  avgt    5     29625.524 ±    578.690  ns/op
LinearCombinationPerformance.fourD   1e20       N/A      dot2    1000  avgt    5     31495.629 ±   4817.193  ns/op
LinearCombinationPerformance.fourD   1e20       N/A      dot3    1000  avgt    5     46525.082 ±    588.173  ns/op
LinearCombinationPerformance.fourD   1e20       N/A      dot4    1000  avgt    5     55568.213 ±   2678.669  ns/op
LinearCombinationPerformance.fourD   1e20       N/A      dot5    1000  avgt    5     65862.113 ±   4156.989  ns/op
LinearCombinationPerformance.fourD   1e20       N/A      dot6    1000  avgt    5     74365.334 ±   2252.189  ns/op
LinearCombinationPerformance.fourD   1e20       N/A      dot7    1000  avgt    5     83908.295 ±   2334.416  ns/op
LinearCombinationPerformance.fourD   1e20       N/A  extended    1000  avgt    5     66471.331 ±   1468.997  ns/op
LinearCombinationPerformance.fourD   1e20       N/A     exact    1000  avgt    5   6242338.287 ± 303731.813  ns/op
LinearCombinationPerformance.nD      1e20        32  standard    1000  avgt    5     40881.169 ±   1028.620  ns/op
LinearCombinationPerformance.nD      1e20        32   current    1000  avgt    5    488370.139 ±   6822.698  ns/op
LinearCombinationPerformance.nD      1e20        32    dekker    1000  avgt    5    376277.106 ±  41055.472  ns/op
LinearCombinationPerformance.nD      1e20        32     dot2s    1000  avgt    5    274352.368 ±   4212.051  ns/op
LinearCombinationPerformance.nD      1e20        32      dot2    1000  avgt    5    362834.497 ±   4109.045  ns/op
LinearCombinationPerformance.nD      1e20        32      dot3    1000  avgt    5    480087.086 ±   8505.884  ns/op
LinearCombinationPerformance.nD      1e20        32      dot4    1000  avgt    5    596274.595 ±   8312.549  ns/op
LinearCombinationPerformance.nD      1e20        32      dot5    1000  avgt    5    716167.473 ±  10982.713  ns/op
LinearCombinationPerformance.nD      1e20        32      dot6    1000  avgt    5    838779.146 ±  36630.756  ns/op
LinearCombinationPerformance.nD      1e20        32      dot7    1000  avgt    5    953464.494 ±  38360.647  ns/op
LinearCombinationPerformance.nD      1e20        32  extended    1000  avgt    5   1442858.294 ±  27216.570  ns/op
LinearCombinationPerformance.nD      1e20        32     exact    1000  avgt    5  24739454.322 ± 439917.888  ns/op
So on this machine the new Dot2s implementation is faster than the current implementation. The current implementation uses two loops in the computation over the length of the input vectors. The new implementation uses only 1 loop. The number of double multiply and add operations should be the same so I assume the second loop is consuming extra time.

Dekker's method is slower than Dot2s (optimised for no array allocation) or even Dot2 (which has an array allocation) but still delivers only 2-fold precision. The Dekker method requires a double comparison to order the parts of a summation for each term of the dot product. The number of float operations is less than Dot2 but the branch for the comparison is slow.

By my working the Extended method will have the same number of floating-point computations as DotK when k=n+1 where n is the length of the vector for the dot product. We see this here as the time for Dot3 (twoD), Dot4 (threeD) and Dot5 (fourD) are approximately the same as the Extended method. This has been noted in the comments on the class that for large k and short dot products it would be better to use the Extended method instead. But this has not actually been enforced in code by preventing using a large k for short dot products.
h2. An enhancement

One performance improvement not currently in the code in the PR is the use of a temporary array for computations. Currently all the methods are effectively static and thread safe. They allocate working arrays for each method call. This adds a lot of garbage collection overhead. When I tried a variation of Shewchuk's extended precision method as described in his paper it was slower until I allowed reuse of the working arrays. This method used 3 working arrays and not 1 so garbage collection overhead is high. Trying array reuse on other implementations also adds performance benefit. I scrapped the Shewchuk variation but it does open the question of whether to support thread-safe computation or allow faster thread-unsafe instances. 

The methods DotK and Extended require a working array of length 2N for a dot product on input arrays of length N. If this working array can be reused then the performance gain is about 10-20%. To avoid cluttering the functional interface for the dot products this could be done by allowing instances of the implementations to be created that are not thread safe as they reuse working space. This would be of benefit for this scenario:
int k = 4;
// DotK will hold an internal working array, resized or reused as appropriate for each computation
LinearCombination.ND lc = new LinearCombinations.DotK(k);
for (int i = 0; i < BIG; i++) {
    double[] a = ...;
    double[] b = ...;
    double sum = lc.value(a, b);
This could of course be extended by allowing the constructor to choose whether to be thread-safe or not. All of the standard instance singletons could be thread-safe but a user could construct an instance for each working thread that is performing these computations:
// thread-safe instance
LinearCombination.ND lc1 = LinearCombinations.DotK.DOT_4;
// thread-unsafe instance with local working array
int k = 4;
LinearCombination.ND lc2 = new LinearCombinations.DotK(k);

The use of a thread-local instance for faster computations may be of benefit in Commons Geometry for methods that use an extended precision computation with many vectors.

> Improve LinearCombination accuracy during summation of the round-off errors
> ---------------------------------------------------------------------------
>                 Key: NUMBERS-142
>                 URL: https://issues.apache.org/jira/browse/NUMBERS-142
>             Project: Commons Numbers
>          Issue Type: Improvement
>          Components: arrays
>    Affects Versions: 1.0
>            Reporter: Alex Herbert
>            Assignee: Alex Herbert
>            Priority: Minor
>         Attachments: array_performance.jpg, cond_no.jpg, error_vs_condition_no.jpg, inline_perfomance.jpg
>          Time Spent: 20m
>  Remaining Estimate: 0h
> The algorithm in LinearCombination is an implementation of dot2 from [Ogita el al|http://citeseerx.ist.psu.edu/viewdoc/summary?doi=] (algorithm 5.3). There is a subtle difference in that the original dot2 algorithm sums the round-off from the products and the round-off from the product summation together. The method in LinearCombination sums them separately (using an extra variable) and adds them at the end. This actually improves the accuracy under conditions where the round-off is of greater sum than the products, as described below.
> The dot2 algorithm suffers when the total sum is close to 0 but the intermediate sum is far enough from zero that there is a large difference between the exponents of summed terms and the final result. In this case the sum of the round-off is more important than the sum of the products which due to massive cancellation is zero. The case is important for Complex numbers which require a computation of log1p(x^2+y^2-1) when x^2+y^2 is close to 1 such that log(1) would be ~zero yet the true logarithm is representable to very high precision.
> This can be protected against by using the dotK algorithm of Ogita et al with K>2. This saves all the round-off parts from each product and the running sum. These are subject to an error free transform that repeatedly adds adjacent pairs to generate a new split pair with a closer upper and lower part. Over time this will order the parts from low to high and these can be summed low first for an error free dot product.
> Using this algorithm with a single pass (K=3 for dot3) removes the cancellation error observed in the mentioned use case. Adding a single pass over the parts changes the algorithm from 25n floating point operations (flops) to 31n flops for the sum of n products.
> A second change for the algorithm is to switch to using [Dekker's|https://doi.org/10.1007/BF01397083] algorithm (Dekker, 1971) to split the number. This extracts two 26-bit mantissas from a 53-bit mantis (in IEEE 754 the leading bit in front of the of the 52-bit mantissa is assumed to be 1). This is done by multiplication by 2^s+1 with s = ceil(53/2) = 27:
> big = (2^s+1) * a
> a_hi = (big - (big - a))
> The extra bit of precision is carried into the sign of the low part of the split number.
> This is in contrast to the method in LinearCombination that uses a simple mask on the long representation to obtain the a_hi part in 26-bits and the lower part will be 27 bits.
> The advantage of Dekker's method is it produces 2 parts with 26 bits in the mantissa that can be multiplied exactly. The disadvantage is the potential for overflow requiring a branch condition to check for extreme values.
> It also appropriately handles very small sub-normal numbers that would be masked to create a 0 high part with all the non-zero bits left in the low part using the current method. This will have knock on effects on split multiplication which requires the high part to be larger.
> A simple change to the current implementation to use Dekker's split improves the precision on a wide range of test data (not shown).

This message was sent by Atlassian Jira