[math] Matrix parallel operations

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[math] Matrix parallel operations

ole ersoy
Hi,

Hope ya'll are having an awesome new year!

Some matrix operations, like createRealIdentityMatrix can be turned into one liners like this:

        IntStream.range(0, dimension).forEach(i -> m.setEntry(i, i, 1.0));

And can be performed in parallel like this:
         IntStream.range(0, dimension).parallel().forEach(i -> m.setEntry(i, i, 1.0));

Applying that approach in general we could probably create a ParallelXxxRealMatrix fairly easily.   Just thought I'd float the idea.

Cheers,
Ole


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Matrix parallel operations

Otmar Ertl
On Sat, Jan 2, 2016 at 4:38 AM, Ole Ersoy <[hidden email]> wrote:

> Hi,
>
> Hope ya'll are having an awesome new year!
>
> Some matrix operations, like createRealIdentityMatrix can be turned into one
> liners like this:
>
>        IntStream.range(0, dimension).forEach(i -> m.setEntry(i, i, 1.0));
>
> And can be performed in parallel like this:
>         IntStream.range(0, dimension).parallel().forEach(i -> m.setEntry(i,
> i, 1.0));
>
> Applying that approach in general we could probably create a
> ParallelXxxRealMatrix fairly easily.   Just thought I'd float the idea.
>
> Cheers,
> Ole
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: [hidden email]
>

Once we go for Java 8, yes. However, all these constructs have some
overhead. I doubt that the naive parallelization using parallel() is
faster than the sequential counterpart, especially for operations such
as matrix initializations. A more efficient parallel implementation
would require the definition of a Spliterator that divides an
operation into less but larger chunks of work in order to amortize
synchronization costs. In general, the implementation of efficient and
well scaling parallel matrix operations requires more work than
writing just a single line of code.

Otmar

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Matrix parallel operations

ole ersoy
Hi Otmar,

On 01/02/2016 10:33 AM, Otmar Ertl wrote:

> On Sat, Jan 2, 2016 at 4:38 AM, Ole Ersoy <[hidden email]> wrote:
>> Hi,
>>
>> Hope ya'll are having an awesome new year!
>>
>> Some matrix operations, like createRealIdentityMatrix can be turned into one
>> liners like this:
>>
>>         IntStream.range(0, dimension).forEach(i -> m.setEntry(i, i, 1.0));
>>
>> And can be performed in parallel like this:
>>          IntStream.range(0, dimension).parallel().forEach(i -> m.setEntry(i,
>> i, 1.0));
>>
>> Applying that approach in general we could probably create a
>> ParallelXxxRealMatrix fairly easily.   Just thought I'd float the idea.
>>
>> Cheers,
>> Ole
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
> Once we go for Java 8, yes. However, all these constructs have some
> overhead. I doubt that the naive parallelization using parallel() is
> faster than the sequential counterpart, especially for operations such
> as matrix initializations.
I'm sure it depends...But just to find out whether there is any merit to it I ran the identity matrix code single threaded and parallel (With JMH), and these were the results (On my HP Envy laptop 8 cores):

# Run complete. Total time: 00:14:08

Benchmark                          Mode  Cnt      Score     Error Units
MyBenchmark.parallelIdentity      thrpt  200  18875.739 ± 400.382 ops/s
MyBenchmark.singleThreadIdentity  thrpt  200  11046.079 ± 130.713 ops/s

So almost twice the throughput with parallel().  This was on a 20K X 20K matrix.  I'll paste the test code at the bottom.

>   A more efficient parallel implementation
> would require the definition of a Spliterator that divides an
> operation into less but larger chunks of work in order to amortize
> synchronization costs. In general, the implementation of efficient and
> well scaling parallel matrix operations requires more work than
> writing just a single line of code.
I'm sure you are right.  I'm going to test array based matrix vector multiplication next.  I'm curious to see what happens once some basic operations are added.

Cheers,
Ole


public class IdentityBenchmark {

     @State(Scope.Thread)
     public static class Identity {
         public int dimension = 20000;
         public double[][] m = new double[dimension][dimension];
     }

     @Benchmark
     public void singleThreadIdentity(Identity m) {
         IntStream.range(0, m.dimension).forEach(i -> m.m[i][i] = 1.0);
     }

     @Benchmark
     public void parallelIdentity(Identity m) {
         IntStream.range(0, m.dimension).parallel().forEach(i -> m.m[i][i] = 1.0);
     }
}


---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Matrix parallel operations

ole ersoy
In reply to this post by Otmar Ertl
Hi,

I ran another test using a single parallel loop for array based matrix vector multiplication.  Throughput almost tripled (Test pasted at bottom):

# Run complete. Total time: 00:13:24

Benchmark                                      Mode  Cnt Score Error  Units
MultiplyBenchmark.parallelMultiplication      thrpt  200  2221.682 ± 48.689  ops/s
MultiplyBenchmark.singleThreadMultiplication  thrpt  200   818.755 ±  9.782  ops/s

public class MultiplyBenchmark {

     public static double[] multiplySingleThreaded(double[][] matrix, double[] vector) {
         return Arrays.stream(matrix)
                 .mapToDouble(row -> IntStream.range(0, row.length).mapToDouble(col -> row[col]
                         * vector[col]).sum())
                 .toArray();
     }

     public static double[] multiplyConcurrent(double[][] matrix, double[] vector) {
         return Arrays.stream(matrix).parallel()
                 .mapToDouble(row -> IntStream.range(0, row.length).mapToDouble(col -> row[col]
                         * vector[col]).sum())
                 .toArray();
     }

     @State(Scope.Thread)
     public static class Matrix {
         static int size = 10000;
         static double[] vector = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };

         public static double[][] matrix = new double[size][10];

         static {
             for (int i = 0; i < size; i++) {
                 matrix[i] = vector.clone();
             }
         }
     }

     @Benchmark
     public void singleThreadMultiplication(Matrix m) {
         multiplySingleThreaded(m.matrix, m.vector);
     }

     @Benchmark
     public void parallelMultiplication(Matrix m) {
         multiplyConcurrent(m.matrix, m.vector);
     }
}

Cheers,
Ole

---------------------------------------------------------------------
To unsubscribe, e-mail: [hidden email]
For additional commands, e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [math] Matrix parallel operations

Otmar Ertl
Am 03.01.2016 7:49 vorm. schrieb "Ole Ersoy" <[hidden email]>:
>
> Hi,
>
> I ran another test using a single parallel loop for array based matrix
vector multiplication.  Throughput almost tripled (Test pasted at bottom):
>
> # Run complete. Total time: 00:13:24
>
>
> Benchmark                                      Mode  Cnt Score Error
Units
> MultiplyBenchmark.parallelMultiplication      thrpt  200  2221.682 ±
48.689  ops/s
> MultiplyBenchmark.singleThreadMultiplication  thrpt  200   818.755 ±
9.782  ops/s
>
> public class MultiplyBenchmark {
>
>     public static double[] multiplySingleThreaded(double[][] matrix,
double[] vector) {
>         return Arrays.stream(matrix)
>                 .mapToDouble(row -> IntStream.range(0,
row.length).mapToDouble(col -> row[col]
>                         * vector[col]).sum())
>                 .toArray();
>     }
>
>     public static double[] multiplyConcurrent(double[][] matrix, double[]
vector) {
>         return Arrays.stream(matrix).parallel()
>                 .mapToDouble(row -> IntStream.range(0,
row.length).mapToDouble(col -> row[col]

>                         * vector[col]).sum())
>                 .toArray();
>     }
>
>     @State(Scope.Thread)
>     public static class Matrix {
>         static int size = 10000;
>         static double[] vector = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
>
>         public static double[][] matrix = new double[size][10];
>
>         static {
>             for (int i = 0; i < size; i++) {
>                 matrix[i] = vector.clone();
>             }
>         }
>     }
>
>     @Benchmark
>     public void singleThreadMultiplication(Matrix m) {
>         multiplySingleThreaded(m.matrix, m.vector);
>     }
>
>     @Benchmark
>     public void parallelMultiplication(Matrix m) {
>         multiplyConcurrent(m.matrix, m.vector);
>
>     }
> }
>
> Cheers,
> Ole
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> For additional commands, e-mail: dev-help@ <[hidden email]>
commons.apache.org <[hidden email]>
>

I am curious to see how this compares to simple for-loops which I can
imagine help the JIT compiler to do loop unrolling and to make use of
instruction-level parallelism.

Otmar
Reply | Threaded
Open this post in threaded view
|

Re: [math] Matrix parallel operations

ole ersoy


On 01/03/2016 02:06 AM, Otmar Ertl wrote:

> Am 03.01.2016 7:49 vorm. schrieb "Ole Ersoy" <[hidden email]>:
>> Hi,
>>
>> I ran another test using a single parallel loop for array based matrix
> vector multiplication.  Throughput almost tripled (Test pasted at bottom):
>> # Run complete. Total time: 00:13:24
>>
>>
>> Benchmark                                      Mode  Cnt Score Error
> Units
>> MultiplyBenchmark.parallelMultiplication      thrpt  200  2221.682 ±
> 48.689  ops/s
>> MultiplyBenchmark.singleThreadMultiplication  thrpt  200   818.755 ±
> 9.782  ops/s
>> public class MultiplyBenchmark {
>>
>>      public static double[] multiplySingleThreaded(double[][] matrix,
> double[] vector) {
>>          return Arrays.stream(matrix)
>>                  .mapToDouble(row -> IntStream.range(0,
> row.length).mapToDouble(col -> row[col]
>>                          * vector[col]).sum())
>>                  .toArray();
>>      }
>>
>>      public static double[] multiplyConcurrent(double[][] matrix, double[]
> vector) {
>>          return Arrays.stream(matrix).parallel()
>>                  .mapToDouble(row -> IntStream.range(0,
> row.length).mapToDouble(col -> row[col]
>>                          * vector[col]).sum())
>>                  .toArray();
>>      }
>>
>>      @State(Scope.Thread)
>>      public static class Matrix {
>>          static int size = 10000;
>>          static double[] vector = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
>>
>>          public static double[][] matrix = new double[size][10];
>>
>>          static {
>>              for (int i = 0; i < size; i++) {
>>                  matrix[i] = vector.clone();
>>              }
>>          }
>>      }
>>
>>      @Benchmark
>>      public void singleThreadMultiplication(Matrix m) {
>>          multiplySingleThreaded(m.matrix, m.vector);
>>      }
>>
>>      @Benchmark
>>      public void parallelMultiplication(Matrix m) {
>>          multiplyConcurrent(m.matrix, m.vector);
>>
>>      }
>> }
>>
>> Cheers,
>> Ole
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: dev-help@ <[hidden email]>
> commons.apache.org <[hidden email]>
> I am curious to see how this compares to simple for-loops which I can
> imagine help the JIT compiler to do loop unrolling and to make use of
> instruction-level parallelism.
According to the person that helped out initially on stackoverflow the stream based loop is slightly faster.  In his experiment the for loop took 100 seconds and the stream did it in 89 seconds.

http://stackoverflow.com/questions/34519952/java-8-matrix-vector-multiplication

Not so sure about that though.  Just read up on the below article and, like you are saying, there are some tricks for making for loops very fast:
https://jaxenter.com/java-performance-tutorial-how-fast-are-the-java-8-streams-118830.html

Looking at the results per the article, sticking to primitives and for loops can be wildly faster than streams.  Looks like I'm going to have to follow her advice and benchmark a lot :).

Cheers,
Ole

Reply | Threaded
Open this post in threaded view
|

Re: [math] Matrix parallel operations

ole ersoy
In reply to this post by Otmar Ertl
[...]
> I am curious to see how this compares to simple for-loops which I can
> imagine help the JIT compiler to do loop unrolling and to make use of
> instruction-level parallelism.
Otmar,

Just read another article that evaluated some of Angelika Langers results.
http://blog.codefx.org/java/stream-performance/

Here's a quote:

(Don’t ask me why for the arithmetic operation streaming the array’s elements is faster than looping over them. I have been banging my head against that wall for a while.)

I'm hoping that streaming is faster for Vectors than using the current built in iterator for things like scalar addition, etc.

Cheers,
Ole