[jira] Created: (COLLECTIONS-302) CollectionUtils.subtract() should not use ArrayList to improve speed

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (COLLECTIONS-302) CollectionUtils.subtract() should not use ArrayList to improve speed

Dmitri Blinov (Jira)
CollectionUtils.subtract() should not use ArrayList to improve speed
--------------------------------------------------------------------

                 Key: COLLECTIONS-302
                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-302
             Project: Commons Collections
          Issue Type: Improvement
          Components: Collection
            Reporter: Joachim Rudolph
            Priority: Minor


The implementation of version 3.2.1 is
public static Collection subtract(final Collection a, final Collection b) {
        ArrayList list = new ArrayList( a );
        for (Iterator it = b.iterator(); it.hasNext();) {
            list.remove(it.next());
        }
        return list;
    }
when a and b are large and similar the subtract implementation will call ArrayList.remove() frequently which copies a potentially large part of the list using system.arraycopy.

Suggestion : use LinkedList ( at least for large lists )

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (COLLECTIONS-302) CollectionUtils.subtract() should not use ArrayList to improve speed

Dmitri Blinov (Jira)

    [ https://issues.apache.org/jira/browse/COLLECTIONS-302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642375#action_12642375 ]

Henri Yandell commented on COLLECTIONS-302:
-------------------------------------------

Ran a basic test and interestingly ArrayList came out better.

{code:java}
import java.util.*;
import org.apache.commons.collections.*;

public class Bob {

    public static void main(String[] args) throws Exception {
        test(Integer.parseInt(args[0]));
    }
    public static void test(int n) {
        Collection a = new ArrayList();
        for(int i=0; i<n; i++) {
            a.add("bob"+i);
        }
        Collection b = new ArrayList();
        for(int i=0; i<n; i+=2) {
            b.add("bob"+i);
        }
        long t1 = System.currentTimeMillis();
        CollectionUtils.subtract(a, b);
        long t2 = System.currentTimeMillis();
        System.err.println("T" + n + ": " + (t2-t1));
    }

}
{code}

For an input of 10,000, both were around 550 msec. For 100,000 the ArrayList was 58000, while the LinkedList was 84000. Hardly scientific as I'm not repeating the test in the same run so could be missing out on JIT improving a second run, not running multiple times etc. My suspicion is that the ArrayList constructor checks to see if things are ArrayLists and does quick arraycopies, while the LinkedList constructor just sits and plods along. I retested by changing the input to LinkedLists from ArrayLists and the time doubled up to 102000. Of course when I try LinkedList passing in to the LinkedList variant, it goes up to 125000. Ah well.

Point of all that - apart from implying that more testing is needed - is that the collection type used might want to depend on the type of the 'a' variable.

> CollectionUtils.subtract() should not use ArrayList to improve speed
> --------------------------------------------------------------------
>
>                 Key: COLLECTIONS-302
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-302
>             Project: Commons Collections
>          Issue Type: Improvement
>          Components: Collection
>            Reporter: Joachim Rudolph
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The implementation of version 3.2.1 is
> public static Collection subtract(final Collection a, final Collection b) {
>         ArrayList list = new ArrayList( a );
>         for (Iterator it = b.iterator(); it.hasNext();) {
>             list.remove(it.next());
>         }
>         return list;
>     }
> when a and b are large and similar the subtract implementation will call ArrayList.remove() frequently which copies a potentially large part of the list using system.arraycopy.
> Suggestion : use LinkedList ( at least for large lists )

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (COLLECTIONS-302) CollectionUtils.subtract() should not use ArrayList to improve speed

Dmitri Blinov (Jira)
In reply to this post by Dmitri Blinov (Jira)

    [ https://issues.apache.org/jira/browse/COLLECTIONS-302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642451#action_12642451 ]

Joachim Rudolph commented on COLLECTIONS-302:
---------------------------------------------

I did some tests myself, I just underestimated the speed of System.arraycopy()

In my problem the resulting collection would be almost empty.
For this case the following code models the algorithm :

{code:title=Bar.java|borderStyle=solid}
    private static final int REPS = 50;
    @SuppressWarnings("unchecked")
    public static void test(int n) throws InterruptedException {
        Collection a = new ArrayList(n);
        for(int i=0; i<n; i++) {
            a.add("bob"+i);
        }
       
        for (int r = 0; r < REPS; ++r) {
            ArrayList c = new ArrayList(a);
            long t1 = System.currentTimeMillis();
            while( c.size() > 0 ){
                c.remove(0);
            }
            long t2 = System.currentTimeMillis();
            System.err.println("T" + n + ": " + (t2-t1)+ " ms");
        }
    }
{code}

In this testcase for n = 200000, System.arraycopy() is called 200000 times with an average of 100000 references to be moved.
The code runs on my machine within 12.3 seconds/iteration which is about 6.5 GBytes/sec, much better than I expected.
So I will use Arraylists on many other problems where I was worried about the O(N^2) performance before.

I should have done my profiling before... sorry.




> CollectionUtils.subtract() should not use ArrayList to improve speed
> --------------------------------------------------------------------
>
>                 Key: COLLECTIONS-302
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-302
>             Project: Commons Collections
>          Issue Type: Improvement
>          Components: Collection
>            Reporter: Joachim Rudolph
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The implementation of version 3.2.1 is
> public static Collection subtract(final Collection a, final Collection b) {
>         ArrayList list = new ArrayList( a );
>         for (Iterator it = b.iterator(); it.hasNext();) {
>             list.remove(it.next());
>         }
>         return list;
>     }
> when a and b are large and similar the subtract implementation will call ArrayList.remove() frequently which copies a potentially large part of the list using system.arraycopy.
> Suggestion : use LinkedList ( at least for large lists )

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (COLLECTIONS-302) CollectionUtils.subtract() should not use ArrayList to improve speed

Dmitri Blinov (Jira)
In reply to this post by Dmitri Blinov (Jira)

     [ https://issues.apache.org/jira/browse/COLLECTIONS-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell closed COLLECTIONS-302.
-------------------------------------

    Resolution: Won't Fix

No worries - closing the issue as there's nothing major here to work on.

> CollectionUtils.subtract() should not use ArrayList to improve speed
> --------------------------------------------------------------------
>
>                 Key: COLLECTIONS-302
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-302
>             Project: Commons Collections
>          Issue Type: Improvement
>          Components: Collection
>            Reporter: Joachim Rudolph
>            Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The implementation of version 3.2.1 is
> public static Collection subtract(final Collection a, final Collection b) {
>         ArrayList list = new ArrayList( a );
>         for (Iterator it = b.iterator(); it.hasNext();) {
>             list.remove(it.next());
>         }
>         return list;
>     }
> when a and b are large and similar the subtract implementation will call ArrayList.remove() frequently which copies a potentially large part of the list using system.arraycopy.
> Suggestion : use LinkedList ( at least for large lists )

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.