[jira] Created: (COLLECTIONS-328) ListUtils.intersect is slow

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

[jira] Created: (COLLECTIONS-328) ListUtils.intersect is slow

Gilles (Jira)
ListUtils.intersect is slow
---------------------------

                 Key: COLLECTIONS-328
                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-328
             Project: Commons Collections
          Issue Type: Improvement
            Reporter: Jilles van Gurp


ListUtils.intersect is quite slow and can be improved by using a HashSet for the contains operation which cuts the complexity from n^2 to n. I ran into this by intersecting two lists with a few hundred thousand elements.

current:
public static List intersection(final List list1, final List list2) {
        final ArrayList result = new ArrayList();
        final Iterator iterator = list2.iterator();

        while (iterator.hasNext()) {
            final Object o = iterator.next();

            if (list1.contains(o)) {
                result.add(o);
            }
        }

        return result;

Basically would work by inserting list1 into a HashSet like this:
Set objs = new HashSet();
objs.addAll(list1);

and then instead of list1.contains do a objs.contains

Further performance can be gained by picking the smallest list for this with a simple size comparison.

BTW what is this method supposed to do with duplicate entries in lists? Semantics are not really clear here as opposed to set intersection.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (COLLECTIONS-328) ListUtils.intersect is slow

Gilles (Jira)

    [ https://issues.apache.org/jira/browse/COLLECTIONS-328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716827#action_12716827 ]

Henri Yandell commented on COLLECTIONS-328:
-------------------------------------------

Need to look at performance numbers from small lists to big lists to work out if the HashSet is ever not worth it.

I assume the point of the method is to maintain the duplicate entries.

> ListUtils.intersect is slow
> ---------------------------
>
>                 Key: COLLECTIONS-328
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-328
>             Project: Commons Collections
>          Issue Type: Improvement
>            Reporter: Jilles van Gurp
>
> ListUtils.intersect is quite slow and can be improved by using a HashSet for the contains operation which cuts the complexity from n^2 to n. I ran into this by intersecting two lists with a few hundred thousand elements.
> current:
> public static List intersection(final List list1, final List list2) {
>         final ArrayList result = new ArrayList();
>         final Iterator iterator = list2.iterator();
>         while (iterator.hasNext()) {
>             final Object o = iterator.next();
>             if (list1.contains(o)) {
>                 result.add(o);
>             }
>         }
>         return result;
> Basically would work by inserting list1 into a HashSet like this:
> Set objs = new HashSet();
> objs.addAll(list1);
> and then instead of list1.contains do a objs.contains
> Further performance can be gained by picking the smallest list for this with a simple size comparison.
> BTW what is this method supposed to do with duplicate entries in lists? Semantics are not really clear here as opposed to set intersection.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Commented: (COLLECTIONS-328) ListUtils.intersect is slow

Gilles (Jira)
In reply to this post by Gilles (Jira)

    [ https://issues.apache.org/jira/browse/COLLECTIONS-328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716834#action_12716834 ]

Jilles van Gurp commented on COLLECTIONS-328:
---------------------------------------------

It was so slow that I essentially did more or less what I describe above. I left the original running while I was doing this and it was still running by the time I deployed my fix. I aborted it and restarted it and then it finished in seconds. The big question is indeed what to do with small lists. For big lists there's no question about what is the right approach. Also you might wonder about the memory impact (modest, I would say). Otherwise, HashSet.contains and List.contains should produce same results in this case so this should not affect semantics. If list size is a concern, you could maybe come up with a threshold for deciding on which implementation to use.

 

> ListUtils.intersect is slow
> ---------------------------
>
>                 Key: COLLECTIONS-328
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-328
>             Project: Commons Collections
>          Issue Type: Improvement
>            Reporter: Jilles van Gurp
>
> ListUtils.intersect is quite slow and can be improved by using a HashSet for the contains operation which cuts the complexity from n^2 to n. I ran into this by intersecting two lists with a few hundred thousand elements.
> current:
> public static List intersection(final List list1, final List list2) {
>         final ArrayList result = new ArrayList();
>         final Iterator iterator = list2.iterator();
>         while (iterator.hasNext()) {
>             final Object o = iterator.next();
>             if (list1.contains(o)) {
>                 result.add(o);
>             }
>         }
>         return result;
> Basically would work by inserting list1 into a HashSet like this:
> Set objs = new HashSet();
> objs.addAll(list1);
> and then instead of list1.contains do a objs.contains
> Further performance can be gained by picking the smallest list for this with a simple size comparison.
> BTW what is this method supposed to do with duplicate entries in lists? Semantics are not really clear here as opposed to set intersection.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (COLLECTIONS-328) ListUtils.intersect is slow

Gilles (Jira)
In reply to this post by Gilles (Jira)

     [ https://issues.apache.org/jira/browse/COLLECTIONS-328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell updated COLLECTIONS-328:
--------------------------------------

    Fix Version/s: 3.3

> ListUtils.intersect is slow
> ---------------------------
>
>                 Key: COLLECTIONS-328
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-328
>             Project: Commons Collections
>          Issue Type: Improvement
>            Reporter: Jilles van Gurp
>             Fix For: 3.3
>
>
> ListUtils.intersect is quite slow and can be improved by using a HashSet for the contains operation which cuts the complexity from n^2 to n. I ran into this by intersecting two lists with a few hundred thousand elements.
> current:
> public static List intersection(final List list1, final List list2) {
>         final ArrayList result = new ArrayList();
>         final Iterator iterator = list2.iterator();
>         while (iterator.hasNext()) {
>             final Object o = iterator.next();
>             if (list1.contains(o)) {
>                 result.add(o);
>             }
>         }
>         return result;
> Basically would work by inserting list1 into a HashSet like this:
> Set objs = new HashSet();
> objs.addAll(list1);
> and then instead of list1.contains do a objs.contains
> Further performance can be gained by picking the smallest list for this with a simple size comparison.
> BTW what is this method supposed to do with duplicate entries in lists? Semantics are not really clear here as opposed to set intersection.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Updated: (COLLECTIONS-328) ListUtils.intersect is slow

Gilles (Jira)
In reply to this post by Gilles (Jira)

     [ https://issues.apache.org/jira/browse/COLLECTIONS-328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Rogan updated COLLECTIONS-328:
-------------------------------------

    Attachment: IntersectHashSet.patch

Attached is a patch to implement this improvement.

The elements from the smaller list are inserted into a HashSet, and the intersection is then generated by checking whether the HashSet contains each element from the larger list (if true, include that element in the intersection).

> ListUtils.intersect is slow
> ---------------------------
>
>                 Key: COLLECTIONS-328
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-328
>             Project: Commons Collections
>          Issue Type: Improvement
>            Reporter: Jilles van Gurp
>             Fix For: 4.0
>
>         Attachments: IntersectHashSet.patch
>
>
> ListUtils.intersect is quite slow and can be improved by using a HashSet for the contains operation which cuts the complexity from n^2 to n. I ran into this by intersecting two lists with a few hundred thousand elements.
> current:
> public static List intersection(final List list1, final List list2) {
>         final ArrayList result = new ArrayList();
>         final Iterator iterator = list2.iterator();
>         while (iterator.hasNext()) {
>             final Object o = iterator.next();
>             if (list1.contains(o)) {
>                 result.add(o);
>             }
>         }
>         return result;
> Basically would work by inserting list1 into a HashSet like this:
> Set objs = new HashSet();
> objs.addAll(list1);
> and then instead of list1.contains do a objs.contains
> Further performance can be gained by picking the smallest list for this with a simple size comparison.
> BTW what is this method supposed to do with duplicate entries in lists? Semantics are not really clear here as opposed to set intersection.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply | Threaded
Open this post in threaded view
|

[jira] Closed: (COLLECTIONS-328) ListUtils.intersect is slow

Gilles (Jira)
In reply to this post by Gilles (Jira)

     [ https://issues.apache.org/jira/browse/COLLECTIONS-328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Henri Yandell closed COLLECTIONS-328.
-------------------------------------

    Fix Version/s: 4.0-beta-1
                       (was: 4.0)
       Resolution: Fixed

Patch applied, thanks Thomas.

I tweaked it slightly to work with COLLECTIONS-359, thus the identified intersection is removed from the hash set to avoid duplicates.

It feels to me that cpu performance would increase if we built the hashset on the larger of the datasets as we would end up walking the smaller list, but it also feels that that would hurt memory more often.

svn ci -m "Applying Thomas Rogan's patch to COLLECTIONS-328, improving the performance to ListUtils.intersection in the manner described by Jilles van Gurp"
Sending        src/java/org/apache/commons/collections/ListUtils.java
Transmitting file data .
Committed revision 965176.


> ListUtils.intersect is slow
> ---------------------------
>
>                 Key: COLLECTIONS-328
>                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-328
>             Project: Commons Collections
>          Issue Type: Improvement
>            Reporter: Jilles van Gurp
>             Fix For: 4.0-beta-1
>
>         Attachments: IntersectHashSet.patch
>
>
> ListUtils.intersect is quite slow and can be improved by using a HashSet for the contains operation which cuts the complexity from n^2 to n. I ran into this by intersecting two lists with a few hundred thousand elements.
> current:
> public static List intersection(final List list1, final List list2) {
>         final ArrayList result = new ArrayList();
>         final Iterator iterator = list2.iterator();
>         while (iterator.hasNext()) {
>             final Object o = iterator.next();
>             if (list1.contains(o)) {
>                 result.add(o);
>             }
>         }
>         return result;
> Basically would work by inserting list1 into a HashSet like this:
> Set objs = new HashSet();
> objs.addAll(list1);
> and then instead of list1.contains do a objs.contains
> Further performance can be gained by picking the smallest list for this with a simple size comparison.
> BTW what is this method supposed to do with duplicate entries in lists? Semantics are not really clear here as opposed to set intersection.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.