Rational Java: Java8 Lambdas: Sorting Performance Pitfall EXPLAINED

Written in collaboration with Peter Lawrey.

A few days ago I raised a serious problem with the performance of sorting using the new Java8 declarative style. See blogpost here. In that post I only pointed out the issue but in this post I'm going to go a bit deeper into understanding and explaining the causes of the problem. This will be done by reproducing the issue using the declarative style, and bit by bit modifying the code until we have removed the performance issue and are left with the performance that we would expect using the old style compare.

To recap, we are sorting instances of this class:

private static class MyComparableInt{
        private int a,b,c,d;

        public MyComparableInt(int i) {
            a = i%2;
            b = i%10;
            c = i%1000;
            d = i;
        }

        public int getA() return a;
        public int getB() return b;
        public int getC() return c;
        public int getD() return d;
}

Using the declarative Java 8 style (below) it took ~6s to sort 10m instances:

List mySortedList = myComparableList.stream()
      .sorted(Comparator.comparing(MyComparableInt::getA)
                              .thenComparing(MyComparableInt::getB)
                              .thenComparing(MyComparableInt::getC)
                              .thenComparing(MyComparableInt::getD))
      .collect(Collectors.toList());

Using a custom sorter (below) took ~1.6s to sort 10m instances.
This is the code call to sort:

List mySortedList = myComparableList.stream()
                    .sorted(MyComparableIntSorter.INSTANCE)
                    .collect(Collectors.toList());

Using this custom Comparator:

public enum MyComparableIntSorter 
    implements Comparator<MyComparableInt>{
        INSTANCE;

        @Override
        public int compare(MyComparableInt o1, MyComparableInt o2) {
            int comp = Integer.compare(o1.getA(), o2.getA());
            if(comp==0){
                comp = Integer.compare(o1.getB(), o2.getB());
                if(comp==0){
                    comp = Integer.compare(o1.getC(), o2.getC());
                    if(comp==0){
                        comp = Integer.compare(o1.getD(), o2.getD());
                    }
                }
            }
            return comp;
        }
 }

Let's create a comparing method in our class so we can analyse the code more closely. The reason for the comparing method is to allow us to easily swap implementations but leave the calling code the same.

In all cases this is how the comparing method will be called:

List mySortedList = myComparableList.stream()
                    .sorted(comparing(
                            MyComparableInt::getA,
                            MyComparableInt::getB,
                            MyComparableInt::getC,
                            MyComparableInt::getD))
                    .collect(Collectors.toList());

The first implementation of comparing is pretty much a copy of the one in jdk.

public static <T, U extends Comparable<? super U>> Comparator<T> 
   comparing(
            Function<? super T, ? extends U> ke1,
            Function<? super T, ? extends U> ke2,
            Function<? super T, ? extends U> ke3,
            Function<? super T, ? extends U> ke4)
    {
        return  Comparator.comparing(ke1).thenComparing(ke2)
                  .thenComparing(ke3).thenComparing(ke4);
    }

Not surprisingly this took ~6s to run through the test - but at least we have reproduced the problem and have a basis for moving forward.

Let's look at the flight recording for this test:

As can be seen there are two big issues:
1. A performance issue in the lambda$comparing method
2. Repeatedly calling Integer.valueOf (auto-boxing)

Let's try and deal with the first one which is in the comparing method. At first sight this seems strange because when you look at the code there's not much happening in that method. One thing however that is going on here extensively are virtual table lookups as the code finds the correct implementation of the function. Virtual table lookups are used when there are multiple methods called from a single line of code. We can eliminate this source of latency with the following implementation of comparing. By expanding all of the uses of the Function interface each line can only call one implementation and thus the method is inlined.

public static <T, U extends Comparable<? super U>> Comparator<T> 
       comparing(
            Function<? super T, ? extends U> ke1,
            Function<? super T, ? extends U> ke2,
            Function<? super T, ? extends U> ke3,
            Function<? super T, ? extends U> ke4)
    {
        return  (c1, c2) -> {
            int comp = compare(ke1.apply(c1), ke1.apply(c2));
            if (comp == 0) {
                comp = compare(ke2.apply(c1), ke2.apply(c2));
                if (comp == 0) {
                    comp = compare(ke3.apply(c1), ke3.apply(c2));
                    if (comp == 0) {
                        comp = compare(ke4.apply(c1), ke4.apply(c2));
                    }
                }
            }
            return comp;
        };
    }

By unwinding the method the JIT should be able to inline the method lookup.
Indeed the time almost halves to 3.5s, let's look at the Flight Recording for this run:

When I first saw this I was very surprised because as yet we haven't done any changes to reduce the calls to Integer.valueOf but that percentage has gone right down! What has has actually happened here is that, because of the changes we made to allow inlining, the Integer.valueOf has been inlined and the time taken for the Integer.valueOf is being blamed on the caller (lambda$comparing) which has inlined the callee (Integer.valueOf). This is a common problem in profilers as they can be mistaken as to which method to blame especially when inlining has taken place.

But we know that in the previous Flight Recording Integer.valueOf was highlighted so let's remove that with this implementation of comparing and see if we can reduce the time further.

return  (c1, c2) -> {
    int comp = compare(ke1.applyAsInt(c1), ke1.applyAsInt(c2));
    if (comp == 0) {
        comp = compare(ke2.applyAsInt(c1), ke2.applyAsInt(c2));
        if (comp == 0) {
           comp = compare(ke3.applyAsInt(c1), ke3.applyAsInt(c2));
           if (comp == 0) {
             comp = compare(ke4.applyAsInt(c1), ke4.applyAsInt(c2));
           }
         }
    }
    return comp;
};

With this implementation the time goes right down to 1.6s which is what we could achieve with the custom Comparator.

Let's again look at the flight recording for this run:

All the time is now going in the actual sort methods and not in overhead.

In conclusion we have learnt a couple of interesting things from this investigation:

Using the new Java8 declarative sort will in some cases be up to 4x slower than writing a custom Comparator because of the cost of auto-boxing and virtual table lookups.
FlightRecorder whilst being better than other profilers (see my first blog post on this issue) will still attribute time to the wrong methods especially when inlining is taking place.

[Update] In the comments I was asked by Nitsan to include the source code and to execute the tests as a JMH benchmark. I think that these are excellent improvements to the post. Here's the source code. The results of the JMH tests are below:

Benchmark Mode Cnt Score Error Units

CompTestBenchmark.bmCustomComparator thrpt 20 2598.617 ± 67.634 ops/s

CompTestBenchmark.bmJDKComparator thrpt 20 751.110 ± 14.835 ops/s

CompTestBenchmark.bmNoVTLComparator thrpt 20 1348.750 ± 30.382 ops/s

CompTestBenchmark.bmNoVTLOrAutoBoxingComparator thrpt 20 2202.995 ± 43.673 ops/s

The JMH tests were carried out on a list of 1000 items and confirm the results I saw when I ran the test on 10m items.

Aleksey Shipilev made a few comments (see comments section) and amended the JMH benchmark. It's definitely worth checking out his code changes.

14 comments:

Unknown26 January 2015 at 15:37
Have you tried using the comparingInt method and hoisting the creation of your comparator into a field? This should give you the opportunity to keep a nice declarative style and remove the performance problems you're encountering around boxing.
Nitsan26 January 2015 at 16:49
Can you post the test code? Java benchmarking is such that without your benchmarking harness it's hard to accept any of the claims above without benchmarking independently...
Have you tried to use -XX:+DebugNonSafepoints to improve FlightRecorder accuracy?
Aleksey Shipilev:2 February 2015 at 19:27
Well, it's not enough to wrap your custom runner in JMH @Benchmark. Here's how you do it: http://cr.openjdk.java.net/~shade/scratch/FixedCompTestBenchmark.java

Rational Java

Pages

Thursday, 22 January 2015

Java8 Lambdas: Sorting Performance Pitfall EXPLAINED

14 comments: