Thursday, 22 January 2015

Java8 Lambdas: Sorting Performance Pitfall EXPLAINED

Written in collaboration with Peter Lawrey.

A few days ago I raised a serious problem with the performance of sorting using the new Java8 declarative style. See blogpost here.  In that post I only pointed out the issue but in this post I'm going to go a bit deeper into understanding and explaining the causes of the problem.  This will be done by reproducing the issue using the declarative style, and bit by bit modifying the code until we have removed the performance issue and are left with the performance that we would expect using the old style compare.

To recap, we are sorting instances of this class:

private static class MyComparableInt{
        private int a,b,c,d;

        public MyComparableInt(int i) {
            a = i%2;
            b = i%10;
            c = i%1000;
            d = i;
        }

        public int getA() return a;
        public int getB() return b;
        public int getC() return c;
        public int getD() return d;
}

Using the declarative Java 8 style (below) it took ~6s to sort 10m instances:


List mySortedList = myComparableList.stream()
      .sorted(Comparator.comparing(MyComparableInt::getA)
                              .thenComparing(MyComparableInt::getB)
                              .thenComparing(MyComparableInt::getC)
                              .thenComparing(MyComparableInt::getD))
      .collect(Collectors.toList());

Using a custom sorter (below) took ~1.6s to sort 10m instances.
This is the code call to sort:

List mySortedList = myComparableList.stream()
                    .sorted(MyComparableIntSorter.INSTANCE)
                    .collect(Collectors.toList());

Using this custom Comparator:

public enum MyComparableIntSorter 
    implements Comparator<MyComparableInt>{
        INSTANCE;

        @Override
        public int compare(MyComparableInt o1, MyComparableInt o2) {
            int comp = Integer.compare(o1.getA(), o2.getA());
            if(comp==0){
                comp = Integer.compare(o1.getB(), o2.getB());
                if(comp==0){
                    comp = Integer.compare(o1.getC(), o2.getC());
                    if(comp==0){
                        comp = Integer.compare(o1.getD(), o2.getD());
                    }
                }
            }
            return comp;
        }
 }

Let's create a comparing method in our class so we can analyse the code more closely. The reason for the comparing method is to allow us to easily swap implementations but leave the calling code the same.

In all cases this is how the comparing method will be called:

List mySortedList = myComparableList.stream()
                    .sorted(comparing(
                            MyComparableInt::getA,
                            MyComparableInt::getB,
                            MyComparableInt::getC,
                            MyComparableInt::getD))
                    .collect(Collectors.toList());

The first implementation of comparing is pretty much a copy of the one in jdk.

public static <T, U extends Comparable<? super U>> Comparator<T> 
   comparing(
            Function<? super T, ? extends U> ke1,
            Function<? super T, ? extends U> ke2,
            Function<? super T, ? extends U> ke3,
            Function<? super T, ? extends U> ke4)
    {
        return  Comparator.comparing(ke1).thenComparing(ke2)
                  .thenComparing(ke3).thenComparing(ke4);
    }

Not surprisingly this took ~6s to run through the test - but at least we have reproduced the problem and have a basis for moving forward.

Let's look at the flight recording for this test:



As can be seen there are two big issues:  
1. A performance issue in the lambda$comparing method 
2. Repeatedly calling Integer.valueOf (auto-boxing)


Let's try and deal with the first one which is in the comparing method.  At first sight this seems strange because when you look at the code there's not much happening in that method.  One thing however that is going on here extensively are virtual table lookups as the code finds the correct implementation of the function.   Virtual table lookups are used when there are multiple methods called from a single line of code. We can eliminate this source of latency with the following implementation of comparing. By expanding all of the uses of the Function interface each line can only call one implementation and thus the method is inlined.

public static <T, U extends Comparable<? super U>> Comparator<T> 
       comparing(
            Function<? super T, ? extends U> ke1,
            Function<? super T, ? extends U> ke2,
            Function<? super T, ? extends U> ke3,
            Function<? super T, ? extends U> ke4)
    {
        return  (c1, c2) -> {
            int comp = compare(ke1.apply(c1), ke1.apply(c2));
            if (comp == 0) {
                comp = compare(ke2.apply(c1), ke2.apply(c2));
                if (comp == 0) {
                    comp = compare(ke3.apply(c1), ke3.apply(c2));
                    if (comp == 0) {
                        comp = compare(ke4.apply(c1), ke4.apply(c2));
                    }
                }
            }
            return comp;
        };
    }

By unwinding the method the JIT should be able to inline the method lookup.
Indeed the time almost halves to 3.5s, let's look at the Flight Recording for this run:




When I first saw this I was very surprised because as yet we haven't done any changes to reduce the calls to Integer.valueOf but that percentage has gone right down!  What has has actually happened here is that, because of the changes we made to allow inlining, the Integer.valueOf has been inlined and the time taken for the Integer.valueOf is being blamed on the caller (lambda$comparing) which has inlined the callee (Integer.valueOf).  This is a common problem in profilers as they can be mistaken as to which method to blame especially when inlining has taken place.

But we know that in the previous Flight Recording Integer.valueOf was highlighted so let's remove that with this implementation of comparing and see if we can reduce the time further.

return  (c1, c2) -> {
    int comp = compare(ke1.applyAsInt(c1), ke1.applyAsInt(c2));
    if (comp == 0) {
        comp = compare(ke2.applyAsInt(c1), ke2.applyAsInt(c2));
        if (comp == 0) {
           comp = compare(ke3.applyAsInt(c1), ke3.applyAsInt(c2));
           if (comp == 0) {
             comp = compare(ke4.applyAsInt(c1), ke4.applyAsInt(c2));
           }
         }
    }
    return comp;
};

With this implementation the time goes right down to 1.6s which is what we could achieve with the custom Comparator.

Let's again look at the flight recording for this run:



All the time is now going in the actual sort methods and not in overhead.

In conclusion we have learnt a couple of interesting things from this investigation:

  1. Using the new Java8 declarative sort will in some cases be up to 4x slower than writing a custom Comparator because of the cost of auto-boxing and virtual table lookups.
  2. FlightRecorder whilst being better than other profilers (see my first blog post on this issue) will still attribute time to the wrong methods especially when inlining is taking place.
[Update] In the comments I was asked by Nitsan to include the source code and to execute the tests as a JMH benchmark.  I think that these are excellent improvements to the post. Here's the source code. The results of the JMH tests are below:

Benchmark                                         Mode  Cnt     Score    Error  Units
CompTestBenchmark.bmCustomComparator             thrpt   20  2598.617 ± 67.634  ops/s
CompTestBenchmark.bmJDKComparator                thrpt   20   751.110 ± 14.835  ops/s
CompTestBenchmark.bmNoVTLComparator              thrpt   20  1348.750 ± 30.382  ops/s
CompTestBenchmark.bmNoVTLOrAutoBoxingComparator  thrpt   20  2202.995 ± 43.673  ops/s

The JMH tests were carried out on a list of 1000 items and confirm the results I saw when I ran the test on 10m items.

Aleksey Shipilev made a few comments (see comments section) and amended the JMH benchmark. It's definitely worth checking out his code changes.

14 comments:

  1. Have you tried using the comparingInt method and hoisting the creation of your comparator into a field? This should give you the opportunity to keep a nice declarative style and remove the performance problems you're encountering around boxing.

    ReplyDelete
    Replies
    1. Thanks for the tip - I'll give it a go.

      Delete
  2. Can you post the test code? Java benchmarking is such that without your benchmarking harness it's hard to accept any of the claims above without benchmarking independently...
    Have you tried to use -XX:+DebugNonSafepoints to improve FlightRecorder accuracy?

    ReplyDelete
    Replies
    1. I'll tidy up the code and post in the next few days.
      Thanks for the tip about FlightRecorder - will give that a go and see if it helps at all.

      Delete
    2. Save yourself (and your readers) some time and effort and use JMH... please.
      When you've done that, you could save yourself some more time by using the awesome (yet pretty hardcore) profiler packed with it: perfasm

      Delete
    3. Thanks again for the advice. Will do.

      Delete
    4. From what I understand, JMH is useful for micro benchmarking i.e. executing short running tasks many times. In my scenario I'm creating a list of 10m objects and sorting that list. This is a long running task varying between 1.4s and 6.5s depending on the type of sort. Would you still advocate using JMH?

      Delete
    5. Is that a genuine client use case? does it differ hugely from using a 10K/100K sized list as far as you are concerned? In the post you seem to care mostly about the comparator code and it's impact on the sort function, could you make the same point on a smaller collection? Are your conclusions in other words limited in scope to very large collections?
      JMH cares not how long the method under test takes, you can set the length of the warmup phase and measurement phase whichever way you like.

      Delete
    6. As you suggested I'm going to use JMH on a smaller sized list. The results should be in line with what I found on the larger list. Will hopefully be posting the code and results in the next few days. Thanks for the advice.

      Delete
    7. Post updated with JMH results and source code reference.

      Delete
  3. Well, it's not enough to wrap your custom runner in JMH @Benchmark. Here's how you do it: http://cr.openjdk.java.net/~shade/scratch/FixedCompTestBenchmark.java

    ReplyDelete
    Replies
    1. Thanks so much for taking the time to add that - it really was very helpful. What was your command line for running the Benchmark?

      Delete
    2. Just run the self-executable JAR. The suitable defaults are set with annotations. The profilers are enabled with "-prof stack" and "-prof perfasm", respectively.

      Delete
    3. Thanks again. I've tried out the methods you suggested and written up my results in blog post to hopefully make it easier for anyone else working with JMH for the first time. http://www.rationaljava.com/2015/02/jmh-how-to-setup-and-run-jmh-benchmark.html

      Delete