Friday 30 January 2015

Java8 Multi-threading ForkJoinPool: Dealing with exceptions

One of the main motivations behind the introduction of Java8 lambdas was the ability to be able to use multicores as easily as possible (see Mastering Lambdas: Java Programming in a Multicore World).  By simply changing your code from collection.stream()... to collection.parallelStream()... you have instant multi-threading at your disposal which brings with it all the CPU power on your machine.  (Let's ignore contention at this point.)

If you print out the names of the threads used by parallelStream you will notice that they are the same threads used by the ForkJoin framework and look something like this:

[ForkJoinPool.commonPool-worker-1]
[ForkJoinPool.commonPool-worker-2]

See Benjamin Winterberg's blog for a nicely worked example of this.

Now in Java 8 you can use this commonPool directly with the new method on ForkJoinPool commonPool().  This returns an instance of ForkJoinPool (which is an ExecutorService) with the commonPool of threads - the same ones that are used in parallelStream. This means that any work you do directly with the commonPool will play very nicely with work done in parallelStream especially the thread scheduling and work stealing between threads.

Let's work through an example of how you use ForkJoin especially in dealing with the tricky subject of exceptions.

Firstly obtain an instance of the commonPool by calling ForkJoin.commonPool().  You can submit tasks to it using the submit() method. Because we are using Java8 we can pass in lambda expressions which is really neat.  As with all ExecutorService implementations you can pass either instances of Runnable or Callable into submit().  When you pass a lambda into the submit method it will automatically turn it into a Runnable or a Callable by inspecting the method signature.

This leads to an interesting problem which highlights how lambdas work.  Supposing that you have a method of return type void (like a Runnable) but throws a checked exception (like a Callable).  See the method throwException() in the code listing below for such an example.  If you write this code it won't compile.

Future task1 = commonPool.submit(() -> {
            throwException("task 1");
        });
The reason for this is that the compiler assumes, because of the void return type, that you are trying to create a Runnable.  Of course a Runnable can't throw an Exception.  To get around this problem you need to force the compiler to understand that you are creating a Callable which is allowed to throw an Exception using this code trick.

Future task1 = commonPool.submit(() -> {
            throwException("task 1");
            return null;
        });
This is a bit messy but does the job. Arguably, the compiler, could have worked this out itself.

Two more things to highlight in the full code listing below.  One, the fact that you can see how many threads are going to be available in the pool using commonPool.getParallelism(). This can be adjusted with the parameter '-Djava.util.concurrent.ForkJoinPool.common.parallelism'. Two, notice how you can unwrap the ExecutionException so that your code can just present an IOException to its callers rather a rather non-specific ExecutionException.  Also note that this code fails on the first exception.  If you want to collect all the exceptions you would have to structure the code appropriately, possibly returning a List of Exceptions.  Or maybe more neatly throwing a custom exception containing a list of underlying exceptions. 


public class ForkJoinTest {
    public void run() throws IOException{
        ForkJoinPool commonPool = ForkJoinPool.commonPool();

        Future task1 = commonPool.submit(() -> {
            throwException("task 1");
            return null;
        });
        Future task2 = commonPool.submit(() -> {
            throwException("task 2");
            return null;
        });

        System.out.println("Do something while tasks being " +
                "executed on " + commonPool.getParallelism()
                + " threads");

        try {
            //wait on the result from task2
            task2.get();
            //wait on the result from task1
            task1.get();
        } catch (InterruptedException e) {
            throw new AssertionError(e);
        } catch (ExecutionException e) {
            Throwable innerException = e.getCause();
            if (innerException instanceof RuntimeException) {
                innerException = innerException.getCause();
                if(innerException instanceof IOException){
                    throw (IOException) innerException;
                }
            }
            throw new AssertionError(e);
        }
    }

    public void throwException(String message) throws IOException,
            InterruptedException {
        Thread.sleep(100);
        System.out.println(Thread.currentThread() 
            + " throwing IOException");
        throw new IOException("Throw exception for " + message);
    }

    public static void main(String[] args) throws IOException{
        new ForkJoinTest().run();
    }
}

Thursday 22 January 2015

Book review: Mastering Lambdas: Java Programming in a Multicore World (Maurice Naftalin)

Java 8 with its introduction of lambda expressions is the biggest change to the Java language since it was created some 20 years ago. No escaping from it - if you want to stay relevant as a Java programmer you will need to know about lambda expressions and it is certainly worth the effort to really get to grips with the new semantics and paradigms that Java 8 delivers.

Naftalin's book is actually the 3rd book I've read on the subject. The first 2, whilst reasonably well written, didn't really deliver much more than I could have learnt by reading the numerous tutorials out there on the web. Any Java 8 tutorial will explain how to use Lambdas in your code and if that's all you want then you don't really want to buy a book at all. When I buy a book I expect a lot more than that. I expect a beginning, a middle and an end. And to this expectation Naftalin delivers beautifully. He explains the rationals and motivations of Lambda expressions. Are they just a sprinkling of syntactic sugar? No, he explains, they are much more than that. He sets the scene, as is hinted to in the title of his book 'a multicore world' and explains how Java needs to adapt to continue to hold its own so that it will still be one of the most important languages for the next 20 years.

I've been using Java 8 for the last 6 months and am fairly well aquatinted with the syntax but when I read his description of the syntax I could only wish that I had this book 6 months ago! Naftalin has clearly put a huge amount of effort and consideration into how he delivers his ideas, which is why, perhaps, it wasn't one of the first books on the market to greet Java 8 as it appeared in the early part of last year.

Some books go off on a tangent trying to introduce their readers to functional programming through lambdas. I think that is a mistake. I've been a Java programmer since the language emerged and have no wish whatsoever to change my programming style into a functional style. If I would want to do that I would use a functional language like Erlang etc. I want to continue using Java but to introduce some new concepts that can be delivered through Lambdas. The declarative nature of the Lambdas are extremely nice but we don't have to throw out the proverbial baby with the bath water and ditch everything that is good about OOP. I believe that Naftalin shares this opinion as is evident by the way in which he introduces as to practical uses of lambdas in our code.

In conclusion, I would highly recommend recommend this book to any Java developer that wants to get a real understanding of lambdas. The book is extremely readable, not particularly long (all the best books are short) but will tell you everything you will ever need to know about the subject. If you digest all the information in this short book you will be a far better programmer than someone who has just read a few tutorials on the web!

Java8 Lambdas: Sorting Performance Pitfall EXPLAINED

Written in collaboration with Peter Lawrey.

A few days ago I raised a serious problem with the performance of sorting using the new Java8 declarative style. See blogpost here.  In that post I only pointed out the issue but in this post I'm going to go a bit deeper into understanding and explaining the causes of the problem.  This will be done by reproducing the issue using the declarative style, and bit by bit modifying the code until we have removed the performance issue and are left with the performance that we would expect using the old style compare.

To recap, we are sorting instances of this class:

private static class MyComparableInt{
        private int a,b,c,d;

        public MyComparableInt(int i) {
            a = i%2;
            b = i%10;
            c = i%1000;
            d = i;
        }

        public int getA() return a;
        public int getB() return b;
        public int getC() return c;
        public int getD() return d;
}

Using the declarative Java 8 style (below) it took ~6s to sort 10m instances:


List mySortedList = myComparableList.stream()
      .sorted(Comparator.comparing(MyComparableInt::getA)
                              .thenComparing(MyComparableInt::getB)
                              .thenComparing(MyComparableInt::getC)
                              .thenComparing(MyComparableInt::getD))
      .collect(Collectors.toList());

Using a custom sorter (below) took ~1.6s to sort 10m instances.
This is the code call to sort:

List mySortedList = myComparableList.stream()
                    .sorted(MyComparableIntSorter.INSTANCE)
                    .collect(Collectors.toList());

Using this custom Comparator:

public enum MyComparableIntSorter 
    implements Comparator<MyComparableInt>{
        INSTANCE;

        @Override
        public int compare(MyComparableInt o1, MyComparableInt o2) {
            int comp = Integer.compare(o1.getA(), o2.getA());
            if(comp==0){
                comp = Integer.compare(o1.getB(), o2.getB());
                if(comp==0){
                    comp = Integer.compare(o1.getC(), o2.getC());
                    if(comp==0){
                        comp = Integer.compare(o1.getD(), o2.getD());
                    }
                }
            }
            return comp;
        }
 }

Let's create a comparing method in our class so we can analyse the code more closely. The reason for the comparing method is to allow us to easily swap implementations but leave the calling code the same.

In all cases this is how the comparing method will be called:

List mySortedList = myComparableList.stream()
                    .sorted(comparing(
                            MyComparableInt::getA,
                            MyComparableInt::getB,
                            MyComparableInt::getC,
                            MyComparableInt::getD))
                    .collect(Collectors.toList());

The first implementation of comparing is pretty much a copy of the one in jdk.

public static <T, U extends Comparable<? super U>> Comparator<T> 
   comparing(
            Function<? super T, ? extends U> ke1,
            Function<? super T, ? extends U> ke2,
            Function<? super T, ? extends U> ke3,
            Function<? super T, ? extends U> ke4)
    {
        return  Comparator.comparing(ke1).thenComparing(ke2)
                  .thenComparing(ke3).thenComparing(ke4);
    }

Not surprisingly this took ~6s to run through the test - but at least we have reproduced the problem and have a basis for moving forward.

Let's look at the flight recording for this test:



As can be seen there are two big issues:  
1. A performance issue in the lambda$comparing method 
2. Repeatedly calling Integer.valueOf (auto-boxing)


Let's try and deal with the first one which is in the comparing method.  At first sight this seems strange because when you look at the code there's not much happening in that method.  One thing however that is going on here extensively are virtual table lookups as the code finds the correct implementation of the function.   Virtual table lookups are used when there are multiple methods called from a single line of code. We can eliminate this source of latency with the following implementation of comparing. By expanding all of the uses of the Function interface each line can only call one implementation and thus the method is inlined.

public static <T, U extends Comparable<? super U>> Comparator<T> 
       comparing(
            Function<? super T, ? extends U> ke1,
            Function<? super T, ? extends U> ke2,
            Function<? super T, ? extends U> ke3,
            Function<? super T, ? extends U> ke4)
    {
        return  (c1, c2) -> {
            int comp = compare(ke1.apply(c1), ke1.apply(c2));
            if (comp == 0) {
                comp = compare(ke2.apply(c1), ke2.apply(c2));
                if (comp == 0) {
                    comp = compare(ke3.apply(c1), ke3.apply(c2));
                    if (comp == 0) {
                        comp = compare(ke4.apply(c1), ke4.apply(c2));
                    }
                }
            }
            return comp;
        };
    }

By unwinding the method the JIT should be able to inline the method lookup.
Indeed the time almost halves to 3.5s, let's look at the Flight Recording for this run:




When I first saw this I was very surprised because as yet we haven't done any changes to reduce the calls to Integer.valueOf but that percentage has gone right down!  What has has actually happened here is that, because of the changes we made to allow inlining, the Integer.valueOf has been inlined and the time taken for the Integer.valueOf is being blamed on the caller (lambda$comparing) which has inlined the callee (Integer.valueOf).  This is a common problem in profilers as they can be mistaken as to which method to blame especially when inlining has taken place.

But we know that in the previous Flight Recording Integer.valueOf was highlighted so let's remove that with this implementation of comparing and see if we can reduce the time further.

return  (c1, c2) -> {
    int comp = compare(ke1.applyAsInt(c1), ke1.applyAsInt(c2));
    if (comp == 0) {
        comp = compare(ke2.applyAsInt(c1), ke2.applyAsInt(c2));
        if (comp == 0) {
           comp = compare(ke3.applyAsInt(c1), ke3.applyAsInt(c2));
           if (comp == 0) {
             comp = compare(ke4.applyAsInt(c1), ke4.applyAsInt(c2));
           }
         }
    }
    return comp;
};

With this implementation the time goes right down to 1.6s which is what we could achieve with the custom Comparator.

Let's again look at the flight recording for this run:



All the time is now going in the actual sort methods and not in overhead.

In conclusion we have learnt a couple of interesting things from this investigation:

  1. Using the new Java8 declarative sort will in some cases be up to 4x slower than writing a custom Comparator because of the cost of auto-boxing and virtual table lookups.
  2. FlightRecorder whilst being better than other profilers (see my first blog post on this issue) will still attribute time to the wrong methods especially when inlining is taking place.
[Update] In the comments I was asked by Nitsan to include the source code and to execute the tests as a JMH benchmark.  I think that these are excellent improvements to the post. Here's the source code. The results of the JMH tests are below:

Benchmark                                         Mode  Cnt     Score    Error  Units
CompTestBenchmark.bmCustomComparator             thrpt   20  2598.617 ± 67.634  ops/s
CompTestBenchmark.bmJDKComparator                thrpt   20   751.110 ± 14.835  ops/s
CompTestBenchmark.bmNoVTLComparator              thrpt   20  1348.750 ± 30.382  ops/s
CompTestBenchmark.bmNoVTLOrAutoBoxingComparator  thrpt   20  2202.995 ± 43.673  ops/s

The JMH tests were carried out on a list of 1000 items and confirm the results I saw when I ran the test on 10m items.

Aleksey Shipilev made a few comments (see comments section) and amended the JMH benchmark. It's definitely worth checking out his code changes.

Tuesday 20 January 2015

Java8 Lambdas Tip - Collect SortedGroupingBy

Java8 introduces a great new feature which easily allows your code to decompose a List of objects into a Map of Lists of objects keyed on a particular attribute.

This is best shown by example.
Let's say you have a list of Books as below:


public class Book{
    String author;
    Date published;
    int copiesSold;
    String catagory;

    public String getAuthor() {
        return author;
    }

    public Date getPublished() {
        return published;
    }

    public int getCopiesSold() {
        return copiesSold;
    }

    public String getCatagory() {
        return catagory;
    }
}

To group them into a map of authors to books used to be a little bit painful but is now a one liner!

Map<String, List<Book>> authorsToBooks = books
       .stream()
       .collect(Collectors.groupingBy(Book::getAuthor));

The only problem with this that you might have is that the default Map implementation returned is a HashMap which of course is unordered and you might well want to order by the key, in this example by the author.  Of course, you could always sort the Map in a second step but there's a way to do it in one line.

To fix that let's introduce this static utility function:

public static <T, K extends Comparable<K>> Collector<T, ?, TreeMap<K, List<T>>> 
   sortedGroupingBy(Function<T, K> function) {
        return Collectors.groupingBy(function, 
           TreeMap::new, Collectors.toList());
}

We can call it like this:

Map<String, List<Book>> authorsToBooks = books
       .stream()
       .collect(sortedGroupingBy(Book::getAuthor));

This time the map implementation is a TreeMap which has returned the Map of authors to their books in alphabetical order. 

Monday 19 January 2015

Maven tip: Release Without Deployment

This is how you do a Maven release, as simple as this:

mvn release:prepare
mvn release:perform

This will move you up a version in your code. Say, for example your last released version is 1.2 (meaning the pom.xml on HEAD is on 1.3-SNAPSHOT), by default your release version will be moved to 1.3 and the version on the pom.xml on HEAD will be moved to version 1.4-SNAPSHOT.

A nice feature is that if you use the flag -dryRun=true on either of the above commands you can go through all the steps without any of the files being checked in. It gives you an opportunity to examine the changes to the pom.xml files before you go through with the commit.

To facilitate the release procedure you need to have these lines configured in your highest level pom.xml file.

<scm
  <connection>scm:git:git@github.com:YourRepository.git</connection
  <url>scm:git:git@github.com: com:YourRepository.git</url>
</scm>

By default the code will be built and deployed to a server.  Typically you will want to deploy to Maven Central or to a local Nexus server.  This is done using the releaseProfiles property in the maven-release-plugin.

But this will not always be the case.  If you just want to do a release of the code, to Git for example, and not deploy the code anywhere you need to override the default maven-release-plugin goal property, which by default is set to deploy and set it to install as below:

<plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-release-plugin</artifactId>
         <version>2.5.1</version>
         <configuration>
              <goals>install</goals>
          </configuration>
</plugin>



Saturday 17 January 2015

Maven Tip: Adding a Custom Jar to your Maven Project

Sometimes you come across a jar which you want to include in your Maven project but is not in Maven Central.  This happened to me recently with a great web scraping library called jaunt which I wanted to include in my Maven java project.

One way (maybe the best way) is to add the jar to you local Maven repository which you can do with this command.


mvn install:install-file -Dfile=jaunt0.9.9.4.jar -DgroupId=com.jaunt.code -DartifactId=jaunt -Dversion=0.9.9.4 -Dpackaging=jar

Then add to your dependencies in Maven as below:

 <dependency>
            <groupId>com.jaunt.code</groupId>
            <artifactId>jaunt</artifactId>
            <version>0.9.9.4</version>
  </dependency>

The issue with this methodology is if you want to share your project with people who don't have access to you local Maven repository.  For example, if you want to put your project on Git for general release.

Of course you could always check in the actual source but that's not always possible or necessarily desirable. If you have any good solutions please leave in the comments.   

Friday 16 January 2015

Java Flight Recorder (JFR)

JFR is a Java profiler which will allow you to investigate the runtime characteristics of your code. Typically you will use a profiler to determine which parts of your code are causing  large amounts of memory allocation or causing excess CPU to be consumed.

There are plenty of products out there.  In the past I've used YourKit, OptimizeIt, JProfiler, NetBeans and others. Each has its benefits and it is largely a matter of personal preference as to which you choose. My current personal favourite is YourKit. It integrates nicely into IntelliJ has a relatively low overhead and presents its reports well.

The truth is that profiling is a very inexact science and it is often worth looking at more than one profiler to build up a clearer picture of what exactly is going on in your program. To my knowledge most of the profilers rely on the JVMP/JVMTI agents to probe the Java program. A major problem with this is safe points. This means your Java program can only be probed when it is at a safe point. This means that you will get a false picture of what is really going on in your program especially if much of the activity is between safe points. Also all profilers, to a varying degree add overhead.  Profiler overhead will change the characteristics of your program and may cause misleading results from your analysis. Much more information here.

Enter JFR.  JRF has been bundled with the JDK since release 7u40. JFR is built with direct access to the JVM. This not only means that there is a very low overhead (claimed to be less than 1% in nearly all cases) but also does not rely on safe points.  Have a look here at an example of how radically different an analysis from YourKit and JFR can look.

To run JFR you need to add these switches to your Java command line:

-XX:+UnlockCommercialFeatures -XX:+FlightRecorder

JFR is located in Java Mission Control (JMC).  To launch JMC just type jmc in your command line and if you have the JDK in your path the JMC console will launch.  You should see your Java program in the left hand pane.  Right click on your program and then start flight recording.


You will be presented with a dialog box where you can just accept the defaults (sample for a minute) and then your results will be displayed.  It's worth paying around with the options to find how this will work best for you.  As with all good products this GUI is fairly intuitive.

As you can tell from the command line switches it is commercial feature.  I'm not exactly sure what that means but you can read more about that in the documentation here. Also you can run this from the command line, it's all in the documentation.

One problem I did find was when I downloaded the latest Java8 snapshot (at this time 1.8.0_40-ea) I was unable to launch my program and got the following message:

/Library/Java/JavaVirtualMachines/jdk1.8.0_40.jdk/Contents/Home/bin/
Error: Trying to use 'UnlockCommercialFeatures', but commercial features are not available in this VM.
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

In summary, JFR is a great addition to any developers toolkit and as long as you are using JDK release 7u40 or above it's certainly worth trying it out on your code.

(I encourage you to have a look at a previous post First rule of performance optimisation in conjunction with JFR)

Tuesday 13 January 2015

What's Stopping Me Using Java8 Lambdas - Try Debugging Them

I'm a great fan of Java 8 features!  The declarative code style whilst possibly expensive in some cases is a massive leap forward for Java.

But - I'm still hesitant to use Java8 lambdas because it's so difficult to debug them!  See the screenshot below, which is currently at the breakpoint on line 18:


Notice how you only have visibility of the variables inside the lambda.  It is not even possible to see view the contents of the List 'letters' around which you are iterating.

Clearly in this contrived example it makes little difference but when you are debugging a complex system this becomes a serious pain in the neck.  Time and again I've found myself replacing lambda constructs for the old style loops so that I can see what's going on in variables outside the scope of the lambda.

I'm not sure if this problem is confined to IntelliJ or whether it exists in Eclipse, Netbeans etc [EDIT from the response from IntelliJ it seems to be a Java issue not an IntelliJ one] but IntelliJ is so great and I'm certainly not moving away from it in the near future.


I've raised this with the guys at IntelliJ and they have promised that someone will be looking into this problem.  I'll be updating this post with any progress.

[EDIT] UPDATE FROM INTELLIJ

You are welcome vote for it or star it to receive notifications about the future changes.

Unfortunately, it is not something we can do on our own, we’re waiting for JDK to support lambdas in the debugger.
Here’s the corresponding request in JDK tracker: https://bugs.openjdk.java.net/browse/JDK-8054220.

Monday 12 January 2015

Java8 Sorting - Performance Pitfall

Java 8 brings all the goodness of lambdas to enable us to program using a declarative style. But is it really free? And should we be concerned about the price we have to pay for the new programming goodies?

Here's an example where we might have to worry.

Consider sorting instances of this simple class:
private static class MyComparableInt{
        private int a,b,c,d;

        public MyComparableInt(int i) {
            a = i%2;
            b = i%10;
            c = i%1000;
            d = i;
        }

        public int getA() { return a; }
        public int getB() { return b; }
        public int getC() { return c; }
        public int getD() { return d; }
}
We can sort using the new Java 8 declarative syntax as below:

List<MyComparableInt> mySortedComparableList = 
      myComparableList.stream()
       .sorted(Comparator.comparing(
         MyComparableInt::getA).thenComparing(
         MyComparableInt::getB).thenComparing(
         MyComparableInt::getC).thenComparing(
         MyComparableInt::getD))
       .collect(Collectors.toList());

Or we can sort the old way (still using lambdas) using this code:

List<MyComparableInt> mySortedComparableList = 
         myComparableList.stream()
           .sorted(MyComparableIntSorter.INSTANCE)
           .collect(Collectors.toList());

public enum MyComparableIntSorter implements 
     Comparator<MyComparableInt>{
        INSTANCE;

        @Override
        public int compare(MyComparableInt o1, MyComparableInt o2) {
            int comp = Integer.compare(o1.getA(), o2.getA());
            if(comp==0){
                comp = Integer.compare(o1.getB(), o2.getB());
                if(comp==0){
                    comp = Integer.compare(o1.getC(), o2.getC());
                    if(comp==0){
                        comp = Integer.compare(o1.getD(), o2.getD());
                    }
                }
            }
            return comp;
        }
 }
When I ran this test with 10m objects the sort took ~6.5s using the declarative 
syntax but only 1.5s using the old syntax.  That's almost 4 times as long!
So where is the time going? Presumably it's in the overhead for marshalling
between the thenComparing methods.
Interestingly,  if you try exactly the same test but replace int for String the 
times change as follows.  Again for 10m objects, using the new syntax takes
~11.5s whilst the old syntax takes ~7s.  Using String, when the marshalling is less
significant the new syntax takes only 1.5 times as long as the old syntax.

When I ran this in YourKit I ended up with a not so useful CPU analysis:

When I profiled this in FlightRecorder you can see a more useful analysis (because
it doesn't rely as much on safe points).

The time is spent firstly in calls to virtual methods through the lambdas and secondly converting ints to Integers.

In my follow up post I will be explaining, using a step by step optimisation, how the code matches the analysis in Flight Recorder.

In summary, whilst the new syntax looks great and is wonderfully expressive,
if you are worried about performance you should stick to the old syntax.
Once again it seems that there is no such thing as a free lunch!  


Sunday 11 January 2015

My favourite Java puzzler 2 + 1 = 4 !!

Here's  my current favourite Java puzzler.

How can you get your code to do this?
Integer b = 2;
Integer c = 1;

System.out.println("b+c : " + (b+c) ); // output: 'b+c : 4' !!

There are no tricks with Sytem.out.println() i.e. you would be able to see the same value in a debugger.
Clue: You need to add a few lines of code somewhere before this in your program.

Scroll down for the solution.
.

.

.

.

.

.

.

.

.

.

.

public static void main(String... args)throws Exception{
        Integer a = 2;

        Field valField = a.getClass().getDeclaredField("value");
        valField.setAccessible(true);
        valField.setInt(a, 3);

        Integer b = 2;
        Integer c = 1;

        System.out.println("b+c : " + (b+c) ); // b+c : 4
}
As you can see (and I would encourage you to go to the source code for Integer) 
there is a static cache (look for the Inner class IntegerCache) where the value of the 
Integer is mapped to its corresponding int value.  The cache will store all numbers 
from -128 to 127 although you can tune this using the property  
java.lang.Integer.IntegerCache.high.

Friday 9 January 2015

First rule of performance optimisation

Let's start with a system with no obvious performance bottlenecks.  By that I mean that there are no glaring algorithmic problems which are grinding your system to a halt.  e.g. a tight loop which is reading a property from a file without caching the result.

You want your system to run as fast as possible, where do you start?  Most profilers (e.g. my current favourite YourKit) have modules for memory tracing and CPU tracing.  Since the aim of the exercise is for your program to run faster you start by looking at the CPU? - Wrong!  The first place to start is by looking at the memory and in particular at object allocation.

What you should always try and do in first instance is to reduce your object allocations as much as possible.  The reason that this is not intuitive (at least it wasn't to me) was because we know that object allocation is fast, in fact it's super fast even compared to languages like C. (Lots of discussion on the web about exactly which and in what circumstances it can be faster - but it's undeniably fast).  So, if object allocation is so fast why is it slowing my program down and why should I start by minimising my object allocation?
  1. It puts extra pressure on the garbage collector.  Having more objects in the system (especially if they are not short lived) will give your garbage collector more work and slow down the system that way.
  2. It fills up your CPU caches with garbage forcing them to flush and have to keep going higher up the stack to L2 and L3 cache and then to main memory to retrieve the data.  Roughly speaking, each level up the stack from which data has to fetched takes an order of magnitude longer in time (see graphic below).  So even if object allocation is fast it causes cache misses and thus many wasted cpu cycles which will slow your program down.
  3. Do the easy things first. It's far easier in general to reduce allocation (by caching etc) than it is fix algorithms when looking at a CPU performance trace.  Changing the allocations may completely change the performance characteristics of your program and it may be that any changes to algorithms carried out prior to that will have been a waste of time.  
  4. Profilers lie (this is a must watch video).  It's really hard to know, when looking at CPU traces where exactly the bottlenecks lie.  Profilers however do not lie about the allocations.
  5. High object allocation is often a bad smell in the code.  Looking for excess object allocation will lead you to algorithmic issues.