Monday 31 August 2015

An Unexpected Allocation - JIT Compilation Jitter

Whilst working on ByteWatcher (see my last post) I came across something rather strange.

This is the actual code snippet used to find out how much allocation there has been on a particular thread:

return (long) mBeanServer.invoke(
  name,
  GET_THREAD_ALLOCATED_BYTES,
  PARAMS,
  SIGNATURE
);

For the full context see here.

(The way ByteWatcher works is to call this method periodically to monitor allocation.)

One important point to be noted, especially when wanting to provide an accurate number for a program's allocation is that calling the above code itself - causes allocation.

The allocation caused by this call must be deducted from the number returned so that we isolate the allocation caused by the program.

i.e call to meanBeanServer = program thread allocation + call overhead  

What I noticed was that this amount of allocation was usually 336 bytes.

However when I called this method in a loop I discovered something interesting.  Every so often it would allocate a different amount.

For this test:

@Test
  public void testQuietMeasuringThreadAllocatedBytes() {
    ByteWatcherSingleThread am = new ByteWatcherSingleThread();
    System.out.println("MeasuringCostInBytes = " + am.getMeasuringCostInBytes());
    long[] marks = new long[1_000_000];
    for (int i = 0; i < 1_000_000; i++) {
      marks[i] = am.threadAllocatedBytes();
    }

    long prevDiff = -1;
    for (int i = 1; i < 1_000_000; i++) {
      long diff = marks[i] - marks[i - 1];
      if (prevDiff != diff)
        System.out.println("Allocation changed at iteration " + i + "->" + diff);
      prevDiff = diff;
    }
  }

This was typical result:

MeasuringCostInBytes = 336
Allocation changed at iteration 1->336
Allocation changed at iteration 12->28184
Allocation changed at iteration 13->360
Allocation changed at iteration 14->336
Allocation changed at iteration 1686->600
Allocation changed at iteration 1687->336
Allocation changed at iteration 2765->672
Allocation changed at iteration 2766->336
Allocation changed at iteration 5458->496
Allocation changed at iteration 5459->336
Allocation changed at iteration 6213->656
Allocation changed at iteration 6214->336
Allocation changed at iteration 6535->432
Allocation changed at iteration 6536->336
Allocation changed at iteration 6557->8536
Allocation changed at iteration 6558->336
Allocation changed at iteration 7628->576
Allocation changed at iteration 7629->336
Allocation changed at iteration 8656->4432
Allocation changed at iteration 8657->336
Allocation changed at iteration 9698->968
Allocation changed at iteration 9699->336
Allocation changed at iteration 11881->1592
Allocation changed at iteration 11882->336
Allocation changed at iteration 12796->1552
Allocation changed at iteration 12797->336
Allocation changed at iteration 13382->456
Allocation changed at iteration 13383->336
Allocation changed at iteration 14844->608
Allocation changed at iteration 14845->336
Allocation changed at iteration 36685->304
Allocation changed at iteration 52522->336
Allocation changed at iteration 101440->400
Allocation changed at iteration 101441->336

Given that there was definitely no allocation in the program it was a mystery to me why the same call was sometimes allocating different amounts.

In summary, over 1,000,000 runs, the program allocated different amounts about 25 times. Notably there were no spikes after 100k iterations.

I shared this problem with Heinz Kabutz and Chris Newland. Chris noticed that that the allocations were down to JIT compilation jitter.

This can be seen very clearly by re-running the test with the flag -Xint (only run in interpreted mode i.e. no JIT compilation). Now there were only 2 spikes.

MeasuringCostInBytes = 336
Allocation changed at iteration 1->336
Allocation changed at iteration 12->28184
Allocation changed at iteration 13->360
Allocation changed at iteration 14->336

Similarly running with the flag -Xcomp (compile only)

MeasuringCostInBytes = 336
Allocation changed at iteration 1->336
Allocation changed at iteration 12->29696
Allocation changed at iteration 13->360
Allocation changed at iteration 14->336

So now we can be pretty confident that it is JIT compilation jitter which is causing the rogue allocations.

I can't pinpoint the exact cause of the allocation but I guess it's understandable that there might be some sort of allocation when JIT compilation jitter occurs.

To compensate for this I introduced a calibration phase in the constructor of ByteWatcher which was further refined by Heinz.

You can see the calibration code here but it consists of a couple of phases:

  1. Call the method to work out how much the thread has allocated in tight loop (we call it 100,000 times) - allows the JIT to properly warm up the code so that it's all compiled
  2. Wait 50milliseconds  - this allows the JVM a chance to finish it's compilation jitter
With this code in the constructor, even running with no special flags there are no allocation spikes.

Conclusion:
  • JIT compilation jitter causes some allocation
  • Running the program without compilation jitter (i.e only interpreted or only compiled) greatly reduces that allocation but does not quite eliminate it.
  • After 100k runs the allocations stop which indicate that it takes 100k runs for the jitter to stop.  This is interesting because we know that the code should compile after 10k iterations.



Friday 14 August 2015

ByteWatcher: A New Regression Testing Tool For Allocations

This post is to introduce ByteWatchera tool I've written in collaboration with Heinz Kabutz (the Java Specialist).

To my regular readers, you will recognise it as a an evolution of AllocationMonitor which I described in this post. For the history of the project and an explanation of some of the some of the concepts I refer you to that post.

So what is ByteWatcher

In a sentence, ByteWatcher monitors your allocation and alerts you when a specified allocation limit is breached.

Regression testing is important.  I don't think that there are many projects these days that are not filled to the brim with unit tests. Often the amount of code in the test branch of a project will far exceed the code in the main branch.  

Principally we test:

  • That functionality has not been broken. 
  • For certain programs it is important to test that the the code has not slowed down.  
  • For others programs that the size of the distribution has not exceeded a given amount.

But there is another attribute that you might want to test which is your program's allocation.

I've written a few posts on the problems and effects of allocation, see here for example. First rule of performance optimisation, Revisiting the First Rule of Performance Optimisation: Effects of Escape Analysis and A Case Study in Analysing Latency: Lock vs Synchronized.

For those programs where real-time (and therefore predictable latencies up to the high percentiles) is critical, allocation can be disastrous. The reason for this is that your allocations will, at some point, have to be garbage collected.  A GC will cause a pause that will kill your predictable latencies. (There are other reasons to avoid allocation and I refer you to the links above for those).

With ByteWatcher you can add a couple of lines into your regression tests that will check whether the expected allocation has been exceeded or not.  The documentation for getting the code and example usage is all in the README of the project but this is the basic structure.



Hope you find ByteWatcher useful and I welcome any feedback or contributions to the project.

Also thank you to Chris Newland (Mr JITWatch) for your help on some of the trickier allocation spikes we were seeing. I'll leave those for another post.

Tuesday 11 August 2015

A Case Study in Analysing Latency: Lock vs Synchronized

In particular in this post we will discuss:
  • Garbage created by java.concurrent.Lock
  • Comparing Lock to synchronized
  • How to measure latency programatically
  • The impact of contention on Lock and synchronized
  • The impact of co-ordinated omission on latency tests
Back to one of my favourite topics garbage creation / allocation. See my previous posts (e.g. First rule of performance optimisation and Revisiting the First Rule of Performance Optimisation: Effects of Escape Analysis) for more details on this subject. Especially why allocation is such a critical factor in understanding performance issues.

Something I came across a couple of days ago while I was trying to diagnose some strange effects of allocation during JIT compilation was that java.util.concurrent.locks.ReentrantLock allocates, but only when under contention. (This can be easily proved by running a test program (like the one below) creating contention on Lock with -verbosegc).

Sample gc output for contended Lock below:
[GC (Allocation Failure)  16384K->1400K(62976K), 0.0016854 secs]
[GC (Allocation Failure)  17784K->1072K(62976K), 0.0011939 secs]
[GC (Allocation Failure)  17456K->1040K(62976K), 0.0008452 secs]
[GC (Allocation Failure)  17424K->1104K(62976K), 0.0008338 secs]
[GC (Allocation Failure)  17488K->1056K(61952K), 0.0008799 secs]
[GC (Allocation Failure)  17440K->1024K(61952K), 0.0010529 secs]
[GC (Allocation Failure)  17408K->1161K(61952K), 0.0012381 secs]
[GC (Allocation Failure)  17545K->1097K(61440K), 0.0004592 secs]
[GC (Allocation Failure)  16969K->1129K(61952K), 0.0004500 secs]

[GC (Allocation Failure)  17001K->1129K(61952K), 0.0003857 secs]

I wondered whether the garbage collections necessary to clean up these allocations would mean that Lock, in a highly contended environment, would be a worse choice for synchronisation than using the in-built 'synchronized'. 

Of course the question is more academic than anything else. If you really did care that much about latency, you would never (or certainly should never) find yourself in a situation where so much thread locking would be necessary.  Nevertheless stay with me because the process and results are interesting.

A bit of history.  Locks were introduced into Java in version 1.5 in 2004.  Locks together with the rest of the concurrent utilities were desperately needed to simplify concurrency constructs.  Up to that point you had deal with the built-in synchronized and wait() notify() on Object

ReentrantLock offers much functionality over and above synchronized, to name but a few:
  • Being unstructured - i.e. you are not limited to using it in a block or method. It allows you hold the lock over several methods.
  • Lock polling
  • Time out waiting for the lock
  • Configurable fairness policy
But how do they perform in terms of latency test?

I wrote a simple test below to compare the performance of Lock against synchronized. 
  • The code allows you to vary the number of threads (1 thread means that there is no contention) and thus adjust the amount of contention.
  • To measure with and without coordinated omission (see previous blog Effects of Coordinated Omission)
  • To run testing Lock or testing synchronised.
  • To record my results you will notice that I used a Histogram class. This was created by Peter Lawrey. You can find the class as a utility in Chronicle-Core over here.


So here are the results:

These are the results where co-ordinated omission was ignored:


  • The times are measured in microseconds. 
  • The latency distribution is across the top of the graph. 
  • Contention in this test meant running the program it with 4 threads. 
  • The tests were run on an MBP i7 with 8 logical CPUs. 
  • Each test comprised of 200,000,000 iterations with a 10,000 iteration warmup.
  • Throughput when adjusting for co-ordinated omission was 1 iteration/microsecond. 



As expected, without contention the results are pretty much the same. The JIT will have optimised away the Lock and synchronized.

With contention using Lock was marginally faster in the lower percentiles but again really not much in it.  So even though there were many minor garbage collections they don't seem to have had a noticeable effect slowing down the Lock. If anything Lock is slightly faster overall.

These are the results adjusted for co-ordinated omission.


The numbers are of course higher as they allow for the true latency caused.

Again with no contention the lock and synchronized perform the same - no great surprises there.

With contention, up to the 99th percentile we now see synchronized outperforming lock by 10X. After that the times were pretty much the same. 

I could speculate that effects of the gc collections, which are between between 300-1200 microseconds, are the cause of the slowness of the lock compared to synchronised. This is especially because the slowdown is apparent only up to the 99th percentile - after this the latencies are probably down to hardware and OS.  However that would be just speculation on my part without further investigation.

Conclusion:

The take away from this post is more about the process involved to measure and analyse latencies.  It is interesting that Lock allocates when contended but is unlikely to make any practical difference in the real world.