Rational Java

Monday, 21 September 2015

Chronicle-Bytes: New Feature Prepending Bytes

Chronicle-Bytes is an Open Source project under the Chronicle group of technologies.

It has got some really interesting features and I would definitely recommend it if:

you find yourself at all frustrated with java.nio.ByteBuffer
you are in low latency environment and want to avoid allocation

Some of the extensions over java.nio.ByteBuffer are explained in the README for the project.

If you are new to Chronicle-Bytes and want to see a worked example comparing Chronicle-Bytes with java NIO ByteBuffer see this article.

Chronicle-Bytes just introduced a new feature - 'prepending' bytes.

Chronicle-Bytes is one of the key building blocks we used in creating Chronicle-FIX. You can read more about it and the performance we achieved here. This is one of the features we added to Chronicle-Bytes whilst building Chronicle-FIX which allowed us to hit the numbers quoted in the article.

Let's say you have the following scenario (You find this sort of construct in FIX messages).

You need to write a variable length message into a buffer.
Preceding the message you need to write the message length to the buffer.
The length must be written in text

e.g. 11 hello world or 5 hello

The problem here is that until you've written the message into the buffer you can't know how long the message is. So you don't know how much space will be required to write out the length which must precede the message. e.g. If the message is less than 10 bytes the space required for the length will be 1 byte. If the message length is from 10-99 bytes the space required for the length will be 2 bytes etc.

Before this new feature this is the code you would have had to have written:

(Example debug Hello World in red

hyphens - signify empty bytes,

single | mark the write position

double || mark the read position).

StringBuilder sb = new StringBuilder("Hello World);

Bytes bytes = Bytes.elasticByteBuffer();

bytes.clear();

|||---------------------

//leave 8 bytes for writing the length

bytes.writeLong(0);

||--------|-------------

//remember the position at which you started writing

long pos = bytes.writePosition();

pos=8

//write the string to the buffer

bytes.appendUtf8(sb);

||--------Hello World|--

//remember the position you finished writing

long pos2 = bytes.writePosition();

pos2=19

//work out how many bytes you require to store the length

int sblen = sb.length();

int numlen = 1;

while(sblen > 9){

numlen++;

sblen /= 10;

}

sblen=2

bytes.writePosition(pos - numlen - 1);

||-----|---Hello World--

bytes.append(sb.length());

||-----11|-Hello World--

bytes.append(' ');

||-----11 |Hello World--

bytes.writePosition(pos2);

||-----11 Hello World|--

bytes.readPosition(pos - numlen - 1);

-----||11 Hello World|--

Job done - but a lot of work!

Now look at the alternative when you can use prepending.

bytes.clearAndPad(8);

//moves the read and write position to 8

--------|||-------------

bytes.appendUtf8(sb);

//normal append to buffer

--------||Hello World|--

bytes.prewriteByte((byte) ' ');

//prewriteByte writes the byte before the read position

//and then moves the read position back the number of bytes.

-------|| Hello World|--

bytes.prepend(sb.length());

//prepend writes the long (in text) backwards before the read //position it then moves the read position back the number of bytes.

-----||11 Hello World|--

That's a really massive saving of effort allowing you to very efficiently write to byte buffers.

Monday, 31 August 2015

An Unexpected Allocation - JIT Compilation Jitter

Whilst working on ByteWatcher (see my last post) I came across something rather strange.

This is the actual code snippet used to find out how much allocation there has been on a particular thread:

return (long) mBeanServer.invoke(
  name,
  GET_THREAD_ALLOCATED_BYTES,
  PARAMS,
  SIGNATURE
);

For the full context see here.

(The way ByteWatcher works is to call this method periodically to monitor allocation.)

One important point to be noted, especially when wanting to provide an accurate number for a program's allocation is that calling the above code itself - causes allocation.

The allocation caused by this call must be deducted from the number returned so that we isolate the allocation caused by the program.

i.e call to meanBeanServer = program thread allocation + call overhead

What I noticed was that this amount of allocation was usually 336 bytes.

However when I called this method in a loop I discovered something interesting. Every so often it would allocate a different amount.

For this test:

@Test
  public void testQuietMeasuringThreadAllocatedBytes() {
    ByteWatcherSingleThread am = new ByteWatcherSingleThread();
    System.out.println("MeasuringCostInBytes = " + am.getMeasuringCostInBytes());
    long[] marks = new long[1_000_000];
    for (int i = 0; i < 1_000_000; i++) {
      marks[i] = am.threadAllocatedBytes();
    }

    long prevDiff = -1;
    for (int i = 1; i < 1_000_000; i++) {
      long diff = marks[i] - marks[i - 1];
      if (prevDiff != diff)
        System.out.println("Allocation changed at iteration " + i + "->" + diff);
      prevDiff = diff;
    }
  }

This was typical result:

MeasuringCostInBytes = 336
Allocation changed at iteration 1->336
Allocation changed at iteration 12->28184
Allocation changed at iteration 13->360
Allocation changed at iteration 14->336
Allocation changed at iteration 1686->600
Allocation changed at iteration 1687->336
Allocation changed at iteration 2765->672
Allocation changed at iteration 2766->336
Allocation changed at iteration 5458->496
Allocation changed at iteration 5459->336
Allocation changed at iteration 6213->656
Allocation changed at iteration 6214->336
Allocation changed at iteration 6535->432
Allocation changed at iteration 6536->336
Allocation changed at iteration 6557->8536
Allocation changed at iteration 6558->336
Allocation changed at iteration 7628->576
Allocation changed at iteration 7629->336
Allocation changed at iteration 8656->4432
Allocation changed at iteration 8657->336
Allocation changed at iteration 9698->968
Allocation changed at iteration 9699->336
Allocation changed at iteration 11881->1592
Allocation changed at iteration 11882->336
Allocation changed at iteration 12796->1552
Allocation changed at iteration 12797->336
Allocation changed at iteration 13382->456
Allocation changed at iteration 13383->336
Allocation changed at iteration 14844->608
Allocation changed at iteration 14845->336
Allocation changed at iteration 36685->304
Allocation changed at iteration 52522->336
Allocation changed at iteration 101440->400
Allocation changed at iteration 101441->336

Given that there was definitely no allocation in the program it was a mystery to me why the same call was sometimes allocating different amounts.

In summary, over 1,000,000 runs, the program allocated different amounts about 25 times. Notably there were no spikes after 100k iterations.

I shared this problem with Heinz Kabutz and Chris Newland. Chris noticed that that the allocations were down to JIT compilation jitter.

This can be seen very clearly by re-running the test with the flag -Xint (only run in interpreted mode i.e. no JIT compilation). Now there were only 2 spikes.

MeasuringCostInBytes = 336
Allocation changed at iteration 1->336
Allocation changed at iteration 12->28184
Allocation changed at iteration 13->360
Allocation changed at iteration 14->336

Similarly running with the flag -Xcomp (compile only)

MeasuringCostInBytes = 336
Allocation changed at iteration 1->336
Allocation changed at iteration 12->29696
Allocation changed at iteration 13->360
Allocation changed at iteration 14->336

So now we can be pretty confident that it is JIT compilation jitter which is causing the rogue allocations.

I can't pinpoint the exact cause of the allocation but I guess it's understandable that there might be some sort of allocation when JIT compilation jitter occurs.

To compensate for this I introduced a calibration phase in the constructor of ByteWatcher which was further refined by Heinz.

You can see the calibration code here but it consists of a couple of phases:

Call the method to work out how much the thread has allocated in tight loop (we call it 100,000 times) - allows the JIT to properly warm up the code so that it's all compiled
Wait 50milliseconds - this allows the JVM a chance to finish it's compilation jitter

With this code in the constructor, even running with no special flags there are no allocation spikes.

Conclusion:

JIT compilation jitter causes some allocation
Running the program without compilation jitter (i.e only interpreted or only compiled) greatly reduces that allocation but does not quite eliminate it.
After 100k runs the allocations stop which indicate that it takes 100k runs for the jitter to stop. This is interesting because we know that the code should compile after 10k iterations.

Friday, 14 August 2015

ByteWatcher: A New Regression Testing Tool For Allocations

This post is to introduce ByteWatcher, a tool I've written in collaboration with Heinz Kabutz (the Java Specialist).

To my regular readers, you will recognise it as a an evolution of AllocationMonitor which I described in this post. For the history of the project and an explanation of some of the some of the concepts I refer you to that post.

So what is ByteWatcher?

In a sentence, ByteWatcher monitors your allocation and alerts you when a specified allocation limit is breached.

Regression testing is important. I don't think that there are many projects these days that are not filled to the brim with unit tests. Often the amount of code in the test branch of a project will far exceed the code in the main branch.

Principally we test:

That functionality has not been broken.
For certain programs it is important to test that the the code has not slowed down.
For others programs that the size of the distribution has not exceeded a given amount.

But there is another attribute that you might want to test which is your program's allocation.

I've written a few posts on the problems and effects of allocation, see here for example. First rule of performance optimisation, Revisiting the First Rule of Performance Optimisation: Effects of Escape Analysis and A Case Study in Analysing Latency: Lock vs Synchronized.

For those programs where real-time (and therefore predictable latencies up to the high percentiles) is critical, allocation can be disastrous. The reason for this is that your allocations will, at some point, have to be garbage collected. A GC will cause a pause that will kill your predictable latencies. (There are other reasons to avoid allocation and I refer you to the links above for those).

With ByteWatcher you can add a couple of lines into your regression tests that will check whether the expected allocation has been exceeded or not. The documentation for getting the code and example usage is all in the README of the project but this is the basic structure.

Hope you find ByteWatcher useful and I welcome any feedback or contributions to the project.

Also thank you to Chris Newland (Mr JITWatch) for your help on some of the trickier allocation spikes we were seeing. I'll leave those for another post.

Tuesday, 11 August 2015

A Case Study in Analysing Latency: Lock vs Synchronized

In particular in this post we will discuss:

Garbage created by java.concurrent.Lock
Comparing Lock to synchronized
How to measure latency programatically
The impact of contention on Lock and synchronized
The impact of co-ordinated omission on latency tests

Back to one of my favourite topics garbage creation / allocation. See my previous posts (e.g. First rule of performance optimisation and Revisiting the First Rule of Performance Optimisation: Effects of Escape Analysis) for more details on this subject. Especially why allocation is such a critical factor in understanding performance issues.

Something I came across a couple of days ago while I was trying to diagnose some strange effects of allocation during JIT compilation was that java.util.concurrent.locks.ReentrantLock allocates, but only when under contention. (This can be easily proved by running a test program (like the one below) creating contention on Lock with -verbosegc).

Sample gc output for contended Lock below:
[GC (Allocation Failure) 16384K->1400K(62976K), 0.0016854 secs]
[GC (Allocation Failure) 17784K->1072K(62976K), 0.0011939 secs]
[GC (Allocation Failure) 17456K->1040K(62976K), 0.0008452 secs]
[GC (Allocation Failure) 17424K->1104K(62976K), 0.0008338 secs]
[GC (Allocation Failure) 17488K->1056K(61952K), 0.0008799 secs]
[GC (Allocation Failure) 17440K->1024K(61952K), 0.0010529 secs]
[GC (Allocation Failure) 17408K->1161K(61952K), 0.0012381 secs]
[GC (Allocation Failure) 17545K->1097K(61440K), 0.0004592 secs]
[GC (Allocation Failure) 16969K->1129K(61952K), 0.0004500 secs]

[GC (Allocation Failure) 17001K->1129K(61952K), 0.0003857 secs]

I wondered whether the garbage collections necessary to clean up these allocations would mean that Lock, in a highly contended environment, would be a worse choice for synchronisation than using the in-built 'synchronized'.

Of course the question is more academic than anything else. If you really did care that much about latency, you would never (or certainly should never) find yourself in a situation where so much thread locking would be necessary. Nevertheless stay with me because the process and results are interesting.

A bit of history. Locks were introduced into Java in version 1.5 in 2004. Locks together with the rest of the concurrent utilities were desperately needed to simplify concurrency constructs. Up to that point you had deal with the built-in synchronized and wait() notify() on Object.

ReentrantLock offers much functionality over and above synchronized, to name but a few:

Being unstructured - i.e. you are not limited to using it in a block or method. It allows you hold the lock over several methods.
Lock polling
Time out waiting for the lock
Configurable fairness policy

But how do they perform in terms of latency test?

I wrote a simple test below to compare the performance of Lock against synchronized.

The code allows you to vary the number of threads (1 thread means that there is no contention) and thus adjust the amount of contention.
To measure with and without coordinated omission (see previous blog Effects of Coordinated Omission)
To run testing Lock or testing synchronised.
To record my results you will notice that I used a Histogram class. This was created by Peter Lawrey. You can find the class as a utility in Chronicle-Core over here.

So here are the results:

These are the results where co-ordinated omission was ignored:

The times are measured in microseconds.
The latency distribution is across the top of the graph.
Contention in this test meant running the program it with 4 threads.
The tests were run on an MBP i7 with 8 logical CPUs.
Each test comprised of 200,000,000 iterations with a 10,000 iteration warmup.
Throughput when adjusting for co-ordinated omission was 1 iteration/microsecond.

As expected, without contention the results are pretty much the same. The JIT will have optimised away the Lock and synchronized.

With contention using Lock was marginally faster in the lower percentiles but again really not much in it. So even though there were many minor garbage collections they don't seem to have had a noticeable effect slowing down the Lock. If anything Lock is slightly faster overall.

These are the results adjusted for co-ordinated omission.

The numbers are of course higher as they allow for the true latency caused.

Again with no contention the lock and synchronized perform the same - no great surprises there.

With contention, up to the 99th percentile we now see synchronized outperforming lock by 10X. After that the times were pretty much the same.

I could speculate that effects of the gc collections, which are between between 300-1200 microseconds, are the cause of the slowness of the lock compared to synchronised. This is especially because the slowdown is apparent only up to the 99th percentile - after this the latencies are probably down to hardware and OS. However that would be just speculation on my part without further investigation.

Conclusion:

The take away from this post is more about the process involved to measure and analyse latencies. It is interesting that Lock allocates when contended but is unlikely to make any practical difference in the real world.

Tuesday, 28 July 2015

The difference between Efficiency and Performance - It's not all about the Big O

As a software developer you should understand how to make the best use of your hardware in order to give the user the best experience possible with your program. In order to do this you need to know how to write efficient code and how to write performant code.

(I was watching this cppCon video given by Chandler Carruth which helped crystallise these concepts for me.)

Efficiency - the amount of work you need to do.
Performance - how fast you can do that work

Efficiency - governed by your algorithm
Performance - governed by your data structures.

Efficiency and performance are not dependant on one another.

It follows you can be efficient without being performant, you do very little work to get the job done but do it slowly e.g. You are a snail and need to travel 10m from point X to Y. You take the most direct route but slither along extremely slowly.

It also follows that you can be performant without being efficient, you do a lot of work to get the the job done but do it very fast e.g. You are superman and need travel 10m from X to Y. You fly right round the globe to get there but go very fast.

What we are of course aiming for is to be efficient and performant at the same time!

The efficiency of our algorithms are usually measured by Big O notation which measures how an algorithm scales with input size. We are taught in computer science classes to aim for the Holy Grail of O(1) and get scared when we see O(n²) or worse.

What we focus on less is how the algorithm actually performs. One of the biggest issues contributing to performance is how data is stored in the data structure. Generally speaking we want data that is used together to be stored close together in memory so that we don't end up with CPU cache misses. (See The First Rule of Performance Optimisation where I touch on this issue). If we access data in a predictable manner the hardware will pre-fetch data for us into the cache which can be a huge time saving over randomly accessing data from all over memory and pulling it into the caches on a 'as it is needed' basis.

One of the worst culprits is the good old LinkedList. It has O(n) for access and search and coupled with O(1) for insertion and deletion, might be considered a very good data structure to use in a program. Whilst of course I'm not writing off this important data structure, the problem is that the linked nodes are not held together in memory and it is therefore not the most performant data structure to traverse as you cannot help but end up with many expensive cache misses.

Perhaps the best example of the difference and trade-off between efficiency and performance is looking at sort algorithms.

We are taught that bubble sort is to be avoided at all costs. After all it is O(n^2). We should rather go for binary sort which comes in at much more acceptable O(n log(n)).

However, to be efficient you sometimes have to pay an overhead penalty which slows you down in the short term. For example if you had to edit a bunch of files you might do them manually, one by one. Alternatively you could spend the time creating a script to do the edits. At some point the extra work of editing manually would be slower than to take the hit of creating the script.

Bubble sort is performant, it does its work very quickly, it doesn't pay the penalty that binary sort pays to be more efficient for larger volumes. So as long as we keep the numbers low (below about 60 items) bubble sort will outperform any other type of search.

In fact if you look at the implementation of java.util.Array.sort() you will find in the class java.util.DualPivotQuicksort a number of sorting strategies which are triggered at different thresholds. Again efficiency vs performance.

/** * If the length of an array to be sorted is less than this * constant, Quicksort is used in preference to merge sort. */private static final int QUICKSORT_THRESHOLD = 286;

/** * If the length of an array to be sorted is less than this * constant, insertion sort is used in preference to Quicksort. */private static final int INSERTION_SORT_THRESHOLD = 47;

/** * If the length of a byte array to be sorted is greater than this * constant, counting sort is used in preference to insertion sort. */private static final int COUNTING_SORT_THRESHOLD_FOR_BYTE = 29;

I wrote a quick JMH benchmark which demonstrates this nicely.

With 5000 elements binary sort with its greater efficiency is much faster:

Benchmark Mode Cnt Score Error Units
Sorter.binarySearchForValue thrpt 5 915.362 ± 51.438 ops/s
Sorter.bubbleSort thrpt 5 171.021 ± 9.818 ops/s

With 50 elements bubble sort with it's greater performance is faster:

Benchmark Mode Cnt Score Error Units
Sorter.binarySearchForValue thrpt 5 90011.861 ± 2051.384 ops/s
Sorter.bubbleSort thrpt 5 114931.583 ± 6423.543 ops/s

Summary

Whilst Big O notation is incredibly useful for predicting the efficiency (amount of work) of an algorithm, we must also pay attention to the performance of the data structures on which those algorithms operate. In certain situations a less efficient but more performant algorithm may lead to greater speed overall.

Monday, 6 July 2015

Measuring Allocations Programmatically

I picked up this tip from the The Java Specialists' Newsletter written by Heinz Kabutz. (For all Java developers wanting to learn what goes on under the covers of the JDK this newsletter is an absolute must!)

Especially for developers writing low latency code but even for normal Java code, allocations is the thing you really want to avoid. See my previous posts 'The First Rule of Optimisation' and 'Revisiting the First Rule of Performance Optimisation: Effects of Escape Analysis' for more details.

Previous to this tip I had always used a profiler to work out my allocations or I suppose you could have used the a call to Runtime to see how much heap memory had been allocated by the JVM.

Using MBeans we are able to query an individual thread for it's allocations. This gives us a very precise way of measuring whether a specific thread has allocated and if so how much has it allocated. In situations where you are coding for zero allocation you can include a call to this code in your tests asserting that there has been no allocation.

Below is a simple class you can use based on the tip from the newsletter.

You'll notice that the constructor does a calibration which adjusts for the amount of allocation created by the bean itself.

There is also some defensive code to ensure that the class is only called from a single thread.

You can call the method markAllocations to find the amount of bytes that have been allocated since the the last mark. printAllocations is a handy method to print the allocations since the last mark to standard out. Allocations are rebased after the class is constructed, a call to reset or a call to markAllocations or printAllocations.

In a test you might have code like this:

Allocations measure = new AllocationsMeasure();
...
//critical code
...
assertEquals(0, measure.markAllocations());

Full code for AllocationsMeasure below:

Thursday, 2 July 2015

The Effect of Coordinated Omission - Measure loopback latency using a simple non-blocking NIO client/server

In this post I demonstrate a number of ideas and techniques:

How to write a simple non-blocking NIO client/server
How to time latency loopback on your machine
How to measure latencies in percentiles (as opposed to simple average)
The effects of co-ordinated omission
The realtionship between latency and throughput

Thanks to Peter Lawrey for his help with this post.

I was recently developing a low latency benchmark for client server application. Initially I was simulating the benchmark on a single machine using TCP over loopback. The first measure I wanted to quantify was how much of the recorded latency I needed to allow for simple loopback latency. I would then be able to understand more clearly the latency added by my actual application.

To do this I created a program (code at the end of the post) that transfers a single byte from the client to the server and back again. This is done repeatedly and the results are processed.

The program is written using non-blocking java NIO to optimise the loopback latencies as much as possible.

More importantly than recording the average time, the percentile latencies are recorded. (See a previous post here for a discussion of how to measure latency). Critically, the code factors for coordinated omission. (To learn more about this see here from Gil Tene). In a nutshell, you don't time a job from when it started but from it should have started.

These are results on my 2 year old MBP.

Starting latency test rate: 80000 ... Average time 6001870.9
Loop back echo latency was 6007821.4/10525630.0 11,538,962/11,642,739 11,652,536/11653522 11,653,616 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 70000 ... Average time 4393443.9
Loop back echo latency was 4388003.2/7743356.1 8,496,550/8,576,130 8,583,327/8584029 8,584,097 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 60000 ... Average time 3054243.7
Loop back echo latency was 2943178.9/5463715.9 5,961,190/6,009,637 6,014,795/6015409 6,015,453 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 50000 ... Average time 1420485.4
Loop back echo latency was 1423450.9/2462332.1 2,694,163/2,715,932 2,717,909/2718146 2,718,163 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 40000 ... Average time 4828.0
Loop back echo latency was 79.8/23261.9 44,756/47,379 47,662/47712 47,713 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 30000 ... Average time 45.1
Loop back echo latency was 23.3/46.5 630/2,211 2,904/3092 3,106 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 25000 ... Average time 1095.0
Loop back echo latency was 24.8/36.8 47,300/58,166 59,151/59346 59,357 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 20000 ... Average time 26.7
Loop back echo latency was 24.7/29.5 60/88 259/339 340 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 15000 ... Average time 26.5
Loop back echo latency was 24.8/28.1 60/72 102/252 252 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 10000 ... Average time 26.7
Loop back echo latency was 24.8/28.9 61/72 165/369 369 us for 50/90 99/99.9 99.99/99.999 worst %tile

Running the same test with -Dcoordinated.omission=true

### Running with Coordinated Omission ###

Starting latency test rate: 80000 ... Average time 25.6
Loop back echo latency was 23.3/30.8 59/74 164/1604 9,165 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 70000 ... Average time 25.9
Loop back echo latency was 23.3/31.1 59/74 187/1726 11,358 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 60000 ... Average time 26.2
Loop back echo latency was 25.1/28.3 60/66 90/166 419 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 50000 ... Average time 26.4
Loop back echo latency was 24.6/28.3 60/72 232/1996 4,104 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 40000 ... Average time 26.2
Loop back echo latency was 24.6/28.5 60/67 90/251 319 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 30000 ... Average time 26.6
Loop back echo latency was 24.5/28.3 60/72 162/2257 3,236 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 25000 ... Average time 26.4
Loop back echo latency was 24.6/27.8 61/71 96/254 375 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 20000 ... Average time 26.4
Loop back echo latency was 24.5/27.4 61/70 96/243 293 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 15000 ... Average time 26.8
Loop back echo latency was 24.6/29.0 61/74 285/887 887 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 10000 ... Average time 26.7
Loop back echo latency was 24.6/28.5 61/73 99/307 307 us for 50/90 99/99.9 99.99/99.999 worst %tile

As you can see the effects of throughput are completely ignored! It would appear that even at a rate of 80,000 messages a second your 99.99 percentile is 374us when in fact it is way way larger than that.

In fact you can only hit your target latencies when your throughput is nearer 20,000 a second. As you would understand intuitively there is a trade off between throughput and latency.

The results for a server machine (i7-3790X 3.5 Ghz with Ubuntu 15.04) running Ubuntu are as below:

Starting latency test rate: 200000 ... Average time 9417750.1

Loop back echo latency was 9349156.4/17229351.3 18,907,915/19,073,572 19,089,824/19091465 19,091,645 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 160000 ... Average time 3155097.0

Loop back echo latency was 3186332.1/5649721.3 6,159,813/6,211,035 6,216,502/6217022 6,217,047 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 140000 ... Average time 47245.0

Loop back echo latency was 19.0/236226.4 339,075/362,914 364,575/364749 364,767 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 120000 ... Average time 8.2

Loop back echo latency was 6.6/13.1 22/89 388/1427 1,519 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 100000 ... Average time 7.3

Loop back echo latency was 6.6/7.4 18/20 59/107 317 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 80000 ... Average time 7.1

Loop back echo latency was 6.7/7.7 18/19 22/100 619 us for 50/90 99/99.9 99.99/99.999 worst %tile

Running the same test with -Dcoordinated.omission=true

### Running with Coordinated Omission ###

Starting latency test rate: 200000 ... Average time 6.8

Loop back echo latency was 6.5/6.9 18/19 21/23 122 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 160000 ... Average time 6.8

Loop back echo latency was 6.5/6.9 18/19 21/23 131 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 140000 ... Average time 6.9

Loop back echo latency was 6.5/7.1 18/19 21/23 442 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 120000 ... Average time 6.8

Loop back echo latency was 6.5/6.9 18/19 21/23 121 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 100000 ... Average time 6.9

Loop back echo latency was 6.5/6.9 18/19 21/23 106 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 80000 ... Average time 6.8

Loop back echo latency was 6.5/6.9 17/19 21/23 104 us for 50/90 99/99.9 99.99/99.999 worst %tile

On the server machine, you can achieve stable latencies (~7us) with a throughput of 100,000.

An interesting observation is the relationship between latency and throughput and the effect that lowing latency has on raising throughput (comparing the server and laptop). For more on this topic see Peter Lawrey's excellent presentation here.

The code for this test is below:

Pages