Monday 18 April 2016

A Series of Posts on JLBH (Java Latency Benchmark Harness)

Introducing a set of 5 blog posts on JLBH

These set of posts should be useful for anyone responsible for creating or benchmarking real Java applications.

As well as introducing the open source JLBH benchmarking harness we go into some of the subtleties that underpin latency benchmarking in Java with real code examples.


1) JLBH - Introducing Java Latency Benchmarking Harness

  • What is JLBH
  • What was the motivation for JLBH
  • Differences between JMH and JLBH
  • Quick start guide
  • A side by side example using JMH and JLBH for Date serialisation
  • Measuring Date serialisation in a microbenchmark
  • Measuring Date serialisation as part of a proper application
  • How to add a probe to your JLBH benchmark
  • Understanding the importance of measuring code in context
  • Running JLBH with and without accounting for coordinated omission
  • An example to in numbers the effects of coordinated omission
  • A discussion about flow control
  • A discussion about the effects of throughput on latency
  • How use JLBH to measure TCP loopback
  • Adding probes to test both halves of the TCP round trip
  • Watching the effect of increasing throughput on latency
  • Understanding that you have to drop throughput to achieve good latencies at high percentiles.
  • Using JLBH to benchmark QuickFIX
  • Observing how QuickFix latencies degrade through the percentiles
  • Comparing QuickFIX with Chronicle FIX

JLBH Examples 4 - Benchmarking QuickFix vs ChronicleFix

In this post:

  • Using JLBH to test QuickFIX
  • Observing how QuickFix latencies degrade through the percentiles
  • Comparing QuickFIX with Chronicle FIX


As mentioned in the Introduction to JLBH the primary reason JLBH was created was to measure the Chronicle-FIX engine.

We use all the features of JLBH, particularly the throughput lever and the accounting for coordinated omission to get some realistic timings for QuickFIX.

Later in this article we will look at some results for ChronicleFIX but first let's look at benchmarking QuickFix the open source implementation of a Fix Engine. 

This is the scenario we will be benchmarking:
  • The client creates a NewOrderSingle which is then passed to the server.  
  • The server parses the NewOrderSingle 
  • The server creates an ExecutionReport which is send back to the client.
  • The client receives the ExecutionReport
End to end time is measured from when the client starts creating the NewOrderSingle to when the client receives the ExecutionReport. 

Note: We need to keep the start time that the benchmark was called with right the the program. In order to do this we use a trick and set the start time to the tag ClOrdId.

If you want to get the benchmark running on your servers you should clone this GitHub repo all jars and configurations files are set up there.

For the sake of this post here's the code for the benchmark.



These are the results that I see running on my server Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz.

At a throughput of 2,000/s


Percentile   run1         run2         run3      % Variation
50:           270.34       270.34       233.47         9.52
90:           352.26       335.87      1867.78        75.25
99:          6684.67      4849.66     69206.02        89.84
99.9:       13369.34     12845.06    163577.86        88.67
99.99:      81788.93     20447.23    163577.86        82.35
worst:     111149.06     98566.14    163577.86        30.54

At a throughput of 10,000/s


Percentile   run1         run2         run3      % Variation
50:           184.32       176.13       176.13         0.00 
90:           573.44       270.34       249.86         5.18 
99:         19398.66      2686.98      5111.81        37.56 
99.9:       28835.84      7733.25      7995.39         2.21 
99.99:      30932.99      9699.33      9175.04         3.67 
worst:      30932.99      9699.33      9175.04         3.67 

The mean here is ~200us but as you up through the percentiles the latencies really begin to degrade.

A lot of this is due to the amount of garbage that is being created! You can see this by running the benchmark with the jvm flag -verbosegc.

Note that the latencies for 10k/s are slightly better than at 2k/s. This is rather counter intuitive as we would expect latencies to increase as throughput increases, for example as we find with the 50k run below where the latencies are much worse. The explanation for the 10k / 2k discrepancy is that as throughput increases the CPU caches and other resources can stay hotter and therefore, up to a point, latencies actually get better as throughput increases.  This is another reason why it is so important to measure your latencies at a realistic throughput for your application. 

In fact when you up the throughput to 50,000/s even your 90th percentile (1 in every 10 iterations) is completely blown and you end up with multi millisecond delays.

At a throughput of 50,000/s

Percentile   run1         run2         run3      % Variation   var(log)
50:           176.13       176.13       176.13         0.00       11.82
90:         12845.06     29884.42      3604.48        82.94       21.01
99:         34603.01     94371.84     17301.50        74.81       25.26
99.9:       42991.62     98566.14     25690.11        65.41       25.84
99.99:      45088.77     98566.14     27787.26        62.94       25.93
worst:      45088.77     98566.14     27787.26        62.94       25.93

The issues here are not just the mean time (assuming that ~200us is too slow for you) but even more worryingly the way the numbers degrade as throughput is raised and as you study the higher percentiles.

Let's compare this Chronicle-FIX.  The test is run for exactly the same scenario and on the same machine.

The results look like this:

At a throughput of 2000/s

Percentile   run1         run2         run3      % Variation   
50:            16.90        16.90        16.90       0.00    
90:            18.94        18.94        18.94       0.00    
99:            26.11        30.21        23.04       17.18    
99.9:          35.84        39.94        33.79       10.81    
99.99:        540.67       671.74       401.41       65.41    
worst:        638.98      1081.34       606.21       61.59    

At a throughput of 10,000/s

Percentile   run1         run2         run3      % Variation   
50:            16.90        16.90        16.13         3.08     
90:            18.94        18.94        18.94         0.00     
99:            26.11        22.02        20.99         3.15     
99.9:          88.06        33.79        83.97        49.75     
99.99:        999.42       167.94       802.82        71.59     
worst:       1146.88       249.86       966.66        65.67     
At a throughput of 50,000/s


Percentile   run1         run2         run3      % Variation  
50:            15.62        15.10        15.62         2.21     
90:            17.92        16.90        16.90         0.00        
99:            22.02        30.21        29.18         2.29        
99.9:         120.83       352.26        33.79        86.27       
99.99:        335.87       802.82        96.26        83.03       
worst:        450.56       901.12       151.55        76.73       
The Chronicle-FIX mean is around 16us which is 12 times faster than QuickFIX.

But it's more than that because nearly all that time is in the TCP round trip.  When you measure the TCP time (see last post JLBH Examples 3 - The Affects of Throughput on Latency) it turns out that most of that time is TCP ~10us.

Therefore if you deduct the TCP times you get 
  • QuickFix       200 - 10 = 190
  • Chronicle-FIX   16 - 10 = 6
  • Chronicle-FIX is over 30 times faster than QF
And as has been demonstrated if you care about your upper percentiles than it get much worse than that.
For completeness it should be noted that that the server this was benchmarked on was reasonably noisy.  It had latency spikes of around 400us which accounts for the larger numbers that are shown up in the higher percentiles.
Also this test uses loopback TCP which puts a huge stress on the Linux kernel.  In fact strange things happen when you push the throughput up really high (you can try this with the simple TCP test) - so this is not the best way to test Chronicle-FIX.  It is only used as a comparison against Quick FIX.
Using Chronicle-FIX, if you measure the parsing of a fix message into its data model, including logging, on a tuned server you actually see this profile tested across throughput profiles from 10,000/s to 200,000/s:
Percentile   run1         run2         run3         run4         run5     50:             1.01         1.01         1.01         1.01         1.06 90:             1.12         1.12         1.12         1.12         1.12         99:             1.38         1.31         1.44         1.31         2.11         99.9:           2.88         2.88         2.88         2.88         4.03       99.99:          3.26         3.14         3.39         3.14         6.02        worst:          5.25         6.27        22.02        20.99        18.94   


Friday 15 April 2016

JLBH Examples 3 - The Affects of Throughput on Latency

In this post:

  • A discussion about the effects of thoughput on latency
  • How use JLBH to measure TCP loopback
  • Adding probes to test both halves of the TCP round trip
  • Watching the effect of increasing throughput on latency
  • Understanding that you have to drop throughput to achieve good latencies at high percentiles.


In the post we saw the effects of accounting for coordinated omission or measuring the effect a delay to one iteration will have an affect on subsequent iterations.

Intuitively we understand that throughput will affect latency. It seems natural that as we raise throughput we will also raise latency. 

Going into a very crowded shop will affect how fast you can select and purchase your goods. On the other hand consider a very rarely frequented shop. It could be that in such a shop the shopkeeper is away from the till having a tea break and that your purchase will be delayed as you wait for him to put down his cup of tea and make his way to the counter where he can server you.

This is exactly what you find when running benchmarks and vary throughput.  

In general as you raise throughput latency increases but at some point when throughput drops below a certain threshold latency can increase as well.

The code below times a round trip TCP call over loopback.

We add two probes:

  • client2server - the amount of time taken to complete the first half of the round trip
  • server2client - the amount of time taken to complete the second half of the trip

These probes do not take into account co-ordinated omission - it is only the end to end time that takes coordinated omission into account.

Here's the code for the benchmark:


And here are the results when run at a throughput of 20,000 iterations / second:

Warm up complete (50000 iterations took 2.296s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ---------Run time: 2.5s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (50,000)                            50/90 99/99.9 99.99 - worst was 34 / 2,950  19,400 / 20,450  20,450 - 20,450
client2server (50,000)                          50/90 99/99.9 99.99 - worst was 16 / 26  38 / 72  287 - 336
server2clientProbe (50,000)                     50/90 99/99.9 99.99 - worst was 16 / 27  40 / 76  319 - 901
OS Jitter (26,960)                              50/90 99/99.9 99.99 - worst was 9.0 / 16  44 / 1,340  10,220 - 11,800
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 2.5s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (50,000)                            50/90 99/99.9 99.99 - worst was 42 / 868  4,590 / 5,110  5,370 - 5,370
client2server (50,000)                          50/90 99/99.9 99.99 - worst was 20 / 27  38 / 92  573 - 2,560
server2clientProbe (50,000)                     50/90 99/99.9 99.99 - worst was 19 / 27  38 / 72  868 - 1,740
OS Jitter (13,314)                              50/90 99/99.9 99.99 - worst was 9.0 / 16  32 / 96  303 - 672
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 2.5s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (50,000)                            50/90 99/99.9 99.99 - worst was 34 / 152  999 / 2,160  2,290 - 2,290
client2server (50,000)                          50/90 99/99.9 99.99 - worst was 17 / 26  36 / 54  201 - 901
server2clientProbe (50,000)                     50/90 99/99.9 99.99 - worst was 16 / 25  36 / 50  225 - 1,740
OS Jitter (14,306)                              50/90 99/99.9 99.99 - worst was 9.0 / 15  23 / 44  160 - 184
---------------------------------------------------------------------------------------------------- SUMMARY (end to end)---------------
Percentile   run1         run2         run3      % Variation   var(log)
50:            33.79        41.98        33.79        13.91       
90:          2949.12       868.35       151.55        75.92       
99:         19398.66      4587.52       999.42        70.53       
99.9:       20447.23      5111.81      2162.69        47.62     99.99:      20447.23      5373.95      2293.76        47.24       
worst:      20447.23      5373.95      2293.76        47.24       
--------------------------------------------------------------------
-------------------------------- SUMMARY (client2server)------------
Percentile   run1         run2         run3      % Variation   
50:            16.13        19.97        16.90        10.81       
90:            26.11        27.14        26.11         2.55       
99:            37.89        37.89        35.84         3.67       
99.9:          71.68        92.16        54.27        31.76       
99.99:        286.72       573.44       200.70        55.32       
worst:        335.87      2555.90       901.12        55.04       
--------------------------------------------------------------------
-------------------------------- SUMMARY (server2clientProbe)-------
Percentile   run1         run2         run3      % Variation   
50:            16.13        18.94        16.13        10.43       
90:            27.14        27.14        25.09         5.16       
99:            39.94        37.89        35.84         3.67       
99.9:          75.78        71.68        50.18        22.22       
99.99:        319.49       868.35       225.28        65.55       
worst:        901.12      1736.70      1736.70         0.00       
--------------------------------------------------------------------

What should happen is that:

client2server + server2client ~= endToEnd

And this more of less what happens at the 50th percentile

Take the second run for the purposes of this demonstration:

19.97 + 18.94 ~= 41.98

If that was all you were measuring you might say that there would be no problem running 20k/s messages through my machine.

However my laptop clearly can't handle this volume and if we look at the second run again for the 90th percentile.

27.14 + 27.14 !~= 868.35

And as you move up the percentiles it just gets worse and worse...

If however I change the throughput to 5k messages / second I see this on the 90th percentile:

32.23 + 29.38 ~= 62.46

So we see that if you want to achieve low latencies at the high percentiles you have to drop your throughput to the correct level.

This is why it so important that we are able to vary throughputs with JLBH.

JLBH Examples 2 - Accounting for Coordinated Omission

In this post:

  • Running JLBH with and without accounting for coordinated omission
  • An example to in numbers the effects of coordinated omission
  • A discussion about flow control


This is the example I used when describing what it would be like if you measured without accounting for coordinated omission:

Let's imagine you are waiting for a train and get delayed in the station for an hour because the train in front of you was late.  Let's then imagine you get on the train an hour late and the train takes it's usual half an hour to reach it's destination. If you don't account for coordinated omission you will not consider yourself to have suffered any delay as your journey took exactly the correct amount of time even though you waited for an hour at the station before departing!

But that's exactly what you do when you run a micro benchmark. You time each 'journey' and not the waiting time.

And the truth is, that's absolutely fine for a micro benchmark. But it is not fine when you want to measure the latency of an application.

By default JLBH measures end to end times accounting for coordinated omission although you do have a setting to measure it without accounting for coordinated omission. 

I wrote this simple benchmark to show just how dramatic an effect accounting for coordinated omission can have.

In this example after every 10k iterations we add in a millisecond latency:



If you set coordinatedOmission(false) then you get this profile - as expected the millisecond delay can only be seen on the highest percentiles, from the 99.99th percentile upwards. Or put it this way it only affects one in every 10k iterations - not really surprising.

Warm up complete (40000 iterations took 0.046s)
-------------------------------- BENCHMARK RESULTS (RUN 1) -----------
Run time: 11.593s
Correcting for co-ordinated:false
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.11 / 0.13  0.20 / 0.33  999 / 999 - 1,930
OS Jitter (14,986)                              50/90 99/99.9 99.99 - worst was 8.4 / 15  68 / 1,080  3,210 - 4,330
----------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) -----------
Run time: 11.49s
Correcting for co-ordinated:false
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.11 / 0.13  0.16 / 0.28  999 / 999 - 999
OS Jitter (13,181)                              50/90 99/99.9 99.99 - worst was 8.4 / 12  36 / 62  270 - 573
----------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) -----------
Run time: 11.494s
Correcting for co-ordinated:false
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.11 / 0.13  0.16 / 0.26  999 / 999 - 1,030
OS Jitter (13,899)                              50/90 99/99.9 99.99 - worst was 8.4 / 13  42 / 76  160 - 541
----------------------------------------------------------------------
-------------------------------- SUMMARY (end to end)-----------------
Percentile   run1         run2         run3      % Variation
50:             0.11         0.11         0.11         0.00  
90:             0.13         0.13         0.13         0.00  
99:             0.20         0.16         0.16         3.31    
99.9:           0.33         0.28         0.26         3.88   
99.99:        999.42       999.42       999.42         0.00  
99.999:       999.42       999.42       999.42         0.00    
worst:       1933.31       999.42      1032.19         2.14   

----------------------------------------------------------------------

But if you set coordinatedOmission(true)you see the true effect of this delay.

Warm up complete (40000 iterations took 0.044s)
-------------------------------- BENCHMARK RESULTS (RUN 1) -----------
Run time: 11.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.11 / 0.17  385 / 1,930  4,590 / 5,370 - 5,370
OS Jitter (13,605)                              50/90 99/99.9 99.99 - worst was 8.4 / 15  68 / 1,080  5,110 - 5,900
----------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) -----------
Run time: 11.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.12 / 0.18  42 / 901  999 / 999 - 1,030
OS Jitter (13,156)                              50/90 99/99.9 99.99 - worst was 8.4 / 13  38 / 68  209 - 467
----------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) -----------
Run time: 11.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.12 / 0.18  46 / 901  999 / 999 - 999
OS Jitter (13,890)                              50/90 99/99.9 99.99 - worst was 8.4 / 14  44 / 80  250 - 1,870
----------------------------------------------------------------------
-------------------------------- SUMMARY (end to end)-----------------
Percentile   run1         run2         run3      % Variation   
50:             0.11         0.12         0.12         0.00       
90:             0.17         0.18         0.18         0.00       
99:           385.02        41.98        46.08         6.11       
99.9:        1933.31       901.12       901.12         0.00       
99.99:       4587.52       999.42       999.42         0.00       
99.999:      5373.95       999.42       999.42         0.00       
worst:       5373.95      1032.19       999.42         2.14       

----------------------------------------------------------------------

In fact one in a hundred (not one in 10,000) iterations are affected to some degree.  You can also see the progressive effect of the latency as you head up the percentiles.

This clearly demonstrates in numbers why coordinated omission must be a significant part of your benchmarking, especially if you can't exert flow control in your program. Flow control is the ability to stop consuming if you are not keeping up e.g. bumping users off your site if you are too busy.  Fix Engines can not exert flow control i.e. you can't tell the market to slow down because you can't keep up! Programs that exert flow control are consumer centric whilst programs don't exert flow control are producer centric.

Accounting for co-ordinated omission goes hand in hand in hand with being able to set latency for a defined throughput which is something we will look at in the next example.




JLBH Examples 1 - Why Code Should be Benchmarked in Context

In this post:

  • A side by side example using JMH and JLBH for Date serialisation
  • Measuring Date serialisation in a microbenchmark
  • Measuring Date serialisation as part of a proper application
  • How to add a probe to your JLBH benchmark
  • Understanding the importance of measuring code in context


In the last post, 'Introducing JLBH' we introduced JLBH the latency testing tool that Chronicle used to test Chronicle-FIX and is now available as open source.

In the next few posts we are going to look at some example applications: 

All the code for the examples cane be found here in my GitHub project:

One of the points I made in the introduction to JLBH was that it is important to benchmark code in context. That means benchmarking code in an environment as close to how it will be running in real life as possible. This post demonstrates this in practice.

Let's look at a relatively expensive Java operation - Date Serialisation - and see how long it takes:

First here's a JMH benchmark:



Running on my laptop (MBP i7) these are the results I get:

Result "test":
  4.578 ±(99.9%) 0.046 us/op [Average]
  (min, avg, max) = (3.664, 4.578, 975.872), stdev = 6.320
  CI (99.9%): [4.533, 4.624] (assumes normal distribution)
  Samples, N = 206803
        mean =      4.578 ±(99.9%) 0.046 us/op
         min =      3.664 us/op
  p( 0.0000) =      3.664 us/op
  p(50.0000) =      4.096 us/op
  p(90.0000) =      5.608 us/op
  p(95.0000) =      5.776 us/op
  p(99.0000) =      8.432 us/op
  p(99.9000) =     24.742 us/op
  p(99.9900) =    113.362 us/op
  p(99.9990) =    847.245 us/op
  p(99.9999) =    975.872 us/op
         max =    975.872 us/op


# Run complete. Total time: 00:00:21

Benchmark                Mode     Cnt  Score   Error  Units

DateSerialiseJMH.test  sample  206803  4.578 ± 0.046  us/op

A mean time of 4.5us for the operation:

We get pretty much the same results running the test with JLBH:


These are the results:

Warm up complete (400000 iterations took 2.934s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000)                         50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8  352 / 672  803 / 901 - 934
OS Jitter (13,939)                              50/90 99/99.9 99.99 - worst was 8.4 / 17  639 / 4,130  12,850 - 20,450
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000)                         50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8  434 / 705  836 / 934 - 967
OS Jitter (11,016)                              50/90 99/99.9 99.99 - worst was 8.4 / 17  606 / 770  868 - 1,340
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000)                         50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8  434 / 737  901 / 999 - 1,030
OS Jitter (12,319)                              50/90 99/99.9 99.99 - worst was 8.4 / 15  573 / 737  803 - 901
---------------------------------------------------------------------------------------------------- SUMMARY (end to end)---------------Percentile   run1         run2         run3      % Variation   
50:             4.22         4.22         4.22         0.00    
90:             5.76         5.76         5.76         0.00    
99:           352.26       434.18       434.18         0.00    
99.9:         671.74       704.51       737.28         3.01    
99.99:        802.82       835.58       901.12         4.97    
worst:        901.12       933.89       999.42         4.47    
--------------------------------------------------------------------

A mean time of 4.2us for the operation:

Note: This is case where there is no advantage using JLBH over JMH.  I just include the code as a comparison.

Now we're going to run exactly the same operation but inside a TCP call the code will work like this:
  1. Client send the server a fix message over TCP loopback (localhost)
  2. Server reads the message
  3. Server does the date serialisation
  4. Server returns a message to the client
As explained in the previous post JLBH allows us to produce a latency profile for any part of the code.  We will add a probe for stage 3.



This time the results look like this:

Warm up complete (50000 iterations took 3.83s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ------------------------
Run time: 6.712s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000)                           50/90 99/99.9 99.99 - worst was 822,080 / 1,509,950  1,711,280 / 1,711,280  1,711,280 - 1,711,280
date serialisation  (100,000)                   50/90 99/99.9 99.99 - worst was 11 / 19  31 / 50  901 - 2,420
OS Jitter (64,973)                              50/90 99/99.9 99.99 - worst was 8.1 / 16  40 / 1,540  4,850 - 18,350
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 6.373s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000)                           50/90 99/99.9 99.99 - worst was 1,107,300 / 1,375,730  1,375,730 / 1,375,730  1,375,730 - 1,375,730
date serialisation  (100,000)                   50/90 99/99.9 99.99 - worst was 11 / 19  29 / 52  901 - 1,670
OS Jitter (40,677)                              50/90 99/99.9 99.99 - worst was 8.4 / 16  34 / 209  934 - 1,470
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 5.333s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000)                           50/90 99/99.9 99.99 - worst was 55,570 / 293,600  343,930 / 343,930  343,930 - 343,930
date serialisation  (100,000)                   50/90 99/99.9 99.99 - worst was 9.0 / 16  26 / 38  770 - 1,030
OS Jitter (32,042)                              50/90 99/99.9 99.99 - worst was 9.0 / 13  22 / 58  737 - 934
--------------------------------------------------------------------
-------------------------------- SUMMARY (end to end)---------------
Percentile   run1         run2         run3      % Variation   
50:        822083.58   1107296.26     55574.53        92.66    
90:       1509949.44   1375731.71    293601.28        71.07    
99:       1711276.03   1375731.71    343932.93        66.67    
99.9:     1711276.03   1375731.71    343932.93        66.67    
99.99:    1711276.03   1375731.71    343932.93        66.67    
worst:    1711276.03   1375731.71    343932.93        66.67    
--------------------------------------------------------------------
-------------------------------- SUMMARY (date serialisation )------
Percentile   run1         run2         run3      % Variation   
50:            11.01        11.01         8.96        13.22    
90:            18.94        18.94        15.62        12.44    
99:            31.23        29.18        26.11         7.27    
99.9:          50.18        52.22        37.89        20.14    
99.99:        901.12       901.12       770.05        10.19    
worst:       2424.83      1671.17      1032.19        29.21    

--------------------------------------------------------------------

As can be seen the very same Date Serialisation take over twice as long from ~4.5us to ~10us.

It's not really the place here to go into too much detail about why the code takes longer to run when in context but it's to do with CPU caches getting filled in between calls to the date serialisation.

When all we are running (as in the micro benchmark) is the Date Serialisation then that can fit nicely into a CPU cache and never needs to get cleared out.  However when there is a gap between calls to the Date serialisation the code for the operation gets cleared out and needs to be reloaded.

JLBH allows you to benchmark code in context and that's an important part of latency benchmarking.