Tuesday 24 October 2017

Allocation free logging with log4j2

Introduction

Recently I was working for a client trying to remove some GC pauses for a large well crafted Java system. After profiling I realised that most of the garbage was being produced by logging!! Was there a simple unobtrusive way to remove all that allocation?  Turns out there was :)

Which framework should I use for GC free logging?


Let's turn our attention back to the age old question of which logging framework to use.

The usual candidates are
  • log4j2
  • logback
  • java.util.logging
  • slf4j (simple)
and unless you are doing something super funky you might not think it makes a huge difference which one you choose.

There have been many studies done comparing performance and those of course this can be important if you are playing in the ultra low latency arena. (see Benchmarking Java logging frameworks)


But what I'm concerned about is allocation!!

See my earlier post on the evils of allocation The First Rule of Performance Optimisation.  

In a typical system 30%-50% of all allocation can be in logging!!

So even if most people don't mind if logging takes a couple of milliseconds longer in one framework than another they will almost certainly care about long GC pauses caused by the garbage created by the logging framework.

And all this logging allocation can be removed by simple configuration.

Log4J2 from 2.6 onwards is allocation free so really no excuse not to take advantage of it :) 

Let's see it in action

It's one thing making the allocation free claim but let's see if this holds true in practice.

Consider this simple logging code below: Note that I've used slf4j in the code and therefore only need change the configuration to run with the different logging frameworks.  It's a good idea to do this because although log4j2 might be the best framework at the moment who knows what tomorrow will bring...



Run with log4j2 - no allocation

If we configure the program to run with log4j2 using this maven configuration:

<dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-api</artifactId>
    <version>1.7.25</version>
</dependency>
<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-api</artifactId>
    <version>2.9.1</version>
</dependency>
<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-core</artifactId>
    <version>2.9.1</version>
</dependency>
<dependency>
    <groupId>org.apache.logging.log4j</groupId>
    <artifactId>log4j-slf4j-impl</artifactId>
    <version>2.9.1</version>
</dependency>

And we use this log4j2 configuration

xml version="1.0" encoding="UTF-8" ?><Configuration status="INFO">
    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{HH:mm:ss:SSS} [%t] %-5level %logger[36] %msg%n"></PatternLayout>
        </Console>
        <File name="MyFile" fileName="all.log" immeadiateFlush="false" append="false">
            <PatternLayout pattern="%d{dd MMM yyyy HH:mm:ss,SSS} [%t] %-5level %logger[36] %msg%n"></PatternLayout>
        </File>
    </Appenders>
    <Loggers>
        <Root level="debug">
            
            <AppenderRef ref="MyFile"></AppenderRef>
        </Root>
    </Loggers>
</Configuration>

Then we get 0 allocation!

We can prove this by running Flight Recorder on the program (see below):

The only memory being allocated is due to Flight Recorder (by the way FR can be configured so that this doesn't happen).



Run using logback

Try exactly the same this using Logback.

Use this Maven configuration

<dependency>
    <groupId>ch.qos.logback</groupId>
    <artifactId>logback-classic</artifactId>
    <version>1.2.3</version>
</dependency>

Use this logback configuration:

<configuration>
    <appender name="FILE" class="ch.qos.logback.core.FileAppender">
        <file>myApp.log</file>
        <encoder>
            <pattern>%date %level [%thread] %logger{10} [%file:%line] %msg%n</pattern>
        </encoder>
    </appender>
    <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <pattern>%msg%n</pattern>
        </encoder>
    </appender>
    <root level="debug">
        <appender-ref ref="FILE" />
        
    </root>
</configuration>

And when we run with Flight Recorder we see a hug amount allocation!


But a word of warning....

You need to use the log4j2 configuration exactly as specified in the documentation see supported layouts.  If you even change date format slightly allocation will again be rampant.

Summary

  • Use slf4j so that you can easily change your logging implementation
  • Use log4j2 to avoid allocation
  • Make sure you use the supported formats to support allocation free logging

Friday 14 October 2016

Dell XPS13 Dual boot Windows 10 with Ubuntu - The Definitive Guide

Excuse a slight departure from the normal subject of this blog.

So I've just got a brand new XPS13 (i7-7500U CPU @ 2.70 GHz - Aug 16) - a lovely machine if you want something ultra portable.  I specially bought the 512GB version so that I could comfortably install Linux to dual boot with the pre-installed Windows 10. (Btw you do have the option of buying the developer edition which comes pre-installed with Ubuntu but I wanted both Windows and Ubuntu and reckoned it was easier doing it this way round).

Having setup dual boot many times before I was not expecting this to be too difficult but I went through hours of pain, dead ends helpful and unhelpful Q&A sites and YouTube tutorials before I finally managed to get this working. I even completely corrupted my Windows version thanks to unhelpful answer on askubuntu and ended up having to reinstall Windows :(

To save everyone else the pain this is the definitive way to get dual boot (Windows 10 Ubuntu 16.04) working on an this Dell XPS machine. Hope you find it helpful - If you have any questions or comments please leave them below. As always it's not that hard once you know how :)

Firstly you need to prepare you machine:

1. Run Command Prompt as Admin
2. Invoke a Safe Mode boot with the command: bcdedit /set {current} safeboot minimal
3. Restart the PC and enter your BIOS during bootup.
4. Change from IDE to AHCI mode then Save & Exit.
5. Windows 10 will launch in Safe Mode.
6. Right click the Window icon and select to run the Command Prompt in Admin mode from among the various options.
7. Cancel Safe Mode booting with the command: bcdedit /deletevalue {current} safeboot
8. Restart your PC once more and this time it will boot up normally but with AHCI mode activated.
9. Bask in the reflected glory of being a total Windows 10 God 

(Source of this information)

Spend 8 minutes watching this video but don't do anything just yet:


Next as per the video:

  • prepare your Linux distro on a USB - I used Ubuntu 16.04.
  • shrink your drive by the required amount for your Ubuntu distribution.


Next we need to effect some changes to the bios:

Restart the machine and press F12 when you see the Dell logo. You will be able to go to 'bios settings'. In the bios settings go General -> Advanced Boot Options and click on 'Enable UEFI Network Stack'. Save and Exit.

Reboot the machine.

Make sure you have your flash drive in your machine at this point and reenter the bios settings.  This time go to General -> Boot Sequence click on 'Add Boot Option'.  Enter Ubuntu as the 'Boot Option Name' and then in 'File Name' click on EFI/boot/grub64.

You will see a new entry at the top of the screen under Windows Boot Manager called ubuntu.  Change the order so that Ubuntu is called before Windows Boot Manager. 'Apply' and save your settings and then Exit.

When the machine reboots you change be presented with the Grub menu where you can install Ubuntu as per the rest of the instructions in the video.

Once Ubuntu is installed you can reboot and choose whether to go into windows or ubuntu as you would in a normal dual boot environment.

One thing you will notice is that the screen size resolution on Ubuntu is pretty tiny so you might want to change the screen display 2048x1152 before continuing.

Monday 18 April 2016

A Series of Posts on JLBH (Java Latency Benchmark Harness)

Introducing a set of 5 blog posts on JLBH

These set of posts should be useful for anyone responsible for creating or benchmarking real Java applications.

As well as introducing the open source JLBH benchmarking harness we go into some of the subtleties that underpin latency benchmarking in Java with real code examples.


1) JLBH - Introducing Java Latency Benchmarking Harness

  • What is JLBH
  • What was the motivation for JLBH
  • Differences between JMH and JLBH
  • Quick start guide
  • A side by side example using JMH and JLBH for Date serialisation
  • Measuring Date serialisation in a microbenchmark
  • Measuring Date serialisation as part of a proper application
  • How to add a probe to your JLBH benchmark
  • Understanding the importance of measuring code in context
  • Running JLBH with and without accounting for coordinated omission
  • An example to in numbers the effects of coordinated omission
  • A discussion about flow control
  • A discussion about the effects of throughput on latency
  • How use JLBH to measure TCP loopback
  • Adding probes to test both halves of the TCP round trip
  • Watching the effect of increasing throughput on latency
  • Understanding that you have to drop throughput to achieve good latencies at high percentiles.
  • Using JLBH to benchmark QuickFIX
  • Observing how QuickFix latencies degrade through the percentiles
  • Comparing QuickFIX with Chronicle FIX

JLBH Examples 4 - Benchmarking QuickFix vs ChronicleFix

In this post:

  • Using JLBH to test QuickFIX
  • Observing how QuickFix latencies degrade through the percentiles
  • Comparing QuickFIX with Chronicle FIX


As mentioned in the Introduction to JLBH the primary reason JLBH was created was to measure the Chronicle-FIX engine.

We use all the features of JLBH, particularly the throughput lever and the accounting for coordinated omission to get some realistic timings for QuickFIX.

Later in this article we will look at some results for ChronicleFIX but first let's look at benchmarking QuickFix the open source implementation of a Fix Engine. 

This is the scenario we will be benchmarking:
  • The client creates a NewOrderSingle which is then passed to the server.  
  • The server parses the NewOrderSingle 
  • The server creates an ExecutionReport which is send back to the client.
  • The client receives the ExecutionReport
End to end time is measured from when the client starts creating the NewOrderSingle to when the client receives the ExecutionReport. 

Note: We need to keep the start time that the benchmark was called with right the the program. In order to do this we use a trick and set the start time to the tag ClOrdId.

If you want to get the benchmark running on your servers you should clone this GitHub repo all jars and configurations files are set up there.

For the sake of this post here's the code for the benchmark.



These are the results that I see running on my server Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz.

At a throughput of 2,000/s


Percentile   run1         run2         run3      % Variation
50:           270.34       270.34       233.47         9.52
90:           352.26       335.87      1867.78        75.25
99:          6684.67      4849.66     69206.02        89.84
99.9:       13369.34     12845.06    163577.86        88.67
99.99:      81788.93     20447.23    163577.86        82.35
worst:     111149.06     98566.14    163577.86        30.54

At a throughput of 10,000/s


Percentile   run1         run2         run3      % Variation
50:           184.32       176.13       176.13         0.00 
90:           573.44       270.34       249.86         5.18 
99:         19398.66      2686.98      5111.81        37.56 
99.9:       28835.84      7733.25      7995.39         2.21 
99.99:      30932.99      9699.33      9175.04         3.67 
worst:      30932.99      9699.33      9175.04         3.67 

The mean here is ~200us but as you up through the percentiles the latencies really begin to degrade.

A lot of this is due to the amount of garbage that is being created! You can see this by running the benchmark with the jvm flag -verbosegc.

Note that the latencies for 10k/s are slightly better than at 2k/s. This is rather counter intuitive as we would expect latencies to increase as throughput increases, for example as we find with the 50k run below where the latencies are much worse. The explanation for the 10k / 2k discrepancy is that as throughput increases the CPU caches and other resources can stay hotter and therefore, up to a point, latencies actually get better as throughput increases.  This is another reason why it is so important to measure your latencies at a realistic throughput for your application. 

In fact when you up the throughput to 50,000/s even your 90th percentile (1 in every 10 iterations) is completely blown and you end up with multi millisecond delays.

At a throughput of 50,000/s

Percentile   run1         run2         run3      % Variation   var(log)
50:           176.13       176.13       176.13         0.00       11.82
90:         12845.06     29884.42      3604.48        82.94       21.01
99:         34603.01     94371.84     17301.50        74.81       25.26
99.9:       42991.62     98566.14     25690.11        65.41       25.84
99.99:      45088.77     98566.14     27787.26        62.94       25.93
worst:      45088.77     98566.14     27787.26        62.94       25.93

The issues here are not just the mean time (assuming that ~200us is too slow for you) but even more worryingly the way the numbers degrade as throughput is raised and as you study the higher percentiles.

Let's compare this Chronicle-FIX.  The test is run for exactly the same scenario and on the same machine.

The results look like this:

At a throughput of 2000/s

Percentile   run1         run2         run3      % Variation   
50:            16.90        16.90        16.90       0.00    
90:            18.94        18.94        18.94       0.00    
99:            26.11        30.21        23.04       17.18    
99.9:          35.84        39.94        33.79       10.81    
99.99:        540.67       671.74       401.41       65.41    
worst:        638.98      1081.34       606.21       61.59    

At a throughput of 10,000/s

Percentile   run1         run2         run3      % Variation   
50:            16.90        16.90        16.13         3.08     
90:            18.94        18.94        18.94         0.00     
99:            26.11        22.02        20.99         3.15     
99.9:          88.06        33.79        83.97        49.75     
99.99:        999.42       167.94       802.82        71.59     
worst:       1146.88       249.86       966.66        65.67     
At a throughput of 50,000/s


Percentile   run1         run2         run3      % Variation  
50:            15.62        15.10        15.62         2.21     
90:            17.92        16.90        16.90         0.00        
99:            22.02        30.21        29.18         2.29        
99.9:         120.83       352.26        33.79        86.27       
99.99:        335.87       802.82        96.26        83.03       
worst:        450.56       901.12       151.55        76.73       
The Chronicle-FIX mean is around 16us which is 12 times faster than QuickFIX.

But it's more than that because nearly all that time is in the TCP round trip.  When you measure the TCP time (see last post JLBH Examples 3 - The Affects of Throughput on Latency) it turns out that most of that time is TCP ~10us.

Therefore if you deduct the TCP times you get 
  • QuickFix       200 - 10 = 190
  • Chronicle-FIX   16 - 10 = 6
  • Chronicle-FIX is over 30 times faster than QF
And as has been demonstrated if you care about your upper percentiles than it get much worse than that.
For completeness it should be noted that that the server this was benchmarked on was reasonably noisy.  It had latency spikes of around 400us which accounts for the larger numbers that are shown up in the higher percentiles.
Also this test uses loopback TCP which puts a huge stress on the Linux kernel.  In fact strange things happen when you push the throughput up really high (you can try this with the simple TCP test) - so this is not the best way to test Chronicle-FIX.  It is only used as a comparison against Quick FIX.
Using Chronicle-FIX, if you measure the parsing of a fix message into its data model, including logging, on a tuned server you actually see this profile tested across throughput profiles from 10,000/s to 200,000/s:
Percentile   run1         run2         run3         run4         run5     50:             1.01         1.01         1.01         1.01         1.06 90:             1.12         1.12         1.12         1.12         1.12         99:             1.38         1.31         1.44         1.31         2.11         99.9:           2.88         2.88         2.88         2.88         4.03       99.99:          3.26         3.14         3.39         3.14         6.02        worst:          5.25         6.27        22.02        20.99        18.94   


Friday 15 April 2016

JLBH Examples 3 - The Affects of Throughput on Latency

In this post:

  • A discussion about the effects of thoughput on latency
  • How use JLBH to measure TCP loopback
  • Adding probes to test both halves of the TCP round trip
  • Watching the effect of increasing throughput on latency
  • Understanding that you have to drop throughput to achieve good latencies at high percentiles.


In the post we saw the effects of accounting for coordinated omission or measuring the effect a delay to one iteration will have an affect on subsequent iterations.

Intuitively we understand that throughput will affect latency. It seems natural that as we raise throughput we will also raise latency. 

Going into a very crowded shop will affect how fast you can select and purchase your goods. On the other hand consider a very rarely frequented shop. It could be that in such a shop the shopkeeper is away from the till having a tea break and that your purchase will be delayed as you wait for him to put down his cup of tea and make his way to the counter where he can server you.

This is exactly what you find when running benchmarks and vary throughput.  

In general as you raise throughput latency increases but at some point when throughput drops below a certain threshold latency can increase as well.

The code below times a round trip TCP call over loopback.

We add two probes:

  • client2server - the amount of time taken to complete the first half of the round trip
  • server2client - the amount of time taken to complete the second half of the trip

These probes do not take into account co-ordinated omission - it is only the end to end time that takes coordinated omission into account.

Here's the code for the benchmark:


And here are the results when run at a throughput of 20,000 iterations / second:

Warm up complete (50000 iterations took 2.296s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ---------Run time: 2.5s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (50,000)                            50/90 99/99.9 99.99 - worst was 34 / 2,950  19,400 / 20,450  20,450 - 20,450
client2server (50,000)                          50/90 99/99.9 99.99 - worst was 16 / 26  38 / 72  287 - 336
server2clientProbe (50,000)                     50/90 99/99.9 99.99 - worst was 16 / 27  40 / 76  319 - 901
OS Jitter (26,960)                              50/90 99/99.9 99.99 - worst was 9.0 / 16  44 / 1,340  10,220 - 11,800
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 2.5s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (50,000)                            50/90 99/99.9 99.99 - worst was 42 / 868  4,590 / 5,110  5,370 - 5,370
client2server (50,000)                          50/90 99/99.9 99.99 - worst was 20 / 27  38 / 92  573 - 2,560
server2clientProbe (50,000)                     50/90 99/99.9 99.99 - worst was 19 / 27  38 / 72  868 - 1,740
OS Jitter (13,314)                              50/90 99/99.9 99.99 - worst was 9.0 / 16  32 / 96  303 - 672
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 2.5s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (50,000)                            50/90 99/99.9 99.99 - worst was 34 / 152  999 / 2,160  2,290 - 2,290
client2server (50,000)                          50/90 99/99.9 99.99 - worst was 17 / 26  36 / 54  201 - 901
server2clientProbe (50,000)                     50/90 99/99.9 99.99 - worst was 16 / 25  36 / 50  225 - 1,740
OS Jitter (14,306)                              50/90 99/99.9 99.99 - worst was 9.0 / 15  23 / 44  160 - 184
---------------------------------------------------------------------------------------------------- SUMMARY (end to end)---------------
Percentile   run1         run2         run3      % Variation   var(log)
50:            33.79        41.98        33.79        13.91       
90:          2949.12       868.35       151.55        75.92       
99:         19398.66      4587.52       999.42        70.53       
99.9:       20447.23      5111.81      2162.69        47.62     99.99:      20447.23      5373.95      2293.76        47.24       
worst:      20447.23      5373.95      2293.76        47.24       
--------------------------------------------------------------------
-------------------------------- SUMMARY (client2server)------------
Percentile   run1         run2         run3      % Variation   
50:            16.13        19.97        16.90        10.81       
90:            26.11        27.14        26.11         2.55       
99:            37.89        37.89        35.84         3.67       
99.9:          71.68        92.16        54.27        31.76       
99.99:        286.72       573.44       200.70        55.32       
worst:        335.87      2555.90       901.12        55.04       
--------------------------------------------------------------------
-------------------------------- SUMMARY (server2clientProbe)-------
Percentile   run1         run2         run3      % Variation   
50:            16.13        18.94        16.13        10.43       
90:            27.14        27.14        25.09         5.16       
99:            39.94        37.89        35.84         3.67       
99.9:          75.78        71.68        50.18        22.22       
99.99:        319.49       868.35       225.28        65.55       
worst:        901.12      1736.70      1736.70         0.00       
--------------------------------------------------------------------

What should happen is that:

client2server + server2client ~= endToEnd

And this more of less what happens at the 50th percentile

Take the second run for the purposes of this demonstration:

19.97 + 18.94 ~= 41.98

If that was all you were measuring you might say that there would be no problem running 20k/s messages through my machine.

However my laptop clearly can't handle this volume and if we look at the second run again for the 90th percentile.

27.14 + 27.14 !~= 868.35

And as you move up the percentiles it just gets worse and worse...

If however I change the throughput to 5k messages / second I see this on the 90th percentile:

32.23 + 29.38 ~= 62.46

So we see that if you want to achieve low latencies at the high percentiles you have to drop your throughput to the correct level.

This is why it so important that we are able to vary throughputs with JLBH.

JLBH Examples 2 - Accounting for Coordinated Omission

In this post:

  • Running JLBH with and without accounting for coordinated omission
  • An example to in numbers the effects of coordinated omission
  • A discussion about flow control


This is the example I used when describing what it would be like if you measured without accounting for coordinated omission:

Let's imagine you are waiting for a train and get delayed in the station for an hour because the train in front of you was late.  Let's then imagine you get on the train an hour late and the train takes it's usual half an hour to reach it's destination. If you don't account for coordinated omission you will not consider yourself to have suffered any delay as your journey took exactly the correct amount of time even though you waited for an hour at the station before departing!

But that's exactly what you do when you run a micro benchmark. You time each 'journey' and not the waiting time.

And the truth is, that's absolutely fine for a micro benchmark. But it is not fine when you want to measure the latency of an application.

By default JLBH measures end to end times accounting for coordinated omission although you do have a setting to measure it without accounting for coordinated omission. 

I wrote this simple benchmark to show just how dramatic an effect accounting for coordinated omission can have.

In this example after every 10k iterations we add in a millisecond latency:



If you set coordinatedOmission(false) then you get this profile - as expected the millisecond delay can only be seen on the highest percentiles, from the 99.99th percentile upwards. Or put it this way it only affects one in every 10k iterations - not really surprising.

Warm up complete (40000 iterations took 0.046s)
-------------------------------- BENCHMARK RESULTS (RUN 1) -----------
Run time: 11.593s
Correcting for co-ordinated:false
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.11 / 0.13  0.20 / 0.33  999 / 999 - 1,930
OS Jitter (14,986)                              50/90 99/99.9 99.99 - worst was 8.4 / 15  68 / 1,080  3,210 - 4,330
----------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) -----------
Run time: 11.49s
Correcting for co-ordinated:false
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.11 / 0.13  0.16 / 0.28  999 / 999 - 999
OS Jitter (13,181)                              50/90 99/99.9 99.99 - worst was 8.4 / 12  36 / 62  270 - 573
----------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) -----------
Run time: 11.494s
Correcting for co-ordinated:false
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.11 / 0.13  0.16 / 0.26  999 / 999 - 1,030
OS Jitter (13,899)                              50/90 99/99.9 99.99 - worst was 8.4 / 13  42 / 76  160 - 541
----------------------------------------------------------------------
-------------------------------- SUMMARY (end to end)-----------------
Percentile   run1         run2         run3      % Variation
50:             0.11         0.11         0.11         0.00  
90:             0.13         0.13         0.13         0.00  
99:             0.20         0.16         0.16         3.31    
99.9:           0.33         0.28         0.26         3.88   
99.99:        999.42       999.42       999.42         0.00  
99.999:       999.42       999.42       999.42         0.00    
worst:       1933.31       999.42      1032.19         2.14   

----------------------------------------------------------------------

But if you set coordinatedOmission(true)you see the true effect of this delay.

Warm up complete (40000 iterations took 0.044s)
-------------------------------- BENCHMARK RESULTS (RUN 1) -----------
Run time: 11.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.11 / 0.17  385 / 1,930  4,590 / 5,370 - 5,370
OS Jitter (13,605)                              50/90 99/99.9 99.99 - worst was 8.4 / 15  68 / 1,080  5,110 - 5,900
----------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) -----------
Run time: 11.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.12 / 0.18  42 / 901  999 / 999 - 1,030
OS Jitter (13,156)                              50/90 99/99.9 99.99 - worst was 8.4 / 13  38 / 68  209 - 467
----------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) -----------
Run time: 11.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,100,000)                         50/90 99/99.9 99.99/99.999 - worst was 0.12 / 0.18  46 / 901  999 / 999 - 999
OS Jitter (13,890)                              50/90 99/99.9 99.99 - worst was 8.4 / 14  44 / 80  250 - 1,870
----------------------------------------------------------------------
-------------------------------- SUMMARY (end to end)-----------------
Percentile   run1         run2         run3      % Variation   
50:             0.11         0.12         0.12         0.00       
90:             0.17         0.18         0.18         0.00       
99:           385.02        41.98        46.08         6.11       
99.9:        1933.31       901.12       901.12         0.00       
99.99:       4587.52       999.42       999.42         0.00       
99.999:      5373.95       999.42       999.42         0.00       
worst:       5373.95      1032.19       999.42         2.14       

----------------------------------------------------------------------

In fact one in a hundred (not one in 10,000) iterations are affected to some degree.  You can also see the progressive effect of the latency as you head up the percentiles.

This clearly demonstrates in numbers why coordinated omission must be a significant part of your benchmarking, especially if you can't exert flow control in your program. Flow control is the ability to stop consuming if you are not keeping up e.g. bumping users off your site if you are too busy.  Fix Engines can not exert flow control i.e. you can't tell the market to slow down because you can't keep up! Programs that exert flow control are consumer centric whilst programs don't exert flow control are producer centric.

Accounting for co-ordinated omission goes hand in hand in hand with being able to set latency for a defined throughput which is something we will look at in the next example.