Monday, 18 April 2016

JLBH Examples 4 - Benchmarking QuickFix vs ChronicleFix

In this post:

  • Using JLBH to test QuickFIX
  • Observing how QuickFix latencies degrade through the percentiles
  • Comparing QuickFIX with Chronicle FIX


As mentioned in the Introduction to JLBH the primary reason JLBH was created was to measure the Chronicle-FIX engine.

We use all the features of JLBH, particularly the throughput lever and the accounting for coordinated omission to get some realistic timings for QuickFIX.

Later in this article we will look at some results for ChronicleFIX but first let's look at benchmarking QuickFix the open source implementation of a Fix Engine. 

This is the scenario we will be benchmarking:
  • The client creates a NewOrderSingle which is then passed to the server.  
  • The server parses the NewOrderSingle 
  • The server creates an ExecutionReport which is send back to the client.
  • The client receives the ExecutionReport
End to end time is measured from when the client starts creating the NewOrderSingle to when the client receives the ExecutionReport. 

Note: We need to keep the start time that the benchmark was called with right the the program. In order to do this we use a trick and set the start time to the tag ClOrdId.

If you want to get the benchmark running on your servers you should clone this GitHub repo all jars and configurations files are set up there.

For the sake of this post here's the code for the benchmark.



These are the results that I see running on my server Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz.

At a throughput of 2,000/s


Percentile   run1         run2         run3      % Variation
50:           270.34       270.34       233.47         9.52
90:           352.26       335.87      1867.78        75.25
99:          6684.67      4849.66     69206.02        89.84
99.9:       13369.34     12845.06    163577.86        88.67
99.99:      81788.93     20447.23    163577.86        82.35
worst:     111149.06     98566.14    163577.86        30.54

At a throughput of 10,000/s


Percentile   run1         run2         run3      % Variation
50:           184.32       176.13       176.13         0.00 
90:           573.44       270.34       249.86         5.18 
99:         19398.66      2686.98      5111.81        37.56 
99.9:       28835.84      7733.25      7995.39         2.21 
99.99:      30932.99      9699.33      9175.04         3.67 
worst:      30932.99      9699.33      9175.04         3.67 

The mean here is ~200us but as you up through the percentiles the latencies really begin to degrade.

A lot of this is due to the amount of garbage that is being created! You can see this by running the benchmark with the jvm flag -verbosegc.

Note that the latencies for 10k/s are slightly better than at 2k/s. This is rather counter intuitive as we would expect latencies to increase as throughput increases, for example as we find with the 50k run below where the latencies are much worse. The explanation for the 10k / 2k discrepancy is that as throughput increases the CPU caches and other resources can stay hotter and therefore, up to a point, latencies actually get better as throughput increases.  This is another reason why it is so important to measure your latencies at a realistic throughput for your application. 

In fact when you up the throughput to 50,000/s even your 90th percentile (1 in every 10 iterations) is completely blown and you end up with multi millisecond delays.

At a throughput of 50,000/s

Percentile   run1         run2         run3      % Variation   var(log)
50:           176.13       176.13       176.13         0.00       11.82
90:         12845.06     29884.42      3604.48        82.94       21.01
99:         34603.01     94371.84     17301.50        74.81       25.26
99.9:       42991.62     98566.14     25690.11        65.41       25.84
99.99:      45088.77     98566.14     27787.26        62.94       25.93
worst:      45088.77     98566.14     27787.26        62.94       25.93

The issues here are not just the mean time (assuming that ~200us is too slow for you) but even more worryingly the way the numbers degrade as throughput is raised and as you study the higher percentiles.

Let's compare this Chronicle-FIX.  The test is run for exactly the same scenario and on the same machine.

The results look like this:

At a throughput of 2000/s

Percentile   run1         run2         run3      % Variation   
50:            16.90        16.90        16.90       0.00    
90:            18.94        18.94        18.94       0.00    
99:            26.11        30.21        23.04       17.18    
99.9:          35.84        39.94        33.79       10.81    
99.99:        540.67       671.74       401.41       65.41    
worst:        638.98      1081.34       606.21       61.59    

At a throughput of 10,000/s

Percentile   run1         run2         run3      % Variation   
50:            16.90        16.90        16.13         3.08     
90:            18.94        18.94        18.94         0.00     
99:            26.11        22.02        20.99         3.15     
99.9:          88.06        33.79        83.97        49.75     
99.99:        999.42       167.94       802.82        71.59     
worst:       1146.88       249.86       966.66        65.67     
At a throughput of 50,000/s


Percentile   run1         run2         run3      % Variation  
50:            15.62        15.10        15.62         2.21     
90:            17.92        16.90        16.90         0.00        
99:            22.02        30.21        29.18         2.29        
99.9:         120.83       352.26        33.79        86.27       
99.99:        335.87       802.82        96.26        83.03       
worst:        450.56       901.12       151.55        76.73       
The Chronicle-FIX mean is around 16us which is 12 times faster than QuickFIX.

But it's more than that because nearly all that time is in the TCP round trip.  When you measure the TCP time (see last post JLBH Examples 3 - The Affects of Throughput on Latency) it turns out that most of that time is TCP ~10us.

Therefore if you deduct the TCP times you get 
  • QuickFix       200 - 10 = 190
  • Chronicle-FIX   16 - 10 = 6
  • Chronicle-FIX is over 30 times faster than QF
And as has been demonstrated if you care about your upper percentiles than it get much worse than that.
For completeness it should be noted that that the server this was benchmarked on was reasonably noisy.  It had latency spikes of around 400us which accounts for the larger numbers that are shown up in the higher percentiles.
Also this test uses loopback TCP which puts a huge stress on the Linux kernel.  In fact strange things happen when you push the throughput up really high (you can try this with the simple TCP test) - so this is not the best way to test Chronicle-FIX.  It is only used as a comparison against Quick FIX.
Using Chronicle-FIX, if you measure the parsing of a fix message into its data model, including logging, on a tuned server you actually see this profile tested across throughput profiles from 10,000/s to 200,000/s:
Percentile   run1         run2         run3         run4         run5     50:             1.01         1.01         1.01         1.01         1.06 90:             1.12         1.12         1.12         1.12         1.12         99:             1.38         1.31         1.44         1.31         2.11         99.9:           2.88         2.88         2.88         2.88         4.03       99.99:          3.26         3.14         3.39         3.14         6.02        worst:          5.25         6.27        22.02        20.99        18.94   


No comments:

Post a Comment