Friday, 15 April 2016

JLBH Examples 1 - Why Code Should be Benchmarked in Context

In this post:

  • A side by side example using JMH and JLBH for Date serialisation
  • Measuring Date serialisation in a microbenchmark
  • Measuring Date serialisation as part of a proper application
  • How to add a probe to your JLBH benchmark
  • Understanding the importance of measuring code in context


In the last post, 'Introducing JLBH' we introduced JLBH the latency testing tool that Chronicle used to test Chronicle-FIX and is now available as open source.

In the next few posts we are going to look at some example applications: 

All the code for the examples cane be found here in my GitHub project:

One of the points I made in the introduction to JLBH was that it is important to benchmark code in context. That means benchmarking code in an environment as close to how it will be running in real life as possible. This post demonstrates this in practice.

Let's look at a relatively expensive Java operation - Date Serialisation - and see how long it takes:

First here's a JMH benchmark:



Running on my laptop (MBP i7) these are the results I get:

Result "test":
  4.578 ±(99.9%) 0.046 us/op [Average]
  (min, avg, max) = (3.664, 4.578, 975.872), stdev = 6.320
  CI (99.9%): [4.533, 4.624] (assumes normal distribution)
  Samples, N = 206803
        mean =      4.578 ±(99.9%) 0.046 us/op
         min =      3.664 us/op
  p( 0.0000) =      3.664 us/op
  p(50.0000) =      4.096 us/op
  p(90.0000) =      5.608 us/op
  p(95.0000) =      5.776 us/op
  p(99.0000) =      8.432 us/op
  p(99.9000) =     24.742 us/op
  p(99.9900) =    113.362 us/op
  p(99.9990) =    847.245 us/op
  p(99.9999) =    975.872 us/op
         max =    975.872 us/op


# Run complete. Total time: 00:00:21

Benchmark                Mode     Cnt  Score   Error  Units

DateSerialiseJMH.test  sample  206803  4.578 ± 0.046  us/op

A mean time of 4.5us for the operation:

We get pretty much the same results running the test with JLBH:


These are the results:

Warm up complete (400000 iterations took 2.934s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000)                         50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8  352 / 672  803 / 901 - 934
OS Jitter (13,939)                              50/90 99/99.9 99.99 - worst was 8.4 / 17  639 / 4,130  12,850 - 20,450
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000)                         50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8  434 / 705  836 / 934 - 967
OS Jitter (11,016)                              50/90 99/99.9 99.99 - worst was 8.4 / 17  606 / 770  868 - 1,340
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000)                         50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8  434 / 737  901 / 999 - 1,030
OS Jitter (12,319)                              50/90 99/99.9 99.99 - worst was 8.4 / 15  573 / 737  803 - 901
---------------------------------------------------------------------------------------------------- SUMMARY (end to end)---------------Percentile   run1         run2         run3      % Variation   
50:             4.22         4.22         4.22         0.00    
90:             5.76         5.76         5.76         0.00    
99:           352.26       434.18       434.18         0.00    
99.9:         671.74       704.51       737.28         3.01    
99.99:        802.82       835.58       901.12         4.97    
worst:        901.12       933.89       999.42         4.47    
--------------------------------------------------------------------

A mean time of 4.2us for the operation:

Note: This is case where there is no advantage using JLBH over JMH.  I just include the code as a comparison.

Now we're going to run exactly the same operation but inside a TCP call the code will work like this:
  1. Client send the server a fix message over TCP loopback (localhost)
  2. Server reads the message
  3. Server does the date serialisation
  4. Server returns a message to the client
As explained in the previous post JLBH allows us to produce a latency profile for any part of the code.  We will add a probe for stage 3.



This time the results look like this:

Warm up complete (50000 iterations took 3.83s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ------------------------
Run time: 6.712s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000)                           50/90 99/99.9 99.99 - worst was 822,080 / 1,509,950  1,711,280 / 1,711,280  1,711,280 - 1,711,280
date serialisation  (100,000)                   50/90 99/99.9 99.99 - worst was 11 / 19  31 / 50  901 - 2,420
OS Jitter (64,973)                              50/90 99/99.9 99.99 - worst was 8.1 / 16  40 / 1,540  4,850 - 18,350
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 6.373s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000)                           50/90 99/99.9 99.99 - worst was 1,107,300 / 1,375,730  1,375,730 / 1,375,730  1,375,730 - 1,375,730
date serialisation  (100,000)                   50/90 99/99.9 99.99 - worst was 11 / 19  29 / 52  901 - 1,670
OS Jitter (40,677)                              50/90 99/99.9 99.99 - worst was 8.4 / 16  34 / 209  934 - 1,470
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 5.333s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000)                           50/90 99/99.9 99.99 - worst was 55,570 / 293,600  343,930 / 343,930  343,930 - 343,930
date serialisation  (100,000)                   50/90 99/99.9 99.99 - worst was 9.0 / 16  26 / 38  770 - 1,030
OS Jitter (32,042)                              50/90 99/99.9 99.99 - worst was 9.0 / 13  22 / 58  737 - 934
--------------------------------------------------------------------
-------------------------------- SUMMARY (end to end)---------------
Percentile   run1         run2         run3      % Variation   
50:        822083.58   1107296.26     55574.53        92.66    
90:       1509949.44   1375731.71    293601.28        71.07    
99:       1711276.03   1375731.71    343932.93        66.67    
99.9:     1711276.03   1375731.71    343932.93        66.67    
99.99:    1711276.03   1375731.71    343932.93        66.67    
worst:    1711276.03   1375731.71    343932.93        66.67    
--------------------------------------------------------------------
-------------------------------- SUMMARY (date serialisation )------
Percentile   run1         run2         run3      % Variation   
50:            11.01        11.01         8.96        13.22    
90:            18.94        18.94        15.62        12.44    
99:            31.23        29.18        26.11         7.27    
99.9:          50.18        52.22        37.89        20.14    
99.99:        901.12       901.12       770.05        10.19    
worst:       2424.83      1671.17      1032.19        29.21    

--------------------------------------------------------------------

As can be seen the very same Date Serialisation take over twice as long from ~4.5us to ~10us.

It's not really the place here to go into too much detail about why the code takes longer to run when in context but it's to do with CPU caches getting filled in between calls to the date serialisation.

When all we are running (as in the micro benchmark) is the Date Serialisation then that can fit nicely into a CPU cache and never needs to get cleared out.  However when there is a gap between calls to the Date serialisation the code for the operation gets cleared out and needs to be reloaded.

JLBH allows you to benchmark code in context and that's an important part of latency benchmarking.

4 comments:

  1. Hello,
    I cannot agree more about running in context. I always considered microbenchmark as an idealization of the execution.

    Reality is far from this. So any optimizations based on a Microbenchark, considering the order of magnitude between L1 cache and Dram access, can be completely ruin by a L3 miss.

    Cheers

    ReplyDelete
    Replies
    1. For sure!

      This really hits you hard when you are given a time budget of 2us and you have micro benchmarked your code at 0.5us so you falsely believe you have lots of spare time.
      Then you get into the real application and you realise that your code actually takes 1.5us to run and you are right on the edge of your budget.

      Delete
  2. All centennial celebrations have ended in Indy 500 2017. The Indianapolis 500 Live race will take place on May 28, but the field will be officially established this weekend, with qualifying on Saturday and Sunday.

    Saturday's qualifying rounds will determine the Fast 9, and the nine drivers will eliminate him for the first nine positions on the field on Sunday after the No. 10 to No. 33 positions are determined.

    Both qualifying days can be viewed on ABC or via Indy 500 Live streaming.

    ReplyDelete

  3. Apple recently launched the beta version of iOS 11 at the WWDC event held in California. But you need to shell out 8 dollars to install it on your iOS devices. If you want to download it for free and check out the features, then just
    Click Here

    ReplyDelete