Friday, 15 April 2016

JLBH Examples 1 - Why Code Should be Benchmarked in Context

In this post:

  • A side by side example using JMH and JLBH for Date serialisation
  • Measuring Date serialisation in a microbenchmark
  • Measuring Date serialisation as part of a proper application
  • How to add a probe to your JLBH benchmark
  • Understanding the importance of measuring code in context


In the last post, 'Introducing JLBH' we introduced JLBH the latency testing tool that Chronicle used to test Chronicle-FIX and is now available as open source.

In the next few posts we are going to look at some example applications: 

All the code for the examples cane be found here in my GitHub project:

One of the points I made in the introduction to JLBH was that it is important to benchmark code in context. That means benchmarking code in an environment as close to how it will be running in real life as possible. This post demonstrates this in practice.

Let's look at a relatively expensive Java operation - Date Serialisation - and see how long it takes:

First here's a JMH benchmark:



Running on my laptop (MBP i7) these are the results I get:

Result "test":
  4.578 ±(99.9%) 0.046 us/op [Average]
  (min, avg, max) = (3.664, 4.578, 975.872), stdev = 6.320
  CI (99.9%): [4.533, 4.624] (assumes normal distribution)
  Samples, N = 206803
        mean =      4.578 ±(99.9%) 0.046 us/op
         min =      3.664 us/op
  p( 0.0000) =      3.664 us/op
  p(50.0000) =      4.096 us/op
  p(90.0000) =      5.608 us/op
  p(95.0000) =      5.776 us/op
  p(99.0000) =      8.432 us/op
  p(99.9000) =     24.742 us/op
  p(99.9900) =    113.362 us/op
  p(99.9990) =    847.245 us/op
  p(99.9999) =    975.872 us/op
         max =    975.872 us/op


# Run complete. Total time: 00:00:21

Benchmark                Mode     Cnt  Score   Error  Units

DateSerialiseJMH.test  sample  206803  4.578 ± 0.046  us/op

A mean time of 4.5us for the operation:

We get pretty much the same results running the test with JLBH:


These are the results:

Warm up complete (400000 iterations took 2.934s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000)                         50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8  352 / 672  803 / 901 - 934
OS Jitter (13,939)                              50/90 99/99.9 99.99 - worst was 8.4 / 17  639 / 4,130  12,850 - 20,450
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000)                         50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8  434 / 705  836 / 934 - 967
OS Jitter (11,016)                              50/90 99/99.9 99.99 - worst was 8.4 / 17  606 / 770  868 - 1,340
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 10.0s
Correcting for co-ordinated:true
Target throughput:100000/s = 1 message every 10us
End to End: (1,000,000)                         50/90 99/99.9 99.99/99.999 - worst was 4.2 / 5.8  434 / 737  901 / 999 - 1,030
OS Jitter (12,319)                              50/90 99/99.9 99.99 - worst was 8.4 / 15  573 / 737  803 - 901
---------------------------------------------------------------------------------------------------- SUMMARY (end to end)---------------Percentile   run1         run2         run3      % Variation   
50:             4.22         4.22         4.22         0.00    
90:             5.76         5.76         5.76         0.00    
99:           352.26       434.18       434.18         0.00    
99.9:         671.74       704.51       737.28         3.01    
99.99:        802.82       835.58       901.12         4.97    
worst:        901.12       933.89       999.42         4.47    
--------------------------------------------------------------------

A mean time of 4.2us for the operation:

Note: This is case where there is no advantage using JLBH over JMH.  I just include the code as a comparison.

Now we're going to run exactly the same operation but inside a TCP call the code will work like this:
  1. Client send the server a fix message over TCP loopback (localhost)
  2. Server reads the message
  3. Server does the date serialisation
  4. Server returns a message to the client
As explained in the previous post JLBH allows us to produce a latency profile for any part of the code.  We will add a probe for stage 3.



This time the results look like this:

Warm up complete (50000 iterations took 3.83s)
-------------------------------- BENCHMARK RESULTS (RUN 1) ------------------------
Run time: 6.712s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000)                           50/90 99/99.9 99.99 - worst was 822,080 / 1,509,950  1,711,280 / 1,711,280  1,711,280 - 1,711,280
date serialisation  (100,000)                   50/90 99/99.9 99.99 - worst was 11 / 19  31 / 50  901 - 2,420
OS Jitter (64,973)                              50/90 99/99.9 99.99 - worst was 8.1 / 16  40 / 1,540  4,850 - 18,350
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 2) ---------
Run time: 6.373s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000)                           50/90 99/99.9 99.99 - worst was 1,107,300 / 1,375,730  1,375,730 / 1,375,730  1,375,730 - 1,375,730
date serialisation  (100,000)                   50/90 99/99.9 99.99 - worst was 11 / 19  29 / 52  901 - 1,670
OS Jitter (40,677)                              50/90 99/99.9 99.99 - worst was 8.4 / 16  34 / 209  934 - 1,470
--------------------------------------------------------------------
-------------------------------- BENCHMARK RESULTS (RUN 3) ---------
Run time: 5.333s
Correcting for co-ordinated:true
Target throughput:20000/s = 1 message every 50us
End to End: (100,000)                           50/90 99/99.9 99.99 - worst was 55,570 / 293,600  343,930 / 343,930  343,930 - 343,930
date serialisation  (100,000)                   50/90 99/99.9 99.99 - worst was 9.0 / 16  26 / 38  770 - 1,030
OS Jitter (32,042)                              50/90 99/99.9 99.99 - worst was 9.0 / 13  22 / 58  737 - 934
--------------------------------------------------------------------
-------------------------------- SUMMARY (end to end)---------------
Percentile   run1         run2         run3      % Variation   
50:        822083.58   1107296.26     55574.53        92.66    
90:       1509949.44   1375731.71    293601.28        71.07    
99:       1711276.03   1375731.71    343932.93        66.67    
99.9:     1711276.03   1375731.71    343932.93        66.67    
99.99:    1711276.03   1375731.71    343932.93        66.67    
worst:    1711276.03   1375731.71    343932.93        66.67    
--------------------------------------------------------------------
-------------------------------- SUMMARY (date serialisation )------
Percentile   run1         run2         run3      % Variation   
50:            11.01        11.01         8.96        13.22    
90:            18.94        18.94        15.62        12.44    
99:            31.23        29.18        26.11         7.27    
99.9:          50.18        52.22        37.89        20.14    
99.99:        901.12       901.12       770.05        10.19    
worst:       2424.83      1671.17      1032.19        29.21    

--------------------------------------------------------------------

As can be seen the very same Date Serialisation take over twice as long from ~4.5us to ~10us.

It's not really the place here to go into too much detail about why the code takes longer to run when in context but it's to do with CPU caches getting filled in between calls to the date serialisation.

When all we are running (as in the micro benchmark) is the Date Serialisation then that can fit nicely into a CPU cache and never needs to get cleared out.  However when there is a gap between calls to the Date serialisation the code for the operation gets cleared out and needs to be reloaded.

JLBH allows you to benchmark code in context and that's an important part of latency benchmarking.

14 comments:

  1. Hello,
    I cannot agree more about running in context. I always considered microbenchmark as an idealization of the execution.

    Reality is far from this. So any optimizations based on a Microbenchark, considering the order of magnitude between L1 cache and Dram access, can be completely ruin by a L3 miss.

    Cheers

    ReplyDelete
    Replies
    1. For sure!

      This really hits you hard when you are given a time budget of 2us and you have micro benchmarked your code at 0.5us so you falsely believe you have lots of spare time.
      Then you get into the real application and you realise that your code actually takes 1.5us to run and you are right on the edge of your budget.

      Delete
  2. JLBH does seem to work better. Thanks for recommending it and posting the example as well. Your post will definitely help everyone make the right decision.

    ReplyDelete
  3. InstantPcApps
    It's a hub for free mac, pc, android and ios software. You can get any kind of game, software and mobile application free of cost. 4K Video Downloader Crack

    ReplyDelete
  4. Hello,
    Thanks a lot dude,It means a lot of me, Its amazing, I have learned many things on your blog. If you dont mind kindly visit my blog...
    Mehndi Artist in Delhi | Best Mehndi Artist in Delhi | Simple Mehndi Design | Mehndi Service at Home in Delhi

    ReplyDelete
  5. You realize the great math practice for your children? Prodigy Game is math practice used to be a battle — however not any longer.

    ReplyDelete
  6. It's very useful blog post with informative and insightful content and i had good experience with this information. We, at the CRS info solutions ,help candidates in acquiring certificates, master interview questions, and prepare brilliant resumes.Find top Salesforce admin interview questions in 2020.
    These Salesforce developer interview questions are highly helpful in 2020. You can read these Salesforce lightning interview questions and Salesforce integration interview questions which are prepared by industry experts.

    ReplyDelete
  7. 3. Al Jefferson, Bobcats: Jefferson is already out with a badly sprained ankle. Charlotte is a tough place to play for any standout because there is very little help. Most of us breathe a little sigh of relief. Now it time to gather up the wrinkled, torn wrapping paper, frayed ribbons and bows, smashed gift boxes, and those pieces of infernal, impossible to open rigid plastic packaging (does anybody else hate trying to break into that stuff?) and to put it all into garbage bags or recycling bins for hauling away. The sooner, the better, I say.. (tags: Jordan Shoes For Sale Cheap, 2020 Jordan Release Dates, Ray Ban Sunglasses Outlet)

    If people want marriage, however, they would have to find a private ceremony for that. I dare say many churches and synagogues would not offer such a ceremony to gays or to polygamists and as private institutions that would be their right. Government, on the other hand, should have no such power.. (tags: Cheap Real Yeezys, Discount Jordan Shoes Wholesale, Cheap Yeezys)

    Hunt is replaced in the halves by Darren Nicholls. Leeson Ah Mau (rested following Denver Test) returns at starting prop for Vaughan. Blake Lawrie starts at prop for De Belin. Visit the Best Offers page on the website to check for cellar deals and savings. Wine lovers can score extra discounts by joining the Laithwaites Wine Club. Members receive cases of hand picked wines and can enjoy an introductory case of wine for 50% off, with two free glasses included in the offer (tags: Coach Outlet Store Online , Michael Kors Bags Sale, Coach Outlet Store, MK Outlet).

    ReplyDelete