Tuesday 28 July 2015

The difference between Efficiency and Performance - It's not all about the Big O

As a software developer you should understand how to make the best use of your hardware in order to give the user the best experience possible with your program.  In order to do this you need to know how to write efficient code and how to write performant code.

(I was watching this cppCon video given by Chandler Carruth which helped crystallise these concepts for me.)

Efficiency - the amount of work you need to do.
Performance -  how fast you can do that work

Efficiency - governed by your algorithm
Performance - governed by your data structures.

Efficiency and performance are not dependant on one another.

It follows you can be efficient without being performant, you do very little work to get the job done but do it slowly e.g. You are a snail and need to travel 10m from point X to Y. You take the most direct route but slither along extremely slowly.

It also follows that you can be performant without being efficient, you do a lot of work to get the the job done but do it very fast e.g. You are superman and need travel 10m from X to Y. You fly right round the globe to get there but go very fast.

What we are of course aiming for is to be efficient and performant at the same time!

The efficiency of our algorithms are usually measured by Big O notation which measures how an algorithm scales with input size. We are taught in computer science classes to aim for the Holy Grail of O(1) and get scared when we see O(n2) or worse.

What we focus on less is how the algorithm actually performs.  One of the biggest issues contributing to performance is how data is stored in the data structure. Generally speaking we want data that is used together to be stored close together in memory so that we don't end up with CPU cache misses.  (See The First Rule of Performance Optimisation where I touch on this issue). If we access data in a predictable manner the hardware will pre-fetch data for us into the cache which can be a huge time saving over randomly accessing data from all over memory and pulling it into the caches on a 'as it is needed' basis.

One of the worst culprits is the good old LinkedList.  It has O(n) for access and search and coupled with O(1) for insertion and deletion, might be considered a very good data structure to use in a program.  Whilst of course I'm not writing off this important data structure, the problem is that the linked nodes are not held together in memory and it is therefore not the most performant data structure to traverse as you cannot help but end up with many expensive cache misses.

Perhaps the best example of the difference and trade-off between efficiency and performance is looking at sort algorithms. 

We are taught that bubble sort is to be avoided at all costs. After all it is O(n^2).  We should rather go for binary sort which comes in at much more acceptable O(n log(n)).

However, to be efficient you sometimes have to pay an overhead penalty which slows you down in the short term. For example if you had to edit a bunch of files you might do them manually, one by one. Alternatively you could spend the time creating a script to do the edits.  At some point the extra work of editing manually would be slower than to take the hit of creating the script. 

Bubble sort is performant, it does its work very quickly, it doesn't pay the penalty that binary sort pays to be more efficient for larger volumes.  So as long as we keep the numbers low (below about 60 items) bubble sort will outperform any other type of search.

In fact if you look at the implementation of java.util.Array.sort() you will find in the class java.util.DualPivotQuicksort a number of sorting strategies which are triggered at different thresholds. Again efficiency vs performance.


/** * If the length of an array to be sorted is less than this * constant, Quicksort is used in preference to merge sort. */private static final int QUICKSORT_THRESHOLD = 286;

/** * If the length of an array to be sorted is less than this * constant, insertion sort is used in preference to Quicksort. */private static final int INSERTION_SORT_THRESHOLD = 47;

/** * If the length of a byte array to be sorted is greater than this * constant, counting sort is used in preference to insertion sort. */private static final int COUNTING_SORT_THRESHOLD_FOR_BYTE = 29;


I wrote a quick JMH benchmark which demonstrates this nicely.

With 5000 elements binary sort with its greater efficiency is much faster:

Benchmark                     Mode  Cnt    Score    Error  Units
Sorter.binarySearchForValue  thrpt    5  915.362 ± 51.438  ops/s
Sorter.bubbleSort            thrpt    5  171.021 ±  9.818  ops/s

With 50 elements bubble sort with it's greater performance is faster:

Benchmark                     Mode  Cnt       Score      Error  Units
Sorter.binarySearchForValue  thrpt    5   90011.861 ± 2051.384  ops/s
Sorter.bubbleSort            thrpt    5  114931.583 ± 6423.543  ops/s




Summary

Whilst Big O notation is incredibly useful for predicting the efficiency (amount of work) of an algorithm, we must also pay attention to the performance of the data structures on which those algorithms operate.  In certain situations a less efficient but more performant algorithm may lead to greater speed overall. 

Monday 6 July 2015

Measuring Allocations Programmatically

I picked up this tip from the The Java Specialists' Newsletter written by Heinz Kabutz.  (For all Java developers wanting to learn what goes on under the covers of the JDK this newsletter is an absolute must!)

Especially for developers writing low latency code but even for normal Java code, allocations is the thing you really want to avoid. See my previous posts 'The First Rule of Optimisation' and 'Revisiting the First Rule of Performance Optimisation: Effects of Escape Analysis' for more details.

Previous to this tip I had always used a profiler to work out my allocations or I suppose you could have used the a call to Runtime to see how much heap memory had been allocated by the JVM.

Using MBeans we are able to query an individual thread for it's allocations.  This gives us a very precise way of measuring whether a specific thread has allocated and if so how much has it allocated.  In situations where you are coding for zero allocation you can include a call to this code in your tests asserting that there has been no allocation.

Below is a simple class you can use based on the tip from the newsletter.

You'll notice that the constructor does a calibration which adjusts for the amount of allocation created by the bean itself.

There is also some defensive code to ensure that the class is only called from a single thread.

You can call the method markAllocations to find the amount of bytes that have been allocated since the the last mark. printAllocations is a handy method to print the allocations since the last mark to standard out.  Allocations are rebased after the class is constructed, a call to reset or a call to markAllocations or printAllocations

In a test you might have code like this:

Allocations measure = new AllocationsMeasure();
...
//critical code
...
assertEquals(0, measure.markAllocations());

Full code for AllocationsMeasure below:


Thursday 2 July 2015

The Effect of Coordinated Omission - Measure loopback latency using a simple non-blocking NIO client/server

In this post I demonstrate a number of ideas and techniques: 
  1. How to write a simple non-blocking NIO client/server
  2. How to time latency loopback on your machine
  3. How to measure latencies in percentiles (as opposed to simple average)
  4. The effects of co-ordinated omission
  5. The realtionship between latency and throughput
Thanks to Peter Lawrey for his help with this post.

I was recently developing a low latency benchmark for client server application.  Initially I was simulating the benchmark on a single machine using TCP over loopback. The first measure I wanted to quantify was how much of the recorded latency I needed to allow for simple loopback latency. I would then be able to understand more clearly the latency added by my actual application.

To do this I created a program (code at the end of the post) that transfers a single byte from the client to the server and back again. This is done repeatedly and the results are processed.

The program is written using non-blocking java NIO to optimise the loopback latencies as much as possible. 

More importantly than recording the average time, the percentile latencies are recorded. (See a previous post here for a discussion of how to measure latency). Critically, the code factors for coordinated omission.  (To learn more about this see here from Gil Tene). In a nutshell, you don't time a job from when it started but from it should have started.

These are results on my 2 year old MBP.

Starting latency test rate: 80000 ... Average time 6001870.9
Loop back echo latency was 6007821.4/10525630.0 11,538,962/11,642,739 11,652,536/11653522 11,653,616 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 70000 ... Average time 4393443.9
Loop back echo latency was 4388003.2/7743356.1 8,496,550/8,576,130 8,583,327/8584029 8,584,097 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 60000 ... Average time 3054243.7
Loop back echo latency was 2943178.9/5463715.9 5,961,190/6,009,637 6,014,795/6015409 6,015,453 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 50000 ... Average time 1420485.4
Loop back echo latency was 1423450.9/2462332.1 2,694,163/2,715,932 2,717,909/2718146 2,718,163 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 40000 ... Average time 4828.0
Loop back echo latency was 79.8/23261.9 44,756/47,379 47,662/47712 47,713 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 30000 ... Average time 45.1
Loop back echo latency was 23.3/46.5 630/2,211 2,904/3092 3,106 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 25000 ... Average time 1095.0
Loop back echo latency was 24.8/36.8 47,300/58,166 59,151/59346 59,357 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 20000 ... Average time 26.7
Loop back echo latency was 24.7/29.5 60/88 259/339 340 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 15000 ... Average time 26.5
Loop back echo latency was 24.8/28.1 60/72 102/252 252 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 10000 ... Average time 26.7
Loop back echo latency was 24.8/28.9 61/72 165/369 369 us for 50/90 99/99.9 99.99/99.999 worst %tile

Running the same test with -Dcoordinated.omission=true

### Running with Coordinated Omission ###

Starting latency test rate: 80000 ... Average time 25.6
Loop back echo latency was 23.3/30.8 59/74 164/1604 9,165 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 70000 ... Average time 25.9

Loop back echo latency was 23.3/31.1 59/74 187/1726 11,358 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 60000 ... Average time 26.2

Loop back echo latency was 25.1/28.3 60/66 90/166 419 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 50000 ... Average time 26.4

Loop back echo latency was 24.6/28.3 60/72 232/1996 4,104 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 40000 ... Average time 26.2

Loop back echo latency was 24.6/28.5 60/67 90/251 319 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 30000 ... Average time 26.6

Loop back echo latency was 24.5/28.3 60/72 162/2257 3,236 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 25000 ... Average time 26.4

Loop back echo latency was 24.6/27.8 61/71 96/254 375 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 20000 ... Average time 26.4

Loop back echo latency was 24.5/27.4 61/70 96/243 293 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 15000 ... Average time 26.8

Loop back echo latency was 24.6/29.0 61/74 285/887 887 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 10000 ... Average time 26.7

Loop back echo latency was 24.6/28.5 61/73 99/307 307 us for 50/90 99/99.9 99.99/99.999 worst %tile


As you can see the effects of throughput are completely ignored! It would appear that even at a rate of 80,000 messages a second your 99.99 percentile is 374us when in fact it is way way larger than that.

In fact you can only hit your target latencies when your throughput is nearer 20,000 a second.  As you would understand intuitively there is a trade off between throughput and latency.  

The results for a server machine (i7-3790X 3.5 Ghz with Ubuntu 15.04) running Ubuntu are as below:


Starting latency test rate: 200000 ... Average time 9417750.1
Loop back echo latency was 9349156.4/17229351.3 18,907,915/19,073,572 19,089,824/19091465 19,091,645 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 160000 ... Average time 3155097.0
Loop back echo latency was 3186332.1/5649721.3 6,159,813/6,211,035 6,216,502/6217022 6,217,047 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 140000 ... Average time 47245.0
Loop back echo latency was 19.0/236226.4 339,075/362,914 364,575/364749 364,767 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 120000 ... Average time 8.2
Loop back echo latency was 6.6/13.1 22/89 388/1427 1,519 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 100000 ... Average time 7.3
Loop back echo latency was 6.6/7.4 18/20 59/107 317 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 80000 ... Average time 7.1
Loop back echo latency was 6.7/7.7 18/19 22/100 619 us for 50/90 99/99.9 99.99/99.999 worst %tile

Running the same test with -Dcoordinated.omission=true

### Running with Coordinated Omission ###

Starting latency test rate: 200000 ... Average time 6.8
Loop back echo latency was 6.5/6.9 18/19 21/23 122 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 160000 ... Average time 6.8
Loop back echo latency was 6.5/6.9 18/19 21/23 131 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 140000 ... Average time 6.9
Loop back echo latency was 6.5/7.1 18/19 21/23 442 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 120000 ... Average time 6.8
Loop back echo latency was 6.5/6.9 18/19 21/23 121 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 100000 ... Average time 6.9
Loop back echo latency was 6.5/6.9 18/19 21/23 106 us for 50/90 99/99.9 99.99/99.999 worst %tile

Starting latency test rate: 80000 ... Average time 6.8
Loop back echo latency was 6.5/6.9 17/19 21/23 104 us for 50/90 99/99.9 99.99/99.999 worst %tile

On the server machine, you can achieve stable latencies (~7us) with a throughput of 100,000.  

An interesting observation is the relationship between latency and throughput and the effect that lowing latency has on raising throughput (comparing the server and laptop). For more on this topic see Peter Lawrey's excellent presentation here.

The code for this test is below: