Monday, 16 March 2015

Creating Millions of Objects with Zero Garbage

As noted in First rule of performance optimisation, garbage is the enemy of fast code. Not only can it destroy any sort of deterministic performance by employing the services of the garbage collector but we start filling our CPU caches with garbage that will cause expensive cache misses for our program.

So, can we use Java without creating garbage? Is it possible, for example, in natural Java, to solve this problem:

Create 10m financial instrument objects, store them in a map, retrieve them and perform a calculation using each object without creating any garbage at all.

It is if you use Chronicle!  Chronicle provides libraries so that you can easily use off heap storage in the form off memory mapped files for your objects. (For full source code for this article see here.)

Let's look implementing a solution for the above problem. 

First let's have a look at how you might do this in normal Java so that we make sure we understand the problem and what happens if we use the standard Java libraries for our implementation.

This is the out put from the program:


*** Entering critical section ***
[GC (Allocation Failure)  98816K->92120K(125952K), 0.0317021 secs]
[Full GC (Ergonomics)  92120K->91917K(216576K), 0.2510530 secs]
[GC (Allocation Failure)  125197K->125430K(224256K), 0.0449051 secs]
[GC (Allocation Failure)  166390K->166686K(244224K), 0.0504341 secs]
[Full GC (Ergonomics)  166686K->165777K(387072K), 0.6243385 secs]
[GC (Allocation Failure)  226705K->226513K(388096K), 0.0785121 secs]
[GC (Allocation Failure)  293073K->293497K(392704K), 0.0825828 secs]
[Full GC (Ergonomics)  293497K->292649K(591872K), 1.2479519 secs]
[GC (Allocation Failure)  359209K->359433K(689664K), 0.0666344 secs]
[GC (Allocation Failure)  449033K->449417K(695296K), 0.1759746 secs]
[GC (Allocation Failure)  539017K->539385K(747008K), 0.1907760 secs]
[GC (Allocation Failure)  632569K->633009K(786944K), 0.2293778 secs]
[Full GC (Ergonomics)  633009K->631584K(1085952K), 2.1328028 secs]
[GC (Allocation Failure)  724768K->723368K(1146368K), 0.3092297 secs]
[GC (Allocation Failure)  827816K->825088K(1174016K), 0.3156138 secs]
[GC (Allocation Failure)  929536K->929952K(1207296K), 0.3891754 secs]
[GC (Allocation Failure)  1008800K->1009560K(1273856K), 0.4149915 secs]
[Full GC (Ergonomics)  1009560K->1007636K(1650688K), 3.4521240 secs]
[GC (Allocation Failure)  1086484K->1087425K(1671680K), 0.3884906 secs]
[GC (Allocation Failure)  1195969K->1196129K(1694208K), 0.2905121 secs]
[GC (Allocation Failure)  1304673K->1305257K(1739776K), 0.4291658 secs]
[GC (Allocation Failure)  1432745K->1433137K(1766912K), 0.4470582 secs]
[GC (Allocation Failure)  1560625K->1561697K(1832960K), 0.6003558 secs]
[Full GC (Ergonomics)  1561697K->1558537K(2343936K), 4.9359721 secs]
[GC (Allocation Failure)  1728009K->1730019K(2343936K), 0.7616385 secs]
[GC (Allocation Failure)  1899491K->1901139K(2413056K), 0.5187234 secs]
[Full GC (Ergonomics)  1901139K->1897477K(3119616K), 5.7177263 secs]
[GC (Allocation Failure)  2113029K->2114505K(3119616K), 0.6768888 secs]
[GC (Allocation Failure)  2330057K->2331441K(3171840K), 0.4812436 secs]
[Full GC (Ergonomics)  2331441K->2328578K(3530240K), 6.3054896 secs]
[GC (Allocation Failure)  2600962K->2488834K(3528704K), 0.1580837 secs]
*** Exiting critical section ***
Time for putting 32088
Time for getting 454
[GC (System.gc())  2537859K->2488834K(3547136K), 0.1599314 secs]
[Full GC (System.gc())  2488834K->2488485K(3547136K), 6.2759293 secs]
[GC (System.gc())  2488485K->2488485K(3559936K), 0.0060901 secs]
[Full GC (System.gc())  2488485K->2488485K(3559936K), 6.0975322 secs]
Memory(heap) used 2.6 GB


The two main points that jump out of this issue are, one, the number and expense of the garbage collections (clearly this could be tuned) and two the amount of heap used 2.6GB. In short, there's no getting away from it, this program produces masses of garbage.

Let's try exactly the same thing, this time using ChronicleMap

This is the code to solve the problem:

This is the output from the program:


[GC (Allocation Failure)  33280K->6595K(125952K), 0.0072065 secs]
[GC (Allocation Failure)  39875K->12177K(125952K), 0.0106678 secs]
[GC (Allocation Failure)  45457K->15289K(125952K), 0.0068434 secs]
[GC (Allocation Failure)  48569K->18357K(159232K), 0.0098287 secs]
[GC (Allocation Failure)  84917K->21008K(159232K), 0.0156393 secs]
*** Entering critical section ***
*** Exiting critical section ***
Time for putting 8554
Time for getting 4351
[GC (System.gc())  36921K->21516K(230400K), 0.0331916 secs]
[Full GC (System.gc())  21516K->15209K(230400K), 0.0630483 secs]
[GC (System.gc())  15209K->15209K(230912K), 0.0006491 secs]
[Full GC (System.gc())  15209K->15209K(230912K), 0.0234045 secs]
Memory(heap) used 18.2 MB


The main point here is obviously that there were no GCs in the critical section and that the whole program only used 18MB of heap. We have managed to create a program that ordinarily would have produced gigabytes of garbage without producing any garbage at all. 

A note on timings

ChronicleMap is clearly not a drop in replacement for ConcurrentHashMap, they have very different uses and it is beyond the scope of this post to go too much further into that line of discussion. But the main differences in functionality are  that ChronicleMap is persisted and can be shared amongst many JVMs. (ChronicleMap also has the ability to be tcp replicated.) Nevertheless it is interesting to quickly compare timings if nothing else than to make sure we are in the same ball park.  ChronicleMap was faster for putting, 8.5s compared to 32s.  But most of the time in ConcurrentHashMap was spent in GC and that might be tuned away to some extent.  ConcurrentHashMap was faster for getting, 0.5s compared to 4.3s.  Nevertheless on other runs I've seen ConcurrentHashMap taking over 7s because of a GC that occurred in that section. Even though ChronicleMap is doing significantly more work, the lack of garbage produced actually makes the timings comparable with ConcurrentHashMap.

Restarting the program   

Where ChronicleMap really comes into its own is on a restart.  Let's say your program goes down and you need to recalculate the same computation we did earlier.  In the case of ConcurrentHashMap we would have to repopulate the map in exactly the same we did earlier. With ChronicleMap, since the map is persistent it is just a matter of pointing the map at the existing file and rerunning the calculation to produce the totalQuantity.

Summary  


ConcurrentHashMapChronicleMap
gc pausesManyNone
update time32s8s
reads allowing gc7s4s
reads no gc0.5s4s
heap size2.6GB18MB
persistenceNoYes
fast restartNoYes

12 comments:

  1. Let me start out by saying that the following is in no way meant to negatively portray the Chronicle code. The OpenHFT work is amazing and I use it regularly.

    I can't reproduce any of your tests because the pom is incomplete and the code will not compile due to missing dependencies.

    You're not really drawing accurate comparisons here in a number of ways:

    1) The first test is being unfairly penalized because you're not specifying a minimum heap size as well as the maximum. You'd never run a low latency operation without also specifying -Xms as the same size as the max, in this case -Xms4G. Failure to do so will cause the JVM to perform a STW (stop the world) collection every time the heap grows. Your Chronicle test exhibits a few of these because it still does use managed heap but not as much so has less pauses.

    2) You're comparing "heap" when in reality all you're measuring is JVM-managed heap. The Chronicle code will consume process heap but it won't be measured by simply calculating JVM-managed heap which is what the code currently does. Additionally, IIRC the Chronicle implementation is based on mmap'ing files. So, while it may use less process heap (managed or otherwise), it will result in more segfaults which also have a cost. A more interesting comparison would be to measure overall system impact (RAM used, shared memory used, page faults, CPU, etc.) across what are two fundamentally different approaches to memory usage.

    Lastly, the contention that garbage will "destroy any sort of deterministic performance" and "will cause expensive cache misses" isn't really substantiated by this test. You don't measure cache effectiveness so we don't know how well either performs in that regard. It's possible to get relatively deterministic performance out of well tuned JVMs depending on workload. Speaking from experience running large scale Cassandra clusters, if all your garbage is created in the new generation and doesn't survive promotion at collection time, GC is incredibly effective and fast. It's effectively just zeroing out pointers in a preallocated memory slab. I've seen steady workloads collect gigabytes of new gen in milliseconds with little variance.

    ReplyDelete
    Replies
    1. Thanks for your comments which are well made and taken.

      Yes the test is clearly unfair as you pointed out in 1). I did allude to this in the article when I said "(clearly this could be tuned)". The main point of the article was to demonstrate that there are alternatives to creating objects on the heap. To people who are not aware of Chronicle or even the concept of off-heap data I think this might be very interesting. I used this very artificial example to demonstrate the point as clearly as I could.

      Definitely agree with point 2) if you are planning on Using ChronicleMap you do have to take into account the performance of the whole machine not just the JVM. Of course what really has to be compared in that case is ChronicleMap against another persisted solution, comparing against ConcurrentHashMap was only done for the sake of this example.

      The sorts of latencies that certain people who would use ChronicleMap might be aiming for are sub-millisecond. Ideally no gc collections at all. But accept your point that a well tuned JVM might well perform in a deterministic manner.

      Delete
  2. "In short, there's no getting away from it, this program produces masses of garbage." - This is simply not true. The program is triggering GC activity, but that does NOT mean it's creating garbage. The GC activity is triggered due to poor configuration (I ran the test with a larger heap of a fixed size, no GC prints). There's no garbage generated in the "Critical Section" (i.e there's nothing to collect, the map is a live reference).
    So here's how you create millions of objects (real objects, you can pass around and synchronise on and put into different containers and have methods on...) with no garbage, you just create the objects in question and put them into a well sized map (so there's no garbage on resizing the map) and you keep a live reference to the map.
    That's it! None of these objects will get collected -> no garbage.

    ReplyDelete
    Replies
    1. Thanks for the comment with which I completely agree.

      I've addressed some of the points in my comment above to Erik. You are of course correct, the garbage was only produced in this program because of poor configuration. The main point of the post was to introduce people to the idea that there are alternatives to on heap storage. Comparing ChronicleMap against HashMap was only done to highlight this idea. As you pointed out a much fairer (and more sensible) comparison would be to compare ChronicleMap to another persisted solution.

      There might be situations where someone has more data than they can fit into their heap. In this case ChronicleMap as an off heap solution would be a way out for them as it can exceed the main memory on the machine. One would have to accept that depending on how the data is accessed it might not be that performant due to page faulting but that really depends on the data access pattern in the program.

      Delete
    2. "the garbage was only produced in this program because of poor configuration"
      No, you are missing the point. GC activity does not mean garbage was produced.
      In particular heap resizing and promotion of objects is not indication of garbage.

      Delete
    3. Understood - I did mean and should have written GC activity.
      Thanks for pointing out.

      Delete
  3. I'll copy my comment from JavaCodeGeeks here. However as I see it is pretty similar to other's comments:

    CreateOnHeapTest doesn’t produce garbage, so it is a very bad candidate for testing garbage collection. You are storing the data in the ConcurrentHashMap, but you never remove it, so they never become eligible for collection.
    It is clearly visible in the GC log. The GC doesn’t remove these map entries, the used heap size is not reduced. And the overall heap size is growing continuously. It is a good example of memory leak, as nobody cleans up this collection, it just grows…

    The Java GC is based on the generational theory. it is about the most of the objects are used, and thrown away immediately. This is what Java GC is focusing on. GC pause time linearly depends on the number of live objects. Because of the generational theory, the number of live objects are always low, so the GC is fast.
    But in your example, -because of the memory leak- the number of live objects are very high, making the GC inefficient and slow. It has to walk through the huge reference tree every time. This is a very extreme case in this sense, not common in Java world.

    The reason for Garbage Collection is the much better performance compared to other solutions. In C++ you have to spend expensive CPU time to clean up after every single objects. In Java, cleaning up is almost free for most of the objects. As well as object allocation.
    So the overall throughput of the application must be better with GC. The drawback is that the latency is not stable. Since it is a well proven technology, I would be very surprised if experiment would show different result than this.

    ReplyDelete
  4. Thanks for taking the time to comment. Agree with all your points. As I replied earlier to previous comments the main point of the post was to demonstrate how Chronicle does not use heap memory. The HashMap was intended used purely as a contrast to show how much heap memory would be allocated in normal circumstances.

    ReplyDelete
  5. I dnt know what went wrong but when i executed the first case. It gave following output.

    ----- HASHMAP ------------------------
    *** Entering critical section ***
    [GC 29942K->29040K(120832K), 0.0402470 secs]
    [GC 58995K->57696K(152576K), 0.0496340 secs]
    [GC 121184K->121232K(185344K), 0.3986370 secs]
    [Full GC 121232K->121052K(299520K), 1.1513830 secs]
    [GC 184540K->184723K(330240K), 0.5455970 secs]
    [Full GC 184723K->184537K(441856K), 1.0315350 secs]
    [GC 278745K->278993K(474624K), 0.7381750 secs]
    [Full GC 278993K->278743K(619520K), 1.3248320 secs]
    [GC 405719K->405815K(712192K), 0.9467300 secs]
    [GC 509751K->509975K(744960K), 1.4869750 secs]
    [GC 613911K->614063K(808448K), 1.7037870 secs]
    [Full GC 614063K->613587K(1054208K), 3.2654490 secs]
    ************* STARTING GET *********************
    [GC 711379K->691041K(1113600K), 1.4565380 secs]
    [GC 788833K->691177K(1110528K), 0.8207020 secs]
    *** Exiting critical section ***
    Time for putting 15268
    Time for getting 6068
    [GC 730574K->691145K(975360K), 0.6077130 secs]
    [Full GC 691145K->225K(975360K), 0.0212590 secs]
    [GC 225K->225K(1108992K), 0.0043150 secs]
    [Full GC 225K->225K(1108992K), 0.0093000 secs]
    Memory(heap) used 2.7 MB



    When I executed 2nd case It gave this out put.
    ----- CHRONICLE MAP ------------------------
    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
    SLF4J: Defaulting to no-operation (NOP) logger implementation
    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
    [GC 31744K->7497K(120832K), 0.0202670 secs]
    *** Entering critical section ***
    [GC 39241K->11625K(152576K), 0.0179970 secs]
    [GC 89009K->11665K(159232K), 0.0014340 secs]
    [GC 86417K->11665K(156672K), 0.0014400 secs]
    [GC 83857K->11665K(154624K), 0.0014570 secs]

    730 messages like this
    [GC 83857K->11665K(154624K), 0.0014570 secs]

    *** Exiting critical section ***
    Time for putting 58035
    Time for getting 88172
    [GC 28510K->11633K(154624K), 0.0013230 secs]
    [Full GC 11633K->9045K(154624K), 0.0853900 secs]
    [GC 9045K->9045K(154624K), 0.0011460 secs]
    [Full GC 9045K->9045K(154624K), 0.0375410 secs]
    Memory(heap) used 10.7 MB

    After this I dnt think 2nd one is any use for me .
    Did I make any mistake ?

    ReplyDelete
  6. Interesting... there shouldn't be any allocation in the critical section.
    Did you set -Xmx?

    Are you able to run the program using flight recorder and see where the allocation is coming from?
    (I have a blog post about how to run Flight Recorder (http://www.rationaljava.com/2015/03/java-flight-recorder-since-jdk-18040.html)

    ReplyDelete
  7. The new Samsung Galaxy Note 8 will hopefully avoid the fiasco that was the bane of the Galaxy Note 7. With the Korean giant looking to recoup its multi-billion dollar losses, we have created a wishlist of the features which will be seen on the Samsung Galaxy Note 8. We are waiting with baited breath the release of the new device.

    ReplyDelete
  8. At the same time, another container would be put on the truck to accept another load of wastes. Due to this special nature, the truck carries a high price tag. Hence companies prefer garbage truck financing to acquire it. badger garbage disposal

    ReplyDelete