Wednesday, 11 February 2015

How to detect and diagnose slow code in production

One of the more difficult tasks that a developer faces is finding and diagnosing slow running code in production.

Firstly how do you monitor your production code without slowing it down? Certainly you can't run production code through a profiler.  [EDIT] It has since been pointed out to me that adaptive profilers such as Satoris can be run in production. Even if you have a mechanism for timing your code, how then do you diagnose the issue? This is especially the case if you can't reproduce the problem in a development environment. Ideally you want to be notified when a problem happens in production and be supplied with enough information to have a reasonable chance of being able to fix or at least diagnose the issue.

Here's a mechanism suggested by my colleague Peter Lawrey that you can use to do exactly that. (Full code listing can be found here).

What you do is to create a Monitor class as below:

public class Monitor implements Runnable{
  private final Thread thread;
  private final AtomicLong startTime 
                  = new AtomicLong(Long.MAX_VALUE);
  private final int thresholdMS;

  public Monitor(Thread thread, int thresholdMS){
    this.thread = thread;
    this.thresholdMS = thresholdMS;

  public void reset(){

  public void run(){
      long timeTaken = System.currentTimeMillis()-startTime.get();
      if(timeTaken > thresholdMS){
        System.out.println(timeTaken + "-------------------------");
      try {
      } catch (InterruptedException e) {

This class will dump the stack trace of the running thread if the thread fails to reset within the threshold time.

Here's some example program demonstrating how the Monitor would be called.

Monitor monitor = new Monitor(Thread.currentThread(), 8);
Thread thread = new Thread(monitor, "MonitorThread");

while(true) {
   double x=0;
   for (int i = 0; i < 10_000; i++) {
     x += Math.sqrt(i);
     Logger.getLogger(getClass().getName()).fine("x=" + x);

This 'critical' piece of code is observed by Monitor. It will dump out a stack trace of the code if the monitor is not reset within 8ms.

If you have a Monitor watching your critical sections of code you can be sure that they are performing within the given constraint. If the code does break the constraints you can have a good idea of where the problem lies by examining the stack trace.  You can also use this to see how many times your critical code was not performant through its run time.

You can alter the wait time if you don't have the luxury of a dedicated spare CPU for monitoring. Also you might want to change the wait strategy to allow for GC pauses which would effect all threads. You might want to fine grain the timing by using System.nanoTime() rather than working in milliseconds.

No comments:

Post a Comment