How to write correct benchmarks

Baptiste Wicht

Several months ago, I wrote an article to compare the performances of short indexes for loops. I wrote that code to achieve my goal :

package com.wicht.old;

public class TestShortInt {
    public static void main(String[] args){
        long startTime = System.nanoTime();

        int resultInt = 0;

        for (int i = 0; i &amp;lt; 100000; i++){
            for (int j = 0; j &amp;lt; 32760; j++){
                resultInt += i * j;
            }
        }

        System.out.println("Temp pour int : " + (System.nanoTime() - startTime) / 1000000 + " ms");

        startTime = System.nanoTime();

        int resultShort = 0;

        for (int k = 0; k &amp;lt; 100000; k++){
            for (short f = 0; f &amp;lt; 32760; f++){
                resultShort += k * f;
            }
        }

        System.out.println("Temp pour short : " + (System.nanoTime() - startTime) / 1000000 + " ms");

        System.out.println(resultInt);
        System.out.println(resultShort);
    }
}

And i found as a result that short was two times slower than int and I was convinced of these results until a week ago.

At this time, a reader (Jean) criticized the results of my tests and gave me links to several articles about micro-benchmarking. I've read these articles and understand why my results were incorrect.

In fact, my test doesn't pay attention to several things that can change results of tests :

JVM warmup : Due to several parameters, the code is first often slow and becomes faster and faster when the execution time grows until it goes to steady-state.
Class loading : The first time you launch a benchmark, all the used classes must be loaded, increasing the execution time.
Just In Time Compiler : When the JVM identify a hot part of the code
Garbage Collector : A garbage collection can happen during the benchmark and with that the time can increase a lot.

Due to all these factors, the first runs (perhaps 10 seconds of run) are slower than the other and than can make your benchmarks completely false.

So, how can we do to have good benchmarks results ?

It's really difficult, but we can have help using a benchmark framework introduced by Brent Boyer, a software developer from Elliptic Group. This framework take care of all the previously introduced factors and made good benchmarks.

The use of this framework is really simple, you just have to create a new instance of the Benchmark class passing to it a Callable or a Runnable and the test is directly launched. Here is the example with the test of short and int in loop indexes :

public class ShortIndexesLoop {
    public static void main(String[] args) {
        Callable callableInt = new Callable(){
            public Long call() throws Exception {
                long result = 0;

                for (int f = 0; f &amp;lt; 32760; f++){
                      result += 444;
                  }

                return result;
            }
        };

        Callable callableShort = new Callable(){
            public Long call() throws Exception {
                long result = 0;

                for (short f = 0; f &amp;lt; 32760; f++){
                      result += 444;
                  }

                return result;
            }
        };

        try {
            Benchmark intBenchmark = new Benchmark(callableInt);

            System.out.println("Result with int ");
            System.out.println(intBenchmark.toString());

            Benchmark shortBenchmark = new Benchmark(callableShort);

            System.out.println("Result short ");
            System.out.println(shortBenchmark.toString());
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

To get the results, you can use Benchmark.toString() or Benchmark.toStringFull() for more statistics. You can also directly access some stats like standard deviation using Benchmark.getSd() or directly with Benchmark.getStats() to get all the stats.

Here is the result with the preceding code :

Result int
first = 807.056 us, mean = 46.032 us (CI deltas: -261.393 ns, +408.932 ns), sd = 230.929 us (CI deltas: -68.201 us, +105.262 us)
Result short
first = 721.912 us, mean = 48.234 us (CI deltas: -198.625 ns, +254.774 ns), sd = 160.196 us (CI deltas: -32.764 us, +37.882 us)

As you can see, the short version is only 104.78% slower than the int. That show that the first results were completely false.

Here is the full results of the int version :

action statistics: first = 807.056 us, mean = 46.032 us (CI deltas: -261.393 ns, +408.932 ns), sd = 230.929 us (CI deltas: -68.201 us, +105.262 us) WARNING: EXECUTION TIMES HAVE EXTREME OUTLIERS, SD VALUES MAY BE INACCURATE
    ----------
    --the action statistics were calculated from block statistics
    --each block measured 32768 task executions
    --the user says that task internally performs m = 1 actions
    --then the number of actions per block measurement is a = 32768
    --block statistics: mean = 1.508 s (CI deltas: -8.565 ms, +13.400 ms), sd = 41.803 ms (CI deltas: -12.346 ms, +19.054 ms)
    --the forumla used to convert block statistics to action statistics (mean scales as 1/a, sd scales as 1/sqrt(a)) assumes that the action execution times are iid
    ----------
    --each confidence interval (CI) is reported as either +- deltas from the point estimate, or as a closed interval ([x, y])
    --each confidence interval has confidence level = 0.95
    ----------
    --EXECUTION TIMES APPEAR TO HAVE OUTLIERS
    --this was determined using the boxplot algorithm with median = 1.498 s, interquantileRange = 34.127 ms
    --3 are EXTREME (on the high side): #57 = 1.621 s, #58 = 1.647 s, #59 = 1.688 s
    --2 are mild (on the high side): #55 = 1.570 s, #56 = 1.582 s
    ----------
    --block sd values MAY NOT REFLECT TASK'S INTRINSIC VARIATION
    --guesstimate: environmental noise explains at least 55.89418621876822% of the measured sd
    ----------
    --action sd values ALMOST CERTAINLY GROSSLY INFLATED by outliers
    --they cause at least 98.95646276911543% of the measured VARIANCE according to a equi-valued outlier model
    --model quantities: a = 32768.0, muB = 1.5083895562166663, sigmaB = 0.04180264914581472, muA = 4.603239612477619E-5, sigmaA = 2.3092919283255957E-4, tMin = 0.0, muGMin = 2.3016198062388096E-5, sigmaG = 5.754049515597024E-6, cMax1 = 1252, cMax2 = 322, cMax = 322, cOutMin = 322, varOutMin = 0.0017292260645147487, muG(cOutMin) = 2.3034259031465023E-5, U(cOutMin) = 0.002363416110812895

Like you can perhaps see when you use this framework, it gives you some warnings when by example you have extreme outliers that can make the standard deviation completely false.

You can download this framework on the web page of the Elliptic Group. I found it really powerful and easy to use and I'll use it everytime I have to do a benchmark.

To conclude, I must also say that even if you use that kind of framework, you can make very bad benchmarks if you don't test the right part of the code. Here are two really interesting articles from Brent Boyer :

My Java Benchmarks on GitHub

Java File Copy Benchmarks Update

Tip : Don’t use shorts for loop indexes !

Generate graphs benchmarks easily

Java File Copy Benchmark Updates (once again)

C++11 Synchronization Benchmark