Micro-optimizations in Kotlin — 3

The Kotlin standard library is a wonderful set of APIs, but it sometimes hides… interesting surprises. So today let’s took at the innocent looking minOf() and maxOf() functions.

These functions let you perform a min/max on a series of values, instead of just two with the more common min() and max() functions. Both min() and max() are straightforward and simply delegate — on Android and JVM — to Math.min()/Math.max() respectively, which you can assume to be implemented with intrinsics (and are on Android).

Now consider the following piece of code that loops over an array of colors to find the minimum of the red & green channels across all colors:

1var minColor = Float.MAX_VALUE
2for (color in colors) {
3    minColor = minOf(minColor, color.red, color.green)
4}

Let’s benchmark it on a Pixel 6, using 9,999 colors:

minOf(3)     86,815 ns     0 allocs

Seems reasonable, but we forgot to check the blue channel, so we’ll fix our code:

1var minColor = Float.MAX_VALUE
2for (color in colors) {
3    minColor = minOf(minColor, color.red, color.green, color.blue)
4}

And the benchmark now yields:

minOf(4)    256,048 ns  9999 allocs

The new benchmark is 2 times slower (total execution time is 3 times the previous total execution time). Does this make sense? To find the minimum of 3 values, we need to invoke min() twice:

1fun minOf(a: Float, b: Float, c: Float) = min(a, min(b, c))

And to find the minimum of 4 values, we need three min():

1fun minOf(a: Float, b: Float, c: Float, d: Float) = min(a, min(b, min(c, d)))

We would therefore reasonably expect the first benchmark to take ~2/3 of the time of the second benchmark, not 1/3. The number of allocations reported by the benchmark gives us a hint about what’s happening: with 9,999 reported allocations, it means we are doing one allocation per color that we are testing. Looking at the source code of the standard library the answer becomes obvious:

1public actual fun minOf(a: Float, vararg other: Float): Float {
2    var min = a
3    for (e in other) min = minOf(min, e)
4    return min
5}

The minOf() function offers specializations for 2 and 3 parameters, but falls back to a vararg variant starting with 4 parameters. The problem with vararg is that they are equivalent to passing an array that must be allocated at the call site. Here is the relevant DEX bytecode on Android:

 1const/4 v0, #int 3 // #3
 2new-array v1, v0, [F // type@0037
 3const/4 v2, #int 0 // #0
 4aput v4, v1, v2
 5const/4 v4, #int 1 // #1
 6aput v5, v1, v4
 7const/4 v4, #int 2 // #2
 8aput v6, v1, v4
 9if-ge v2, v0, 0017 // +000b
10aget v4, v1, v2
11invoke-static {v3, v4}, Ljava/lang/Math;.min:(FF)F // method@0008
12move-result v3
13add-int/lit8 v2, v2, #int 1 // #01
14goto 000c // -000a

The ART¹ AOT² compiler is however able to optimize this away (the fact that the allocations show up in a benchmark without speed compilation means the JIT kept array though):

1fmin s0, s0, s1
2fmin s0, s2, s0
3fmin s0, s3, s0

If you need to min/max 4 values in a tight loop, an easy fix to benefit from optimized code in interpreted, JIT, and AOT modes is to use a 4 parameter specialization:

1inline fun fastMinOf(a: Float, b: Float, c: Float, d: Float): Float {
2    return minOf(a, minOf(b, minOf(c, d)))
3}

This implementation produces exactly what we expect in DEX:

1invoke-static {v2, v3}, Ljava/lang/Math;.min:(FF)F // method@0008
2move-result v2
3invoke-static {v1, v2}, Ljava/lang/Math;.min:(FF)F // method@0008
4move-result v1
5invoke-static {v0, v1}, Ljava/lang/Math;.min:(FF)F // method@0008
6move-result v0

And of course the aarch64 assembly matches what the AOT compiler was already doing:

1fmin s0, s0, s1
2fmin s0, s2, s0
3fmin s0, s3, s0

Here is how this new version compares in the same benchmarks as before:

minOf(3)         86,815 ns     0 allocs
minOf(4)        256,048 ns  9999 allocs
fastMinOf(4)    114,490 ns     0 allocs

All the allocations are gone, and we are now closer to the 2/3 ratio between minOf(3) and fastMinOf(4).

Will this impact your app? Not unless you invoke minOf()/maxOf() with 4 values in performance critical loops. For Jetpack Compose, optimizing the interpreted and JIT use cases is also important since it helps improve the debug workflows and reduces noise in other measurements.

The Android RunTime ↩︎
Ahead-Of-Time ↩︎