There is a simple but often overlooked technique to optimize performance-sensitive code: merging (or manually inlining) functions. We often build series of low-level functions that execute various computations that we then combine to perform higher-level tasks. When taken in isolation, each of those functions does exactly what it should and might even be perfectly optimized. However, when a series of thosefunctions work together, unnecessary or duplicated work might appear.

Let’s look at a concrete example taken from the Jetpack Compose code base. To apply the various geometric transforms that may affect a layer, Compose needs to build a matrix that combines all the transformations exposed by its APIs:

  • A pivot point used as the center of the other transforms
  • A translation
  • A rotation over each of the X, Y, and Z axis
  • A scale over the X and Y axis

In practice, this was achieved by executing the following Kotlin code:

 1// This matrix is cached and re-used across invocations
 2val m: Matrix ...
 3
 4// Reset to identity
 5m.reset()
 6// Move to the pivot point
 7m.translate(-pivotX, -pivotY)
 8// Apply the user's transforms
 9m *= Matrix().apply {
10    translate(translationX, translationY)
11    rotateX(rotationX)
12    rotateY(rotationY)
13    rotateZ(rotationZ)
14    scale(scaleX, scaleY)
15}
16m *= Matrix().apply { translate(pivotX, pivotY) }

The code is easy to read as it relies on a series of low-level APIs provided by the Matrix type, but it suffers from two problems:

  1. It allocates two temporary matrices that don’t really serve any purpose but are required to combine transforms through the multiplication operator.
  2. There is a lot of duplicated work.

The allocations are a fairly obvious problem, but the second one is a bit more subtle, and only becomes visible when looking at and comparing the implementation of all the APIs that are invoked.

For instance, the code above starts with a call to reset() which changes the 16 values in the 4x4 matrix to reset the matrix to the identity matrix. This is completely unnecessary work since the code that comes next will eventually write into almost every entry of the matrix.

Going through the generic multiplication operator also means we are treating both matrices involved as arbitrary 4x4 matrices that may contain arbitrary values. However, we know exactly which values are needed at every step, making many of the intermediate computations irrelevant (an identity matrix starts with a lot of zeroes).

The solution in this case was to manually merge all the function calls, and to prune any work that was not necessary to build the final matrix. This was achieved by creating a new function called resetToPivotedTransform.

Ignoring the cost of allocating the intermediate matrices, the original code was executing roughly 1,062 arm64 instructions (including 39 branches or so). By merging the functions and removing duplicated work, the final code only needs 168 instructions (with 1 branch), or only 15% of the original instructions stream.

Doing this only makes sense in performance-sensitive code. You should also note that R8 and ART will perform this kind of optimizations in some cases, but not always. It is fairly easy to prevent inlining optimizations from happening, if the code in the functions is too big, or if the function is called from too many places for instance. This is why looking at the final disassembly of your code can be useful to figure out where such optimizations can/should take place.