Leland and I were recently discussing how to best implement a new data structure to speed up a specific aspect of Jetpack Compose. He came up with a great idea, and nerd sniped me in the process. The problem was to efficiently encode the occupancy of an 8x8 grid represented as a Long (each bit representing a cell in the grid).

After coming up with the bit twiddling code that quickly “rasterizes” a rectangle into the grid as a bitfield, I found myself thinking about how incredibly helpful Kotlin can be at making low-level/micro-optimized code easy to read¹ for users of an API.

The grid Link to heading

To understand what the code that follows does, here is an example of a grid initialized with a rectangle spanning from the coordinate (1,1) to (5,5). Note that the left/top coordinates are inclusive while right/bottom are exclusive.

1val grid = Grid(1, 1, 5, 5)
2println(grid)

Gives the following output:

Like mentioned earlier, we want to encode the grid in a Long and Kotlin offers a great way to do so while maintaining type safety and a nice API surface thanks to value classes:

 1inline fun Grid() = Grid(0L)
 2inline fun Grid(r: Rect) = Grid(rasterize(r.l, r.t, r.r, r.b))
 3inline fun Grid(l: Int, t: Int, r: Int, b: Int) = Grid(rasterize(l, t, r, b))
 4
 5@JvmInline
 6value class Grid @PublishedApi internal constructor(@PublishedApi internal val cells: Long) {
 7    override fun toString() = buildString {
 8        for (y in 0..7) {
 9            val line = (cells ushr (56 - y shl 3) and 0xffL).toString(2).padStart(8, '0')
10            appendLine(line)
11        }
12    }
13}
14
15@PublishedApi
16internal fun rasterize(l: Int, t: Int, r: Int, b: Int): Long {
17    val w = r - l
18    val h = b - t
19    val scanline = 0xffL ushr (8 - w) shl (8 - r)
20    val rows = 0x01_01_01_01_01_01_01_01L ushr ((8 - h) shl 3) shl ((8 - b) shl 3)
21    return rows * scanline
22}

Let’s ignore the definition of Rect for now, just know that it’s a type that exposes 4 properties call l, t, r, and b for left, top, right, and bottom respectively. If we were to ship this code as a public API we’d change the properties to have longer names, but we’ll keep the single letter versions to keep our code concise here.

With this code, we can now construct grid “instances” and benefit from the aforementioned type safety without paying the cost of any abstraction.

We actually pay so little that building a grid with a constant rectangle is effectively free. The following Kotlin function:

1fun defaultGrid() = Grid(1, 1, 5, 5)

Yields the following DEX bytecode:

1const-wide v0, #double 2.17795e-306 // #0078787878000000
2return-wide v0

While this code was compiled with R8², which means extra optimizations were applied, the result is quite impressive given how rasterize() is implemented.

Adding operators Link to heading

With a working value type, we can now add a series of APIs to simplify grid manipulations. The ability to overload operators proves to be particularly useful in a case like this. With these operators we can now perform interesting operations on the grid, such as adding rectangles to produce the union:

 1var grid = Grid(1, 1, 5, 5)
 2grid += Rect(3, 3, 7, 7)
 3println(grid)
 4
 5// Output:
 6// 00000000
 7// 01111000
 8// 01111000
 9// 01111110
10// 01111110
11// 00011110
12// 00011110
13// 00000000

Or subtracting a rectangle:

 1var grid = Grid(1, 1, 5, 5)
 2grid -= Rect(2, 2, 6, 6)
 3println(grid)
 4
 5// Output:
 6// 00000000
 7// 01111000
 8// 01000000
 9// 01000000
10// 01000000
11// 00000000
12// 00000000
13// 00000000

Or querying if a cell is occupied:

1val grid = Grid(1, 1, 5, 5)
2if (grid[4, 4]) {
3    println("Cell 4,4 is occupied")
4}

These operators are fairly trivial to implement, and more importantly extremely efficient once compiled:

 1inline operator fun get(x: Int, y: Int) =
 2    ((cells ushr ((7 - y) shl 3)) and (0x1L shl (7 - x))) != 0L
 3
 4inline operator fun plus(r: Rect) =
 5    Grid(cells or rasterize(r.l, r.t, r.r, r.b))
 6
 7inline operator fun minus(r: Rect) =
 8    Grid(cells and rasterize(r.l, r.t, r.r, r.b).inv())
 9
10inline infix fun and(r: Rect) =
11    Grid(cells and rasterize(r.l, r.t, r.r, r.b))
12
13inline fun intersects(r: Rect) =
14    (cells and rasterize(r.l, r.t, r.r, r.b)) != 0L

We could keep going and add other operators such as xor or provide functions that operate on two grids (the union of two grids for instance would just be a plus operator that ors the cells value of both grids).

Higher level APIs Link to heading

A natural thing we may want to do with our grid is iterate over all occupied cells. We can easily add this capability with a forEach function that takes a lambda:

 1inline fun forEach(block: (Int, Int) -> Unit) {
 2    var v = cells
 3    while (v.hasNext()) {
 4        val index = v.get()
 5        block(index and 0x7, index ushr 3)
 6        v = v.next()
 7    }
 8}
 9
10@PublishedApi
11internal inline fun Long.get() = 63 - countTrailingZeroBits()
12
13@PublishedApi
14internal inline fun Long.hasNext() = this != 0L
15
16@PublishedApi
17internal inline fun Long.next() = this and (this - 1L)

The resulting aarch64 code is nice and tidy since countTrailingZeroBits() is an ART intrinsic that compiles down to rbit + clz. Also note how we use extension functions to make our own code easier to read at no extra cost.

Hiding the madness Link to heading

With all of the above, the users of our API can be completely oblivious to what’s going on in our code and they can happily write highly readable code such as:

1var grid = Grid(1, 1, 5, 5)
2grid += Rect(3, 3, 7, 7)
3grid = grid and Rect(5, 5, 8, 8)
4
5grid.forEach { x, y ->
6    process(x, y)
7}

All of our abstractions will just go away and leave behind only highly optimized code. This forEach loop for instance:

1var total = 0
2g.forEach { x, y ->
3    total += x * y
4}

Will produce the following aarch64 code:

 1 mov w2, #0x3f
 2 mov w0, #0x0
 3 cmp x1, #0x0 (0)
 4 cset w3, ne
 5 cbz x1, #+0x2c
 6 rbit x3, x1
 7 clz x3, x3
 8 sub w3, w2, w3
 9 lsr w4, w3, #3
10 and w3, w3, #0x7
11 sub x5, x1, #0x1 (1)
12 madd w0, w3, w4, w0
13 and x1, x1, x5
14 ldr x21, [x21]
15 b #-0x30

It is important to note that this kind of low-level optimized code also helps the compilers apply their own optimizations as the side-effect free code is easier to reason about.

The mighty R8 is impressive Link to heading

We already briefly mentioned R8 above, but let’s look at a larger example. To truly understand the scope of what R8 does, let’s see how our Rect works:

 1inline fun Int.uncheckedCoerceIn(minimumValue: Int, maximumValue: Int) =
 2    this.coerceAtLeast(minimumValue).coerceAtMost(maximumValue)
 3
 4inline fun Rect(l: Int, t: Int, r: Int, b: Int) =
 5    Rect(
 6        ((l.uncheckedCoerceIn(0, 8) and 0xf) shl 24) or
 7        ((t.uncheckedCoerceIn(0, 8) and 0xf) shl 16) or
 8        ((r.uncheckedCoerceIn(0, 8) and 0xf) shl  8) or
 9        ((b.uncheckedCoerceIn(0, 8) and 0xf)       )
10    )
11
12@JvmInline
13value class Rect @PublishedApi internal constructor(@PublishedApi internal val points: Int) {
14    inline val l: Int get() = points ushr 24
15    inline val t: Int get() = (points shr 16) and 0xf
16    inline val r: Int get() = (points shr 8) and 0xf
17    inline val b: Int get() = points and 0xf
18
19    override fun toString() = "Rect($l, $t, $r, $b)"
20}

Because our grid is an 8x8 grid, we can represent a rectangle using a single integer (we could even use a Short). Encoding and decoding the rectangle coordinates is straightforward, we only throw in a range coercion to guarantee the coordinates are valid³.

Given this definition, you can understand that quite a bit is happening in this code:

1var grid = Grid(1, 1, 5, 5)
2grid += Rect(3, 3, 7, 7)

We call rasterize() twice, we pack 4 coordinates into a rectangle, and unpack those 4 coordinates. Without R8, the generated aarch64 code is:

  1sub x16, sp, #0x2000 (8192)
  2ldr wzr, [x16]
  3str x0, [sp, #-80]!
  4stp x22, x23, [sp, #32]
  5stp x24, x25, [sp, #48]
  6stp x26, lr, [sp, #64]
  7ldr x21, [x21]
  8mov w25, #0x8
  9mov x24, #0xff
 10mov x23, #0x101010101010101
 11mov w22, #0xf
 12adrp x0, #+0x3000 (addr 0x5000)
 13adr lr, #+0xc (addr 0x2e9c)
 14ldr w0, [x0, #28]
 15cbnz x20, #+0x5c8
 16cbz w0, #+0x154
 17ldrb w16, [x0, #115]
 18cmp w16, #0xf0 (240)
 19b.lo #+0x148
 20mov w2, #0x0
 21mov w1, #0x3
 22mov w0, #0x3d
 23ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
 24blr lr
 25mov x2, x25
 26mov x1, x0
 27mov w0, #0x3e
 28ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
 29blr lr
 30and w0, w0, #0xf
 31lsl w26, w0, #24
 32mov w2, #0x0
 33mov w1, #0x3
 34mov w0, #0x3d
 35ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
 36blr lr
 37mov x2, x25
 38mov x1, x0
 39mov w0, #0x3e
 40ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
 41blr lr
 42and w0, w0, #0xf
 43orr w26, w26, w0, lsl #16
 44mov w2, #0x0
 45mov w1, #0x7
 46mov w0, #0x3d
 47ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
 48blr lr
 49mov x2, x25
 50mov x1, x0
 51mov w0, #0x3e
 52ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
 53blr lr
 54and w0, w0, #0xf
 55orr w26, w26, w0, lsl #8
 56mov w2, #0x0
 57mov w1, #0x7
 58mov w0, #0x3d
 59ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
 60blr lr
 61mov x2, x25
 62mov x1, x0
 63mov w0, #0x3e
 64ldr lr, [tr, #1240] ; pInvokeStaticTrampolineWithAccessCheck
 65blr lr
 66and w0, w0, #0xf
 67orr w0, w0, w26
 68adrp x1, #+0x3000
 69adr lr, #+0xc
 70ldr w1, [x1, #32]
 71cbnz x20, #+0x448
 72cbz w1, #+0x84
 73ldrb w16, [x1, #115]
 74cmp w16, #0xf0 (240)
 75b.lo #+0x78
 76and w1, w22, w0, asr #16
 77and w2, w22, w0, asr #8
 78and w3, w0, #0xf
 79sub w0, w2, w0, lsr #24
 80sub w1, w3, w1
 81sub w0, w25, w0
 82lsr x0, x24, x0
 83sub w2, w25, w2
 84lsl x0, x0, x2
 85sub w1, w25, w1
 86lsl w1, w1, #3
 87lsr x1, x23, x1
 88sub w2, w25, w3
 89lsl w2, w2, #3
 90lsl x1, x1, x2
 91mul x0, x0, x1
 92mov x16, #0x78000000
 93movk x16, #0x7878, lsl #32
 94movk x16, #0x78, lsl #48
 95orr x0, x0, x16
 96ldp x22, x23, [sp, #32]
 97ldp x24, x25, [sp, #48]
 98ldp x26, lr, [sp, #64]
 99add sp, sp, #0x50 (80)
100ret
101mov w0, #0x4
102bl #+0x56c ; pResolveType
103bl #+0x578 ; pInitializeStaticStorage
104b #-0x150 (addr 0x2eac)
105str x0, [sp, #16]
106mov w0, #0x6
107bl #+0x558 ; pResolveType
108bl #+0x564 ; pInitializeStaticStorage
109mov w1, w0
110ldr x0, [sp, #16]
111b #-0x8c

And what if we enable R8? We get a single constant load:

1mov x0, #0x1e00
2movk x0, #0x7e1e, lsl #16
3movk x0, #0x787e, lsl #32
4movk x0, #0x78, lsl #48
5ret

Turn on R8.

The complete version of the code presented here is available as a GitHub gist.

Except the bit twiddling code itself. it remains one of my major complaints about Kotlin. ↩︎
We’ll get back to this with a more complex example. You want to enable R8. ↩︎
We use our own version of coerceIn(). The standard one has unwanted side effects which breaks a bunch of optimizations. I’ll maybe write a post about it in the future. ↩︎