I recently discussed an optimization that I worked on following Leland’s successful nerd snipe. That, however, was not the end of it. He also needed to test for intersecting/overlapping rectangles. The most obvious way to achieve this is pretty straightforward:

1// A rectangle is defined by its left (l), top (t),
2// right (r), and bottom (b) coordinates
3data class Rect(val l: Int, val t: Int, val r: Int, val b: Int) {
4    fun overlaps(other: Rect) =
5        l < other.r && other.l < r && t < other.b && other.t < b

The source code is nice and tidy, but the generated assembly is less than ideal (as always, the code was optimized with R8 first):

 1stp x0, lr, [sp, #-16]!
 2ldr w0, [x2, #16]
 3ldr w3, [x1, #12]
 4cmp w3, w0
 5b.ge #+0x3c
 6ldr w0, [x2, #12]
 7ldr w3, [x1, #16]
 8cmp w0, w3
 9b.ge #+0x2c
10ldr w0, [x1, #20]
11ldr w3, [x2, #8]
12cmp w0, w3
13b.ge #+0x1c
14ldr w0, [x2, #20]
15ldr w1, [x1, #8]
16cmp w0, w1
17b.ge #+0xc
18mov w0, #0x1
19b #+0x8
20mov w0, #0x0
21ldp xzr, lr, [sp], #16

It’s a direct translation of the Kotlin source code, meaning we have generated 4 conditional branches (plus an inconditional jump). While we may benefit from an early exit, all those branches can be (and likely will be for our use case) mispredicted, which isn’t good for performance.

Before we go any further let’s make something clear. If you are an app developer, the kind of optimizations described below do not matter for the vast majority, if not all, of code you write. They can however teach you important things about how to optimize code and why, should you ever need to. Also, it’s interesting and fun.

Getting rid of the branches Link to heading

We can remove the conditional branches in this particular test by replacing the logical operators && with bitwise operators and:

1fun overlaps(other: Rect) =
2    (l < other.r) and (other.l < r) and (t < other.b) and (other.t < b)

The source code is similar, but the generated assembly is quite different, even though the number of instructions is similar, going from 22 down to 21:

 1stp x0, lr, [sp, #-16]!
 2ldr w0, [x2, #16]
 3ldr w3, [x1, #12]
 4cmp w3, w0
 5cset w3, lt
 6ldr w4, [x2, #12]
 7ldr w5, [x1, #16]
 8cmp w4, w5
 9cset w4, lt
10and w3, w3, w4
11ldr w4, [x1, #20]
12cmp w4, w0
13cset w0, lt
14and w0, w3, w0
15ldr w2, [x2, #20]
16ldr w1, [x1, #8]
17cmp w2, w1
18cset w1, lt
19and w0, w0, w1
20ldp xzr, lr, [sp], #16

Looking closely, we can see that all branches are now gone. Does this matter? Using a benchmark to test whether a source rectangle overlaps with any of 10,000 other rectangles, we obtain the following results on a Google Pixel 6:

&&      62,552 ns
and     36,476 ns

The new version is 1.7x faster, not bad for such a small change. That’s great. Send that PR, and merge it. Unless… maybe we can do better?

Silly idea #1 Link to heading

If we look closely at the previous implementations we’ll note that we compare each source coordinate (left, top, etc.) to its opposite (src.left to dst.right, src.right to dst.left, etc.). Using this, we can build other ways to achieve those comparisons. For instance we can find the min and max of each pair of left and right coordinates, and look at the difference. If the max minus the min is negative, then the rectangles don’t overlap.

In the following implementation, we do this for the horizontal and vertical coordinates, or the results, and check if the sign bit is set. If it’s set, then the rectangles don’t overlap on one or both of the 2 axis, meaning they don’t overlap at all:

1fun overlaps(other: Rect): Boolean {
2    val x1 = maxOf(l, other.l)
3    val y1 = maxOf(t, other.t)
4    val x2 = minOf(r, other.r)
5    val y2 = minOf(b, other.b)
6    return (x2 - x1) or (y2 - y1) and 0x8000_0000.toInt() == 0

I’ll spare you the resulting aarch64 assembly, but it’s branchless, despite having, perhaps unsurprisingly, more instructions (27). Let’s compare it to our previous solutions anyway:

&&      62,552 ns
and     36,476 ns
silly1  40,590 ns

This solution is not as good as the simpler one based on bitwise and, but it’s still better than our original code because we got rid of all those branches.

Trying to be clever thankfully didn’t pan out (the code is much harder to understand!), so what else can we do?

Objects begone! Link to heading

Looking back at the assembly code from our original solutions, we can see a number of ldr instructions caused by reading the fields of a class. The DEX1 bytecode is easier to understand, for instance when we read the right coordinate from an object:

1iget v0, v3, LRect;.r:I

What if we simply ditched the class and operated on integers directly? To be truly useful, this solution would require storing rectangles in an array of ints making their manipulation a bit more difficult, but what kind of benefit, if any, could we get from this? Let’s rewrite our tests:

 1fun logical(l: Int, t: Int, r: Int, b: Int, l1: Int, t1: Int, r1: Int, b1: Int): Boolean {
 2    return l < r1 && l1 < r && t < b1 && t1 < b
 5fun bitwise(l: Int, t: Int, r: Int, b: Int, l1: Int, t1: Int, r1: Int, b1: Int): Boolean {
 6    return (l < r1) and (l1 < r) and (t < b1) and (t1 < b)
 9fun silly1(l: Int, t: Int, r: Int, b: Int, l1: Int, t1: Int, r1: Int, b1: Int): Boolean {
10    val x1 = maxOf(l, l1)
11    val y1 = maxOf(t, t1)
12    val x2 = minOf(r, r1)
13    val y2 = minOf(b, b1)
14    return (x2 - x1) or (y2 - y1) and 0x8000_0000.toInt() == 0

And let’s look at the generated assembly for the “bitwise” version of the test, our fastest solution so far:

 1ldr w0, [sp, #36]
 2cmp w1, w7
 3cset w1, lt
 4cmp w5, w3
 5cset w3, lt
 6and w1, w1, w3
 7cmp w2, w0
 8cset w0, lt
 9and w0, w1, w0
10cmp w6, w4
11cset w1, lt
12and w0, w0, w1

The code remains branchless, with only 13 instructions instead of 21, and more importantly with none of those pesky memory loads. The reduced number of instructions and the absence of loads greatly improves our benchmarks:

object.&&      62,552 ns
object.and     36,476 ns
object.silly1  40,590 ns

ints.&&        10,084 ns
ints.and        5,535 ns
ints.silly1     6,591 ns

Using primitives instead of objects sped up the original implementation (&&) by 6x! Similar improvements are seen in the other implementations. This means our fastest implementation, bitwise and operating on integers, is now ~11x faster than our original solution.

Silly idea #2 Link to heading

Since we are now working with primitives and arrays of primitives, there is another solution we can try. The idea for this last attempt is to perform all 4 comparisons at once. To do this, we need to pack the left, top, right, and bottom coordinates of a rectangle inside a single primitive. Since we are working with Kotlin, this means the widest type we can use is a Long (64 bits), and each coordinate will be constrained to 16 bits, meaning the maximum value will be 65,535. Actually, because of the way we will perform the comparisons, each value will be limited to 15 bits, or 32,767. The original problem Leland and I were discussing involved rectangles defined in screen coordinates, so this limitation is acceptable. Somewhat2.

First, we need a way to pack our coordinates. Note that for our comparisons to work, we need to encode the “source” and “destination” rectangles differently3. This is another limitation of this solution, but considering we want to test a single source rectangle against many destination rectangles, this limitation is once again perfectly acceptable:

 1// Encoded as left, right, top, bottom in a Long
 2fun SourcePackedRect(l: Int, t: Int, r: Int, b: Int) =
 3    ((l and 0xffff).toLong() shl 48) or
 4    ((r and 0xffff).toLong() shl 32) or
 5    ((t and 0xffff).toLong() shl 16) or
 6    ((b and 0xffff).toLong()       )
 8// Encoded as right, left, bottom, top in a Long
 9fun DestinationPackedRect(l: Int, t: Int, r: Int, b: Int) =
10    ((r and 0xffff).toLong() shl 48) or
11    ((l and 0xffff).toLong() shl 32) or
12    ((b and 0xffff).toLong() shl 16) or
13    ((t and 0xffff).toLong()       )

Using these encodings we can now do some bit shifting and masking to construct two longs, one contaning src.left, dst.left, src.top, dst.top and the other one dst.right, src.right, dst.bottom, src.bottom. The two longs can be compared with a bit of bitwise trickery to compare each 16-bit word, generating a Long that will contain the value 0x8000 for any of the 16-bit word where the source is less than the destination:

1fun overlaps(src: Long, dst: Long): Boolean {
2    val a = (src and 0xffff_0000_ffff_0000UL.toLong()) or (dst and 0x0000_ffff_0000_ffffL)
3    val b = (dst and 0xffff_0000_ffff_0000UL.toLong()) or (src and 0x0000_ffff_0000_ffffL)
4    return (b - a) and 0x8000_8000_8000_8000UL.toLong() == 0L

It is possible to lift the 2^15 limitation using a more complex solution involving computing the subtraction carry-out vector. The last line would become:

return (a.inv() and b) or ((a xor b).inv() and (a - b)) and 0x8000_8000_8000_8000UL.toLong() != 0L

This code is difficult to use because of the specific encoding required, and hard to read. The resulting assembly has 13 instructions and no branches, just like the bitwise integer solution.

Is it faster than our existing solutions? Well… yes and no:

object.&&      62,552 ns
object.and     36,476 ns
object.silly1  40,590 ns

ints.&&        10,084 ns
ints.and        5,535 ns
ints.silly1     6,591 ns

silly2         20,189 ns 

It’s better than using objects but (much) worse than just using integers. Silly idea instead. We can make it sillier by encoding each rectangle as two longs, to get rid of the 4 masking operations we start with (we go from 13 instructions to 9, and the times goes grom 20,189 ns to 7,462 ns; still not as good as the simplest solution).

Conclusion Link to heading

Even the simplest problems can have a surprising number of possible solutions (I tried a couple more I didn’t describe here but they weren’t interesting nor fast, and I’m sure there are other ways). Should you ditch classes and objects? If you are writing the kind of code where the answer is yes, you already know that the answer is yes. Should you take an easy win like the one showed in the and solution? If the code is on a hot path, absolutely. The code remains easy to read.

In the next blog post, we’ll talk about how Jake Wharton caused me to go down another silly rabbit hole.

  1. The bytecode used by the Android Runtime, ART. ↩︎

  2. Because nobody will ever use a display with more than 32,768 pixels horizontally or vertically, right? ↩︎

  3. We could pack them the same way and do some extra bit shifting but what’s the fun in that. ↩︎