Eliminating Array Bounds Checks

The Android Runtime (ART) offers a nice memory safety feature when accessing the content of an array. The indices you use are automatically checked against the bounds of the array to prevent unsafe memory accesses. To achieve this, ART generates extra machine instructions to throw an ArrayIndexOutOfBoundsException when the index is invalid.

Here is a simple Kotlin example:

1fun scaleZ(values: FloatArray, scale: Float) = values[2] * scale

After translation to arm64 assembly, we obtain the following result:

 1stp  x0, lr, [sp, #-16]!
 2ldr  w0, [x1, #8]
 3cmp  w0, #0x2 (2)
 4b.ls #+0x14 (addr 0x1200)
 5ldr  s1, [x1, #20]
 6fmul s0, s0, s1
 7ldp  xzr, lr, [sp], #16
 8ret
 9mov  x1, x0
10mov  w0, #0x2
11bl   #+0x508 (addr 0x1710) ; pThrowArrayBounds

The highlighted lines show the instructions used to check that the array has at least 3 entries for index 2 to be valid. When the comparison fails, execution jumps to the epilogue where another jump into an ART-internal function happens, to throw the exception.

Having this kind of safety is wonderful for many reasons I do not need to enumerate. It is however painful to know that so many instructions are generated and executed in situations where you know that the index cannot be out of bounds. The compilers in the Android toolchain (including ART’s JIT and AOT compilers) try to be smart about this and avoid making unnecessary checks when possible. There are however limits to what they can assume.

Helping the compiler Link to heading

Let’s take a concrete example and look at the Matrix used in Jetpack Compose:

 1@kotlin.jvm.JvmInline
 2value class Matrix(
 3    val values: FloatArray = floatArrayOf(
 4        1f, 0f, 0f, 0f,
 5        0f, 1f, 0f, 0f,
 6        0f, 0f, 1f, 0f,
 7        0f, 0f, 0f, 1f
 8    )
 9) {
10    inline operator fun get(row: Int, column: Int) = values[(row * 4) + column]
11
12    inline operator fun set(row: Int, column: Int, v: Float) {
13        values[(row * 4) + column] = v
14    }
15}

This class is a simple wrapper around an array of 16 floats that is never re-allocated. We therefore know that this array is always exactly of size 16. Unfortunately there is no guarantee at the language level that this will be the case. Even though the array is marked val, we could reallocate it via reflection or JNI, or in the case of a class try to access it during partial construction of an instance.

This can have fairly large consequences for somewhat innocent-looking code like this:

 1fun Matrix.isIdentity(): Boolean {
 2    val v = values
 3    return v[0] == 1f &&
 4        v[1] == 0f &&
 5        v[2] == 0f &&
 6        v[3] == 0f &&
 7        v[4] == 0f &&
 8        v[5] == 1f &&
 9        v[6] == 0f &&
10        v[7] == 0f &&
11        v[8] == 0f &&
12        v[9] == 0f &&
13        v[10] == 1f &&
14        v[11] == 0f &&
15        v[12] == 0f &&
16        v[13] == 0f &&
17        v[14] == 0f &&
18        v[15] == 1f
19}

The arm64 code generated is the following:

  1stp  x0, lr, [sp, #-16]!
  2fmov s0, #0x70 (1.0000)
  3ldr  w0, [x1, #8]
  4cmp  w0, #0x0 (0)
  5b.ls #+0x150 (addr 0x1240)
  6ldr  s1, [x1, #12]
  7fcmp s1, s0
  8b.ne #+0x138 (addr 0x1234)
  9cmp  w0, #0x1 (1)
 10b.ls #+0x148 (addr 0x124c)
 11ldr  s1, [x1, #16]
 12fcmp s1, #0.0
 13b.ne #+0x124 (addr 0x1234)
 14cmp  w0, #0x2 (2)
 15b.ls #+0x140 (addr 0x1258)
 16ldr  s1, [x1, #20]
 17fcmp s1, #0.0
 18b.ne #+0x110 (addr 0x1234)
 19cmp  w0, #0x3 (3)
 20b.ls #+0x138 (addr 0x1264)
 21ldr  s1, [x1, #24]
 22fcmp s1, #0.0
 23b.ne #+0xfc (addr 0x1234)
 24cmp  w0, #0x4 (4)
 25b.ls #+0x130 (addr 0x1270)
 26ldr  s1, [x1, #28]
 27fcmp s1, #0.0
 28b.ne #+0xe8 (addr 0x1234)
 29cmp  w0, #0x5 (5)
 30b.ls #+0x128 (addr 0x127c)
 31ldr  s1, [x1, #32]
 32fcmp s1, s0
 33b.ne #+0xd4 (addr 0x1234)
 34cmp  w0, #0x6 (6)
 35b.ls #+0x120 (addr 0x1288)
 36ldr  s1, [x1, #36]
 37fcmp s1, #0.0
 38b.ne #+0xc0 (addr 0x1234)
 39cmp  w0, #0x7 (7)
 40b.ls #+0x118 (addr 0x1294)
 41ldr  s1, [x1, #40]
 42fcmp s1, #0.0
 43b.ne #+0xac (addr 0x1234)
 44cmp  w0, #0x8 (8)
 45b.ls #+0x110 (addr 0x12a0)
 46ldr  s1, [x1, #44]
 47fcmp s1, #0.0
 48b.ne #+0x98 (addr 0x1234)
 49cmp  w0, #0x9 (9)
 50b.ls #+0x108 (addr 0x12ac)
 51ldr  s1, [x1, #48]
 52fcmp s1, #0.0
 53b.ne #+0x84 (addr 0x1234)
 54cmp  w0, #0xa (10)
 55b.ls #+0x100 (addr 0x12b8)
 56ldr  s1, [x1, #52]
 57fcmp s1, s0
 58b.ne #+0x70 (addr 0x1234)
 59cmp  w0, #0xb (11)
 60b.ls #+0xf8 (addr 0x12c4)
 61ldr  s1, [x1, #56]
 62fcmp s1, #0.0
 63b.ne #+0x5c (addr 0x1234)
 64cmp  w0, #0xc (12)
 65b.ls #+0xf0 (addr 0x12d0)
 66ldr  s1, [x1, #60]
 67fcmp s1, #0.0
 68b.ne #+0x48 (addr 0x1234)
 69cmp  w0, #0xd (13)
 70b.ls #+0xe8 (addr 0x12dc)
 71ldr  s1, [x1, #64]
 72fcmp s1, #0.0
 73b.ne #+0x34 (addr 0x1234)
 74cmp  w0, #0xe (14)
 75b.ls #+0xe0 (addr 0x12e8)
 76ldr  s1, [x1, #68]
 77fcmp s1, #0.0
 78b.ne #+0x20 (addr 0x1234)
 79cmp  w0, #0xf (15)
 80b.ls #+0xd8 (addr 0x12f4)
 81ldr  s1, [x1, #72]
 82fcmp s1, s0
 83b.ne #+0xc (addr 0x1234)
 84mov  w0, #0x1
 85b    #+0x8 (addr 0x1238)
 86mov  w0, #0x0
 87ldp  xzr, lr, [sp], #16
 88ret
 89mov  x1, x0
 90mov  w0, #0x0
 91bl   #+0x5f8 (addr 0x1840) ; pThrowArrayBounds
 92mov  x1, x0
 93mov  w0, #0x1
 94bl   #+0x5ec (addr 0x1840) ; pThrowArrayBounds
 95mov  x1, x0
 96mov  w0, #0x2
 97bl   #+0x5e0 (addr 0x1840) ; pThrowArrayBounds
 98mov  x1, x0
 99mov  w0, #0x3
100bl   #+0x5d4 (addr 0x1840) ; pThrowArrayBounds
101mov  x1, x0
102mov  w0, #0x4
103bl   #+0x5c8 (addr 0x1840) ; pThrowArrayBounds
104mov  x1, x0
105mov  w0, #0x5
106bl   #+0x5bc (addr 0x1840) ; pThrowArrayBounds
107mov  x1, x0
108mov  w0, #0x6
109bl   #+0x5b0 (addr 0x1840) ; pThrowArrayBounds
110mov  x1, x0
111mov  w0, #0x7
112bl   #+0x5a4 (addr 0x1840) ; pThrowArrayBounds
113mov  x1, x0
114mov  w0, #0x8
115bl   #+0x598 (addr 0x1840) ; pThrowArrayBounds
116mov  x1, x0
117mov  w0, #0x9
118bl   #+0x58c (addr 0x1840) ; pThrowArrayBounds
119mov  x1, x0
120mov  w0, #0xa
121bl   #+0x580 (addr 0x1840) ; pThrowArrayBounds
122mov  x1, x0
123mov  w0, #0xb
124bl   #+0x574 (addr 0x1840) ; pThrowArrayBounds
125mov  x1, x0
126mov  w0, #0xc
127bl   #+0x568 (addr 0x1840) ; pThrowArrayBounds
128mov  x1, x0
129mov  w0, #0xd
130bl   #+0x55c (addr 0x1840) ; pThrowArrayBounds
131mov  x1, x0
132mov  w0, #0xe
133bl   #+0x550 (addr 0x1840) ; pThrowArrayBounds
134mov  x1, x0
135mov  w0, #0xf
136bl   #+0x544 (addr 0x1840) ; pThrowArrayBounds

We have a total of 136 instructions. For each array access, the compiler first checks the bounds and if the test fails, execution jumps to one of the 16 pThrowArrayBounds in the epilogue to produce an exception message with the appropriate index value. That’s a lot of code we will never need.

Thankfully, we can help the compiler with a simple change, highlighted below:

 1fun Matrix.isIdentity(): Boolean {
 2    val v = values
 3    if (v.size < 16) return false
 4    return v[0] == 1f &&
 5        v[1] == 0f &&
 6        v[2] == 0f &&
 7        v[3] == 0f &&
 8        v[4] == 0f &&
 9        v[5] == 1f &&
10        v[6] == 0f &&
11        v[7] == 0f &&
12        v[8] == 0f &&
13        v[9] == 0f &&
14        v[10] == 1f &&
15        v[11] == 0f &&
16        v[12] == 0f &&
17        v[13] == 0f &&
18        v[14] == 0f &&
19        v[15] == 1f
20}

This check, while apparently useless since our array is never allocated to have less than 16 elements, gives the compiler enough information to be able to get rid of all the bounds checks:

 1stp  x0, lr, [sp, #-16]!
 2ldr  w0, [x1, #8]
 3cmp  w0, #0x10 (16)
 4b.ge #+0xc (addr 0x10f8)
 5mov  w0, #0x0
 6b    #+0xd4 (addr 0x11c8)
 7fmov s0, #0x70 (1.0000)
 8ldr  s1, [x1, #12]
 9fcmp s1, s0
10b.ne #+0xc0 (addr 0x11c4)
11ldr  s1, [x1, #16]
12fcmp s1, #0.0
13b.ne #+0xb4 (addr 0x11c4)
14ldr  s1, [x1, #20]
15fcmp s1, #0.0
16b.ne #+0xa8 (addr 0x11c4)
17ldr  s1, [x1, #24]
18fcmp s1, #0.0
19b.ne #+0x9c (addr 0x11c4)
20ldr  s1, [x1, #28]
21fcmp s1, #0.0
22b.ne #+0x90 (addr 0x11c4)
23ldr  s1, [x1, #32]
24fcmp s1, s0
25b.ne #+0x84 (addr 0x11c4)
26ldr  s1, [x1, #36]
27fcmp s1, #0.0
28b.ne #+0x78 (addr 0x11c4)
29ldr  s1, [x1, #40]
30fcmp s1, #0.0
31b.ne #+0x6c (addr 0x11c4)
32ldr  s1, [x1, #44]
33fcmp s1, #0.0
34b.ne #+0x60 (addr 0x11c4)
35ldr  s1, [x1, #48]
36fcmp s1, #0.0
37b.ne #+0x54 (addr 0x11c4)
38ldr  s1, [x1, #52]
39fcmp s1, s0
40b.ne #+0x48 (addr 0x11c4)
41ldr  s1, [x1, #56]
42fcmp s1, #0.0
43b.ne #+0x3c (addr 0x11c4)
44ldr  s1, [x1, #60]
45fcmp s1, #0.0
46b.ne #+0x30 (addr 0x11c4)
47ldr  s1, [x1, #64]
48fcmp s1, #0.0
49b.ne #+0x24 (addr 0x11c4)
50ldr  s1, [x1, #68]
51fcmp s1, #0.0
52b.ne #+0x18 (addr 0x11c4)
53ldr  s1, [x1, #72]
54fcmp s1, s0
55b.ne #+0xc (addr 0x11c4)
56mov  w0, #0x1
57b    #+0x8 (addr 0x11c8)
58mov  w0, #0x0
59ldp  xzr, lr, [sp], #16
60ret

The function is now down to 60 instructions and does only exactly what we wrote in our Kotlin code.

Inlining Link to heading

Calling functions is another barrier that prevents the compiler from fully understanding your code. Here is the original implemention of the *= operator for Matrix:

 1operator fun timesAssign(m: Matrix) {
 2    val v00 = dot(this, 0, m, 0)
 3    val v01 = dot(this, 0, m, 1)
 4    val v02 = dot(this, 0, m, 2)
 5    val v03 = dot(this, 0, m, 3)
 6    val v10 = dot(this, 1, m, 0)
 7    val v11 = dot(this, 1, m, 1)
 8    val v12 = dot(this, 1, m, 2)
 9    val v13 = dot(this, 1, m, 3)
10    val v20 = dot(this, 2, m, 0)
11    val v21 = dot(this, 2, m, 1)
12    val v22 = dot(this, 2, m, 2)
13    val v23 = dot(this, 2, m, 3)
14    val v30 = dot(this, 3, m, 0)
15    val v31 = dot(this, 3, m, 1)
16    val v32 = dot(this, 3, m, 2)
17    val v33 = dot(this, 3, m, 3)
18
19    val v = values
20    v[ 0] = v00
21    v[ 1] = v01
22    v[ 2] = v02
23    v[ 3] = v03
24    v[ 4] = v10
25    v[ 5] = v11
26    v[ 6] = v12
27    v[ 7] = v13
28    v[ 8] = v20
29    v[ 9] = v21
30    v[10] = v22
31    v[11] = v23
32    v[12] = v30
33    v[13] = v31
34    v[14] = v32
35    v[15] = v33
36}
37}
38
39private fun dot(m1: Matrix, row: Int, m2: Matrix, column: Int): Float {
40    return m1[row, 0] * m2[0, column] +
41        m1[row, 1] * m2[1, column] +
42        m1[row, 2] * m2[2, column] +
43        m1[row, 3] * m2[3, column]
44}

Each call to the dot product function dot() will execute a rather lengthy function that performs its own array bounds checks (two, one for each array passed as an argument), for a total of 32 checks, plus the one done in timesAssign() itself when writing the values at the end. It is interesting to note that, in this case, the compiler generates one bound checks for the writes that results in a deoptimized (interpreted) execution of the function on failure, instead of one bound check per access like we saw in isIdentity().

Written in this manner, timesAssign() requires exectuing around 860 instructions to complete. We can bring this down to 180 instructions by doing two things:

Marking dot() as inline
Doing a manual array size check for both arrays at the beginning of timesAssign()

After inlining, the compiler can perform extra optimizations, such as eliminating redundant bound checks and loads. The elimination of the loads goes beyond the original scope of this post, and we will explore this in further detail in the next post.

Always worth it? Link to heading

Of course, like any optimization technique, whether doing this matters highly depends on the code you’re applying it to. Make sure to always benchmark, but this technique might be useful in code that processes large arrays (audio, text, pixels, etc.). Even when the compiler does a good job on its own, you can save a few instructions.

For instance, this Kotlin function produces fairly optimal arm64 instructions:

 1fun scale(x: Float = 1f, y: Float = 1f, z: Float = 1f) {
 2    this[0, 0] *= x
 3    this[0, 1] *= x
 4    this[0, 2] *= x
 5    this[0, 3] *= x
 6    this[1, 0] *= y
 7    this[1, 1] *= y
 8    this[1, 2] *= y
 9    this[1, 3] *= y
10    this[2, 0] *= z
11    this[2, 1] *= z
12    this[2, 2] *= z
13    this[2, 3] *= z
14}

The generated assembly contains a single bounds check:

 1sub  x16, sp, #0x2000 (8192)
 2ldr  wzr, [x16]
 3str  x0, [sp, #-32]!
 4str  lr, [sp, #24]
 5ldr  w0, [x1, #8]
 6cmp  w0, #0xb (11)
 7b.ls #+0xa0 (addr 0x11f8)
 8ldr  s3, [x1, #12]
 9fmul s3, s0, s3
10str  s3, [x1, #12]
11ldr  s3, [x1, #16]
12fmul s3, s0, s3
13str  s3, [x1, #16]
14ldr  s3, [x1, #20]
15fmul s3, s0, s3
16str  s3, [x1, #20]
17ldr  s3, [x1, #24]
18fmul s0, s0, s3
19str  s0, [x1, #24]
20ldr  s0, [x1, #28]
21fmul s0, s1, s0
22str  s0, [x1, #28]
23ldr  s0, [x1, #32]
24fmul s0, s1, s0
25str  s0, [x1, #32]
26ldr  s0, [x1, #36]
27fmul s0, s1, s0
28str  s0, [x1, #36]
29ldr  s0, [x1, #40]
30fmul s0, s1, s0
31str  s0, [x1, #40]
32ldr  s0, [x1, #44]
33fmul s0, s2, s0
34str  s0, [x1, #44]
35ldr  s0, [x1, #48]
36fmul s0, s2, s0
37str  s0, [x1, #48]
38ldr  s0, [x1, #52]
39fmul s0, s2, s0
40str  s0, [x1, #52]
41ldr  s0, [x1, #56]
42fmul s0, s2, s0
43str  s0, [x1, #56]
44ldr  lr, [sp, #24]
45add  sp, sp, #0x20 (32)
46ret
47str  x0, [sp, #8]
48mov  x0, #0x5
49bl   #+0x4f0 (addr 0x16f0) ; pDeoptimize

The only gain to be had here is to get rid of the epilogue (binary size/instruction cache improvements) and to simplify the prologue of the function. Adding our own bounds check saves 7 instructions over the entire function (3 of those being the epilogue itself):

 1 stp  x0, lr, [sp, #-16]!
 2 ldr  w0, [x1, #8]
 3 cmp  w0, #0x10 (16)
 4 b.lt #+0x94 (addr 0x11e0)
 5 ldr  s3, [x1, #12]
 6 fmul s3, s0, s3
 7 str  s3, [x1, #12]
 8 ldr  s3, [x1, #16]
 9 fmul s3, s0, s3
10 str  s3, [x1, #16]
11 ldr  s3, [x1, #20]
12 fmul s3, s0, s3
13 str  s3, [x1, #20]
14 ldr  s3, [x1, #24]
15 fmul s0, s0, s3
16 str  s0, [x1, #24]
17 ldr  s0, [x1, #28]
18 fmul s0, s1, s0
19 str  s0, [x1, #28]
20 ldr  s0, [x1, #32]
21 fmul s0, s1, s0
22 str  s0, [x1, #32]
23 ldr  s0, [x1, #36]
24 fmul s0, s1, s0
25 str  s0, [x1, #36]
26 ldr  s0, [x1, #40]
27 fmul s0, s1, s0
28 str  s0, [x1, #40]
29 ldr  s0, [x1, #44]
30 fmul s0, s2, s0
31 str  s0, [x1, #44]
32 ldr  s0, [x1, #48]
33 fmul s0, s2, s0
34 str  s0, [x1, #48]
35 ldr  s0, [x1, #52]
36 fmul s0, s2, s0
37 str  s0, [x1, #52]
38 ldr  s0, [x1, #56]
39 fmul s0, s2, s0
40 str  s0, [x1, #56]
41 ldp  xzr, lr, [sp], #16
42 ret

Use this wisely in your own code, and make sure to check that what you are doing actually improves the generated instructions streams and your benchmarks.