The Android Runtime (ART) offers a nice memory safety feature when accessing the content of an
array. The indices you use are automatically checked against the bounds of the array to prevent
unsafe memory accesses. To achieve this, ART generates extra machine instructions to throw an
ArrayIndexOutOfBoundsException
when the index is invalid.
Here is a simple Kotlin example:
1fun scaleZ(values: FloatArray, scale: Float) = values[2] * scale
After translation to arm64 assembly, we obtain the following result:
1stp x0, lr, [sp, #-16]!
2ldr w0, [x1, #8]
3cmp w0, #0x2 (2)
4b.ls #+0x14 (addr 0x1200)
5ldr s1, [x1, #20]
6fmul s0, s0, s1
7ldp xzr, lr, [sp], #16
8ret
9mov x1, x0
10mov w0, #0x2
11bl #+0x508 (addr 0x1710) ; pThrowArrayBounds
The highlighted lines show the instructions used to check that the array has at least 3 entries for index 2 to be valid. When the comparison fails, execution jumps to the epilogue where another jump into an ART-internal function happens, to throw the exception.
Having this kind of safety is wonderful for many reasons I do not need to enumerate. It is however painful to know that so many instructions are generated and executed in situations where you know that the index cannot be out of bounds. The compilers in the Android toolchain (including ART’s JIT and AOT compilers) try to be smart about this and avoid making unnecessary checks when possible. There are however limits to what they can assume.
Helping the compiler Link to heading
Let’s take a concrete example and look at the Matrix
used in Jetpack Compose:
1@kotlin.jvm.JvmInline
2value class Matrix(
3 val values: FloatArray = floatArrayOf(
4 1f, 0f, 0f, 0f,
5 0f, 1f, 0f, 0f,
6 0f, 0f, 1f, 0f,
7 0f, 0f, 0f, 1f
8 )
9) {
10 inline operator fun get(row: Int, column: Int) = values[(row * 4) + column]
11
12 inline operator fun set(row: Int, column: Int, v: Float) {
13 values[(row * 4) + column] = v
14 }
15}
This class is a simple wrapper around an array of 16 floats that is never re-allocated. We therefore
know that this array is always exactly of size 16. Unfortunately there is no guarantee at the
language level that this will be the case. Even though the array is marked val
, we could
reallocate it via reflection or JNI, or in the case of a class try to access it during partial
construction of an instance.
This can have fairly large consequences for somewhat innocent-looking code like this:
1fun Matrix.isIdentity(): Boolean {
2 val v = values
3 return v[0] == 1f &&
4 v[1] == 0f &&
5 v[2] == 0f &&
6 v[3] == 0f &&
7 v[4] == 0f &&
8 v[5] == 1f &&
9 v[6] == 0f &&
10 v[7] == 0f &&
11 v[8] == 0f &&
12 v[9] == 0f &&
13 v[10] == 1f &&
14 v[11] == 0f &&
15 v[12] == 0f &&
16 v[13] == 0f &&
17 v[14] == 0f &&
18 v[15] == 1f
19}
The arm64 code generated is the following:
1stp x0, lr, [sp, #-16]!
2fmov s0, #0x70 (1.0000)
3ldr w0, [x1, #8]
4cmp w0, #0x0 (0)
5b.ls #+0x150 (addr 0x1240)
6ldr s1, [x1, #12]
7fcmp s1, s0
8b.ne #+0x138 (addr 0x1234)
9cmp w0, #0x1 (1)
10b.ls #+0x148 (addr 0x124c)
11ldr s1, [x1, #16]
12fcmp s1, #0.0
13b.ne #+0x124 (addr 0x1234)
14cmp w0, #0x2 (2)
15b.ls #+0x140 (addr 0x1258)
16ldr s1, [x1, #20]
17fcmp s1, #0.0
18b.ne #+0x110 (addr 0x1234)
19cmp w0, #0x3 (3)
20b.ls #+0x138 (addr 0x1264)
21ldr s1, [x1, #24]
22fcmp s1, #0.0
23b.ne #+0xfc (addr 0x1234)
24cmp w0, #0x4 (4)
25b.ls #+0x130 (addr 0x1270)
26ldr s1, [x1, #28]
27fcmp s1, #0.0
28b.ne #+0xe8 (addr 0x1234)
29cmp w0, #0x5 (5)
30b.ls #+0x128 (addr 0x127c)
31ldr s1, [x1, #32]
32fcmp s1, s0
33b.ne #+0xd4 (addr 0x1234)
34cmp w0, #0x6 (6)
35b.ls #+0x120 (addr 0x1288)
36ldr s1, [x1, #36]
37fcmp s1, #0.0
38b.ne #+0xc0 (addr 0x1234)
39cmp w0, #0x7 (7)
40b.ls #+0x118 (addr 0x1294)
41ldr s1, [x1, #40]
42fcmp s1, #0.0
43b.ne #+0xac (addr 0x1234)
44cmp w0, #0x8 (8)
45b.ls #+0x110 (addr 0x12a0)
46ldr s1, [x1, #44]
47fcmp s1, #0.0
48b.ne #+0x98 (addr 0x1234)
49cmp w0, #0x9 (9)
50b.ls #+0x108 (addr 0x12ac)
51ldr s1, [x1, #48]
52fcmp s1, #0.0
53b.ne #+0x84 (addr 0x1234)
54cmp w0, #0xa (10)
55b.ls #+0x100 (addr 0x12b8)
56ldr s1, [x1, #52]
57fcmp s1, s0
58b.ne #+0x70 (addr 0x1234)
59cmp w0, #0xb (11)
60b.ls #+0xf8 (addr 0x12c4)
61ldr s1, [x1, #56]
62fcmp s1, #0.0
63b.ne #+0x5c (addr 0x1234)
64cmp w0, #0xc (12)
65b.ls #+0xf0 (addr 0x12d0)
66ldr s1, [x1, #60]
67fcmp s1, #0.0
68b.ne #+0x48 (addr 0x1234)
69cmp w0, #0xd (13)
70b.ls #+0xe8 (addr 0x12dc)
71ldr s1, [x1, #64]
72fcmp s1, #0.0
73b.ne #+0x34 (addr 0x1234)
74cmp w0, #0xe (14)
75b.ls #+0xe0 (addr 0x12e8)
76ldr s1, [x1, #68]
77fcmp s1, #0.0
78b.ne #+0x20 (addr 0x1234)
79cmp w0, #0xf (15)
80b.ls #+0xd8 (addr 0x12f4)
81ldr s1, [x1, #72]
82fcmp s1, s0
83b.ne #+0xc (addr 0x1234)
84mov w0, #0x1
85b #+0x8 (addr 0x1238)
86mov w0, #0x0
87ldp xzr, lr, [sp], #16
88ret
89mov x1, x0
90mov w0, #0x0
91bl #+0x5f8 (addr 0x1840) ; pThrowArrayBounds
92mov x1, x0
93mov w0, #0x1
94bl #+0x5ec (addr 0x1840) ; pThrowArrayBounds
95mov x1, x0
96mov w0, #0x2
97bl #+0x5e0 (addr 0x1840) ; pThrowArrayBounds
98mov x1, x0
99mov w0, #0x3
100bl #+0x5d4 (addr 0x1840) ; pThrowArrayBounds
101mov x1, x0
102mov w0, #0x4
103bl #+0x5c8 (addr 0x1840) ; pThrowArrayBounds
104mov x1, x0
105mov w0, #0x5
106bl #+0x5bc (addr 0x1840) ; pThrowArrayBounds
107mov x1, x0
108mov w0, #0x6
109bl #+0x5b0 (addr 0x1840) ; pThrowArrayBounds
110mov x1, x0
111mov w0, #0x7
112bl #+0x5a4 (addr 0x1840) ; pThrowArrayBounds
113mov x1, x0
114mov w0, #0x8
115bl #+0x598 (addr 0x1840) ; pThrowArrayBounds
116mov x1, x0
117mov w0, #0x9
118bl #+0x58c (addr 0x1840) ; pThrowArrayBounds
119mov x1, x0
120mov w0, #0xa
121bl #+0x580 (addr 0x1840) ; pThrowArrayBounds
122mov x1, x0
123mov w0, #0xb
124bl #+0x574 (addr 0x1840) ; pThrowArrayBounds
125mov x1, x0
126mov w0, #0xc
127bl #+0x568 (addr 0x1840) ; pThrowArrayBounds
128mov x1, x0
129mov w0, #0xd
130bl #+0x55c (addr 0x1840) ; pThrowArrayBounds
131mov x1, x0
132mov w0, #0xe
133bl #+0x550 (addr 0x1840) ; pThrowArrayBounds
134mov x1, x0
135mov w0, #0xf
136bl #+0x544 (addr 0x1840) ; pThrowArrayBounds
We have a total of 136 instructions. For each array access, the compiler first checks the bounds and
if the test fails, execution jumps to one of the 16 pThrowArrayBounds
in the epilogue to produce
an exception message with the appropriate index value. That’s a lot of code we will never need.
Thankfully, we can help the compiler with a simple change, highlighted below:
1fun Matrix.isIdentity(): Boolean {
2 val v = values
3 if (v.size < 16) return false
4 return v[0] == 1f &&
5 v[1] == 0f &&
6 v[2] == 0f &&
7 v[3] == 0f &&
8 v[4] == 0f &&
9 v[5] == 1f &&
10 v[6] == 0f &&
11 v[7] == 0f &&
12 v[8] == 0f &&
13 v[9] == 0f &&
14 v[10] == 1f &&
15 v[11] == 0f &&
16 v[12] == 0f &&
17 v[13] == 0f &&
18 v[14] == 0f &&
19 v[15] == 1f
20}
This check, while apparently useless since our array is never allocated to have less than 16 elements, gives the compiler enough information to be able to get rid of all the bounds checks:
1stp x0, lr, [sp, #-16]!
2ldr w0, [x1, #8]
3cmp w0, #0x10 (16)
4b.ge #+0xc (addr 0x10f8)
5mov w0, #0x0
6b #+0xd4 (addr 0x11c8)
7fmov s0, #0x70 (1.0000)
8ldr s1, [x1, #12]
9fcmp s1, s0
10b.ne #+0xc0 (addr 0x11c4)
11ldr s1, [x1, #16]
12fcmp s1, #0.0
13b.ne #+0xb4 (addr 0x11c4)
14ldr s1, [x1, #20]
15fcmp s1, #0.0
16b.ne #+0xa8 (addr 0x11c4)
17ldr s1, [x1, #24]
18fcmp s1, #0.0
19b.ne #+0x9c (addr 0x11c4)
20ldr s1, [x1, #28]
21fcmp s1, #0.0
22b.ne #+0x90 (addr 0x11c4)
23ldr s1, [x1, #32]
24fcmp s1, s0
25b.ne #+0x84 (addr 0x11c4)
26ldr s1, [x1, #36]
27fcmp s1, #0.0
28b.ne #+0x78 (addr 0x11c4)
29ldr s1, [x1, #40]
30fcmp s1, #0.0
31b.ne #+0x6c (addr 0x11c4)
32ldr s1, [x1, #44]
33fcmp s1, #0.0
34b.ne #+0x60 (addr 0x11c4)
35ldr s1, [x1, #48]
36fcmp s1, #0.0
37b.ne #+0x54 (addr 0x11c4)
38ldr s1, [x1, #52]
39fcmp s1, s0
40b.ne #+0x48 (addr 0x11c4)
41ldr s1, [x1, #56]
42fcmp s1, #0.0
43b.ne #+0x3c (addr 0x11c4)
44ldr s1, [x1, #60]
45fcmp s1, #0.0
46b.ne #+0x30 (addr 0x11c4)
47ldr s1, [x1, #64]
48fcmp s1, #0.0
49b.ne #+0x24 (addr 0x11c4)
50ldr s1, [x1, #68]
51fcmp s1, #0.0
52b.ne #+0x18 (addr 0x11c4)
53ldr s1, [x1, #72]
54fcmp s1, s0
55b.ne #+0xc (addr 0x11c4)
56mov w0, #0x1
57b #+0x8 (addr 0x11c8)
58mov w0, #0x0
59ldp xzr, lr, [sp], #16
60ret
The function is now down to 60 instructions and does only exactly what we wrote in our Kotlin code.
Inlining Link to heading
Calling functions is another barrier that prevents the compiler from fully understanding your code.
Here is the original implemention of the *=
operator for Matrix
:
1operator fun timesAssign(m: Matrix) {
2 val v00 = dot(this, 0, m, 0)
3 val v01 = dot(this, 0, m, 1)
4 val v02 = dot(this, 0, m, 2)
5 val v03 = dot(this, 0, m, 3)
6 val v10 = dot(this, 1, m, 0)
7 val v11 = dot(this, 1, m, 1)
8 val v12 = dot(this, 1, m, 2)
9 val v13 = dot(this, 1, m, 3)
10 val v20 = dot(this, 2, m, 0)
11 val v21 = dot(this, 2, m, 1)
12 val v22 = dot(this, 2, m, 2)
13 val v23 = dot(this, 2, m, 3)
14 val v30 = dot(this, 3, m, 0)
15 val v31 = dot(this, 3, m, 1)
16 val v32 = dot(this, 3, m, 2)
17 val v33 = dot(this, 3, m, 3)
18
19 val v = values
20 v[ 0] = v00
21 v[ 1] = v01
22 v[ 2] = v02
23 v[ 3] = v03
24 v[ 4] = v10
25 v[ 5] = v11
26 v[ 6] = v12
27 v[ 7] = v13
28 v[ 8] = v20
29 v[ 9] = v21
30 v[10] = v22
31 v[11] = v23
32 v[12] = v30
33 v[13] = v31
34 v[14] = v32
35 v[15] = v33
36}
37}
38
39private fun dot(m1: Matrix, row: Int, m2: Matrix, column: Int): Float {
40 return m1[row, 0] * m2[0, column] +
41 m1[row, 1] * m2[1, column] +
42 m1[row, 2] * m2[2, column] +
43 m1[row, 3] * m2[3, column]
44}
Each call to the dot product function dot()
will execute a rather lengthy function that performs
its own array bounds checks (two, one for each array passed as an argument), for a total of 32
checks, plus the one done in timesAssign()
itself when writing the values at the end. It is
interesting to note that, in this case, the compiler generates one bound checks for the writes that
results in a deoptimized (interpreted) execution of the function on failure, instead of one bound
check per access like we saw in isIdentity()
.
Written in this manner, timesAssign()
requires exectuing around 860 instructions to complete. We
can bring this down to 180 instructions by doing two things:
- Marking
dot()
asinline
- Doing a manual array size check for both arrays at the beginning of
timesAssign()
After inlining, the compiler can perform extra optimizations, such as eliminating redundant bound checks and loads. The elimination of the loads goes beyond the original scope of this post, and we will explore this in further detail in the next post.
Always worth it? Link to heading
Of course, like any optimization technique, whether doing this matters highly depends on the code you’re applying it to. Make sure to always benchmark, but this technique might be useful in code that processes large arrays (audio, text, pixels, etc.). Even when the compiler does a good job on its own, you can save a few instructions.
For instance, this Kotlin function produces fairly optimal arm64 instructions:
1fun scale(x: Float = 1f, y: Float = 1f, z: Float = 1f) {
2 this[0, 0] *= x
3 this[0, 1] *= x
4 this[0, 2] *= x
5 this[0, 3] *= x
6 this[1, 0] *= y
7 this[1, 1] *= y
8 this[1, 2] *= y
9 this[1, 3] *= y
10 this[2, 0] *= z
11 this[2, 1] *= z
12 this[2, 2] *= z
13 this[2, 3] *= z
14}
The generated assembly contains a single bounds check:
1sub x16, sp, #0x2000 (8192)
2ldr wzr, [x16]
3str x0, [sp, #-32]!
4str lr, [sp, #24]
5ldr w0, [x1, #8]
6cmp w0, #0xb (11)
7b.ls #+0xa0 (addr 0x11f8)
8ldr s3, [x1, #12]
9fmul s3, s0, s3
10str s3, [x1, #12]
11ldr s3, [x1, #16]
12fmul s3, s0, s3
13str s3, [x1, #16]
14ldr s3, [x1, #20]
15fmul s3, s0, s3
16str s3, [x1, #20]
17ldr s3, [x1, #24]
18fmul s0, s0, s3
19str s0, [x1, #24]
20ldr s0, [x1, #28]
21fmul s0, s1, s0
22str s0, [x1, #28]
23ldr s0, [x1, #32]
24fmul s0, s1, s0
25str s0, [x1, #32]
26ldr s0, [x1, #36]
27fmul s0, s1, s0
28str s0, [x1, #36]
29ldr s0, [x1, #40]
30fmul s0, s1, s0
31str s0, [x1, #40]
32ldr s0, [x1, #44]
33fmul s0, s2, s0
34str s0, [x1, #44]
35ldr s0, [x1, #48]
36fmul s0, s2, s0
37str s0, [x1, #48]
38ldr s0, [x1, #52]
39fmul s0, s2, s0
40str s0, [x1, #52]
41ldr s0, [x1, #56]
42fmul s0, s2, s0
43str s0, [x1, #56]
44ldr lr, [sp, #24]
45add sp, sp, #0x20 (32)
46ret
47str x0, [sp, #8]
48mov x0, #0x5
49bl #+0x4f0 (addr 0x16f0) ; pDeoptimize
The only gain to be had here is to get rid of the epilogue (binary size/instruction cache improvements) and to simplify the prologue of the function. Adding our own bounds check saves 7 instructions over the entire function (3 of those being the epilogue itself):
1 stp x0, lr, [sp, #-16]!
2 ldr w0, [x1, #8]
3 cmp w0, #0x10 (16)
4 b.lt #+0x94 (addr 0x11e0)
5 ldr s3, [x1, #12]
6 fmul s3, s0, s3
7 str s3, [x1, #12]
8 ldr s3, [x1, #16]
9 fmul s3, s0, s3
10 str s3, [x1, #16]
11 ldr s3, [x1, #20]
12 fmul s3, s0, s3
13 str s3, [x1, #20]
14 ldr s3, [x1, #24]
15 fmul s0, s0, s3
16 str s0, [x1, #24]
17 ldr s0, [x1, #28]
18 fmul s0, s1, s0
19 str s0, [x1, #28]
20 ldr s0, [x1, #32]
21 fmul s0, s1, s0
22 str s0, [x1, #32]
23 ldr s0, [x1, #36]
24 fmul s0, s1, s0
25 str s0, [x1, #36]
26 ldr s0, [x1, #40]
27 fmul s0, s1, s0
28 str s0, [x1, #40]
29 ldr s0, [x1, #44]
30 fmul s0, s2, s0
31 str s0, [x1, #44]
32 ldr s0, [x1, #48]
33 fmul s0, s2, s0
34 str s0, [x1, #48]
35 ldr s0, [x1, #52]
36 fmul s0, s2, s0
37 str s0, [x1, #52]
38 ldr s0, [x1, #56]
39 fmul s0, s2, s0
40 str s0, [x1, #56]
41 ldp xzr, lr, [sp], #16
42 ret
Use this wisely in your own code, and make sure to check that what you are doing actually improves the generated instructions streams and your benchmarks.