Jake Wharton recently caused me to go down yet another silly optimization rabbit hole when he nonchalantly linked to a piece of code used to count the number of digits in a Long during a Slack conversation about Kotlin’s lack of ternary operator. This of course triggered folks like Madis Pink and me to want to optimize it…

Counting digits Link to heading

The simplest way to count the number of digits would be to compute log10(n).toInt() + 1, where n is our input number. Unfortunately logarithmic functions like Kotlin’s log10 are only defined for floating point numbers. If our input is an Int or a Long, we could first convert to Double and then call log10, but not all Long values can be stored in a Double (any value above 2^53), and we would have to special case 0. We must therefore find a different solution1.

Important Note

In the rest of this article, we’ll assume the input number is positive to keep the code easier to understand. You can see how Jake handled negative values in the original source code. There might be better ways of doing it in the various solutions presented below, but I’ll leave it as an exercise to the reader.

Let’s start with the most straightforward solution:

 1fun simpleCount(v: Long): Int {
 2    if (v < 10L) return 1
 3    if (v < 100L) return 2
 4    if (v < 1000L) return 3
 5    if (v < 10000L) return 4
 6    if (v < 100000L) return 5
 7    if (v < 1000000L) return 6
 8    if (v < 10000000L) return 7
 9    if (v < 100000000L) return 8
10    if (v < 1000000000L) return 9
11    if (v < 10000000000L) return 10
12    if (v < 100000000000L) return 11
13    if (v < 1000000000000L) return 12
14    if (v < 10000000000000L) return 13
15    if (v < 100000000000000L) return 14
16    if (v < 1000000000000000L) return 15
17    if (v < 10000000000000000L) return 16
18    if (v < 100000000000000000L) return 17
19    if (v < 1000000000000000000L) return 18
20    return 19
21}

I like this solution because of how obvious it is. What I don’t like about it is the large number of branches and instructions it generates. I will spare you the full arm64 assembly dump, just know that it yields 111 instructions and 32 branches. While early returns will likely spare you the full execution of the generate code, this won’t be great for the instruction cache nor the branch predictor.

On a Pixel 6, this solution takes 63,236 ns to execute over a large array of random inputs.

Using a binary search Link to heading

Jake’s solution is similar to the one we just saw, but re-arranges the tests to optimize for likely values, and effectively performs a binary search.

 1fun binarySearchCount(v: Long): Int {
 2    return if (v < 100000000L) {
 3        if (v < 10000L) {
 4            if (v < 100L) {
 5                if (v < 10L) {
 6                    1
 7                } else {
 8                    2
 9                }
10            } else if (v < 1000L) {
11                3
12            } else {
13                4
14            }
15        } else if (v < 1000000L) {
16            if (v < 100000L) {
17                5
18            } else {
19                6
20            }
21        } else if (v < 10000000L) {
22            7
23        } else {
24            8
25        }
26    } else if (v < 1000000000000L) {
27        if (v < 10000000000L) {
28            if (v < 1000000000L) {
29                9
30            } else {
31                10
32            }
33        } else if (v < 100000000000L) {
34            11
35        } else {
36            12
37        }
38    } else if (v < 1000000000000000L) {
39        if (v < 10000000000000L) {
40            13
41        } else if (v < 100000000000000L) {
42            14
43        } else {
44            15
45        }
46    } else if (v < 100000000000000000L) {
47        if (v < 10000000000000000L) {
48            16
49        } else {
50            17
51        }
52    } else if (v < 1000000000000000000L) {
53        18
54    } else {
55        19
56    }
57}

This new solution has a similar number of instructions as the previous one (108 vs 111), but only a quarter of the branches, for a total of 8 branches, down from 32. In addition, you are guaranteed to never execute all the branches.

With the smarter way to go throught the branches, the binary search takes 27,034 ns to execute, or only ~40% of the original execution time. This is already a great improvement.

Educated guess Link to heading

What makes this problem “difficult” is that we are trying to work in base 10 while computers work in base 2. We can however use this to our advantage to estimate how many digits our number has. To do this we need to count the number of leading zeroes in the binary representation of the number. Using this formula:

1val bitCount = 64 - n.countLeadingZeroBits()

We know how many bits are required to encode the number. Knowing this we can use a lookup table to map this number of bits to a possible number of digits. For instance if the input is 237, the formula above tells us we need 8 bits to encode the value. From this we need to map to the closest power of 10. This can be done either via another lookup table (like I did oringally) or by using Madis’ solution:

1val powerOfTen = bitCount * 3 / 10

With our example of 237, the value returned is 2 because the closest power of ten is 100, or 10^2. If the input is less than the matching power of ten, we can return powerOfTen as the number of digits, and if the input is greater we need to adjust the result and add 1:

 1private val PowersOfTen = longArrayOf(
 2    0,
 3    10,
 4    100,
 5    1000,
 6    10000,
 7    100000,
 8    1000000,
 9    10000000,
10    100000000,
11    1000000000,
12    10000000000,
13    100000000000,
14    1000000000000,
15    10000000000000,
16    100000000000000,
17    1000000000000000,
18    10000000000000000,
19    100000000000000000,
20    1000000000000000000
21)
22
23fun log2Count(v: Long): Int {
24    val guess = ((64 - v.countLeadingZeroBits()) * 3) / 10
25    return guess + (if (v >= PowersOfTen[guess]) 1 else 0)
26}

The expensive integer division can be removed using another trick from Madis (which is similar to what I have seen compilers do automatically):

1fun log2Count(v: Long): Int {
2    val guess = ((64 - v.countLeadingZeroBits()) * 10) ushr 5
3    return guess + (if (v >= PowersOfTen[guess]) 1 else 0)
4}

With this solution we end up with only 24 instructions and 2 branches (1 branch always executed for the array bounds check, the other one in case we fail the check, which can’t happen in our case):

 1str x0, [sp, #-32]!
 2str lr, [sp, #24]
 3mov w3, #0x40
 4mov w2, #0xa
 5clz x4, x1
 6sub w3, w3, w4
 7mul w2, w3, w2
 8lsr w2, w2, #5
 9ldr w3, [x0]
10ldr w0, [x3, #224]
11ldr w3, [x0, #8]
12cmp w2, w3
13b.hs #+0x24
14add w0, w0, #0x10 (16)
15ldr x0, [x0, x2, lsl #3]
16cmp x1, x0
17cset w0, ge
18add w0, w2, w0
19ldr lr, [sp, #24]
20add sp, sp, #0x20 (32)
21ret
22mov x0, x2
23mov x1, x3
24bl #+0x144

This solution runs in 16,396 ns on a Pixel 6, taking only only 60% of the time taken by the binary search, and only 25% of the naive implementation.

Conservative approximation Link to heading

If you don’t need to know the exact number of digits, but a close approximation that’s guaranteed to be always greater or equal to the real answer (to size a buffer for instance), you can drop the entire lookup table, which greatly simplifies the generated arm64 assembly.

1fun conservativeLog2Count(v: Long): Int {
2    val guess = ((64 - v.countLeadingZeroBits()) * 10) ushr 5
3    return guess + 1
4}

The resulting aarch64 code has only 8 instructions and no branches:

1mov w3, #0x40
2mov w2, #0x4d1
3mov w0, #0x1
4clz x1, x1
5sub w1, w3, w1
6mul w1, w1, w2
7add w0, w0, w1, lsr #12
8ret

This approximation saves us even more time, running in 14,853 ns.

Recap Link to heading

Here is the final comparison between the various implementations:

ImplementationInstructions/branchesTimeSpeed
Branches111 / 3263,236 ns1.0x
Binary search108 / 827,034 ns2.3x
log224 / 216,396 ns3.8x
log2 conservative8 / 014,853 ns4.2x

Frustratingly — but unsurprisingly — a C++ implementation of log2Count produces 12 instructions instead of Kotlin’s 24, and no branches. This is of course due to the fact that C++ doesn’t do any bounds check for us on the array. I haven’t tried to benchmark this solution using JNI and @CriticalNative but I doubt the savings will be greater than the cost of the JNI overhead.


  1. We could use the log10 technique for any value lower than 2^53 and otherwise use one of the solutions that follow, but after benchmarking it’s not faster than any of the other solutions except the most naive one. ↩︎