In the previous post, we saw how we could
micro-optimize `Int.sign`

to save a few instructions. We are now going to turn to `Float.sign`

(and by extension `Double.sign`

).

`Float.sign`

returns the sign
of single-precision float value as a single-precision float value. While similar to `Int.sign`

, this
API must handle a special cases: Not-a-Number (`NaN`

). The exact behavior of the API is that it will
return:

`-1.0f`

if the value is negative`+/-0.0f`

if the value is zero (floats can encode both positive and negative zero)`1.0f`

if the value is positive`NaN`

if the value is`NaN`

An easy way to implement this API ourselves is to return the input when the input equals `0.0f`

or
`NaN`

, and to return the input’s sign copied onto `1.0f`

otherwise. Translated to code, we can
write:

```
1public inline val Float.sign: Float get() = if (this == 0.0f || isNaN()) {
2 this
3} else {
4 1.0f.withSign(this)
5}
```

On Android, Kotlin does not implement `Float.sign`

as above, but delegates to
`java.lang.Math.signum`

instead:

```
1public actual inline val Float.sign: Float get() = nativeMath.signum(this)
```

If we look at the implementation of `signum()`

on Android, we find:

```
1public static float signum(float f) {
2 return (f == 0.0f || Float.isNaN(f)) ? f : copySign(1.0f, f);
3}
```

We can now turn to the generated aarch64 assembly to see what happens once the code runs on an actual device:

```
1fcmp s0, #0.0
2b.eq #+0x28 (addr 0x266c)
3fcmp s0, s0
4b.ne #+0x20 (addr 0x266c)
5fmov s1, #0x70 (1.0000)
6fmov w0, s0
7and w0, w0, #0x80000000
8fmov w1, s1
9and w1, w1, #0x7fffffff
10orr w0, w0, w1
11fmov s0, w0
12ret
```

The first interesting thing we can notice is that both `isNaN()`

and `copySign()`

disappear as
function calls and are instead replaced with their implementation, via a combination of inlining
and instrinsics in ART.

The code is a pretty direct translation of the original Java source:

- Lines 1 and 2 check if the value is
`0.0f`

- Lines 3 and 4 check if the value is
`NaN`

^{1} - And the rest implements
`copySign()`

So… all good? I could describe the assembly step by step but it will be easier for most readers
if we look at the Java implementation of `copySign()`

directly:

```
1public static float copySign(float magnitude, float sign) {
2 return Float.intBitsToFloat(
3 (Float.floatToRawIntBits(sign) & (FloatConsts.SIGN_BIT_MASK)) |
4 (Float.floatToRawIntBits(magnitude) & (FloatConsts.EXP_BIT_MASK | FloatConsts.SIGNIF_BIT_MASK))
5 );
6}
```

What looks like a lot of scary looking bit manipulation is actually fairly simple and relies on a simple fact: the most-significant bit (MSB) of a float (or double) encodes the sign of the number. When the MSB is set to 1, the number is negative, otherwise the number is positive.

This code also uses `floatToRawIntBits()`

to get the bit representation of a float as an integer.

Given this information, the code should be easier to follow:

- First we mask the bit representation of the
`sign`

input with`0x80000000`

to extract the sign bit - Then we mask the bit representation of the
`magnitude`

input with`0x7fffffff`

to extract all the bits*except*the sign bit - We combine both with a binary OR

Go back to the aarch64 assembly above, and you’ll see it’s exactly what happens (hint: `fmov`

is
what does both `intBitsToFloat()`

and `floatToRawIntBits()`

at the assembly level, and I will have
something interesting to say about this in a future post) . This tells us something interesting
about `copySign()`

: it is not an intrinsic, but it is inlined, and the functions it calls are
themselves intrinsics. All function calls disappear.

So far so good, but if we look more closely at the assembly we can notice something rather silly.
The code spends a few instructions loading the constant `1.0f`

just to extract its non-sign bits
(the exponent and significand):

```
1fcmp s0, #0.0
2b.eq #+0x28 (addr 0x266c)
3fcmp s0, s0
4b.ne #+0x20 (addr 0x266c)
5fmov s1, #0x70 (1.0000)
6fmov w0, s0
7and w0, w0, #0x80000000
8fmov w1, s1
9and w1, w1, #0x7fffffff
10orr w0, w0, w1
11fmov s0, w0
12ret
```

But `1.0f`

is a known constant, and instead of extracting its bits at runtime we could just… use
the hexadecimal representation of `1.0f`

, `0x3f800000`

.

This is something you would normally expect a compiler or an optimizer to do when executing a pass
of constant folding. Unfortunately ART currently does not perform constant folding *through
intrinsics*^{2}.

So what can we do about it? We can rewrite `Float.sign`

/`signum()`

to bypass
`copySign()`

/`withSign()`

and do the sign copy ourselves. Here’s a Kotlin version:

```
1public inline val Float.sign: Float get() = if (this == 0.0f || isNaN()) {
2 this
3} else {
4 Float.fromBits((toRawBits() and 0x80000000.toInt()) or 0x3f800000)
5}
```

Doing this saves two instructions in the generated aarch64 code, and this optimization will be
delivered in a future update of `libcore`

, ART’s standard library:

```
1fcmp s0, #0.0
2b.eq #+0x20 (addr 0x26a4)
3fcmp s0, s0
4cset w0, ne
5cbnz w0, #+0x14 (addr 0x26a4)
6fmov w0, s0
7and w0, w0, #0x80000000
8orr w0, w0, #0x3f800000
9fmov s0, w0
10ret
```

It is interesting to note that swapping implementation has the side-effect of changing how `isNaN()`

is performed. Instead of a comparison (`fcmp`

) and a jump (`b.ne`

), we now use a comparison followed
by `cset`

and `cbnz`

. This behavior is apparently caused by the different code generation paths
taken in both cases (inlining vs not), and this means we could in theory save another instruction.

*Update*: Thanks to Pete Cawley’s suggestion, I tried a C++
implementation. The C++ version is just a straight port of the Java/Kotlin implementation:

```
1__attribute__((always_inline))
2inline uint32_t to_uint32(float x) {
3 uint32_t a;
4 std::memcpy(&a, &x, sizeof(x));
5 return a;
6}
7
8__attribute__((always_inline))
9inline float to_float(uint32_t x) {
10 float a;
11 std::memcpy(&a, &x, sizeof(x));
12 return a;
13}
14
15float sign(float x) {
16 if (x == 0.0f || std::isnan(x)) {
17 return x;
18 } else {
19 uint32_t d = to_uint32(x);
20 return to_float((d & 0x80000000) | 0x3f800000);
21 }
22}
```

With this implementation, the compiler (clang 17.0) will produce the following aarch64 code:

```
1fmov w8, s0
2fcmp s0, #0.0
3and w8, w8, #0x80000000
4orr w8, w8, #0x3f800000
5fmov s1, w8
6fcsel s1, s0, s1, eq
7fcsel s0, s0, s1, vs
8ret
```

This new version saves another 2 instructions, for a total of 4 instructions (30%) compared to the
original implementation (and it’s branchless!). This solution relies on the fact that `fcmp`

will
set the overflow flag when either operand is `NaN`

. Since we perform a comparison to `0.0f`

which
we know cannot be `NaN`

, we can check the `V`

flag to know if the input is `NaN`

. It is achieved
above using the `fcsel`

instruction and the `vs`

operator. This means that unless ART could perform
the same optimization automatically, it might be worth implementing `Math.signum`

as an intrinsic.

Next time, we’ll take a look at one of the following topics:

`floatToRawIntBits()`

and a not-so-micro micro-optimization- Optimizing value classes
- Using a better
`HashMap`

- Faster range-checks
- Optimizing code size with de-inlining
- Optimizing a text parser