Sunday, October 24, 2021

[SOLVED] Why does my "=r"(var) output not pick the same register as "a"(var) input?

Issue

I'm learning how to use __asm__ volatile in GCC and came up with a problem. I want implement a function performing atomic compare and exchange and returning the value that was previously stored in the destination.

Why does an "=a"(expected) output constraint work, but an "=r"(expected) constraint lets the compiler generate code that doesn't work?

Case 1.

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

uint64_t atomic_cas(uint64_t * destination, uint64_t expected, uint64_t value){
    __asm__ volatile (
        "lock cmpxchgq %3, %1":
        "=a" (expected) :
        "m" (*destination), "a" (expected), "r" (value) :
        "memory"
    );

    return expected;
}

int main(void){
    uint64_t v1 = 10;
    uint64_t result = atomic_cas(&v1, 10, 5);
    printf("%" PRIu64 "\n", result);           //prints 10, the value before, OK
    printf("%" PRIu64 "\n", v1);               //prints 5, the new value, OK
}

It works as expected. Now consider the following case:

Case 2.

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>

uint64_t atomic_cas(uint64_t * destination, uint64_t expected, uint64_t value){
    __asm__ volatile (
        "lock cmpxchgq %3, %1":
        "=r" (expected) ://<----- I changed a with r and expected GCC understood it from the inputs 
        "m" (*destination), "a" (expected), "r" (value) :
        "memory"
    );

    return expected;
}

int main(void){
    uint64_t v1 = 10;
    uint64_t result = atomic_cas(&v1, 10, 5);
    printf("%" PRIu64 "\n", result);            //prints 5, wrong
    printf("%" PRIu64 "\n", v1);                //prints 5, the new value, OK 
}

I examined generated assembly and noticed the following things:

I. In both of the cases the function code is the same and looks as

   0x0000555555554760 <+0>:     mov    rax,rsi
   0x0000555555554763 <+3>:     lock cmpxchg QWORD PTR [rdi],rdx
   0x0000555555554768 <+8>:     ret 

II. The problem came when GCC inlined the atomic_cas so in the later case the correct value was not passed to the printf function. Here is the related fragment of disas main:

0x00000000000005f6 <+38>:    lock cmpxchg QWORD PTR [rsp],rdx
0x00000000000005fc <+44>:    lea    rsi,[rip+0x1f1]        # 0x7f4
0x0000000000000603 <+51>:    mov    rdx,rax ;  <-----This instruction is absent in the Case 2.
0x0000000000000606 <+54>:    mov    edi,0x1
0x000000000000060b <+59>:    xor    eax,eax

QUESTION: Why does the replacing rax(a) with an arbitrary register (r) produce wrong result? I expected it worked in both of the cases?

UPD. I compile with the following flags -Wl,-z,lazy -Warray-bounds -Wextra -Wall -g3 -O3


Solution

First of all, https://gcc.gnu.org/wiki/DontUseInlineAsm. There is basically zero reason to roll your own CAS, vs. using bool __atomic_compare_exchange(type *ptr, type *expected, type *desired, bool weak, int success_memorder, int failure_memorder) https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html. This works even on non-_Atomic variables.


"=r" tells gcc it can ask for the output in whatever register it wants, so it can avoid having to mov the result there itself. (Like here where GCC wants the output in RSI as an arg for printf). And/or so it can avoid destroying the input it put in the same register. That's the entire point of =r instead of specific-register constraints.

If you want to tell GCC that the register it picks for input is also the output register, use "+r". Or in this case since you need it to pick RAX, use "+a"(expected).

There's already syntax for making the compiler pick the same register for 2 constraints with separate variables for input and output, specifically matching constraints: "=r"(outvar) : "0"(invar).

It would be a missed optimization if the syntax didn't let you describe a non-destructive instruction that could produce output in a different register from the input(s).


You can see what GCC actually picked by using the constraint in a comment.

Remember that GNU C inline asm is just text substitution into your template. The compiler literally has no idea what the asm instructions do, and doesn't even check they're valid. (That only happens when the assembler reads the compiler output).

    ...
    asm volatile (
    "lock cmpxchgq %3, %1   # 0 out: %0  |  2 in: %2" 
    : ...
    ...

The resulting asm shows the problem very clearly (Godbolt GCC7.4):

        lock cmpxchgq %rsi, (%rsp)   # 0 out: %rsi  |  2 in: %rax
        leaq    .LC0(%rip), %rdi
        xorl    %eax, %eax
        call    printf@PLT

(I used AT&T syntax so your cmpxchgq %reg,mem would match the mem,reg operand order documented by Intel, although both GAS and clang's built-in assembler seem to accept it in the other order, too. Also because of the operand-size suffix)

GCC takes the opportunity to ask for the "=r"(expected) output in RSI as an arg for printf. Your bug is that your template makes a wrong assumption that %0 will expand to rax.


There are lots of examples of the lack of implicit connection between input and output that happen to use the same C var. For example, you can swap 2 C variables with an empty asm statement, just using constraints. How to write a short block of inline gnu extended assembly to swap the values of two integer variables?



Answered By - Peter Cordes