Issue
I was working on a personal project recently when I stumbled across an odd issue.
In a very tight loop I have an integer with a value between 0 and 15. I need to get -1 for values 0, 1, 8, and 9 and 1 for values 4, 5, 12, and 13.
I turned to godbolt to check a few options and was surprised that it seemed the compiler couldn't optimize a switch statement the same way as an if chain.
The link is here: https://godbolt.org/z/WYVBFl
The code is:
const int lookup[16] = {-1, -1, 0, 0, 1, 1, 0, 0, -1, -1, 0, 0, 1, 1, 0, 0};
int a(int num) {
return lookup[num & 0xF];
}
int b(int num) {
num &= 0xF;
if (num == 0 || num == 1 || num == 8 || num == 9)
return -1;
if (num == 4 || num == 5 || num == 12 || num == 13)
return 1;
return 0;
}
int c(int num) {
num &= 0xF;
switch (num) {
case 0: case 1: case 8: case 9:
return -1;
case 4: case 5: case 12: case 13:
return 1;
default:
return 0;
}
}
I would have thought that b and c would yield the same results, and I was hoping that I could read the bit-hacks to come up with an efficient implementation myself since my solution (the switch statement - in another form) was fairly slow.
Oddly, b
compiled to bit-hacks while c
was either pretty much un-optimized or reduced to a different case of a
depending on target hardware.
Can anybody explain why there is this discrepancy? What is the 'correct' way to optimize this query?
EDIT:
Clarification
I want the switch solution to be the fastest, or a similarly "clean" solution. However when compiled with optimizations on my machine the if solution is significantly faster.
I wrote a quick program to demonstrate and TIO has the same results as I find locally: Try it online!
With static inline
the lookup table speeds up a bit: Try it online!
Solution
If you explicitely enumerate all the cases, gcc is very efficient :
int c(int num) {
num &= 0xF;
switch (num) {
case 0: case 1: case 8: case 9:
return -1;
case 4: case 5: case 12: case 13:
return 1;
case 2: case 3: case 6: case 7: case 10: case 11: case 14: case 15:
//default:
return 0;
}
}
is just compiled in a simple indexed branch :
c:
and edi, 15
jmp [QWORD PTR .L10[0+rdi*8]]
.L10:
.quad .L12
.quad .L12
.quad .L9
.quad .L9
.quad .L11
.quad .L11
.quad .L9
.quad .L9
.quad .L12
etc...
Note that if default:
is uncommented, gcc turns back to its nested branch version.
Answered By - Alain Merigot