I noticed a curious thing on my computer.^{*} The handwritten divisibility test is significantly faster than the `%`

operator. Consider the minimal example:

^{* AMD Ryzen Threadripper 2990WX, GCC 9.2.0}

```
static int divisible_ui_p(unsigned int m, unsigned int a)
{
if (m <= a) {
if (m == a) {
return 1;
}
return 0;
}
m += a;
m >>= __builtin_ctz(m);
return divisible_ui_p(m, a);
}
```

The example is limited by odd `a`

and `m > 0`

. However, it can be easily generalized to all `a`

and `m`

. The code just converts the division to a series of additions.

Now consider the test program compiled with `-std=c99 -march=native -O3`

:

```
for (unsigned int a = 1; a < 100000; a += 2) {
for (unsigned int m = 1; m < 100000; m += 1) {
#if 1
volatile int r = divisible_ui_p(m, a);
#else
volatile int r = (m % a == 0);
#endif
}
}
```

... and the results on my computer:

```
| implementation | time [secs] |
|--------------------|-------------|
| divisible_ui_p | 8.52user |
| builtin % operator | 17.61user |
```

Therefore more than 2 times faster.

The question: Can you tell me how the code behaves on your machine? Is it missed optimization opportunity in GCC? Can you do this test even faster?

**UPDATE:**
As requested, here is a minimal reproducible example:

```
#include <assert.h>
static int divisible_ui_p(unsigned int m, unsigned int a)
{
if (m <= a) {
if (m == a) {
return 1;
}
return 0;
}
m += a;
m >>= __builtin_ctz(m);
return divisible_ui_p(m, a);
}
int main()
{
for (unsigned int a = 1; a < 100000; a += 2) {
for (unsigned int m = 1; m < 100000; m += 1) {
assert(divisible_ui_p(m, a) == (m % a == 0));
#if 1
volatile int r = divisible_ui_p(m, a);
#else
volatile int r = (m % a == 0);
#endif
}
}
return 0;
}
```

compiled with `gcc -std=c99 -march=native -O3 -DNDEBUG`

on AMD Ryzen Threadripper 2990WX with

```
gcc --version
gcc (Gentoo 9.2.0-r2 p3) 9.2.0
```

**UPDATE2:** As requested, the version that can handle any `a`

and `m`

(if you also want to avoid integer overflow, the test has to be implemented with integer type twice as long as the input integers):

```
int divisible_ui_p(unsigned int m, unsigned int a)
{
#if 1
/* handles even a */
int alpha = __builtin_ctz(a);
if (alpha) {
if (__builtin_ctz(m) < alpha) {
return 0;
}
a >>= alpha;
}
#endif
while (m > a) {
m += a;
m >>= __builtin_ctz(m);
}
if (m == a) {
return 1;
}
#if 1
/* ensures that 0 is divisible by anything */
if (m == 0) {
return 1;
}
#endif
return 0;
}
```

`r`

s that you calculate are indeed equal to each other.`a % b`

have`b`

much smaller than`a`

. Through most iterations in your test case, they are of similar size, or`b`

is bigger, and your version can be faster on many CPUs in those situations.