Sunday, October 24, 2021

[SOLVED] clang performance drop when using uniform_real_distribution

Issue

The following code results in very different times for g++ and clang++ when using uniform_real_distribution.

#include <iostream>
#include <sstream>
#include <fstream>

#include <chrono>
#include <random>


std::mt19937::result_type seed = 0;
std::mt19937 gen(seed);
// std::uniform_int_distribution<size_t> distr(0, 1);
std::uniform_real_distribution<double> distr(0.0,1.0);

int main()
{
    auto t_start = std::chrono::steady_clock::now();
    for (auto i = 1; i <= 1000000; ++i)
    {
        distr(gen);
    }
    auto t_end = std::chrono::steady_clock::now();
    std::cout << "elapsed time: " << std::chrono::duration_cast<std::chrono::nanoseconds>(t_end - t_start).count()  << " ns\n" << std::endl;

    return 0;
}

Compiled with the following commands:

clang++ -std=c++17 -O3 -flto -march=native -mllvm -inline-threshold=10000000 rng.cpp -o rng
g++ -std=c++17 -O3 -march=native rng.cpp -o rng

this results in the following times:

clang:  272929774 ns

gcc:    12054635 ns

when using the commented distribution instead, the times are:

clang:  48155862 ns

gcc:    50226810 ns

I have found a quite old question here which handles the same problem however none of the proposed solutions worked in my case.

Clang performance drop for specific C++ random number generation

Does someone has an idea what is going on here?


Solution

Take a look on godbolt

On gcc compiler trashed distr(gen);!!!

.L27:
        dec     esi
        je      .L25

This is for loop which does nothing!

On clang compiler was not smart enough:

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     edi, offset gen
        call    double std::generate_canonical<double, 53ul, std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul> >(std::mersenne_twister_engine<unsigned long, 32ul, 624ul, 397ul, 31ul, 2567483615ul, 11ul, 4294967295ul, 7ul, 2636928640ul, 15ul, 4022730752ul, 18ul, 1812433253ul>&)
        dec     ebx
        jne     .LBB0_1

And generate_canonical was actually called.

Basically you must use result of distr(gen); to do something with it what will have impact on code outcome, otherwise compiler can remove that code.


The simplest way to fix it is to accumulate results of distr(gen); and print it.

Now when you look on assembly, you can see that clang is calling function std::generate_canonical<double, 53ul, std::mersenne_twister_engine< .... >> and gcc just placed that respective code inline.

Most probably this difference is caused by different organization of standard library. Clang used version built in into standard library and in gcc template from header file was used to generate code in just created assembly. When compiler reaches external code from library it can't tell what exactly it does, so it is unable to optimize away that code (since some side effects can be hidden in library).



Answered By - Marek R