Sunday, April 3, 2022

[SOLVED] Intel(x86_64) 64 bit vs 32 bit integer arithmetic performance difference

Issue

I was testing some program and I came upon a rather unexpected anomaly.

I wrote a simple program that computed prime numbers, and used pthreads API to parallelize this workload.

After conducting some tests, I found that if i used uint64_t as the datatype for calculations and loops, the program took significantly more time to run than if i used uint32_t.

Here is the code that I ran:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <pthread.h>

#define UINT uint64_t
#define SIZE (1024 * 1024)

typedef struct _data
{
    UINT start;
    UINT len;
    int t;
    UINT c;
}data;

int isprime(UINT x)
{
    uint8_t flag = 1;

    if(x < 2)
        return 0;

    for(UINT i = 2;i < x/2; i++)
    {
        if(!(x % i ))
        {
            flag = 0;
            break;
        }
    }

    return flag;
}

void* calc(void *p)
{
    data *a = (data*)p;

    //printf("thread no. %d has start: %lu length: %lu\n",a->t,a->start,a->len);

    for(UINT i = a->start; i < a->len; i++)
    {
        if(isprime(i))
            a->c++;
    }

    //printf("thread no. %d found %lu primes\n", a->t,a->c);
    
    pthread_exit(NULL);
}

int main(int argc,char **argv)
{   
    pthread_t *t;
    data *a;
    uint32_t THREAD_COUNT;

    if(argc < 2)
        THREAD_COUNT = 1;
    else
        sscanf(argv[1],"%u",&THREAD_COUNT);
    
    t = (pthread_t*)malloc(THREAD_COUNT * sizeof(pthread_t));
    a = (data*)malloc(THREAD_COUNT * sizeof(data));

    printf("executing the application on %u thread(s).\n",THREAD_COUNT);

    for(uint8_t i = 0; i < THREAD_COUNT; i++)
    {
        a[i].t = i;
        a[i].start = i * (SIZE / THREAD_COUNT);
        a[i].len = a[i].start + (SIZE / THREAD_COUNT);
        a[i].c = 0;
    }
    
    for(uint8_t i = 0; i < THREAD_COUNT; i++)
        pthread_create(&t[i],NULL,calc,(void*)&a[i]);
    for(uint8_t i = 0; i < THREAD_COUNT; i++)
        pthread_join(t[i],NULL);

    free(a);
    free(t);
    
    return 0;
}

I changed the UINT macro between uint32_t and uint64_t and compiled and ran the program and determined its runtime using time command on linux.

I found major difference between the runtime for uint64_t vs uint32_t.

On using uint32_t it took the program 46s to run while using uint64_t it took 2m49s to run!

I wrote a blog post about it here : https://qcentlabs.com/index.php/2021/02/01/intelx86_64-64-bit-vs-32-bit-arithmetic-big-performance-difference/

You can check out the post if you want more information.

What might be the issue behind this? Are 64 bit arithmetic slower on x86_64 than 32 bit one?


Solution

In general 64-bit arithmetic is as fast as 32-bit, ignoring things like larger operands taking up more memory and BW (bandwidth), and on x86-64 addressing the full 64-bit registers requires longer instructions.

However, you have managed to hit one of the few exceptions to this rule, namely the div instruction for calculating divisions.



Answered By - janneb
Answer Checked By - Marilyn (WPSolving Volunteer)