Tuesday, December 28, 2021

[SOLVED] How to transform column to row in Terminal Linux? But more complex

Issue

How I transform column to row in Terminal Linux? But more complex... The following is an example of my data:

SNP_Name        ID_Animal        Allele        Chr        Position
rs01            215               AB            1            100
rs02            215               AA            2            200
rs03            215               BA            3            300
rs04            215               AA            4            400
rs01            300               AB            1            100
rs02            300               BB            2            200
rs03            300               AA            3            300
rs04            300               AB            4            400
rs01            666               BB            1            100
rs02            666               AA            2            200
rs03            666               AB            3            300
rs04            666               AB            4            400

I want to transform this into the following:

SNP_Name     Chr     Position   215(ID_animal)  300(ID_Animal) 666(ID_Animal)
rs01         1       100        AB              AB            BB
rs02         2       200        AA              BB            AA
rs03         3       300        BA              AA            AB
rs04         4       400        AA              AB            AB

The row ID_animal change in column with respective allele. How I do this? But I will work with 55,000 repetition per ID_animal. So, I want to be only 55,000 row and (animal number+SNP_Name+Chr+Position) of column.

Thank you.


Solution

The issue here is the amount of data, and I don't want to give a solution that reads everything into memory and then outputs it.

To do this, I'd like to parse and output the data for each SNP (rs number) in turn rather than for each animal in turn. But the data is given to us in the wrong order (it's sorted by animal).

So the first thing we need to do is to sort the data by SNP (the first column). I will also remove the header row at the same time as it is not needed for the data transformation.

I'm assuming that the data is stored in the file data.in:

$ sed '1d' data.in | sort -o data.tmp

We now have:

$ cat data.tmp
rs01            215               AB            1            100
rs01            300               AB            1            100
rs01            666               BB            1            100
rs02            215               AA            2            200
rs02            300               BB            2            200
rs02            666               AA            2            200
rs03            215               BA            3            300
rs03            300               AA            3            300
rs03            666               AB            3            300
rs04            215               AA            4            400
rs04            300               AB            4            400
rs04            666               AB            4            400

Then I run the following to produce the result:

$ awk -f script.awk data.tmp >data.new

The awk script is quite long, so it makes sense to have it in it's own script file rather than as a "one-liner":

FNR == 1    {
    # at first line of input

    rsid        = $1;
    chr         = $4;
    pos         = $5;

    c           = 0;
    aid[c]      = $2; # animal ID
    all[c++]    = $3; # allele

    do_header   = 1;  # output header when done parsing this SNP

    next;
}

rsid == $1 {
    # still reading animal ID/allele for this SNP
    aid[c]      = $2;
    all[c++]    = $3;
    next;
}

{
    if (do_header) {
        # output header

        printf("SNP_name\tChr\tPosition\t");
        for (c in aid) {
            printf("%d\t", aid[c]);
        }
        printf("\n");

        do_header = 0;
    }

    # output line with data from previous SNP    
    printf("%s\t%d\t%d\t", rsid, chr, pos);
    for (c in all) {
        printf("%s\t", all[c]);
    }
    printf("\n");

    # store data for this SNP
    rsid        = $1;
    chr         = $4;
    pos         = $5;

    c           = 0;
    aid[c]      = $2;
    all[c++]    = $3;
}

END {
    # output line for last SNP

    printf("%s\t%d\t%d\t", rsid, chr, pos);
    for (c in all) {
        printf("%s\t", all[c]);
    }
    printf("\n");
}

This produces the tab-delimited file data.new with the following contents, for the given input:

SNP_name    Chr Position    215 300 666
rs01    1   100 AB  AB  BB
rs02    2   200 AA  BB  AA
rs03    3   300 BA  AA  AB
rs04    4   400 AA  AB  AB

NOTE: This requires that all animals were genotyped for exactly the same SNPs. The same animal IDs needs to occur for every SNP. No exceptions.



Answered By - Kusalananda