Monday, January 8, 2024

[SOLVED] Use awk to extract column data with multiple names (UNIX)

January 08, 2024 awk, large-data, multiple-columns, subset, unix

Issue

Hey I am newer to using Linux and I have a large dataset with 5k columns all with different names comparing patient data. I want to subset the "PC" data from the other groups (such as BC, PR... etc). There are about 100 members for the PC all labeled PC****. where **** is a case number.

Data looks like this (Although delimiter = \t ):

ID_REF	BC0010	BC0011	PC0023	PC0013	PR0007	PR0053
Gene1	1.177703095	-1.32116828	-2.176250582	-2.079342976	-1.427146954	10.64102737
Gene2	1.177703095	-1.32116828	-2.176250582	-2.079342976	-1.427146954	10.64102737
Gene3	1.177703095	-1.32116828	-2.176250582	-2.079342976	-1.427146954	10.64102737

What I am trying to get is (again with delimiter = \t):

ID_REF	PC0023	PC0013
Gene1	-2.176250582	-2.079342976
Gene2	-2.176250582	-2.079342976
Gene3	-2.176250582	-2.079342976

So far I have tried :

 $ awk -F/t -v col=*PC 'NR==1{for(i=1;i<=NF;i++){if($i==col){c=i;break}} print $c} NR>1{print $c}' data.txt > pcData.txt

However for some reason this doesn't subset the data and I end up getting all columns returned.

I am not sure what to put into col=??? to make this work.

Thanks for your help!

Solution

1st issue is the input field delimiter (-F/t); wrong symbol for a tab; a couple options: -F$'\t' or in a BEGIN{} block (eg, BEGIN { FS="\t" }); since it looks like you want to maintain a tab as the output field delimiter then one idea:

# replace

awk -F/t 'NR==1 ....

# with

awk 'BEGIN { FS=OFS="\t" } NR==1 ....

2nd issue is the check for a column name starting with PC; you'll have problems getting awk to understand what *PC is supposed to mean; since you want to know about column names that start with PC, one idea:

awk -v col='PC' 'NR==1{for (i=1;i<=NF;i++) if ( index($i,col)==1 ) ....'

3rd issue is that you want to build a list (how about an array?) to store a list of columns of interest; you're currently breaking out of your NR==1 loop on the first match where what you want to do is keep track of the column numbers (i) for all matches; since you also want to keep the 1st column, something like:

if ( index($i,col)==1 || i==1 ) cols[++c]=i

... then for all input lines you print the columns represented by the numbers in the array.

Rolling all of this into a complete script (and changing some names):

awk -v pfx='PC' '
BEGIN { FS=OFS="\t" }
NR==1 { for (i=1;i<=NF;i++)
            if ( (index($i,pfx)==1) || i==1)
               cols[++c]=i
      }
      { for (i=1;i<=c;i++)
            printf "%s%s", (i==1 ? "" : OFS), $(cols[i])
        print ""
      }
' data.txt

This generates:

ID_REF  PC0023  PC0013
Gene1   -2.176250582    -2.079342976
Gene2   -2.176250582    -2.079342976
Gene3   -2.176250582    -2.079342976

Answered By - markp-fuso

Answer Checked By - Marie Seifert (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 8, 2024

[SOLVED] Use awk to extract column data with multiple names (UNIX)

Issue

Solution

Popular Posts

Labels