Issue
Hey I am newer to using Linux and I have a large dataset with 5k columns all with different names comparing patient data. I want to subset the "PC" data from the other groups (such as BC, PR... etc). There are about 100 members for the PC all labeled PC****. where **** is a case number.
Data looks like this (Although delimiter = \t ):
ID_REF | BC0010 | BC0011 | PC0023 | PC0013 | PR0007 | PR0053 |
---|---|---|---|---|---|---|
Gene1 | 1.177703095 | -1.32116828 | -2.176250582 | -2.079342976 | -1.427146954 | 10.64102737 |
Gene2 | 1.177703095 | -1.32116828 | -2.176250582 | -2.079342976 | -1.427146954 | 10.64102737 |
Gene3 | 1.177703095 | -1.32116828 | -2.176250582 | -2.079342976 | -1.427146954 | 10.64102737 |
What I am trying to get is (again with delimiter = \t):
ID_REF | PC0023 | PC0013 |
---|---|---|
Gene1 | -2.176250582 | -2.079342976 |
Gene2 | -2.176250582 | -2.079342976 |
Gene3 | -2.176250582 | -2.079342976 |
So far I have tried :
$ awk -F/t -v col=*PC 'NR==1{for(i=1;i<=NF;i++){if($i==col){c=i;break}} print $c} NR>1{print $c}' data.txt > pcData.txt
However for some reason this doesn't subset the data and I end up getting all columns returned.
I am not sure what to put into col=??? to make this work.
Thanks for your help!
Solution
1st issue is the input field delimiter (-F/t
); wrong symbol for a tab; a couple options: -F$'\t'
or in a BEGIN{}
block (eg, BEGIN { FS="\t" }
); since it looks like you want to maintain a tab as the output field delimiter then one idea:
# replace
awk -F/t 'NR==1 ....
# with
awk 'BEGIN { FS=OFS="\t" } NR==1 ....
2nd issue is the check for a column name starting with PC
; you'll have problems getting awk to understand what *PC
is supposed to mean; since you want to know about column names that start with PC
, one idea:
awk -v col='PC' 'NR==1{for (i=1;i<=NF;i++) if ( index($i,col)==1 ) ....'
3rd issue is that you want to build a list (how about an array?) to store a list of columns of interest; you're currently breaking out of your NR==1
loop on the first match where what you want to do is keep track of the column numbers (i
) for all matches; since you also want to keep the 1st column, something like:
if ( index($i,col)==1 || i==1 ) cols[++c]=i
... then for all input lines you print the columns represented by the numbers in the array.
Rolling all of this into a complete script (and changing some names):
awk -v pfx='PC' '
BEGIN { FS=OFS="\t" }
NR==1 { for (i=1;i<=NF;i++)
if ( (index($i,pfx)==1) || i==1)
cols[++c]=i
}
{ for (i=1;i<=c;i++)
printf "%s%s", (i==1 ? "" : OFS), $(cols[i])
print ""
}
' data.txt
This generates:
ID_REF PC0023 PC0013
Gene1 -2.176250582 -2.079342976
Gene2 -2.176250582 -2.079342976
Gene3 -2.176250582 -2.079342976
Answered By - markp-fuso Answer Checked By - Marie Seifert (WPSolving Admin)