Thursday, July 28, 2022

[SOLVED] Extracting multiple strings and combining to .csv in bash

Issue

I have a long list of ID's I need to parse. I want to extract three pieces of information and write to a 3 column CSV. Column 1 = the field between tr|XXXX|, column 3 = the field after the second | but before OS=.

Column 2 would be conditional. If there is 'GN=XXX' in the line, I'd like it to return XXX. If GN= isn't present, I'd like to write the first section of column 3 (i.e. up to the first space).

Input:

>tr|I1WXP1|I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment) OS=uncultured euryarchaeote OX=114243 GN=mcrA PE=4 SV=1
>tr|A0A059VAR9|A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment) OS=Halorubrum sp. Ga66 OX=1480727 GN=atpB PE=3 SV=1
>tr|Q51760|Q51760_9EURY Glutaredoxin-like protein OS=Pyrococcus furiosus OX=2261 PE=1 SV=1

Desired output:

I1WXP1,mcrA,I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)
A0A059VAR9,atpB, A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment)
Q51760,Q51760_9EURY,Q51760_9EURY Glutaredoxin-like protein

I can get the first two with awk, for e.g.:

awk '{split($0,a,"|"); print a[2]

But I can't work out the conditional, or how to act on the 'GN=' pattern neatly.

So for example, extracting bold text:

tr|**I1WXP1**|**I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)** OS=uncultured euryarchaeote OX=114243 GN=**mcrA** PE=4 SV=1

Becomes:

I1WXP1, mcrA, I1WXP1_9EURY Methyl coenzyme M reductase subunit A(Fragment)

Solution

Whenever your input contains tag=value pairs as yours does I find it best to first create an array to contain that mapping and then you can access the values by their tags (names) however you like, e.g. using any awk:

$ cat tst.awk
BEGIN { FS="[|]"; OFS="," }
{
    delete tag2val

    description = $3; sub(/ +[^ ]+=.*/,"",description)
    assignments = substr($3,length(description)+1)

    tag2val["GN"] = description; sub(/ .*/,"",tag2val["GN"])

    split(assignments,a," ")
    for ( i in a ) {
        tag = a[i]; sub(/=.*/,"",tag)
        val = substr(a[i],length(tag)+2)
        tag2val[tag] = val
    }

    print $2, tag2val["GN"], description
}

$ awk -f tst.awk file
I1WXP1,mcrA,I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)
A0A059VAR9,atpB,A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment)
Q51760,Q51760_9EURY,Q51760_9EURY Glutaredoxin-like protein

With that approach if you want to print or test other fields it's trivial, e.g.:

$ cat tst.awk
BEGIN { FS="[|]"; OFS="," }
{
    delete tag2val

    description = $3; sub(/ +[^ ]+=.*/,"",description)
    assignments = substr($3,length(description)+1)

    tag2val["GN"] = description; sub(/ .*/,"",tag2val["GN"])

    split(assignments,a," ")
    for ( i in a ) {
        tag = a[i]; sub(/=.*/,"",tag)
        val = substr(a[i],length(tag)+2)
        tag2val[tag] = val
    }

    print $2, tag2val["GN"], tag2val["OS"], tag2val["PE"], description
}

$ awk -f tst.awk file
I1WXP1,mcrA,uncultured,4,I1WXP1_9EURY Methyl coenzyme M reductase subunit A (Fragment)
A0A059VAR9,atpB,Halorubrum,3,A0A059VAR9_9EURY V-type ATP synthase beta chain (Fragment)
Q51760,Q51760_9EURY,Pyrococcus,1,Q51760_9EURY Glutaredoxin-like protein


Answered By - Ed Morton
Answer Checked By - Katrina (WPSolving Volunteer)