Thursday, June 16, 2022

[SOLVED] Searching within strings in a single field containing an in-field delimiter across two files using awk?

June 16, 2022 awk, grep, mapping

Issue

I have two files.

file01.tab:

Q86IC9   PGEN_.00g000010
P04177   PGEN_.00g000020
Q8L840   PGEN_.00g000050
Q61043   PGEN_.00g000060
A1E2V0   PGEN_.00g000080
P34456   PGEN_.00g000090
P34457   PGEN_.00g000120
O00463   PGEN_.00g000210
Q00945   PGEN_.00g000230
Q5SWK7   PGEN_.00g000240

file02.tab:

Q86IC9;Q552T5                                                omt5
P04177                                                       Th
Q8L840;O04092;Q9FT71                                         RECQL4A
Q61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7  Nin
A1E2V0                                                       BIRC3
P34456                                                       Uncharacterized
P34457                                                       uncharacterized
O00463;B4DIS9;B4E0A2;Q6FHY1                                  TRAF5
Q00945                                                       RING
Q5SWK7;Q8BXX5;Q9CXG1                                         Rnf145

I want to use the first column in file01.tab to join with the first column in file02.tab. I could do this with grep, but I need the output to be formatted in the following fashion:

PGEN_.00g000010 Q86IC9;Q552T5           omt5
PGEN_.00g000020 P04177                  Th
QPGEN_.00g000050 Q8L840;O04092;Q9FT71   RECQL4A

I've come very close to success using the following awk code:

awk 'NR==FNR{a[$1]=$0; next} ($1 in a){print $2,a[$1]}' file02.tab file01.tab

That one-liner will produce the following:

PGEN_.00g000020 P04177  Th
PGEN_.00g000080 A1E2V0  BIRC3
PGEN_.00g000090 P34456  Uncharacterized
PGEN_.00g000120 P34457  uncharacterized
PGEN_.00g000230 Q00945  RING
PGEN_.00g000280 Q8ZXT3  protein
PGEN_.00g000300 Q5REG4  DTX3
PGEN_.00g000450 A0JMR6  mysm1
PGEN_.00g000490 Q7D513  Hercynine
PGEN_.00g000530 A6H769  RPS7

The code does not find matches in file02.tab $1 where there is an in-field semi-colon delimiter. It will only find matches that have a single entry in $1.

Obviously, grep can handle the searching using two input files, but I don't know how to format the output from the grep results, since the formatting requires info from both input files.

Is there any way to accomplish this with an awk one-liner or should I put together a small script instead?

Solution

Would you please try the following:

awk -v FS='[;[:space:]]+' 'NR==FNR {a[$1]=$2; next} $1 in a {print a[$1], $0}' file01.tab file02.tab

Output:

PGEN_.00g000010 Q86IC9;Q552T5                                                omt5
PGEN_.00g000020 P04177                                                       Th
PGEN_.00g000050 Q8L840;O04092;Q9FT71                                         RECQL4A
PGEN_.00g000060 Q61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7  Nin
PGEN_.00g000080 A1E2V0                                                       BIRC3
PGEN_.00g000090 P34456                                                       Uncharacterized
PGEN_.00g000120 P34457                                                       uncharacterized
PGEN_.00g000210 O00463;B4DIS9;B4E0A2;Q6FHY1                                  TRAF5
PGEN_.00g000230 Q00945                                                       RING
PGEN_.00g000240 Q5SWK7;Q8BXX5;Q9CXG1                                         Rnf145

FS='[;[:space:]]+' splits the line on a sequence of semicolons or space characters.
I have switched the order of files for simplicity.

Answered By - tshiono

Answer Checked By - Timothy Miller (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Thursday, June 16, 2022

[SOLVED] Searching within strings in a single field containing an in-field delimiter across two files using awk?

Issue

Solution

Popular Posts

Labels