Issue
I have two files.
file01.tab
:
Q86IC9 PGEN_.00g000010
P04177 PGEN_.00g000020
Q8L840 PGEN_.00g000050
Q61043 PGEN_.00g000060
A1E2V0 PGEN_.00g000080
P34456 PGEN_.00g000090
P34457 PGEN_.00g000120
O00463 PGEN_.00g000210
Q00945 PGEN_.00g000230
Q5SWK7 PGEN_.00g000240
file02.tab
:
Q86IC9;Q552T5 omt5
P04177 Th
Q8L840;O04092;Q9FT71 RECQL4A
Q61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7 Nin
A1E2V0 BIRC3
P34456 Uncharacterized
P34457 uncharacterized
O00463;B4DIS9;B4E0A2;Q6FHY1 TRAF5
Q00945 RING
Q5SWK7;Q8BXX5;Q9CXG1 Rnf145
I want to use the first column in file01.tab
to join with the first column in file02.tab
. I could do this with grep
, but I need the output to be formatted in the following fashion:
PGEN_.00g000010 Q86IC9;Q552T5 omt5
PGEN_.00g000020 P04177 Th
QPGEN_.00g000050 Q8L840;O04092;Q9FT71 RECQL4A
I've come very close to success using the following awk
code:
awk 'NR==FNR{a[$1]=$0; next} ($1 in a){print $2,a[$1]}' file02.tab file01.tab
That one-liner will produce the following:
PGEN_.00g000020 P04177 Th
PGEN_.00g000080 A1E2V0 BIRC3
PGEN_.00g000090 P34456 Uncharacterized
PGEN_.00g000120 P34457 uncharacterized
PGEN_.00g000230 Q00945 RING
PGEN_.00g000280 Q8ZXT3 protein
PGEN_.00g000300 Q5REG4 DTX3
PGEN_.00g000450 A0JMR6 mysm1
PGEN_.00g000490 Q7D513 Hercynine
PGEN_.00g000530 A6H769 RPS7
The code does not find matches in file02.tab
$1
where there is an in-field semi-colon delimiter. It will only find matches that have a single entry in $1
.
Obviously, grep
can handle the searching using two input files, but I don't know how to format the output from the grep
results, since the formatting requires info from both input files.
Is there any way to accomplish this with an awk
one-liner or should I put together a small script instead?
Solution
Would you please try the following:
awk -v FS='[;[:space:]]+' 'NR==FNR {a[$1]=$2; next} $1 in a {print a[$1], $0}' file01.tab file02.tab
Output:
PGEN_.00g000010 Q86IC9;Q552T5 omt5
PGEN_.00g000020 P04177 Th
PGEN_.00g000050 Q8L840;O04092;Q9FT71 RECQL4A
PGEN_.00g000060 Q61043;A0A1Y7VJL5;B2RQ73;B7ZMZ9;E9Q488;E9Q4S3;Q674R4;Q6ZPM7 Nin
PGEN_.00g000080 A1E2V0 BIRC3
PGEN_.00g000090 P34456 Uncharacterized
PGEN_.00g000120 P34457 uncharacterized
PGEN_.00g000210 O00463;B4DIS9;B4E0A2;Q6FHY1 TRAF5
PGEN_.00g000230 Q00945 RING
PGEN_.00g000240 Q5SWK7;Q8BXX5;Q9CXG1 Rnf145
FS='[;[:space:]]+'
splits the line on a sequence of semicolons or space characters.- I have switched the order of files for simplicity.
Answered By - tshiono Answer Checked By - Timothy Miller (WPSolving Admin)