Issue
I need to filter some files in which the "Identities" are greater than 90%.
Example file
>EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform
CRA_b, partial [Homo sapiens]
Length=1203
Score = 2445 bits (6337), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)
>NP_005493.2 phospholipid-transporting ATPase ABCA1 [Homo sapiens]
O95477.3 RecName: Full=Phospholipid-transporting ATPase ABCA1; AltName:
Full=ATP-binding cassette sub-family A member 1; AltName:
Full=ATP-binding cassette transporter 1; Short=ABC-1; Short=ATP-binding
cassette 1; AltName: Full=Cholesterol efflux regulatory
protein [Homo sapiens]
BAB63210.1 ABCA1 [Homo sapiens]
Length=2261
Score = 2246 bits (5819), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 1150/2284 (50%), Positives = 1511/2284 (66%), Gaps = 81/2284 (4%)
The result should be like this:
>EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform
CRA_b, partial [Homo sapiens]
Length=1203
Score = 2445 bits (6337), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)
I edited my code following this post:
awk '{RS="Identities"; FS=" "; original_block=$0; gsub(/\(|\)|%|,/,""); if ($5 >= 99) print original_block}' my_file.txt
Got this:
>EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform
CRA_b, partial [Homo sapiens]
Length=1203
Score = 2445 bits (6337), Expect = 0.0, Method: Compositional matrix adjust.
= 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)
>NP_005493.2 phospholipid-transporting ATPase ABCA1 [Homo sapiens]
O95477.3 RecName: Full=Phospholipid-transporting ATPase ABCA1; AltName:
Full=ATP-binding cassette sub-family A member 1; AltName:
Full=ATP-binding cassette transporter 1; Short=ABC-1; Short=ATP-binding
cassette 1; AltName: Full=Cholesterol efflux regulatory
protein [Homo sapiens]
BAB63210.1 ABCA1 [Homo sapiens]
Length=2261
Score = 2246 bits (5819), Expect = 0.0, Method: Compositional matrix adjust.
= 1150/2284 (50%), Positives = 1511/2284 (66%), Gaps = 81/2284 (4%)
Another attempt was:
awk '{RS="Identities"; FS=" "; gsub(/\(|\)|%|,/,""); if ($5 >= 99) print }' my_file.txt
And I get this:
>EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform
CRA_b, partial [Homo sapiens]
Length=1203
Score = 2445 bits (6337), Expect = 0.0, Method: Compositional matrix adjust.
= 1174/1174 100 Positives = 1174/1174 100 Gaps = 0/1174 0
>NP_005493.2 phospholipid-transporting ATPase ABCA1 [Homo sapiens]
O95477.3 RecName: Full=Phospholipid-transporting ATPase ABCA1; AltName:
Full=ATP-binding cassette sub-family A member 1; AltName:
Full=ATP-binding cassette transporter 1; Short=ABC-1; Short=ATP-binding
cassette 1; AltName: Full=Cholesterol efflux regulatory
protein [Homo sapiens]
BAB63210.1 ABCA1 [Homo sapiens]
Length=2261
Score = 2246 bits (5819), Expect = 0.0, Method: Compositional matrix adjust.
= 1150/2284 50 Positives = 1511/2284 66 Gaps = 81/2284 4
My filtering strategy is not working. Could you please help me?
Solution
I assume >
is used solely for denoting where block is starting, you are always interested in 1st percentage and percantages are always given as integer (e.g. 90%
but not 90.1%
), then I would use GNU AWK
following way, let file.txt
content be
>EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform
CRA_b, partial [Homo sapiens]
Length=1203
Score = 2445 bits (6337), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)
>NP_005493.2 phospholipid-transporting ATPase ABCA1 [Homo sapiens]
O95477.3 RecName: Full=Phospholipid-transporting ATPase ABCA1; AltName:
Full=ATP-binding cassette sub-family A member 1; AltName:
Full=ATP-binding cassette transporter 1; Short=ABC-1; Short=ATP-binding
cassette 1; AltName: Full=Cholesterol efflux regulatory
protein [Homo sapiens]
BAB63210.1 ABCA1 [Homo sapiens]
Length=2261
Score = 2246 bits (5819), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 1150/2284 (50%), Positives = 1511/2284 (66%), Gaps = 81/2284 (4%)
then
awk 'BEGIN{RS=">";FPAT="[0-9]+%"}int($1)>90{print}' file.txt
output
EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform
CRA_b, partial [Homo sapiens]
Length=1203
Score = 2445 bits (6337), Expect = 0.0, Method: Compositional matrix adjust.
Identities = 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)
Explanation: Firstyle I inform GNU AWK
that >
is used as record separator and fields are like 1 or more digits followed by %
sign. Then for each record I exploit int
function to get value, in this case 100
from 100%
. Then I print if value is greater than 90. Keep in mind assumptions described earlier and if you wish to use my code test it with other data.
(tested in gawk 4.2.1)
Answered By - Daweo Answer Checked By - Mildred Charles (WPSolving Admin)