Saturday, June 4, 2022

[SOLVED] Filter lines according to a percentage

Issue

I need to filter some files in which the "Identities" are greater than 90%.

Example file

>EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform 
CRA_b, partial [Homo sapiens]
Length=1203

 Score = 2445 bits (6337),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)
>NP_005493.2 phospholipid-transporting ATPase ABCA1 [Homo sapiens]
 O95477.3 RecName: Full=Phospholipid-transporting ATPase ABCA1; AltName: 
Full=ATP-binding cassette sub-family A member 1; AltName: 
Full=ATP-binding cassette transporter 1; Short=ABC-1; Short=ATP-binding 
cassette 1; AltName: Full=Cholesterol efflux regulatory 
protein [Homo sapiens]
 BAB63210.1 ABCA1 [Homo sapiens]
Length=2261

 Score = 2246 bits (5819),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 1150/2284 (50%), Positives = 1511/2284 (66%), Gaps = 81/2284 (4%)

The result should be like this:

>EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform 
CRA_b, partial [Homo sapiens]
Length=1203

 Score = 2445 bits (6337),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)

I edited my code following this post:

awk '{RS="Identities"; FS=" "; original_block=$0; gsub(/\(|\)|%|,/,""); if ($5 >= 99) print original_block}' my_file.txt

Got this:

   >EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform 
CRA_b, partial [Homo sapiens]
Length=1203

 Score = 2445 bits (6337),  Expect = 0.0, Method: Compositional matrix adjust.

 = 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)
>NP_005493.2 phospholipid-transporting ATPase ABCA1 [Homo sapiens]
 O95477.3 RecName: Full=Phospholipid-transporting ATPase ABCA1; AltName: 
Full=ATP-binding cassette sub-family A member 1; AltName: 
Full=ATP-binding cassette transporter 1; Short=ABC-1; Short=ATP-binding 
cassette 1; AltName: Full=Cholesterol efflux regulatory 
protein [Homo sapiens]
 BAB63210.1 ABCA1 [Homo sapiens]
Length=2261

 Score = 2246 bits (5819),  Expect = 0.0, Method: Compositional matrix adjust.

 = 1150/2284 (50%), Positives = 1511/2284 (66%), Gaps = 81/2284 (4%)

Another attempt was:

awk '{RS="Identities"; FS=" "; gsub(/\(|\)|%|,/,""); if ($5 >= 99) print }' my_file.txt

And I get this:

   >EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform 
CRA_b, partial [Homo sapiens]
Length=1203

 Score = 2445 bits (6337),  Expect = 0.0, Method: Compositional matrix adjust.
 = 1174/1174 100 Positives = 1174/1174 100 Gaps = 0/1174 0
>NP_005493.2 phospholipid-transporting ATPase ABCA1 [Homo sapiens]
 O95477.3 RecName: Full=Phospholipid-transporting ATPase ABCA1; AltName: 
Full=ATP-binding cassette sub-family A member 1; AltName: 
Full=ATP-binding cassette transporter 1; Short=ABC-1; Short=ATP-binding 
cassette 1; AltName: Full=Cholesterol efflux regulatory 
protein [Homo sapiens]
 BAB63210.1 ABCA1 [Homo sapiens]
Length=2261

 Score = 2246 bits (5819),  Expect = 0.0, Method: Compositional matrix adjust.
 = 1150/2284 50 Positives = 1511/2284 66 Gaps = 81/2284 4

My filtering strategy is not working. Could you please help me?


Solution

I assume > is used solely for denoting where block is starting, you are always interested in 1st percentage and percantages are always given as integer (e.g. 90% but not 90.1%), then I would use GNU AWK following way, let file.txt content be

>EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform 
CRA_b, partial [Homo sapiens]
Length=1203

 Score = 2445 bits (6337),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)
>NP_005493.2 phospholipid-transporting ATPase ABCA1 [Homo sapiens]
 O95477.3 RecName: Full=Phospholipid-transporting ATPase ABCA1; AltName: 
Full=ATP-binding cassette sub-family A member 1; AltName: 
Full=ATP-binding cassette transporter 1; Short=ABC-1; Short=ATP-binding 
cassette 1; AltName: Full=Cholesterol efflux regulatory 
protein [Homo sapiens]
 BAB63210.1 ABCA1 [Homo sapiens]
Length=2261

 Score = 2246 bits (5819),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 1150/2284 (50%), Positives = 1511/2284 (66%), Gaps = 81/2284 (4%)

then

awk 'BEGIN{RS=">";FPAT="[0-9]+%"}int($1)>90{print}' file.txt

output

EAW73057.1 ATP-binding cassette, sub-family A (ABC1), member 4, isoform 
CRA_b, partial [Homo sapiens]
Length=1203

 Score = 2445 bits (6337),  Expect = 0.0, Method: Compositional matrix adjust.
 Identities = 1174/1174 (100%), Positives = 1174/1174 (100%), Gaps = 0/1174 (0%)

Explanation: Firstyle I inform GNU AWK that > is used as record separator and fields are like 1 or more digits followed by % sign. Then for each record I exploit int function to get value, in this case 100 from 100%. Then I print if value is greater than 90. Keep in mind assumptions described earlier and if you wish to use my code test it with other data.

(tested in gawk 4.2.1)



Answered By - Daweo
Answer Checked By - Mildred Charles (WPSolving Admin)