Saturday, November 25, 2023

[SOLVED] How to extract protein sequence IDs that are present as singletons in a cluster?

November 25, 2023 awk, bash, grep, linux, sed

Issue

I have a large dataset, containing clusters of protein sequences. A cluster number and numerous rows listing the protein sequences found in each cluster serve as its representation. Some protein sequences appear multiple times within a cluster, while others appear only once (i.e., singletons). I want to extract the protein sequence IDs that are present as singletons in each cluster.

Here is an example of the dataset:

>Cluster 0
0       310aa, >ref_ENST00000279791... at 100.00%
1       415aa, >ref_ENST00000641310... *
>Cluster 1
0       310aa, >ENST00000279791.590... at 100.00%
1       310aa, >ENST00000332650.693... at 100.00%
2       413aa, >ENST00000641310.590... *
3       310aa, >ENST00000279791.590... at 99.35%
4       310aa, >ENST00000332650.693... at 99.35%
>Cluster 2
0       399aa, >ENST00000641310.394... *
>Cluster 3
0       311aa, >ENST00000641081.179... at 96.14%
1       395aa, >ENST00000641310.395... *
2       311aa, >ENST00000641581.842... at 96.14%
3       311aa, >ENST00000641668.842... at 96.14%
4       311aa, >ENST00000641081.179... at 96.14%
5       299aa, >ENST00000641310.395... at 100.00%
6       311aa, >ENST00000641581.842... at 96.14%
7       311aa, >ENST00000641668.842... at 96.14%
>Cluster 4
0       380aa, >ENST00000641310.583... *
1       314aa, >ENST00000332238.915... at 95.86%
2       310aa, >ENST00000641310.583... at 97.10%
>Cluster 5
0       370aa, >ref_ENST00000314644... *
1       316aa, >ref_ENST00000642128... at 100.00%
>Cluster 6
0       367aa, >ENST00000641310.213... *
1       326aa, >ENST00000531945.112... at 96.32%
2       319aa, >ENST00000641123.112... at 98.12%
3       313aa, >ENST00000641310.213... at 99.68%
>Cluster 7
0       367aa, >ENST00000641310.284... *

In this example, I want to extract the protein sequence IDs that appear only once (i.e., singletons) in each cluster. Based on the given dataset, the desired output should include the following protein sequence IDs:

ENST00000641310.394
ENST00000641310.284

#!/bin/bash

# Assuming the dataset is stored in a file called "dataset.txt"
input_file="dataset.txt"

# Loop through each line in the input file
while IFS= read -r line; do
  # Check if the line starts with ">Cluster"
  if [[ $line == ">Cluster"* ]]; then
    cluster_number=${line#>Cluster }
    cluster_number=${cluster_number//[^0-9]/}
    cluster_found=false
  fi

  # Check if the line contains a singleton protein sequence
  if [[ $line == *"... *" ]]; then
    protein_sequence=$(echo "$line" | awk -F"[>, ]" '{print $4}')
    cluster_found=true
  fi

  # Print the singleton protein sequence if a cluster was found
  if [[ $cluster_found == true ]]; then
    echo "$protein_sequence"
  fi
done < "$input_file"

I tried the following script, but it did not work.

Let me know if you have any doubts.

Solution

If I put your data in a file called protein.txt then I can do this on Linux (be aware that RS='>Cluster' requires GNU awk):

awk -F'\n' -v RS='>Cluster' 'NF==3' protein.txt

which gives me the lines from Clusters with one singlet:

 2
0       399aa, >ENST00000641310.394... *

 7
0       367aa, >ENST00000641310.284... *

Is that what you are looking for?

Answered By - Sergiu

Answer Checked By - David Goodson (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Saturday, November 25, 2023

[SOLVED] How to extract protein sequence IDs that are present as singletons in a cluster?

Issue

Solution

Popular Posts

Labels