Sunday, October 30, 2022

[SOLVED] replace portion of fasta headers

Issue

I would like to replace a portion of the headers in a fasta file (sourrunded by _) using a text file with a key. Hope someone can help me! Thanks

#fasta file:
>mir-2_scf7180000350313_41896
CCATCAGAGTGGTTGTGATGTGGTGCTATTGATTCATATCACAGCCAGCTTTGATGAG
>mir-92a-2_scf7180000349939_17298
AGGTGGGGATGGGGGCAATATTTGTGAATGATTAAATTCAAATTGCACTTGTCCCGGCCTGC
>mir-279a_scf7180000350374_48557
AATGAGTGGCGGTCTAGTGCACGGTCGATAAAGTTGTGACTAGATCCACACTCATTAAG

#key_file.txt
scf7180000350313 NW_011929472.1
scf7180000349939 NW_011929473.1
scf7180000350374 NW_011929474.1

#expected result
>mir-2_NW_011929472.1_41896
CCATCAGAGTGGTTGTGATGTGGTGCTATTGATTCATATCACAGCCAGCTTTGATGAG
>mir-92a-2_NW_011929473.1_17298
AGGTGGGGATGGGGGCAATATTTGTGAATGATTAAATTCAAATTGCACTTGTCCCGGCCTGC
>mir-279a_NW_011929474.1_48557
AATGAGTGGCGGTCTAGTGCACGGTCGATAAAGTTGTGACTAGATCCACACTCATTAAG

Solution

You can try this awk.

$ awk '
    NR == FNR{r[$1] = $2; next}      # read in keyword-replacement file in associative array
    /^>/{                            # for all lines beginning with >
      for(i in r){                   # cycle through the key values of the associative array
        n = sub(i, r[i], $0)         # do the replacement with i (key) and r[i] (value). That changes the line in memory. It's printed using "1" at the end of the block 
        if(n == 1){break}            # a performance-relevant line, assuring the for loop breaks once a key-value pair matched
      }
    }1' key_file.txt fasta-file
>mir-2_NW_011929472.1_41896
CCATCAGAGTGGTTGTGATGTGGTGCTATTGATTCATATCACAGCCAGCTTTGATGAG
>mir-92a-2_NW_011929473.1_17298
AGGTGGGGATGGGGGCAATATTTGTGAATGATTAAATTCAAATTGCACTTGTCCCGGCCTGC
>mir-279a_NW_011929474.1_48557
AATGAGTGGCGGTCTAGTGCACGGTCGATAAAGTTGTGACTAGATCCACACTCATTAAG


Answered By - Andre Wildberg
Answer Checked By - Katrina (WPSolving Volunteer)