Monday, October 24, 2022

[SOLVED] Remove everything after and including a tab in FASTA header?

Issue

I am trying to keep only the first field identifier for each sequence in a .fasta file that looks like this:

>hetGla3    ENST00000215754.179
ATGCCGATGTTCGTCTTGAACACCAACGTGCCCCGCGCCTCTGTGCCGGACGGGTTCCTCTCCGAGCTCACCCAGCAGCTGGCGCAGGCCACTGGCAAGCCGGCCCAGTATATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACCTTCGCGGGCTCATCCGAGCCCTGCGCGCTCTGCAGCCTGCACAGCATCGGCAAGATAGGCGGCGTTCAGAATCGCTCGTACAGCAAGCTGCTGTGTGGCCTGCTGGCGGAGCGCCTGCGTATCAGTCCGGACAGGATCTACATCAACTACTACGACATGAATGCGGCCAATGTGGGCTGGAACGGCTCCACCTTCGCTNNN
>musMus10   ENST00000215754.270
ATGCCTATGTTCATCGTGAACACCAATGTTCCCCGCGCCTCCGTGCCAGAGGGGTTTCTGTCGGAGCTCACCCAGCAGCTGGCGCAGGCCACCGGCAAGCCCGCACAGTACATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACTTTTAGCGGCACGAACGATCCCTGCGCCCTCTGCAGCCTGCACAGCATCGGCAAGATCGGTGGTGCCCAGAACCGCAACTACAGTAAGCTGCTGTGTGGCCTGCTGTCCGATCGCCTGCACATCAGCCCGGACCGGGTCTACATCAACTATTACGACATGAACGCTGCCAACGTGGGCTGGAACGGTTCCACCTTCGCTNNN

I want to remove the \tab and "ENST..." identifier after it, returning:

>hetGla3
ATGCCGATGTTCGTCTTGAACACCAACGTGCCCCGCGCCTCTGTGCCGGACGGGTTCCTCTCCGAGCTCACCCAGCAGCTGGCGCAGGCCACTGGCAAGCCGGCCCAGTATATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACCTTCGCGGGCTCATCCGAGCCCTGCGCGCTCTGCAGCCTGCACAGCATCGGCAAGATAGGCGGCGTTCAGAATCGCTCGTACAGCAAGCTGCTGTGTGGCCTGCTGGCGGAGCGCCTGCGTATCAGTCCGGACAGGATCTACATCAACTACTACGACATGAATGCGGCCAATGTGGGCTGGAACGGCTCCACCTTCGCTNNN
>musMus10
ATGCCTATGTTCATCGTGAACACCAATGTTCCCCGCGCCTCCGTGCCAGAGGGGTTTCTGTCGGAGCTCACCCAGCAGCTGGCGCAGGCCACCGGCAAGCCCGCACAGTACATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACTTTTAGCGGCACGAACGATCCCTGCGCCCTCTGCAGCCTGCACAGCATCGGCAAGATCGGTGGTGCCCAGAACCGCAACTACAGTAAGCTGCTGTGTGGCCTGCTGTCCGATCGCCTGCACATCAGCCCGGACCGGGTCTACATCAACTATTACGACATGAACGCTGCCAACGTGGGCTGGAACGGTTCCACCTTCGCTNNN

I have already tried sed to remove all whitespaces from headers, but it doesn't appear to be working (returns the original format):

sed 's/\.[^\.]*//'

Any help would be greatly appreciated! Thank you.


Solution

Using GNU sed

$ sed -E '/^>/s/( +|\t).*//' input_file
>hetGla3
ATGCCGATGTTCGTCTTGAACACCAACGTGCCCCGCGCCTCTGTGCCGGACGGGTTCCTCTCCGAGCTCACCCAGCAGCTGGCGCAGGCCACTGGCAAGCCGGCCCAGTATATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACCTTCGCGGGCTCATCCGAGCCCTGCGCGCTCTGCAGCCTGCACAGCATCGGCAAGATAGGCGGCGTTCAGAATCGCTCGTACAGCAAGCTGCTGTGTGGCCTGCTGGCGGAGCGCCTGCGTATCAGTCCGGACAGGATCTACATCAACTACTACGACATGAATGCGGCCAATGTGGGCTGGAACGGCTCCACCTTCGCTNNN
>musMus10
ATGCCTATGTTCATCGTGAACACCAATGTTCCCCGCGCCTCCGTGCCAGAGGGGTTTCTGTCGGAGCTCACCCAGCAGCTGGCGCAGGCCACCGGCAAGCCCGCACAGTACATCGCAGTGCACGTGGTCCCGGACCAGCTCATGACTTTTAGCGGCACGAACGATCCCTGCGCCCTCTGCAGCCTGCACAGCATCGGCAAGATCGGTGGTGCCCAGAACCGCAACTACAGTAAGCTGCTGTGTGGCCTGCTGTCCGATCGCCTGCACATCAGCCCGGACCGGGTCTACATCAACTATTACGACATGAACGCTGCCAACGTGGGCTGGAACGGTTCCACCTTCGCTNNN


Answered By - HatLess
Answer Checked By - Willingham (WPSolving Volunteer)