Issue
>2654570298
MRNYSYKGKWEKLLTPEIVKKLTLINEFKGEQRLFIKAHKDELKELSELA
KIQSTEASNKIEGIFTSDDRFKSLAQAKTTPRNRNESEIAGYRDVLNTIH
DSYEYIPISASYFLQLHRDLYKFVAKNDVGKFKSSDNIIRETDEKGNERL
RFRPVPAWETPAAIDELCKAYADAKEEIDPLILNAMFILDFLCIHPFNDG
NGRMSRLLTLLLLYKTGFIVGKYISIEKIIEESKETYYEVLQDSLVGWHE
NENDYKPFVNYMLGVIVNAYKEFESRTELVTNPNLTKSDRIREIIKDHIG
TITKAELLEMNPDISDTTVQRTLAKLLKNNDIKKIGGGRYTKYTWNTEEQ
>2654570299|K03427
MITGELKNKIDGLWDVFAAGGLVNPLEVIEQITYLMFIKDLDDVDKRKEK
ESAMLGLPYKSIFAGEVKIGDRTIEGTQLKWSVFHDFSAGRMYAIMQEWV
FPFIKNLHSDKNSTYSKYMDDAIFKFPTPLLLSKVVDSLDEIYEIMNSTL
VLDVRGDVYEYLLNKIASAGRNGQFRTPRHIIRMMVEMVEPKADDVICDP
GDLLKVCKTKKTELLFLALFLRMLKVGGRCACIVPDGVLFGSSKAHKDIR
KQVVEENRLEAVISMPSGVFKPYAGVSTAILIFTKTGHGGTDNVWFYDMT
ADGYSLDDKRTPVSENDIPDIIERFKNLDKEIDRERTDKSFMVPKQDIAD
NDYDLSINKYKEVVYEKIEYPPTSEIMADIREIEMEIGKEMDELEKLLNI
>2654570301
MNESELYKELGILTKDKSKWAENIQYVSSLLNHESAKIQAKALWLLGEMG
LEYPDSIQDAVPMVASFCDSENALLRERAVNALGRIGRGNYNLIEPYWSD
LFRFASDDEPKVRLSFIWASENVATNTPDIYENHMSVFESLLHDIDDKVR
MESPEIFRVLGKRRPEFVIPYIEQLQKMAETDSNRVVRIHSLGAIKVTTS
K
>2654570302
MWNMIWPLVLIVGSNCFYNICTKSMPEGTNTFGALTVTYLVGAVLSAVLF
VVSVKPAGVLNEISKINWTSFVLGLVIVGLEAGYVFLYRAGWKVSNGALT
ANICLAIALIVIGFLLYKESISIKQVAGIVVCGFGLFLING
>2654570303|K01153
MKNKELLKRVGYVVLICLSFFVATWYFFENNKICTICWIAIGSKNVYDLV
HRIKNSKKED
I would like to filter it printing only the sequences whose header contains a "|K", either using awk, grep, or anything similar. Desired output:
>2654570299|K03427
MITGELKNKIDGLWDVFAAGGLVNPLEVIEQITYLMFIKDLDDVDKRKEK
ESAMLGLPYKSIFAGEVKIGDRTIEGTQLKWSVFHDFSAGRMYAIMQEWV
FPFIKNLHSDKNSTYSKYMDDAIFKFPTPLLLSKVVDSLDEIYEIMNSTL
VLDVRGDVYEYLLNKIASAGRNGQFRTPRHIIRMMVEMVEPKADDVICDP
GDLLKVCKTKKTELLFLALFLRMLKVGGRCACIVPDGVLFGSSKAHKDIR
KQVVEENRLEAVISMPSGVFKPYAGVSTAILIFTKTGHGGTDNVWFYDMT
ADGYSLDDKRTPVSENDIPDIIERFKNLDKEIDRERTDKSFMVPKQDIAD
NDYDLSINKYKEVVYEKIEYPPTSEIMADIREIEMEIGKEMDELEKLLNI
>2654570303|K01153
MKNKELLKRVGYVVLICLSFFVATWYFFENNKICTICWIAIGSKNVYDLV
HRIKNSKKED
Note that the number of lines between one header and the next are not always the same, and line break always separates one sequence and the following header.
Anyone could help?
Solution
Using awk or sed:
sed -e '/|K/, /^$/ p; d' database.txt
awk '/\|K/, /^$/' database.txt
Both of these do the exact same things -- they check for a |K
on a line and print until they see the next blank line. In the sed syntax, the print is the explicit p
(the d following clears the buffer to move to the next input line), while the awk example leverages the more implicit awk "default action" behavior.
There is a little bit of difference between the version of regular expressions language the two tools use in the matching syntax -- in that the `|` character can have a special meaning, and so it must be escaped in the awk example.
For more understanding of the syntax, both awk and sed are documented in their "man pages" -- refer to this documentation to figure out more about how the languages work.
Answered By - Michael Back Answer Checked By - Marie Seifert (WPSolving Admin)