Issue
I have a big file with thousand lines that looks like:
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00002235.4
TTACGCAT
TAGGCCAG
>ENST00005546.9
TTTATCGC
TTAGGGTAT
I want to grep specific ids (after >
sign), for example, ENST00001234.1
then want to get lines after the match until the next >
[regardless of the number of lines]. I want to grep about 63 ids in this way at once.
If I grep ENST00001234.1
and ENST00005546.9
ids, the ideal output should be:
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00005546.9
TTTATCGC
TTAGGGTAT
I tried awk '/ENST00001234.1/ENST00005546.9/{print}'
but it did not help.
Solution
You can set >
as the record separator:
$ awk -F'\n' -v RS='>' -v ORS= '$1=="ENST00001234.1"{print RS $0}' ip.txt
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
-F'\n'
to make it easier to compare the search term with first line-v RS='>'
set>
as input record separator-v ORS=
clear the output record separator, otherwise you'll get extra newline in the output$1=="ENST00001234.1"
this will do string comparison and matches the entire first line, otherwise you'll have to escape regex metacharacters like.
and add anchorsprint RS $0
if match is found, print>
and the record content
If you want to match more than one search terms, put them in a file:
$ cat f1
ENST00001234.1
ENST00005546.9
$ awk 'BEGIN{FS="\n"; ORS=""}
NR==FNR{a[$0]; next}
$1 in a{print RS $0}' f1 RS='>' ip.txt
>ENST00001234.1
ACGTACGTACGG
TTACCCAGTACG
ATCGCATTCAGC
>ENST00005546.9
TTTATCGC
TTAGGGTAT
Here, the contents of f1
is used to build the keys for array a
. Once the first file is read, RS='>'
will change the record separator for the second file.
$1 in a
will check if the first line matches a key in array a
Answered By - Sundeep