Issue
Consider the following file.txt
:
@A00940:70:HTCYYDRXX:2:2101:1561:1063 1:N:0:ATCACG
TAGCACTGGGCTGTGAGACTGTCGTGTGTGCTTTGGATCAAGCAAGATCGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFF:FFF::FFFFFFF:FFFFFFFFFFF
@A00940:70:HTCYYDRXX:2:2101:2175:1063 1:N:0:ATCACG
CGCCCCCTCCTCCGGTCGCCGCCGCGGTGTCCGCGCGTGGGTCCTGAGGGA
+
FFFF:FFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF
@A00940:70:HTCYYDRXX:2:2101:2772:1063 1:N:0:ATCACG
TGGTGGCAGGCACCTGTAATCCCAGCTACTCGGGAGCCTGAGGCAGGAGAA
I am trying to grep
all the characters of the lines that start with @
up to :
but not including the colon, in this example the result would be A00940
.
I have tried this:
cat file.txt | grep '[^:]*'
and this:
cat file.txt | grep '^(.*?):'
but both commands do not work, why is that?
Solution
This pattern [^:]*
matches 0+ times any character except :
which does not take an @
char into account. As the quantifier is *
it can also match an empty string.
This pattern ^(.*?):
matches from the start of the string, as least as possible characters till the first occurrence of :
and also does not take the @
char into account.
One option is to use -P
for a Perl compatible regex with a positive lookbehind to assert an @
to the left.
grep -oP '(?<=@)[^@:]+' file.txt
The pattern matches:
(?<=@)
Positive lookbehind, assert from the current position an@
directly to the left[^@:]+
Negated character class, match 1+ times any character except@
and:
Output
A00940
A00940
A00940
Another option using gawk with a capture group:
gawk 'match($0, /@([^@:]+):/, a) {print a[1]}' file.txt
Answered By - The fourth bird