Wednesday, December 29, 2021

[SOLVED] grep a pattern until a specific character (:)

Issue

Consider the following file.txt:

@A00940:70:HTCYYDRXX:2:2101:1561:1063 1:N:0:ATCACG
TAGCACTGGGCTGTGAGACTGTCGTGTGTGCTTTGGATCAAGCAAGATCGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFF:FFF::FFFFFFF:FFFFFFFFFFF
@A00940:70:HTCYYDRXX:2:2101:2175:1063 1:N:0:ATCACG
CGCCCCCTCCTCCGGTCGCCGCCGCGGTGTCCGCGCGTGGGTCCTGAGGGA
+
FFFF:FFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF
@A00940:70:HTCYYDRXX:2:2101:2772:1063 1:N:0:ATCACG
TGGTGGCAGGCACCTGTAATCCCAGCTACTCGGGAGCCTGAGGCAGGAGAA

I am trying to grep all the characters of the lines that start with @ up to : but not including the colon, in this example the result would be A00940.

I have tried this:

cat file.txt | grep '[^:]*'

and this:

cat file.txt | grep '^(.*?):'

but both commands do not work, why is that?


Solution

This pattern [^:]* matches 0+ times any character except : which does not take an @ char into account. As the quantifier is * it can also match an empty string.

This pattern ^(.*?): matches from the start of the string, as least as possible characters till the first occurrence of : and also does not take the @ char into account.


One option is to use -P for a Perl compatible regex with a positive lookbehind to assert an @ to the left.

grep -oP '(?<=@)[^@:]+' file.txt

The pattern matches:

  • (?<=@) Positive lookbehind, assert from the current position an @ directly to the left
  • [^@:]+ Negated character class, match 1+ times any character except @ and :

Output

A00940
A00940
A00940

Another option using gawk with a capture group:

gawk 'match($0, /@([^@:]+):/, a) {print a[1]}' file.txt


Answered By - The fourth bird