Sunday, October 24, 2021

[SOLVED] egrep + quantifier not working

Issue

egrep isn't matching in the following example and from everything I've read it should be. The expression is '{% +' what I'm trying to accomplish is match on all the {% %} brackets in my markdown files. From my current understanding it should match {% then one or more spaces after that, but fail to match if there is no space. I can use the same expression in PowerShell and it matches so I'm wondering what it is I'm missing.

Snippet to match against

{% highlight ruby %}
{% endhighlight %}

cat file.md | egrep '{% +'

Solution

For me, your regex works as expected. Given an input file file.md containing:

{% highlight ruby %}
{% endhighlight %}
not this line, though
nor {%this%}

When I run your command (avoiding UUoC), I get the output shown:

$ egrep '{% +' file.md
{% highlight ruby %}
{% endhighlight %}
$

You've not identified which version of egrep you are using and which platform you are using it on. I'm running Mac OS X 10.11.6 and using egrep (BSD grep) 2.5.1-FreeBSD (but I also get the same result with GNU Grep 2.25).

You should be aware, though, that { is a metacharacter to egrep, and the problem may be that it is not handling the initial { as you expect.

For example, here's a more complex egrep invocation that should only select the endhighlight line:

$ egrep '\{% {1,4}[a-z]{4,20} {1,4}%\}' file.md
{% endhighlight %}
$

I used the backslashes to escape the first and last braces. The {n,m} notation means n ≤ x ≤ m matches of the preceding regex (blank and [a-z]). You can omit ,m; you can use {4,} too — check the manual to understand these. However, on my machine, I can also run:

$ egrep '{% {1,4}[a-z]{4,20} {1,4}%}' file.md
{% endhighlight %}
$

Presumably, because the first { doesn't start an {n,m} sequence, it is treated as an ordinary character.

If you look at the POSIX specification for Extended Regular Expressions, you'll find that it says using { like that is undefined behaviour:

*+?{

The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:

  • If these characters appear first in an ERE, or immediately following a <vertical-line>, <circumflex>, or <left-parenthesis>

  • If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)

So, according to POSIX, you are using a regex that produces undefined results. Therefore, you are getting a result that POSIX deems acceptable.

Clearly, you should be able to use the following and get the result you expect:

$ egrep '\{% +' file.md
{% highlight ruby %}
{% endhighlight %}
$


Answered By - Jonathan Leffler