Friday, July 29, 2022

[SOLVED] stop condition for emulating "grep -oE" with awk

Issue

I'm trying to emulate GNU grep -Eo with a standard awk call.

What the man says about the -o option is:

-o --only-matching
     Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line.

For now I have this code:

#!/bin/sh

regextract() {
    [ "$#" -ge 2 ] || return 1
    __regextract_ere=$1
    shift
    awk -v FS='^$' -v ERE="$__regextract_ere" '
        {
            while ( match($0,ERE) && RLENGTH > 0 ) {
                print substr($0,RSTART,RLENGTH)
                $0 = substr($0,RSTART+1)
            }
        }
    ' "$@"
}

My question is: In the case that the matching part is 0-length, do I need to continue trying to match the rest of the line or should I move to the next line (like I already do)? I can't find a sample of input+regex that would need the former but I feel like it might exist. Any idea?


Solution

Here's a POSIX awk version, which works with a* (or any POSIX awk regex):

echo abcaaaca |
awk -v regex='a*' '
{
    while (match($0, regex)) {
        if (RLENGTH) print substr($0, RSTART, RLENGTH)
        $0 = substr($0, RSTART + (RLENGTH > 0 ? RLENGTH : 1))
        if ($0 == "") break
    }
}'

Prints:

a
aaa
a

POSIX awk and grep -E use POSIX extended regular expressions, except that awk allows C escapes (like \t) but grep -E does not. If you wanted strict compatibility you'd have to deal with that.



Answered By - dan
Answer Checked By - Willingham (WPSolving Volunteer)