Monday, October 25, 2021

[SOLVED] Extract everything between first and last occurence of the same pattern in single iteration

Issue

This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.

Given the pattern CAPTURE and input

1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......

Print:

100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE

Can this be accomplished with a regular expression?

I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.


Solution

You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.

#!/usr/bin/env perl
use warnings;
use strict;

my $first=1;
my @buf;
while ( my $line = <> ) {
    push @buf, $line unless $first;
    if ( $line=~/CAPTURE/ ) {
        if ($first) {
            @buf = ($line);
            $first = 0;
        }
        print @buf;
        @buf = ();
    }
}

Feed the input into this program via zcat file.gz | perl script.pl.

Which can of course be jammed into a one-liner, if need be...

zcat file.gz | perl -ne '$x&&push@b,$_;if(/CAPTURE/){$x||=@b=$_;print@b;@b=()}'

Can this be accomplished with a regular expression?

You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.

zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'


Answered By - haukex