Sunday, October 24, 2021

[SOLVED] Not able to extract with grep

Issue

> <img alt="Citizen Kane Poster" title="Citizen Kane Poster"
src="https://images-na.ssl-images-amazon.com/images/M/MV5BMTQ2Mjc1MDQwMl5BMl5BanBnXkFtZTcwNzUyOTUyMg@@._V1_UX182_CR0,0,182,268_AL_.jpg"
itemprop="image" />

I want to extract the url of the poster from the above text. This is my grep statement:

count=$(grep -zPo '(?<=> <img alt=").*?src="\K.*?(?="itemprop="image")'  ~/movie_local)

movie_local was where I had saved the page source of the site. I am learning grep and haven't got a complete command over it,so please do go soft on me.Could you please help me out? :)


Solution

(As has been said many times before, the best solution is to use an HTML parser.)

With GNU grep, try this simplified version:

grep -zPo '<img alt=[^/]+?src="\K[^"]+' ~/movie_local

A fixed version of your original attempt (note the (?s) prefix; see below for an explanation):

grep -zPo '(?s)> <img alt=".*?src="\K.*?(?=")' ~/movie_local

Alternative, with [\s\S] used ad-hoc to match any char., including \n:

grep -zPo '> <img alt="[\s\S]*?src="\K.*?(?=")' ~/movie_local

As for why your attempt didn't work:

  • When you use -P (for PCRE (Perl-Compatible Regular Expression support), . does not match \n chars. by default, so even though you're using -z to read the entire input at once, .* won't match across line boundaries. You have two choices:

    • Set option s ("dotall") at the start of the regex - (?s) - this makes . match any character, including \n
    • Ad-hoc workaround: use [\s\S] instead of .
  • As an aside: the \K construct is a syntactically simpler and sometimes more flexible alternative to a lookbehind assertion ((?<=...).

    • Your command had both, which did no harm in this case, but was unnecessary.
    • By contrast, had you tried (?<=>\s*<img alt=") for more flexible whitespace matching - note the \s* in place of the original single space - your lookbehind assertion would have failed, because lookbehind assertions must be of fixed length (at least as of GNU grep v2.26).
      However, using just \K would have worked: >\s*<img alt=")\K.
      \K simply removes everything matched so far (doesn't include it in the output).


Answered By - mklement0