Issue
> <img alt="Citizen Kane Poster" title="Citizen Kane Poster"
src="https://images-na.ssl-images-amazon.com/images/M/MV5BMTQ2Mjc1MDQwMl5BMl5BanBnXkFtZTcwNzUyOTUyMg@@._V1_UX182_CR0,0,182,268_AL_.jpg"
itemprop="image" />
I want to extract the url of the poster from the above text. This is my grep statement:
count=$(grep -zPo '(?<=> <img alt=").*?src="\K.*?(?="itemprop="image")' ~/movie_local)
movie_local was where I had saved the page source of the site. I am learning grep and haven't got a complete command over it,so please do go soft on me.Could you please help me out? :)
Solution
(As has been said many times before, the best solution is to use an HTML parser.)
With GNU grep
, try this simplified version:
grep -zPo '<img alt=[^/]+?src="\K[^"]+' ~/movie_local
A fixed version of your original attempt (note the (?s)
prefix; see below for an explanation):
grep -zPo '(?s)> <img alt=".*?src="\K.*?(?=")' ~/movie_local
Alternative, with [\s\S]
used ad-hoc to match any char., including \n
:
grep -zPo '> <img alt="[\s\S]*?src="\K.*?(?=")' ~/movie_local
As for why your attempt didn't work:
When you use
-P
(for PCRE (Perl-Compatible Regular Expression support),.
does not match\n
chars. by default, so even though you're using-z
to read the entire input at once,.*
won't match across line boundaries. You have two choices:- Set option
s
("dotall") at the start of the regex -(?s)
- this makes.
match any character, including\n
- Ad-hoc workaround: use
[\s\S]
instead of.
- Set option
As an aside: the
\K
construct is a syntactically simpler and sometimes more flexible alternative to a lookbehind assertion ((?<=...)
.- Your command had both, which did no harm in this case, but was unnecessary.
- By contrast, had you tried
(?<=>\s*<img alt=")
for more flexible whitespace matching - note the\s*
in place of the original single space - your lookbehind assertion would have failed, because lookbehind assertions must be of fixed length (at least as of GNUgrep
v2.26).
However, using just\K
would have worked:>\s*<img alt=")\K
.
\K
simply removes everything matched so far (doesn't include it in the output).
Answered By - mklement0