Issue
I have a dictionary with a ton of text files containing various urls. I want to grep all subdomains of a particular url. For example, a text file could look like this:
https://subdomain1.stage.domain.tld/[...]
https://subdomain2.stage.domain.tld/test/[...]
https://subdomain3.pre.domain.tld/[...]
https://subdomain4.prod.domain.tld/files/[...]
I want to extract all subdomains of .stage.domain.tld, in this case subdomain1 and subdomain2 whereas subdomain could have a-A,0-9 and -_ characters.
After failing to figure it out myself, I need to ask for help. I have tried several attemps, the closest I could get is
grep -r -i '(http|https)://[^/"]+.stage.domain.tld' .
But it does not extract the subdomains. Any suggestions what is wrong? Thanks a lot!
Solution
Using GNU grep
, you can use the following PCRE based grep
command:
grep -oP 'https?://\K[\w-]+(?=\.stage\.domain\.tld)' file
See the online grep
demo. Regex details:
https?://
-http://
orhttps://
\K
- match reset operator discarding all text matched so far[\w-]+
- one or more letters, digits,_
or-
(?=\.stage\.domain\.tld)
- a positive lookahead, immediately to the right, there must be a.stage.domain.tld
string.
With sed
, assuming all URLs are at the start of separate lines:
sed -En 's~https?://([[:alnum:]-]+)\.stage\.domain\.tld.*~\1~p' file
See this online sed
demo. It matches the same way as the grep
command, just .*
is added to match the rest of the lines. \1
replaces with whole match with the Group 1 contents, and -n
suppresses the default line output and p
prints the final results.
Answered By - Wiktor Stribiżew