Issue
I want to display all links from https://www.gentoo.org/downloads/mirrors/
to the terminal.
Firstly, the script would wget
the web page to a file called index.html
, then a grep
or sed
command would simply display all the https://
, http://
and ftp://
to the terminal.
Can someone help me with this command? I know it's simple, but I'm not really familiar with neither of these commands.
What i tried:
grep "<code>" index.html
Output:
<a href="ftp://mirrors.tera-byte.com/pub/gentoo"><code>ftp://mirrors.tera-byte.com/pub/gentoo</code></a>
<a href="http://gentoo.mirrors.tera-byte.com/"><code>http://gentoo.mirrors.tera-byte.com/</code></a>
<a href="rsync://mirrors.tera-byte.com/gentoo"><code>rsync://mirrors.tera-byte.com/gentoo</code></a>
How can I remove the empty spaces, tags and all unnecessary text after the link?
Solution
If you just want the domain link to remain, you can try this grep
grep -Eo '[h|f]t*ps?://.[^<|>|"]*' index.html
This will display only http
,https
and ftp
matches
If the need is to match within <code>
blocks, this sed
will work
sed -En '/<code>/ {s|.*([h|f]t*ps?://.[^<|>|"]*).*|\1|p}' index.html
Answered By - HatLess Answer Checked By - Dawn Plyler (WPSolving Volunteer)