Issue
How to parse the url, for example: https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
So that only virtualbox.org/virtualbox/6.1.36 remains?
TEST_URLS=(
https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe
https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe
https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2
)
for url in "${TEST_URLS[@]}"; do
without_proto="${url#*:\/\/}"
without_auth="${without_proto##*@}"
[[ $without_auth =~ ^([^:\/]+)(:[[:digit:]]+\/|:|\/)?(.*) ]]
PROJECT_HOST="${BASH_REMATCH[1]}"
PROJECT_PATH="${BASH_REMATCH[3]}"
echo "given: $url"
echo " -> host: $PROJECT_HOST path: $PROJECT_PATH"
done
Solution
So, if I am correct in assuming that you need to extract a string of the form...
hostname.tld/dirname
...where tld is the top-level domain and dirname is the path to the file.
So filtering out any url scheme and subdomains at the beginning, then also filtering out any file basename at the end? All solutions have assumptions. Assuming one of the original thee letter top level domains ie. .com, .org, .net, .int, .edu, .gov, .mil.
This possible solution uses sed
with the -r
option for the regular expressions extension.
It creates two filters and uses them to chop off the ends that you don't want (hopefully).
It also uses a capture group in filter_end
, so as to keep the /
in the filter.
test_urls=(
'https://download.virtualbox.org/virtualbox/6.1.36/VirtualBox-6.1.36-152435-Win.exe'
'https://github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4/npp.8.4.4.Installer.x64.exe'
'https://downloads.sourceforge.net/project/libtirpc/libtirpc/1.3.1/libtirpc-1.3.1.tar.bz2'
)
for url in ${test_urls[@]}
do
filter_start=$(
echo "$url" | \
sed -r 's/([^.\/][a-z]+\.[a-z]{3})\/.*//' )
filter_end=$(
echo "$url" | \
sed 's/.*\(\/\)/\1/g' )
out_string="${url#$filter_start}"
out_string="${out_string%$filter_end}"
echo "$out_string"
done
Output:
virtualbox.org/virtualbox/6.1.36
github.com/notepad-plus-plus/notepad-plus-plus/releases/download/v8.4.4
sourceforge.net/project/libtirpc/libtirpc/1.3.1
Answered By - adebayo10k Answer Checked By - Marilyn (WPSolving Volunteer)