Issue
I'm trying to use this curl command to download a bunch of gzipped xml sitemaps that contain product urls.
The default behaviour of it goes to the robots.txt file, finds the sitemap file that contains all the urls for the individual sitemaps, uncompresses them, then find the in the individual sitemaps that contain the url for all the individual products.
What I'd like to do instead is download each individual sitemap (over 400) to its own file and then manipulate those sitemaps on my local machine.
curl -N https://www.example.com/robots.txt |
sed -n 's/^Sitemap: \(.*\)$/\1/p' |
sed 's/\r$//g' |
xargs -n1 curl -N |
grep -oP '<loc>\K[^<]*' |
xargs -n1 curl -N |
gunzip |
grep -oP '<loc>\K[^<]*' |
gzip > \
somefile.txt.gz
Right now it puts all the data in one file - which is just too large. I've tried a few things like this and eventually came up with this:
curl -N https://www.example.com/robots.txt |
sed -n 's/^Sitemap: \(.*\)$/\1/p' |
xargs -n1 curl -N |
grep -oP '<loc>\K[^<]*' |
sort > carid-list-of-compressed-sitemaps.txt
which works nicely and gives me a list of the gzipped xml sitesmaps but I can't quite figure out how to get the individual uncompressed sitemaps that have the product urls in them.
So basically I want to download all the individual product sitemaps that have the individual product urls in them.
Solution
Use 2 steps. I deleted the $
in the first sed
command, because .*
is already matching until the end of the line.
I removed the gzip, that was not needed with my test site.
caridlist="carid-list-of-compressed-sitemaps.txt"
curl -sN https://www.example.com/robots.txt |
sed -n 's/^Sitemap: \(.*\)/\1/p' |
xargs -n1 curl -sN |
grep -oP '<loc>\K[^<]*' > "${carid-list-of-compressed-sitemaps.txt}"
filenumber=1
urlinfile=1
while IFS= read -r site_url; do
curl -sN "${site_url}"|
grep -oP '<loc>\K[^<]*' > somefile_${filenumber}.txt
((urlinfile++))
if ((urlinfile==10)); then
((filenumber++))
urlinfile=1
fi
done < "${carid-list-of-compressed-sitemaps.txt}"
Answered By - Walter A