Sunday, November 14, 2021

[SOLVED] Using curl and xargs to get individual sitemaps

November 14, 2021 curl, grep, sed, sitemap, xargs

Issue

I'm trying to use this curl command to download a bunch of gzipped xml sitemaps that contain product urls.

The default behaviour of it goes to the robots.txt file, finds the sitemap file that contains all the urls for the individual sitemaps, uncompresses them, then find the in the individual sitemaps that contain the url for all the individual products.

What I'd like to do instead is download each individual sitemap (over 400) to its own file and then manipulate those sitemaps on my local machine.

curl -N https://www.example.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    sed 's/\r$//g' |
    xargs -n1 curl -N |
    grep -oP '<loc>\K[^<]*' |
    xargs -n1 curl -N |
    gunzip |
    grep -oP '<loc>\K[^<]*' |
    gzip > \
    somefile.txt.gz

Right now it puts all the data in one file - which is just too large. I've tried a few things like this and eventually came up with this:

curl -N https://www.example.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)$/\1/p' |
    xargs -n1 curl -N |
    grep -oP '<loc>\K[^<]*' |
    sort > carid-list-of-compressed-sitemaps.txt

which works nicely and gives me a list of the gzipped xml sitesmaps but I can't quite figure out how to get the individual uncompressed sitemaps that have the product urls in them.

So basically I want to download all the individual product sitemaps that have the individual product urls in them.

Solution

Use 2 steps. I deleted the $ in the first sed command, because .* is already matching until the end of the line.
I removed the gzip, that was not needed with my test site.

caridlist="carid-list-of-compressed-sitemaps.txt"
curl -sN https://www.example.com/robots.txt |
    sed -n 's/^Sitemap: \(.*\)/\1/p' |
    xargs -n1 curl -sN |
    grep -oP '<loc>\K[^<]*' > "${carid-list-of-compressed-sitemaps.txt}" 

filenumber=1
urlinfile=1
while IFS= read -r site_url; do
    curl -sN "${site_url}"|
    grep -oP '<loc>\K[^<]*' > somefile_${filenumber}.txt
    ((urlinfile++))
    if ((urlinfile==10)); then
       ((filenumber++))
       urlinfile=1
    fi
done < "${carid-list-of-compressed-sitemaps.txt}"

Answered By - Walter A

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, November 14, 2021

[SOLVED] Using curl and xargs to get individual sitemaps

Issue

Solution

Popular Posts

Labels