Friday, May 13, 2022

[SOLVED] Python webscraping script on AWS keeps failing after 1.5 hours/fetching 10,000 xmls

May 13, 2022 amazon-s3, amazon-web-services, linux, nohup, python

Issue

I wrote a Python script that runs from an AWS instance and fetches xml files off an S3 server to place in a folder on the instance. The script works fine, except for the fact that after about an hour and a half, or about the time it takes to fetch 10,000-15,000 xmls, I get the following error:

HTTP Error 500: Internal Server Error

Following this error, I am told that the folder I tell the script to place the fetched xml in can't be found, i.e.

[Errno 2] No such file or directory:

I have tried running this script both from ssh, using screen and using nohup, but I get the same issue each time. As I have about 200,000 xmls to fetch, I'd like to just run this script once and go do something else for the 20+ hours it needs to run.

For reference, the script I wrote is below:

import os

import feather

df = feather.read_dataframe('avail.feather')

import xmltodict

urls = df['URL']

import urllib.request
import time
import requests

ticker=0 
start = time.time()
for u in urls[ticker:len(urls)]:
    #os.chdir('/home/stan/Documents/Dissertation Idea Box/IRS_Data')
    ticker += 1
    print("Starting URL",ticker, "of", len(urls),"........." ,(ticker/len(urls))*100, "percent done")
    if u is None:
        print("NO FILING")
        end = time.time()
        m, s = divmod(end-start, 60)
        h, m = divmod(m, 60)
        print("Elapsed Time:","%02d:%02d:%02d" % (h, m, s))        
        continue

    u = u.replace('https','http')
    r = requests.get(u)    
    doc = xmltodict.parse(r.content)
    try:
        os.chdir("irs990s")
        urllib.request.urlretrieve(u, u.replace('http://s3.amazonaws.com/irs-form-990/',''))
        print("FETCHED!","..........",u)
    except Exception as e: 
        print("ERROR!!!","..........",u)
        print(e)
        end = time.time()
        m, s = divmod(end-start, 60)
        h, m = divmod(m, 60)
        print("Elapsed Time:","%02d:%02d:%02d" % (h, m, s))
        continue
    end = time.time()
    m, s = divmod(end-start, 60)
    h, m = divmod(m, 60)
    print("Elapsed Time:","%02d:%02d:%02d" % (h, m, s))
    os.chdir('..')

Solution

I don't know the first thing about python, but the problem seems apparent enough, all the same.

When the S3 error occurs, you continue, which skips the rest of the instructions within the closest loop, and continues with the next value, from the top of the loop... and this skips os.chdir('..') at the end of the loop, so your current working directory is still irs990s. On the next iteration, os.chdir("irs990s") will, of course, fail, because that is trying to find a directory called irs990s inside the then-current directory, which is of course already irs990s, so that would of course fail.

There are a couple of lessons, here.

Don't keep switching in and out of a directory using os.chdir('..') -- that's very bad form, prone to subtle bugs. Case in point, see above. Use absolute paths. If you really want relative paths, that's fine, but don't do it this way. Capture the working directory at startup or configure a base working directory and use that to fully-qualify your paths with chdir.

Design your code to anticipate occasional errors from S3, or any web service, from any provider, and retry 5XX errors after a brief, incrementing delay -- exponential backoff.

Answered By - Michael - sqlbot

Answer Checked By - Candace Johnson (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, May 13, 2022

[SOLVED] Python webscraping script on AWS keeps failing after 1.5 hours/fetching 10,000 xmls

Issue

Solution

Popular Posts

Labels