Tuesday, February 6, 2024

[SOLVED] How to get exact page content in wget if error code is 404

Issue

I have two url one is working url another one is page deleted url.working url is fine but for page deleted url instead of getting the exact page content wget receives 404

Working url

import os
def curl(url):
    data = os.popen('wget -qO- %s '% url).read()
    print (url)
    print (len(data))
    #print (data)

curl("https://www.reverbnation.com/artist_41/bio")

Output:

https://www.reverbnation.com/artist_41/bio
80067

Page Deleted url

import os
def curl(url):
    data = os.popen('wget -qO- %s '% url).read()
    print (url)
    print (len(data))
    #print (data)

curl("https://www.reverbnation.com/artist_42/bio")

output:

https://www.reverbnation.com/artist_42/bio
0

I get length as 0 but live page has some content in it

How to receive the exact content in wget or curl


Solution

wget has a switch called "--content-on-error":

--content-on-error
           If this is set to on, wget will not skip the content

which outputs the more information whenever the server responds with an HTTP status code that indicates the error.

So just add it to your code and you will have the "content" of the 404 pages too:

import os
def curl(URL):
    data = os.popen('wget --content-on-error -qO- %s '% url).read()
    print (URL)
    print (len(data))
    #print (data)

curl("https://www.reverbnation.com/artist_42/bio")


Answered By - Jadi
Answer Checked By - Marilyn (WPSolving Volunteer)