Issue
I have bunch of crawled endpoint result from a domain, it was parsed in Json format. I've successfully aggregate the json to extract it's endpoint using sed and I wanted to merge all the sed command into my python crawl script. Here's the crawl output :
{'Results': [{'Result': {'IsDB': 'True', 'Spend': 367, 'Paths': [{'Technologies': [{'Categories': ['Japan hosting'], 'Name': 'Internet Initiative Japan','Link': 'https://www.iij.ad.jp'},{'Name': 'GlobalSign Domain Verification', 'Link': 'https://support.globalsign.com/customer/portal/articles/2167245-performing-domain-verification---dns-txt-record'}]}]}]}
The reason why I use regex instead of jq
to resolve the json is: sometimes the json has invalid
format and the single-quotation ' '
raises other exception.
The problem is; I had to use os
module to execute bash
command which the sed command itself has inner quotation. Here's the implementation:
temp = "testi.txt"
os.system(f"sed -i 's/http:\/\//\nhttp:\/\//g' {temp}")
os.system(f"sed -i 's/https:\/\//\nhttps:\/\//g' {temp}")
os.system(f"sed -i 's/Tag/\\nTag/g' {temp}")
os.system(f"sed -i '/Tag/d' {temp}")
os.system(f"sed -i '/Result/d' {temp}")
os.system(f"sed -i 's/\', \'*$//' {temp}")
os.system(f"sed -i 's/^http:\/\///' {temp}")
os.system(f"sed -i 's/^https:\/\///' {temp}")
os.system(f"sed -i 's/\/.*//' {temp}")
As expected these 3 sed command are breaker, resulting in yet another mess:
os.system(f"sed -i 's/http:\/\//\nhttp:\/\//g' {temp}")
os.system(f"sed -i 's/https:\/\//\nhttps:\/\//g' {temp}")
os.system(f"sed -i 's/\', \'*$//' {temp}")
I have tried to escape the newline Unicode \n
and inner quotation ', '
with double backslash but didn't work.
's/http:\/\//\\nhttp:\/\//g'
's/https:\/\//\\nhttps:\/\//g'
's/\\', \\'*$//'
Another reason to use sed is because; the target file has multi-lines inside, so I thought it was more easy to use sed instead of read each line using python. Any help would be cherished...
Solution
You seem to be looking simply for
import re
urls = re.findall(r'(?<=")https?://[^"]+(?=")', text)
Your sample shows single quotes but real JSON uses double quotes, so I assumed you sample shows how Python parsed it. If your broken JSON really has single quotes instead of double, swap them around in the code.
urls = re.findall(r"(?<=')https?://[^']+(?=')", text)
You'd obviously have the pseudo-JSON as a string in text
.
Answered By - tripleee Answer Checked By - Mary Flores (WPSolving Volunteer)