Issue
From given HTML
I need to extract specific urls. For example, <a>
and attribute href
looks like this:
<a href="https://hoster.com/some_description-specific_name-more_description.html">
I need to extract only urls that include "hoster.com" and "specific_name"
I have used BeautifulSoup
on an Raspberry Pi but i only can the basic thing which extracts all ULRs of an HTML
:
from bs4 import BeautifulSoup
with open("page.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
Solution
You could select your elements more specific with css selectors
:
soup.select('a[href*="hoster.com"][href*="specific_name"]')
But in case that multiple patterns has to match I would recommend:
for link in soup.find_all('a'):
if all(s in link['href'] for s in pattern):
print(link.get('href'))
Example
html = '''
<a href="https://hoster.com/some_description-specific_name-more_description.html">
<a href="https://lobster.com/some_description-specific_name-more_description.html">
<a href="https://hipster.com/some_description-specific_name-more_description.html">
'''
soup = BeautifulSoup(html)
pattern = ['hoster.com','specific_name']
for link in soup.find_all('a'):
if all(s in link['href'] for s in pattern):
print(link.get('href'))
Output
https://hoster.com/some_description-specific_name-more_description.html
Answered By - HedgeHog Answer Checked By - Mildred Charles (WPSolving Admin)