Issue
I have a database table which holds 10,000+ entries with URLs. I am testing to see whether URLs are active or dead using get_headers() PHP function and I have set a task which runs every hour (entries are tested in chuncks of 500 approx.)
It works fine but, it times out because, it has to enquire lots of entries at once.
Is there a better way of doing this and reduce server load at the same time? I also tried CURL but, it has the same timeout issue.
Thanks
Solution
According to this comment on PHP's documentation it may be better to use cURL.
- I would store in the DB a timestamp of the last URL check for each entry.
- Create a cron job that triggers a PHP script every minute or every x minutes.
The PHP cron task:
- I would query the database to get all the links with a timestamp older than one day (or x hours), sorted by older first and a limit of let's say 20 links or more.
- Loop over these links and check each link with cURL:
- Set correct cURL
CURLOPT_CONNECTTIMEOUT
andCURLOPT_TIMEOUT
timeouts so that the check can be done but without exceeding a long time also. Perhaps 5 to 10 seconds? - Check the URL and update the timestamp once it's done.
- See what
getrusage()
returns you to see if you are soon arriving at theini_get('max_execution_time')
value. Or create your own timer withmicrotime()
. But the idea is to break the loop if needed to avoid the script to be suddenly killed.
- Set correct cURL
You could then try to improve speed by using curl_multi_*()
routines. I just didn't find a nice solution where you read the result of each finished handle in order to update the DB. If you have to wait for all requests to finish then why not making some packages of URL requests and run them all. This is what I tried quickly, but by reading the result (which would not be your case):
<?php
define('TIMEOUT', 8);
$urls = [
'https://www.php.net/manual/fr/function.curl-multi-select.php',
'https://www.bag.admin.ch/bag/fr/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/testen.html',
'https://ofsp-coronavirus.ch/tester/',
'https://www.liguepulmonaire.ch/nc/fr/maladies-et-consequences/bpco/test-de-risque-bpco.html',
'https://www.ge.ch/covid-19-se-faire-tester/faire-test-covid-19',
'https://www.gsi.be.ch/fr/start/themen/coronavirus/testen/wo-kann-ich-einen-covid-19-test-machen-.html',
'https://diagnostics.roche.com/ch/fr/products/params/sars-cov-2-rapid-antigen-nasal-test.html',
'https://www.unispital-basel.ch/fr/patients-proches/votre-sejour/massnahmen-coronavirus/test-coronavirus/',
'https://www.insel.ch/fr/patients-et-visiteurs/coronavirus/test-du-covid-dans-linsel-gruppe',
'https://www.bag.admin.ch/bag/en/home/krankheiten/ausbrueche-epidemien-pandemien/aktuelle-ausbrueche-epidemien/novel-cov/testen.html',
'https://www.fr.ch/sante/covid-19/test-covid-19-liste-des-pharmacies-autorisees',
'https://www.fr.ch/document/425951',
'https://www.m3-test.ch/',
'https://www.ehnv.ch/covid-19',
'https://covid-19testing.genepredictis.com/',
'https://www.vd.ch/themes/mobilite/automobile/test-schuhfried/',
'https://www.hirslanden.ch/fr/corporate/campagne/covid-19-test/tests-repetitifs-preventifs-entreprises-ecoles.html',
];
$multi_handle = curl_multi_init();
$handles = [];
foreach ($urls as $i => $url) {
$handles[$i] = curl_init();
curl_setopt_array($handles[$i], [
CURLOPT_URL => $url,
CURLOPT_HEADER => false,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_CONNECTTIMEOUT => TIMEOUT,
CURLOPT_TIMEOUT => TIMEOUT,
]);
curl_multi_add_handle($multi_handle, $handles[$i]);
}
$running = null;
do {
curl_multi_exec($multi_handle, $running);
usleep(1000);
} while ($running > 0);
$results = [];
foreach ($handles as $i => $handle) {
$results[$i] = curl_multi_getcontent($handle);
curl_multi_remove_handle($multi_handle, $handle);
}
curl_multi_close($multi_handle);
var_export($results);
Answered By - Patrick Janser Answer Checked By - Terry (WPSolving Volunteer)