Issue
UPDATE: I've narrowed the problem a bit, so I'm removing unnecessary code and examples:
UPDATE 2: after leaving the cron job running at 12 hour intervals for quite some time (with each ending in success, but with nothing written in BQ) we were shocked to discover that, after about a week, one of the cron jobs did successfully write to BigQuery, along with Stackdriver logs stating that "this request caused a new process to be started for your application (...)", as copied below. The following jobs again stopped writing. Now I'm wondering if this is somehow connected to either cached app state (with some expiration period) or credential expiration date which somehow prevent further writes to BigQuery after the first, but do not result in errors.
Problem description:
I'm am trying to set up a cron job in App Engine (standard) to query from and write data back to BigQuery (datasets are in the same project as the deployed app) and cron jobs execute successfully but only write to BigQuery on the first execution after deployment, afterwards they still execute with success but do not write.
The main difference I've found is in Stackdriver logs, for executions which correctly write there are extra debug and informations, for subsequent ones there are no such messages:
2018-04-19 04:44:03.933 CEST
Converted retries value: 3 -> Retry(total=3, connect=None, read=None, redirect=None, status=None) (/base/data/home/apps/e~<redacted>/lib/urllib3/util/retry.py:200)
2018-04-19 04:44:04.154 CEST
Making request: POST https://accounts.google.com/o/oauth2/token (/base/data/home/apps/e~<redacted>/lib/google/auth/transport/requests.py:117)
2018-04-19 04:44:04.160 CEST
Starting new HTTPS connection (1): accounts.google.com (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:824)
2018-04-19 04:44:04.329 CEST
https://accounts.google.com:443 "POST /o/oauth2/token HTTP/1.1" 200 None (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:396)
2018-04-19 04:44:04.339 CEST
Starting new HTTPS connection (1): www.googleapis.com (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:824)
2018-04-19 04:44:04.802 CEST
https://www.googleapis.com:443 "POST /bigquery/v2/projects/<redacted>/jobs HTTP/1.1" 200 None (/base/data/home/apps/e~<redacted>/lib/urllib3/connectionpool.py:396)
2018-04-19 04:44:04.813 CEST
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time. This request may thus take longer and use more CPU than a typical request for your application.
I've tried:
Adding BigQuery DataOwner and User permissions for default appengine service account, but there was no effect.
there are mentions that google.cloud library is not fully supported for standard app engine, so I tried using OAuth2/httplib2/googleapiclient credentials to authenticate but this is the first time I've tried it and I don't understand how to put the pieces together, and without google.cloud library I don't know how to even write a proper query for BQ
other credential setup approaches as suggested below but it seems that connecting to BQ is not the issue, they all connect and write (once), just repeating it within already deployed app engine.
Below is the full implementation:
app.yaml:
runtime: python27
api_version: 1
threadsafe: true
handlers:
- url: /bigquerycron
script: bigquerycron.app
login: admin
libraries:
- name: ssl
version: latest
env_variables:
GAE_USE_SOCKETS_HTTPLIB : 'true'
bigquerycron.py
from __future__ import absolute_import
from google.cloud import bigquery
import webapp2
class MainPage(webapp2.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
self.response.write('CRON test page')
def writeDataTest(dataset_id = '<redacted>',table_id='<redacted>'):
client = bigquery.Client.from_service_account_json("credentials.json")
job_config = bigquery.QueryJobConfig()
table_ref = client.dataset(dataset_id).table(table_id)
job_config.destination = table_ref
job_config.write_disposition = 'WRITE_APPEND'
query_job = client.query(
"""SELECT CURRENT_DATETIME() AS Datetime, 'CRON' as Source""", job_config=job_config)
writeDataTest()
app = webapp2.WSGIApplication([
('/bigquerycron', MainPage),
], debug=True)
cron.yaml:
cron:
- url: /bigquerycron
schedule: every 30 minutes
Solution
Credentials were not the problem in this specific case, the issue is simply with placement of the function call due to misunderstanding of how App Engine works. Function call for bigquery should be moved inside the MainPage class definition, fixed bigquerycron.py looks like this (only a single line of code is moved):
from __future__ import absolute_import
from google.cloud import bigquery
import webapp2
class MainPage(webapp2.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
self.response.write('CRON test page')
writeDataTest()
def writeDataTest(dataset_id = '<redacted>',table_id='<redacted>'):
client = bigquery.Client.from_service_account_json("credentials.json")
job_config = bigquery.QueryJobConfig()
table_ref = client.dataset(dataset_id).table(table_id)
job_config.destination = table_ref
job_config.write_disposition = 'WRITE_APPEND'
query_job = client.query(
"""SELECT CURRENT_DATETIME() AS Datetime, 'CRON' as Source""", job_config=job_config)
app = webapp2.WSGIApplication([
('/bigquerycron', MainPage),
], debug=True)
Version in OP indeed only writes once to BigQuery, when App Engine app is loaded for the first time, all subsequent calls just execute MainPage class, which in this case does nothing, as actual BigQuery code is outside of it.
Additionally it would be beneficial to rewrite the app without using google-cloud-python library, which is not supported in GAE Standard (https://github.com/GoogleCloudPlatform/google-cloud-python/issues/1893). This is especially unfortunate since even official bigquery documentation for python (https://cloud.google.com/bigquery/docs/) makes use of this library. There is a variety of workarounds to keep using it however, including some mentioned in the linked github issue and also here: Using gcloud-python in GAE and a similar workaround was used in this example.
But as mentioned, it would be best to use a dedicated Google API Client Library for Python: https://developers.google.com/api-client-library/python/
Answered By - Vaeqal