BigQuery: Monitor Query Costs via INFORMATION_SCHEMA
Cost monitoring in Google BigQuery can be a difficult task, especially within a growing organization and lots of (independent) stakeholders that have access to the data. If your organization is not using reserved slots (flat-rate pricing) but is billed by the number of bytes processed (on-demand pricing), costs can get quickly out of hand, and we need the means to investigate or "debug" the BigQuery usage in order to understand:
- who ran queries with a high cost
- what were the exact queries
- when did those queries run (and are they maybe even running regularly)
Previously, we had to manually set up query logging via Stackdriver as explained in the article
Taking a practical approach to BigQuery cost monitoring
but in late 2019 BigQuery introduced
INFORMATION_SCHEMA
views
as a beta feature that also contain data about BigQuery jobs via the
INFORMATION_SCHEMA.JOBS_BY_*
views
and became generally available (GA) at 2020-06-16
Examples
SELECT
creation_time,
job_id,
project_id,
user_email,
total_bytes_processed,
query
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_USER
SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
SELECT * FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION
Working Example
<script src="https://gist.github.com/paslandau/980f251cbbb2f0dddff9da6425007e3f.js"><script src="https://gist.github.com/paslandau/980f251cbbb2f0dddff9da6425007e3f.js">- this query will select the most interesting fields in terms of cost monitoring from the
INFORMATION_SCHEMA.JOBS_BY_USER
view for all jobs that have been run in region US in the currently selected project - the
cost_in_dollar
is estimate by calculating thetotal_bytes_processed
in Terabyte and multiplying the result with $5.00 (which corresponds to the the cost as of today 2020-06-21). Also, we only take those costs into account if the query was not answered from the cache (see thecache_hit != true
condition) - the
creation_time
is converted to our local timezone - the results are restricted to the past 30 days by using the
WHERE
clause to filter on the partition columncreation_time
- feel free to replace
JOBS_BY_PROJECT
withJOBS_BY_USER
orJOBS_BY_ORGANIZATION
Run on BigQuery
Notes
While playing around with the INFORMATION_SCHEMA
views I've hit a couple of gotchas:
- the different views require different permissions
- the views are regionalized, i.e. we must prefix the region (see
region-us
in the view specification) and must run the job in that region (e.g. from the BigQuery UI viaMore > Query Settings > Processing location
) - it is not possible to mix multiple regions in the query, because a query with processing location
US
can only access resources in locationUS
. Though it would be very helpful for organizations that actively use different locations, something like this is not possible:SELECT * FROM (SELECT * `region-us`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION) UNION ALL (SELECT * `region-eu`.INFORMATION_SCHEMA.JOBS_BY_ORGANIZATION)
- data is currently only kept for the past 180 days
- the
JOBS_BY_USER
view seems to "match" the user based on the email address. My user email adress is a@googlemail.com
address; in the user column it is stored as@gmail.com
. Thus, I get no results when using
Truncated by Planet PHP, read more at the original (another 1126 bytes)