BigQuery: Use expression subqueries for querying nested and repeated fields
BigQuery allows to define nested and repeated fields
in a table. Although this is very powerful, it makes it much more complex to retrieve the
data if one is not used to such structures. Especially beginners tend to use an
UNNEST
statement
on the nested fields, followed by a huge
GROUP BY
statement on the not-originally-repeated fields. Imho, using
expression subqueries
is oftentimes the better approach here.
Code
SELECT
id,
(SELECT value from t.repeated_fields LIMIT 1)
FROM
table t
Caution: When using expression subqueries, you need to make sure that the result is a single value (scalar or array), otherwise you will get the error message
Scalar subquery produced more than one element
In the example code above this is ensured by enforcing one result via LIMIT 1
.
Working Example
<script src="https://gist.github.com/paslandau/03c73ee5eef2ce217af82a8f7edcb125.js"><script src="https://gist.github.com/paslandau/03c73ee5eef2ce217af82a8f7edcb125.js">Run on BigQuery
Links
Use cases
The most prominent use case is probably the BigQuery export schema of Google Analytics. To be honest, I also feel that the schema is not very friendly for newcomers with its ~30 RECORD-type (nested) fields and 300+ columns.
In a nutshell, each row represents one session.
A session consists of multiple hits. Those hits are also available in the nested and repeated hits
field. But wait, there is more...
Each hit can have a number of so called customDimensions
(meta data that can be attached to each hit). So the resulting table structue looks something
like this:
- field_1
- field_2
- hits
- field_1
- field_2
- customDimensions
- index
- value
The following example uses the public Google Analytics sample dataset for BigQuery and shows a couple of sample expression subqueries
SELECT
fullVisitorId,
visitStartTime,
TIMESTAMP_SECONDS(visitStartTime) as started_at,
TIMESTAMP_SECONDS(visitStartTime + CAST( (SELECT time from t.hits ORDER BY hitNumber DESC LIMIT 1) /1000 AS INT64)) as ended_at,
(SELECT COUNT(*) from t.hits) as hit_count,
(SELECT page.hostname || page.pagePath from t.hits WHERE isEntrance = TRUE) as landing_page,
(
SELECT
(SELECT COUNT(*) from h.customDimensions)
FROM
t.hits h
WHERE
hitNumber = 1
) as customDimension_count_of_first_hit,
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
ORDER BY
visitStartTime asc