BigQuery: Use expression subqueries for querying nested and repeated fields

  • You are here: Free PHP » Uncategorized » BigQuery: Use expression subqueries for querying nested and repeated fields

BigQuery allows to define nested and repeated fields in a table. Although this is very powerful, it makes it much more complex to retrieve the data if one is not used to such structures. Especially beginners tend to use an UNNEST statement on the nested fields, followed by a huge GROUP BY statement on the not-originally-repeated fields. Imho, using expression subqueries is oftentimes the better approach here.

Code

SELECT
  id,
  (SELECT value from t.repeated_fields LIMIT 1)
FROM
  table t  

Caution: When using expression subqueries, you need to make sure that the result is a single value (scalar or array), otherwise you will get the error message

Scalar subquery produced more than one element

In the example code above this is ensured by enforcing one result via LIMIT 1.

Working Example

<script src="https://gist.github.com/paslandau/03c73ee5eef2ce217af82a8f7edcb125.js"><script src="https://gist.github.com/paslandau/03c73ee5eef2ce217af82a8f7edcb125.js">

Run on BigQuery

Open in BigQuery Console

BigQuery Console: How to use expression subqueries for nested and repeated fields example

Links

Use cases

The most prominent use case is probably the BigQuery export schema of Google Analytics. To be honest, I also feel that the schema is not very friendly for newcomers with its ~30 RECORD-type (nested) fields and 300+ columns.

In a nutshell, each row represents one session. A session consists of multiple hits. Those hits are also available in the nested and repeated hits field. But wait, there is more... Each hit can have a number of so called customDimensions (meta data that can be attached to each hit). So the resulting table structue looks something like this:

- field_1
- field_2
- hits
  - field_1
  - field_2
  - customDimensions
    - index
    - value 

The following example uses the public Google Analytics sample dataset for BigQuery and shows a couple of sample expression subqueries

SELECT
  fullVisitorId,
  visitStartTime,
  TIMESTAMP_SECONDS(visitStartTime) as started_at,
  TIMESTAMP_SECONDS(visitStartTime + CAST( (SELECT time from t.hits ORDER BY hitNumber DESC LIMIT 1) /1000 AS INT64)) as ended_at,
  (SELECT COUNT(*) from t.hits) as hit_count,
  (SELECT page.hostname || page.pagePath from t.hits WHERE isEntrance = TRUE) as landing_page,
  (
    SELECT
      (SELECT COUNT(*) from h.customDimensions)
    FROM
      t.hits h
    WHERE
      hitNumber = 1
   ) as customDimension_count_of_first_hit,
FROM
  `bigquery-public-data.google_analytics_sample.ga_sessions_20170801` t
ORDER BY
  visitStartTime asc
Powered by Gewgley