BigQuery: Calculate the MEDIAN in BigQuery

  • You are here: Free PHP » Uncategorized » BigQuery: Calculate the MEDIAN in BigQuery

There is no MEDIAN() function in Google BigQuery, but we can still calculate the MEDIAN with the PERCENTILE_CONT(x, 0.5) or PERCENTILE_DISC(x, 0.5) functions. The difference between those two functions is the linear interpolation that is applied when using PERCENTILE_CONT(x, 0.5) - so that's probably what you want when dealing with numeric values. Take the numbers 1,2,3,4 for example:

  • PERCENTILE_CONT(x, 0.5) yields 2.5 (as the 50% percentile is exactly between 2 and 3)
  • PERCENTILE_DISC(x, 0.5) yields 2 (as the 50% percentile is >= 2)

Example

SELECT
  PERCENTILE_CONT(x, 0.5) OVER() AS median_cont,
  PERCENTILE_DISC(x, 0.5) OVER() AS median_disc
FROM
  UNNEST([1,2,3,4]) as x
LIMIT 1

Result

median_cont median_disc
2.5 2

Caution: As of today (2020-06-20), BigQuery only supports PERCENTILE_CONT and PERCENTILE_DISC for window functions (hence the OVER() clause and the LIMIT 1 in the example above):

PERCENTILE_CONT is under development, and we will publish the documentation once it is GA. We will support it as analytic function first, and we plan to support it as aggregate function (allowing GROUP BY) later.

Source: SO: percentile functions with GROUPBY in BigQuery

The more common use case is probably to calculate the median as a result of a GROUP BY statement. I.e. I would like to write something like this to get the median of quantity per product_id.

SELECT
  product_id,
  PERCENTILE_CONT(quantity, 0.5) AS median
GROUP BY
  product_id

Right now, that is only possible for the average via AVG() but not for the median. But we can still work around that limitation by using the PERCENTILE_CONT function on a window partitioned by product_id, then group by the product_id (to get only one row per product_id) and resolve a single median value via ANY_VALUE().

Working Example

<script src="https://gist.github.com/paslandau/14940ec0fd34dc30b36377886c308ab3.js"><script src="https://gist.github.com/paslandau/14940ec0fd34dc30b36377886c308ab3.js">

Run on BigQuery

Open in BigQuery UI

BigQuery UI: MEDIAN in BigQuery example

Links

Notes

There is also the APPROX_QUANTILES() function (mentioned here) that can by applied to a GROUP BY. I didn't have a practical use case for approximate functions yet, though. Thus, I don't know the implications of "not using an exact calculation" and rather mention this for the sake of completeness. Example:

SELECT
  product_id,
  APPROX_QUANTILES(quantity, 100)[OFFSET(50)] as approx_median
GROUP BY
  product_id
Powered by Gewgley