BigQuery: Calculate the MEDIAN in BigQuery
There is no MEDIAN()
function in Google BigQuery, but we can still calculate the MEDIAN
with the
PERCENTILE_CONT(x, 0.5)
or
PERCENTILE_DISC(x, 0.5)
functions. The difference between those two functions is the linear interpolation that is applied
when using PERCENTILE_CONT(x, 0.5)
- so that's probably what you want when dealing with numeric
values. Take the numbers 1,2,3,4
for example:
PERCENTILE_CONT(x, 0.5)
yields2.5
(as the 50% percentile is exactly between 2 and 3)PERCENTILE_DISC(x, 0.5)
yields2
(as the 50% percentile is >= 2)
Example
SELECT
PERCENTILE_CONT(x, 0.5) OVER() AS median_cont,
PERCENTILE_DISC(x, 0.5) OVER() AS median_disc
FROM
UNNEST([1,2,3,4]) as x
LIMIT 1
Result
median_cont | median_disc |
---|---|
2.5 | 2 |
Caution: As of today (2020-06-20), BigQuery only supports PERCENTILE_CONT
and PERCENTILE_DISC
for window functions
(hence the OVER()
clause and the LIMIT 1
in the example above):
PERCENTILE_CONT is under development, and we will publish the documentation once it is GA. We will support it as analytic function first, and we plan to support it as aggregate function (allowing GROUP BY) later.
Source: SO: percentile functions with GROUPBY in BigQuery
The more common use case is probably to calculate the median as a result of a GROUP BY
statement.
I.e. I would like to write something like this to get the median of quantity
per product_id
.
SELECT
product_id,
PERCENTILE_CONT(quantity, 0.5) AS median
GROUP BY
product_id
Right now, that is only possible for the average via
AVG()
but not for the median. But we can still work around that limitation by using the
PERCENTILE_CONT
function on a window partitioned by product_id
, then group by the product_id
(to get only one row per product_id
)
and resolve a single median value via ANY_VALUE()
.
Working Example
<script src="https://gist.github.com/paslandau/14940ec0fd34dc30b36377886c308ab3.js"><script src="https://gist.github.com/paslandau/14940ec0fd34dc30b36377886c308ab3.js">Run on BigQuery
Links
Notes
There is also the APPROX_QUANTILES()
function (mentioned here) that can by applied to a GROUP BY
. I didn't have a practical use case for approximate functions
yet, though. Thus, I don't know the implications of "not using an exact calculation" and rather mention
this for the sake of completeness. Example:
SELECT
product_id,
APPROX_QUANTILES(quantity, 100)[OFFSET(50)] as approx_median
GROUP BY
product_id