I discovered SQL max_by() and min_by() functions and now I can not imagine my work without them
I consider my knowledge of SQL to be advanced (for an analyst). I rarely search for code examples and problem solutions on the internet when writing queries. However, this makes me experience some sort of a curse of knowledge - when I rely on what I already know, I do not learn anything new.
So recently I looked at the list of available aggregate functions described in the Presto documentation. I found two functions that sparked my interest: min_by()
and max_by()
. Actually, I’ve seen them there many times before, but that was the first time when I was not only looking at them but also writing a query in which they were potentially useful. And when such two conditions align, then the magic happens.
The task that I had was to write a query that aggregates all historical data about transactions and returns one record for each transaction that informs about:
- the initial value,
- the most recent value,
- the most recent status,
- and when the last update happened.
The source table contained a stream of events with all the historical updates for each transaction.
transaction_id | status | value | event_created_at |
---|---|---|---|
1 | pending | 100 | 2022-06-01 |
1 | pending |
50 | 2022-06-05 |
2 | pending | 20 | 2022-06-06 |
2 | confirmed | 20 | 2022-06-07 |
1 | canceled | 0 | 2022-06-08 |
As you can see, the value and the status are changing over time, and I needed the first row and the last row for each transaction. My initial idea was to use the usual CTE (this is this temporary result you create using the WITH statement) and some JOINs as follows:
WITH base AS (
SELECT
transaction_id
, min(event_created_at) as first_event
, max(event_created_at) as last_event
FROM table
GROUP BY 1
)
SELECT
t1.transaction_id
, t2.status AS last_status
, t1.value AS initial_value
, t2.value AS last_value
, t2.event_created_at as last_update_at
FROM table AS t1
INNER JOIN table AS t2
ON t1.transaction_id = t2.transaction_id
INNER JOIN base AS b
ON t1.transaction_id = b.transaction_id
AND t1.event_created_at = b.first_event
AND t2.event_created_at = b.last_event
It works, however, it takes some time to understand what is happening here. And this is only a simplified version of the data and query I actually worked with. When a bug appeared, I struggled to debug it to find the issue. And that was the moment I decided to refactor the query. So, with the help of min_by()
and max_by()
, I replaced the CTE and JOINs with only one GROUP BY.
According to the documentation:
max_by(x, y) - Returns the value of x associated with the maximum value of y over all input values.
min_by(x, y) - Returns the value of x associated with the minimum value of y over all input values.
So here they are in action:
SELECT
transaction_id
, max_by(status, event_created_at) AS last_status
, min_by(value, event_created_at) AS initial_value
, max_by(value, event_created_at) AS last_value
, max(event_created_at) as last_update_at
FROM table
GROUP BY 1
how cool is that?! It is so simple to understand what the query does now.
There are many more potential applications of those functions in analytics, for instance, finding a day of the month with most/least visitors on the website. I hope you find them usefull as I did!