I discovered SQL max_by() and min_by() functions and now I can not imagine my work without them

I consider my knowledge of SQL to be advanced (for an analyst). I rarely search for code examples and problem solutions on the internet when writing queries. However, this makes me experience some sort of a curse of knowledge - when I rely on what I already know, I do not learn anything new.

So recently I looked at the list of available aggregate functions described in the Presto documentation. I found two functions that sparked my interest: min_by() and max_by(). Actually, I’ve seen them there many times before, but that was the first time when I was not only looking at them but also writing a query in which they were potentially useful. And when such two conditions align, then the magic happens.

The task that I had was to write a query that aggregates all historical data about transactions and returns one record for each transaction that informs about:

  • the initial value,
  • the most recent value,
  • the most recent status,
  • and when the last update happened.

The source table contained a stream of events with all the historical updates for each transaction.

transaction_id status value event_created_at
1 pending 100 2022-06-01
1 pending
50 2022-06-05
2 pending 20 2022-06-06
2 confirmed 20 2022-06-07
1 canceled 0 2022-06-08

As you can see, the value and the status are changing over time, and I needed the first row and the last row for each transaction. My initial idea was to use the usual CTE (this is this temporary result you create using the WITH statement) and some JOINs as follows:

WITH base AS (
  SELECT
    transaction_id
    , min(event_created_at) as first_event
    , max(event_created_at) as last_event
  FROM table
  GROUP BY 1
)
SELECT
  t1.transaction_id
  , t2.status AS last_status
  , t1.value AS initial_value
  , t2.value AS last_value
  , t2.event_created_at as last_update_at
FROM table AS t1
INNER JOIN table AS t2
  ON t1.transaction_id = t2.transaction_id
INNER JOIN base AS b
  ON t1.transaction_id = b.transaction_id
    AND t1.event_created_at = b.first_event
    AND t2.event_created_at = b.last_event

It works, however, it takes some time to understand what is happening here. And this is only a simplified version of the data and query I actually worked with. When a bug appeared, I struggled to debug it to find the issue. And that was the moment I decided to refactor the query. So, with the help of min_by() and max_by(), I replaced the CTE and JOINs with only one GROUP BY.

According to the documentation:

max_by(x, y) - Returns the value of x associated with the maximum value of y over all input values.

min_by(x, y) - Returns the value of x associated with the minimum value of y over all input values.

So here they are in action:

SELECT
  transaction_id
  , max_by(status, event_created_at) AS last_status
  , min_by(value, event_created_at) AS initial_value
  , max_by(value, event_created_at) AS last_value
  , max(event_created_at) as last_update_at
FROM table
GROUP BY 1

how cool is that?! It is so simple to understand what the query does now.

There are many more potential applications of those functions in analytics, for instance, finding a day of the month with most/least visitors on the website. I hope you find them usefull as I did!

Written on June 27, 2022
Tags: #sql