Programming Rants: Simpler Way to find rows with Largest/Smallest value for each group in SQL

Sometimes we need to find largest or smallest value for each column and also associated column. Recently I found out that DISTINCT ON is quite good, than my previous way (LEFT JOIN less IS NULL, or WHERE IN SELECT MAX/MIN), see this example:

CREATE TABLE test1 ( id bigserial primary key, val int, name text );

INSERT INTO test1(val,name) VALUES(6,'a'),(9,'a'),(5,'b'),(11,'b'),(2,'c'),(8,'c');

SELECT * FROM test1;

id | val | name
----+-----+------
1 | 6 | a
2 | 9 | a
3 | 5 | b
4 | 11 | b
5 | 2 | c
6 | 8 | c

To find which id for each name that has greatest value, we usually do something like this (WHERE IN SELECT MAX):

SELECT *
FROM test1 x1
WHERE (name, val) IN (
SELECT name, MAX(val)
FROM test1
GROUP BY 1
)
ORDER BY x1.name;

id | val | name
----+-----+------
2 | 9 | a
4 | 11 | b
6 | 8 | c

Sort (cost=73.95..74.64 rows=275 width=44)
Sort Key: test1.name
-> Hash Join (cost=33.50..62.81 rows=275 width=44)
Hash Cond: ((test1.name = test1_1.name) AND (test1.val = (max(test1_1.val))))
-> Seq Scan on test1 (cost=0.00..21.00 rows=1100 width=44)
-> Hash (cost=30.50..30.50 rows=200 width=36)
-> HashAggregate (cost=26.50..28.50 rows=200 width=36)
Group Key: test1_1.name
-> Seq Scan on test1 test1_1 (cost=0.00..21.00 rows=1100 width=36)

Or this (LEFT JOIN less IS NULL):

SELECT x1.*
FROM test1 x1
LEFT JOIN test1 x2
ON x1.name = x2.name
AND x1.val < x2.val
WHERE x2.id IS NULL
GROUP BY x1.id, x1.name
ORDER BY x1.name;

id | val | name
----+-----+------
2 | 9 | a
4 | 11 | b
6 | 8 | c

Group (cost=264.68..264.75 rows=10 width=44)
Group Key: x1.name, x1.id
-> Sort (cost=264.68..264.70 rows=10 width=44)
Sort Key: x1.name, x1.id
-> Merge Left Join (cost=153.14..264.51 rows=10 width=44)
Merge Cond: (x1.name = x2.name)
Join Filter: (x1.val < x2.val)
Filter: (x2.id IS NULL)
-> Sort (cost=76.57..79.32 rows=1100 width=44)
Sort Key: x1.name
-> Seq Scan on test1 x1 (cost=0.00..21.00 rows=1100 width=44)
-> Sort (cost=76.57..79.32 rows=1100 width=44)
Sort Key: x2.name
-> Seq Scan on test1 x2 (cost=0.00..21.00 rows=1100 width=44)

There some other way, that I found simpler (and faster in real life) also return only one row if there's duplicate value, that is DISTINCT ON!

SELECT DISTINCT ON (x1.name) *
FROM test1 x1
ORDER BY x1.name, x1.val DESC;

id | val | name
----+-----+------
2 | 9 | a
4 | 11 | b
6 | 8 | c

Unique (cost=76.57..82.07 rows=200 width=44)
-> Sort (cost=76.57..79.32 rows=1100 width=44)
Sort Key: name, val
-> Seq Scan on test1 x1 (cost=0.00..21.00 rows=1100 width=44)

So in this case, we look for highest value, distinct by name. Well, that's all for now.

Programming Rants

2016-02-10

Simpler Way to find rows with Largest/Smallest value for each group in SQL

No comments:

Post a Comment