My introduction to databases and PostgreSQL was for web application development and statistical analysis. I learned just enough SQL to get the queries to return the right answers. Because of my work with PostGIS (and FOSS4G) I became friends with Paul Ramsey. We are now co-workers at Crunchy Data and he is helping me up my SQL-fu. One of the first lessons he taught me was "Try to use joins rather than subqueries."
Today's post is going to work through this advice, as Paul and I work through some SQL.
What We Are Trying to Answer
I am trying to create some new teaching and speaking materials on using SQL in data science and I was working on some pre-analysis data manipulation. I had a table, fire_weather, which is a subset of the weather table, and I want to find all the entries in weather that are NOT in fire_weather. My initial instinct was to write a subquery but this seemed like a straightforward and easy query to follow Paul's "use a join" advice.
Wrong Join Query
I started with this join query:
select count(weather.*) from weather, fire_weather where weather.id != fire_weather.id;
And it didn't work (otherwise I wouldn't be writing this blog post). It turns out that this does a cross join where we end up with all the pairwise combinations of all rows in both tables.
The Proper Join Query
So at this point I slack-up (as opposed to ring up on the phone) Paul and we start discussing how to do the proper join. It turns out the right syntax is:
select count(weather.*) from weather left join fire_weather on weather.id = fire_weather.id where fire_weather.id is null;
Basically you do a left outer join, giving you all the rows from the weather table and only the fire_weather entries that match. Then you filter out all the records where there are matches for fire_weather. Pretty simple to understand but not very set like, as in using set theory (which is the basis of relations in relational database systems). Then again, we now have a working join query.
As part of my journey to greater understanding of SQL in PostgreSQL, I have become a big fan of EXPLAIN ANALYZE for for timings and looking at the query plan. Just out of curiosity I decide to look at the timing and query plan for the join query. Here is the output and it took about 7 milliseconds with a somewhat complicated query plan:
Aggregate (cost=1358.58..1358.59 rows=1 width=8) (actual time=6.780..6.781 rows=1 loops=1)
-> Hash Anti Join (cost=1168.76..1358.58 rows=1 width=64) (actual time=1.648..6.218 rows=6802 loops=1)
Hash Cond: (weather.id = fire_weather.id)
-> Seq Scan on weather (cost=0.00..159.05 rows=8205 width=68) (actual time=0.008..3.157 rows=8205 loops=1)
-> Hash (cost=785.56..785.56 rows=30656 width=4) (actual time=1.609..1.609 rows=1403 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 306kB
-> Seq Scan on fire_weather (cost=0.00..785.56 rows=30656 width=4) (actual time=0.008..1.399 rows=1403 loops=1)
Planning Time: 0.110 ms
Execution Time: 6.807 ms
Now the Subquery
And now I wanted to see how my original idea for a subquery would perform. Here is the subquery way to answer the same question:
select count(weather.*) from weather where id not in (select id from fire_weather);
You should see why this query appealed to me, it's very set based and very simple to write. When I look at this query with explain analyze I get:
Aggregate (cost=1052.02..1052.03 rows=1 width=8) (actual time=7.458..7.459 rows=1 loops=1)
-> Seq Scan on weather (cost=862.20..1041.76 rows=4102 width=64) (actual time=1.139..6.727 rows=6802 loops=1)
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 1403
-> Seq Scan on fire_weather (cost=0.00..785.56 rows=30656 width=4) (actual time=0.004..0.643 rows=1403 loops=1)
Planning Time: 0.068 ms
Execution Time: 7.497 ms
So we end up with a very simple plan and timings that are about about the same as the join. Paul and I discussed why the timings might be so similar and we came up with at least two reasons:
- The dataset has very few rows (8k) so the subquery performance might degrade with a larger data set.
- My machine has NVMe disk drives giving sequential access an even bigger performance difference.
Another Set Based Subquery
Finally Paul, came up with one more set based query to answer the same question:
select * from weather except select weather.* from weather join fire_weather on weather.id = fire_weather.id
This one uses a new SQL clause, EXCEPT, which is part of the set operation query combiners. One big restraint on these queries is that the queries on each side of the except clause must return the same columns and datatypes. This explains why this query can't return the total row count. But this query turned out to be worse in performance and a much more complicated query plan:
HashSetOp Except (cost=0.00..1984.13 rows=8205 width=74) (actual time=10.382..11.171 rows=6802 loops=1)
-> Append (cost=0.00..1532.86 rows=16410 width=74) (actual time=0.021..5.867 rows=9608 loops=1)
" -> Subquery Scan on ""*SELECT* 1"" (cost=0.00..241.10 rows=8205 width=44) (actual time=0.021..2.300 rows=8205 loops=1)"
-> Seq Scan on weather (cost=0.00..159.05 rows=8205 width=40) (actual time=0.006..0.537 rows=8205 loops=1)
" -> Subquery Scan on ""*SELECT* 2"" (cost=261.61..1209.71 rows=8205 width=44) (actual time=1.796..2.788 rows=1403 loops=1)"
-> Hash Join (cost=261.61..1127.66 rows=8205 width=40) (actual time=1.795..2.583 rows=1403 loops=1)
Hash Cond: (fire_weather.id = weather_1.id)
-> Seq Scan on fire_weather (cost=0.00..785.56 rows=30656 width=4) (actual time=0.008..0.413 rows=1403 loops=1)
-> Hash (cost=159.05..159.05 rows=8205 width=40) (actual time=1.768..1.768 rows=8205 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 765kB
-> Seq Scan on weather weather_1 (cost=0.00..159.05 rows=8205 width=40) (actual time=0.002..0.518 rows=8205 loops=1)
Planning Time: 0.127 ms
Execution Time: 11.975 ms
The Surprise Twist
Then I thought some more about the query Paul suggested and realized that we didn't really need the join on the right hand side of the except clause. Since fire_weather contains all the same columns as weather we can just use the columns we want and get the response we expected.
select id from weather except select id from fire_weather;
Now this has nice set syntax making it really easy to understand. If we wanted to actually get the count like in the other queries we can wrap our query in a CTE.
with count_me as (select id from weather except select id from fire_weather) select count(*) from count_me;
With this golden ticket we get 6 ms query times and a query plans that is cleaner but not simplest. I should note that cleanliness and simplicity are not key factors in evaluating a query plan.
HashSetOp Except (cost=0.00..1624.68 rows=8205 width=8) (actual time=4.834..5.313 rows=6802 loops=1)
-> Append (cost=0.00..1527.52 rows=38861 width=8) (actual time=0.004..3.248 rows=9608 loops=1)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..241.10 rows=8205 width=8) (actual time=0.004..1.903 rows=8205 loops=1)
-> Seq Scan on weather (cost=0.00..159.05 rows=8205 width=4) (actual time=0.003..1.001 rows=8205 loops=1)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..1092.12 rows=30656 width=8) (actual time=0.003..0.554 rows=1403 loops=1)
-> Seq Scan on fire_weather (cost=0.00..785.56 rows=30656 width=4) (actual time=0.002..0.409 rows=1403 loops=1)
Planning Time: 0.246 ms
Execution Time: 5.897 ms
My Final Takeaways
Here are the final lessons I would like to leave you with from this little exercise.
- Never eyeball query times - these were all the same speed to my eye. Explain analyze is your friend.
- Write the query in the way that makes the most sense and then do timings. If it's not good then look to an alternative (probably joins)
- There are multiple ways to arrive at the same answer in SQL - the "right" answer is going to be highly situational dependent. A few things that will influence the result:
- Your data size - a query might stop being "ok" as your data size grows
- Your disk speed
- Memory size
- Processor speed
- How often you plan to execute the query
- Finally, time spent improving your SQL knowledge and skills will pay off handsomely. Even if you don't write the most efficient queries, they are still usually faster than writing a lot of procedural code. As I learn more and more SQL patterns the more amazed I am at all the code I can replace with a few lines of SQL (and I usually get a huge performance boost).
And with that list, we wrap up this little blog post. I hope you found the journey and insights interesting and helpful. I would love to hear your experience working with joins versus subselects. You can reach out in the comments below or on Twitter to the Crunchy Data account or my account. Happy coding!