Blog | Data Data

Won’t You Be My Neighbor? Quickly Finding Who is Nearby

Jonathan S. Katz
PostgreSQL Performance Indexing

Many applications these days want us to know how close we are to things:

  • What are the three closest coffee shops to my current location?
  • Which is the nearest airport to the office?
  • What are the two closest subway stops to the restaurant?

and countless more examples.

Another way of asking these questions is to say “who are my nearest neighbors to me?” This maps to a classic algorithmic problem: efficiently finding the K-nearest neighbors (or K-NN), where K is a constant. For example, the first question would be a 3-NN problem as we are trying to find the 3 closest coffee shops.

(If you are interested in learning more about K-NN problems in general, I highly recommend looking at how you can solve this using n-dimensional Voronoi diagrams, a wonderful data structure developed in the field of computational geometry.)

How can we use PostgreSQL to help us quickly find our closest neighbors? Let’s explore.

Building a Query

In an earlier article about “Why Covering Indexes are Incredibly Helpful,” we used a data set that kept track of when people visited coffee shops located in the New York City area. Recall that our table looks like this:

CREATE TABLE visits (
    id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    visitor UUID NOT NULL,
    visited_at timestamptz NOT NULL,
    geocode point NOT NULL
);

As mentioned in this article, because proper benchmarking is a very large topic, when you see query times, please use them as directional guides for the purposes of this demonstration.

Let’s say we want to figure out the three friends that were closest to our location at a given date and time. To build out this query, we need to know:

  • The date/time range for the query
  • The distance between our location and that of our friends

PostgreSQL defines a distance operator for geometric types that looks like this “<->” that, in the case of points, calculates the 2-Dimensional Euclidean distance. For example:

SELECT POINT(0,0) <-> POINT(1,1);

    ?column?
-----------------
 1.4142135623731

For completeness sake, the smaller the number, the closer the two points are to each other, which is important for the next step of the process.

If we want to find the three friends who were closest to us on October 1, 2012 between 7:00am and 9:00am, we could construct a query like this:

SELECT visitor, visited_at, geocode
FROM visits
WHERE
    visited_at BETWEEN '2012-10-01 07:00' AND '2012-10-01 09:00'
ORDER BY POINT(40.7127263,-74.0066592) <-> geocode
LIMIT 3;

The “K-nearest neighbor” portion of the query can be seen in ORDER BY POINT(40.7127263,-74.0066592) <-> geocode LIMIT 3 which is another of saying “order by the shortest distance between my current location and all the other recorded locations, but find the 3 closest locations to me.”

How does this perform? Let’s pretend that we still have the covering indexes in place from the previous article:

EXPLAIN ANALYZE SELECT visitor, visited_at, geocode
FROM visits
WHERE
    visited_at BETWEEN '2012-10-01 07:00' AND '2012-10-01 09:00'
ORDER BY POINT(40.7127263,-74.0066592) <-> geocode
LIMIT 3;
Limit (cost=53755.18..53755.54 rows=3 width=48) (actual time=120.890..128.781 rows=3 loops=1)
-> Gather Merge (cost=53755.18..53794.45 rows=328 width=48) (actual time=120.889..128.778 rows=3 loops=1)
Workers Planned: 4
Workers Launched: 4
-> Sort (cost=52755.12..52755.32 rows=82 width=48) (actual time=115.623..115.625 rows=2 loops=5)
Sort Key: (('(40.7127263,-74.0066592)'::point <-> geocode))
Sort Method: top-N heapsort Memory: 25kB
Worker 0: Sort Method: top-N heapsort Memory: 25kB
Worker 1: Sort Method: top-N heapsort Memory: 25kB
Worker 2: Sort Method: top-N heapsort Memory: 25kB
Worker 3: Sort Method: quicksort Memory: 25kB
-> Parallel Seq Scan on visits (cost=0.00..52754.06 rows=82 width=48) (actual time=65.256..115.476 rows=50 loops=5)
Filter: ((visited_at >= '2012-10-01 07:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-01 09:00:00-04'::timestamp with time zone))
Rows Removed by Filter: 805600
Planning Time: 0.112 ms
Execution Time: 128.808 ms

Looking at this query plan, PostgreSQL determined that none of the indexes could be used, and does a full (parallelized) sequential scan on the visits table in order to find the 3 closest people. This is not very efficient, but we could possibly do better.

KNN-GiST: Efficiently Find Neighbors

PostgreSQL 9.1 introduced the KNN-GiST index as a way to accelerate searching for neighbors. It has been implemented on several data types, including points and trigrams, and is also leveraged by the PostGIS geospatial extension.

You can use a KNN-GiST index simply by creating a GiST index on a supported data type, which in this case, is the geocode column:

CREATE INDEX visits_geocode_gist_idx ON visits USING gist(geocode);
VACUUM ANALYZE visits;

To demonstrate its power, let’s see what happens if we try to find the 3 nearest points to a given location:

EXPLAIN ANALYZE SELECT visitor, visited_at, geocode
FROM visits
ORDER BY POINT(40.7127263,-74.0066592) <-> geocode
LIMIT 3;
Limit (cost=0.41..0.73 rows=3 width=48) (actual time=0.237..0.244 rows=3 loops=1)
-> Index Scan using visits_geocode_gist_idx on visits (cost=0.41..423200.97 rows=4028228 width=48) (actual time=0.236..0.242 rows=3 loops=1)
Order By: (geocode <-> '(40.7127263,-74.0066592)'::point)
Planning Time: 0.089 ms
Execution Time: 0.265 ms

Wow! Even though it’s not an apples-to-oranges comparison to the query before, we can already see that the KNN-GiST index on our point type lets us find things that are close to us incredibly quickly! Now let’s try running our original query to find who is closest to us on October 1, 2012 between 7am and 9am and see if this index speeds up:

EXPLAIN ANALYZE SELECT visitor, visited_at, geocode
FROM visits
WHERE
    visited_at BETWEEN '2012-10-01 07:00' AND '2012-10-01 09:00'
ORDER BY POINT(40.7127263,-74.0066592) <-> geocode
LIMIT 3;
Limit (cost=0.41..4012.19 rows=3 width=48) (actual time=184.327..184.332 rows=3 loops=1)
-> Index Scan using visits_geocode_gist_idx on visits (cost=0.41..433272.35 rows=324 width=48) (actual time=184.326..184.330 rows=3 loops=1)
Order By: (geocode <-> '(40.7127263,-74.0066592)'::point)
Filter: ((visited_at >= '2012-10-01 07:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-01 09:00:00-04'::timestamp with time zone))
Rows Removed by Filter: 499207
Planning Time: 0.140 ms
Execution Time: 184.361 ms

No luck: in this case, because we need also need to filter out our data for the given date/time range, PostgreSQL is unable to take advantage of the KNN-GiST index.

btree_gist to the Rescue

One possible solution is to create a multicolumn index on (visited_at, geocode), but in order to do that, we need set up the “btree_gist” extension. In short, the btree_gist extension allows you to use the standard B-tree quality operators with a GiST index. You can install this extension by executing the following:

CREATE EXTENSION btree_gist;

Before we try creating the multicolumn index, first, let’s drop the previous index:

DROP INDEX visits_geocode_gist_idx;

Now, let’s create the multicolumn index on (visited_at,geocode):

CREATE INDEX visits_visited_at_geocode_gist_idx ON visits USING gist(visited_at, geocode);
VACUUM ANALYZE visits;

What happens to the execution time?

EXPLAIN ANALYZE SELECT visitor, visited_at, geocode
FROM visits
WHERE
    visited_at BETWEEN '2012-10-01 07:00' AND '2012-10-01 09:00'
ORDER BY POINT(40.7127263,-74.0066592) <-> geocode
LIMIT 3;
Limit (cost=0.41..12.64 rows=3 width=48) (actual time=0.047..0.049 rows=3 loops=1)
-> Index Scan using visits_visited_at_geocode_gist_idx on visits (cost=0.41..1348.69 rows=331 width=48) (actual time=0.046..0.048 rows=3 loops=1)
Index Cond: ((visited_at >= '2012-10-01 07:00:00-04'::timestamp with time zone) AND (visited_at <= '2012-10-01 09:00:00-04'::timestamp with time zone))
Order By: (geocode <-> '(40.7127263,-74.0066592)'::point)
Planning Time: 0.133 ms
Execution Time: 0.068 ms

Excellent! It seems we are taking advantage of the KNN-GiST index now.

One question you may have: why not just use a B-tree index on visited_at? That solution could work if you know you will eliminate most rows before performing the sort. However, if you begin to look over a larger date/time range or there are directionally “a lot” of rows within the date/time range, you will see a significant performance boost by using this multicolumn KNN-GiST index.

KNN-GiST Indexes: A Hidden Gem for Finding Friends

One of the wonderful things about PostgreSQL is that it is chock full of hidden gems that can significantly help accelerate your workloads. When used appropriately, the KNN-GiST index can significantly speed up your queries where you need to find things that are nearby, which is a common need for applications. They are not without cost: KNN-GiST indexes are larger than regular GiST indexes, but the speedup KNN-GiST indexes provide is significantly (if not orders of magnitude) larger than the additional space and should be in your toolbox for any location-aware application you build.

 

0 replies