Blog | Crunchy Data

Quick and Dirty Address Matching with LibPostal

Paul Ramsey
PostgreSQL postgis

Most businesses have databases of previous customers, and data analysts will frequently be asked to join arbitrary data to the customer tables in order to provide analysis.

Unfortunately joining address data together is notoriously difficult:

  • The same address can be expressed in many ways
  • The parts of addresses are not always clear
  • There are valid lexically very similar addresses very nearby any given address

One way of doing quick and dirty categorization of address data to find duplicates or join tables is to apply "address normalization" to the address data.

Address normalization attempts to convert an arbitrary input address string into a finite number of standard forms that the input might intend to represent.

It's not hard to imagine writing your own normalizer in python or another scripting language: convert standard abbreviations to fully spelled out words, deal with equivalencies like "first" and "1st" and "1", strip out lexical quirks like capitalization and diacritics. But what if you had to do it for a second language and region? And then another? You might be bi-lingual, but are you tri-lingual? Quad?

Enter libpostal, a natural language processing library trained on over 1B address records in the international OpenStreetMap database. For use inside PostgreSQL, you can use the pgsql-postal extension to call the library directly from SQL.

libpostal only has two operations, "address normalization" and "address parsing", that are exposed by pgsql-postal with the postal_normalize() and postal_parse() functions.

Normalization takes an address string and converts it to all the standard forms that "make sense". For example:

SELECT unnest(postal_normalize('390 Greenwich St., New york, ny, 10013'));
390 greenwich saint new york ny 10013
390 greenwich saint new york new york 10013
390 greenwich street new york ny 10013
390 greenwich street new york new york 10013

Parsing takes apart a string into address components and returns a JSONB of those components:

SELECT jsonb_pretty(postal_parse('390 Greenwich St., New york, ny, 10013'));
{                           
"city": "new york",     
"road": "greenwich st.",
"state": "ny",          
"postcode": "10013",    
"house_number": "390"   
}

We can use normalization to create a table of text searchable address strings, and then use full text search to efficiently search that table for potential matches for new addresses.

International Normalization

As we saw above, normalization takes raw address strings and turns them into "possible standard forms", which are suitable for searching against. They aren't necessarily the best forms, more regionally-aware parsing can do a better job of standard North American parsing and formatting, but where libpostal shines is being a ready-to-run fully international solution that doesn't even need to be told what language it is working on.

For example, this address in Berlin:

SELECT unnest(postal_normalize('Potsdamer Straße 3, 10785 Berlin, Germany'));
potsdamer strasse 3 10785 berlin germany

The diacritics are stripped out, but otherwise it is not heavily changed. The same address with an abbreviation:

SELECT unnest(postal_normalize('POTSDAMER STR. 3, 10785 BERLIN'));
potsdamer strasse 3 10785 berlin

The library recognizes based on nothing but context that it is working in German, and that 'STR' is short for 'strasse', not 'street'.

Build and Install LibPostal

  • Download or clone the libpostal
  • Build and install the library
    cd libpostal 
    ./bootstrap.sh 
    sudo mkdir /opt/libpostal 
    ./configure --datadir=/opt/libpostal 
    make 
    sudo make install 
  • Download or clone the pgsql-postal extension
  • Build and install the extension
    cd pgsql-postal 
    make 
    sudo make install 
  • Create a working database
    createdb postal 
  • Create the extensions
    psql -d postal -c 'create extension postal' psql -d postal -c 'create extension fuzzystrmatch' 

Load and Prepare the Addresses

For this example, we will load a CSV file of New York addresses.

Now we need a table to load the addresses into:

DROP TABLE IF EXISTS addresses; 
CREATE TABLE addresses (
    pk serial primary key,
    lon double precision, 
    lat double precision, 
    number varchar, 
    street varchar, 
    unit varchar, 
    city varchar, 
    district varchar, 
    region varchar, 
    postcode varchar, 
    id double precision, 
    hash varchar
);

Copy the New York data into the empty table:

COPY addresses (
    lon,
    lat,
    number,
    street,
    unit,
    city,
    district,
    region,
    postcode,
    id,
    hash
)
FROM 'us/ny/city_of_new_york.csv' WITH (format csv, header true);

Create the Normalized Address Table

The New York address table has addresses already split out into components, but we want to feed them to the normalizer as one string, so we actually stick them back together again.

SELECT concat_ws(', ', concat_ws(' ', unit, number, street), 'NEW YORK', postcode)
FROM addresses
LIMIT 5;
1458 36 ST, NEW YORK, 11218
927 80 ST, NEW YORK, 11228<
2116 73 ST, NEW YORK, 11204
2168 73 ST, NEW YORK, 11204
105 E  32 ST, NEW YORK, 11226

We are also going to want to search the normalized address using the PostgreSQL full-text search engine, so while creating the normalized form we'll also build a tsvector of the normalized text.

DROP TABLE IF EXISTS addresses_normalized; 
CREATE TABLE addresses_normalized AS 
SELECT pk, 
na, -- normalized address 
to_tsvector('simple', na) AS ts -- tsearch address 
FROM ( 
SELECT pk, 
unnest(postal_normalize(concat_ws(', ', concat_ws(' ', unit, number, street), 'NEW YORK', postcode))) AS na 
FROM addresses ) 
AS subq;

Now we have a search table that we can join back to the original addresses via the primary key, pk.

SELECT * FROM addresses_normalized LIMIT 5;
   
  pk   |                na                 |                            ts             
  --------+-----------------------------------+-----------------------------------------------------------
817072 | 410 liberty avenue new york 11207 | '11207':6 '410':1 'avenue':3 'liberty':2 'new':4 'york':5
817073 | 5006 snyder avenue new york 11203 | '11203':6 '5006':1 'avenue':3 'new':4 'snyder':2 'york':5
817074 | 586 liberty avenue new york 11207 | '11207':6 '586':1 'avenue':3 'liberty':2 'new':4 'york':5
817075 | 45 hewitt avenue new york 10301   | '10301':6 '45':1 'avenue':3 'hewitt':2 'new':4 'york':5
817076 | 88 coventry road new york 10304   | '10304':6 '88':1 'coventry':2 'new':4 'road':3 'york':5

Search Using the Normalized Address Table

To search the addresses with a single input, we need to keep a few things in mind:

  • A single search term can generate multiple normalized forms that we will want to search with
  • We will need to turn those normalized forms into text search queries
  • Order matters in addresses, so we need to run a phrase search

We start the query using WITH to create a set of text search queries to join to the search table, then join them and add the levenshtein distance to get a feel for how lexically close each potential match is.

WITH normalized AS (
    SELECT unnest(postal_normalize('388 Greenwich st, ny')) AS na
), tsqueries as (
    SELECT phraseto_tsquery('simple', na) AS ts, na
    FROM normalized
)
SELECT DISTINCT ON (pk)
    addresses.*,
    addresses_normalized.na AS addresses_normalized_na,
    tsqueries.na AS tsqueries_na,
    levenshtein(tsqueries.na, addresses_normalized.na) AS levenshtein
FROM addresses
JOIN addresses_normalized USING (pk)
JOIN tsqueries ON (addresses_normalized.ts @@ tsqueries.ts)
ORDER BY pk;

With 1M New York addresses and 1.7M normalized candidates, the search takes about 90ms or less.

Join Using the Normalized Address Table

We can use the normalized address table and the normalizer to do a bulk join of another address table to the master table the New York addresses.

  • Download the New York open data Health Centers data
  • Create a table for the data
    CREATE TABLE health_centers (
        name text,
        address text,
        postal text,
        telephone text,
        lat double precision,
        lon double precision,
        board text,
        council text,
        censustract text,
        bin text,
        bbl text,
        nta text,
        borough text
    );
  • Load the data
    COPY health_centers FROM 'rows.csv' WITH (format csv, header true);

    The health center data has quite different form for addresses from the base New York address table, but we can still get a lot of joins.

    WITH queries AS (
        SELECT
            name as health_name,
            address as health_address,
            phraseto_tsquery('simple', unnest(postal_normalize(concat_ws(', ', address, 'new york')))) AS tsq
        FROM health_centers
    )
    SELECT
        addresses.*,
        addresses_normalized.*,
        queries.health_name,
        queries.health_address
    FROM addresses
    JOIN addresses_normalized using (pk)
    JOIN queries ON (addresses_normalized.ts @@ queries.tsq);

    Here's one example of a matched record:

    Note the difference between the ’street’ and ’number’ from the address table to the ‘health_address’ from the health center table. We are able to match lexically quite dissimilar data.
    pk             | 485629
    lon            | -74.0091443
    lat            | 40.645177
    number         | 514
    street         | 49 ST
    unit           | 
    city           | NEW YORK
    district       | 
    region         | NEW YORK
    postcode       | 11220
    id             | 3017183
    hash           | a373b8e9e014c4b4
    geog           | 0101000020E6100000785CF9D1958052C0D190F12895524440
    pk             | 485629
    na             | 514 49 street new york 11220
    ts             | '11220':6 '49':2 '514':1 'new':4 'street':3 'york':5
    health_name    | Sunset Park Family Health Center
    health_address | 514 49th Street

    And another:

    pk             | 902117
    lon            | -73.9440892
    lat            | 40.7911112
    number         | 212
    street         | E  106 ST
    unit           | 
    city           | NEW YORK
    district       | 
    region         | NEW YORK
    postcode       | 10029
    id             | 9106834
    hash           | 9802bcab491efb2a
    geog           | 0101000020E61000006FA01BF56B7C52C0EABFBD2143654440
    pk             | 902117
    na             | 212 east 106 street new york 10029
    ts             | '10029':7 '106':3 '212':1 'east':2 'new':5 'street':4 'york':6
    health_name    | Settlement Health & Medical Services
    health_address | 212 East 106th Street

    Conclusion

    While it is not a 100% solution, using normalized addresses and full text search provides a relatively fast (less than 100ms) matching approach for loose address matching.

0 replies