Many applications today record data from sensors, devices, tracking information, and other things that share a common attribute: a timestamp that is always increasing. This timestamp is very valuable, as it serves as the basis for types of lookups, analytical queries, and more.
Common table expressions, aka CTEs, aka WITH queries, are not only the gateway to writing recursive SQL queries, but also help developers write maintainable SQL. WITH query clauses can help developers who are more comfortable writing in imperative languages to feel more comfortable writing SQL, as well as help reduce writing redundant code by reusing a particular common table expressions multiple times in a query.
A new patch, scheduled to be a part of PostgreSQL 12 major release later in the year, introduces the ability, under certain conditions, to inline common table expressions within a query. This is a huge feature: many developers could suddenly see their existing queries speed up significantly, and the ability to explicitly specify when to inline (i.e. the planner "substitutes" a reference to the CTE in the main query and can then optimize further) or, conversely, materialize (i.e. place the CTE into memory but lose out on certain planning & execution optimizations).
But why is this a big deal? Before we look into the future, first let's understand how WITH queries currently work in PostgreSQL.
An important part of running a production PostgreSQL database system (and for that matter, any database software) is to ensure you are prepared for disaster. There are many ways to go about preparing your system for disaster, but one of the simplest and most effective ways to do this is by taking periodic backups of your database clusters.
How does one typically go about setting up taking a periodic backup? If you’re running PostgreSQL on a Linux based system, the solution is to often use cron, and setting up a crontab entry similar to this in your superuser account:
# take a daily base backup at 1am to a mount point on an external disk # using pg_basebackup 0 1 * * * /usr/bin/env pg_basebackup –D /your/external/mount/
However, if you’re managing tens, if not hundreds and thousands of PostgreSQL databases, this very quickly becomes an onerous task and you will need some automation to help you scale your disaster recovery safely and efficiently.
Automating Periodic Backups
The Crunchy PostgreSQL Operator, an application for managing PostgreSQL databases in a Kubernetes-based environment in is designed for managing thousands of PostgreSQL database from a single interface to help with challenges like the above. One of the key features of the PostgreSQL Operator is to utilize Kubernetes Labels to apply commands across many PostgreSQL databases. Later in this article, we will see how we can take advantage of labels in order to set backup policies across many clusters.
Interested in running PostgreSQL natively on Kubernetes? Let's look at a few quick steps to get up and running with the open source Crunchy PostgreSQL Operator for Kubernetes on your choice of Kubernetes deployment.
The Crunchy PostgreSQL Operator (aka "pgo") provides a quickstart script to automate the deployment of the Crunchy PostgreSQL Operator to a number of popular Kubernetes environments, including Google Kubernetes Engine (GKE), OpenShift Container Platform (OCP) and Pivotal Container Service (PKS). You can download the quickstart script here:
The script will deploy the operator to a GKE Kubernetes cluster or an OpenShift Container Platform cluster. The script assumes you have a StorageClass defined for persistence.
Many applications these days want us to know how close we are to things:
- What are the three closest coffee shops to my current location?
- Which is the nearest airport to the office?
- What are the two closest subway stops to the restaurant?
and countless more examples.
Another way of asking these questions is to say “who are my nearest neighbors to me?” This maps to a classic algorithmic problem: efficiently finding the K-nearest neighbors (or K-NN), where K is a constant. For example, the first question would be a 3-NN problem as we are trying to find the 3 closest coffee shops.
(If you are interested in learning more about K-NN problems in general, I highly recommend looking at how you can solve this using n-dimensional Voronoi diagrams, a wonderful data structure developed in the field of computational geometry.)
How can we use PostgreSQL to help us quickly find our closest neighbors? Let’s explore.
The PostgreSQL 11 release is nearly here (maybe in the next couple of weeks?!), and while a lot of the focus will be on the improvements to the overall performance of the system (and rightly so!), it's important to notice some features that when used appropriately, will provide noticeable performance improvements to your applications.
One example of such feature is the introduction of "covering indexes" for B-tree indexes. A covering index allows a user to perform an index-only scan if the select list in the query matches the columns that are included in the index. You can specify the additional columns for the index using the "INCLUDE" keyword, e.g.
CREATE INDEX a_b_idx ON x (a,b) INCLUDE (c);
Theoretically, this can reduce the amount of I/O your query needs to use in order to retrieve information (traditionally, I/O is the biggest bottleneck on database systems). Additionally, the data types included in a covering index do not need to be B-tree indexable; you can add any data type to the INCLUDE part of a CREATE INDEX statement.
However, you still need to be careful how you deploy covering indexes: each column you add to the index still takes up space on disk, and there is still a cost for maintaining the index, for examples, on row updates.
Understanding these trade offs, you can still apply covering indexes in very helpful ways that can significantly help your applications.
A Simple Example: Tracking Coffee Shop Visits
Crunchy Data recently announced the publication of the CIS PostgreSQL Benchmark by the Center for Internet Security, a nonprofit organization that provides publications around standards and best practices for securing technologies systems. This CIS PostgreSQL Benchmark builds on earlier work that Crunchy started when it helped to publish the PostgreSQL Security Technical Implementation Guide (PostgreSQL STIG) and provides guidance and steps to help secure your PostgreSQL databases.
What is a CIS Benchmark?
A CIS Benchmark is a set of guidelines and best practices for securely configuring a target system. Authoring a CIS Benchmark is a collaborative process as CIS involves considerable peer reviews and discussion before a major version is published, to ensure there is a general consensus on the best practices for deploying a secure system.
Having replicas in a database environment can fulfill many goals: create a high-availability environment, load balance read-only queries across several nodes, have a read-only database to run analytical queries on, and more. PostgreSQL introduced streaming replication in PostgreSQL 9.0 almost 8 years ago to help facilitate satisfying the above requirements, and through the years, the community has added many features to continue to enhance the replication user experience.
There are many different ways to configure streaming replication in PostgreSQL, but fundamentally you need to consider if you want to have asynchronous or synchronous replication. With asynchronous replication, the default in PostgreSQL, a transaction is considered complete when it is committed on the primary database; the replica databases will replay and consume the transaction on their own time. With synchronous replication, the transaction will not be considered complete until it has been committed on all of the replicas (though PostgreSQL 10 introduced “quorum commit” which allows you to configure how many replicas need to commit the transaction before it’s considered complete).
Choosing which form of replication you need for your environment is out-of-scope for this article. Rather, we will look at what happens when a replica becomes out-of-sync with a primary, what to do, and how you can avoid this situation. Credits to Doug Hunley for originally compiling this guide.
Detecting An Out-of-Sync Replica
Crunchy Data recently released version 2.6 of the PostgreSQL Kubernetes Operator, a powerful controller that follows the Kubernetes Operator pattern that enables users to provision and manage thousands of PostgreSQL databases at scale. We thought that we would demonstrate some of the features in this latest version over the next few weeks, which includes support for manual database failover, storage selection, node (or server) selection preference, and many goodies that make it easier to manage a large PostgreSQL cluster.
If you have been asked to provide a CSV that someone can open up in their favorite spreadsheet editor, chances are you have used the PostgreSQL COPY command. COPY has been around since the early open source releases of PostgreSQL back in the late 1990s and was designed to quickly get data in and out of PostgreSQL.
COPY is also incredibly helpful for ingesting data into a table, especially if you have a lot of it to ingest, and will generally outperform INSERT. Let’s explore a few ways to use COPY to load some data into a table.