Troubleshooting Postgres in Kubernetes

Bob Pacheco

12 min read

In my role as a Solutions Architect at Crunchy Data, I help customers get up and running with Crunchy Postgres for Kubernetes (CPK). Installing and managing a Postgres cluster in Kubernetes has never been easier. However, sometimes things don't go as planned and I’ve noticed a few major areas where Kubernetes installations go awry. Today I want to walk through some of the most common issues I see when people try to get up and running with Postgres in Kubernetes and offer a list of basic troubleshooting ideas to get started. Now sure, your issue might not be in here, but if you’re just trying to diagnose a bad install or a failing cluster, here’s my go to list of where to get started

The Order of Things: CRD, Operator, Cluster, Pod

Let’s get started with a basic understanding of how things get installed and by what. You can use that knowledge to determine where to look first when something that you are expecting does not appear during your installation.

Custom Resource Definition (CRD): The CPK Operator requires a Custom Resource Definition (CRD). It is possible to have multiple CRDs per Operator. Our most recent Operator, 5.5, has 3 CRD examples, postgres-operator.crunchydata.com_postgresclusters.yaml being one of them. The user applies all CRD files to the Kubernetes cluster. The CRDs must be installed before the operator.

Operator: The CPK Operator gets installed by the user applying a manager.yaml file that describes a Kubernetes object of kind:Deployment. This creates the Deployment and the Deployment creates the Operator pod. The Operator itself is a container running in a pod.

Postgres Cluster: A CPK Postgres Cluster is typically created by the user applying a postgres.yaml file containing the PostgresCluster.spec, that describes a Kubernetes object of kind:PostgresCluster.

Pods: The stateful sets and deployments create the individual pods they describe. The Operator creates a stateful set for each Postgres pod and the pgBackRest repo host pod (if applicable). Deployments are also created for pgBouncer pods (if applicable). If you are missing a pod, describe the Deployment or StatefulSet that owns it. If you are missing a Deployment or StatefulSet, the CPK Operator logs will usually indicate why.

Image Pulls

Next, let’s look at image pull issues. There are two primary reasons why you would receive an image pull error. 1 - you do not have permissions to connect to the registry or pull the requested image. Or 2 - the image requested is not in the registry.

Permissions Example

I am attempting to deploy the CPK Operator.

kubectl apply -n postgres-operator -k install/default --server-side

I see that I have an ImagePullBackOff error.

kubectl -n postgres-operator get pods
NAME                   READY   STATUS             RESTARTS   AGE
pgo-5694b9545c-ggz7g   0/1     ImagePullBackOff   0          27s

When looking at issues with a pod not coming up in Kubernetes the first thing we will do is run a describe on the pod and look at the events in the bottom of the output.

kubectl -n postgres-operator describe pod pgo-5694b9545c-ggz7g
...
Events:
  Type     Reason     Age                     From               Message
  ----     ------     ----                    ----               -------
  ...
  Normal   Pulling    6m9s (x4 over 7m39s)    kubelet            Pulling image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0"
  Warning  Failed     6m9s (x4 over 7m39s)    kubelet            Failed to pull image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0": rpc error: code = Unknown desc = failed to pull and unpack image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0": failed to resolve reference "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0": failed to authorize: failed to fetch anonymous token: unexpected status from GET request to <https://access.crunchydata.com/api/v1/auth/jwt/container/token/?scope=repository%3Acrunchydata%2Fpostgres-operator%3Apull&service=crunchy-container-registry:> 403 Forbidden

Looking in the events, we see that we attempted to pull the crunchydata/postgres-operator:ubi8-5.5.0-0 pod from the Crunchy Data registry. We see in the next event entry: 403 Forbidden. This means that we do not have the permissions to pull this pod from this registry.

Adding a pull secret

To resolve the issue, we will create a pull secret and add it to the deployment. You can find more information on creating pull secrets for private registries in the CPK documentation.

We created the image pull secret and added it to the deployment per the documentation. We apply the change and delete the failed pod. Now we see that the pod is recreated and the image is pulled successfully.

kubectl apply -n postgres-operator -k install/default --server-side

kubectl -n postgres-operator delete pod pgo-5694b9545c-xnpjg
pod "pgo-5694b9545c-xnpjg" deleted

kubectl -n postgres-operator get pods
NAME                   READY   STATUS    RESTARTS   AGE
pgo-5694b9545c-xnpjg   1/1     Running   0          23s

Image Not In Registry Example

We again attempt to deploy the Operator and see that we have an ImagePullBackOff error.

kubectl -n postgres-operator get pods
NAME                   READY   STATUS             RESTARTS   AGE
pgo-6bfc9554b7-6h4jd   0/1     ImagePullBackOff   0          22s

Just like before, we will describe the pod and look at the events to determine why this is happening:

kubectl -n postgres-operator describe pod pgo-6bfc9554b7-6h4jd

...
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
...
  Normal   Pulling    4m30s (x4 over 6m5s)  kubelet            Pulling image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0"
  Warning  Failed     4m30s (x4 over 6m4s)  kubelet            Failed to pull image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0": rpc error: code = NotFound desc = failed to pull and unpack image "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0": failed to resolve reference "registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0": registry.crunchydata.com/crunchydata/postgres-operator:ubi8-5.50.0-0: not found

This time we see that we tried to pull the crunchydata/postgres-operator:ubi8-5.50.0-0 image from the Crunchy Data registry. However, the image is not found. Upon closer inspection of the image listed in the CPK Operator kustomization.yaml file we see that we have a typo. We had a tag of ubi8-5.50.0-0 when it should have been ubi8-5.5.0-0.

images:
  - name: postgres-operator
    newName: registry.crunchydata.com/crunchydata/postgres-operator
    newTag: ubi8-5.50.0-0

Changing tag names

We make the correction to the file and apply the change. The pod is automatically recreated with the correct image tag.

kubectl apply -n postgres-operator -k install/default --server-side

kubectl -n postgres-operator get pods
NAME                   READY   STATUS    RESTARTS   AGE
pgo-6bfc9554b7-6h4jd   1/1     Running   0          96s

By using the Kubernetes describe pod function we were able to see why we were getting image pull errors and easily correct them.

Resource Allocation

Another important place to look when troubleshooting a failed Kubernetes installation is looking at resource allocations, and ensuring pods have the necessary CPU and memory. The most common issues I see at installation time are:

  • Requesting more resources then are available in the available Kubernetes nodes.
  • Insufficient resource requests to allow for the proper operation of the containers running in the pod.

Resource Request Exceeds Availability

Here in this postgres.yaml we set some resource requests and limits for our Postgres pods. We are requesting 5 CPUs and setting a limit of 10 CPUs per Postgres pod.

instances:
  - name: pgha1
    replicas: 2
    resources:
      limits:
        cpu: 10000m
        memory: 256Mi
      requests:
        cpu: 5000m
        memory: 100Mi

When we create the Postgres cluster and look at the pods we find them in a pending state.

kubectl apply -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created

kubectl -n postgres-operator get pods                                                                                                       ──(Tue,Dec19)─┘
NAME                                 READY   STATUS    RESTARTS   AGE
hippo-ha-pgbouncer-7c467748d-tl4pn   2/2     Running   0          103s
hippo-ha-pgbouncer-7c467748d-v6s4d   2/2     Running   0          103s
hippo-ha-pgha1-bzrb-0                0/5     Pending   0          103s
hippo-ha-pgha1-z7nl-0                0/5     Pending   0          103s
hippo-ha-repo-host-0                 2/2     Running   0          103s
pgo-6ccdb8b5b-m2zsc                  1/1     Running   0          48m

Let's describe one of the pending pods and look at the events:

kubectl -n postgres-operator describe pod hippo-ha-pgha1-bzrb-0
Name:             hippo-ha-pgha1-bzrb-0
Namespace:        postgres-operator
...
Events:
  Type     Reason             Age                    From                Message
  ----     ------             ----                   ----                -------
  Warning  FailedScheduling   3m41s (x2 over 3m43s)  default-scheduler   0/2 nodes are available: 2 Insufficient cpu. preemption: 0/2 nodes are available: 2 No preemption victims found for incoming pod..

We see that there is insufficient available CPU to meet our request. We reduce our resource request and limits and try again.

instances:
  - name: pgha1
    replicas: 2
    resources:
      limits:
        cpu: 1000m
        memory: 256Mi
      requests:
        cpu: 500m
        memory: 100Mi
kubectl apply -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created

kubectl -n postgres-operator get pods                                                                                                       ──(Tue,Dec19)─┘
NAME                                 READY   STATUS    RESTARTS   AGE
hippo-ha-backup-jb8t-tgdtx           1/1     Running   0          13s
hippo-ha-pgbouncer-7c467748d-s8wq6   2/2     Running   0          34s
hippo-ha-pgbouncer-7c467748d-zhcmf   2/2     Running   0          34s
hippo-ha-pgha1-hmrq-0                5/5     Running   0          35s
hippo-ha-pgha1-xxtf-0                5/5     Running   0          35s
hippo-ha-repo-host-0                 2/2     Running   0          35s
pgo-6ccdb8b5b-m2zsc                  1/1     Running   0          124m

Now we see that all of our pods are running as expected.

Insufficient Resource Request

What happens if we don't allocate enough resources? Here we set very low CPU requests and limits. We are requesting 5m CPUs and setting a limit of 10m CPUs.

instances:
  - name: pgha1
    replicas: 2
    resources:
      limits:
        cpu: 10m
        memory: 256Mi
      requests:
        cpu: 5m
        memory: 100Mi

We apply the manifest and take a look at the pods.

kubectl apply -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created

kubectl -n postgres-operator get pods
NAME                                 READY   STATUS    RESTARTS      AGE
hippo-ha-pgbouncer-7c467748d-hnf5k   2/2     Running   0             93s
hippo-ha-pgbouncer-7c467748d-q28t9   2/2     Running   0             93s
hippo-ha-pgha1-r2qs-0                4/5     Running   2 (11s ago)   93s
hippo-ha-pgha1-x2ft-0                4/5     Running   2 (8s ago)    93s
hippo-ha-repo-host-0                 2/2     Running   0             93s
pgo-6ccdb8b5b-m2zsc                  1/1     Running   0             136m

We see that our Postgres pods are only showing 4/5 containers running and 90 seconds after creation they have already restarted twice. This is a clear indication that something is wrong. Let's look at the logs for the Postgres container to see what is going on.

kubectl -n postgres-operator logs hippo-ha-pgha1-r2qs-0 -c database

We didn't get any logs back. This indicates that the Postgres container is not starting. Now we will adjust the CPU request and limit to more reasonable values and try again. I normally don’t go below 500m.

instances:
  - name: pgha1
    replicas: 2
    resources:
      limits:
        cpu: 1000m
        memory: 256Mi
      requests:
        cpu: 500m
        memory: 100Mi
kubectl apply -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created

Now we see that our cluster is up and running with all expected containers.

kubectl -n postgres-operator get pods
NAME                                 READY   STATUS    RESTARTS   AGE
hippo-ha-backup-pv9n-tr7mh           1/1     Running   0          6s
hippo-ha-pgbouncer-7c467748d-45jj9   2/2     Running   0          33s
hippo-ha-pgbouncer-7c467748d-lqfz2   2/2     Running   0          33s
hippo-ha-pgha1-8kh2-0                5/5     Running   0          34s
hippo-ha-pgha1-v4t5-0                5/5     Running   0          34s
hippo-ha-repo-host-0                 2/2     Running   0          33s
pgo-6ccdb8b5b-m2zsc                  1/1     Running   0          147m

Storage Allocation

Lastly, we will look at some common issues when allocating storage to our pods. The most common issues that someone will run into regarding storage allocation at installation time are:

  • Improper Resource Request
  • Unsupported Storage Class

Improper Resource Request Example

Here is an example of the storage we want to allocate to our Postgres cluster pods in the postgres.yaml:

dataVolumeClaimSpec:
  accessModes:
    - 'ReadWriteOnce'
  resources:
    requests:
      storage: 1GB

When we attempt to apply the manifest we see this output on the command line:

k apply -n postgres-operator -k high-availability
The PostgresCluster "hippo-ha" is invalid: spec.instances[0].dataVolumeClaimSpec.resources.requests.storage: Invalid value: "1GB": spec.instances[0].dataVolumeClaimSpec.resources.requests.storage in body should match '^(\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\\+|-)?(([0-9]+(\\.[0-9]*)?)|(\\.[0-9]+))))?$'

The value of "1GB" is invalid. The error message tells you where in the manifest the error is. It is in the spec.instances[0].dataVolumeClaimSpec.resources.requests.storage section of the manifest. The message even provides the regex that is used for validation.

When we enter the valid value of 1Gi we are able to deploy our Postgres cluster. Remember that gigabytes must be described in Gi, and megabytes with Mi. More syntax specifics in the Kubernetes docs.

dataVolumeClaimSpec:
  accessModes:
    - 'ReadWriteOnce'
  resources:
    requests:
      storage: 1Gi
affinity:
kubectl -n postgres-operator get pods
NAME                                 READY   STATUS    RESTARTS   AGE
hippo-ha-backup-ngg5-56z7z           1/1     Running   0          10s
hippo-ha-pgbouncer-7c467748d-4q887   2/2     Running   0          35s
hippo-ha-pgbouncer-7c467748d-lc2sr   2/2     Running   0          35s
hippo-ha-pgha1-w9vc-0                5/5     Running   0          35s
hippo-ha-pgha1-zhx8-0                5/5     Running   0          35s
hippo-ha-repo-host-0                 2/2     Running   0          35s
pgo-6ccdb8b5b-vzzkp                  1/1     Running   0          12m

Improper Storage Class Name Example

We want to specify a specific storage class to be used with our Postgres cluster pods:

dataVolumeClaimSpec:
  storageClassName: foo
  accessModes:
    - 'ReadWriteOnce'
  resources:
    requests:
      storage: 1Gi

When we apply the manifest we see that our Postgres pods get stuck in a "pending" state.

kubectl -n postgres-operator get pods
NAME                                 READY   STATUS    RESTARTS   AGE
hippo-ha-pgbouncer-7c467748d-jxxpf   2/2     Running   0          3m42s
hippo-ha-pgbouncer-7c467748d-wdtvq   2/2     Running   0          3m42s
hippo-ha-pgha1-79gr-0                0/5     Pending   0          3m42s
hippo-ha-pgha1-xv2t-0                0/5     Pending   0          3m42s
hippo-ha-repo-host-0                 2/2     Running   0          3m42s
pgo-6ccdb8b5b-vzzkp                  1/1     Running   0          24m

At this point it is not clear to us why the pods are pending. Let's describe one of them and look at the events to see if we can get more information.

kubectl -n postgres-operator describe pod hippo-ha-pgha1-79gr-0
Name:             hippo-ha-pgha1-79gr-0
Namespace:        postgres-operator
...
Events:
  Type     Reason             Age                   From                Message
  ----     ------             ----                  ----                -------
  Normal   NotTriggerScaleUp  31s (x32 over 5m34s)  cluster-autoscaler  pod didn't trigger scale-up:
  Warning  FailedScheduling   13s (x6 over 5m36s)   default-scheduler   0/2 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/2 nodes are available: 2 Preemption is not helpful for scheduling..

In the describe events we see that the pod has unbound immediate PersistentVolumeClaims. What does that mean? It means that Kubernetes was not able to meet our storage claim request so it remains unbound. If we examine our dataVolumeClaimSpec we see that we set three specific values:

dataVolumeClaimSpec:
  storageClassName: foo
  accessModes:
    - 'ReadWriteOnce'
  resources:
    requests:
      storage: 1Gi

We review the available storage classes in our Kubernetes provider. In this case we are deploying on GKE. We see that we have 3 storage classes available to us:

storage_classes.png

We delete the failed cluster deployment

kubectl delete -n postgres-operator -k high-availability
postgrescluster.postgres-operator.crunchydata.com "hippo-ha" deleted

We update the storageClassName in our manifest to a supported storage class and apply it.

dataVolumeClaimSpec:
  storageClassName: standard-rwo
  accessModes:
    - 'ReadWriteOnce'
  resources:
    requests:
      storage: 1Gi
kubectl apply -n postgres-operator -k high-availability
configmap/db-init-sql created
postgrescluster.postgres-operator.crunchydata.com/hippo-ha created

Now we see that all of our pods are up and running.

kubectl -n postgres-operator get pods
NAME                                 READY   STATUS    RESTARTS   AGE
hippo-ha-backup-jstq-c8n67           1/1     Running   0          6s
hippo-ha-pgbouncer-7c467748d-5smt9   2/2     Running   0          31s
hippo-ha-pgbouncer-7c467748d-6vb7t   2/2     Running   0          31s
hippo-ha-pgha1-9s2g-0                5/5     Running   0          32s
hippo-ha-pgha1-drmv-0                5/5     Running   0          32s
hippo-ha-repo-host-0                 2/2     Running   0          32s
pgo-6ccdb8b5b-vzzkp                  1/1     Running   0          44m

We Did It!

In this blog we were able to identify, diagnose and correct common installation issues that sometimes occur when installing Postgres in Kubernetes. We learned how to use the Kubernetes describe function to obtain information that assisted us in the diagnosis of the issues we ran into. The lessons learned here don't just apply to Postgres. These types of issues can happen with any application running in Kubernetes if the manifest is not correct or proper resources have not been allocated. Congratulations! You now have the knowledge you need to solve common installation issues.

Avatar for Bob Pacheco

Written by

Bob Pacheco

January 17, 2024 More by this author