Huge Pages and Postgres in Containers

Chris Bandy

4 min read

We recently participated in a community solution for using huge pages when you’re running Postgres in containers or with Crunchy Postgres for Kubernetes. We worked on a patch to the underlying OCI (Open Container Initiative) runtime specification with our partner Red Hat and also worked on a patch for Postgres 16. For those of you using huge pages or running in containers, we have some additional notes on our solution in this write up. We’re really proud of the improvements we’ve made because they help Postgres, Kubernetes, and every container runtime!

Background on Huge Pages

CPUs translate virtual memory addresses to physical addresses in chunks called “pages.” Pages are typically 4 KB each, but nearly all CPU architectures provide a way to use larger sizes, often 2 MB or 1 GB. Those larger pages are called “huge pages” in Linux and are more efficient when using lots of memory. Huge pages can improve Postgres performance and protect Postgres background processes from the Out Of Memory (OOM) manager. Anyone who adjusts Postgres shared_buffers should consider tuning their system’s huge pages.

Because huge pages are so great, Crunchy Postgres for Kubernetes makes them super easy to use in the resources portion of the PostgresCluster YAML. The following example starts Postgres with 10 gigs of memory, 2 of which are huge pages. Kubernetes finds a machine the right size, and Postgres uses what’s available.

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: pg
spec:
  postgresVersion: 14
  backups:
    pgbackrest:
      repos:
        - name: repo1
          volume:
            volumeClaimSpec:
              accessModes: [ReadWriteOnce]
              resources: { requests: { storage: 1Gi } }
  instances:
    - dataVolumeClaimSpec:
        accessModes: [ReadWriteOnce]
        resources: { requests: { storage: 1Gi } }
      resources:
        requests:
          cpu: 2
          memory: 8Gi
        limits:
          hugepages-2Mi: 2Gi

Huge pages missing in container runtimes

Crunchy Postgres for Kubernetes initially released this feature in 2021. Every once in a while, we would get a report that Postgres could not initialize due to Bus error, indicating that huge pages are to blame. The reporter would change their environment or set huge_pages = off and be satisfied. Earlier this year, we decided to dig in and really identify what was going on.

We reproduced the issue and saw that the container’s hugetlb.2MB.limit_in_bytes cgroup matched the Kubernetes hugepages-2Mi limit, but hugetlb.2MB.rsvd.limit_in_bytes did not. That difference explained the error, and setting the latter field made everything work again. (That made sense because the rsvd field was added to the kernel because no one liked the bus errors caused by the other field.)

Surely, we thought, something in the tall stack between Kubernetes and Linux was supposed to configure this field. Which one was misbehaving and deserved a bug report? None of them, it turns out! Container runtimes didn’t set this field because v1.0 of the OCI Runtime Specification didn’t mention it at all.

Naturally, we were not the first to notice this. We found issues related to this error and these cgroup fields scattered across Postgres, Kubernetes, operators, containers, and container runtimes for years. None, however, were making any progress toward a solution.

Odin Ugedal reported the problem to OCI and suggested a solution in 2020. Kailun Qin submitted a patch in 2021. We reviewed this patch and enlisted the help of our partner, Red Hat, to get it merged. It is now part of OCI Runtime Specification v1.1 released in July. Look for it to be in container runtimes soon! 🎉

Postgres 16 allows huge pages during initialization

While we were working on OCI, David Angel, a PGO user, engaged with the pgsql-bugs mailing list after struggling to avoid the issue on a system with huge pages. As a result, our own Tom Lane added a feature to Postgres 16 allowing server variables to be set using initdb. With that, initdb with --set huge_pages=off works on any system where huge pages are broken for any reason.

Better huge pages in the future for everyone

These two changes above are proper long-term solutions, but they’ll take time to make their way into your environments. In the meantime, Crunchy Postgres for Kubernetes has implemented workarounds that will keep you running smoothly with huge pages.

Avatar for Chris Bandy

Written by

Chris Bandy

October 3, 2023 More by this author