Huge Pages and Postgres in Containers
We recently participated in a community solution for using huge pages when you’re running Postgres in containers or with Crunchy Postgres for Kubernetes. We worked on a patch to the underlying OCI (Open Container Initiative) runtime specification with our partner Red Hat and also worked on a patch for Postgres 16. For those of you using huge pages or running in containers, we have some additional notes on our solution in this write up. We’re really proud of the improvements we’ve made because they help Postgres, Kubernetes, and every container runtime!
Background on Huge Pages
CPUs translate virtual memory addresses to physical addresses in chunks called
“pages.” Pages are typically 4 KB each, but nearly all CPU architectures provide
a way to use larger sizes, often 2 MB or 1 GB. Those larger pages are called
“huge pages” in Linux and are more efficient when using lots of memory. Huge
pages can improve Postgres performance and protect Postgres background processes
from the
Out Of Memory (OOM) manager.
Anyone who adjusts Postgres shared_buffers
should consider tuning their
system’s huge pages.
Because huge pages are so great, Crunchy Postgres for Kubernetes makes them
super easy to use in the resources
portion of the PostgresCluster YAML. The
following example starts Postgres with 10 gigs of memory, 2 of which are huge
pages. Kubernetes finds a machine the right size, and Postgres uses what’s
available.
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: pg
spec:
postgresVersion: 14
backups:
pgbackrest:
repos:
- name: repo1
volume:
volumeClaimSpec:
accessModes: [ReadWriteOnce]
resources: { requests: { storage: 1Gi } }
instances:
- dataVolumeClaimSpec:
accessModes: [ReadWriteOnce]
resources: { requests: { storage: 1Gi } }
resources:
requests:
cpu: 2
memory: 8Gi
limits:
hugepages-2Mi: 2Gi
Huge pages missing in container runtimes
Crunchy Postgres for Kubernetes initially released this feature in 2021. Every
once in a while, we would get a report that Postgres could not initialize due to
Bus error
, indicating that huge pages are to blame. The reporter would change
their environment or set huge_pages = off
and be satisfied. Earlier this year,
we decided to dig in and really identify what was going on.
We reproduced the issue and saw that the container’s
hugetlb.2MB.limit_in_bytes
cgroup
matched the Kubernetes hugepages-2Mi
limit, but
hugetlb.2MB.rsvd.limit_in_bytes
did not. That difference explained the error,
and setting the latter field made everything work again. (That made sense
because the rsvd
field was added to the kernel because no one liked the bus
errors caused by the other field.)
Surely, we thought, something in the tall stack between Kubernetes and Linux was supposed to configure this field. Which one was misbehaving and deserved a bug report? None of them, it turns out! Container runtimes didn’t set this field because v1.0 of the OCI Runtime Specification didn’t mention it at all.
Naturally, we were not the first to notice this. We found issues related to this error and these cgroup fields scattered across Postgres, Kubernetes, operators, containers, and container runtimes for years. None, however, were making any progress toward a solution.
Odin Ugedal reported the problem to OCI and suggested a solution in 2020. Kailun Qin submitted a patch in 2021. We reviewed this patch and enlisted the help of our partner, Red Hat, to get it merged. It is now part of OCI Runtime Specification v1.1 released in July. Look for it to be in container runtimes soon! 🎉
Postgres 16 allows huge pages during initialization
While we were working on OCI, David Angel, a PGO user,
engaged with the pgsql-bugs mailing list
after struggling to avoid the issue on a system with huge pages. As a result,
our own Tom Lane added a feature to
Postgres 16 allowing
server variables to be set using initdb. With that, initdb with
--set huge_pages=off
works on any system where huge pages are broken for any
reason.
Better huge pages in the future for everyone
These two changes above are proper long-term solutions, but they’ll take time to make their way into your environments. In the meantime, Crunchy Postgres for Kubernetes has implemented workarounds that will keep you running smoothly with huge pages.