Multi-Cloud Strategies with Crunchy Postgres for Kubernetes
Crunchy Postgres for Kubernetes can be used for cross-datacenter streaming replication out of the box. With so many folks asking for cross-cloud / cross-datacenter replication, we wanted to give people a large explanation of how that works. For this post, we use streaming replication, and prioritize reducing latency and adding stability.
Cross-cloud streaming replication can be used:
- To enable multi-cloud disaster recovery
- For moving clusters between cloud providers
- For moving clusters between on-premises and cloud
Given the power of this feature, we decided to incorporate streaming replication
directly into
PGO.
With the
5.2 release
this is easily configurable through the postgrescluster
spec without the need
for manual Postgres configuration to set up the streaming replication.
Setup Cloud Environments
In this sample scenario, we will create postgresclusters
in both EKS and GKE
clouds. EKS will be used as our primary environment, and GKE will be a standby.
PGO will need to be deployed in EKS and GKE to create postgresclusters
in both
environments.
The standby database needs to connect directly to the primary database over the
network. This means the primary environment (EKS) needs to be able to create
services with an external IP. In this example, we are using the LoadBalancer
service type, which is easily configurable through the postgrescluster
spec.
Both postgresclusters
will need copies of the same TLS certificates to allow
replication. Please look at the
custom TLS
section of our docs for guidance on creating custom cert secrets in the format
that PGO expects. This will need to be done in both environments. In this
example, we have copies of the cluster-cert
and replication-cert
secrets in
both Kubernetes environments.
Create Clusters
Now that our cloud environments are configured, we can create the primary and standby clusters. First, we will create the primary cluster and allow it to startup. Then we will have to take note of the external IP that is created for the primary service on the cluster. After we have the IP, we can create our standby cluster.
Primary
For the primary, we create a postgrescluster
with the following spec. We have
defined the custom TLS certs that we created in both environments. We also
specified that the service that exposes the PostgreSQL primary instance should
have the type LoadBalancer
.
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: primary
namespace: postgres-operator
spec:
service:
type: LoadBalancer
postgresVersion: 14
customTLSSecret:
name: cluster-cert
customReplicationTLSSecret:
name: replication-cert
instances:
- name: instance1
replicas: 1
dataVolumeClaimSpec:
{
accessModes: [ReadWriteOnce],
resources: { requests: { storage: 1Gi } },
}
backups:
pgbackrest:
repos:
- name: repo1
volume:
volumeClaimSpec:
{
accessModes: [ReadWriteOnce],
resources: { requests: { storage: 1Gi } },
}
After you create a postgrescluster
with this spec, wait for an initial backup
and the cluster to be ready. After that, your primary should be ready, and you
can start setting up the standby. Before you switch to the GKE cluster, you will
need the external IP from the primary-ha
service.
$ k get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
primary-ha LoadBalancer 10.100.4.48 a078e7d173f214d9ca0e7d122052aa5a-1097707392.us-east-1.elb.amazonaws.com 5432:30985/TCP
Standby
Now that we have the primary cluster, we can create our standby. Here we are
using the spec.standby
fields in the PostgresCluster
spec. When filling out
the standby spec, we have a few options. You can provide a host
, a repoName
,
or both. In this scenario, we are using streaming replication and will need to
provide a host
. The host
in the spec below is the external IP we copied from
the primary-ha service.
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: standby
namespace: postgres-operator
spec:
standby:
enabled: true
host: a2eb494c1f05a414dafb62743c790ba1-2010516878.us-east-1.elb.amazonaws.com
postgresVersion: 14
customTLSSecret:
name: cluster-cert
customReplicationTLSSecret:
name: replication-cert
instances:
- name: instance1
replicas: 1
dataVolumeClaimSpec:
{
accessModes: [ReadWriteOnce],
resources: { requests: { storage: 1Gi } },
}
backups:
pgbackrest:
repos:
- name: repo1
volume:
volumeClaimSpec:
{
accessModes: [ReadWriteOnce],
resources: { requests: { storage: 1Gi } },
}
The standby cluster will look slightly different from the primary. You can expect the standby to have instance pods (one for every replica defined in the spec) and a repo-host pod. One thing you will not see is an initial backup on the cluster.
Verify Streaming Replication
Now that we have a standby using streaming replication, it is a good time to check that replication is configured correctly and working as expected. The first thing you should check is that any data you create is replicated over to the standby. The time this takes will depend on network latency and the size of the data. If you see data from your primary database, streaming replication is active, you are good to go.
If you have exec privileges in your Kubernetes cluster, there are a few commands
you can use to verify data replication and streaming. In the following two
commands, we exec into the standby database, check that the walreciever
process is running, and check that we have a streaming status in
pg_stat_wal_receiver
.
$ kubectl exec -it standby-instance1-bkbl-0 -c database -- bash
bash-4.4$ ps -U postgres -x | grep walreceiver
95 ? Ss 0:10 postgres: standby-ha: walreceiver streaming 0/A000000bash-4.4$ psql -c "select pid,status,sender_host from pg_stat_wal_receiver;"
bash-4.4$ psql -c "select pid,status,sender_host from pg_stat_wal_receiver;"
pid | status | sender_host
-----+-----------+-------------------------------------------------------------------------
95 | streaming | a078e7d173f214d9ca0e7d122052aa5a-1097707392.us-east-1.elb.amazonaws.com
(1 row
Promote the standby
Now that you can see your data being replicated from the primary to the standby,
you are ready to promote the standby in a disaster scenario. This is done by
updating the spec of the standby cluster so that standby.enabled
is false or
removing the standby section entirely.
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: standby
spec:
standby:
enabled: true
After you promote the standby, it will work as a fully functioning
postgrescluster
that you can backup, scale and use as you would expect. You
can also use the new primary to create another standby cluster!
Conclusion
If you've been looking for a solution for streaming replication, you may have come across Brian Pace's article earlier this year on Streaming Replication using pgBackRest. I'm excited that with PGO 5.2., this is even easier to setup. Streaming replication adds another tool into our operator to allow customers to find the disaster recovery solutions to meet their needs.