Quick and Easy Postgres Data Compare
If you're checking archives or working with Postgres replication, data reconciliation can be a necessary task. Row counts can be one of the go to comparison methods but that does not show data mismatches. You could pull table data across the network and then compare each row and each field, but that can be a demand on resources. Today we'll walk through a simple solution for your Postgres toolbox - using Foreign Data Wrappers to connect and compare the two source datasets. With the foreign data wrapper and a little sql magic, we can compare data quickly and easily.
Creating Environments
To keep the environment simple so even with limited resources it can be
practiced, we will use a single PostgreSQL cluster with two databases (hrprod
,
hrreport
) connected via PostgreSQL Foreign Data Wrapper. The simulation here
is a production database (hrprod
) with a reporting database (hrreport
). Keep
in mind that the source and target do not have to be within the same PostgreSQL
cluster.
For speed of creating the environment, the Crunchy Postgres for Kubernetes was used and a simple PostgreSQL cluster deployed using the Postgres Operator Examples repository.
The rest of the steps will only show the steps performed within psql
from the
database containers.
Production Setup (hrprod)
The steps to create the simulated production database is simple: create the
database, create the postgres_fdw
extension, create the employee
table and
lastly populate the employee
table with three rows of data.
postgres=> create database hrprod;
CREATE DATABASE
postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".
hrprod=> create extension postgres_fdw;
CREATE EXTENSION
hrprod=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE
hrprod=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3
Reporting Setup (hrreport)
The steps are then repeated to create the simulated reporting database.
postgres=> create database hrreport;
CREATE DATABASE
postgres=> \c hrreport
You are now connected to database "hrreport" as user "postgres".
hrreport=> create extension postgres_fdw;
CREATE EXTENSION
hrreport=> create table employee (id int, first_name varchar(50), last_name varchar(50), department varchar(20));
CREATE TABLE
hrreport=> insert into employee (id, first_name, last_name, department) values (1,'John','Smith','explorer'),(2,'George','Washington','government'),(3,'Thomas','Edison','inventor');
INSERT 0 3
With this, the setup is complete and the data in the employee
table match in
both databases.
Data Compare
The compare will be performed from the reporting database side (hrreport
). To
start, a temporary table named data_compare
is created. The data_compare
table is to store three pieces of information:
source_name
column that identifies where the data came from (hrprod
orhrreport
in this example).id
column that will store the value(s) of the primary key from the table.hash_value
column that stores the hash value of all the non-key fields in the table.
Note that if the table has a composite key, the id
column would be populated
by joining the values into a single string. The hash occurs on the source side
and only the hashed value is used for the comparison, greatly reducing network
traffic, transfer time, etc.
Setup Data Compare
Create the data_compare
table in both the production (hrprod
) and target
(hrreport
) databases.
hrreport=> \c hrprod
You are now connected to database "hrprod" as user "postgres".
hrprod=> CREATE TABLE data_compare
(source_name VARCHAR(140),
id VARCHAR(1000),
hash_value varchar(100)
);
CREATE TABLE
hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".
hrreport=> CREATE TABLE data_compare
(source_name VARCHAR(140),
id VARCHAR(1000),
hash_value varchar(100)
);
CREATE TABLE
An INSERT
statement will be executed on both the source and target to populate
the data_compare
table and then the contents of the tables compared to
identify differences. To reduce time and transfer for multiple compare passes,
the data_compare
table contents can be transferred via the foreign table or
pg_dump
, etc.
The following steps were used to create the foreign table.
hrreport=> CREATE SERVER hrprod FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'hrprod', port '5432');
CREATE SERVER
hrreport=> CREATE USER MAPPING FOR current_user SERVER hrprod options (user 'postgres', password 'welcome1');
CREATE USER MAPPING
CREATE FOREIGN TABLE hrprod_data_compare (source_name varchar(140), id varchar(1000), hash_value varchar(100)) SERVER hrprod OPTIONS (table_name 'data_compare');
Perform Initial Compare
Populate the data_compare
table in both the source (hrprod
) and target
(hrreport
) databases.
hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
(SELECT 'hrprod' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3
hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
(SELECT 'hrreport' source_name, id::text, md5(concat_ws('|',first_name, last_name, department)) hash_value FROM employee e);
INSERT 0 3
At this point we know that the data is exactly the same so let's look at the SQL that is used to perform the actual comparison.
hrreport=> SELECT COALESCE(s.id,t.id) id,
s.hash_value source_hash_value, t.hash_value target_hash_value,
CASE WHEN s.hash_value = t.hash_value THEN 'equal'
WHEN s.id IS NULL THEN 'row not on source'
WHEN t.id IS NULL THEN 'row not on target'
ELSE 'difference'
END compare_result
FROM hrprod_data_compare s
FULL JOIN data_compare t ON s.id=t.id;
id | source_hash_value | target_hash_value | compare_result
----+----------------------------------+----------------------------------+-------------------
1 | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
2 | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
3 | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
(3 rows)
The compare_result
confirms that two sets of data are equal. An alternate
compare SQL is included at the end of this article to show various ways the data
can be compared when the two data_compare
tables are combined.
Create an Out-Of-Sync Condition and Compare
At this stage, three rows exist in the table and the data matches.
hrprod=> SELECT * FROM employee;
id | first_name | last_name | department
----+------------+------------+------------
1 | John | Smith | explorer
2 | George | Washington | government
3 | Thomas | Edison | inventor
(3 rows)
To create the out of sync, the following changes will be performed:
- In
hrprod
, add CS Lewis with id 4, Charles Babbage with id 5, Blaise Pascal with id 6. - In
hrreport
, add Charles Babbage with id 4, CS Lewis with id 5, Kenny Rogers with id 7.
Notice that the ids for CS Lewis and Charles Babbage have been swapped and a
unique record added to each database (Blaise Pascal to hrprod
and Kenny Rogers
to hrreport
). The compare should show that 3 rows match, 2 rows have
differences and 2 rows are in one database but not the other.
Up first, changes to source (hrprod
).
hrprod=> INSERT INTO employee (id, first_name, last_name, department)
VALUES (4,'CS','Lewis','author'),(5,'Charles','Babbage','math'),(6,'Blaise','Pascal','math');
hrprod=> SELECT * FROM employee ORDER BY id;
id | first_name | last_name | department
----+------------+------------+------------
1 | John | Smith | explorer
2 | George | Washington | government
3 | Thomas | Edison | inventor
4 | CS | Lewis | author
5 | Charles | Babbage | math
6 | Blaise | Pascal | math
(6 rows)
Now the changes to the target (hrreport
).
hrreport=> INSERT INTO employee (id, first_name, last_name, department)
VALUES (5,'CS','Lewis','author'),(4,'Charles','Babbage','math'),(7,'Kenny','Rogers','music');
hrreport=> SELECT * FROM employee ORDER BY id;
id | first_name | last_name | department
----+------------+------------+------------
1 | John | Smith | explorer
2 | George | Washington | government
3 | Thomas | Edison | inventor
4 | Charles | Babbage | math
5 | CS | Lewis | author
7 | Kenny | Rogers | music
(6 rows)
To summarize the current state:
- Three rows that match (id=1, 2, 3)
- Two rows that do not match (id=4, id=5)
- Two rows that exist in one but not the other (id=6, id=7)
Let's now clear the data_compare
tables and perform the compare again.
postgres=> \c hrprod
You are now connected to database "hrprod" as user "postgres".
hrprod=> DELETE FROM data_compare;
DELETE 3
hrprod=> INSERT INTO data_compare (source_name, id, hash_value)
(SELECT 'hrprod' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6
hrprod=> \c hrreport
You are now connected to database "hrreport" as user "postgres".
hrreport=> DELETE FROM data_compare;
DELETE 3
hrreport=> INSERT INTO data_compare (source_name, id, hash_value)
(SELECT 'hrreport' source_name, id::text id, md5(textin(record_out(e))) FROM employee e);
INSERT 0 6
Now for the compare and the results.
hrreport=> SELECT COALESCE(s.id,t.id) id,
s.hash_value source_hash_value, t.hash_value target_hash_value,
CASE WHEN s.hash_value = t.hash_value THEN 'equal'
WHEN s.id IS NULL THEN 'row not on source'
WHEN t.id IS NULL THEN 'row not on target'
ELSE 'difference'
END compare_result
FROM hrprod_data_compare s
FULL JOIN data_compare t ON s.id=t.id;
id | source_hash_value | target_hash_value | compare_result
----+----------------------------------+----------------------------------+-------------------
1 | 681c37a127083d90164a9f04b5f92759 | 681c37a127083d90164a9f04b5f92759 | equal
2 | 6e181f686815319daa07c5e0e1ddcd27 | 6e181f686815319daa07c5e0e1ddcd27 | equal
3 | 4d4eba0d792cb227d247a3b0f9f66979 | 4d4eba0d792cb227d247a3b0f9f66979 | equal
4 | bbee9d6cccbeac4e9125ec78507c4eb7 | 57acef6ed228a52b8c42f0a6c155e62b | difference
5 | 57acef6ed228a52b8c42f0a6c155e62b | bbee9d6cccbeac4e9125ec78507c4eb7 | difference
6 | 047742fb256df0b78cebc3fbbc3ca4ad | | row not on target
7 | | 66e5e35673780bd392d2f81d589fbb52 | row not on source
(7 rows)
The above output indicates that rows with id = 1 thru 3 exists in both databases
and the content of the rows match. Rows with id 4 and 5 exists in each database
but the contents of the row is different. Going a step further, one could see
that the hash values are the same between the two different rows but associated
to the wrong id. Row with id 6 only exist on the target (hrreport
) while the
row with id 7 only exists on the source (hrprod
). In total, there are 4 rows
that are out of sync.
With the rows identified, proper steps can be performed to sync the appropriate
rows. Last thought, imagine for a moment that logical replication was in place
between the two databases and changes were pending on the target due to lag. The
INSERT into the data_compare
could be performed only on the rows flagged as
out of sync to verify just those rows once replication lag is gone.
Conclusion
Comparing data can be a monumental task. However, this little trick has come in handy over the years when expensive data compare software packages were not an option. There is still room for some creativity with the compare SQL to meet the exact needs of the compare. For example, only show rows that are missing from one side or the other.
Alternate Compare SQL:
SELECT id, hash_value,
count(src1) src1,
count(src2) src2
FROM
( SELECT a.*,
1 src1,
null src2
FROM data_compare a
WHERE source_name='hrprod'
UNION ALL
SELECT b.*,
null src1,
2 src2
FROM data_compare b
WHERE source_name='hrreport'
) c
GROUP BY id, hash_value
HAVING count(src1) <> count(src2);
So by setting up postgres_fdw, hashing the non-key fields, and writing a sql query to see if any rows are different - you can do a quick and simple Postgres data comparison. Have another solution you like for data compare? Let us know at @crunchydata.