A Look at Postgres 15: MERGE Command with Examples
With PostgreSQL 15 comes a new SQL Command called
MERGE. MERGE
has been in
the SQL standard for quite a while, but it just made its way into the PostgreSQL
codebase. Prior to MERGE
, if you wanted to refresh a target table from a
source table, prior to Postgres 15, you could use the "upsert" method with
the ON CONFLICT
clause.
Now, MERGE
can be used instead! Some situations where MERGE
makes a lot of
sense are:
- data loading from external sources, thru foreign data wrappers
- staged and batched process jobs
About MERGE
Let's look at the synopsis in the documentation:
[[ WITH with_query [, ...] ]
MERGE INTO target_table_name [ [ AS ] target_alias ]
USING data_source ON join_condition
when_clause [...]
where data_source is:
{ source_table_name | ( source_query ) } [ [ AS ] source_alias ]
and when_clause is:
{ WHEN MATCHED [ AND condition ] THEN { merge_update | merge_delete | DO NOTHING } |
WHEN NOT MATCHED [ AND condition ] THEN { merge_insert | DO NOTHING } }
and merge_insert is:
INSERT [( column_name [, ...] )]
[ OVERRIDING { SYSTEM | USER } VALUE ]
{ VALUES ( { expression | DEFAULT } [, ...] ) | DEFAULT VALUES }
and merge_update is:
UPDATE SET { column_name = { expression | DEFAULT } |
( column_name [, ...] ) = ( { expression | DEFAULT } [, ...] ) } [, ...]
and merge_delete is:
DELETE
We can read that the source of the data_source
can be either a table or some
data brought to the command thanks to the prior WITH
with_query
, or inside
the MERGE
after USING
source_query
. When the data matches, i.e. there's a
matching record in the target_table_name
, we can specify what to do:
- do an
UPDATE
as defined in themerge_update
(seeUPDATE SET...
) part or - do a
DELETE
as defined in themerge_delete
or - do nothing
When the data doesn't matches, i.e. there's no matching record in the
target_table_name
, then we do an INSERT
(see merge_update
).
MERGE
example - remote sensors & batches
In working with clients, I have seen a need for data loading with remote sensors
or stations. Anything “sending regularly data” and loading that data into a
dataset is a common need. I’m particularly excited about using MERGE
to solve
these issues for Postgres users.
In order to explain this data loading use case of MERGE
, I’m going to set up
an example. We’ll have a database of stations that’s a remote measurement tool
with data coming in intermittently.
- Streaming data, data arriving continuously
- Stations emit periodic and intermittent data measures
- There's a batch collecting this data
station_data_new
is a temp table for common storage- Data is stored long term
station_data_actual
with the last possible values - We want to keep track of when the station has been
created
Let's create some tables for testing purpose. Beware that the temporary table
station_data_new
will only exist in the context of a given session.
create temporary table station_data_new (
station_id integer
, a integer
, b integer
);
create table station_data_actual (
station_id integer primary key
, a integer
, b integer
, created timestamp default current_timestamp
, updated timestamp default current_timestamp
);
Let's create some sample data into station_data_new
: our 1st 5 stations are up
and sent the 1st batch of data:
with measures as (
select *
from generate_series(1,5)
)
insert into station_data_new (
station_id
, a
, b
)
select
generate_series
, round(random()*100+1)
, round(random()*100+1)
from
measures;
Basic MERGE
At this point, we could just do a plain INSERT
with SELECT
from
station_data_new
to station_data_actual
. Instead we’ll use MERGE
because
we’re planning for cases when the data already exists, an UPDATE
will be
issued and not an INSERT
.
merge into station_data_actual sda
using station_data_new sdn
on sda.station_id = sdn.station_id
when matched then
update set a = sdn.a, b = sdn.b, updated = default
when not matched then
insert (station_id, a, b)
values (sdn.station_id, sdn.a, sdn.b);
If you execute it once, you'll have in return MERGE 5
, and the data is
inserted into station_data_actual
, where the timestamps in created
and
updated
have the same value.
To do even more testing you could:
truncate table station_data_new;
- Re-create data in it, by changing
generate_series(1,5)
withgenerate_series(1,10)
- Issue again the very same
MERGE
as we ran before
You'll see the data in station_data_actual
updated. Stations 1 to 5 will have
updated
status and stations 6 to 10 will be created
and updated
.
Conclusion
MERGE
was a long awaited feature for PostgreSQL fans! Now we have it. I
encourage you to review your processes when it's about merging existing data in
your database. In this example, new data entering with existing data can be a
good place to use this.
MERGE
opens new usages of aggregating and/or merging data from many
databases. In distributed models, where all data exists across different
locations, using
foreign data wrappers
with MERGE
could be a really elegant solution. MERGE
may simplify your
processing of data all along the database's life. With strong and fast SQL
statements as opposed to functions or other complex operations.